Message 00415: Re: .gov crawl
how current is that crawl? I really have issues with the way brewster
does crawls ... it has never been very timely or complete and he has
really strange restrictions on use.
On Nov 11, 2008, at 7:35 AM, Aaron Swartz wrote:
Archive.org has a 5TB crawl of .gov and a QA team looking thru to see
if they missed anything. I suggest we build scripts that can run on
this corpus and then we can run Heritrix (the archive.org crawler) to
generate new corpuses if needed. This will save a bunch of crawling
time and give us a good basis for statistical sampling and so on. If
we notice any URLs which aren't in the corpus we can add them
ourselves.
I tried running the W3C LinkChecker but it was very slow for some
reason. Anyway, I think it'd be pretty easy to build a linkchecker
ourselves that runs on the archive.org corpus. All that's really
needed is a good way to grep for the appropriate attributes.
Similarly, it's pretty easy to run validators offline.
We can also use their corpus to get a list of domain names to run nmap
and other tools against.