[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]


Message 00414: .gov crawl



Archive.org has a 5TB crawl of .gov and a QA team looking thru to see
if they missed anything. I suggest we build scripts that can run on
this corpus and then we can run Heritrix (the archive.org crawler) to
generate new corpuses if needed. This will save a bunch of crawling
time and give us a good basis for statistical sampling and so on. If
we notice any URLs which aren't in the corpus we can add them
ourselves.

I tried running the W3C LinkChecker but it was very slow for some
reason. Anyway, I think it'd be pretty easy to build a linkchecker
ourselves that runs on the archive.org corpus. All that's really
needed is a good way to grep for the appropriate attributes.
Similarly, it's pretty easy to run validators offline.

We can also use their corpus to get a list of domain names to run nmap
and other tools against.