Message 00414: .gov crawl
- To: "Carl Malamud" <xxxx@media.org>
- Subject: .gov crawl
- From: "Aaron Swartz" <xx@aaronsw.com>
- Date: Tue, 11 Nov 2008 10:35:53 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:mime-version:content-type:content-transfer-encoding :content-disposition:x-google-sender-auth; bh=iqWHRHdfaGX7PMj1qvuOBjiLnyBb7BMl8Yb15NNwUhA=; b=dTwDY62H7h3voUaGcg7+AAio9bPoerAnVWBe5t/QOsolzhaA3o81ZnyKqxBiLWply/ R2Mg4C/LUo4iRDfsNfUFqMRh12RVI+GXUs1kruLH9ojH3wN6JPst6zsJGfKsckzEukwl zQigRMtlSaiDmw1MrU2X3/PbthUMXpgVdt4cA=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:mime-version:content-type :content-transfer-encoding:content-disposition:x-google-sender-auth; b=KOQjlwFpj1G3o/sSShlE5UpeU6ApzDD8GqkuhTX43Ir45abRIyA0gGAAPrNylCIeof 4qeKE+szZ/3IU5hVHJVz/DPsyUEF1OwrqSrHYZwTWEyIiiK+4dMf78XWhuUZWmTiCt9d S7yrjssBXH7Y32Op8z842d1RU1kGyvA0/HiOs=
- Sender: xxxxxxx@gmail.com
Archive.org has a 5TB crawl of .gov and a QA team looking thru to see
if they missed anything. I suggest we build scripts that can run on
this corpus and then we can run Heritrix (the archive.org crawler) to
generate new corpuses if needed. This will save a bunch of crawling
time and give us a good basis for statistical sampling and so on. If
we notice any URLs which aren't in the corpus we can add them
ourselves.
I tried running the W3C LinkChecker but it was very slow for some
reason. Anyway, I think it'd be pretty easy to build a linkchecker
ourselves that runs on the archive.org corpus. All that's really
needed is a good way to grep for the appropriate attributes.
Similarly, it's pretty easy to run validators offline.
We can also use their corpus to get a list of domain names to run nmap
and other tools against.