Subject: Re: .gov crawl

Message 00415: Re: .gov crawl

To: Aaron Swartz <xx@aaronsw.com>
Subject: Re: .gov crawl
From: Carl Malamud <xxxx@media.org>
Date: Tue, 11 Nov 2008 07:48:02 -0800
In-reply-to: <dc21c7860811110735pa856bb6g29784cc72a0fc7a8@mail.gmail.com>
References: <dc21c7860811110735pa856bb6g29784cc72a0fc7a8@mail.gmail.com>

how current is that crawl? I really have issues with the way brewsterdoes crawls ... it has never been very timely or complete and he hasreally strange restrictions on use.


On Nov 11, 2008, at 7:35 AM, Aaron Swartz wrote:

Archive.org has a 5TB crawl of .gov and a QA team looking thru to see
if they missed anything. I suggest we build scripts that can run on
this corpus and then we can run Heritrix (the archive.org crawler) to
generate new corpuses if needed. This will save a bunch of crawling
time and give us a good basis for statistical sampling and so on. If
we notice any URLs which aren't in the corpus we can add them
ourselves.

I tried running the W3C LinkChecker but it was very slow for some
reason. Anyway, I think it'd be pretty easy to build a linkchecker
ourselves that runs on the archive.org corpus. All that's really
needed is a good way to grep for the appropriate attributes.
Similarly, it's pretty easy to run validators offline.

We can also use their corpus to get a list of domain names to run nmap
and other tools against.

Follow-Ups:
- Re: .gov crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>

References:
- .gov crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>

Prev by Date: .gov crawl
Next by Date: Re: .gov crawl
Previous by thread: .gov crawl
Next by thread: Re: .gov crawl
Index(es):
- Date
- Thread