Subject: Re: is it really that simple?

Message 00671: Re: is it really that simple?

To: Stephen Schultze <xxxxxxxxxx@cyber.law.harvard.edu>
Subject: Re: is it really that simple?
From: Carl Malamud <xxxx@media.org>
Date: Wed, 1 Apr 2009 15:57:43 -0700
Cc: Aaron Swartz <xx@aaronsw.com>
In-reply-to: <57703F3A-931F-44A5-8D4E-AA1D80B85E3E@cyber.law.harvard.edu>
References: <54B16536-B061-4CBE-8A0B-57515396D930@media.org> <57703F3A-931F-44A5-8D4E-AA1D80B85E3E@cyber.law.harvard.edu>

On Apr 1, 2009, at 3:53 PM, Stephen Schultze wrote:

That's the basic idea. The cookie will last for a week. Thecrawling involves some annoying parsing (including generating POSTrequests) but once you have it figured out it's not terribly complex.

can actually do this one all with get I think ... see below.

You start with a given case number, go to a standard URL to grab thedocket, parse the docket to get the document sub-ages, request eachof those, parse each of those to see if you need to do get anothersub-page, and ultimately parse out the PDF link (there are differentstandards used in different versions of PACER), and then request thePDF.

I get all that. Not writing a full crawler, just need to clean upafter you folks and get some missing files, so it will actually bemuch simpler (for each directory, I issue a call to qryAttorneys.pl ...

https://ecf.dcd.uscourts.gov/cgi-bin/qryAttorneys.pl?114972

where the number is the directory name ... pretty simple. Just needto 12,000 gets and I'm done with stage 1.

Carl

References:
- is it really that simple?
  - From: Carl Malamud <xxxx@media.org>
- Re: is it really that simple?
  - From: Stephen Schultze <xxxxxxxxxx@cyber.law.harvard.edu>

Prev by Date: Re: is it really that simple?
Next by Date: Re: is it really that simple?
Previous by thread: Re: is it really that simple?
Next by thread: Re: is it really that simple?
Index(es):
- Date
- Thread