Message 00671: Re: is it really that simple?
On Apr 1, 2009, at 3:53 PM, Stephen Schultze wrote:
That's the basic idea. The cookie will last for a week. The
crawling involves some annoying parsing (including generating POST
requests) but once you have it figured out it's not terribly complex.
can actually do this one all with get I think ... see below.
You start with a given case number, go to a standard URL to grab the
docket, parse the docket to get the document sub-ages, request each
of those, parse each of those to see if you need to do get another
sub-page, and ultimately parse out the PDF link (there are different
standards used in different versions of PACER), and then request the
PDF.
I get all that. Not writing a full crawler, just need to clean up
after you folks and get some missing files, so it will actually be
much simpler (for each directory, I issue a call to qryAttorneys.pl ...
https://ecf.dcd.uscourts.gov/cgi-bin/qryAttorneys.pl?114972
where the number is the directory name ... pretty simple. Just need
to 12,000 gets and I'm done with stage 1.
Carl