[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]


Message 00671: Re: is it really that simple?



On Apr 1, 2009, at 3:53 PM, Stephen Schultze wrote:

That's the basic idea. The cookie will last for a week. The crawling involves some annoying parsing (including generating POST requests) but once you have it figured out it's not terribly complex.
can actually do this one all with get I think ... see below.


You start with a given case number, go to a standard URL to grab the docket, parse the docket to get the document sub-ages, request each of those, parse each of those to see if you need to do get another sub-page, and ultimately parse out the PDF link (there are different standards used in different versions of PACER), and then request the PDF.
I get all that.  Not writing a full crawler, just need to clean up  
after you folks and get some missing files, so it will actually be  
much simpler (for each directory, I issue a call to qryAttorneys.pl ...
https://ecf.dcd.uscourts.gov/cgi-bin/qryAttorneys.pl?114972

where the number is the directory name ... pretty simple. Just need to 12,000 gets and I'm done with stage 1.
Carl