[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]


Message 00671: Re: is it really that simple?




On Apr 1, 2009, at 3:53 PM, Stephen Schultze wrote:

That's the basic idea. The cookie will last for a week. The crawling involves some annoying parsing (including generating POST requests) but once you have it figured out it's not terribly complex.

can actually do this one all with get I think ... see below.



You start with a given case number, go to a standard URL to grab the docket, parse the docket to get the document sub-ages, request each of those, parse each of those to see if you need to do get another sub-page, and ultimately parse out the PDF link (there are different standards used in different versions of PACER), and then request the PDF.

I get all that. Not writing a full crawler, just need to clean up after you folks and get some missing files, so it will actually be much simpler (for each directory, I issue a call to qryAttorneys.pl ...

https://ecf.dcd.uscourts.gov/cgi-bin/qryAttorneys.pl?114972

where the number is the directory name ... pretty simple. Just need to 12,000 gets and I'm done with stage 1.

Carl