Subject: Re: is it really that simple?

Message 00670: Re: is it really that simple?

To: Carl Malamud <xxxx@media.org>
Subject: Re: is it really that simple?
From: Stephen Schultze <xxxxxxxxxx@cyber.law.harvard.edu>
Date: Wed, 1 Apr 2009 18:53:29 -0400
Cc: Aaron Swartz <xx@aaronsw.com>
In-reply-to: <54B16536-B061-4CBE-8A0B-57515396D930@media.org>
References: <54B16536-B061-4CBE-8A0B-57515396D930@media.org>

That's the basic idea. The cookie will last for a week. The crawlinginvolves some annoying parsing (including generating POST requests)but once you have it figured out it's not terribly complex.

You start with a given case number, go to a standard URL to grab thedocket, parse the docket to get the document sub-ages, request each ofthose, parse each of those to see if you need to do get another sub-page, and ultimately parse out the PDF link (there are differentstandards used in different versions of PACER), and then request thePDF.


On Apr 1, 2009, at 6:21 PM, Carl Malamud wrote:

Hi -
Is a pacer crawl as simple as download one file, save the cookie,then hand that cookie back with every subsequent request you make?
e.g., wget --load-cookies=file.txt --output-document=out.htmlhttp.....
Carl


--
Stephen Schultze
Fellow, Berkman Center for Internet and Society
xxxxxxx@cyber.law.harvard.edu

Follow-Ups:
- Re: is it really that simple?
  - From: Carl Malamud <xxxx@media.org>
- Re: is it really that simple?
  - From: Carl Malamud <xxxx@media.org>

References:
- is it really that simple?
  - From: Carl Malamud <xxxx@media.org>

Prev by Date: is it really that simple?
Next by Date: Re: is it really that simple?
Previous by thread: is it really that simple?
Next by thread: Re: is it really that simple?
Index(es):
- Date
- Thread