Message 00206: Re: pacer program
Hi -
This looks great.
On http://watchdog.net/static/.tmp/vtd/15151/docket.html you need to
adjust the URLs to be relative.
You have no metadata in the PDF docs. At the very least, we need to
stamp in the following pieces of information:
1. the url of the doc you got (e.g., what is in your docket.html file)
2. the court: district court for the district of vermont
3. the office, which is on your docket.html file
4. the case number
5. the docket number (which is embedded inside of the case number)
5. the document number
6. the fact that it is public domain
(Tim, please chime in if I've forgot anything.)
Do you guys have/use exiftool?
http://www.sno.phy.queensu.ca/~phil/exiftool/
I believe we want to do everything in the XMP headers. Aaron, you
might be able to help me get this more precise. A couple things I
think we need to set:
xmp:rights False
xmp:license [url of creative commons public domain license]
xmp:contributor (name of downloader?)
xmp:date (date on the document?)
xmp:publisher (name of the court?)
Where I get lost is where and how to put the identifying information.
One project I work (archimedespalimpsest.org) shoves it all in the
description field as a bunch of name value pairs. There is a proposal
Tom Bruce has advanced (see the open case list for details), but I
could never figure out from his spec where to shoehorn in things such
as the name of the office, or the case number.
It would be very nice if we all came up with a standard list of what
gets stamped where. We can write that up as a precise guide for
others to follow and that would be very useful.
Note that none of this should slow you down from harvesting ... if you
keep everything collected as sets in a docket, we can go back and do
that part later as long as you keep a record of the urls you were at.
Carl
On Sep 6, 2008, at 4:12 PM, Aaron Swartz wrote:
Hello! I've put up a sample case (PACER ID 15151 at the Vermont
District Court) at this temporary URL:
http://watchdog.net/static/.tmp/vtd/15151/
Can you all look it over and see if we're missing anything or if
there's something we can do better? Thanks.