Message 00056: Re: govdocs/google
- To: "Carl Malamud" <xxxx@media.org>
- Subject: Re: govdocs/google
- From: "Aaron Swartz" <xx@aaronsw.com>
- Date: Mon, 31 Dec 2007 13:24:52 -0800
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=H7FoHU8A3YeC6BiVmdkQVkTedv7dR4cT2RQ+oIf6SV8=; b=cZA/VQjXhdFBUYVbilPox/7Jl7/WnfsxZ5EqYKG5xhOK655UMmHVwHdMX33WqeMEYYzak/z+Mxcfdo7L3d9PIJoj3iohmD4ooHD++Xt+wGAJqAOTGHhlGK0oOUL4W/7DXvceDsiI3HrlHhkroMF5WksQSCIoBDYTds6zEZTHGWA=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=j/sefS6rfOXxY6K+3/0r5xpE6yfxPl0kmQjaGFHvRZanbia8HhExPVPpOEN1O15XADLIKhdkcKPEQOH6tgDowieNQjftEK23x4t7zwK+NEzzLkl5GAQPiLQefLHokSXJI9VdC/B1nFUvPcbVrN493GuLLzcQ6e0Tvp4U/UO1NWY=
- In-reply-to: <200712312121.lBVLLHYb029125@bulk.resource.org>
- References: <dc21c7860712311305j1f57bbedg1ffc4ab758a3fe24@mail.gmail.com> <200712312121.lBVLLHYb029125@bulk.resource.org>
- Sender: xxxxxxx@gmail.com
> Have you found the google book id's embedded in the pdf's by
> any chance? It is such a pain to grab them out of the url's.
Yeah, there's apparently a line like:
<< /Type /Annot /Subtype /Link /C [0 0 1] /Border [0 0 1] /Rect [022
227 167 238] /H /I /A << /S /URI /URI
(http://books.google.com/books?id=2Sw6AAAAMAAJ&ie=ISO-8859-1) >> >>
> By 530k ... 530,000 seperate titles? Is there an easy way to
> find those? What are you doing for metadata?
Yeah, 530K books. We extract the metadata from the HTML page that goes
with them and archive both. Here's an example:
http://www.archive.org/details/reportscasesarg255courgoog/
We haven't announced them yet, but when we do they'll be in the search engine.