Subject: Re: pacer crawl

Message 00198: Re: pacer crawl

To: "Carl Malamud" <xxxx@media.org>
Subject: Re: pacer crawl
From: "Aaron Swartz" <xx@aaronsw.com>
Date: Thu, 4 Sep 2008 22:44:22 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references :x-google-sender-auth; bh=0ZxsAo9vuPVbuXemPhWCR7powTuYkPMB5zEeUIkDU1k=; b=PdG2HIvXME6k8TI5HNW6V1xi5SJ7XT/gZhycVrCnxdT4mRBDx1AS7Ydt9BWWE2AVEB EQh0Cr1tA+cr55+SL55PzMaBKViudT4psBnPYLViceWitM6S+DN4472ajm1wVhnURyHg WGJsFNDwz/Gt1DdwOPjf0SC8qb7aGbiDTN0sU=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references:x-google-sender-auth; b=gkR5qi5RfMnihkANjtYhEyQBTv7uBqpBiK00TXzSJ/d9IMccgW1JpNyNTXKN+l0U3x xiTY+UnD9La09hiAPzhZfZh/xDLUOvXqN2C7MfzvvyPTLP/9ojMsFrjzswFN7PFzq98+ jjn6FlPPg78GXmzCcZf1wdbkmwh4Hn0pporgE=
In-reply-to: <C786130E-7A81-41A9-90C1-3676160231E6@media.org>
References: <dc21c7860809041922i64893bd7p2c8bdac1df1a137f@mail.gmail.com> <4DA167E8-5E2E-402E-8331-1A4616F8D129@media.org> <dc21c7860809041923p41c4b55dtf3e5264ac217fd31@mail.gmail.com> <762EBFD0-5A93-423D-BFF6-93D974E441A7@media.org> <dc21c7860809041930v1cd6719dueddd61d949df57cc@mail.gmail.com> <8D62995A-C4A4-4D86-8032-EA7792662283@media.org> <dc21c7860809041938g555f7ad7r506692f2cd691d21@mail.gmail.com> <EF4849BF-BB11-4BAA-A3AB-E1EDBF1B147D@media.org> <dc21c7860809041941u7bbe5b85q826185a1a59539e8@mail.gmail.com> <C786130E-7A81-41A9-90C1-3676160231E6@media.org>
Sender: xxxxxxx@gmail.com

fair enough. stephen is building a team to go to the library.

On Thu, Sep 4, 2008 at 10:43 PM, Carl Malamud <xxxxxxx@media.org> wrote:
> sigh.
>
> this is not how we do things.  :)
>
> we can't have thumper be an on-site patron at the library that is part of
> the 20-library pacer trial because we are cutting corners.  we don't cut
> corners, we belly up to the bar and get permission.
>
> if your librarian wants to do this, i'm on board.  if you have a legal right
> to go to the library and be a patron and use the published procedure to grab
> data, i'm also on board.  but, i'm not going to shave the rules.  if we're
> coming in off-site, we can drain pacer, but you want a valid account and pay
> $0.08/page.  then, we can do whatever we want with the data.
>
> On Sep 4, 2008, at 7:41 PM, Aaron Swartz wrote:
>
>> no
>>
>> On Thu, Sep 4, 2008 at 10:40 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>
>>> do you have your library's permission/tacit agreement to drain pacer?
>>>  what
>>> library?
>>>
>>> Carl
>>>
>>> On Sep 4, 2008, at 7:38 PM, Aaron Swartz wrote:
>>>
>>>> the easiest thing would just to have a screen session open with a
>>>> couple perl scripts calling wget on the various pacer urls
>>>>
>>>> On Thu, Sep 4, 2008 at 10:36 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>>>
>>>>> so, what specifically do you want to do on the box?
>>>>>
>>>>> do you need to run scripts, cron jobs, etc...?  periodically dump data
>>>>> off
>>>>> local crawlers?  run python jobs?
>>>>>
>>>>> On Sep 4, 2008, at 7:30 PM, Aaron Swartz wrote:
>>>>>
>>>>>> it's a disk space thing -- last time I did something like this, i kept
>>>>>> filling up people's disks whenever the process moving stuff off
>>>>>> hiccupped. and if we're at speed the hiccups don't have to last long.
>>>>>>
>>>>>> On Thu, Sep 4, 2008 at 10:26 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>>>>>
>>>>>>> i don't mind crawling pacer with a valid account.  that is our
>>>>>>> production
>>>>>>> box, so i wouldn't want to be too intensive, but in principle I
>>>>>>> suppose
>>>>>>> one
>>>>>>> could crawl straight from thumper.
>>>>>>>
>>>>>>> is this a bandwidth thing?  not enough bits between your local
>>>>>>> computers
>>>>>>> and
>>>>>>> thumper to get the data over the wall?
>>>>>>>
>>>>>>> if this is really serious, there are a couple other places we can put
>>>>>>> you.
>>>>>>>
>>>>>>> let me know what you have in mind.
>>>>>>>
>>>>>>> On Sep 4, 2008, at 7:23 PM, Aaron Swartz wrote:
>>>>>>>
>>>>>>>> On Thu, Sep 4, 2008 at 10:23 PM, Carl Malamud <xxxxxxx@media.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Sep 4, 2008, at 7:22 PM, Aaron Swartz wrote:
>>>>>>>>>
>>>>>>>>>> I assume running the pacer crawl form thumper is not on, right?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> so, what are you crawling?
>>>>>>>>>
>>>>>>>>> the thumb drive corps is based on going to the library and using
>>>>>>>>> their
>>>>>>>>> access.  other access is $0.08/page.  do you have some kind of
>>>>>>>>> magic
>>>>>>>>> account
>>>>>>>>> or something?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> just the library's account.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Follow-Ups:
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>

References:
- pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>

Prev by Date: Re: pacer crawl
Next by Date: Re: pacer crawl
Previous by thread: Re: pacer crawl
Next by thread: Re: pacer crawl
Index(es):
- Date
- Thread