Subject: Re: pacer crawl

Message 00194: Re: pacer crawl

To: "Carl Malamud" <xxxx@media.org>
Subject: Re: pacer crawl
From: "Aaron Swartz" <xx@aaronsw.com>
Date: Thu, 4 Sep 2008 22:38:52 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references :x-google-sender-auth; bh=KUFAiiciIkgX1UqCR11fETLoUYIiiEtS1e8O+Ra8wWs=; b=OAE9dng9yOty7ymFj6+tNTfuDFUksqTdGSSdonC2vqd8D8Y5o67fapA910MsLRUcaW pL/qGwR4fSvYFkFrYsAlUfjeKQDTmkpicqquFBWs44adipAVwglsVSDlDlxfnJiyV3oo p6KTn/5YgsjRNC9MDL7KBxP7lqahss7eVXG38=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references:x-google-sender-auth; b=rTsjCAfKKMMR1C+cnD2llrr3oNuS23t+wE7s9uGBSZz6GI8zCPeedSR984XRy2eq+W XoSbQjUGCADylzXkMtpN6kk9O3faF0scQw6wmcrVXdvuceAxrc0V7zwn07I7+asqPtsI P89S4B1b4qZZSMa6X0Yjm77AhzX42wAh921OA=
In-reply-to: <8D62995A-C4A4-4D86-8032-EA7792662283@media.org>
References: <dc21c7860809041922i64893bd7p2c8bdac1df1a137f@mail.gmail.com> <4DA167E8-5E2E-402E-8331-1A4616F8D129@media.org> <dc21c7860809041923p41c4b55dtf3e5264ac217fd31@mail.gmail.com> <762EBFD0-5A93-423D-BFF6-93D974E441A7@media.org> <dc21c7860809041930v1cd6719dueddd61d949df57cc@mail.gmail.com> <8D62995A-C4A4-4D86-8032-EA7792662283@media.org>
Sender: xxxxxxx@gmail.com

the easiest thing would just to have a screen session open with a
couple perl scripts calling wget on the various pacer urls

On Thu, Sep 4, 2008 at 10:36 PM, Carl Malamud <xxxxxxx@media.org> wrote:
> so, what specifically do you want to do on the box?
>
> do you need to run scripts, cron jobs, etc...?  periodically dump data off
> local crawlers?  run python jobs?
>
> On Sep 4, 2008, at 7:30 PM, Aaron Swartz wrote:
>
>> it's a disk space thing -- last time I did something like this, i kept
>> filling up people's disks whenever the process moving stuff off
>> hiccupped. and if we're at speed the hiccups don't have to last long.
>>
>> On Thu, Sep 4, 2008 at 10:26 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>
>>> i don't mind crawling pacer with a valid account.  that is our production
>>> box, so i wouldn't want to be too intensive, but in principle I suppose
>>> one
>>> could crawl straight from thumper.
>>>
>>> is this a bandwidth thing?  not enough bits between your local computers
>>> and
>>> thumper to get the data over the wall?
>>>
>>> if this is really serious, there are a couple other places we can put
>>> you.
>>>
>>> let me know what you have in mind.
>>>
>>> On Sep 4, 2008, at 7:23 PM, Aaron Swartz wrote:
>>>
>>>> On Thu, Sep 4, 2008 at 10:23 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>>>
>>>>> On Sep 4, 2008, at 7:22 PM, Aaron Swartz wrote:
>>>>>
>>>>>> I assume running the pacer crawl form thumper is not on, right?
>>>>>>
>>>>>
>>>>> so, what are you crawling?
>>>>>
>>>>> the thumb drive corps is based on going to the library and using their
>>>>> access.  other access is $0.08/page.  do you have some kind of magic
>>>>> account
>>>>> or something?
>>>>>
>>>>>
>>>>
>>>> just the library's account.
>>>>
>>>
>>>
>>
>
>

Follow-Ups:
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>

References:
- pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>

Prev by Date: Re: pacer crawl
Next by Date: Re: pacer crawl
Previous by thread: Re: pacer crawl
Next by thread: Re: pacer crawl
Index(es):
- Date
- Thread