[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]


Message 00194: Re: pacer crawl



the easiest thing would just to have a screen session open with a
couple perl scripts calling wget on the various pacer urls

On Thu, Sep 4, 2008 at 10:36 PM, Carl Malamud <xxxxxxx@media.org> wrote:
> so, what specifically do you want to do on the box?
>
> do you need to run scripts, cron jobs, etc...?  periodically dump data off
> local crawlers?  run python jobs?
>
> On Sep 4, 2008, at 7:30 PM, Aaron Swartz wrote:
>
>> it's a disk space thing -- last time I did something like this, i kept
>> filling up people's disks whenever the process moving stuff off
>> hiccupped. and if we're at speed the hiccups don't have to last long.
>>
>> On Thu, Sep 4, 2008 at 10:26 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>
>>> i don't mind crawling pacer with a valid account.  that is our production
>>> box, so i wouldn't want to be too intensive, but in principle I suppose
>>> one
>>> could crawl straight from thumper.
>>>
>>> is this a bandwidth thing?  not enough bits between your local computers
>>> and
>>> thumper to get the data over the wall?
>>>
>>> if this is really serious, there are a couple other places we can put
>>> you.
>>>
>>> let me know what you have in mind.
>>>
>>> On Sep 4, 2008, at 7:23 PM, Aaron Swartz wrote:
>>>
>>>> On Thu, Sep 4, 2008 at 10:23 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>>>
>>>>> On Sep 4, 2008, at 7:22 PM, Aaron Swartz wrote:
>>>>>
>>>>>> I assume running the pacer crawl form thumper is not on, right?
>>>>>>
>>>>>
>>>>> so, what are you crawling?
>>>>>
>>>>> the thumb drive corps is based on going to the library and using their
>>>>> access.  other access is $0.08/page.  do you have some kind of magic
>>>>> account
>>>>> or something?
>>>>>
>>>>>
>>>>
>>>> just the library's account.
>>>>
>>>
>>>
>>
>
>