[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]


Message 00198: Re: pacer crawl



fair enough. stephen is building a team to go to the library.

On Thu, Sep 4, 2008 at 10:43 PM, Carl Malamud <xxxxxxx@media.org> wrote:
> sigh.
>
> this is not how we do things.  :)
>
> we can't have thumper be an on-site patron at the library that is part of
> the 20-library pacer trial because we are cutting corners.  we don't cut
> corners, we belly up to the bar and get permission.
>
> if your librarian wants to do this, i'm on board.  if you have a legal right
> to go to the library and be a patron and use the published procedure to grab
> data, i'm also on board.  but, i'm not going to shave the rules.  if we're
> coming in off-site, we can drain pacer, but you want a valid account and pay
> $0.08/page.  then, we can do whatever we want with the data.
>
> On Sep 4, 2008, at 7:41 PM, Aaron Swartz wrote:
>
>> no
>>
>> On Thu, Sep 4, 2008 at 10:40 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>
>>> do you have your library's permission/tacit agreement to drain pacer?
>>>  what
>>> library?
>>>
>>> Carl
>>>
>>> On Sep 4, 2008, at 7:38 PM, Aaron Swartz wrote:
>>>
>>>> the easiest thing would just to have a screen session open with a
>>>> couple perl scripts calling wget on the various pacer urls
>>>>
>>>> On Thu, Sep 4, 2008 at 10:36 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>>>
>>>>> so, what specifically do you want to do on the box?
>>>>>
>>>>> do you need to run scripts, cron jobs, etc...?  periodically dump data
>>>>> off
>>>>> local crawlers?  run python jobs?
>>>>>
>>>>> On Sep 4, 2008, at 7:30 PM, Aaron Swartz wrote:
>>>>>
>>>>>> it's a disk space thing -- last time I did something like this, i kept
>>>>>> filling up people's disks whenever the process moving stuff off
>>>>>> hiccupped. and if we're at speed the hiccups don't have to last long.
>>>>>>
>>>>>> On Thu, Sep 4, 2008 at 10:26 PM, Carl Malamud <xxxxxxx@media.org> wrote:
>>>>>>>
>>>>>>> i don't mind crawling pacer with a valid account.  that is our
>>>>>>> production
>>>>>>> box, so i wouldn't want to be too intensive, but in principle I
>>>>>>> suppose
>>>>>>> one
>>>>>>> could crawl straight from thumper.
>>>>>>>
>>>>>>> is this a bandwidth thing?  not enough bits between your local
>>>>>>> computers
>>>>>>> and
>>>>>>> thumper to get the data over the wall?
>>>>>>>
>>>>>>> if this is really serious, there are a couple other places we can put
>>>>>>> you.
>>>>>>>
>>>>>>> let me know what you have in mind.
>>>>>>>
>>>>>>> On Sep 4, 2008, at 7:23 PM, Aaron Swartz wrote:
>>>>>>>
>>>>>>>> On Thu, Sep 4, 2008 at 10:23 PM, Carl Malamud <xxxxxxx@media.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Sep 4, 2008, at 7:22 PM, Aaron Swartz wrote:
>>>>>>>>>
>>>>>>>>>> I assume running the pacer crawl form thumper is not on, right?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> so, what are you crawling?
>>>>>>>>>
>>>>>>>>> the thumb drive corps is based on going to the library and using
>>>>>>>>> their
>>>>>>>>> access.  other access is $0.08/page.  do you have some kind of
>>>>>>>>> magic
>>>>>>>>> account
>>>>>>>>> or something?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> just the library's account.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>