Subject: Re: pacer crawl

Message 00199: Re: pacer crawl

To: Aaron Swartz <xx@aaronsw.com>
Subject: Re: pacer crawl
From: Carl Malamud <xxxx@media.org>
Date: Thu, 4 Sep 2008 23:15:40 -0700
In-reply-to: <dc21c7860809041944n5ec791a7tdba99016ed68fa7e@mail.gmail.com>
References: <dc21c7860809041922i64893bd7p2c8bdac1df1a137f@mail.gmail.com> <4DA167E8-5E2E-402E-8331-1A4616F8D129@media.org> <dc21c7860809041923p41c4b55dtf3e5264ac217fd31@mail.gmail.com> <762EBFD0-5A93-423D-BFF6-93D974E441A7@media.org> <dc21c7860809041930v1cd6719dueddd61d949df57cc@mail.gmail.com> <8D62995A-C4A4-4D86-8032-EA7792662283@media.org> <dc21c7860809041938g555f7ad7r506692f2cd691d21@mail.gmail.com> <EF4849BF-BB11-4BAA-A3AB-E1EDBF1B147D@media.org> <dc21c7860809041941u7bbe5b85q826185a1a59539e8@mail.gmail.com> <C786130E-7A81-41A9-90C1-3676160231E6@media.org> <dc21c7860809041944n5ec791a7tdba99016ed68fa7e@mail.gmail.com>

ok, but let's not have that be a distraction from your day jobs?getting watchdog up is by far more important. pacer is not criticalpath for me ... it is important, but it is not where the major push isright now. i'd welcome the effort, but not if it is at the expense ofyour "real" work.


On Sep 4, 2008, at 7:44 PM, Aaron Swartz wrote:

fair enough. stephen is building a team to go to the library.

On Thu, Sep 4, 2008 at 10:43 PM, Carl Malamud <xxxxxxx@media.org> wrote:
sigh.

this is not how we do things.  :)
we can't have thumper be an on-site patron at the library that ispart ofthe 20-library pacer trial because we are cutting corners. wedon't cut
corners, we belly up to the bar and get permission.
if your librarian wants to do this, i'm on board. if you have alegal rightto go to the library and be a patron and use the publishedprocedure to grabdata, i'm also on board. but, i'm not going to shave the rules.if we'recoming in off-site, we can drain pacer, but you want a validaccount and pay
$0.08/page.  then, we can do whatever we want with the data.

On Sep 4, 2008, at 7:41 PM, Aaron Swartz wrote:
no
On Thu, Sep 4, 2008 at 10:40 PM, Carl Malamud <xxxxxxx@media.org>wrote:
do you have your library's permission/tacit agreement to drainpacer?
what
library?

Carl

On Sep 4, 2008, at 7:38 PM, Aaron Swartz wrote:
the easiest thing would just to have a screen session open with a
couple perl scripts calling wget on the various pacer urls
On Thu, Sep 4, 2008 at 10:36 PM, Carl Malamud <xxxxxxx@media.org>wrote:
so, what specifically do you want to do on the box?
do you need to run scripts, cron jobs, etc...? periodicallydump data
off
local crawlers?  run python jobs?

On Sep 4, 2008, at 7:30 PM, Aaron Swartz wrote:
it's a disk space thing -- last time I did something likethis, i kept
filling up people's disks whenever the process moving stuff off
hiccupped. and if we're at speed the hiccups don't have tolast long.
On Thu, Sep 4, 2008 at 10:26 PM, Carl Malamud <xxxxxxx@media.org>wrote:
i don't mind crawling pacer with a valid account.  that is our
production
box, so i wouldn't want to be too intensive, but in principle I
suppose
one
could crawl straight from thumper.

is this a bandwidth thing?  not enough bits between your local
computers
and
thumper to get the data over the wall?
if this is really serious, there are a couple other places wecan put
you.

let me know what you have in mind.

On Sep 4, 2008, at 7:23 PM, Aaron Swartz wrote:
On Thu, Sep 4, 2008 at 10:23 PM, Carl Malamud <xxxxxxx@media.org>
wrote:
On Sep 4, 2008, at 7:22 PM, Aaron Swartz wrote:
I assume running the pacer crawl form thumper is not on,right?
so, what are you crawling?
the thumb drive corps is based on going to the library andusing
their
access. other access is $0.08/page. do you have some kindof
magic
account
or something?
just the library's account.

Follow-Ups:
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>

References:
- pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>
- Re: pacer crawl
  - From: Carl Malamud <xxxx@media.org>
- Re: pacer crawl
  - From: "Aaron Swartz" <xx@aaronsw.com>

Prev by Date: Re: pacer crawl
Next by Date: usps updates
Previous by thread: Re: pacer crawl
Next by thread: Re: pacer crawl
Index(es):
- Date
- Thread