Message 00197: Re: pacer crawl
this is not how we do things. :)
we can't have thumper be an on-site patron at the library that is part
of the 20-library pacer trial because we are cutting corners. we
don't cut corners, we belly up to the bar and get permission.
if your librarian wants to do this, i'm on board. if you have a legal
right to go to the library and be a patron and use the published
procedure to grab data, i'm also on board. but, i'm not going to
shave the rules. if we're coming in off-site, we can drain pacer, but
you want a valid account and pay $0.08/page. then, we can do whatever
we want with the data.
On Sep 4, 2008, at 7:41 PM, Aaron Swartz wrote:
On Thu, Sep 4, 2008 at 10:40 PM, Carl Malamud <email@example.com> wrote:
do you have your library's permission/tacit agreement to drain
On Sep 4, 2008, at 7:38 PM, Aaron Swartz wrote:
the easiest thing would just to have a screen session open with a
couple perl scripts calling wget on the various pacer urls
On Thu, Sep 4, 2008 at 10:36 PM, Carl Malamud <firstname.lastname@example.org>
so, what specifically do you want to do on the box?
do you need to run scripts, cron jobs, etc...? periodically dump
local crawlers? run python jobs?
On Sep 4, 2008, at 7:30 PM, Aaron Swartz wrote:
it's a disk space thing -- last time I did something like this,
filling up people's disks whenever the process moving stuff off
hiccupped. and if we're at speed the hiccups don't have to last
On Thu, Sep 4, 2008 at 10:26 PM, Carl Malamud <email@example.com>
i don't mind crawling pacer with a valid account. that is our
box, so i wouldn't want to be too intensive, but in principle I
could crawl straight from thumper.
is this a bandwidth thing? not enough bits between your local
thumper to get the data over the wall?
if this is really serious, there are a couple other places we
let me know what you have in mind.
On Sep 4, 2008, at 7:23 PM, Aaron Swartz wrote:
On Thu, Sep 4, 2008 at 10:23 PM, Carl Malamud <firstname.lastname@example.org>
On Sep 4, 2008, at 7:22 PM, Aaron Swartz wrote:
I assume running the pacer crawl form thumper is not on,
so, what are you crawling?
the thumb drive corps is based on going to the library and
access. other access is $0.08/page. do you have some kind
just the library's account.