Message 00371: Re: a couple of questions
On Oct 20, 2008, at 5:27 PM, Aaron Swartz wrote:
the data's a real mess -
That's ultimately my biggest defense in all of this. But, I'm still
*really* nervous when I hear the Superintendent of Documents talk
about "security breach" and "investigation."
In terms of the data being a mess, I'm thinking of doing a disallow on
google in my robots.txt on this or maybe even just releasing really
big tarballs ... I'm positive should be public, but I'm not
necessarily convinced this stuff deserves to go live on random google
searches until more volunteers have done more scrubbing. There is
some really bad crap I caught, which means there is a whole bunch I
didn't catch.
It really sucks of course that all the commercial guys don't care and
have all this live, but our biggest defense (again) is that we
actually care. My letter to Rosenthal will point out very clearly
that her computer people and the commercial boys never told her any of
this stuff nor did they redact the data ... by making the data public,
ironically, we protect privacy much better.
What do you think if the initial release consists of:
1. scribd of all my letters to the judicial conference (with all the
private information like the lists of hits and even the case numbers
with hits redacted, of course)
2. a bunch of 50gbyte or so tarballs.
That lets us get the data out but not have to be in the end user
business.
We should realize that if we put even just big tarballs, there will be
some jerks that take all the data and slap google adwords on it and
will not care if, for example, some really bad document is found and
needs to be redacted.
Thoughts?
Carl