From: Carl Malamud <xxxxxxx@media.org>
Date: November 7, 2008 4:50:39 PM PST
To: Aaron Swartz <me@aaronsw.com>
Subject: Re: audit
On Nov 7, 2008, at 4:42 PM, Aaron Swartz wrote:
what is your corporate form these days? are you incorporated?
filed for c4
or c3? are you under official "fiscal sponsorship" of Sunlight?
we incorporated in MA and filed for c4. no fiscal sponsorship.
ok. I can work with that. I'm going to make a $25k contribution to
Watchdog. my thinking is you write the general auditing software/
scripts and I apply them to .gov (these scripts could easily be used
on state governments as well). please don't start yet ... I want a
few days to put together some thoughts on this. I ran some dry runs
of this audit concept a couple of years ago and there are some
things I learned I want to transmit before you get started. I also
want to make sure we stay very clearly on the above-board side of
this thing (e.g., we'll run nmap to look for open ports and
fingerprint os's, but we're *not* going to crack their password
files. :)).
Anyway, let's talk Monday.
BTW, good progress on my irs project. I've got the 12-dvd loader up
and running (finder screendump attached) and today I got the program
working that scarfs a dozen dvds, reads an index file to figure out
which tiff's go with which return (they are one page per tiff in
semi-random order), use tiffcp to concatenate them together, use
tiff2pdf to create a pdf,. use exiftool to stamp the metadata into
the pdf header. I still need to automate running them through OCR,
looking for SSNs (there are a bunch), and doing a few other
housekeeping tasks. This is definitely a big project, but this is
certainly progress.
For the CFR, I'm now able to go from their broken sgml to well-
formed xml. Now, I need to figure out how to lay it out as xhtml,
convert the eps files to png and pdf, and automate the laying it all
into svn so you can do diffs.
Carl