[spambayes-dev] modified version of sb_dbexpimp.py

Tue Mar 16 18:21:47 EST 2004

(Not noticing that Tony cc'd spambayes-dev I originally sent this reply just
to him.)

    >> Since this is rather late in the game 1.0-wise I would like a little
    >> extra feedback before checking this stuff in.

    Tony> I was too late trying it out for this, but it (cvs version) also
    Tony> works for me.

    Tony> One query (I've used the csv module quite a bit since I moved to
    Tony> 2.3, but only reading, never writing, so haven't noticed this
    Tony> before): I see that it writes rows with '\r\n' termination, so in
    Tony> Excel I get blank lines between every row (with a file as long as
    Tony> the spambayes database, this means I miss a lot of data).

The csv file should be opened in "wb" mode.  I thought I caught that.  Can
you take a quick look?  Also, you are talking about using the real csv
module, not the compatcsv thing, right?

    Tony> Should we provide an option to the dbexpimp script to change the
    Tony> line terminator to '\n'?  (Simple enough to do, if I read the csv
    Tony> doc correctly).  Or maybe just have a "if sys.platform == "win32":
    Tony> lineterminator = '\n'" kinda thing?

No, I don't think so.  It seems we have a bug to squash.  We control
everything about reading and writing that file.  We should be able to make
it work without any hints from the user.

    Tony> For example, I'll want to see how often an experimental token gets
    Tony> used, or something like that.  A lot of the time I could just use
    Tony> a shell script (even on Windows <wink>) to get around the long
    Tony> pathname, anyway.  Forget I mentioned it ;)

Okay.  Here's a simple use of spamcounts:

    % spamcounts -d ~/tmp/tte.db -r 'long cons word'
    db: /Users/skip/tmp/tte.db
    token,nspam,nham,spam prob
    long cons word,32,7,0.797764401748
    subject:long cons word,9,0,0.97619047619

It says report on all tokens in tte.db which match the regular expression
(using re.search) 'long cons word'.  Without the -r it only matches the
first token.  (It also runs a lot faster.)

Skip