[spambayes-dev] minor csv module problem

Skip Montanaro skip at pobox.com
Fri Jan 21 03:25:38 CET 2005


In my message training I train into a pickle (faster at that point), then
use sb_dbexpimp to dump it to a csv file.  For use by sb_bnfilter I then
convert that to a Berkeley db file.  (The csv file also serves as a
convenient debug/interchange format.)  The Python csv module is used both to
write and read the csv file.  Unfortunately, it seems to have a bug.  It
generates this line:

    "subject:          \r",0,1\r

(\r subbing for the real CR), which it later refuses to read because it
thinks there is a newline inside the string.  This is a long-standing bug as
far as I can tell.  I can reproduce it with Python 2.3 and 2.4, though is
fixed in the latest CVS, probably as a side-effect of the recent changes to
the csv module.

I imagine we'll get the csv problem fixed (hopefully by the 2.3.5 release),
but that doesn't help SpamBayes in the short term, so I think a workaround
is in order.  The problem is a token generated that ends with a \r
character.  One spam's subject is:

    '=?iso-2022-jp?B?k36LeILdgs2DRYNug0WDbiAgICAgICAgICAN?='

After decoding by email.Header.decode_header we have

    '\x93~\x8bx\x82\xdd\x82\xcd\x83E\x83n\x83E\x83n          \r'

The tokenizer generates this token as part of its output:

    'subject:          \r'

Perhaps we could replace '\r' with ' ' in the subject before tokenizing
without losing much/any accuracy.  I don't believe we can get whitespace in
body tokens.

Skip


More information about the spambayes-dev mailing list