[spambayes-dev] minor csv module problem
Skip Montanaro
skip at pobox.com
Fri Jan 21 03:25:38 CET 2005
In my message training I train into a pickle (faster at that point), then
use sb_dbexpimp to dump it to a csv file. For use by sb_bnfilter I then
convert that to a Berkeley db file. (The csv file also serves as a
convenient debug/interchange format.) The Python csv module is used both to
write and read the csv file. Unfortunately, it seems to have a bug. It
generates this line:
"subject: \r",0,1\r
(\r subbing for the real CR), which it later refuses to read because it
thinks there is a newline inside the string. This is a long-standing bug as
far as I can tell. I can reproduce it with Python 2.3 and 2.4, though is
fixed in the latest CVS, probably as a side-effect of the recent changes to
the csv module.
I imagine we'll get the csv problem fixed (hopefully by the 2.3.5 release),
but that doesn't help SpamBayes in the short term, so I think a workaround
is in order. The problem is a token generated that ends with a \r
character. One spam's subject is:
'=?iso-2022-jp?B?k36LeILdgs2DRYNug0WDbiAgICAgICAgICAN?='
After decoding by email.Header.decode_header we have
'\x93~\x8bx\x82\xdd\x82\xcd\x83E\x83n\x83E\x83n \r'
The tokenizer generates this token as part of its output:
'subject: \r'
Perhaps we could replace '\r' with ' ' in the subject before tokenizing
without losing much/any accuracy. I don't believe we can get whitespace in
body tokens.
Skip
More information about the spambayes-dev
mailing list