[Spambayes] test sets?
Tim Peters
tim.one@comcast.net
Fri, 06 Sep 2002 12:55:07 -0400
[Anthony Baxter]
> The other thing on my todo list (probably tonight's tram ride home) is
> to add all headers from non-text parts of multipart messages. If nothing
> else, it'll pick up most virus email real quick.
See the checkin comments for timtest.py last night. Adding this code gave a
major reduction in the false negative rate:
def crack_content_xyz(msg):
x = msg.get_type()
if x is not None:
yield 'content-type:' + x.lower()
x = msg.get_param('type')
if x is not None:
yield 'content-type/type:' + x.lower()
for x in msg.get_charsets(None):
if x is not None:
yield 'charset:' + x.lower()
x = msg.get('content-disposition')
if x is not None:
yield 'content-disposition:' + x.lower()
fname = msg.get_filename()
if fname is not None:
for x in fname.lower().split('/'):
for y in x.split('.'):
yield 'filename:' + y
x = msg.get('content-transfer-encoding:')
if x is not None:
yield 'content-transfer-encoding:' + x.lower()
...
t = ''
for x in msg.walk():
for w in crack_content_xyz(x):
yield t + w
t = '>'
I *suspect* most of that stuff didn't make any difference, but I put it all
in as one blob so don't know which parts did and didn't help.