[Spambayes] test sets?

Tim Peters tim.one@comcast.net
Fri, 06 Sep 2002 12:55:07 -0400


[Anthony Baxter]
> The other thing on my todo list (probably tonight's tram ride home) is
> to add all headers from non-text parts of multipart messages. If nothing
> else, it'll pick up most virus email real quick.

See the checkin comments for timtest.py last night.  Adding this code gave a
major reduction in the false negative rate:

def crack_content_xyz(msg):
    x = msg.get_type()
    if x is not None:
        yield 'content-type:' + x.lower()

    x = msg.get_param('type')
    if x is not None:
        yield 'content-type/type:' + x.lower()

    for x in msg.get_charsets(None):
        if x is not None:
            yield 'charset:' + x.lower()

    x = msg.get('content-disposition')
    if x is not None:
        yield 'content-disposition:' + x.lower()

    fname = msg.get_filename()
    if fname is not None:
        for x in fname.lower().split('/'):
            for y in x.split('.'):
                yield 'filename:' + y

    x = msg.get('content-transfer-encoding:')
    if x is not None:
        yield 'content-transfer-encoding:' + x.lower()


...

    t = ''
    for x in msg.walk():
        for w in crack_content_xyz(x):
            yield t + w
        t = '>'

I *suspect* most of that stuff didn't make any difference, but I put it all
in as one blob so don't know which parts did and didn't help.