[Spambayes] to From_ or not to From_?

01 Oct 2002 10:03:15 -0700

So then, Tim Peters <tim.one@comcast.net> is all like:

> Actually, none of mine do, because BruceG's spam didn't.  I removed
> all the "From " lines from the c.l.py archive to match that (easier
> than inventing such lines for Bruce's msgs).  I don't know that it
> makes any difference for the way I run the tests, but it certainly
> could make a difference if "From " lines were getting mined for clues.
> I forced all my msgs alike in this respect just to cut off that
> possibility.

Sorry to enter this discussion a little late--I've been pretty busy with
a release at work.

I understand some people may not have them, but the "From " lines seem
to be very useful, as they report who the sender identified themselves
as in the MAIL command of the SMTP envelope.  I've had a great deal of
success stopping spam at the gate by denying access to people who
identify themselves with addresses from certain domains.  I would expect
that looking at "From " lines would be a clear win for anyone.

Here, I'll put my money where my mouth is.  My mail program writes the
>From header as an X-From: line.  I add this to my bayescustomize.ini:

[Tokenizer]
basic_header_tokenize: True
basic_header_skip: received
    date
    x-[^f][^r].*

And I get this on my tiny corpus (2x5x200 messages):

"""
false positive percentages
    1.500  1.500  tied
    1.000  1.000  tied
    2.000  1.000  won    -50.00%
    1.500  1.000  won    -33.33%
    1.500  1.000  won    -33.33%

won   3 times
tied  2 times
lost  0 times

total unique fp went from 15 to 11 won    -26.67%
mean fp % went from 1.5 to 1.1 won    -26.67%

false negative percentages
    1.500  1.000  won    -33.33%
    0.000  0.500  lost  +(was 0)
    1.000  1.000  tied
    0.500  0.000  won   -100.00%
    1.000  1.000  tied

won   2 times
tied  2 times
lost  1 times
"""

In all but one case where something changed, it was just a single
message.  That's not a huge improvement, but maybe enough of one to
convince someone with a larger test set to try it out?

Neale