[Spambayes] to From_ or not to From_?
Neale Pickett
neale@woozle.org
01 Oct 2002 10:03:15 -0700
So then, Tim Peters <tim.one@comcast.net> is all like:
> Actually, none of mine do, because BruceG's spam didn't. I removed
> all the "From " lines from the c.l.py archive to match that (easier
> than inventing such lines for Bruce's msgs). I don't know that it
> makes any difference for the way I run the tests, but it certainly
> could make a difference if "From " lines were getting mined for clues.
> I forced all my msgs alike in this respect just to cut off that
> possibility.
Sorry to enter this discussion a little late--I've been pretty busy with
a release at work.
I understand some people may not have them, but the "From " lines seem
to be very useful, as they report who the sender identified themselves
as in the MAIL command of the SMTP envelope. I've had a great deal of
success stopping spam at the gate by denying access to people who
identify themselves with addresses from certain domains. I would expect
that looking at "From " lines would be a clear win for anyone.
Here, I'll put my money where my mouth is. My mail program writes the
>From header as an X-From: line. I add this to my bayescustomize.ini:
[Tokenizer]
basic_header_tokenize: True
basic_header_skip: received
date
x-[^f][^r].*
And I get this on my tiny corpus (2x5x200 messages):
"""
false positive percentages
1.500 1.500 tied
1.000 1.000 tied
2.000 1.000 won -50.00%
1.500 1.000 won -33.33%
1.500 1.000 won -33.33%
won 3 times
tied 2 times
lost 0 times
total unique fp went from 15 to 11 won -26.67%
mean fp % went from 1.5 to 1.1 won -26.67%
false negative percentages
1.500 1.000 won -33.33%
0.000 0.500 lost +(was 0)
1.000 1.000 tied
0.500 0.000 won -100.00%
1.000 1.000 tied
won 2 times
tied 2 times
lost 1 times
"""
In all but one case where something changed, it was just a single
message. That's not a huge improvement, but maybe enough of one to
convince someone with a larger test set to try it out?
Neale