[Spambayes] to From_ or not to From_?
Tim Peters
tim.one@comcast.net
Tue, 01 Oct 2002 19:35:14 -0400
[Neale Pickett, "From " lines]
> ...
> Here, I'll put my money where my mouth is. My mail program writes the
> From header as an X-From: line. I add this to my bayescustomize.ini:
>
> [Tokenizer]
> basic_header_tokenize: True
> basic_header_skip: received
> date
> x-[^f][^r].*
Note that this tokenizes a great many headers lines beyond just x-from.
Something like
basic_header_skip: (?!x-from)
would have been sharper (that's a negative lookahead assertion: it matches
iff the header name doesn't match x-from, so it skips a header line iff it's
not x-from, so it looks only at x-from -- all obvious to the most casual
observer <wink>).
> And I get this on my tiny corpus (2x5x200 messages):
>
> """
> false positive percentages
> 1.500 1.500 tied
> 1.000 1.000 tied
> 2.000 1.000 won -50.00%
> 1.500 1.000 won -33.33%
> 1.500 1.000 won -33.33%
>
> won 3 times
> tied 2 times
> lost 0 times
>
> total unique fp went from 15 to 11 won -26.67%
> mean fp % went from 1.5 to 1.1 won -26.67%
>
> false negative percentages
> 1.500 1.000 won -33.33%
> 0.000 0.500 lost +(was 0)
> 1.000 1.000 tied
> 0.500 0.000 won -100.00%
> 1.000 1.000 tied
>
> won 2 times
> tied 2 times
> lost 1 times
> """
>
> In all but one case where something changed, it was just a single
> message. That's not a huge improvement,
*Relative to* your error rates, it was a huge improvement, but it's hard to
be confident about it because the absolute # of msgs involved is so small.
Still, that it won 3 times on f-p, and never lost, adds to the confidence
you should have that it truly helped.
> but maybe enough of one to convince someone with a larger test
> set to try it out?
I can't get away with tokenizing so many header lines; there are too many
"good clues for bad reasons" in my mixed-source data.