[Spambayes] Another optimization

T. Alexander Popiel popiel@wolfskeep.com
Wed, 18 Sep 2002 13:37:19 -0700


In message:  <LNBBLJKPBEHFEDALKOLCMEAFBFAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>>
>> * Graham was very specific in describing his tokenizer...
>>   and you folks seem to have ignored that description.
>
>The answer to every question is "because careful experiments said it worked
>better".  This one is discussed in a long tokenizer.py comment block, with
>title "How to tokenize?", and some full both-way test results are given
>there.

Argh.  I just saw that comment block, nestled between n-grams
and HTML discussions.  Somehow I managed to overlook it on my
prior readings.  Sorry.

>> * Why do you throw out super-long words?
>
>Likewise, but under the section "How big should 'a word' be?".

Yeah, but that section only seems to refer to deciding the cutoff
length between 12 and 13 characters, not whether there should be a
cutoff at all.  Hrm.

>>   Since I'm not storing _all_ words encountered (I'm only keeping the
>>   probability computations after ignoring the few-occurences words)
>
>(Perhaps not so) Strangely enough, my experiments showed that never ignoring
>a word worked better.

That's not something I thought to check.  Perhaps I'll try it.

>Do read our project's TESTING.txt:

*nod*  Good stuff.

>>   However, I do _not_ consider additional less significant words if
>>   the number of maximally significant words is >N.
>
>And we do, *if* the number of non-cancelling maxprob words is less than N.

Yeah, that's the distinction I thought I saw.  I tested it both ways
(though not as rigorously as you do) and chose my way.

>The fact is that the scheme is almost totally lost in these cases
>regardless, and Graham's combining rule is such that just a few .01`or .99
>clues not cancelled pretty much determines the final outcome.

The cases that decided me had about 20-30 maximal clues, with about
5 or 6 unpaired .99s.  Allowing the less significant clues (ranging
from .02 to .08 with only a couple .96-.98s) flipped them from 'spam'
to 'ham', increasing the FN rate annoyingly.

Of course, this may be an artifact of my corpora size... those .02-.08
clues may get promoted to .01s with more training data, completely
obviating the difference in our approaches.

>One of my most stubborn false negatives has over a hundred .01 clues
>to a couple dozen .99 clues -- it doesn't matter what we do to combine
>those, it's always going to be a false negative without a more fundamental
>change elsewhere.

Yes.

>This "only look at smoking guns" scoring scheme seems to be systematically
>weak when dealing with long and chatty "just folks" spam (of which there is
>blessedly little so far <wink>).

Indeed.  I'm at a loss for how to combat it, though, unless we
artificially restrict the number of .01s we get through limited
corpora.

>Well, it's a statistical approach, [...]
>You really need multiple test runs to have confidence you're
>making progress.

Yeah, I really ought to be more anal about it. ;-)

- Alex