[Spambayes] RE: spam detection via probability - actual results!

Fri, 20 Sep 2002 07:44:51 -0400

> Something without an explanation:  Gary had a report from someone else who
> tried his combining scheme without bounding the number of words.

I got an apology this morning, he said he had accidentally trained and
tested on the same data set. So maybe the explanation is in that somehow.

> It's possible that the fellow who generated Gary's other result in this
> direction wasn't aware of the potential underflow problems

That's the most likely thing, IMHO.

> Now a question for Gary (hope you're still here <wink>), to help me
> understand what's needed to do this right:  what, exactly, does it mean to
> require that the spam probabilities be uniformly distributed?
> 
> Concrete and relevant example:  suppose I were to take the spamprobs exactly
> as they are now, and merely round them to two significant decimal digits.
> Then there would be exactly 99 distinct spamprobs in the system, uniformly
> distributed in .01 through .99.  Is that all it takes to meet the formal
> precondition?

No, that would mean you would have 99 buckets, but a different number of
words in each bucket. So that wouldn't be uniform. Unless I'm
misunderstanding you.

> 
> If I normalize the existing probabilities instead based on rank (which I'm
> happy to do), I have tens of thousands of words all with spamprob .01 now,
> and also with spamprob .99 now.  Based on rank, then, assigning all ties to
> a probability based on the median rank in an all-equal range would *still*
> end up giving tens of thousands of words the same probabilities in the end.

I think you should get rid of the code that turns so many things into .01's
and .99's. There should be no need for those cut-offs under my scheme. Just
let everything be what it naturally is. Then you won't have the thousands of
equal probabilities.

BUT be sure you do "Further Improvement 1" first. (That is, the new and
improved FI1 that was written after we talked about Graham's way  of
generating his probabilities.) That will eliminate much of the reason for
the .01 and .99 cutoffs because it will be rarer to get such extreme values.

I think there's a really good chance that if you do FI1 AND FI2, the results
will be significantly better than we're seeing now.

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454