[spambayes-dev] I took a big step Tuesday...

Rob W.W. Hooft rob at hooft.net
Mon Aug 4 11:03:49 EDT 2003


Tim Peters wrote:
> [Rob Hooft]
> 
> Nice to hear from you, Rob!

Still trying to keep up with the -dev list, but I haven't updated the 
software in months....

>>We have done lots of research in the earlier days of spambayes, and
>>have come to the conclusion that there are no more than two useful
>>cut-off points. Our false-positives mostly scored hopelessly close
>>to the ideal 1.00000000000000000.

> Hmm.  That wasn't true of my data:  the only FP I had scoring 1.00 (rounded)
> was the message that consisted almost entirely of a full quote of a Nigerian
> scam.  That one was hopeless.  All other FP scored below 1.00 (rounded).

I'm not sure about the statistics, but I guess you had so few fp that 
you can not test the hypothesis that the distribution of falses is 
non-homogeneous. It may still be useful if the FP distribution is 
homogeneous, because, as you note the TP distribution is very sharply 
peaked. Cutting at 0.9995 instead of 0.995 may cut almost as many spam, 
and will cost 1/10th the amount of ham if the homogeneous distribution 
hypothesis is true.

This could even result in an intermediate way for training: don't train 
on messages that score <0.001 or >0.999

>>If you find spam boring and want to delete everything above 0.995
>>automatically, there is no scientific basis for not cutting at 0.90
>>instead.
> 
> 
> There's an obvious basis for not doing that, though:  I've seen FP scoring
> above 0.90 in day-to-day use, always a piece of HTML email I actually want,
> from an online business spambayes hadn't yet been taught about.  OTOH, I've
> never seen an FP in day-to-day use that scored 1.00 (rounded), although
> *most* spam scores 1.00 (rounded -- and most ham scores 0.00 (rounded)).  I
> think Skip is seeing the same.

When are you actually reading your spam? For me in my setup it is very 
difficult to get at my spam. The only reason for me to have a look at it 
is when I am re-training (about once a month): just before training I 
use the "mail -f spam.mbox" to skim the headers.

> Theory simply hasn't kept up with practice here.  That's what happens when
> all the theorists die <wink>.

I just gave a sign of life...from a distance. And you yourself are still 
around as well.

Rob

-- 
Rob W.W. Hooft  ||  rob at hooft.net  ||  http://www.hooft.net/people/rob/




More information about the spambayes-dev mailing list