[Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein)

Tim Peters tim.one@comcast.net
Fri Nov 1 20:49:46 2002


[Tim Peters, on Jeremy's poorly-scoring example]
> e.g., for a tech guy to have the word "computer" as
> high-spamprob word is suspicious all by itself:
>
>> computer 0.851704776271

[Toby Dickenson]
> Its a 0.88 for me too, due to "If you want to make money with
> your computer" spam.

I believe it.  In context, Jeremy had many computer*ish* words scoring with
high spamprobs, and many mailing-list lexicalisms not scoring with low
spamprobs, and some obvious spam words not scoring with high spamprobs.
Jeremy has said in the past that he's inclined to train only on mistakes,
and I've raised as many cautions about that as I can.  The system was
intended from the start to be trained on a random sampling of all your ham
and spam.  Every time someone has sent me a "surprising msg", my personal
classifier has absolutely nailed it in the correct category; I don't think
that's because I know a secret way to start Python <wink>, but suspect it's
because I've made sustained attempts to train my personal classifier on a
"random slice of real life" every day (including a representative sampling
of duplicates when I get a single ham or spam from multiple sources).  This
gives it a reality-driven view of the probabilities instead of a
mistake-driven view, and also adapts its view of both as time goes on.