[Spambayes] progress on POP+VM+ZODB deployment
Tim Peters
tim.one@comcast.net
Sun Oct 27 18:02:07 2002
[Derek Simkowiak]
> How can we test out new algorithms if the project doesn't have a
> control group? We have no way of knowing if someone's successful (or
> poor) results are an attribute of the new algorithm, or if it's an
> attribute of their particular sample data.
Do read TESTING.txt, checked into the project. The testing framework is set
up in a statistically sound way, so that even people working with a single
corpus get it sliced and diced in random ways across multiple testing runs.
In addition, as Alex already said, Big Changes have been made only after
multiple-corpora tests reported on this list. When 10 randomized runs
across each of several distinct corpora all yield similar results, it's easy
to have confidence.
> Having a starter.db would both (a) make life easier for getting
> started,
I couldn't give you a starter db that would work well for your ham. The
algorithms here aren't *trying* to "identify spam" -- you want something
like SpamAssassin if that's what you want. The algorithms here are trying
to *separate* ham from spam, and the ham words are just as important to that
as the spam words. I've run several experiments where a classifier trained
on one corpus was used to predict against a different corpus. The false
negative rate remained good (little spam snuck thru), but the false positive
rate zoomed (many ham were *called* spam). In IR terms, spam recall
remained good but spam precision suffered badly.
This isn't surprising, either: except for foreign-language spam, spam is
still using ordinary words, and the same words show up in ham too. For
example, in the very msg I'm replying to,
'give' 0.648963
'skip:w 10' 0.664292
'results' 0.693332
'database.' 0.718815
'successful' 0.821229
'stock' 0.867852
'data.' 0.887295
"someone's" 0.969799
'subject:+' 0.987106
That's a decent collection of high-spamprob words. Nevertheless,
chi-combining was extremely confident the msg was ham, because of a much
larger number of low-spamprob words, some of which are specific to the topic
being discussed on this mailing list, and some of which are specific to
computer-geek chatter:
'argument' 0.0155709
'header:In-reply-to:1' 0.0158379
'subject:: [' 0.0169746
'attribute' 0.0196507
'url:mailman-21' 0.0196507
'skip:_ 40' 0.0320263
"else's" 0.0348837
'(b)' 0.0412844
'header:Errors-to:1' 0.0458968
'started,' 0.0505618
'subject:Spambayes' 0.0505618
'algorithms' 0.0652174
'subject:ZODB' 0.0652174
'subject:] ' 0.0772017
'from:derek' 0.0918367
'spambayes' 0.0918367
'header:Return-path:1' 0.0946929
'header:Message-id:1' 0.0962885
'header:MIME-version:1' 0.122459
The low-spamprob words specific to *your* ham will depend on the content of
your ham in equally quirky ways.