[Spambayes] progress on POP+VM+ZODB deployment

Tim Peters tim.one@comcast.net
Sun Oct 27 18:02:07 2002


[Derek Simkowiak]
> 	How can we test out new algorithms if the project doesn't have a
> control group?  We have no way of knowing if someone's successful (or
> poor) results are an attribute of the new algorithm, or if it's an
> attribute of their particular sample data.

Do read TESTING.txt, checked into the project.  The testing framework is set
up in a statistically sound way, so that even people working with a single
corpus get it sliced and diced in random ways across multiple testing runs.
In addition, as Alex already said, Big Changes have been made only after
multiple-corpora tests reported on this list.  When 10 randomized runs
across each of several distinct corpora all yield similar results, it's easy
to have confidence.

> 	Having a starter.db would both (a) make life easier for getting
> started,

I couldn't give you a starter db that would work well for your ham.  The
algorithms here aren't *trying* to "identify spam" -- you want something
like SpamAssassin if that's what you want.  The algorithms here are trying
to *separate* ham from spam, and the ham words are just as important to that
as the spam words.  I've run several experiments where a classifier trained
on one corpus was used to predict against a different corpus.  The false
negative rate remained good (little spam snuck thru), but the false positive
rate zoomed (many ham were *called* spam).  In IR terms, spam recall
remained good but spam precision suffered badly.

This isn't surprising, either:  except for foreign-language spam, spam is
still using ordinary words, and the same words show up in ham too.  For
example, in the very msg I'm replying to,

'give'                         0.648963
'skip:w 10'                    0.664292
'results'                      0.693332
'database.'                    0.718815
'successful'                   0.821229
'stock'                        0.867852
'data.'                        0.887295
"someone's"                    0.969799
'subject:+'                    0.987106

That's a decent collection of high-spamprob words.  Nevertheless,
chi-combining was extremely confident the msg was ham, because of a much
larger number of low-spamprob words, some of which are specific to the topic
being discussed on this mailing list, and some of which are specific to
computer-geek chatter:

'argument'                     0.0155709
'header:In-reply-to:1'         0.0158379
'subject:: ['                  0.0169746
'attribute'                    0.0196507
'url:mailman-21'               0.0196507
'skip:_ 40'                    0.0320263
"else's"                       0.0348837
'(b)'                          0.0412844
'header:Errors-to:1'           0.0458968
'started,'                     0.0505618
'subject:Spambayes'            0.0505618
'algorithms'                   0.0652174
'subject:ZODB'                 0.0652174
'subject:] '                   0.0772017
'from:derek'                   0.0918367
'spambayes'                    0.0918367
'header:Return-path:1'         0.0946929
'header:Message-id:1'          0.0962885
'header:MIME-version:1'        0.122459

The low-spamprob words specific to *your* ham will depend on the content of
your ham in equally quirky ways.