[Spambayes] An alternate use

Tim Peters tim.one@comcast.net
Sun Nov 3 18:51:48 2002


[Tim]
> That's actually what started this project:  Barry Warsaw is GNU
> Mailman's author, and he asked me to look into adapting Graham's
> scheme for incorporation into Mailman. ...

[Rob Hooft]
> So, we'd have to make mailing lists keep a spam-archive as well? Or do
> we deliver spambayes with a pre-cooked spam archive to get started with
> new mailing lists?

That will remain unclear until someone sets up relevant experiments and
people measure results.  I'm counting on Barry to drive that.  Seeding a
mailing-list classifier with ham may also be a puzzle.  I suspect, but don't
know, that training several times on the initial list introduction post will
do well at that -- most lists have "a topic" <wink>, and a good list intro
is bound to mention many words characteristic of that topic.

For python.org use, I expect we'll share a single spam corpus across all
non-personal email carried by that site.

One of the reasons I keep the default header analysis as
platform-independent as I can is so that it won't be a nightmare to *try* to
share spam stats.  I haven't tried to do this, though.

A hint of potential:  where w is the WordInfo dict from my fat c.l.py test:

"""
d = {}
for k, r in w.iteritems():
    if r.spamprob > 0.95 and r.spamcount + r.hamcount >= 10:
        d[k] = r

f = file('reduced.pik', 'wb')
pickle.dump(d, f, 1)
f.close()
"""

Of the 327,439 words in the full dict, 10,559 pass that rather demanding
test for "strong spamness" (high spamprob and not close to being a hapax).
Seeding a classifier with those *may* work well, although the probabilities
will get recomputed in the new classifier, and it's unclear (to me) how to
fiddle the spamcounts and hamcounts in the inherited words so that they
don't dominate the first year of a list's life.