[spambayes-dev] TREC-Spam

Wed Mar 23 03:42:46 CET 2005

I'm planning on participating in TREC 2005 (the spam track) using SpamBayes:

<http://trec.nist.gov/>
<http://plg.uwaterloo.ca/~gvcormac/spam/>

Basically the idea is that a whole lot of filters are run over a few corpora
(a couple of public and a couple of private) and the results are compared.
(Not to say, "hey, my filter is best", but to see what works well, where
improvements can be made, and all that).

The testing system is similar to our (Alex's) incremental testing setup -
the steps are:

initialize
classify emailfile resultfile
train [ham|spam] emailfile resultfile
finalize

So there is (or can be) training after each classification.  I'll create
scripts (a modified sb_filter, probably) that do each of the steps.  I don't
think that train-on-everything is a good idea here, so will include some
sort of training regime (like the incremental testing setup), too (maybe
train-to-exhaustion?).

I'm interested in doing this:

 o As research that I can work on after I submit my PhD and before I defend
it.

 o To see how spambayes compares with various types of filter/corpus.

 o As a sideline to other research I'd like to do with spambayes (see #1).

To get to the point of the email:

 o Does anyone object to me using spambayes in this way?  Everyone will be
acknowledged in the write-ups and all that, obviously, and I'm participating
as an individual (with tentative ties to my work, and obviously using the
work, but not speaking for, the spambayes group).

 o Is anyone else interested in this?  I can certainly report back as things
progress, but if anyone is really interested and can spare the time, I'd
happy work on it with someone else.

=Tony.Meyer