[Spambayes] Central limit

Josiah Carlson jcarlson@uci.edu
Mon, 30 Sep 2002 11:07:37 -0700


> There's a large literature on this.  A good start is Jason Rennie's paper on
> ifile, a classic Bayesian N-way classifier.  The paper is available from
> ifile's homepage:

I'm going to check it out later this week.  Thank you.

> Error rates reported there range from 10-20%.  Rennie (ifile's author)
> reports an error rate of about 15% on his on own email, doing 50-way
> classification (this info is in the ifile FAQ).  Before assuming that
> popfile does better than that, measure it <wink>.  The ham-vs-spam error
> rates I'm seeing on my corpus are at least 100x better; keep in mind that
> people have non-zero error rates too, and several times on this list we've
> had vigorous debates about whether a specific piece of email *was* spam.
> It's not always clear, and in my set of 20,000 hams I'm still keeping a
> message that added a one-line comment to a quote of an entire Nigerian-scam
> spam -- that's one of the 2 false positives remaining in my corpus.

My thoughts may be addressed in the faq, so yeah.  Even if we do n-way
classification, what if we really only care if it is in one of those 50
good folders, or in that one bad folder.  What I was trying to express
is that generalizing those 50 good folders into 1 good folder could be
generalizing too much.  Though it could also be that our spam
categorization is equally as over-generalized.  What if we were to split
it up into financial spam, porn spam, etc.  I would think that would
even the playing field a bit.  Then one could have two lists, the spam
list (categories of spam) and the ham list (categories of ham).  We
really only are concerned with ham or spam, but by doing a bit more work
on our side, it could make the computer's job easier.

Again, thank you for the information,
 - Josiah