[Spambayes] Central limit
Rob Hooft
rob@hooft.net
Mon, 30 Sep 2002 10:49:42 +0200
While my computer is eating away cycles running timcv on my 24000 ham/
5750 spam messages, a few remarks on the Central Limit Theorem:
- The standard deviations seem "underestimated". Gary already said this
can be caused by correlations between scores. Alternatively this can
indicate that the data is not 1D: in more than one dimension, a
higher percentage of normally distributed data lies outside of the
"core regions". Anyway, something can be done about this: just
calculate the RMS Z-score, and scale it to 1.0. After applying that
scale the normal 68%/95% rule should apply. One would hope.
- The "certainty" rule of Tim should be formalized. This is a very
powerful concept not only for those people that believe in a middle
ground, but also for those that want to use a pre-trained spambayes
system for other corpora. Once this has crystallized a bit more we
should exchange pickles and see how well we can do with each-others
training data!
- It should somehow be possible to classify messages into any sumber of
distinct groups using this trick. A new message can get scored a
Z-score to describe the likelyhood that it is part of any of the
groups, if all of these numbers are large, the test message does not
belong to any class. I guess, e.g. that it should not be too
difficult for the bayesian algorithms used here to judge whether
E-mail I receive is for "work", "private" or "spam". What would take
this really to the next generation would be an algorithm that can
make the classification "ab initio" as a sort of clustering
algorithm: e.g. something that would start with two of the most
different messages in a single corpus, and add single messages to
either of the two groups until it finds a message that has 2 large
Z-scores. Then it starts a third group.
Just dreaming.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/