[Spambayes] Central limit

Mon, 30 Sep 2002 10:49:42 +0200

While my computer is eating away cycles running timcv on my 24000 ham/ 
5750 spam messages, a few remarks on the Central Limit Theorem:

  - The standard deviations seem "underestimated". Gary already said this
    can be caused by correlations between scores. Alternatively this can
    indicate that the data is not 1D: in more than one dimension, a
    higher percentage of normally distributed data lies outside of the
    "core regions". Anyway, something can be done about this: just
    calculate the RMS Z-score, and scale it to 1.0. After applying that
    scale the normal 68%/95% rule should apply. One would hope.

  - The "certainty" rule of Tim should be formalized. This is a very
    powerful concept not only for those people that believe in a middle
    ground, but also for those that want to use a pre-trained spambayes
    system for other corpora. Once this has crystallized a bit more we
    should exchange pickles and see how well we can do with each-others
    training data!

  - It should somehow be possible to classify messages into any sumber of
    distinct groups using this trick. A new message can get scored a
    Z-score to describe the likelyhood that it is part of any of the
    groups, if all of these numbers are large, the test message does not
    belong to any class. I guess, e.g. that it should not be too
    difficult for the bayesian algorithms used here to judge whether
    E-mail I receive is for "work", "private" or "spam". What would take
    this really to the next generation would be an algorithm that can
    make the classification "ab initio" as a sort of clustering
    algorithm: e.g. something that would start with two of the most
    different messages in a single corpus, and add single messages to
    either of the two groups until it finds a message that has 2 large
    Z-scores. Then it starts a third group.

Just dreaming.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/