[Spambayes] Training

Wed Nov 20 04:14:05 2002

[Paul Moore]
> ...
> This happened to me today, with Tim's new adjustment switched on, with
> a 10:1 ham:spam imbalance. IIRC, Tim's change means that with this
> sort of imbalance, ham clues will only have 10% of their normal
> effect, so saying "This is ham" will be pretty much ignored :-(

It affects only the Bayesian adjustment to the by-counting spamprob
estimates, and the adjustment isn't a linear function, so 10:1 -> 10% isn't
what happens.  For what really happens, study update_probabilities <wink>.

The effect of the Bayesian adjustment is *always* to move a by-counting
estimate closer to 0.5 (unknown_word_prob).  It can never increase the
distance of a by-counting estimate from 0.5.  So even if the Bayesian
adjustment weren't done at all, a hamprob can only get as low as the data
says it should get, and that's purely a matter of how often the word has
been seen in trained ham and trained spam.  Doing better than that would
require major psychic powers.

> I'm not sure. All training is basically saying "these specific
> messages *are* ham/spam". Whether this is done in bulk, or on an
> individual basis, shouldn't matter. A naive view says that therefore
> trained messages will score 0/100 "by definition". But the maths
> doesn't work like that, and nothing is going to make it.

You could train on a message over and over and over ... again, until the
score became arbitrarily close to 0 or 100.  It would probably ruin the
classifier for most other msgs, though.

> But I think it's a reasonable assumption that any messages which have
> been explicitly trained will no longer hit the "unsure" range. I just
> can't see a way of making even that assumption be true.

I have an FP that's an entire Nigerian scam msg, prefaced by a one-line
comment saying something like "Jeez, here's another Nigerian wire scam --
this has been around for 20 years".  Think about it <wink>.