[Spambayes] training WAS: aging information

Wed Feb 19 20:01:29 EST 2003

[D. R. Evans]
> ...
> I saw a comment in the LJ article that one should train on roughly
> equal numbers of spam and ham. Is this actually true? (This question of
> course merely demonstrates that I'm too lazy to do the maths myself.)

I think so, but insufficient testing has been done to prove it with high
confidence.

A thought experiment may help:  suppose you don't know any French or
Russian, but get a job requiring you to identify which is which from
transcripts of conversations.  Say you've been trained on 100 French
transcripts and 1 Russian transcript.  90 of 100 French transcripts
contained the phrase "bon mot".  The single Russian transcript you saw did
not.

First day on the job, the first transcript you see does contain "bon mot".
Is it French or Russian?  A fact that's hard to account for is that you know
much less about Russian than about French at this point.  By default,
spambayes gives full credit (a very high francoprob) to "bon mot" based on
what you do know about French, and doesn't penalize it (lower the
francoprob) to account for that you know so much less about Russian.  As a
result, the transcript will almost certainly be judged French.

But spam contains ham words routinely, and vice versa, and, indeed, a number
of French phrases have become part of the international vocabulary --
there's just no better way to say mot juste <wink>.

With

experimental_ham_spam_imbalance_adjustment: True

spambayes takes the French evidence and discounts it, to give words
francoprobs *as if* you had seen no more French transcripts than Russian
ones.  In the example, "bon mot" will get a mild francoprob instead of a
very strong one, because the system can't claim to be sure of anything based
on one training example of each.

There are downsides to both in practice.  Mark mentioned that he tends to
keep training spam, and that's a predictable outcome of setting this option
to True once spam outnumbers ham:  additional training on spam doesn't do a
heck of a lot to boost spamprobs then, because almost as much as
non-adjusted training boosts them, the imbalance adjustment knocks them down
again.  (So, Mark, if you're listening, try training on a pile of ham
instead next time:  that will, perhaps paradoxically, raise the spamprobs on
spam words.)

OTOH, if this adjustment isn't made, the corpus (ham or spam) with the
higher training count gets words with probabilities closer to its endpoint
(0.0 or 1.0) than the other corpus *can* get, and that can give the
accidental appearance of the strong flavor of word in the weak flavor of msg
more power than the weak words can overcome.

In an uncharitable mood, you can think of it as getting screwed either
way -- but if you've told any system a lot more about one kind of msg than
the other, relatively speaking it *has* to "guess" a lot more about the kind
of msg you've withheld.  Remember that it can't infer patterns or meanings
either -- it's just staring at isolated words.