[Spambayes] train on error - to exhaustion?

Tue Dec 3 16:53:10 2002

[Greg Louis]
> ...
> Doesn't look as though pure training-on-error is particularly
> advantageous with the Robinson-Fisher (chi) calculation method.

Are you hashing tokens?  spambayes does not, CRM114 does.  Bill generates
about 16 hash codes per input token, and with just a million hash buckets,
collision rates zoom quickly if you train on everything.  The experiments
spambayes did with CRM114-like schemes were a disaster due to this -- we
continued to train on everything, with hashing but without any bounds on
bucket count, and the hash collisions quickly caused outrageously bad
classification mistakes.  Removing the hashing cured that, but then the
database size goes through the roof (when generating ~16 "exact strings" per
input token, and training on everything).

Training-on-error helps Bill because it slashes hash collisions, simply via
producing far fewer hash codes than does training on everything.

Experiments in the default non-hashing spambayes unigram code found that
train-on-error hurt the unsure rate but not the FP or FN rates.

> It may still be useful in maintaining the effectiveness of an established
> training base.

Possibly; we didn't do any experiments on that.