[spambayes-dev] Hapaxes? (was: How low can you go?)

Matthew Dixon Cowles matt at mondoinfo.com
Thu Dec 18 21:58:46 EST 2003


> Another newbie Q: were hapaxes not stored at one time?  Some of the
> recent discussion implies that a recent change (storing them?) has
> increased the DB size considerably.  Was that the only heuristic,
> or was it tokens seen less than N times...?

Hapaxes have always been stored. There have been various experiments
with removing them since they seem to make up about half of an
"average" database. It turns out that if you have a well-trained
database, you can remove hapaxes with little effect on scoring. The
problem comes if you're doing ongoing training. If you remove hapaxes
every day, a strong clue that only arrives once a day will never
persist to become a strong clue.

Regards,
Matt




More information about the spambayes-dev mailing list