[spambayes-dev] SPAM In This List

Tim Peters tim.one at comcast.net
Sun May 16 02:10:15 EDT 2004


[Tim]
>> ... (although I have some non-default options set, and that may be
>> relevant).

[OTR Comm]
> If you don't mind, what are these 'non-default options?'

"""
[Tokenizer]
replace_nonascii_chars: True
record_header_absence: True
mine_received_headers: True

[Classifier]
use_bigrams: True
"""

use_bigrams in particular has major effects, creating a much larger database
packed with many more hapaxes (tokens that appear only once).  The
classifier learns faster when it's enabled (less training is needed to get
to a comparable level of effectiveness).  OTOH, the database is much larger
than without it, and over time it's unclear whether it retains an
effectiveness advantage.  In large-scale train-on-everything tests quite
some time ago, leaving it off did just as well, and created a much smaller
database, so use_bigrams didn't have anything to recommend it for
high-volume applications on server-class machines.  The jury is still out on
whether the tradeoffs differ for personal classifiers.





More information about the spambayes-dev mailing list