[spambayes-dev] SPAM In This List
Tim Peters
tim.one at comcast.net
Sun May 16 02:10:15 EDT 2004
[Tim]
>> ... (although I have some non-default options set, and that may be
>> relevant).
[OTR Comm]
> If you don't mind, what are these 'non-default options?'
"""
[Tokenizer]
replace_nonascii_chars: True
record_header_absence: True
mine_received_headers: True
[Classifier]
use_bigrams: True
"""
use_bigrams in particular has major effects, creating a much larger database
packed with many more hapaxes (tokens that appear only once). The
classifier learns faster when it's enabled (less training is needed to get
to a comparable level of effectiveness). OTOH, the database is much larger
than without it, and over time it's unclear whether it retains an
effectiveness advantage. In large-scale train-on-everything tests quite
some time ago, leaving it off did just as well, and created a much smaller
database, so use_bigrams didn't have anything to recommend it for
high-volume applications on server-class machines. The jury is still out on
whether the tradeoffs differ for personal classifiers.
More information about the spambayes-dev
mailing list