[spambayes-dev] Deprecated options

Thu Aug 5 18:56:24 CEST 2004

[Tim Peters]
> You're incapable of making a bad decision here, so I've stayed silent
> <wink>.  Bigrams remain an interesting option, so I don't expect the
> code to go away.  The database size can be pretty amazing, though! 
> Using bigrams and a giant pickled dict, my Outlook routinely consumes
> over 120MB of RAM now.  Fine by me -- I've got plenty of RAM.  But it
> sure makes False the right default.

CRM-114 uses 5-grams or even more, but ultimately uses a short hash to
represent the n-gram strings. This (intentionally?) short hash
(effectively 20 bits, from what I've read) results in a lot of
collisions, which keeps the classifier DB size small. Performance
doesn't seem to suffer much at all because of these collisions. 

Should this approach be looked at for n-grams and SpamBayes? I would
love to try the "engine" of CRM-114's SBPH classifier in SpamBayes'
comparatively pretty and easy-to-use skin. I think the maximum DB size
using this approach (5 bytes for hex-encoded hash "token", 4 bytes each
for ham and spam count) would be something like 13.5 MB. Perhaps there's
more overhead (termination strings, whatever) in the DB format than I
realize, but the DB could still be kept fairly small. Heck, the size of
the hash could be configurable as well, giving people the option to use
whatever length (and resulting DB size) they're comfortable with.

I know Bill Y. (CRM-144's creator) used to participate here, perhaps he
could offer some ideas. To me, using SBPH to generate tokens for
SpamBayes seems like it would be fairly straightforward. The rest of
SpamBayes would stay mostly the same.

If I ever find the time to try this out on my own, I will give it a
shot.

Regards,
	Ryan