[spambayes-dev] Using crm114-style hash files

Dobes Vandermeer dobes at dobesland.com
Tue Jul 15 12:24:11 EDT 2003


Hearing all this talk of 27MB databases makes we want to suggest (sorry,
I'm not a python hacker - yet - or I'd submit a file instead) trying
crm114-style hash files.

They are very simple:

Instead of storing precise token counts per word, you compute the hash
of each token modulus the number of counters the hash file (a fixed size
of 1 million counts, which is 1 MB if you are using one byte counters,
is probably plenty) and treat the counter at that file position as the
counter for your token.  He has two of these files - one counts the
number of times a token has appeared in spam, another counting the
number of times a token apeared as ham.

At first glance, I thought... what about collisions?

In practice it seems to work anyway.  First of all, there aren't really
that many valuable tokens; so the majority of collisions will be for
words that you don't care about.  Secondly, if a popular hammy token
collides with a popular spammy one, they'll just cancel each other out,
which may slightly reduce the effectiveness but isn't really fatal.  Of
course, if a popular token collides with a less popular one you might be
in trouble; crm114 compensates for this possibility by generating a LOT
of tokens (it actually hashes a 5-token phrase 16 times with each
combination of possible word wildcards) so these errors are swamped out.

So essentially you're just using token hashes instead of actual tokens
as your key, and taking advantage of the special property of those
hashes (e.g. they all fit inside a fixed-size flat file) to optimize
database access.

Since this would be a very small module to implement, you could bundle
it with spambayes and use it as your final fallback instead of dumbdbm.

Advantages:
- Easy to implement
- Fixed-size database

Disadvantages:
- Less precise; you don't know for sure whether the token is as
significant as it seems
- Can't export to another DB format

CU
Dobes



More information about the spambayes-dev mailing list