Graham's spam filter

Fri Aug 23 05:01:35 EDT 2002

Oren Tirosh <oren-py-l at hishome.net> writes:
> My original proposal was to mmap a hash table into memory. Let's assume 
> that the hash file looks like this:
> 
> hash table size = N
> N*hash table entry - each entry contains 32 bit hash and file offset
> M<N variable size word entries:
>    full word
>    counts, probabilities, expiration info, etc

I think the mmaped table only needs the probability for each word.  An
8 bit probability might be enough and 16 bits is certainly enough.
I'd first try an 8 bit probability with a few reserved values.  Don't
store the word or hash values in the table.  If there's 100,000
distinct words in the corpus, make say a 200kbyte table.  Compute the
hash for each word, e.g. hash(word) = CRC(word) % 200000.  Sort all
the hashes and find collisions.  For words with no collision, set
table(hash(word)) = probability(word), where probability(word) is a
number from 0 to 250, giving the probability to the nearest 0.4%.  If
there's a collision, use a special code like 251, saying to look the
hashcode up in a smaller, secondary table (or use some other collision
resolution method).

The above scheme keeps the mmap table pretty small, reducing the
amount of paging needed.