Graham's spam filter
Paul Rubin
phr-n2002b at NOSPAMnightsong.com
Fri Aug 23 05:01:35 EDT 2002
Oren Tirosh <oren-py-l at hishome.net> writes:
> My original proposal was to mmap a hash table into memory. Let's assume
> that the hash file looks like this:
>
> hash table size = N
> N*hash table entry - each entry contains 32 bit hash and file offset
> M<N variable size word entries:
> full word
> counts, probabilities, expiration info, etc
I think the mmaped table only needs the probability for each word. An
8 bit probability might be enough and 16 bits is certainly enough.
I'd first try an 8 bit probability with a few reserved values. Don't
store the word or hash values in the table. If there's 100,000
distinct words in the corpus, make say a 200kbyte table. Compute the
hash for each word, e.g. hash(word) = CRC(word) % 200000. Sort all
the hashes and find collisions. For words with no collision, set
table(hash(word)) = probability(word), where probability(word) is a
number from 0 to 250, giving the probability to the nearest 0.4%. If
there's a collision, use a special code like 251, saying to look the
hashcode up in a smaller, secondary table (or use some other collision
resolution method).
The above scheme keeps the mmap table pretty small, reducing the
amount of paging needed.
More information about the Python-list
mailing list