[Spambayes] Database reduction

Tim Peters tim.one@comcast.net
Thu Oct 31 02:18:08 2002


There's a semi-standard trick for database size reduction I haven't persued
and don't intend to pursue.  Those keenly interested in reducing database
size may wish to pursue it.

Currently, the classifier's wordinfo dict is indexed by strings S.  There's
no bound on how many unique strings may appear, and so also no bound on how
large the database may grow.

A cheesy but probably-effective trick is to pick an integer N for all time,
and index a wordinfo structure by hash(S) % N instead of by S.

Since strings are no longer stored:

+ Good:  Space for storing strings isn't needed.
+ Bad:  You can't get words out again (for, e.g., clue lists).

Since hash(S) is a many-to-one mapping:

+ Bad:  Words get combined more-than-less randomly.

Since mod N is also a many-to-one mapping:

+ Bad:  As above.
+ Good:  N is a solid upper bound on the maximum number of wordinfo records,
and so you can know exactly how big a classifier can get.
+ Good:  You could drop the dict and use hash(S)%N to index a contiguous
structure directly, like an mmap'ed file (http://crm114.sf.net/ uses that
specific trick after multiple layers of hashing, into distinct files of
one-byte clamped ham and spam counts).

Since database size would be bounded:

+ Good:  There's less obvious need to prune the database over time (a main
point of pruning is to reclaim space for words that aren't being used
anymore -- or ever).
+ Bad:  If the database is never pruned, it will adapt slower to changes in
the nature of ham and spam.

I suppose the scariest thing is combining words "at random".  It's possible,
e.g., that Python would get mapped to the same record as Viagra.  And the
smaller N is, the most certain "bad stuff like that" *will* happen.  We
won't know until someone tries it and measures results; my intuition <wink>
is that unless you get silly with N, it won't hurt much, as most words are
approximately worthless anyway.  Think about what happens when N=1 for the
other side of this coin.