[Spambayes] Using mxBeeBase as hammie DB

Tim Peters tim.one@comcast.net
Thu Oct 17 16:59:31 2002


[M.-A. Lemburg, on mxBeeBase]
> Just to put some numbers by the fishes:
>
> Teaching hammie 13000 messages from comp.lang.python
> gives a database size of 23MB (that's data + index).

Note that at least half the words in the database are almost certainly
unique, and so of no actual use.  Pruning the database, and especially over
time, is something that needs work here.

> Checking a single message takes 200ms on my Athlon 1200
> (this includes Python startup time).

For contrast, I run tests using a plain Python dict for "a database", and
reading up msgs stored one per file, but doing many (on the order of 1e5)
scorings per run.  On a slower 866MHz Pentium box with 256MB RAM, this
scores about 80 msgs/second, or about 12.5ms per msg (under 2.3 CVS Python,
which is zippier than 2.2.2).  Firing up the system once per msg is a real
expense; keeping it running in the background all the time is a real expense
of a different kind.