[spambayes-dev] Speedup for full retrain when using DB dict

Thu Sep 4 22:09:20 EDT 2003

[Skip]
> My earlier message (which because of the mail load on mail.python.org
> you will probably get after this one)

Heh -- I still haven't seen that one.

> indicated that I had a patch which might speed up full retrains when
> using a shelve database.  I'm happy to say it works well for me.  The
> test I ran essentially executed
>
>     rm hammie.db
>     hammie.py -d -p hammie.db -g newham.clean -s newspam.clean
>
> between calls to the Unix date(1) program.  The above two files
> contained a total of 15720 messages.  The full retrain time dropped
> from about 33 minutes to about 20 minutes.  The speedup comes from
> not writing to the shelve until until the training is completed.  The
> context diff is attached.

Wouldn't it be simpler to do the full retrain using a PickledClassifier
instance, then populate a DBDictClassifier from the result?  That would also
skip the extra layers of code (and time) to maintain the changed_words dict
during the retrain.