[spambayes-dev] Speedup for full retrain when using DB dict
Tim Peters
tim.one at comcast.net
Thu Sep 4 22:09:20 EDT 2003
[Skip]
> My earlier message (which because of the mail load on mail.python.org
> you will probably get after this one)
Heh -- I still haven't seen that one.
> indicated that I had a patch which might speed up full retrains when
> using a shelve database. I'm happy to say it works well for me. The
> test I ran essentially executed
>
> rm hammie.db
> hammie.py -d -p hammie.db -g newham.clean -s newspam.clean
>
> between calls to the Unix date(1) program. The above two files
> contained a total of 15720 messages. The full retrain time dropped
> from about 33 minutes to about 20 minutes. The speedup comes from
> not writing to the shelve until until the training is completed. The
> context diff is attached.
Wouldn't it be simpler to do the full retrain using a PickledClassifier
instance, then populate a DBDictClassifier from the result? That would also
skip the extra layers of code (and time) to maintain the changed_words dict
during the retrain.
More information about the spambayes-dev
mailing list