[spambayes-dev] Another incremental training idea...

Seth Goodman nobody at spamcop.net
Wed Jan 14 14:05:41 EST 2004


[T. Alex Popiel]
> This code trains immediately on the non-edge stuff, and expires
> at the end of each day.  It does not choose the messages to train
> for the day after expiring, as you suggest.  Your suggestion is
> interesting, though it would be a bit expensive to do (doubling
> the number of classifications done).

Thanks, Alex.  That's just what I think of the idea: interesting but who
knows.  As for it being expensive, it does do two classifications for every
message.  However, in real life, assuming you do training once per day after
making sure all messages are correctly classified, it only has to
re-classify one day's messages.  That only takes a few seconds on my system
with a couple of hundred messages from the Outlook plug-in.  I have no idea
if the Python scripts using the standard message data structure are slower.

Here are a few more potentially dumb questions.  Does your script work
directly with incremental.py from CVS or do you use a modified version?  To
implement the expire, reclassify, train regime, would I then modify just
incremental.py or are these functions spread around in other modules?  I
would like to play with this on my own saved mail corpus (it only goes back
to September and has just 10K messages but grows daily) and get my feet wet
with some cv runs.  As always, thanks for your indulgence.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the spambayes-dev mailing list