[Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
Tim Peters
tim.one@comcast.net
Wed Nov 13 07:18:45 2002
[Mark Hammond]
> I'm a little confused about these probabilities.
>
> Isn't it true that whenever we do a "train operation", we should
> also update the probabilities?
It's a tradeoff. The bigger the database, the longer update_probabilities()
takes. If the user is staring at a specific msg, and expects to see its
score change, then the probs *have* to be updated or the score won't change.
So that was a very clear reason to force updating here. I didn't know why
the probs weren't being updated anyway, so fixed the one thing that was
unarguably buggy.
> For a batch train, we only want to do it at the end, but for an
> individual, incremental train, I would have thought we still want the
> probabilities updated, even if we don't rescore the message. Otherwise
> future messages will not use the new probabilities.
That's so. I haven't worried about it, perhaps because I run on Win9x most
of the time so live with frequent reboots (i.e., I retrain from scratch
several times every day anyway, as incremental updates are lost when a
forced reboot occurs; that's not *this* code's fault, although I eventual
hope to get around to writing out the updated database whenever the probs
get updated).
> I ask because revision 1.14 did exactly this, and we regressed it.
That's odd -- the CVS log says mhammond did that <wink>.
> ...
> And it seems to me that a new param, specifically for update_probs, is
> less of a hack than tieing it to the "rescore" param - we want the
> new probs used for the *next* incoming message even if we don't need
> it for *this* message.
It's still a tradeoff, though. Once a classifier has gotten any amount of
decent training, whether or not a new training msg gets reflected instantly
in the probs should make little difference to results.
If it's possible that update_probabilities() *never* gets called after
training and before shutdown now, then that's clearly a bug.
It's OK by me whatever you'd rather do here, and updating probs after
training, without fail, is certainly the least error-prone strategy.
More information about the Spambayes-checkins
mailing list