[Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17

Wed Nov 13 07:18:45 2002

[Mark Hammond]
> I'm a little confused about these probabilities.
>
> Isn't it true that whenever we do a "train operation", we should
> also update the probabilities?

It's a tradeoff.  The bigger the database, the longer update_probabilities()
takes.  If the user is staring at a specific msg, and expects to see its
score change, then the probs *have* to be updated or the score won't change.
So that was a very clear reason to force updating here.  I didn't  know why
the probs weren't being updated anyway, so fixed the one thing that was
unarguably buggy.

> For a batch train, we only want to do it at the end, but for an
> individual, incremental train, I would have thought we still want the
> probabilities updated, even if we don't rescore the message.  Otherwise
> future messages will not use the new probabilities.

That's so.  I haven't worried about it, perhaps because I run on Win9x most
of the time so live with frequent reboots (i.e., I retrain from scratch
several times every day anyway, as incremental updates are lost when a
forced reboot occurs; that's not *this* code's fault, although I eventual
hope to get around to writing out the updated database whenever the probs
get updated).

> I ask because revision 1.14 did exactly this, and we regressed it.

That's odd -- the CVS log says mhammond did that <wink>.

> ...
> And it seems to me that a new param, specifically for update_probs, is
> less of a hack than tieing it to the "rescore" param - we want the
> new probs used for the *next* incoming message even if we don't need
> it for *this* message.

It's still a tradeoff, though.  Once a classifier has gotten any amount of
decent training, whether or not a new training msg gets reflected instantly
in the probs should make little difference to results.

If it's possible that update_probabilities() *never* gets called after
training and before shutdown now, then that's clearly a bug.

It's OK by me whatever you'd rather do here, and updating probs after
training, without fail, is certainly the least error-prone strategy.