[spambayes-dev] Strange performancedipandDBRunRecoveryErrorretreat

Sun Jan 4 22:30:12 EST 2004

[Tony Meyer]
>
<http://sourceforge.net/tracker/index.php?func=detail&group_id=61702&atid=49
8103&aid=797890>
>
> Basically, Richie added it to help prevent the ham/spam count going
> to 0 when training was interrupted.

Ya, except that's crazy <wink>.  If we're trying to keep a database that's
going to remain in a self-consistent state across "unexpected" stoppages,
then we have to use an explicit transaction model.  The database entries
aren't independent, and we need exactly what transactions provide:  "commit
all of the changes in this batch of related mutations in one shot, or commit
none of them".  Otherwise multiple entries in the database can become
mutually inconsistent.

A giant pickled dict gets that result trivially, by rewriting the entire
database in one gulp.  Berkeley supplies a transactional API, but we're not
using it.  In its absence, I don't see a safe way to proceed except to
sync() frequently, *and* hope that the convenience wrappers don't do
opportunistic database syncs under the covers whenever-the-heck they feel
like it -- it's impossible for a wrapper to guess when we've made all the
mutations necessary to restore the database's contained data to a
self-consistent state.

> ...
> I'm not certain about saving after every message, though.

Using a transactional API explicitly allows the "saving" granularity to be
at any frequency we choose; until a transaction is explicitly committed,
it's guaranteed that none of the changes *provisionally* made will be
reflected in the disk file.  If, e.g., you choose to commit after every
thousand messages trained, then it's possible that you'll lose the training
for the last 999 messages you trained on, but the database will still hold
the self-consistent data it had after the last 1000 trained on.

We're also probably in trouble keeping more than one physical database
around, hoping they remain consistent with each other.

> I have the feeling that Mark won't like this, too (after all, he
> recently changed the plug-in code so that it *didn't* save after
> every message train).

Then it's also living dangerously to the extent that it does.

> If we don't sync after every message, we probably shouldn't
> write the new state key, though.  I think we could remove the
> _post_training() call without harm (the bug report had the guy using
> dumbdbm, which isn't possible anymore, and if we call store() often
> enough then the state key will be written anyway (along with a sync)).

Yes, the _post_training() hook should go regardless.