[spambayes-dev] Strange performance
dipandDBRunRecoveryErrorretreat
Tim Peters
tim.one at comcast.net
Sun Jan 4 18:24:04 EST 2004
[Richie Hindle]
> Sadly not. sb_server saves the db after ever train as well, out of
> paranoia. The page should always say "Training... Saving... Done".
> If there's a way of training without saving, maybe that's the
> problem, but I don't believe there is...?
Sorry, I don't know -- there's a lot of code, and it's pasted together in
lots of creative ways.
I'll note one thing: somewhere along the line the classifier grew a funky
"_post_training()" method. The implementation in DBDictClassifier is:
def _post_training(self):
"""This is called after training on a wordstream. We ensure
that the database is in a consistent state at this point by
writing the state key.
"""
self._write_state_key()
But, of course, that *doesn't* ensure the database is in a consistent state.
To the contrary, it all but guarantees that the disk file gets *out* of
sync, because the implementation of _write_state_key is just this:
def _write_state_key(self):
self.db[self.statekey] = (classifier.PICKLE_VERSION,
self.nspam, self.nham)
So that's an obvious way to get the in-memory Berkeley internals out of sync
with what's on disk. After adding the line:
self.db.sync()
to the end of _write_state_key(), your hammer.py (as checked in, with the
reopen-without-closing business) has run here w/o complaint for a lot longer
than it ran before adding the sync() (I typically got a DBRunRecoveryError
shortly after the first occurrence of "Re-opening." output before; I've had
a few dozen of those go by so far after the change).
So maybe that's relevant. It's too easy to look at
db[key] = value
syntax and overlook that it's hiding a very dangerous operation (which is
another reason to avoid "convenience wrappers" -- code mutating a disk-based
database *shouldn't* be easy to read <0.7 wink>).
More information about the spambayes-dev
mailing list