[spambayes-dev] Strange performance dipandDBRunRecoveryErrorretreat

Sun Jan 4 18:24:04 EST 2004

[Richie Hindle]
> Sadly not.  sb_server saves the db after ever train as well, out of
> paranoia.  The page should always say "Training... Saving... Done".
> If there's a way of training without saving, maybe that's the
> problem, but I don't believe there is...?

Sorry, I don't know -- there's a lot of code, and it's pasted together in
lots of creative ways.

I'll note one thing:  somewhere along the line the classifier grew a funky
"_post_training()" method.  The implementation in DBDictClassifier is:

    def _post_training(self):
        """This is called after training on a wordstream.  We ensure
           that the database is in a consistent state at this point by
           writing the state key.
         """
        self._write_state_key()

But, of course, that *doesn't* ensure the database is in a consistent state.
To the contrary, it all but guarantees that the disk file gets *out* of
sync, because the implementation of _write_state_key is just this:

    def _write_state_key(self):
        self.db[self.statekey] = (classifier.PICKLE_VERSION,
                                  self.nspam, self.nham)

So that's an obvious way to get the in-memory Berkeley internals out of sync
with what's on disk.   After adding the line:

        self.db.sync()

to the end of _write_state_key(), your hammer.py (as checked in, with the
reopen-without-closing business) has run here w/o complaint for a lot longer
than it ran before adding the sync() (I typically got a DBRunRecoveryError
shortly after the first occurrence of "Re-opening." output before; I've had
a few dozen of those go by so far after the change).

So maybe that's relevant.  It's too easy to look at

    db[key] = value

syntax and overlook that it's hiding a very dangerous operation (which is
another reason to avoid "convenience wrappers" -- code mutating a disk-based
database *shouldn't* be easy to read <0.7 wink>).