[Spambayes] Hammiefilter doesn't write out the pickle

Neale Pickett neale@woozle.org
Tue Nov 19 17:13:51 2002


So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> I think we've got some real potential for a great little api here.  I
> do have some questions about the data storage.  We've agreed that an
> explicit store is the way we want to go, which I think is correct.
> However, dbm really doesn't support this.  I fooled with a couple
> ideas (hacks) to make DBDict behave in a load/store fashion, and the
> best thing I can come up with is to actually make a working copy of
> the dbm file, which is then used for the session.  When store() is
> called, the original is replaced with the working copy.  There are
> some difficulties with this approach.  If store is never called, then
> there is no guaranteed way to clean up the working copy.  Replacing
> the original with the working copy may be a bit difficult, because dbm
> doesn't support a close method...

Yeah.  I ran into the same problem yesterday.  As I thought about it, I
realized this must have been why I implemented the __del__ method of
DBDict.

The problem, really, with DBDict is that there is this meta-information
it has to store (nham, nspam).  If individual db entries are updated but
the meta-info isn't, your database is corrupt, game over.  That problem
manifests itself in two ways:

1. You need to be very careful about when you hit ^C when running hammie
2. The pop3proxy's "store" method doesn't really do anything

But couldn't this be adequately explained by merely stating that the
DBDict method stores things instantaneously?  If we're careful to always
update nham and nspam *before* writing any new wordinfo, then the worst
you can do would be start training, then hit ^C right away--equivalent
to training on an empty message.  And people running the pop3proxy would
have to be aware that the way the proxy is working is always in sync
with what's on the disk.  I don't see either of these as a huge
problem.

So we need to write out nham and nspam before writing out the new
WordInfo counts.  I don't think it'd be much of a penalty to do this
before every message in a batch training run, and of course for the
pickle method it's no difference at all whether you add one before or
after training on a message.

Neale



More information about the Spambayes mailing list