[Spambayes] Guidance re pickles versus DB for Outlook

Tue Nov 26 06:28:13 2002

Mark, I'm going to have to go with everyone else, and I'm the guy who
wrote the DBM back-end.  Until a reasonable dbm implementation is
available in a default Windows install, it's pretty much a no-brainer
due to potential database corruption.

However, I'm still going to address a few of your points :)

So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:

> * Move to a DB, but stick with a fully synchronous model.  We still
>   wear the DB load time at startup, but this should be reduced
>   significantly.  We wear the performance costs at runtime associated
>   with the scoring, and do all such scoring in the "foreground", and
>   saving of the DB as necessary.

The startup time for loading a DB is virtually non-existant.  I can't
say for sure what the DBM back-ends do, but I imagine it's something
like "open file, check a magic number, do a sanity check or two, build a
few structures in memory, return".  The only thing a DBDict does when
you start it is read in the MetaInfo class, which is stored as a
2-tuple.  So I don't think this is going to be very slow at all.

It's very easy to implement a hybrid dict/pickle method, which caches
DBM writes and only writes them out when you call the store() method.
I've been meaning to implement the write cache for a while now, because
training a dbdict on a large corpus is so abysmally slow right now, and
I have to do that a lot.

For small training batches though (1 or 2 messages), I don't think
you'll notice much difference.

> I would appreciate some comments on this.  I am leaning towards the
> asynch model, but it is clearly more complicated.  However, if moving
> to a DB simply means we will have perf issues, just not at startup,
> then the complexity would be warranted.

The DBM method is currently about 10 times slower than the pickle for
training, but it's a lot faster when you look at the whole picture, at
least if you are constantly opening and closing your persistent store.
I trained a new database with 50 messages using both methods:

pickle:
    min: 0.00468504428864
    max: 0.0303419828415
    avg: 0.00997757434845
    tot: 1.799s

dbm:
    min: 0.0343930721283
    max: 0.35492503643
    avg: 0.102057716846
    tot: 5.976s

Here's that same run with an existing database trained on a full 1088
messages.  You can see that the dbm method scales much better with a
large dataset:

pickle:
    min: 0.0046820640564
    max: 0.128466010094
    avg: 0.0142578792572
    tot: 8.874s

dbm:
    min: 0.0369809865952
    max: 0.546903014183
    avg: 0.11867954731
    tot: 6.749s

This is why the procmail crowd prefers the dbm though, here's a train on
*one* message:

pickle:
    min: 0.011234998703
    max: 0.011234998703
    avg: 0.011234998703
    tot: 7.908s

dbm:
    min: 0.0912280082703
    max: 0.0912280082703
    avg: 0.0912280082703
    tot: 0.574s

But in *getting* it trained, the pickle smoked the DBM:

pickle:
    min: 0.00426197052002
    max: 0.302268981934
    avg: 0.0123967074734
    tot: 26.991s

dbm:
    min: 0.0284680128098
    max: 1.34902703762
    avg: 0.139986485572
    tot: 2m46.591s

This performance loss can be mitigated pretty well by caching DBM
writes.  It would also fix the "problem" Tim S has with closing the DBM
before writing out the MetaData.  To me, that's the same as crashing,
but in any case, it'll fix it.

So if the DBM support on Windows were any good, I wouldn't know which
one you should use for the Outlook stuff.  But I suspect that a DBM with
write-caching could pound the vinegar-flavored snot out of a pickle.  :)

Things being what they are, though, it sounds like you should stay away
from DBM until Python 2.3.

Neale