[Spambayes] Guidance re pickles versus DB for Outlook
Neale Pickett
neale@woozle.org
Tue Nov 26 06:28:13 2002
Mark, I'm going to have to go with everyone else, and I'm the guy who
wrote the DBM back-end. Until a reasonable dbm implementation is
available in a default Windows install, it's pretty much a no-brainer
due to potential database corruption.
However, I'm still going to address a few of your points :)
So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:
> * Move to a DB, but stick with a fully synchronous model. We still
> wear the DB load time at startup, but this should be reduced
> significantly. We wear the performance costs at runtime associated
> with the scoring, and do all such scoring in the "foreground", and
> saving of the DB as necessary.
The startup time for loading a DB is virtually non-existant. I can't
say for sure what the DBM back-ends do, but I imagine it's something
like "open file, check a magic number, do a sanity check or two, build a
few structures in memory, return". The only thing a DBDict does when
you start it is read in the MetaInfo class, which is stored as a
2-tuple. So I don't think this is going to be very slow at all.
It's very easy to implement a hybrid dict/pickle method, which caches
DBM writes and only writes them out when you call the store() method.
I've been meaning to implement the write cache for a while now, because
training a dbdict on a large corpus is so abysmally slow right now, and
I have to do that a lot.
For small training batches though (1 or 2 messages), I don't think
you'll notice much difference.
> I would appreciate some comments on this. I am leaning towards the
> asynch model, but it is clearly more complicated. However, if moving
> to a DB simply means we will have perf issues, just not at startup,
> then the complexity would be warranted.
The DBM method is currently about 10 times slower than the pickle for
training, but it's a lot faster when you look at the whole picture, at
least if you are constantly opening and closing your persistent store.
I trained a new database with 50 messages using both methods:
pickle:
min: 0.00468504428864
max: 0.0303419828415
avg: 0.00997757434845
tot: 1.799s
dbm:
min: 0.0343930721283
max: 0.35492503643
avg: 0.102057716846
tot: 5.976s
Here's that same run with an existing database trained on a full 1088
messages. You can see that the dbm method scales much better with a
large dataset:
pickle:
min: 0.0046820640564
max: 0.128466010094
avg: 0.0142578792572
tot: 8.874s
dbm:
min: 0.0369809865952
max: 0.546903014183
avg: 0.11867954731
tot: 6.749s
This is why the procmail crowd prefers the dbm though, here's a train on
*one* message:
pickle:
min: 0.011234998703
max: 0.011234998703
avg: 0.011234998703
tot: 7.908s
dbm:
min: 0.0912280082703
max: 0.0912280082703
avg: 0.0912280082703
tot: 0.574s
But in *getting* it trained, the pickle smoked the DBM:
pickle:
min: 0.00426197052002
max: 0.302268981934
avg: 0.0123967074734
tot: 26.991s
dbm:
min: 0.0284680128098
max: 1.34902703762
avg: 0.139986485572
tot: 2m46.591s
This performance loss can be mitigated pretty well by caching DBM
writes. It would also fix the "problem" Tim S has with closing the DBM
before writing out the MetaData. To me, that's the same as crashing,
but in any case, it'll fix it.
So if the DBM support on Windows were any good, I wouldn't know which
one you should use for the Outlook stuff. But I suspect that a DBM with
write-caching could pound the vinegar-flavored snot out of a pickle. :)
Things being what they are, though, it sounds like you should stay away
from DBM until Python 2.3.
Neale
More information about the Spambayes
mailing list