[Spambayes] Guidance re pickles versus DB for Outlook
Tim Stone - Four Stones Expressions
tim@fourstonesExpressions.com
Tue Nov 26 14:03:34 2002
Nice treatment of the issue, dude. So how are you going to do the write
caching thing? I imagine you're not going to use a working copy model, like I
had going... ;)
The problem Richie had with __del__ is that there's no guarantee that it will
actually be called.
- TimS
11/26/2002 12:28:13 AM, Neale Pickett <neale@woozle.org> wrote:
>Mark, I'm going to have to go with everyone else, and I'm the guy who
>wrote the DBM back-end. Until a reasonable dbm implementation is
>available in a default Windows install, it's pretty much a no-brainer
>due to potential database corruption.
>
>However, I'm still going to address a few of your points :)
>
>So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:
>
>> * Move to a DB, but stick with a fully synchronous model. We still
>> wear the DB load time at startup, but this should be reduced
>> significantly. We wear the performance costs at runtime associated
>> with the scoring, and do all such scoring in the "foreground", and
>> saving of the DB as necessary.
>
>The startup time for loading a DB is virtually non-existant. I can't
>say for sure what the DBM back-ends do, but I imagine it's something
>like "open file, check a magic number, do a sanity check or two, build a
>few structures in memory, return". The only thing a DBDict does when
>you start it is read in the MetaInfo class, which is stored as a
>2-tuple. So I don't think this is going to be very slow at all.
>
>It's very easy to implement a hybrid dict/pickle method, which caches
>DBM writes and only writes them out when you call the store() method.
>I've been meaning to implement the write cache for a while now, because
>training a dbdict on a large corpus is so abysmally slow right now, and
>I have to do that a lot.
>
>For small training batches though (1 or 2 messages), I don't think
>you'll notice much difference.
>
>> I would appreciate some comments on this. I am leaning towards the
>> asynch model, but it is clearly more complicated. However, if moving
>> to a DB simply means we will have perf issues, just not at startup,
>> then the complexity would be warranted.
>
>The DBM method is currently about 10 times slower than the pickle for
>training, but it's a lot faster when you look at the whole picture, at
>least if you are constantly opening and closing your persistent store.
>I trained a new database with 50 messages using both methods:
>
>pickle:
> min: 0.00468504428864
> max: 0.0303419828415
> avg: 0.00997757434845
> tot: 1.799s
>
>dbm:
> min: 0.0343930721283
> max: 0.35492503643
> avg: 0.102057716846
> tot: 5.976s
>
>
>Here's that same run with an existing database trained on a full 1088
>messages. You can see that the dbm method scales much better with a
>large dataset:
>
>pickle:
> min: 0.0046820640564
> max: 0.128466010094
> avg: 0.0142578792572
> tot: 8.874s
>
>dbm:
> min: 0.0369809865952
> max: 0.546903014183
> avg: 0.11867954731
> tot: 6.749s
>
>
>This is why the procmail crowd prefers the dbm though, here's a train on
>*one* message:
>
>pickle:
> min: 0.011234998703
> max: 0.011234998703
> avg: 0.011234998703
> tot: 7.908s
>
>dbm:
> min: 0.0912280082703
> max: 0.0912280082703
> avg: 0.0912280082703
> tot: 0.574s
>
>
>But in *getting* it trained, the pickle smoked the DBM:
>
>pickle:
> min: 0.00426197052002
> max: 0.302268981934
> avg: 0.0123967074734
> tot: 26.991s
>
>dbm:
> min: 0.0284680128098
> max: 1.34902703762
> avg: 0.139986485572
> tot: 2m46.591s
>
>This performance loss can be mitigated pretty well by caching DBM
>writes. It would also fix the "problem" Tim S has with closing the DBM
>before writing out the MetaData. To me, that's the same as crashing,
>but in any case, it'll fix it.
>
>So if the DBM support on Windows were any good, I wouldn't know which
>one you should use for the Outlook stuff. But I suspect that a DBM with
>write-caching could pound the vinegar-flavored snot out of a pickle. :)
>
>Things being what they are, though, it sounds like you should stay away
>from DBM until Python 2.3.
>
>Neale
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
c'est moi - TimS
www.fourstonesExpressions.com
More information about the Spambayes
mailing list