[Spambayes] Guidance re pickles versus DB for Outlook

Tim Stone - Four Stones Expressions tim@fourstonesExpressions.com
Tue Nov 26 14:03:34 2002


Nice treatment of the issue, dude.  So how are you going to do the write 
caching thing?  I imagine you're not going to use a working copy model, like I 
had going... ;)

The problem Richie had with __del__ is that there's no guarantee that it will 
actually be called.

- TimS

11/26/2002 12:28:13 AM, Neale Pickett <neale@woozle.org> wrote:

>Mark, I'm going to have to go with everyone else, and I'm the guy who
>wrote the DBM back-end.  Until a reasonable dbm implementation is
>available in a default Windows install, it's pretty much a no-brainer
>due to potential database corruption.
>
>However, I'm still going to address a few of your points :)
>
>So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:
>
>> * Move to a DB, but stick with a fully synchronous model.  We still
>>   wear the DB load time at startup, but this should be reduced
>>   significantly.  We wear the performance costs at runtime associated
>>   with the scoring, and do all such scoring in the "foreground", and
>>   saving of the DB as necessary.
>
>The startup time for loading a DB is virtually non-existant.  I can't
>say for sure what the DBM back-ends do, but I imagine it's something
>like "open file, check a magic number, do a sanity check or two, build a
>few structures in memory, return".  The only thing a DBDict does when
>you start it is read in the MetaInfo class, which is stored as a
>2-tuple.  So I don't think this is going to be very slow at all.
>
>It's very easy to implement a hybrid dict/pickle method, which caches
>DBM writes and only writes them out when you call the store() method.
>I've been meaning to implement the write cache for a while now, because
>training a dbdict on a large corpus is so abysmally slow right now, and
>I have to do that a lot.
>
>For small training batches though (1 or 2 messages), I don't think
>you'll notice much difference.
>
>> I would appreciate some comments on this.  I am leaning towards the
>> asynch model, but it is clearly more complicated.  However, if moving
>> to a DB simply means we will have perf issues, just not at startup,
>> then the complexity would be warranted.
>
>The DBM method is currently about 10 times slower than the pickle for
>training, but it's a lot faster when you look at the whole picture, at
>least if you are constantly opening and closing your persistent store.
>I trained a new database with 50 messages using both methods:
>
>pickle:
>    min: 0.00468504428864
>    max: 0.0303419828415
>    avg: 0.00997757434845
>    tot: 1.799s
>
>dbm:
>    min: 0.0343930721283
>    max: 0.35492503643
>    avg: 0.102057716846
>    tot: 5.976s
>
>
>Here's that same run with an existing database trained on a full 1088
>messages.  You can see that the dbm method scales much better with a
>large dataset:
>
>pickle:
>    min: 0.0046820640564
>    max: 0.128466010094
>    avg: 0.0142578792572
>    tot: 8.874s
>
>dbm:
>    min: 0.0369809865952
>    max: 0.546903014183
>    avg: 0.11867954731
>    tot: 6.749s
>
>
>This is why the procmail crowd prefers the dbm though, here's a train on
>*one* message:
>
>pickle:
>    min: 0.011234998703
>    max: 0.011234998703
>    avg: 0.011234998703
>    tot: 7.908s
>
>dbm:
>    min: 0.0912280082703
>    max: 0.0912280082703
>    avg: 0.0912280082703
>    tot: 0.574s
>
>
>But in *getting* it trained, the pickle smoked the DBM:
>
>pickle:
>    min: 0.00426197052002
>    max: 0.302268981934
>    avg: 0.0123967074734
>    tot: 26.991s
>
>dbm:
>    min: 0.0284680128098
>    max: 1.34902703762
>    avg: 0.139986485572
>    tot: 2m46.591s
>
>This performance loss can be mitigated pretty well by caching DBM
>writes.  It would also fix the "problem" Tim S has with closing the DBM
>before writing out the MetaData.  To me, that's the same as crashing,
>but in any case, it'll fix it.
>
>So if the DBM support on Windows were any good, I wouldn't know which
>one you should use for the Outlook stuff.  But I suspect that a DBM with
>write-caching could pound the vinegar-flavored snot out of a pickle.  :)
>
>Things being what they are, though, it sounds like you should stay away
>from DBM until Python 2.3.
>
>Neale
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 





More information about the Spambayes mailing list