[Spambayes] Guidance re pickles versus DB for Outlook

Tim Stone - Four Stones Expressions tim@fourstonesExpressions.com
Tue Nov 26 02:15:41 2002


Ok, I'm glad you've put this out here. IMO, DBM is too unreliable to be 
anything but a test database.  In real life, bad stuff happens... the database 
has to be resilient, or at least recoverable.  DBM doesn't seem to be either, 
really.  (are the perl dbm implementations better?)  In the absence of a real 
database, which may be out of reach here, we should stick with pickles, which 
have a rather short 'indoubt' window that exists only as the pickle is being 
written.  Pickles are slow to load, slow to store, and fast to access, 
primarily because the entire object model is being materialized into memory.  
This makes 'em honkin memory hogs, with the memory consumption being a 
potential show-stopper.  But that won't happen except in huge database cases, 
and we can perhaps deal with that by placing some artificial limit on the 
pickle size.  When it exceeds that size, prune the least important stuff out.

So, as far as async goes, wow... that adds a huge amount of complexity.  Is it 
really worth it?  I really doubt it.  It makes for really neat architectures, 
and it certainly isn't out of the question, but it makes a rigorous test of a 
system all but impossible, makes the code really hard to understand, modify, 
maintain, and seriously violates the stupid is good principle.

So, to deal with the outlook startup times, I wonder if there are any 
partitioning schemes we can implement.  Perhaps we could split the pickled 
stuff into partitions, based on spamprob (perhaps), alphabetically, nham, 
whatever.  We could load a small subset by default, and then load the whole 
thing later at a user's request, or ... I don't know, I'm just thinking out 
loud.

- TimS


11/25/2002 5:45:01 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>Hi everyone (and tim1 <wink>)
>
>  I've been thinking about the "database" to use for the Outlook plugin.  I
>see two reasonable choices today: pickles and whatever anydbm picks up on
>Windows.
>
>My understanding is that the main trade-offs are that pickles are slow to
>load, but lightening to use, whereas a database is fast(er) to load, but
>slow to use.  IIRC, updating the probabilities was a real killer for a DB,
>but this has recently died.
>
>To be honest, my main motivation in even thinking about this is the terrible
>things we are doing to Outlook's startup time.  My decent machine is taking
>quite a few seconds longer to get outlook started - and this cost is worn
>every time *any* application uses Outlook for anything at all.  If we do any
>sort of training, we also pay this penalty shutting down, saving the pickle.
>If we crash, we lose all recent training data.
>
>So, I see two basic routes I can take:
>
>* Move to a DB, but stick with a fully synchronous model.  We still wear the
>DB load time at startup, but this should be reduced significantly.  We wear
>the performance costs at runtime associated with the scoring, and do all
>such scoring in the "foreground", and saving of the DB as necessary.
>
>* Stick with pickles, but move to a threaded asynchronous model.  Messages
>can be "queued" for scoring/training.  At startup, we spin a new thread to
>load the pickle.  Any "missed" messages at startup, and all messages as they
>arrive are queued for scoring and filtering.  If the pickle is loaded, then
>it will generally appear synchronous, otherwise new messages may sit in your
>inbox for a few seconds before they are removed.  When the pickle is
>modified, a background thread copies the data, and starts writing.  We do
>some smarts with renaming the previous versions, as Tim1 implicated.  There
>would be support for synchronous calls too (eg, "show spam clues"), but in
>general, asynch could be used.
>
>I would appreciate some comments on this.  I am leaning towards the asynch
>model, but it is clearly more complicated.  However, if moving to a DB
>simply means we will have perf issues, just not at startup, then the
>complexity would be warranted.
>
>Any thoughts?  Fairy god-mothers? Magic answers?
>
>Thanks,
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 





More information about the Spambayes mailing list