[Spambayes] Results of playing with CDB

Neale Pickett neale@woozle.org
15 Sep 2002 12:06:33 -0700


So then, "Tim Peters" <tim@zope.com> is all like:

> Sure.  My post suggested a different approach: one client box, one
> client server process running on the client box.  I understood that
> Neale had in mind one server machine and N server processes running on
> that single machine.  That's not what I had in mind.

Unfortunately, that's the only situation I can afford :) I have 80
users, one box, and if what SpamAssassin says is correct, 1/3 of all my
incoming email is spam.  I suspect it's closer to 1/2.  In any case, I
have a big problem and limited resources.  The "quick-scoring" Paul
Graham approach looks really appealing, but I have to make sure it's
fast and runs quickly.  About the only thing I really can spare is
storage space.

> I expect that, in the end, 20MB RAM will prove enough for a 0.2% fp
> rate and a 1% fn rate, and assuming English.

So that basically eliminates one server process per user for me, and
probably for a lot of other people, too.  204 users would fill up all
addressable 4GiB on a 32-bit architecture.

> > But that is indeed a question to ponder: can 80 users effectively
> > share the classifier database?
> 
> I expect that depends most on their tolerance for errors.  Just about
> everyone subscribes to some bulk email, and it seems likely that a
> classifier trained to accept the union of 80 peoples' private
> delusions <wink> would suffer a lot.  At the start, there would be a
> lot of false positives as the system struggled to learn that stuff
> looking exactly like spam is actually desired by someone.  OTOH, only
> testing can answer this, and we've got a lot more question-askers than
> testers here.

I volunteer my hapless 80 as testers!  (Actually, I doubt more than four
or five would sign up for such a thing, but maybe that's enough.)  Do
you think they should all contribute to one big vat-o-meat database, or
should individual words be tagged per-user?