[Spambayes] Results of playing with CDB
Neale Pickett
neale@woozle.org
15 Sep 2002 12:06:33 -0700
So then, "Tim Peters" <tim@zope.com> is all like:
> Sure. My post suggested a different approach: one client box, one
> client server process running on the client box. I understood that
> Neale had in mind one server machine and N server processes running on
> that single machine. That's not what I had in mind.
Unfortunately, that's the only situation I can afford :) I have 80
users, one box, and if what SpamAssassin says is correct, 1/3 of all my
incoming email is spam. I suspect it's closer to 1/2. In any case, I
have a big problem and limited resources. The "quick-scoring" Paul
Graham approach looks really appealing, but I have to make sure it's
fast and runs quickly. About the only thing I really can spare is
storage space.
> I expect that, in the end, 20MB RAM will prove enough for a 0.2% fp
> rate and a 1% fn rate, and assuming English.
So that basically eliminates one server process per user for me, and
probably for a lot of other people, too. 204 users would fill up all
addressable 4GiB on a 32-bit architecture.
> > But that is indeed a question to ponder: can 80 users effectively
> > share the classifier database?
>
> I expect that depends most on their tolerance for errors. Just about
> everyone subscribes to some bulk email, and it seems likely that a
> classifier trained to accept the union of 80 peoples' private
> delusions <wink> would suffer a lot. At the start, there would be a
> lot of false positives as the system struggled to learn that stuff
> looking exactly like spam is actually desired by someone. OTOH, only
> testing can answer this, and we've got a lot more question-askers than
> testers here.
I volunteer my hapless 80 as testers! (Actually, I doubt more than four
or five would sign up for such a thing, but maybe that's enough.) Do
you think they should all contribute to one big vat-o-meat database, or
should individual words be tagged per-user?