[Spambayes] Results of playing with CDB

Sun, 15 Sep 2002 03:05:29 -0400

[Guido]
> That's not how I read Neale's post.  He's considering one server
> *process* per user though, and thinks that 80 server processes with a
> virtual memory address space (what he calls "RAM") of 20 MB would take
> up too much swap space.

Sure.  My post suggested a different approach:  one client box, one client
server process running on the client box.  I understood that Neale had in
mind one server machine and N server processes running on that single
machine.  That's not what I had in mind.

> Is 20 MB a reasonable estimate for the in-memory dict?

It depends on how much training data you feed it, how often it's cleaned,
how aggressively it's cleaned, and what the user considers to be acceptable
error rates (of both kinds).  The only solid data we have on any of that is
the qualitative observation that both error rates decrease as the training
data increases.

BTW, I've noticed that my process data size on test runs is about 3-4x
larger than the classifier pickle on disk.

Note that saving an 8-byte timestamp and an 8-byte probability (both
doubles) per word is merely convenient for now.

I expect that, in the end, 20MB RAM will prove enough for a 0.2% fp rate and
a 1% fn rate, and assuming English.

> But I don't buy his assumption that if you share the server between 80
> users the dict becomes too large.  Why would it?  Because their ham
> collections would be disjoint?  Not completely, I expect.

The only figure I have on that was posted yesterday:  when I increased the
training data by a factor of 4, the pickle size grew by a factor of 2.4 (to
about 600 bytes per message, and about 27,000 messages trained on).

> But that is indeed a question to ponder: can 80 users effectively
> share the classifier database?

I expect that depends most on their tolerance for errors.  Just about
everyone subscribes to some bulk email, and it seems likely that a
classifier trained to accept the union of 80 peoples' private delusions
<wink> would suffer a lot.  At the start, there would be a lot of false
positives as the system struggled to learn that stuff looking exactly like
spam is actually desired by someone.  OTOH, only testing can answer this,
and we've got a lot more question-askers than testers here.