Graham's spam filter

Fri Aug 23 02:05:12 EDT 2002

Heiko Wundram <heikowu at ceosg.de> writes:
> The probabilities database and the non-spam-_corpus_ is kept on the
> client, only the spam-_corpus_ is kept on the server. This doesn't leak
> any information whatsoever (at least in my point of view)...
> 
> That's what my proposal is about...

If the probabilities database includes probabilities of non-spam words,
that definitely leaks information (e.g. my artichoke example).

If there's a public spam corpus, that's ok, but I think it should be
populated only with spam sent to addresses published specifically to
gather spam.  It shouldn't be populated with spam sent to individual
user accounts.  Spam mailing is not completely isotropic.  If I buy
sex toys from online vendor XYZ, and XYZ sells my email address to
specific spammers, I'll get a non-uniform selection of spam, which
which will get reflected in the spam corpus.  If the spam corpus is
public, someone could deduce from it that I've been buying sex toys
from XYZ.  Even reflecting the XYZ mailings in the probability table
leaks information.

I think it's best to keep things completely airtight.  Don't publish
anything at all that's based on mail (spam or non-spam) sent to
private users.