Graham's spam filter

Fri Aug 23 01:23:27 EDT 2002

Am Fre, 2002-08-23 um 07.06 schrieb Paul Rubin:
> If you haven't been using the word artichoke in your previous email,
> artichoke will now be flagged in the database as a spam word, so my
> final artichoke message will get labelled as spam.  But if you HAVE
> been emailing about artichokes, then "artichoke" will be in both
> databases with similar probabilities, and my message won't get
> flagged.  So the filter sharing databases leaks info about the
> contents of your email.

Hmm... Consider this: Keeping the SPAM-_corpus_ (no probabilities, only
the _count_ of words found in spam messages, and nothing else) allows
users to start using the system right away. The chance that a user might
falsely mark a message as spam is reduced by this, as the user will
rather have to start marking messages as being OK, and thus creating his
personal corpus (which contains the count of words that appear in
non-spam messages).

The probabilities database and the non-spam-_corpus_ is kept on the
client, only the spam-_corpus_ is kept on the server. This doesn't leak
any information whatsoever (at least in my point of view)...

That's what my proposal is about...

Yours,

	Heiko Wundram
	Netzwart Wohnheim-D
	Universität 18 - Zimmer 2206 - Saarbrücken