[spambayes-dev] SpamBayes rules. New use. Suggestion for improvement.

Mon Jul 28 10:05:52 EDT 2003

Hello to all readers. This is my first post to the list so please bear
with me if I should bring up anything that has been there already. The
MailMan Archives are missing any search functionality, for that matter.

I just wanted to mention that I had success creating a learning
server-side Spam filter on our company server. It is based on IMAP-SSL
and it basically works with three Folders - INBOX, one for Learning and
one for Spam. The filter simply learns by finding how users move their
mail from one folder to another.
The system design and database (SQL) would even allow for more learnable
folders per user so that mail could be pre-sorted into other folders
than SPAM using the same principle although this feature has to undergo
some more tesing yet. The beta period is almost over and I am pretty
satisfied with the result.
We intend to offer the system to companies and end users for a resonable
price soon and I am still negotiating about publishing the code. Which -
in some areas - is still slightly spaghetti and too undocumented ATM, so
I am a bit reluctant myself. Anyway, if you should see one of these
silly patents on the principle some day soon, here is your prior Art ;-)

As you would expect I have chosen SpamBayes for the actual core of the
system because of its excellent modularity and improvements to the basic
idea, although we "only" use tokenize() and our own subclass of the
classifier object.

Although the system works almost perfect for the average user I have
found that power users will have false positives almost inevitably. One
outstanding example is a mail from Bruce Schneiers CRYPTO-GRAM Mailing
list dated Jul. 15 which scored a whopping 100.0% for 8 out of 9 users
receiving it. The mail covers various topics and one of them is Spam
filtering. In an other story it had the deadly two words: prescr*ption
and ph*rmacy. But it bears some attributes that should easily be
tokenizable to tell it apart from the typical Spam.

As far as I can see there is e.g. no token that indicates the length of
a message. It might even be advisable to specify the length not in words
but in tokens. I just looked over last days logs and would estimate that
about 50% of all spam is less than 75 tokens, about 90% is less than 250
tokens and hardly any spam at all gets together 1000 or more tokens. So,
a special token, bearing the length of a mail in a form like
t-length: [<500|>500|>1000|>2000|>5000] might be a useful indicator
against spams for mail like the one mentioned above which was a pretty
long one.

If you are interested I can forward the mail to interested individuals
for testing or completely to the list.

Thanks for reading.

Regards,
	Fionn
-- 
Software patents    -  not allowed in Europe | See whats going on:
Archiving Email     -  patented in Europe    | http://freepatents.org/
E-Shopping Baskets  -  patented in Europe    | Become active easily:  
Cross-compiling     -  patented in Europe    | http://aktiv.ffii.org/eubsa/en  
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20030728/82036b82/attachment.bin