[spambayes-bugs] [ spambayes-Feature Requests-1000427 ] non-English spam; localized filters

Mon Aug 9 09:04:09 CEST 2004

Feature Requests item #1000427, was opened at 2004-07-30 00:07
Message generated for change (Comment added) made by mkengel
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1000427&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Michael Engel (mkengel)
Assigned to: Nobody/Anonymous (nobody)
Summary: non-English spam; localized filters

Initial Comment:

How to deal with

spam in a mixture of English/non-English mails* - it
seems that they pass easily the filters
* in my case English/German/French and Japanese

Solution idea: localized filters, one after the other;
should be possible to choose upon installation

----------------------------------------------------------------------

>Comment By: Michael Engel (mkengel)
Date: 2004-08-09 07:04

Message:
Logged In: YES 
user_id=780774

Thank you for the comments.

I have waited a little bit to see if the training on German
spam had an effect.
It did, after a total of 4 weeks, SpamBayes now discovers
these messages as spam (0.44 - my cutoff line is 0.35).

Probably, there were not enough messages in German and
French that SpamBayes could see the difference.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-08-03 06:42

Message:
Logged In: YES 
user_id=552329

Is your ham also mixed language?  With
English/German/French, SpamBayes doesn't care about the
language and will just learn each word as good/bad, so
should work fine (with appropriate training).  Have you
trained on these sorts of spam?  Attaching the clues for a
misclassified message would give more insight into this.

The Japanese is more difficult, because SpamBayes creates
tokens by (mostly) splitting on whitespace, and this isn't
how Asian languages work (we would get sentence tokens, I
think).  It's unlikely that we will ever handle this well,
and the best solution would be to have someone (willing to
do all the work) create a forked project that has a
different tokeniser, customised for Asian langauges.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1000427&group_id=61702