[spambayes-bugs] [ spambayes-Feature Requests-1206796 ] Catch intentional mispellings

SourceForge.net noreply at sourceforge.net
Mon May 23 06:24:12 CEST 2005


Feature Requests item #1206796, was opened at 2005-05-23 16:12
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206796&group_id=61702

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
Priority: 5
Submitted By: Matt (matthew_levine)
Assigned to: Nobody/Anonymous (nobody)
Summary: Catch intentional mispellings

Initial Comment:
Most of the spam I receive have a lot of the key words 
intentionally mispelled to throw off spam filters.  If 
spammers always used the same mispellings, 
SpamBayes would catch them just fine, but spammers 
are smart enough to change the way the mispell words, 
plus if there are many different versions of the same 
word in the spam database, it will greatly weaken the 
word's spam association.  

I think it would help if SpamBayes could recognize 
words as versions of other words and count it as the 
same token.  A way to do this might be to, in words 
composed primarily of ASCII characters, to replace 
zeros and ones with 'o's and 'i's, replace any accented 
characters or other symbols with the normal letters that 
they resemble, and then instead of requiring the letters 
of the word to be in order, count the quantity of each 
letter in the word, and if the letter count is over a certain 
percentage similar to that of a known spam token, count 
the email as having that token.  

Mispellings may be more common in subject lines than 
bodies, so this feature could also possibly be used to 
test only the subject line and not the body of the email.

Here's another kind of mispelling that would be even 
tougher to decipher: "Do u Want 
M:or:eInt:ense:Org:as:ms&3"inWe:eks?"  To tackle this, 
we'd need to detect the breaks between words, which 
are marked either by capitalization, or by the insertion of 
symbols or punctuation marks.

These features might be tricky to implement or resource-
intensive to run, but I think they could greatly improve 
functionality.

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2005-05-23 16:24

Message:
Logged In: YES 
user_id=552329

Dupe of 

[ 817813 ] Consider bad spelling a sign of spam
<https://sourceforge.net/tracker/?group_id=61702&atid=498106&func=detail&aid=817813>

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206796&group_id=61702


More information about the Spambayes-bugs mailing list