[spambayes-bugs] [ spambayes-Feature Requests-1206796 ] Catch intentional mispellings
SourceForge.net
noreply at sourceforge.net
Mon May 23 06:45:34 CEST 2005
Feature Requests item #1206796, was opened at 2005-05-23 00:12
Message generated for change (Comment added) made by matthew_levine
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206796&group_id=61702
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Closed
Priority: 5
Submitted By: Matt (matthew_levine)
Assigned to: Nobody/Anonymous (nobody)
Summary: Catch intentional mispellings
Initial Comment:
Most of the spam I receive have a lot of the key words
intentionally mispelled to throw off spam filters. If
spammers always used the same mispellings,
SpamBayes would catch them just fine, but spammers
are smart enough to change the way the mispell words,
plus if there are many different versions of the same
word in the spam database, it will greatly weaken the
word's spam association.
I think it would help if SpamBayes could recognize
words as versions of other words and count it as the
same token. A way to do this might be to, in words
composed primarily of ASCII characters, to replace
zeros and ones with 'o's and 'i's, replace any accented
characters or other symbols with the normal letters that
they resemble, and then instead of requiring the letters
of the word to be in order, count the quantity of each
letter in the word, and if the letter count is over a certain
percentage similar to that of a known spam token, count
the email as having that token.
Mispellings may be more common in subject lines than
bodies, so this feature could also possibly be used to
test only the subject line and not the body of the email.
Here's another kind of mispelling that would be even
tougher to decipher: "Do u Want
M:or:eInt:ense:Org:as:ms&3"inWe:eks?" To tackle this,
we'd need to detect the breaks between words, which
are marked either by capitalization, or by the insertion of
symbols or punctuation marks.
These features might be tricky to implement or resource-
intensive to run, but I think they could greatly improve
functionality.
----------------------------------------------------------------------
>Comment By: Matt (matthew_levine)
Date: 2005-05-23 00:45
Message:
Logged In: YES
user_id=1283553
I don't think it's quite the same as the other feature request.
That one is saying that the presence of mispelled or non-
dictionary words should be a sign of spam. I'm saying that
mispelled words should be treated as if they were spelled
correctly, so it will know that "C!al1s" is not a new word, but
a word that's been in 500 spam messages.
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2005-05-23 00:24
Message:
Logged In: YES
user_id=552329
Dupe of
[ 817813 ] Consider bad spelling a sign of spam
<https://sourceforge.net/tracker/?group_id=61702&atid=498106&func=detail&aid=817813>
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206796&group_id=61702
More information about the Spambayes-bugs
mailing list