[spambayes-bugs] [ spambayes-Feature Requests-854705 ] Detect "line noise" in subject and body

Fri Dec 5 09:11:58 EST 2003

Feature Requests item #854705, was opened at 2003-12-05 12:58
Message generated for change (Comment added) made by richiehindle
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=854705&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Julian Morrison (julianm)
Assigned to: Nobody/Anonymous (nobody)
Summary: Detect "line noise" in subject and body

Initial Comment:
Spell check words in the message subject and body,
generate tokens for the count of misspellings in each.
Perhaps also generate tokens for the ratio of
incorrect/correct spellings? This could be chunked to
make it easier to train eg: all, more than half, about
half, less than half, none. These should be seperate
for subject and for body since garble in the header is
very predictive of spam.

Also, there has to be some way to look for words with
"impossible to pronounce" consonant clusters such as
"dvgkbm". Could spambayes be made to look for
"syllables"? Eg: by parsing words into syllables and
generating tokens for each? I'm not sure there's a
parsing technique that's sufficiently
internationalized.  Perhaps even just generating tokens
for ASCII consonant clusters would be better than nothing.

----------------------------------------------------------------------

>Comment By: Richie Hindle (richiehindle)
Date: 2003-12-05 14:11

Message:
Logged In: YES 
user_id=85414

What's the difference between the tokeniser spitting
out "xmlrpc" and spitting out ""unpronounceable:xmlrpc"?
That doesn't make any difference.  The difference is when
you "generate tokens for the count of misspellings" (or
unpronounceables) - then your system starts to decide
that high unpronounceable conts are spammy, and techie
messages get more spammy.  (Unless the tech-speak
outweighs the spam garbage, but even we're not *that*
techie!)

----------------------------------------------------------------------

Comment By: Julian Morrison (julianm)
Date: 2003-12-05 14:04

Message:
Logged In: YES 
user_id=21754

Hmm, would it not merely learn token
"unpronounceable:xmlrpc" as a ham indicator?

Also, as a spellcheck hack: words that are already
recognised tokens, and are ham indicators, should not count
as misspelled even if the spell check rejects them. This
would then quickly learn not to add "xmlrpc" into the
misspelled-words count and ratio.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-12-05 13:53

Message:
Logged In: YES 
user_id=85414

We spambayes developers spend a lot of time talking
about smtp, pop3, cdo, mapi, tcpip, http, html, py2exe,
rfc822, chi2, kmail, ie, oe, xmlrpc, bsddb...

Now those things would be trained as ham clues, but
your scheme would dilute them.  I'm not saying it's a
bad idea, but just because something is unpronouncable
and not in the dictionary doesn't make it the same class
of thing as all the other tokens which are unpronouncable
and not in the dictionary.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=854705&group_id=61702