[spambayes-bugs] [ spambayes-Feature Requests-1242708 ] Counter-counter-spam filtering suggestions

Mon Jul 25 03:34:25 CEST 2005

Feature Requests item #1242708, was opened at 2005-07-22 12:11
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Pending
Priority: 5
Submitted By: Mark Storer (mstorer3772)
Assigned to: Nobody/Anonymous (nobody)
Summary: Counter-counter-spam filtering suggestions

Initial Comment:
My experience is that the majority of spam that gets
around filteration involves lots of deliberate
misspellings, either by add1ng or ins^ertin*g
non-le++er ch at racters, thro wing in sp aces wher e t
hey do n't belon g, or
ByUsingTitleCaseToSeperateWordsRatherThanSpaces.

Ditching spaces 

There are several possible workarounds to this:

1) Drop all non-letters and spaces, evaluating the
resulting monolithic string.  Downside: More
compulationally expensive, as the list of possibly
matches increases dramatically for each segment of the
monolith, and you have to test each segment against
multiple lengths.  O(n^2) might be generous.

2) Attempt to merge adjacent tokens to see if they
qualify as spam (or ham I suppose).  This sounds more
like a O(n) operation, but would only stamp out the
"additional spaces" method.

Downside: Again, more CPU time, but to a lesser extent
than #1  Defeated by not using the "add spaces" technique.

3) Treat all new words as having a low positve spam
rating of some sort.  Each newly encountered
misspelling would be initially biased towards spam.

4) Add a spelling checker.  New misspelled words have a
slightly-spam rating (outside training).

Downside: Big data file tacked onto your otherwise
light-weight plugin/app/thingy.

One concern with #3 and #4 is how they would react to
an email containing source code of whatever language. 
Variable and function names are infrequently found in a
dictionary (as you're no doubt aware).

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2005-07-25 13:34

Message:
Logged In: YES 
user_id=552329

Something like #2 (but better) is done by the use_bigrams
option.  This is an experimental option in 1.0.x, and a
regular option in 1.1.x.  You can enable it and see how you
like it.

You can change the value an unknown token is assigned.  This
is the unknown_word_prob option.  Experimental testing
indicated that the current value of 0.5 gives the best results.

Various testing has been done with spell checking/adding
tokens for words not in a dictionary.  None have shown any
improvement.

I don't understand what you mean by #1.  If you drop all
spaces, you are left with one token per email body.  This
will only match for indentical mail - that will certainly
not help.  Or are you planning on splitting up the token
somehow?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702