[spambayes-bugs] [ spambayes-Feature Requests-1242708 ] Counter-counter-spam filtering suggestions

SourceForge.net noreply at sourceforge.net
Fri Jul 22 02:11:05 CEST 2005


Feature Requests item #1242708, was opened at 2005-07-21 17:11
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Mark Storer (mstorer3772)
Assigned to: Nobody/Anonymous (nobody)
Summary: Counter-counter-spam filtering suggestions

Initial Comment:
My experience is that the majority of spam that gets
around filteration involves lots of deliberate
misspellings, either by add1ng or ins^ertin*g
non-le++er ch at racters, thro wing in sp aces wher e t
hey do n't belon g, or
ByUsingTitleCaseToSeperateWordsRatherThanSpaces.

Ditching spaces 

There are several possible workarounds to this:

1) Drop all non-letters and spaces, evaluating the
resulting monolithic string.  Downside: More
compulationally expensive, as the list of possibly
matches increases dramatically for each segment of the
monolith, and you have to test each segment against
multiple lengths.  O(n^2) might be generous.

2) Attempt to merge adjacent tokens to see if they
qualify as spam (or ham I suppose).  This sounds more
like a O(n) operation, but would only stamp out the
"additional spaces" method.

Downside: Again, more CPU time, but to a lesser extent
than #1  Defeated by not using the "add spaces" technique.

3) Treat all new words as having a low positve spam
rating of some sort.  Each newly encountered
misspelling would be initially biased towards spam.

4) Add a spelling checker.  New misspelled words have a
slightly-spam rating (outside training).

Downside: Big data file tacked onto your otherwise
light-weight plugin/app/thingy.


One concern with #3 and #4 is how they would react to
an email containing source code of whatever language. 
Variable and function names are infrequently found in a
dictionary (as you're no doubt aware).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702


More information about the Spambayes-bugs mailing list