[spambayes-bugs] [ spambayes-Feature Requests-1242708 ] Counter-counter-spam filtering suggestions
SourceForge.net
noreply at sourceforge.net
Mon Jul 25 03:34:25 CEST 2005
Feature Requests item #1242708, was opened at 2005-07-22 12:11
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Pending
Priority: 5
Submitted By: Mark Storer (mstorer3772)
Assigned to: Nobody/Anonymous (nobody)
Summary: Counter-counter-spam filtering suggestions
Initial Comment:
My experience is that the majority of spam that gets
around filteration involves lots of deliberate
misspellings, either by add1ng or ins^ertin*g
non-le++er ch at racters, thro wing in sp aces wher e t
hey do n't belon g, or
ByUsingTitleCaseToSeperateWordsRatherThanSpaces.
Ditching spaces
There are several possible workarounds to this:
1) Drop all non-letters and spaces, evaluating the
resulting monolithic string. Downside: More
compulationally expensive, as the list of possibly
matches increases dramatically for each segment of the
monolith, and you have to test each segment against
multiple lengths. O(n^2) might be generous.
2) Attempt to merge adjacent tokens to see if they
qualify as spam (or ham I suppose). This sounds more
like a O(n) operation, but would only stamp out the
"additional spaces" method.
Downside: Again, more CPU time, but to a lesser extent
than #1 Defeated by not using the "add spaces" technique.
3) Treat all new words as having a low positve spam
rating of some sort. Each newly encountered
misspelling would be initially biased towards spam.
4) Add a spelling checker. New misspelled words have a
slightly-spam rating (outside training).
Downside: Big data file tacked onto your otherwise
light-weight plugin/app/thingy.
One concern with #3 and #4 is how they would react to
an email containing source code of whatever language.
Variable and function names are infrequently found in a
dictionary (as you're no doubt aware).
----------------------------------------------------------------------
>Comment By: Tony Meyer (anadelonbrin)
Date: 2005-07-25 13:34
Message:
Logged In: YES
user_id=552329
Something like #2 (but better) is done by the use_bigrams
option. This is an experimental option in 1.0.x, and a
regular option in 1.1.x. You can enable it and see how you
like it.
You can change the value an unknown token is assigned. This
is the unknown_word_prob option. Experimental testing
indicated that the current value of 0.5 gives the best results.
Various testing has been done with spell checking/adding
tokens for words not in a dictionary. None have shown any
improvement.
I don't understand what you mean by #1. If you drop all
spaces, you are left with one token per email body. This
will only match for indentical mail - that will certainly
not help. Or are you planning on splitting up the token
somehow?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702
More information about the Spambayes-bugs
mailing list