[spambayes-bugs] [ spambayes-Feature Requests-1206807 ] "Trojan text"

Mon May 23 06:39:33 CEST 2005

Feature Requests item #1206807, was opened at 2005-05-23 16:33
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206807&group_id=61702

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
Priority: 5
Submitted By: Matt (matthew_levine)
Assigned to: Nobody/Anonymous (nobody)
Summary: "Trojan text"

Initial Comment:
Some spam will have long sections of text from random 
sources, such as excerpts of classic novels or books of 
quotes, so there will be lots of normal, i.e. hammy, 
words to get the spam past filters.  The spam content 
will consist of urls and possibly images.

An obvious solution would be to search the urls for spam 
clues, and you already have this as an experimental 
feature.  However, that feature only works for emails that 
are below a certain threshold of tokens, and the phony 
text could easily put it over that threshold.  So I suggest 
that either the feature should be able to check urls in all 
messages, or it could also kick in when some 
conditions are fulfilled that indicate the likely presence 
of "Trojan text," such as a high number of ham words 
along with linked images.  

Additionally, I suggest that when this feature causes a 
message to be registered as spam, SpamBayes should 
not be spam-trained on the "Trojan text," because it was 
inserted specifically to throw off spam filters, so the filter 
should work better if it's ignored.

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2005-05-23 16:39

Message:
Logged In: YES 
user_id=552329

The experimental (available with 1.0.4 or 1.1a1) URL
slurping options do more-or-less what you describe.  Please
feel free to try them out and suggest any specific
improvements to them, and let us know whether they do
improve your results or not.

Identifying text that doesn't fit with the message is fairly
complicated - DSPAM has a "noise" detection algorithm that
does this.  We may try this at some point.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206807&group_id=61702