[spambayes-dev] Interesting unsure

Skip Montanaro skip at pobox.com
Wed Jun 25 16:37:53 EDT 2003


I got an interesting spam just now.  Besides the very short message which
was little more than an <a> tag wrapping an <img> tag:

    doorknob WI-6-5RJ0 molder y-nbs<br>
    <a href="http://jjoelle@www.beachballboy.com/fun/"><img src="http://njackson73@www.superfuntimes.net/jkbie/lobw.gif"></a><br>
    e--HRdW detect 0O-Q-Xmn marijuana

(interesting nonsense words - doorknob? molder?)

the subject had umlauts over many of the vowels:

    ousp Wänt to makë löve likë a teën?

so of course, I got several tokens which the classifier ignored.  The debug
and classification headers were

    X-Spambayes-Debug: '*H*': 0.21; '*S*': 0.66; 'doorknob': 0.09;
	    'subject:?': 0.23; 'detect': 0.26; 'header:Message-ID:1': 0.37;
	    'header:Reply-To:1': 0.61; 'url:com': 0.61; 'url:www': 0.67;
	    'header:Received:2': 0.76; 'subject:\xf6': 0.84;
	    'content-type:text/html': 0.87; 'url:gif': 0.93
    X-Spambayes-Classification: unsure; 0.73

It's not clear much can be done, though it might be interesting to try an
option to map Latin-1 accented characters to their unadorned ASCII
counterparts, at least in subjects (strip_subject_accents?).  For instance,
"subject:teen" and "subject:love" are both pretty spammy in my database but
"subject:teën" and "subject:löve" don't occur at all.  Even "subject:make"
is more spammy than hammy.

The problem with trying such an experiment isn't that it might not be
worthwhile, but that if it's a new spammer technique, there won't be many
messages in our existing spam/ham databases which would exercise the
technique.

Skip



More information about the spambayes-dev mailing list