[Spambayes] Are they learning?

Thu Feb 20 23:30:33 EST 2003

[Kaitlin Duck Sherwood]
> If the spammers ever get too clever for a purely word-based approach,
> then it would be easy to toss in the ratio of
> 	non-letter characters (perl /W) : letter characters (perl /w)
> and/or
> 	characters inside HTML tags : characters outside HTML
> and/or
> 	number of spaces : total length of message
> as features.
>
> I believe that those ratios will do a good job of spotting messages
> that have wildly different "eye space" and "ASCII space" presentations.

In unreported early experiments, I generated a token for the ratio of number
of bytes to number of whitespace-separated "words" in a msg.  A high ratio
was a very strong spam indicator.  I left the code out, though, because it
made no difference to overall error rates in testing:  whatever it was
latching on to was already covered by other stuff at the time.  Like many
other gimmicks, it also over-penalized HTML msgs *just* for using HTML at
all.

I expect that your suggested statistics would also be strong indicators, but
also possibly redundant.

The msg Rob forwarded that kicked this off didn't impress me, just like
other msgs in the past playing goofy typographical tricks didn't impress me:
anything that makes an advertisement harder to read is going to reduce
response rate, so I don't expect such tricks to endure.  I've seen "stuff
like that" all along, but it's always been a very small percentage of the
spam I get.  I expect that Rob noticed it only because he got it from a
python.org mailing list, and spam from such lists is rare (so all the header
clues saying "it came from python.org"-- and there are many --have very low
spamprobs).