[Spambayes] Spam that bypasses spambayes

Tim Peters tim.one at comcast.net
Fri Sep 12 00:08:18 EDT 2003


[Harri Pesonen]
> I had an idea a couple of weeks ago, that all url tokens should have
> more weight than other tokens. The spammer just wants you to click on
> some url, so the other text is not so important. They could even put
> random words there, and they have.

Code it and try it.  spambayes *used* to have fancier URL tokenization than
it has now, and results got better by simplifying it -- there's no
substitute for testing ideas in a statistical system, and *everything* you
try will have both good effects and bad effects (there are no pure wins).
The best you can hope for is that the good outweigh the bad across a large
variety of test sets, and there's no way to determine that without testing
on a large variety of test sets.

> And maybe the url server address should be tokenized the same way as
> the address is tokenized in Received header. So the address below
> would yield
>
> url:biz
> url:gadgitz.biz
> url:www.gadgitz.biz
>
> Now it just does
>
> url:www
> url:gadgitz
> url:biz

I believe you're talking about

    http://www.gadgitz.biz/promo.php?id=93778

That actually generates 7 url tokens today:

    url:93778
    url:biz
    url:gadgitz
    url:id
    url:php
    url:promo
    url:www

and a

    proto:http

token.  Of these, the url:biz token has the highest spamprob in my database
today, and url:id isn't far behind it.  Curiously, url:promo has an only
slightly spammy spamprob for me.  There are no tokens in my database
containing the string gadgitz, so generating more tokens containing that
string wouldn't have helped.  Given that spammers lose their domains only
slightly less frequently than they lose their email addresses, loading the
database with more spammer domain names du jour doesn't sound like a good
bet either.

> Maybe it should do both and have more weight that way. Also decode %
> encoding and find server names for ip addresses... :-)

I hesitate to put in anything by default that requires going off the local
machine (whether to suck down a web page or just to do a DNS lookup).  That
may be OK in an industrial-strength setting with industrial-strength
connectivity, but lots of users are stuck on slow dialup lines to sluggish
ISPs.  Controlling such stuff by options, disabled by default, would be
fine.




More information about the Spambayes mailing list