[Spambayes] Tokenizing ideas (images, attachments)

Harri Pesonen harri.pesonen at wicom.com
Wed Aug 27 11:00:58 EDT 2003


Why not tokenize image URLs? Many times the message is empty or almost
empty, containing only an image URL. Here is an example:

<html><body>
<center><!--kj3evc37dmn--><a
href="http://www.greatbizss3.com/host/default.asp?ID=omni"><img
src="http://clearsale12.com/pics/gv1.gif" height="270"
width="405"></a></center>
</html></body>

While SpamBayes detected this message just fine, it did it only from
tokens in headers (subject mainly). From this example, you would have
got tokens:

image:clearsale12 (ignore com)
href:greatbizss3 (ignore www, com, but not biz)

Or combine them both under href token.

One thing that I have noticed is that many times the end of the domain
name contains numbers. You could add a token for this:

href:#

Or just remove the numbers:

href:clearsale
href:greatbizss

If the href contains an ip address, then do a name lookup. If it fails,
then add

href:ip

Also tokenize attachment extensions:

attachment:pif

This helps fighting viruses.

Harri



More information about the Spambayes mailing list