[Spambayes] Tokenizing ideas (images, attachments)
Harri Pesonen
harri.pesonen at wicom.com
Wed Aug 27 11:00:58 EDT 2003
Why not tokenize image URLs? Many times the message is empty or almost
empty, containing only an image URL. Here is an example:
<html><body>
<center><!--kj3evc37dmn--><a
href="http://www.greatbizss3.com/host/default.asp?ID=omni"><img
src="http://clearsale12.com/pics/gv1.gif" height="270"
width="405"></a></center>
</html></body>
While SpamBayes detected this message just fine, it did it only from
tokens in headers (subject mainly). From this example, you would have
got tokens:
image:clearsale12 (ignore com)
href:greatbizss3 (ignore www, com, but not biz)
Or combine them both under href token.
One thing that I have noticed is that many times the end of the domain
name contains numbers. You could add a token for this:
href:#
Or just remove the numbers:
href:clearsale
href:greatbizss
If the href contains an ip address, then do a name lookup. If it fails,
then add
href:ip
Also tokenize attachment extensions:
attachment:pif
This helps fighting viruses.
Harri
More information about the Spambayes
mailing list