[Spambayes] RE: About my Anti Phishing suggestions for Spambayes

Wed Jan 12 00:20:22 CET 2005

> You seem to be the resident expert/developer on the Spambayes
> mailing list.

If you replace "the" with "a", then you'd be more-or-less right :)

> I made two posts 2 days ago and I was wondering if you had any
> reaction to my suggestion. 

Yes, and I had left it in my list (now at 268 items) of mail to reply
to/deal with, but I just hadn't had time until today to get to it (and since
it was dealing with a possible enhancement rather than a bug or problem it
seemed less urgent than other mail).

[phishing description snipped]
> Now, since I do this by hand, why can't SpamBayes do something like this
> automatically?  For example, SpamBayes should be able to easily parse the
> real URL and the displayed text for a link in an email message.  If the
text
> displayed looks like a URL address, it could do a DNS lookup on the
address
> like "www.paypal.com" and see that it does not match real URL
> "123.456.234.567" and automatically mark the message as SPAM/Phish?

This (or close enough) was also suggested (on spambayes-dev) in June last
year:

<http://mail.python.org/pipermail/spambayes-dev/2004-June/002922.html>

> However another possibility might be
> that at the point where SpamBayes is tokenizing the email content, it may
be
> possible to change strings like "http://123.456.234.567/" into something
> generic like "http://NNN.NNN.NNN.NNN/" and then use the normal Bayesian
> technique to get these messages flagged as SPAM since I don't think any of
> my HAM messages ever use numeric URL addresses.  This would probably only
> work for a while till the Phishers eventually use non-numeric addresses
for
> their faked URLs.  Developers?  Is this possible/easier?

The string "http://123.456.234.567/" generates the tokens: ['proto:http',
'url:http', 'url:', 'url:', 'url:123', 'url:456', 'url:234', 'url:567'].  If
you only ever see numeric URLS in spam, then you should have a reasonable
number of tokens like 'url:123' in your database with probabilities on the
spam side.  You could generate a "url:IPaddr_instead_of_hostname" token, I
suppose (as Glenn suggested), but is it really going to help anything?  See
FAQ 6.1:

<http://spambayes.org/faq.html#why-don-t-you-implement-cool-tokenizer-trick-
x>

The important point here is that ideas need to be tested.

My question would be: are there messages that would contain these tokens
that are currently being scored incorrectly?  (i.e. for which the existing
token set is not sufficient).  Any mail I get with phishing content is
easily scored as spam from the rest of the content - however, I don't use
PayPal, eBay, any US bank, or any of the NZ banks that have been subject of
this sort of mail so far, so legitimate mail from these places might be
scored as spam, for all I know.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.