[spambayes-dev] Re: Generating SB tokens based upon informationon the net

Fri Aug 6 04:12:51 CEST 2004

> From: T. Alexander Popiel
> Sent: Wednesday, August 04, 2004 11:31 AM

<...>

> Actually, the Received header info can come from the HELO or
> EHLO command that opened the conversation, not DNS.  I haven't
> looked to see if any MTAs actually do it that way, but it's
> the way I would do it if I were writing one...  (And sure,
> that means a rogue could lie about identification in the
> HELO... but that's why both the name and the IP appear in the
> Received line.)

You'd be exactly right.  Most MTA's put the SMTP-client (the sender) IP
in [ ], put the EHLO string before that and put the rDNS result before
that.  What you usually see for the sending machine in a received header
is:

rDNS result ( EHLO name [ IP address ])

Some mailers take the pro-active step of noting "May be forged" if they
notice that the rDNS lookup and EHLO string were too different.  In any
case, with these three pieces of information, a user can interpret the
received headers going from top to bottom.  The only piece of
information that can be forged is really the EHLO string.  Spoofing an
IP address for an SMTP session is very hard and is best done through a
proxy with the address you want to spoof, and attacking the rDNS tree is
pretty tough.

The only implication for Spambayes is the when mining headers, the EHLO
string in spam is often non-existent.  When it does appear, it is
sometimes the target domain (to try to fool the mailer into thinking it
is a local message), sometimes a joe-job victim and sometimes just a
non-sense string.  That one piece of information may be of dubious value
for the classifier, but the IP and rDNS result are certainly useful.

The other item of dubious value that seems to generate tokens is the
recipient machine in the top received header.  That is _always_ your own
MX, and listing it is not of any value.  Often, the sending machine in
the first received line is another internal machine at your own
provider, but not always.  The first external sender either occurs in
the first or second received header.  For example, the top two received
line in Alex's post that I am responding to were (for me):

Received: from inbound-mx3.atl.registeredsite.com ([64.224.219.91])
          by imta04a2.registeredsite.com with ESMTP
          id
<20040804163157.IEAL28804.imta04a2.registeredsite.com at inbound-mx3.atl.re
gisteredsite.com>
          for <sethg at goodmanassociates.com>; Wed, 4 Aug 2004
12:31:57 -0400

Received: from smtp-vbr2.xs4all.nl (smtp-vbr2.xs4all.nl [194.109.24.22])
	by inbound-mx3.atl.registeredsite.com (8.12.11/8.12.8) with ESMTP id
i74GV4aI019873
	for <sethg at goodmanassociates.com>; Wed, 4 Aug 2004 16:31:05 GMT

The top received line is a local handoff between the gateway MX and an
internal MTA at my provider.  The gateway MX did not even bother to
provide an EHLO string, since it is a trusted internal handoff.  None of
the address information in this line is suitable for generating tokens.
This would take some effort to suppress, since some providers don't have
the incoming MX and MDA (mail delivery agent) functions separated.  One
possible way to determine this is an internal transfer if the sending
and receiving machines in the top received header have the same domain,
it is an internal handoff.

The next received line contains the actual external SMTP-client (the
sender).  In this case, the EHLO string matches the rDNS exactly,
showing the sender has their DNS properly configured.  The machine in
the 'from' part of this line is worth generating tokens for.  The
machine in the 'by' part is the same as the machine in the 'from' part
of the first header line, and since it is not suitable generating tokens
in the first line, it is not suitable here.

In general, only machines listed in the 'from' part of each received
line should be candidates for token generation, and only if they have a
different domain from the machine in the 'by' part of the top received
line.  When generating a token for a machine, it would probably be wise
to ignore the EHLO string, since they are simply part of the MTA
configuration and can be forged to be the same as a machine that you
would normally trust.  In fact, some spamming MTA's change their EHLO
string with every message.  Another dead-giveaway is when the sending
machine has an rDNS result that has a pattern resembling a dynamic IP
connection.  Typically, this is something like
220-15-7-52-adslpool.bigISP.com.  Unfortunately, this gets into regex's,
which is really the bailiwick of SpamAssassin and other rule-based
systems.  The best way to determine if the line is a dynamic IP is to
consult a dynamic IP DNSBL, but for all the reasons that have been
mentioned, this is really out of the question for Spambayes.

OTOH, running a proxy ahead of Spambayes that checks half a dozen
DNSBL's on all the machines in the 'from' parts of the header lines
might be a useful adjunct.  While YMMV, I have had very few errors when
I used to do this (I no longer bother).  For large spam loads where a
few percent unsures amounts to a lot of mail to manually classify, this
can be helpful.

--

Seth Goodman