[Spambayes] Another software in the field

T. Alexander Popiel popiel@wolfskeep.com
Fri Nov 15 18:20:39 2002


In message:  <LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>Poor man -- I'm glad you uncloaked!  Did the Outlook Message-Ids fit a
>pattern you've seen?  I'm keen to pursue that.

If you're keen on message ids, then one idea I've had (with no
time to implement, alas) is to compare the message id domain with
the sequence in the received headers, to detect when message ids
are generated late in the delivery sequence.  In more detail:

Most received headers these days are of the (rfc 821 dictated) form:

  Received: from ([^ ]*).* by ([^ ]*).*;(.*)

where \1 is the prior MTA, \2 is the current MTA, and \3 is the
time of transfer.  Reading all the received headers, you can get
a chain of MTAs as the delivery sequence... as an example:

  Received: from mail.python.org (mail.python.org [12.155.117.29])
          by cashew.wolfskeep.com (Postfix) with ESMTP id 97FAFF54C
          for <popiel@wolfskeep.com>; Fri, 15 Nov 2002 09:44:19 -0800 (PST)
  Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
          by mail.python.org with esmtp (Exim 4.05)
          id 18CkXd-00065D-01; Fri, 15 Nov 2002 12:46:05 -0500
  Received: from smtp.comcast.net ([24.153.64.2])
          by mail.python.org with esmtp (Exim 4.05)
          id 18CkAN-0007r1-00
          for spambayes@python.org; Fri, 15 Nov 2002 12:22:03 -0500
  Received: from cj569191b (pcp736393pcs.reston01.va.comcast.net
          [68.48.241.201]) by mtaout03.icomcast.net
          (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
          spambayes@python.org; Fri, 15 Nov 2002 12:21:16 -0500 (EST)

yields the sequence:

  cj569191b -> mtaout03.icomcast.net -> smtp.comcast.net -> mail.python.org
  -> localhost.localdomain -> mail.python.org -> mail.python.org ->
  cashew.wolfskeep.com

Remove references to localhost.localdomain or localhost, then compress
identical neighbors to yield:

  cj569191b -> mtaout03.icomcast.net -> smtp.comcast.net -> mail.python.org
  -> cashew.wolfskeep.com

Now, look at the message id:

  Message-id: <LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net>

Extracting just the domain name from that, we get:

  comcast.net

Now, compare the domain from the message id to the domains in the
received list, yielding the number of hierarchy levels matched:

  0 -> 1 -> 2 -> 0 -> 0

Find the first occurence of the best match, and generate a token:

  message-id-generation:skipped 2

If the received parser were a little smarter about parsing iPlanet
received lines, it would have "pcp736393pcs.reston01.va.comcast.net"
instead of "cj569191b" as the first element in the sequence, and
the match list would have been 2 -> 1 -> 2 -> 0 -> 0, yielding:

  message-id-generation:skipped 0

I suspect that high skipped numbers would be a strong spam indicator,
howing where message ids were omitted in the sent mail and/or received
headers naively forged to prevent backtracking.

Unfortunately, I haven't had time to implement and test this...

- Alex



More information about the Spambayes mailing list