[spambayes-dev] default to mine_received_headers=True, "may be forged"

Mon Dec 22 16:58:59 EST 2003

    Tim> Changing the regexp to use [a-z] instead of \w would weed out all
    Tim> that stuff. 

I'll give that a try.  Thanks.

    >> Perhaps we should add
    >> 
    >> header = re.sub(r'\s+', ' ', header)
    >> 
    >> to the "for header ..." loop in any case?

    Tim> There are many "for header" loops, and I'm not sure which one(s)
    Tim> you're talking about here.  If you want to do this somewhere,

    Tim>     header = ' '.join(header.split())

    Tim> is faster.

Okay.  I was just referring to the loop over the Received headers in the
section of code we've been messing with.

    >> I'm willing to tuck the more general received sifting into the
    >> tokenizer controlled by a new experimental option.  Let me know if
    >> you want me to take that step.

    Tim> No, I don't want another experimental option just for this.  It
    Tim> seems clear enough already that "may be forged" is potentially
    Tim> interesting, and also that "may be forged" isn't the only
    Tim> potentially interesting string.  We should suck up a bunch of them,
    Tim> or none of them.  The classifier will learn which are and aren't
    Tim> useful, and it sure looks like that will vary depending on user
    Tim> (that one of my ISPs is Comcast and one of yours isn't is not a
    Tim> good reason to poo-poo the clues Comcast leaves behind <wink>).

Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted
sender)".  I posted a note to comp.mail.misc asking for equivalents to "(may
be forged)" for other MTAs.  I'll see if anything interesting turns up which
warrants investigation.

Skip