[spambayes-dev] Re: Generating SB tokens based upon information on the net

Skip Montanaro skip at pobox.com
Wed Aug 4 22:00:43 CEST 2004


    Brad> I don't know what it's doing.  I know nothing about the SpamBayes
    Brad> configuration on bag.

Aside from the way it's plugged into bag's Postfix setup, the Spambayes
proxy running on bag.python.org is no different than any other Spambayes
installation.  The clasification and scoring pieces of the system are no
different than what you'd find in any other Spambayes application
(sb_server.py, sb_filter.py, etc).  In particular, if the
mine_received_headers option is True, the tokenizer will spew out all sorts
of interesting tokens based on IP addresses and hostnames like

    received:83.69
    received:83.69.163
    received:83.69.163.110

and

    received:grp.scd.yahoo.com
    received:mail.sc5.yahoo.com
    received:mail.yahoo.com
    received:n16.grp.scd.yahoo.com
    received:n36.grp.scd.yahoo.com
    received:n39.grp.scd.yahoo.com
    received:n53.grp.scd.yahoo.com
    received:n54.grp.scd.yahoo.com
    received:sc5.yahoo.com
    received:scd.yahoo.com
    received:smtp805.mail.sc5.yahoo.com
    received:web21506.mail.yahoo.com
    received:web50510.mail.yahoo.com
    received:web60101.mail.yahoo.com
    received:web60909.mail.yahoo.com
    received:web61208.mail.yahoo.com
    received:yahoo.com

It will grub around in many other mail headers as well.

Like many other pieces of software the code is the best documentation you're
going to find.  You needn't read and understand it all, however.  Thanks in
large degree to Tim Peters' skill, the tokenizer is clearly written and very
well-commented (comments make up probably half the file):

    http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/tokenizer.py?rev=1.31&view=markup

The classifier is also well-structured and well-commented thanks to Tim and
contains links to both Paul Graham's original "Plan for Spam" as well as
some of Gary Robinson's writings:

     http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/classifier.py?rev=1.25&view=markup

    Brad> I am stating that this is a capability that I believe we need.

I think we already have it.

    >> There's no need to do blacklisting as far as I'm concerned.

    Brad> The content of the message is important, yes.  But you're throwing
    Brad> away all the envelope information which can also be very
    Brad> instructive.

What envelope information?  Does it turn up in a message header?  If Postfix
adds it to a Received: header, Spambayes probably already takes it into
account.

Skip


More information about the spambayes-dev mailing list