[spambayes-dev] sb_filter change

Skip Montanaro skip at pobox.com
Wed Nov 12 09:48:25 EST 2003


I modified sb_filter.py to accept one or more file names on the command
line.  Existing behavior should be retained.  If a single message is read
from stdin, the output message will have a From_ line only if the input
message did.  When processing files from the command line, it uses
mboxutils.getmbox() to decipher their format.  In such cases, the output is
always a Unix-style mailbox on stdout.

This change probably doesn't have a lot of practical use, but I find it
helpful in one situation.  If I want to score a mailbox full of messages to
identify outliers (perhaps mistakes in my classification of a large body of
messages), I used to do this:

    formail -s sb_filter.py < somembox \
    | egrep -i '^(x-spambayes-classification|message-id): '

which incurred sb_filter.py startup for each message.  Now I execute

    sb_filter.py somembox \
    | egrep -i '^(x-spambayes-classification|message-id): '    

which runs a lot faster.

I should be able to figure out how to process my incoming mail that was as
well, then spit the result into

    formail -s procmail

to do the usual procmail processing.

This usage suggests an enhancement to mboxutils.getmbox().  Currently, it
doesn't recognize Tim-style training databases (e.g. Data/Ham/SetN where all
files have numeric filenames.  mboxutils.DirOfTxtFileMailbox could be
extended to simply accept all plain files as messages and all subdirectories
as nested Dir_ofTxtFileMailboxes.  Would that change break anyone's usage?
(What are .lorien files anyway?)

Skip



More information about the spambayes-dev mailing list