[Spambayes-checkins] spambayes/contrib BULK.txt,NONE,1.1 bulkgraph.py,NONE,1.1 bulktrain.sh,NONE,1.1 procmailrc,NONE,1.1 spambayes.el,NONE,1.1

Neale Pickett npickett at users.sourceforge.net
Wed Jan 22 20:46:29 EST 2003


Update of /cvsroot/spambayes/spambayes/contrib
In directory sc8-pr-cvs1:/tmp/cvs-serv28017/contrib

Added Files:
	BULK.txt bulkgraph.py bulktrain.sh procmailrc spambayes.el 
Log Message:
* Fixed runtest.sh to handle new paths for all the utilities
* moved hammie/* to contrib/*
* new spambayes.el for Gnus integration


--- NEW FILE: BULK.txt ---
Alex's spambayes filter scripts
-------------------------------

I've finally started using spambayes for my incoming mail filtering.
I've got a slightly unusual setup, so I had to write a couple scripts
to deal with the nightly retraining...

First off, let me describe how I've got things set up.  I am an
avid (and rather religious) MH user, so my mail folders are of
course stored in the MH format (directories full of single-message
files, where the filenames are numbers indicating ordering in the
folder).  I've got four mail folders of interest for this discussion:
everything, spam, newspam, and inbox.

When mail arrives, it is classified, then immediately copied in the
everything folder.  If it was classified as spam or ham, it is
trained as such, reinforcing the classification.  Then, if it was
labeled as spam, it goes into the newspam folder; otherwise it
goes into my inbox.

When I read my mail (from inbox or newspam), I move any confirmed
spam into my spam folder; ham may be deleted.  (Of course, I still
have a copy of my ham in the everything folder.)

Every night, I run a complete retraining (from cron at 2:10am);
it trains on all mail in the everything folder that is less than
4 months old.  If a given message has an identical copy in the spam
or newspam folder, then it is trained as spam; otherwise it is
trained as ham.  This does mean that unread unsures will be
treated as ham for up to a day; there's few enough of them that
I don't care.  The four-month age limit will have the effect of
expiring old mail out of the training set, which will keep the
database size fairly manageable (it's currently just under 10 meg,
with 6 days to go until I have 4 months of data).

The retraining generates a little report for me each night,
showing a graph of my ham and spam levels over time.  Here's
a sample:

| Scanning spamdir (/home/cashew/popiel/Mail/spam):
| Scanning spamdir (/home/cashew/popiel/Mail/newspam):
| Scanning everything
| sshsshsshsshsshsshsshshsshshshshsshshshshshshsshsshshsshssshsshshsshshsshshs
| sshshshshsshshsshshshshshssshshshsshsshsshshshshshshsshshhshshsshshshshssshs
| sshshsssshs
|   154
|   152|                                                             
|   144|                                                             
|   136|                                                             
|   128|                                                   h         
|   120|                                                   h      s  
|   112|                             s       ss     ss s   h   s  ss 
|   104|                             ss      ss     ss sHs h   s  ss 
|    96|                           s ss   s  sH  s  ss sHs h  Sss ss 
|    88|                    h  ss  s sss ss  sH sss ssssHHhS sSsssss 
|    80|                 s sSH ss ssssss sssssH HssssHsHHHSS sSsssss 
|    72|                 ssHSH ssssssssssssHHsHSHssHsHsHHHSSssSsssss 
|    64|      s  s  s s sHsHSHsssssssHsHsssHHsHSHssHsHsHHHSSssSsssss 
|    56|   s sss ss sssssHHHSHsHsssHsHHHHssHHsHSHHsHHHsHHHSSsHSsssss 
|    48|   ssssssssssssssHHHSHHHHssHsHHHHHsHHsHSHHsHHHsHHHSSsHSssHsss
|    40|   ssssssssssHsHHHHHSHHHHHsHsHHHHHHHHHHSHHsHHHHHHHSSsHSHsHHss
|    32|   ssHHssHsssHHHHHHHSHHHHHHHsHHHHHHHHHHSHHsHHHHHHHSSHHSHHHHHs
|    24|   ssHHHHHHHsHHHHHHHSHHHHHHHsHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
|    16|   HsHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
|     8|   HHHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHH
|     0|SSSUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
|      +------------------------------------------------------------
| 
| Total: 6441 ham, 9987 spam (60.79% spam)
| 
| real    7m45.049s
| user    5m38.980s
| sys     0m39.170s

At the top of the output it mentions what it's scanning, and has a
long line of s and h indicating progress (so it doesn't look hung
if you run it by hand).

Below is a set of overlaid bar graphs; s is for spam, h is for ham,
u is unsure.  The shorter bars are in front and capitalized.  In
the example, I have very few days where I have more ham than spam.

Finally, there's the amount of time it took to run the retraining.

My scripts are:
  bulkgraph.py
    read and train on messages, and generate the graph

  bulktrain.sh
    wrapper for bulkgraph.py, times the process and moves databases around

  procmailrc
    a slightly edited version of my .procmailrc file

When I actually use this, I put bulkgraph.py and bulktrain.py in
the root of my spambayes tree.  Minor tweaks would probably make
this unnecessary, but as a python newbie I don't know what they
are off the top of my head, and I can't be bothered to find out. ;-)

--- NEW FILE: bulkgraph.py ---
#! /usr/bin/env python

### Train spambayes on messages in an MH mailbox, with spam identified
### by identical copies in other designated MH mailboxes.
###
### Run this from a cron job on your server.

"""Usage: %(program)s [OPTIONS] ...

Where OPTIONS is one or more of:
    -h
        show usage and exit
    -d DBNAME
        use the DBM store.  A DBM file is larger than the pickle and
        creating it is slower, but loading it is much faster,
        especially for large word databases.  Recommended for use with
        hammiefilter or any procmail-based filter.
    -D DBNAME
        use the pickle store.  A pickle is smaller and faster to create,
        but much slower to load.  Recommended for use with pop3proxy and
        hammiesrv.
    -g PATH
        mbox or directory of known good messages (non-spam) to train on.
        Can be specified more than once.
    -s PATH
        mbox or directory of known spam messages to train on.
        Can be specified more than once.
    -f
        force training, ignoring the trained header.  Use this if you
        need to rebuild your database from scratch.
    -q
        quiet mode; no output
"""

import mboxutils
import getopt
import hammie
import sys
import os
import re
import time
import filecmp

program = sys.argv[0]
loud = True
day = 24 * 60 * 60
# The following are in days
expire = 4 * 30
grouping = 2

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def row(value, spamday, hamday, unsureday):
    line = "%5d|" % value
    for j in range((expire) // grouping, -1, -1):
        spamv = 0
        hamv = 0
        unsurev = 0
        for k in range(j * grouping, (j + 1) * grouping):
            try:
                spamv += spamday[k]
                hamv += hamday[k]
                unsurev += unsureday[k]
            except:
                pass
        spamv = spamv // grouping
        hamv = hamv // grouping
        unsurev = unsurev // grouping
        # print "%d: %ds %dh %du" % (j, spamv, hamv, unsurev)
        count = 0
        char = ' '
        if spamv >= value:
            count += 1
            char = 's'
        if hamv >= value:
            count += 1
            if (char == ' ' or hamv < spamv):
                char = 'h'
        if unsurev >= value:
            count += 1
            if (char == ' ' or
                (char == 's' and unsurev < spamv) or
                (char == 'h' and unsurev < hamv)):
                char = 'u'
        if count > 1:
            char = char.upper()
        line += char
    return line

def main():
    """Main program; parse options and go."""

    global loud
    
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hfqd:D:s:e:')
    except getopt.error, msg:
        usage(2, msg)

    if not opts:
        usage(2, "No options given")

    pck = None
    usedb = None
    force = False
    everything = None
    spam = []
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == "-f":
            force = True
        elif opt == "-q":
            loud = False
        elif opt == '-e':
            everything = arg
        elif opt == '-s':
            spam.append(arg)
        elif opt == "-d":
            usedb = True
            pck = arg
        elif opt == "-D":
            usedb = False
            pck = arg
    if args:
        usage(2, "Positional arguments not allowed")

    if usedb == None:
        usage(2, "Must specify one of -d or -D")

    h = hammie.open(pck, usedb, "c")

    spamsizes = {}

    for s in spam:
        if loud: print "Scanning spamdir (%s):" % s
        files = os.listdir(s)
        for f in files:
            if f[0] in ('1', '2', '3', '4', '5', '6', '7', '8', '9'):
                name = os.path.join(s, f)
                size = os.stat(name).st_size
                try:
                    spamsizes[size].append(name)
                except KeyError:
                    spamsizes[size] = [name]

    skipcount = 0
    spamcount = 0
    hamcount = 0
    spamday = [0] * expire
    hamday = [0] * expire
    unsureday = [0] * expire
    date_re = re.compile(
        r";.* (\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{2,4})")
    now = time.mktime(time.strptime(time.strftime("%d %b %Y"), "%d %b %Y"))
    if loud: print "Scanning everything"
    for f in os.listdir(everything):
        if f[0] in ('1', '2', '3', '4', '5', '6', '7', '8', '9'):
            name = os.path.join(everything, f)

            fh = file(name, "rb")
            msg = mboxutils.get_message(fh)
            fh.close()
            # Figure out how old the message is
            age = 2 * expire
            try:
                 received = (msg.get_all("Received"))[0]
                 received = date_re.search(received).group(1)
                 # if loud: print "  %s" % received
                 date = time.mktime(time.strptime(received, "%d %b %Y"))
                 # if loud: print "  %d" % date
                 age = (now - date) // day
                 # Can't just continue here... we're in a try
                 if age < 0:
                     age = 2 * expire
            except:
                 pass
            # Skip anything that has no date or is too old or from the future
            # if loud: print "%s: %d" % (name, age)
            if age >= expire:
                skipcount += 1
                if loud and not (skipcount % 100):
                    sys.stdout.write("-")
                    sys.stdout.flush()
                continue
            age = int(age)

            try:
                if msg.get("X-Spambayes-Classification").find("unsure") >= 0:
                    unsureday[age] += 1
            except:
                pass

            size = os.stat(name).st_size
            isspam = False
            try:
                for s in spamsizes[size]:
                    if filecmp.cmp(name, s):
                        isspam = True
            except KeyError:
                pass
            if isspam:
                spamcount += 1
                spamday[age] += 1
                if loud and not (spamcount % 100):
                    sys.stdout.write("s")
                    sys.stdout.flush()
            else:
                hamcount += 1
                hamday[age] += 1
                if loud and not (hamcount % 100):
                    sys.stdout.write("h")
                    sys.stdout.flush()
            
            h.train(msg, isspam)

    if loud:
        print

        mval = max(max(spamday), max(hamday), max(unsureday))
        scale = (mval + 19) // 20
        print "%5d" % mval
        for j in range(19, -1, -1):
            print row(scale * j, spamday, hamday, unsureday)
        print "     +" + ('-' * 60)
        print

        print "Total: %d ham, %d spam (%.2f%% spam)" % (
            hamcount, spamcount, spamcount * 100.0 / (hamcount + spamcount))

    h.store()


if __name__ == "__main__":
    main()

--- NEW FILE: bulktrain.sh ---
#!/bin/bash
cd $HOME/spambayes/active/spambayes
rm -f tmpdb 2>/dev/null
time /usr/bin/python2.2 bulkgraph.py \
 -d tmpdb \
 -e $HOME/Mail/everything/ \
 -s $HOME/Mail/spam \
 -s $HOME/Mail/newspam \
&& mv -f tmpdb hammiedb
ls -l hammiedb

--- NEW FILE: procmailrc ---
MAILDIR=/home/cashew/popiel/Mail
HOME=/home/cashew/popiel

# Classify message (up here so all copies have the classification)
:0fw:
| /usr/bin/python2.2 $HOME/spambayes/active/spambayes/hammiefilter.py
# And trust the classification
:0Hc:
* ^X-Spambayes-Classification: ham
| /usr/bin/python2.2 $HOME/spambayes/active/spambayes/hammiefilter.py -g
:0Hc:
* ^X-Spambayes-Classification: spam
| /usr/bin/python2.2 $HOME/spambayes/active/spambayes/hammiefilter.py -s


# Save all mail for analysis
:0c:
everything/.


# Block spam
:0H:
* ^Content-Type:.*text/html
newspam/.
:0H:
* ^X-Spambayes-Classification: spam
newspam/.

# Put mail from myself in outbox
:0H:
* ^From:.*popiel\@wolfskeep
outbox/.

# Everything else is presumably good
:0:
inbox/.

--- NEW FILE: spambayes.el ---
;; spambayes.el -- integrate spambayes into Gnus
;; Copyright (C) 2003 Neale Pickett <neale at woozle.org>
;; Time-stamp: <2003-01-21 20:54:15 neale>

;; This is free software; you can redistribute it and/or modify it under
;; the terms of the GNU General Public License as published by the Free
;; Software Foundation; either version 2, or (at your option) any later
;; version.

;; This program is distributed in the hope that it will be useful, but
;; WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
;; General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with GNU Emacs; see the file COPYING.  If not, write to the
;; Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.

;; Purpose:
;;
;; Functions to put spambayes into Gnus.  
;;
;; This binds "B s" to "refile as spam", and "B h" to "refile as ham".
;; After refiling, the message is rescored and respooled.  I haven't yet
;; run across a case where refiling doesn't change a message's score
;; well into the ham or spam range.  If this happens to you, please let
;; me know.

;; Installation:
;;
;; To install, just drop this file in your load path, and insert the
;; following lines in ~/.gnus:
;;
;; (load-library "spambayes")
;; (add-hook
;;  'gnus-sum-load-hook
;;  (lambda nil
;;    (define-key gnus-summary-mode-map [(B) (s)] 'spambayes-refile-as-spam)
;;    (define-key gnus-summary-mode-map [(B) (h)] 'spambayes-refile-as-ham)))
;;

(defvar spambayes-spam-group "spam"
  "Group name for spam messages")

(defvar spambayes-hammiefilter "~/src/spambayes/hammiefilter.py"
  "Path to the hammiefilter program")

(defun spambayes-retrain (args)
  "Retrain on all processable articles, or the one under the cursor.

This will replace the buffer contents with command output."
  (labels ((do-exec (n g args)
		    (with-temp-buffer
		      (gnus-request-article-this-buffer n g)
		      (shell-command-on-region (point-min) (point-max)
					       (concat spambayes-hammiefilter " " args)
					       (current-buffer)
					       t)
		      (gnus-request-replace-article n g (current-buffer)))))
    (let ((g gnus-newsgroup-name)
	  (list gnus-newsgroup-processable))
      (if (>= (length list) 1)
	  (while list
	    (let ((n (car list)))
	      (do-exec n g args))
	    (setq list (cdr list)))
	(let ((n (gnus-summary-article-number)))
	  (do-exec n g args))))))

(defun spambayes-refile-as-spam ()
  "Retrain and refilter all process-marked messages as spam, then respool them"
  (interactive)
  (spambayes-retrain "-s -f")
  (gnus-summary-respool-article nil (gnus-group-method gnus-newsgroup-name)))

(defun spambayes-refile-as-ham ()
  "Retrain and refilter all process-marked messages as ham, then respool them"
  (interactive)
  (spambayes-retrain "-g -f")
  (gnus-summary-respool-article nil (gnus-group-method gnus-newsgroup-name)))






More information about the Spambayes-checkins mailing list