From montanaro@users.sourceforge.net  Fri Nov  1 01:23:30 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Thu, 31 Oct 2002 17:23:30 -0800
Subject: [Spambayes-checkins] spambayes INTEGRATION.txt,NONE,1.1
Message-ID: <E187QX4-0006yo-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26766

Added Files:
	INTEGRATION.txt 
Log Message:
first scribbled notes about integrating Spambayes with different email
packages.


--- NEW FILE: INTEGRATION.txt ---

=======================================
Integrating Spambayes with mail systems
=======================================

General
-------

Spambayes in a tool used to segregate unwanted (spam) mail from the mail you
want (ham).  Before Spambayes can be your spam filter of choice you need to
train it on representative samples of email you receive.  After it's been
trained, you use Spambayes to classify new mail according to its spamminess
and hamminess qualities.

To train Spambayes, you need to save your incoming email for awhile,
segregating it into two piles, known spam and known ham (ham is our nickname
for good mail).  It's best to train on recent email, because your interests
and the nature of what spam looks like change over time.  Once you've
collected a fair portion of each (anything is better than nothing, but it
helps to have a couple hundred of each), you can tell Spambayes, "Here's my
ham and my spam".  It will then process that mail and save information about
different patterns which appear in ham and spam.  That information is then
used during the filtering stage.

When Spambayes filters your email, it compares each unclassified message
against the information it saved from training and makes a decision about
whether it thinks the message qualifies as ham or spam, or if it's unsure
about how to classify the message.

In the sections below, are gathered notes about how Spambayes can be
integrated into your mail processing system.  As a general requirement, you
must have a recent version of Python installed on your computer, version
2.2.1 or later.  (Don't ask about backporting it to earlier versions of
Python.  It's almost a certainty this won't happen.)  If you need to install
Python on your system, check the Python download page for the version
appropriate to your computer:

    http://www.python.org/download/


Training
--------

Given a pair of Unix mailbox format files (each message starts with a line
which begins with 'From '), one containing nothing but spam and the other
containing nothing but ham, you can train Spambayes using a command like

    hammie.py -g ~/tmp/newham -s ~/tmp/newspam

The above command is Unix-centric.  In other environments it's likely that a
less command-line-oriented tool will be available in the near future.


Windows
-------

TBD.


Unix/Linux
----------

Unlike Windows, there are too many combinations of mail reading tools (mutt,
pine, Eudora, ...) and mail transport and delivery tools (sendmail, exim,
procmail, qmail, ...) to attempt to be exhaustive about how to integrate
Spambayes into your environment at this time.  This section just documents
some of what's possible.


Procmail
--------

Many people on Unix-like systems have procmail available as an optional or
as the default local delivery agent.  Integrating Spambayes checking with
Procmail is straightforward.  Once you've trained Spambayes on your
collection of know ham and spam, you can use the hammie.py script to
classify incoming mail like so:

    :0 fw:hamlock
    | /usr/local/bin/hammie.py -f -d -p $HOME/hammie.db

The above Procmail recipe tells it to run /usr/local/bin/hammie.py in filter
mode (-f), and to use the training results stored in the dbm-style file
~/hammie.db.  While hammie.py is runnning, Procmail uses the lock file
hamlock to prevent multiple invocations from stepping on each others' toes.
(It's not strictly necessary in this case since no files on-disk are
modified, but Procmail will still complain if you don't specify a lock
file.)

The result of running hammie.py in filter mode is that Procmail will use the
output from the run as the mail message for further processing downstream.
Hammie.py inserts an X-Hammie-Disposition header in the output message which
looks like

    X-Hammie-Disposition: No; 0.00; '*H*': 1.00; '*S*': 0.00; 'python': 0.00;
	'linux,': 0.01; 'desirable': 0.01; 'cvs,': 0.01; 'perl.': 0.02;
	...

You can then use this to segregate your messages into various inboxes, like
so:

    :0
    * ^X-Hammie-Disposition: Yes
    spam

    :0
    * ^X-Hammie-Disposition: Unsure
    unsure

The first recipe catches all messages which hammie.py classified as spam.
The second catches all messages about which it was unsure.  The combination
allows you to isolate spam from your good mail and tuck away messages it was
unsure about so you can scan them more closely.


X/Emacs+VM
----------

Emacs and XEmacs both come with VM, one of a choice of several Emacs-based
mail packages.  Emacs is extensible using Emacs Lisp or Pymacs.  This
extensibility allows you to easily segregate your incoming mail for training
purposes.  Here's one such example.  If you place the following code in your
~/.vm file:

    (defun copy-to-spam ()
      (interactive)
      (vm-save-message (expand-file-name "~/tmp/newspam"))
      (vm-undelete-message 1))

    (defun copy-to-nonspam ()
      (interactive)
      (vm-save-message (expand-file-name "~/tmp/newham"))
      (vm-undelete-message 1))

    (define-key vm-mode-map "ls" 'copy-to-spam)
    (define-key vm-summary-mode-map "ls" 'copy-to-spam)
    (define-key vm-mode-map "lh" 'copy-to-nonspam)
    (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)

'ls' will save a copy of the current message to ~/tmp/newspam and 'lh' will
save a copy of the current message to ~/tmp/newham.  You can then use those
files later as arguments to hammie.py for training.


Things to watch out for
-----------------------

While Spambayes does an excellent job of classifying incoming mail, it is
only as good as the data on which it was trained.  Here are some tips to
help you create a good training set:

 * Don't use old mail.  The characteristics of your email change over time,
   sometimes subtly, sometimes dramatically, so it's best to use very recent
   mail to train Spambayes.  If you've abandoned an email address in the
   past because it was getting spammed heavily, there are probably some
   clues in mail sent to your old address which would bias Spambayes.

 * Check and recheck your training collections.  While you are manually
   classifying mail as spam or ham, it's easy to make a mistake and toss a
   message or ten in the wrong file.  Such miscategorized mail will throw
   off the classifier.


From mhammond@users.sourceforge.net  Fri Nov  1 01:23:39 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 17:23:39 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
	FilterDialog.py,1.6,1.7
Message-ID: <E187QXD-0006ys-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv26773

Modified Files:
	FilterDialog.py 
Log Message:
Missing an import of the win32com constants.


Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** FilterDialog.py	31 Oct 2002 21:57:00 -0000	1.6
--- FilterDialog.py	1 Nov 2002 01:23:27 -0000	1.7
***************
*** 7,10 ****
--- 7,11 ----
  import win32api
  import pythoncom
+ from win32com.client import constants
  
  from DialogGlobals import *
***************
*** 365,369 ****
  
  if __name__=='__main__':
!     from win32com.client import Dispatch, constants
      outlook = Dispatch("Outlook.Application")
  
--- 366,370 ----
  
  if __name__=='__main__':
!     from win32com.client import Dispatch
      outlook = Dispatch("Outlook.Application")
  

From mhammond@users.sourceforge.net  Fri Nov  1 01:24:52 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 17:24:52 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.1,1.2
Message-ID: <E187QYO-000726-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv26936

Modified Files:
	about.html 
Log Message:
Add a bit more cruft


Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** about.html	31 Oct 2002 21:56:59 -0000	1.1
--- about.html	1 Nov 2002 01:24:09 -0000	1.2
***************
*** 1,7 ****
! <HTML>
! <Title>About SpamBayes</Title>
! 
! <BODY>
! Contributions welcome!
! </BODY>
! </HTML>
\ No newline at end of file
--- 1,57 ----
! <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
! <html>
! <head>
!   <title>About SpamBayes</title>
! </head>
! <body>
! <span style="font-style: italic;">NOTE: This is very very early code. &nbsp;If
! you are looking this, you have probably been told about it against our better
! judgement &lt;wink&gt;. &nbsp;Stuff doesnt work correctly. &nbsp;Fields are
! funny. &nbsp;If you want something known to work well today for alot of people,
! this is not for you.<br>
! </span><br style="font-style: italic;">
! The source code is maintained at <a
!  href="http://spambayes.sourceforge.net">SourceForge</a>.<br>
! <br>
! This spam filter uses Bayesian analysis to filter spam. &nbsp;Unlike other
! spam detection systems, Bayesian systems actually "learn" about what you
! consider spam, and continually adapt as both your regular email and spam
! patterns change.<br>
! <h2>Training</h2>
! Due to the nature of the system, it must be trained before it can be effective.
! &nbsp;Although the system does learn over time, when first installed it has
! no knowledge of either spam or good email.<br>
! <h3>Initial Training</h3>
! When first installed, it is recommended you perform the following steps:<br>
! <ul>
!   <li>Create two folders - one for "Spam", and one for "Possible Spam"</li>
!   <li>Go through your Inbox and Deleted Items, and move as much spam as you
! can find to the "Spam" folder. &nbsp;Try and get as much Spam out of your
! inbox as possible.</li>
!   <li>Select the <span style="font-style: italic;">Training</span> dialog.
! &nbsp;Nominate your Spam folder for spam, and your Inbox for good messages,
! and start training.</li>
! </ul>
! To see how effective your Inbox cleanup was, you may like to try:<br>
! <ul>
!   <li>Go to the <span style="font-style: italic;">Filter Now</span> dialog.</li>
!   <li>Select your Inbox as the folder to filter.</li>
!   <li>Select <span style="font-style: italic;">Score messages, but dont perform
! filter action</span>.</li>
!   <li>Clear both checkboxes so all messages will be scored.</li>
!   <li>Start the score operation.</li>
! </ul>
! You can then look at and sort by the Spam field in your Inbox - this is likely
! to find hidden spam that you missed from your inbox cleanup.
! <h3>Incremental Training</h3>
! When you drag a message to your Spam folder, it will be automatically trained
! as spam. &nbsp;Thus, as the classifier misses spam (or is unsure about them),
! it learns as you correct it.<br>
! If messages are dropped back into the Inbox, they are trained as good - thus,
! the system learns what good messages look like should it incorrectly classify
! it as spam or possible spam.<br>
! <br>
! Contributions to this documentation are welcome!<br>
! <br>
! </body>
! </html>


From tim_one@users.sourceforge.net  Fri Nov  1 02:04:36 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 31 Oct 2002 18:04:36 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.20,1.21 filter.py,1.11,1.12
 manager.py,1.27,1.28 msgstore.py,1.13,1.14
Message-ID: <E187RAq-0001ZF-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000

Modified Files:
	addin.py filter.py manager.py msgstore.py 
Log Message:
Whitespace normalization.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** addin.py	31 Oct 2002 21:56:59 -0000	1.20
--- addin.py	1 Nov 2002 02:03:39 -0000	1.21
***************
*** 300,304 ****
                  self.folder_hooks[k]._obj_.close()
          self.folder_hooks = new_hooks
!         
      def _HookFolderEvents(self, folder_ids, include_sub, HandlerClass):
          new_hooks = {}
--- 300,304 ----
                  self.folder_hooks[k]._obj_.close()
          self.folder_hooks = new_hooks
! 
      def _HookFolderEvents(self, folder_ids, include_sub, HandlerClass):
          new_hooks = {}

Index: filter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** filter.py	31 Oct 2002 21:56:59 -0000	1.11
--- filter.py	1 Nov 2002 02:03:42 -0000	1.12
***************
*** 79,83 ****
          if progress.stop_requested():
              return
!     # All done - report what we did.    
      err_text = ""
      if dispositions.has_key("Error"):
--- 79,83 ----
          if progress.stop_requested():
              return
!     # All done - report what we did.
      err_text = ""
      if dispositions.has_key("Error"):

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** manager.py	31 Oct 2002 21:56:59 -0000	1.27
--- manager.py	1 Nov 2002 02:03:43 -0000	1.28
***************
*** 113,117 ****
                             # "Integer" from the UI doesn't exist!
                             # 'olNumber' doesn't seem to work with PT_INT*
!                            win32com.client.constants.olCombination, 
                             True) # Add to folder
                      item.Save()
--- 113,117 ----
                             # "Integer" from the UI doesn't exist!
                             # 'olNumber' doesn't seem to work with PT_INT*
!                            win32com.client.constants.olCombination,
                             True) # Add to folder
                      item.Save()
***************
*** 130,134 ****
                  self.EnsureOutlookFieldsForFolder(folder.EntryID, True)
                  folder = folders.GetNext()
!     
      def LoadBayes(self):
          if not os.path.exists(self.ini_filename):
--- 130,134 ----
                  self.EnsureOutlookFieldsForFolder(folder.EntryID, True)
                  folder = folders.GetNext()
! 
      def LoadBayes(self):
          if not os.path.exists(self.ini_filename):

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** msgstore.py	31 Oct 2002 21:56:59 -0000	1.13
--- msgstore.py	1 Nov 2002 02:03:45 -0000	1.14
***************
*** 363,367 ****
          # objects use the same name-to-identifier mapping.
          # [MarkH: Note MAPIUUID object are supported and hashable]
!         
          # XXX If the SpamProb (Hammie, whatever) property is passed in as an
          # XXX int, Outlook displays the field as all blanks, and sorting on
--- 363,367 ----
          # objects use the same name-to-identifier mapping.
          # [MarkH: Note MAPIUUID object are supported and hashable]
! 
          # XXX If the SpamProb (Hammie, whatever) property is passed in as an
          # XXX int, Outlook displays the field as all blanks, and sorting on


From tim_one@users.sourceforge.net  Fri Nov  1 02:04:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 31 Oct 2002 18:04:39 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
	FilterDialog.py,1.7,1.8
	ManagerDialog.py,1.4,1.5 TrainingDialog.py,1.6,1.7
Message-ID: <E187RAt-0001b4-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000/dialogs

Modified Files:
	FilterDialog.py ManagerDialog.py TrainingDialog.py 
Log Message:
Whitespace normalization.


Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** FilterDialog.py	1 Nov 2002 01:23:27 -0000	1.7
--- FilterDialog.py	1 Nov 2002 02:03:46 -0000	1.8
***************
*** 213,217 ****
          slider_pos = slider.GetPos()
          self.SetDlgItemText(idc_edit, "%d" % slider_pos)
!  
      def _InitSlider(self, idc_slider, idc_edit):
          slider = self.GetDlgItem(idc_slider)
--- 213,217 ----
          slider_pos = slider.GetPos()
          self.SetDlgItemText(idc_edit, "%d" % slider_pos)
! 
      def _InitSlider(self, idc_slider, idc_edit):
          slider = self.GetDlgItem(idc_slider)
***************
*** 285,289 ****
          [BUTTON,          action_score,         IDC_BUT_ACT_SCORE,   (15,62,203,10), csts | win32con.BS_AUTORADIOBUTTON],
  
!         
          [BUTTON,          only_group,           -1,                  (7,84,230,35),  cs   | win32con.BS_GROUPBOX | win32con.WS_GROUP],
          [BUTTON,          only_unread,          IDC_BUT_UNREAD,      (15,94,149,9),  csts | win32con.BS_AUTOCHECKBOX],
--- 285,289 ----
          [BUTTON,          action_score,         IDC_BUT_ACT_SCORE,   (15,62,203,10), csts | win32con.BS_AUTORADIOBUTTON],
  
! 
          [BUTTON,          only_group,           -1,                  (7,84,230,35),  cs   | win32con.BS_GROUPBOX | win32con.WS_GROUP],
          [BUTTON,          only_unread,          IDC_BUT_UNREAD,      (15,94,149,9),  csts | win32con.BS_AUTOCHECKBOX],

Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** ManagerDialog.py	31 Oct 2002 21:57:00 -0000	1.4
--- ManagerDialog.py	1 Nov 2002 02:03:48 -0000	1.5
***************
*** 28,32 ****
      training_intro = "Training is the process of giving examples of both good and bad email to the system so it can classify future email"
      filtering_intro = "Filtering defines how spam is handled as it arrives"
!     
      dt = [
          # Dialog itself.
--- 28,32 ----
      training_intro = "Training is the process of giving examples of both good and bad email to the system so it can classify future email"
      filtering_intro = "Filtering defines how spam is handled as it arrives"
! 
      dt = [
          # Dialog itself.
***************
*** 39,48 ****
          [BUTTON,          "It is moved from a spam folder back to the Inbox",
                                                  IDC_BUT_TRAIN_FROM_SPAM_FOLDER,(20,50,204,9), csts | win32con.BS_AUTOCHECKBOX],
!         
          [STATIC,          "Automatically train that a message is spam when",
                                                  -1,                  (15,64,208,10), cs],
          [BUTTON,          "It is moved to the certain-spam folder",
                                                  IDC_BUT_TRAIN_TO_SPAM_FOLDER,(20,75,204,9), csts | win32con.BS_AUTOCHECKBOX],
!         
          [STATIC,          "",                   IDC_TRAINING_STATUS, (15,88,146,14),       cs   | win32con.SS_LEFTNOWORDWRAP | win32con.SS_CENTERIMAGE | win32con.SS_SUNKEN],
          [BUTTON,          'Train Now...',       IDC_BUT_TRAIN_NOW,   (167,88,63,14),       csts | win32con.BS_PUSHBUTTON],
--- 39,48 ----
          [BUTTON,          "It is moved from a spam folder back to the Inbox",
                                                  IDC_BUT_TRAIN_FROM_SPAM_FOLDER,(20,50,204,9), csts | win32con.BS_AUTOCHECKBOX],
! 
          [STATIC,          "Automatically train that a message is spam when",
                                                  -1,                  (15,64,208,10), cs],
          [BUTTON,          "It is moved to the certain-spam folder",
                                                  IDC_BUT_TRAIN_TO_SPAM_FOLDER,(20,75,204,9), csts | win32con.BS_AUTOCHECKBOX],
! 
          [STATIC,          "",                   IDC_TRAINING_STATUS, (15,88,146,14),       cs   | win32con.SS_LEFTNOWORDWRAP | win32con.SS_CENTERIMAGE | win32con.SS_SUNKEN],
          [BUTTON,          'Train Now...',       IDC_BUT_TRAIN_NOW,   (167,88,63,14),       csts | win32con.BS_PUSHBUTTON],
***************
*** 72,76 ****
              (IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"),
          ]
!         
          dialog.Dialog.__init__(self, self.dt)
  
--- 72,76 ----
              (IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"),
          ]
! 
          dialog.Dialog.__init__(self, self.dt)
  
***************
*** 125,129 ****
                  filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \
                                  % (watch_names, certain_spam_name, unsure_name)
!                 
          self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable)
          enabled = config.enabled
--- 125,129 ----
                  filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \
                                  % (watch_names, certain_spam_name, unsure_name)
! 
          self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable)
          enabled = config.enabled
***************
*** 133,137 ****
      def OnButAbout(self, id, code):
          if code == win32con.BN_CLICKED:
!             
              fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html")
              fname = os.path.abspath(fname)
--- 133,137 ----
      def OnButAbout(self, id, code):
          if code == win32con.BN_CLICKED:
! 
              fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html")
              fname = os.path.abspath(fname)

Index: TrainingDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/TrainingDialog.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** TrainingDialog.py	31 Oct 2002 21:57:00 -0000	1.6
--- TrainingDialog.py	1 Nov 2002 02:03:52 -0000	1.7
***************
*** 76,80 ****
          if len(self.config.spam_folder_ids)==0 and self.mgr.config.filter.spam_folder_id:
              self.config.spam_folder_ids = [self.mgr.config.filter.spam_folder_id]
!         
          names = []
          for eid in self.config.ham_folder_ids:
--- 76,80 ----
          if len(self.config.spam_folder_ids)==0 and self.mgr.config.filter.spam_folder_id:
              self.config.spam_folder_ids = [self.mgr.config.filter.spam_folder_id]
! 
          names = []
          for eid in self.config.ham_folder_ids:


From tim_one@users.sourceforge.net  Fri Nov  1 02:04:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 31 Oct 2002 18:04:39 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/sandbox delete_outlook_field.py,1.1,1.2
Message-ID: <E187RAt-0001bJ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000/sandbox

Modified Files:
	delete_outlook_field.py 
Log Message:
Whitespace normalization.


Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** delete_outlook_field.py	31 Oct 2002 21:57:00 -0000	1.1
--- delete_outlook_field.py	1 Nov 2002 02:04:03 -0000	1.2
***************
*** 69,73 ****
                                            None,
                                            mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
!     
      table = mapi_folder.GetContentsTable(0)
      prop_ids = PR_ENTRYID,
--- 69,73 ----
                                            None,
                                            mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
! 
      table = mapi_folder.GetContentsTable(0)
      prop_ids = PR_ENTRYID,
***************
*** 152,156 ****
      print msg
  
!     
  def main():
      import getopt
--- 152,156 ----
      print msg
  
! 
  def main():
      import getopt


From npickett@users.sourceforge.net  Fri Nov  1 02:55:35 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Thu, 31 Oct 2002 18:55:35 -0800
Subject: [Spambayes-checkins] spambayes hammiesrv.py,1.8,1.9
Message-ID: <E187RyB-00056r-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18408

Modified Files:
	hammiesrv.py 
Log Message:
* XML-encode the output (thanks Toby Dickenson)


Index: hammiesrv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** hammiesrv.py	27 Oct 2002 05:13:55 -0000	1.8
--- hammiesrv.py	1 Nov 2002 02:55:32 -0000	1.9
***************
*** 41,45 ****
          except AttributeError:
              pass
!         return hammie.Hammie.score(self, msg, *extra)
  
      def filter(self, msg, *extra):
--- 41,45 ----
          except AttributeError:
              pass
!         return xmlrpclib.Binary(hammie.Hammie.score(self, msg, *extra))
  
      def filter(self, msg, *extra):
***************
*** 48,52 ****
          except AttributeError:
              pass
!         return hammie.Hammie.filter(self, msg, *extra)
  
  
--- 48,52 ----
          except AttributeError:
              pass
!         return xmlrpclib.Binary(hammie.Hammie.filter(self, msg, *extra))
  
  
From anthonybaxter@users.sourceforge.net  Fri Nov  1 04:06:52 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:06:52 -0800
Subject: [Spambayes-checkins] website related.ht,1.2,1.3
Message-ID: <E187T5A-0001gC-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv6404

Modified Files:
	related.ht 
Log Message:
bogofilter now on SF.


Index: related.ht
===================================================================
RCS file: /cvsroot/spambayes/website/related.ht,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** related.ht	30 Sep 2002 04:02:31 -0000	1.2
--- related.ht	1 Nov 2002 04:06:49 -0000	1.3
***************
*** 9,13 ****
  <li>Gary Arnold's <a href="http://www.garyarnold.com/projects.php#bayespam">bayespam</a>, a perl qmail filter.
  <li>The mozilla project is working on this, see <a href="http://bugzilla.mozilla.org/show_bug.cgi?id=163188">bug 163188</a>
! <li>Eric Raymond's <a href="http://www.tuxedo.org/~esr/bogofilter/">bogofilter</a>, a C code bayesian filter.
  <li><a href="http://www.ai.mit.edu/~jrennie/ifile/">ifile</a>, a Naive Bayes classification system.
  <li><a href="http://sourceforge.net/projects/pasp">PASP</a>, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.
--- 9,13 ----
  <li>Gary Arnold's <a href="http://www.garyarnold.com/projects.php#bayespam">bayespam</a>, a perl qmail filter.
  <li>The mozilla project is working on this, see <a href="http://bugzilla.mozilla.org/show_bug.cgi?id=163188">bug 163188</a>
! <li>Eric Raymond's <a href="http://bogofilter.sf.net/">bogofilter</a>, a C code bayesian filter.
  <li><a href="http://www.ai.mit.edu/~jrennie/ifile/">ifile</a>, a Naive Bayes classification system.
  <li><a href="http://sourceforge.net/projects/pasp">PASP</a>, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.


From anthonybaxter@users.sourceforge.net  Fri Nov  1 04:10:52 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:10:52 -0800
Subject: [Spambayes-checkins] spambayes timcv.py,1.10,1.11 msgs.py,1.4,1.5
Message-ID: <E187T92-00025S-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7003

Modified Files:
	timcv.py msgs.py 
Log Message:
Added support for specifying different numbers for training and testing
ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam) 
still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain
have been added to timcv.py.

Note that msgs.setparms _tries_ to do the right thing if it's called as
an old 3-arg form, but I might not have captured all the possible 
twistedness. As far as I can tell, only timcv.py and timtest.py
actually call these


Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** timcv.py	10 Oct 2002 04:55:15 -0000	1.10
--- timcv.py	1 Nov 2002 04:10:50 -0000	1.11
***************
*** 14,24 ****
  If you only want to use some of the messages in each set,
  
      --ham-keep int
!         The maximum number of msgs to use from each Ham set.  The msgs are
!         chosen randomly.  See also the -s option.
  
      --spam-keep int
!         The maximum number of msgs to use from each Spam set.  The msgs are
!         chosen randomly.  See also the -s option.
  
      -s int
--- 14,40 ----
  If you only want to use some of the messages in each set,
  
+     --HamTrain int
+         The maximum number of msgs to use from each Ham set for training.  
+         The msgs are chosen randomly.  See also the -s option.
+ 
+     --SpamTrain int
+         The maximum number of msgs to use from each Spam set for training.
+         The msgs are chosen randomly.  See also the -s option.
+ 
+     --HamTest int
+         The maximum number of msgs to use from each Ham set for testing.  
+         The msgs are chosen randomly.  See also the -s option.
+ 
+     --SpamTest int
+         The maximum number of msgs to use from each Spam set for testing.
+         The msgs are chosen randomly.  See also the -s option.
+ 
      --ham-keep int
!         The maximum number of msgs to use from each Ham set for testing
!         and training. The msgs are chosen randomly.  See also the -s option.
  
      --spam-keep int
!         The maximum number of msgs to use from each Spam set for testing
!         and training. The msgs are chosen randomly.  See also the -s option.
  
      -s int
***************
*** 57,62 ****
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
  
      # Now run nsets times, predicting pair i against all except pair i.
--- 73,80 ----
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), 
!                             hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), 
!                             spamdirs[1:], train=1))
  
      # Now run nsets times, predicting pair i against all except pair i.
***************
*** 64,69 ****
          h = hamdirs[i]
          s = spamdirs[i]
!         hamstream = msgs.HamStream(h, [h])
!         spamstream = msgs.SpamStream(s, [s])
  
          if i > 0:
--- 82,87 ----
          h = hamdirs[i]
          s = spamdirs[i]
!         hamstream = msgs.HamStream(h, [h], train=0)
!         spamstream = msgs.SpamStream(s, [s], train=0)
  
          if i > 0:
***************
*** 80,84 ****
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2))
  
              else:
--- 98,103 ----
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1), 
!                         msgs.SpamStream(sname, s2, train=1))
  
              else:
***************
*** 101,109 ****
      try:
          opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
!                                    ['ham-keep=', 'spam-keep='])
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = seed = hamkeep = spamkeep = None
      for opt, arg in opts:
          if opt == '-h':
--- 120,131 ----
      try:
          opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
!                                    ['HamTrain=', 'SpamTrain=',
!                                    'HamTest=', 'SpamTest=',
!                                    'ham-keep=', 'spam-keep='])
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = seed = hamtrain = spamtrain = None
!     hamtest = spamtest = hamkeep = spamkeep = None
      for opt, arg in opts:
          if opt == '-h':
***************
*** 113,116 ****
--- 135,146 ----
          elif opt == '-s':
              seed = int(arg)
+         elif opt == '--HamTest':
+             hamtest = int(arg)
+         elif opt == '--SpamTest':
+             spamtest = int(arg)
+         elif opt == '--HamTrain':
+             hamtrain = int(arg)
+         elif opt == '--SpamTrain':
+             spamtrain = int(arg)
          elif opt == '--ham-keep':
              hamkeep = int(arg)
***************
*** 123,127 ****
          usage(1, "-n is required")
  
!     msgs.setparms(hamkeep, spamkeep, seed)
      drive(nsets)
  
--- 153,160 ----
          usage(1, "-n is required")
  
!     if hamkeep is not None:
!         msgs.setparms(hamkeep, spamkeep, seed=seed)
!     else:
!         msgs.setparms(hamtrain, spamtrain, hamtest, spamtest, seed)
      drive(nsets)
  

Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** msgs.py	25 Sep 2002 20:07:06 -0000	1.4
--- msgs.py	1 Nov 2002 04:10:50 -0000	1.5
***************
*** 6,11 ****
  from tokenizer import tokenize
  
! HAMKEEP  = None
! SPAMKEEP = None
  SEED = random.randrange(2000000000)
  
--- 6,13 ----
  from tokenizer import tokenize
  
! HAMTEST  = None
! SPAMTEST = None
! HAMTRAIN  = None
! SPAMTRAIN = None
  SEED = random.randrange(2000000000)
  
***************
*** 68,83 ****
  
  class HamStream(MsgStream):
!     def __init__(self, tag, directories):
!         MsgStream.__init__(self, tag, directories, HAMKEEP)
  
  class SpamStream(MsgStream):
!     def __init__(self, tag, directories):
!         MsgStream.__init__(self, tag, directories, SPAMKEEP)
  
! def setparms(hamkeep, spamkeep, seed=None):
!     """Set HAMKEEP and SPAMKEEP.  If seed is not None, also set SEED."""
  
!     global HAMKEEP, SPAMKEEP, SEED
!     HAMKEEP, SPAMKEEP = hamkeep, spamkeep
      if seed is not None:
          SEED = seed
--- 70,103 ----
  
  class HamStream(MsgStream):
!     def __init__(self, tag, directories, train=0):
!         if train:
!             MsgStream.__init__(self, tag, directories, HAMTRAIN)
!         else:
!             MsgStream.__init__(self, tag, directories, HAMTEST)
  
  class SpamStream(MsgStream):
!     def __init__(self, tag, directories, train=0):
!         if train:
!             MsgStream.__init__(self, tag, directories, SPAMTRAIN)
!         else:
!             MsgStream.__init__(self, tag, directories, SPAMTEST)
  
! def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.  
!        If seed is not None, also set SEED.
!        If (ham|spam)test are not set, set to the same as the (ham|spam)train
!        numbers (backwards compat option).
!     """
  
!     global HAMTEST, SPAMTEST, HAMTRAIN, SPAMTRAIN, SEED
!     HAMTRAIN, SPAMTRAIN = hamtrain, spamtrain
!     if hamtest is None:
!         HAMTEST = HAMTRAIN
!     else:
!         HAMTEST = hamtest
!     if spamtest is None:
!         SPAMTEST = SPAMTRAIN
!     else:
!         SPAMTEST = spamtest
      if seed is not None:
          SEED = seed


From anthonybaxter@users.sourceforge.net  Fri Nov  1 04:13:13 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:13:13 -0800
Subject: [Spambayes-checkins] spambayes timtest.py,1.29,1.30
Message-ID: <E187TBJ-0002Hk-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8231

Modified Files:
	timtest.py 
Log Message:
Added support for specifying different numbers for training and testing
ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam) 
still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain  
have been added to timcv.py.

Note that msgs.setparms _tries_ to do the right thing if it's called as
an old 3-arg form, but I might not have captured all the possible 
twistedness. As far as I can tell, only timcv.py and timtest.py
actually call these. Also, msgs.HamStream and msgs.SpamStream now
have an option 'train' argument (which defaults to 0/False), which
tells them whether to use the test or train numbers. 

If you have your own test harnesses, you _might_ need to update them
a little. 


Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** timtest.py	24 Sep 2002 05:37:11 -0000	1.29
--- timtest.py	1 Nov 2002 04:13:11 -0000	1.30
***************
*** 98,102 ****
          usage(1, "-n is required")
  
!     msgs.setparms(hamkeep, spamkeep, seed)
      drive(nsets)
  
--- 98,102 ----
          usage(1, "-n is required")
  
!     msgs.setparms(hamkeep, spamkeep, seed=seed)
      drive(nsets)
  

From anthony@interlink.com.au  Fri Nov  1 04:13:29 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 15:13:29 +1100
Subject: [Spambayes-checkins] spambayes timcv.py,1.10,1.11 msgs.py,1.4,1.5
	
In-Reply-To: <E187T92-00025S-00@usw-pr-cvs1.sourceforge.net> 
Message-ID: <200211010413.gA14DUn09404@localhost.localdomain>


>>> "Anthony Baxter" wrote
> Update of /cvsroot/spambayes/spambayes
> In directory usw-pr-cvs1:/tmp/cvs-serv7003
> 
> Modified Files:
> 	timcv.py msgs.py 
> Log Message:
> Added support for specifying different numbers for training and testing
> ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam) 
> still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain
> have been added to timcv.py.
> 
> Note that msgs.setparms _tries_ to do the right thing if it's called as
> an old 3-arg form, but I might not have captured all the possible 
> twistedness. As far as I can tell, only timcv.py and timtest.py
> actually call these

Wierd. My cvs commit aborted and only did two of the files, and truncated
my commit message??? I'll use cvs admin to fix the commit message next.

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthonybaxter@users.sourceforge.net  Fri Nov  1 04:50:21 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:50:21 -0800
Subject: [Spambayes-checkins] 
 website applications.ht,NONE,1.1 index.ht,1.1.1.1,1.2 links.h,1.2,1.3
Message-ID: <E187TlF-0005J0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv20352

Modified Files:
	index.ht links.h 
Added Files:
	applications.ht 
Log Message:
initial 'applications' notes.


--- NEW FILE: applications.ht ---
Title: SpamBayes: Applications
Author-Email: spambayes@python.org
Author: spambayes

<h2>Applications</h2>
<p>A number of applications are available in the SpamBayes project. None
of these are particularly polished, finished pieces of work, but they're
getting there (and help is always appreciated).
</p>
<h3>Outlook2000</h3>
<p>Sean True and Mark Hammond have developed an addin for Outlook2000 that
adds support for the spambayes classifier. 
<h4>Requirements</h4>
<ul>
<li>Python2.2 or later (2.2.2 recommended)
<li>Outlook 2000 (<b>not</b> Outlook Express)
<li>Python's <a href="http://starship.python.net/crew/mhammond">win32com</a>
extensions (win32all-149 or later)
<li>CDO installed.
</ul>
For more on this, see the <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/Outlook2000/README.txt?rev=HEAD&content-type=text/plain">README.txt</a> or 
<a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/Outlook2000/about.html?rev=HEAD&content-type=text/html">about.html</a> file in the spambayes CVS repository's Outlook2000 directory.
</p>
<h4>Availability</h4>
<p>At the moment, you'll need to use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>

<h3>hammie.py</h3>
<p>hammie is a command line tool for marking mail as ham or spam. Skip Montanaro has started a <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/INTEGRATION.txt?rev=HEAD&content-type=text/plain">guide to integrating hammie with your mailer</a> (Unix-only instructions at the moment - additions welcome!). 
Currently it focusses on running hammie via procmail. </p>
<h4>Requirements</h4>
<ul>
<li>Python2.2 or later (2.2.2 recommended)
<li>Currently documentation focusses on Unix.
</ul>
<h4>Availability</h4>
<p>At the moment, you'll need to use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>

<h3>pop3proxy.py</h3>
<p>pop3proxy sits between your mail client and your real POP3 server and marks
mail as ham or spam as it passes through. See the <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/pop3proxy.py?rev=HEAD&content-type=text/plain">docstring at the top of pop3proxy.py</a> for more.
<h4>Requirements</h4>
<ul>
<li>Python2.2 or later (2.2.2 recommended)
<li>Should work on windows/unix/whatever... ?
</ul>
</p>
<h4>Availability</h4>
<p>At the moment, you'll need to use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>


Index: index.ht
===================================================================
RCS file: /cvsroot/spambayes/website/index.ht,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** index.ht	19 Sep 2002 08:40:55 -0000	1.1.1.1
--- index.ht	1 Nov 2002 04:50:19 -0000	1.2
***************
*** 12,16 ****
  <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a> - 
  note that it's not yet 
! suitable for end-users, but for people interested in experimenting.
  </p>
  
--- 12,22 ----
  <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a> - 
  note that it's not yet 
! suitable for non-technical end-users, but for people interested 
! in experimenting.
! </p>
! <p>
! There are now a couple of end-user applications available for those
! excited by the bleeding edge - these are detailed on the 
! <a href="applications.html">Applications</a> page.
  </p>
  

Index: links.h
===================================================================
RCS file: /cvsroot/spambayes/website/links.h,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** links.h	19 Sep 2002 23:39:24 -0000	1.2
--- links.h	1 Nov 2002 04:50:19 -0000	1.3
***************
*** 3,6 ****
--- 3,7 ----
  <li><a href="background.html">Background</a>
  <li><a href="docs.html">Documentation</a>
+ <li><a href="applications.html">Applications</a>
  <li><a href="developer.html">Developers</a>
  <li><a href="related.html">Related</a>


From mhammond@users.sourceforge.net  Fri Nov  1 05:48:02 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 21:48:02 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs FolderSelector.py,1.5,1.6
Message-ID: <E187Uf4-0000CY-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv548/dialogs

Modified Files:
	FolderSelector.py 
Log Message:
All items are now identified by a (store_id, entry_id) tuple.  This was
done in such a way that old config files should be fully supported - no
need to reconfigure.

Not much should look different, except mutiple stores should be *fully*
supported - you should be able to train and filter across stores to your
hearts content.


Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** FolderSelector.py	31 Oct 2002 21:57:00 -0000	1.5
--- FolderSelector.py	1 Nov 2002 05:47:59 -0000	1.6
***************
*** 53,63 ****
  from win32com.mapi.mapitags import *
  
  def _BuildFoldersMAPI(msgstore, folder):
      # Get the hierarchy table for it.
      table = folder.GetHierarchyTable(0)
      children = []
!     rows = mapi.HrQueryAllRows(table, (PR_ENTRYID,PR_DISPLAY_NAME_A), None, None, 0)
!     for (eid_tag, eid),(name_tag, name) in rows:
!         spec = FolderSpec(mapi.HexFromBin(eid), name)
          child_folder = msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
          spec.children = _BuildFoldersMAPI(msgstore, child_folder)
--- 53,66 ----
  from win32com.mapi.mapitags import *
  
+ default_store_id = None
+ 
  def _BuildFoldersMAPI(msgstore, folder):
      # Get the hierarchy table for it.
      table = folder.GetHierarchyTable(0)
      children = []
!     rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
!     for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows:
!         folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid)
!         spec = FolderSpec(folder_id, name)
          child_folder = msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
          spec.children = _BuildFoldersMAPI(msgstore, child_folder)
***************
*** 66,79 ****
  
  def BuildFolderTreeMAPI(session):
      root = FolderSpec(None, "root")
      tab = session.GetMsgStoresTable(0)
!     rows = mapi.HrQueryAllRows(tab, (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
      for row in rows:
!         (eid_tag, eid), (name_tag, name) = row
          msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS)
          hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0)
          subtree_eid = data[0][1]
          folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
!         spec = FolderSpec(mapi.HexFromBin(subtree_eid), name)
          spec.children = _BuildFoldersMAPI(msgstore, folder)
          root.children.append(spec)
--- 69,89 ----
  
  def BuildFolderTreeMAPI(session):
+     global default_store_id
      root = FolderSpec(None, "root")
      tab = session.GetMsgStoresTable(0)
!     prop_tags = PR_ENTRYID, PR_DEFAULT_STORE, PR_DISPLAY_NAME_A
!     rows = mapi.HrQueryAllRows(tab, prop_tags, None, None, 0)
      for row in rows:
!         (eid_tag, eid), (is_def_tag, is_def), (name_tag, name) = row
!         hex_eid = mapi.HexFromBin(eid)
!         if is_def:
!             default_store_id = hex_eid
! 
          msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS)
          hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0)
          subtree_eid = data[0][1]
          folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
!         folder_id = hex_eid, mapi.HexFromBin(subtree_eid)
!         spec = FolderSpec(folder_id, name)
          spec.children = _BuildFoldersMAPI(msgstore, folder)
          root.children.append(spec)
***************
*** 126,129 ****
--- 136,153 ----
          self.checkbox_text = checkbox_text or "Include &subfolders"
  
+     def CompareIDs(self, id1, id2):
+         if type(id1) != type(()):
+             id1 = default_store_id, id1
+         if type(id2) != type(()):
+             id2 = default_store_id, id2
+         return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \
+                self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1]))
+ 
+     def InIDs(self, id, ids):
+         for id_check in ids:
+             if self.CompareIDs(id_check, id):
+                 return True
+         return False
+ 
      def _MakeItemParam(self, item):
          item_id = self.next_item_id
***************
*** 144,148 ****
                  mask = state = 0
              else:
!                 if self.selected_ids and child.folder_id in self.selected_ids:
                      state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
                      num_children_selected += 1
--- 168,172 ----
                  mask = state = 0
              else:
!                 if self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
                      state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
                      num_children_selected += 1
***************
*** 152,156 ****
              item_id = self._MakeItemParam(child)
              hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id))
!             if self.single_select and self.selected_ids and child.folder_id in self.selected_ids:
                  self.list.SelectItem(hitem)
  
--- 176,180 ----
              item_id = self._MakeItemParam(child)
              hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id))
!             if self.single_select and self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
                  self.list.SelectItem(hitem)
  

From mhammond@users.sourceforge.net  Fri Nov  1 05:48:01 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 21:48:01 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.21,1.22 manager.py,1.28,1.29
 msgstore.py,1.14,1.15
Message-ID: <E187Uf3-0000CP-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv548

Modified Files:
	addin.py manager.py msgstore.py 
Log Message:
All items are now identified by a (store_id, entry_id) tuple.  This was
done in such a way that old config files should be fully supported - no
need to reconfigure.

Not much should look different, except mutiple stores should be *fully*
supported - you should be able to train and filter across stores to your
hearts content.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** addin.py	1 Nov 2002 02:03:39 -0000	1.21
--- addin.py	1 Nov 2002 05:47:59 -0000	1.22
***************
*** 308,312 ****
              existing = self.folder_hooks.get(eid)
              if existing is None or existing.__class__ != HandlerClass:
!                 folder = self.application.Session.GetFolderFromID(eid)
                  name = folder.Name.encode("mbcs", "replace")
                  try:
--- 308,312 ----
              existing = self.folder_hooks.get(eid)
              if existing is None or existing.__class__ != HandlerClass:
!                 folder = self.application.Session.GetFolderFromID(*eid)
                  name = folder.Name.encode("mbcs", "replace")
                  try:

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** manager.py	1 Nov 2002 02:03:43 -0000	1.28
--- manager.py	1 Nov 2002 05:47:59 -0000	1.29
***************
*** 92,96 ****
          assert self.outlook is not None, "I need outlook :("
          ol = self.outlook
!         folder = ol.Session.GetFolderFromID(folder_id)
          if self.verbose > 1:
              print "Checking folder '%s' for our field '%s'" \
--- 92,96 ----
          assert self.outlook is not None, "I need outlook :("
          ol = self.outlook
!         folder = ol.Session.GetFolderFromID(*folder_id)
          if self.verbose > 1:
              print "Checking folder '%s' for our field '%s'" \

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** msgstore.py	1 Nov 2002 02:03:45 -0000	1.14
--- msgstore.py	1 Nov 2002 05:47:59 -0000	1.15
***************
*** 91,123 ****
                        mapi.MAPI_USE_DEFAULT)
          self.session = mapi.MAPILogonEx(0, None, None, logonFlags)
!         self._FindDefaultMessageStore()
          os.chdir(cwd)
  
      def Close(self):
!         self.mapi_msgstore = None
          self.session.Logoff(0, 0, 0)
          self.session = None
          mapi.MAPIUninitialize()
  
!     def _FindDefaultMessageStore(self):
!         tab = self.session.GetMsgStoresTable(0)
!         # Restriction for the table:  get rows where PR_DEFAULT_STORE is true.
!         # There should be only one.
!         restriction = (mapi.RES_PROPERTY,   # a property restriction
!                        (mapi.RELOP_EQ,      # check for equality
!                         PR_DEFAULT_STORE,   # of the PR_DEFAULT_STORE prop
!                         (PR_DEFAULT_STORE, True))) # with True
!         rows = mapi.HrQueryAllRows(tab,
!                                    (PR_ENTRYID,),   # columns to retrieve
!                                    restriction,     # only these rows
!                                    None,            # any sort order is fine
!                                    0)               # any # of results is fine
!         # get first entry, a (property_tag, value) pair, for PR_ENTRYID
!         row = rows[0]
!         eid_tag, eid = row[0]
!         # Open the store.
!         self.mapi_msgstore = self.session.OpenMsgStore(
                                  0,      # no parent window
!                                 eid,    # msg store to open
                                  None,   # IID; accept default IMsgStore
                                  # need write access to add score fields
--- 91,135 ----
                        mapi.MAPI_USE_DEFAULT)
          self.session = mapi.MAPILogonEx(0, None, None, logonFlags)
!         self.mapi_msg_stores = {}
!         self.default_store_bin_eid = None
!         self._GetMessageStore(None)
          os.chdir(cwd)
  
      def Close(self):
!         self.mapi_msg_stores = None
          self.session.Logoff(0, 0, 0)
          self.session = None
          mapi.MAPIUninitialize()
  
!     def _GetMessageStore(self, store_eid): # bin eid.
!         try:
!             # Will usually be pre-fetched, so fast-path out
!             return self.mapi_msg_stores[store_eid]
!         except KeyError:
!             pass
!         given_store_eid = store_eid
!         if store_eid is None:
!             # Find the EID for the default store.
!             tab = self.session.GetMsgStoresTable(0)
!             # Restriction for the table:  get rows where PR_DEFAULT_STORE is true.
!             # There should be only one.
!             restriction = (mapi.RES_PROPERTY,   # a property restriction
!                            (mapi.RELOP_EQ,      # check for equality
!                             PR_DEFAULT_STORE,   # of the PR_DEFAULT_STORE prop
!                             (PR_DEFAULT_STORE, True))) # with True
!             rows = mapi.HrQueryAllRows(tab,
!                                        (PR_ENTRYID,),   # columns to retrieve
!                                        restriction,     # only these rows
!                                        None,            # any sort order is fine
!                                        0)               # any # of results is fine
!             # get first entry, a (property_tag, value) pair, for PR_ENTRYID
!             row = rows[0]
!             eid_tag, store_eid = row[0]
!             self.default_store_bin_eid = store_eid
! 
!         # Open it.
!         store = self.session.OpenMsgStore(
                                  0,      # no parent window
!                                 store_eid,    # msg store to open
                                  None,   # IID; accept default IMsgStore
                                  # need write access to add score fields
***************
*** 126,158 ****
                                      mapi.MDB_NO_MAIL |
                                      USE_DEFERRED_ERRORS)
  
      def _GetSubFolderIter(self, folder):
          table = folder.GetHierarchyTable(0)
          rows = mapi.HrQueryAllRows(table,
!                                    (PR_ENTRYID, PR_DISPLAY_NAME_A),
                                     None,
                                     None,
                                     0)
!         for (eid_tag, eid),(name_tag, name) in rows:
!             sub = self.mapi_msgstore.OpenEntry(eid,
!                                                None,
!                                                mapi.MAPI_MODIFY |
!                                                    USE_DEFERRED_ERRORS)
              table = sub.GetContentsTable(0)
!             yield MAPIMsgStoreFolder(self, eid, name, table.GetRowCount(0))
!             folder = self.mapi_msgstore.OpenEntry(eid,
!                                                   None,
!                                                   mapi.MAPI_MODIFY |
!                                                       USE_DEFERRED_ERRORS)
!             for store_folder in self._GetSubFolderIter(folder):
                  yield store_folder
  
      def GetFolderGenerator(self, folder_ids, include_sub):
          for folder_id in folder_ids:
!             folder_id = mapi.BinFromHex(folder_id)
!             folder = self.mapi_msgstore.OpenEntry(folder_id,
!                                                   None,
!                                                   mapi.MAPI_MODIFY |
!                                                       USE_DEFERRED_ERRORS)
              table = folder.GetContentsTable(0)
              rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
--- 138,191 ----
                                      mapi.MDB_NO_MAIL |
                                      USE_DEFERRED_ERRORS)
+         # cache it
+         self.mapi_msg_stores[store_eid] = store
+         if given_store_eid is None: # The default store
+             self.mapi_msg_stores[None] = store
+         return store
+ 
+     def _OpenEntry(self, id, iid = None, flags = None):
+         # id is already normalized.
+         store_id, item_id = id
+         store = self._GetMessageStore(store_id)
+         if flags is None:
+             flags = mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS
+         return store.OpenEntry(item_id, iid, flags)
+ 
+     # Given an ID, normalize it into a (store_id, item_id) binary tuple.
+     # item_id may be:
+     # - Simple hex EID, in wich case default store ID is assumed.
+     # - Tuple of (None, hex_eid), in which case default store assumed.
+     # - Tuple of (hex_store_id, hex_id)
+     def NormalizeID(self, item_id):
+         if type(item_id)==type(()):
+             store_id, item_id = item_id
+             item_id = mapi.BinFromHex(item_id)
+             if store_id is None:
+                 store_id = self.default_store_bin_eid
+             else:
+                 store_id = mapi.BinFromHex(store_id)
+             return store_id, item_id
+         assert type(item_id) in [type(''), type(u'')], "What kind of ID is '%r'?" % (item_id,)
+         return self.default_store_bin_eid, mapi.BinFromHex(item_id)
  
      def _GetSubFolderIter(self, folder):
          table = folder.GetHierarchyTable(0)
          rows = mapi.HrQueryAllRows(table,
!                                    (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A),
                                     None,
                                     None,
                                     0)
!         for (eid_tag, eid), (store_eid_tag, store_eid), (name_tag, name) in rows:
!             item_id = store_eid, eid
!             sub = self._OpenEntry(item_id)
              table = sub.GetContentsTable(0)
!             yield MAPIMsgStoreFolder(self, item_id, name, table.GetRowCount(0))
!             for store_folder in self._GetSubFolderIter(sub):
                  yield store_folder
  
      def GetFolderGenerator(self, folder_ids, include_sub):
          for folder_id in folder_ids:
!             folder_id = self.NormalizeID(folder_id)
!             folder = self._OpenEntry(folder_id)
              table = folder.GetContentsTable(0)
              rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
***************
*** 165,173 ****
      def GetFolder(self, folder_id):
          # Return a single folder given the ID.
!         folder_id = mapi.BinFromHex(folder_id)
!         folder = self.mapi_msgstore.OpenEntry(folder_id,
!                                               None,
!                                               mapi.MAPI_MODIFY |
!                                                   USE_DEFERRED_ERRORS)
          table = folder.GetContentsTable(0)
          rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
--- 198,203 ----
      def GetFolder(self, folder_id):
          # Return a single folder given the ID.
!         folder_id = self.NormalizeID(folder_id)
!         folder = self._OpenEntry(folder_id)
          table = folder.GetContentsTable(0)
          rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
***************
*** 177,191 ****
      def GetMessage(self, message_id):
          # Return a single message given the ID.
!         message_id = mapi.BinFromHex(message_id)
          prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
!         mapi_object = self.mapi_msgstore.OpenEntry(message_id,
!                                                    None,
!                                                    mapi.MAPI_MODIFY |
!                                                        USE_DEFERRED_ERRORS)
          hr, data = mapi_object.GetProps(prop_ids,0)
          folder_eid = data[0][1]
          searchkey = data[1][1]
          unread = data[2][1]
!         folder = MAPIMsgStoreFolder(self, folder_eid,
                                      "Unknown - temp message", -1)
          return  MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
--- 207,219 ----
      def GetMessage(self, message_id):
          # Return a single message given the ID.
!         message_id = self.NormalizeID(message_id)
          prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
!         mapi_object = self._OpenEntry(message_id)
          hr, data = mapi_object.GetProps(prop_ids,0)
          folder_eid = data[0][1]
          searchkey = data[1][1]
          unread = data[2][1]
!         folder_id = message_id[0], folder_eid
!         folder = MAPIMsgStoreFolder(self, folder_id,
                                      "Unknown - temp message", -1)
          return  MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
***************
*** 216,232 ****
  
      def __repr__(self):
!         return "<%s '%s' (%d items), id=%s>" % (self.__class__.__name__,
                                                  self.name,
                                                  self.count,
!                                                 mapi.HexFromBin(self.id))
  
      def GetOutlookEntryID(self):
!         return mapi.HexFromBin(self.id)
  
      def GetMessageGenerator(self):
!         folder = self.msgstore.mapi_msgstore.OpenEntry(self.id,
!                                                        None,
!                                                        mapi.MAPI_MODIFY |
!                                                            USE_DEFERRED_ERRORS)
          table = folder.GetContentsTable(0)
          prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
--- 244,263 ----
  
      def __repr__(self):
!         return "<%s '%s' (%d items), id=%s/%s>" % (self.__class__.__name__,
                                                  self.name,
                                                  self.count,
!                                                 mapi.HexFromBin(self.id[0]),
!                                                 mapi.HexFromBin(self.id[1]))
  
      def GetOutlookEntryID(self):
!         # Return EntryID, StoreID - we use this order as it is the same as
!         # Session.GetItemFromID() uses - thus:
!         # ids = me.GetOutlookEntryID()
!         # session.GetItemFromID(*ids)
!         # should work.
!         return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
  
      def GetMessageGenerator(self):
!         folder = self.msgstore._OpenEntry(self.id)
          table = folder.GetContentsTable(0)
          prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
***************
*** 239,244 ****
                  break
              for row in rows:
                  yield MAPIMsgStoreMsg(self.msgstore, self,
!                                       row[0][1], row[1][1], row[2][1])
  
  
--- 270,276 ----
                  break
              for row in rows:
+                 item_id = self.id[0], row[0][1] # assume in same store as folder!
                  yield MAPIMsgStoreMsg(self.msgstore, self,
!                                       item_id, row[1][1], row[2][1])
  
  
***************
*** 263,272 ****
          else:
              urs = "unread"
!         return "<%s, (%s) id=%s>" % (self.__class__.__name__,
                                       urs,
!                                      mapi.HexFromBin(self.id))
  
      def GetOutlookEntryID(self):
!         return mapi.HexFromBin(self.id)
  
      def _GetPropFromStream(self, prop_id):
--- 295,310 ----
          else:
              urs = "unread"
!         return "<%s, (%s) id=%s/%s>" % (self.__class__.__name__,
                                       urs,
!                                      mapi.HexFromBin(self.id[0]),
!                                      mapi.HexFromBin(self.id[1]))
  
      def GetOutlookEntryID(self):
!         # Return EntryID, StoreID - we use this order as it is the same as
!         # Session.GetItemFromID() uses - thus:
!         # ids = me.GetOutlookEntryID()
!         # session.GetItemFromID(*ids)
!         # should work.
!         return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
  
      def _GetPropFromStream(self, prop_id):
***************
*** 319,326 ****
      def _EnsureObject(self):
          if self.mapi_object is None:
!             self.mapi_object = self.msgstore.mapi_msgstore.OpenEntry(
!                                    self.id,
!                                    None,
!                                    mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS)
  
      def GetEmailPackageObject(self, strip_mime_headers=True):
--- 357,361 ----
      def _EnsureObject(self):
          if self.mapi_object is None:
!             self.mapi_object = self.msgstore._OpenEntry(self.id)
  
      def GetEmailPackageObject(self, strip_mime_headers=True):
***************
*** 418,432 ****
          assert not self.dirty, \
                 "asking me to move a dirty message - later saves will fail!"
!         dest_folder = self.msgstore.mapi_msgstore.OpenEntry(
!                           folder.id,
!                           None,
!                           mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS)
!         source_folder = self.msgstore.mapi_msgstore.OpenEntry(
!                             self.folder.id,
!                             None,
!                             mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS)
          flags = 0
          if isMove: flags |= MESSAGE_MOVE
!         source_folder.CopyMessages((self.id,),
                                     None,
                                     dest_folder,
--- 453,462 ----
          assert not self.dirty, \
                 "asking me to move a dirty message - later saves will fail!"
!         dest_folder = self.msgstore._OpenEntry(folder.id)
!         source_folder = self.msgstore._OpenEntry(self.folder.id)
          flags = 0
          if isMove: flags |= MESSAGE_MOVE
!         eid = self.id[1]
!         source_folder.CopyMessages((eid,),
                                     None,
                                     dest_folder,
***************
*** 434,438 ****
                                     None,
                                     flags)
!         self.folder = self.msgstore.GetFolder(mapi.HexFromBin(folder.id))
  
      def MoveTo(self, folder):
--- 464,473 ----
                                     None,
                                     flags)
!         # At this stage, I think we have lost meaningful ID etc values
!         # Set everything to None to make it clearer what is wrong should
!         # this become an issue.  We would need to re-fetch the eid of
!         # the item, and set the store_id to the dest folder.
!         self.id = None
!         self.folder = None
  
      def MoveTo(self, folder):
***************
*** 453,457 ****
              print msg
      store.Close()
- 
  
  if __name__=='__main__':
--- 488,491 ----


From mhammond@users.sourceforge.net  Fri Nov  1 06:09:08 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 22:09:08 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.29,1.30
Message-ID: <E187UzU-0001RN-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv5475

Modified Files:
	manager.py 
Log Message:
Stop everyone fretting over a known problem.


Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** manager.py	1 Nov 2002 05:47:59 -0000	1.29
--- manager.py	1 Nov 2002 06:09:06 -0000	1.30
***************
*** 119,125 ****
                          print "Created the UserProperty!"
                  except pythoncom.com_error:
!                     import traceback
!                     print "Failed to create the field"
!                     traceback.print_exc()
          # else no items in this folder - not much worth doing!
          if include_sub:
--- 119,126 ----
                          print "Created the UserProperty!"
                  except pythoncom.com_error:
!                     pass # We know, we know...
! ##                    import traceback
! ##                    print "Failed to create the field"
! ##                    traceback.print_exc()
          # else no items in this folder - not much worth doing!
          if include_sub:


From tim.one@comcast.net  Fri Nov  1 06:22:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 01:22:38 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
 FolderSelector.py,1.5,1.6
In-Reply-To: <E187Uf4-0000CY-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEOACDAB.tim.one@comcast.net>

[Mark Hammond]
> Modified Files:
> 	FolderSelector.py
> Log Message:
> All items are now identified by a (store_id, entry_id) tuple.  This was
> done in such a way that old config files should be fully supported - no
> need to reconfigure.
>
> Not much should look different, except mutiple stores should be *fully*
> supported - you should be able to train and filter across stores to your
> hearts content.

That's impressive!  I'll do my bit next by ensuring there's no trailing
whitespace <wink>.


From richiehindle@users.sourceforge.net  Fri Nov  1 09:14:50 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 01 Nov 2002 01:14:50 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.7,1.8
Message-ID: <E187XtC-0004EZ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16187

Modified Files:
	pop3proxy.py 
Log Message:
Made this work on Linux, where socket.makefile behaves differently from
Windows.

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** pop3proxy.py	29 Oct 2002 21:02:40 -0000	1.7
--- pop3proxy.py	1 Nov 2002 09:14:47 -0000	1.8
***************
*** 87,94 ****
          self.request = ''
          self.set_terminator('\r\n')
!         serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         serverSocket.connect((serverName, serverPort))
!         self.serverFile = serverSocket.makefile()
!         self.push(self.serverFile.readline())
  
      def handle_connect(self):
--- 87,94 ----
          self.request = ''
          self.set_terminator('\r\n')
!         self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.serverSocket.connect((serverName, serverPort))
!         self.serverIn = self.serverSocket.makefile('r')  # For reading only
!         self.push(self.serverIn.readline())
  
      def handle_connect(self):
***************
*** 135,139 ****
          seenAllHeaders = False
          while True:
!             line = self.serverFile.readline()
              if not line:
                  # The socket's been closed by the server, probably by QUIT.
--- 135,139 ----
          seenAllHeaders = False
          while True:
!             line = self.serverIn.readline()
              if not line:
                  # The socket's been closed by the server, probably by QUIT.
***************
*** 173,184 ****
          # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
!             self.serverFile.write('QUIT\r\n')
!             self.serverFile.flush()
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverFile.write(self.request + '\r\n')
!         self.serverFile.flush()
          if self.request.strip() == '':
              # Someone just hit the Enter key.
--- 173,182 ----
          # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
!             self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverSocket.sendall(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
***************
*** 200,204 ****
          if timedOut:
              while True:
!                 line = self.serverFile.readline()
                  if not line:
                      # The socket's been closed by the server.
--- 198,202 ----
          if timedOut:
              while True:
!                 line = self.serverIn.readline()
                  if not line:
                      # The socket's been closed by the server.
***************
*** 529,532 ****
--- 527,531 ----
          asyncore.loop(map=testSocketMap)
  
+     proxyReady = threading.Event()
      def runProxy():
          # Name the database in case it ever gets auto-flushed to disk.
***************
*** 535,538 ****
--- 534,538 ----
          bayes.learn(tokenizer.tokenize(spam1), True)
          bayes.learn(tokenizer.tokenize(good1), False)
+         proxyReady.set()
          asyncore.loop()
  
***************
*** 540,548 ****
      testServerReady.wait()
      threading.Thread(target=runProxy).start()
  
      # Connect to the proxy.
      proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      proxy.connect(('localhost', 8111))
!     assert proxy.recv(100) == "+OK ready\r\n"
  
      # Stat the mailbox to get the number of messages.
--- 540,550 ----
      testServerReady.wait()
      threading.Thread(target=runProxy).start()
+     proxyReady.wait()
  
      # Connect to the proxy.
      proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
      proxy.connect(('localhost', 8111))
!     response = proxy.recv(100)
!     assert response == "+OK ready\r\n"
  
      # Stat the mailbox to get the number of messages.


From mhammond@users.sourceforge.net  Fri Nov  1 14:35:10 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 06:35:10 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.22,1.23 manager.py,1.30,1.31
 msgstore.py,1.15,1.16
Message-ID: <E187ctC-0003o9-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14364

Modified Files:
	addin.py manager.py msgstore.py 
Log Message:
Fix a problem with the (store_id, item_id) change, and remove the
confusing GetOutlookItemID concept - just get the item!


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** addin.py	1 Nov 2002 05:47:59 -0000	1.22
--- addin.py	1 Nov 2002 14:35:05 -0000	1.23
***************
*** 305,312 ****
          for msgstore_folder in self.manager.message_store.GetFolderGenerator(
                      folder_ids, include_sub):
!             eid = msgstore_folder.GetOutlookEntryID()
!             existing = self.folder_hooks.get(eid)
              if existing is None or existing.__class__ != HandlerClass:
!                 folder = self.application.Session.GetFolderFromID(*eid)
                  name = folder.Name.encode("mbcs", "replace")
                  try:
--- 305,311 ----
          for msgstore_folder in self.manager.message_store.GetFolderGenerator(
                      folder_ids, include_sub):
!             existing = self.folder_hooks.get(msgstore_folder.id)
              if existing is None or existing.__class__ != HandlerClass:
!                 folder = msgstore_folder.GetOutlookItem()
                  name = folder.Name.encode("mbcs", "replace")
                  try:
***************
*** 317,325 ****
                  if new_hook is not None:
                      new_hook.Init(folder, self.application, self.manager)
!                     new_hooks[eid] = new_hook
!                     self.manager.EnsureOutlookFieldsForFolder(eid)
                      print "AntiSpam: Watching for new messages in folder", name
              else:
!                 new_hooks[eid] = existing
          return new_hooks
  
--- 316,324 ----
                  if new_hook is not None:
                      new_hook.Init(folder, self.application, self.manager)
!                     new_hooks[msgstore_folder.id] = new_hook
!                     self.manager.EnsureOutlookFieldsForFolder(msgstore_folder.GetID())
                      print "AntiSpam: Watching for new messages in folder", name
              else:
!                 new_hooks[msgstore_folder.id] = existing
          return new_hooks
  

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** manager.py	1 Nov 2002 06:09:06 -0000	1.30
--- manager.py	1 Nov 2002 14:35:05 -0000	1.31
***************
*** 92,96 ****
          assert self.outlook is not None, "I need outlook :("
          ol = self.outlook
!         folder = ol.Session.GetFolderFromID(*folder_id)
          if self.verbose > 1:
              print "Checking folder '%s' for our field '%s'" \
--- 92,97 ----
          assert self.outlook is not None, "I need outlook :("
          ol = self.outlook
!         msgstore_folder = self.message_store.GetFolder(folder_id)
!         folder = msgstore_folder.GetOutlookItem()
          if self.verbose > 1:
              print "Checking folder '%s' for our field '%s'" \

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** msgstore.py	1 Nov 2002 05:47:59 -0000	1.15
--- msgstore.py	1 Nov 2002 14:35:06 -0000	1.16
***************
*** 219,230 ****
          return  MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
  
- ##    # Currently no need for this
- ##    def GetOutlookObjectFromID(self, eid):
- ##        if self.outlook is None:
- ##            from win32com.client import Dispatch
- ##            self.outlook = Dispatch("Outlook.Application")
- ##        return self.outlook.Session.GetItemFromID(mapi.HexFromBin(eid))
- 
- 
  _MapiTypeMap = {
      type(0.0): PT_DOUBLE,
--- 219,222 ----
***************
*** 250,260 ****
                                                  mapi.HexFromBin(self.id[1]))
  
!     def GetOutlookEntryID(self):
!         # Return EntryID, StoreID - we use this order as it is the same as
!         # Session.GetItemFromID() uses - thus:
!         # ids = me.GetOutlookEntryID()
!         # session.GetItemFromID(*ids)
!         # should work.
!         return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
  
      def GetMessageGenerator(self):
--- 242,252 ----
                                                  mapi.HexFromBin(self.id[1]))
  
!     def GetID(self):
!         return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
! 
!     def GetOutlookItem(self):
!         hex_item_id = mapi.HexFromBin(self.id[1])
!         hex_store_id = mapi.HexFromBin(self.id[0])
!         return self.msgstore.outlook.Session.GetFolderFromID(hex_item_id, hex_store_id)
  
      def GetMessageGenerator(self):
***************
*** 300,310 ****
                                       mapi.HexFromBin(self.id[1]))
  
!     def GetOutlookEntryID(self):
!         # Return EntryID, StoreID - we use this order as it is the same as
!         # Session.GetItemFromID() uses - thus:
!         # ids = me.GetOutlookEntryID()
!         # session.GetItemFromID(*ids)
!         # should work.
!         return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
  
      def _GetPropFromStream(self, prop_id):
--- 292,302 ----
                                       mapi.HexFromBin(self.id[1]))
  
!     def GetID(self):
!         return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
! 
!     def GetOutlookItem(self):
!         hex_item_id = mapi.HexFromBin(self.id[1])
!         store_hex_id = mapi.HexFromBin(self.id[0])
!         return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id)
  
      def _GetPropFromStream(self, prop_id):


From tim_one@users.sourceforge.net  Fri Nov  1 16:01:20 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 01 Nov 2002 08:01:20 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.45,1.46
Message-ID: <E187eEa-0002c7-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7943

Modified Files:
	classifier.py 
Log Message:
WordInfo.__init__:  if an initial spamprob isn't specified, set it to
options.robinson_probability_x (the "unknown word" probability) instead
of to None.  If threads exist such that scoring can happen in parallel
with training, None could cause scoring to raise an exception.  "A real"
spamprob can't be computed until update_probabilities is called to
recalculate the entire database; before then, giving a new word the
unknown-word spamprob is thoroughly appropriate.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.45
retrieving revision 1.46
diff -C2 -d -r1.45 -r1.46
*** classifier.py	27 Oct 2002 17:11:00 -0000	1.45
--- classifier.py	1 Nov 2002 16:01:14 -0000	1.46
***************
*** 62,66 ****
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=None):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 62,66 ----
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0


From sjoerd@users.sourceforge.net  Fri Nov  1 16:10:18 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Fri, 01 Nov 2002 08:10:18 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.59,1.60
Message-ID: <E187eNG-0003XF-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13555

Modified Files:
	tokenizer.py 
Log Message:
Switch " and ' in url_re character class and add # ' token the re to
resync python-mode.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.59
retrieving revision 1.60
diff -C2 -d -r1.59 -r1.60
*** tokenizer.py	31 Oct 2002 15:43:55 -0000	1.59
--- tokenizer.py	1 Nov 2002 16:10:13 -0000	1.60
***************
*** 604,609 ****
      # be in HTML, may or may not be in quotes, etc.  If it's full of %
      # escapes, cool -- that's a clue too.
!     ([^\s<>'"\x7f-\xff]+)  # capture the guts
! """, re.VERBOSE)
  
  urlsep_re = re.compile(r"[;?:@&=+,$.]")
--- 604,609 ----
      # be in HTML, may or may not be in quotes, etc.  If it's full of %
      # escapes, cool -- that's a clue too.
!     ([^\s<>"'\x7f-\xff]+)  # capture the guts
! """, re.VERBOSE)                        # '
  
  urlsep_re = re.compile(r"[;?:@&=+,$.]")


From mhammond@users.sourceforge.net  Fri Nov  1 23:54:05 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 15:54:05 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.23,1.24 msgstore.py,1.16,1.17
Message-ID: <E187lc5-0003pD-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14570

Modified Files:
	addin.py msgstore.py 
Log Message:
Fix a couple of places the "multiple stores" concept fell over.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** addin.py	1 Nov 2002 14:35:05 -0000	1.23
--- addin.py	1 Nov 2002 23:54:03 -0000	1.24
***************
*** 121,125 ****
          #     PR_RECEIVED_BY_ENTRYID
          #     PR_TRANSPORT_MESSAGE_HEADERS
!         msgstore_message = self.manager.message_store.GetMessage(item.EntryID)
          if msgstore_message.GetField(self.manager.config.field_score_name) is not None:
              # Already seem this message - user probably moving it back
--- 121,125 ----
          #     PR_RECEIVED_BY_ENTRYID
          #     PR_TRANSPORT_MESSAGE_HEADERS
!         msgstore_message = self.manager.message_store.GetMessage(item)
          if msgstore_message.GetField(self.manager.config.field_score_name) is not None:
              # Already seem this message - user probably moving it back
***************
*** 154,158 ****
          if not self.manager.config.training.train_manual_spam:
              return
!         msgstore_message = self.manager.message_store.GetMessage(item.EntryID)
          prop = msgstore_message.GetField(self.manager.config.field_score_name)
          if prop is not None:
--- 154,158 ----
          if not self.manager.config.training.train_manual_spam:
              return
!         msgstore_message = self.manager.message_store.GetMessage(item)
          prop = msgstore_message.GetField(self.manager.config.field_score_name)
          if prop is not None:
***************
*** 189,193 ****
          return
  
!     msgstore_message = mgr.message_store.GetMessage(item.EntryID)
      score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
      new_msg = app.CreateItem(0)
--- 189,193 ----
          return
  
!     msgstore_message = mgr.message_store.GetMessage(item)
      score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
      new_msg = app.CreateItem(0)

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** msgstore.py	1 Nov 2002 14:35:06 -0000	1.16
--- msgstore.py	1 Nov 2002 23:54:03 -0000	1.17
***************
*** 206,211 ****
  
      def GetMessage(self, message_id):
!         # Return a single message given the ID.
!         message_id = self.NormalizeID(message_id)
          prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
          mapi_object = self._OpenEntry(message_id)
--- 206,217 ----
  
      def GetMessage(self, message_id):
!         # Return a single message given either the ID, or an Outlook
!         # message representing the object.
!         if hasattr(message_id, "EntryID"):
!             # A CDO object
!             message_id = mapi.BinFromHex(message_id.Parent.StoreID), \
!                          mapi.BinFromHex(message_id.EntryID)
!         else:
!             message_id = self.NormalizeID(message_id)
          prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
          mapi_object = self._OpenEntry(message_id)


From mhammond@users.sourceforge.net  Sat Nov  2 03:12:15 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 19:12:15 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/sandbox delete_outlook_field.py,1.2,1.3
Message-ID: <E187ohr-0007y5-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv30593

Modified Files:
	delete_outlook_field.py 
Log Message:
Fix missing quote in usage string.


Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** delete_outlook_field.py	1 Nov 2002 02:04:03 -0000	1.2
--- delete_outlook_field.py	2 Nov 2002 03:12:12 -0000	1.3
***************
*** 147,151 ****
  of the default message store
  
! Eg, python\\python-dev' will locate a python-dev subfolder in a python
  subfolder in your default store.
  """ % os.path.basename(sys.argv[0])
--- 147,151 ----
  of the default message store
  
! Eg, 'python\\python-dev' will locate a python-dev subfolder in a python
  subfolder in your default store.
  """ % os.path.basename(sys.argv[0])


From mhammond@users.sourceforge.net  Sat Nov  2 03:13:24 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 19:13:24 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	dump_props.py,NONE,1.1
Message-ID: <E187oiy-00081z-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv30848

Added Files:
	dump_props.py 
Log Message:
Tool to dump everything we know about a message.


--- NEW FILE: dump_props.py ---
# Dump every property we can find for a MAPI item

from win32com.client import Dispatch, constants
import pythoncom
import os, sys

from win32com.mapi import mapi, mapiutil
from win32com.mapi.mapitags import *

mapi.MAPIInitialize(None)
logonFlags = (mapi.MAPI_NO_MAIL |
              mapi.MAPI_EXTENDED |
              mapi.MAPI_USE_DEFAULT)
session = mapi.MAPILogonEx(0, None, None, logonFlags)

def _FindDefaultMessageStore():
    tab = session.GetMsgStoresTable(0)
    # Restriction for the table:  get rows where PR_DEFAULT_STORE is true.
    # There should be only one.
    restriction = (mapi.RES_PROPERTY,   # a property restriction
                   (mapi.RELOP_EQ,      # check for equality
                    PR_DEFAULT_STORE,   # of the PR_DEFAULT_STORE prop
                    (PR_DEFAULT_STORE, True))) # with True
    rows = mapi.HrQueryAllRows(tab,
                               (PR_ENTRYID,),   # columns to retrieve
                               restriction,     # only these rows
                               None,            # any sort order is fine
                               0)               # any # of results is fine
    # get first entry, a (property_tag, value) pair, for PR_ENTRYID
    row = rows[0]
    eid_tag, eid = row[0]
    # Open the store.
    return session.OpenMsgStore(
                            0,      # no parent window
                            eid,    # msg store to open
                            None,   # IID; accept default IMsgStore
                            # need write access to add score fields
                            mapi.MDB_WRITE |
                                # we won't send or receive email
                                mapi.MDB_NO_MAIL |
                                mapi.MAPI_DEFERRED_ERRORS)

def _FindItemsWithValue(folder, prop_tag, prop_val):
    tab = folder.GetContentsTable(0)
    # Restriction for the table:  get rows where our prop values match
    restriction = (mapi.RES_CONTENT,   # a property restriction
                   (mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level
                    prop_tag,   # of the given prop
                    (prop_tag, prop_val))) # with given val
##    tab.SetColumns((PR_ENTRYID,), 0)
##    restriction = None
    rows = mapi.HrQueryAllRows(tab,
                               (PR_ENTRYID,),   # columns to retrieve
                               restriction,     # only these rows
                               None,            # any sort order is fine
                               0)               # any # of results is fine
    # get entry IDs
    print rows
    return [row[0][1] for row in rows]
    
def _FindFolderEID(name):
    assert name
    from win32com.mapi import exchange
    if not name.startswith("\\"):
        name = "\\Top Of Personal Folders\\" + name
    store = _FindDefaultMessageStore()
    folder_eid = exchange.HrMAPIFindFolderEx(store, "\\", name)
    return folder_eid

# Also in new versions of mapituil
def GetAllProperties(obj, make_tag_names = True):
	tags = obj.GetPropList(0)
	hr, data = obj.GetProps(tags)
	ret = []
	for tag, val in data:
		if make_tag_names:
			hr, tags, array = obj.GetNamesFromIDs( (tag,) )
			if type(array[0][1])==type(u''):
				name = array[0][1]
			else:
				name = mapiutil.GetPropTagName(tag)
		else:
			name = tag
		ret.append((name, val))
	return ret

def DumpProps(folder_eid, subject, shorten):
    mapi_msgstore = _FindDefaultMessageStore()
    mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
                                          None,
                                          mapi.MAPI_DEFERRED_ERRORS)
    hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
    name = data[0][1]
    print name
    eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject)
    print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject)
    for eid in eids:
        print "Dumping item with ID", mapi.HexFromBin(eid)
        item = mapi_msgstore.OpenEntry(eid,
                                       None,
                                       mapi.MAPI_DEFERRED_ERRORS)
        for prop_name, prop_val in GetAllProperties(item):
            prop_repr = repr(prop_val)
            if shorten:
                prop_repr = prop_repr[:50]
            print "%-20s: %s" % (prop_name, prop_repr)

def usage():
    msg = """\
Usage: %s [-f foldername] subject of the message
-f - Search for the message in the specified folder (default = Inbox)
-s - Shorten long property values.

Dumps all properties for all messages that match the subject.  Subject
matching is substring and ignore-case.

Folder name must be a hierarchical 'path' name, using '\\'
as the path seperator.  If the folder name begins with a
\\, it must be a fully-qualified name, including the message
store name (eg, "Top of Public Folders").  If the path does not
begin with a \\, it is assumed to be fully-qualifed from the root
of the default message store

Eg, python\\python-dev' will locate a python-dev subfolder in a python
subfolder in your default store.
""" % os.path.basename(sys.argv[0])
    print msg


def main():
    import getopt
    try:
        opts, args = getopt.getopt(sys.argv[1:], "f:s")
    except getopt.error, e:
        print e
        print
        usage()
        sys.exit(1)
    folder_name = ""
    subject = " ".join(args)
    if not subject:
        usage()
        sys.exit(1)

    shorten = False
    for opt, opt_val in opts:
        if opt == "-f":
            folder_name = opt_val
        elif opt == "-s":
            shorten = True
        else:
            print "Invalid arg"
            return

    if not folder_name:
        folder_name = "Inbox" # Assume this exists!
        
    eid = _FindFolderEID(folder_name)
    if eid is None:
        print "*** Cant find folder", folder_name
        return
    DumpProps(eid, subject, shorten)

if __name__=='__main__':
    main()


From mhammond@users.sourceforge.net  Sat Nov  2 03:18:10 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 19:18:10 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	dump_props.py,1.1,1.2
Message-ID: <E187ona-0008Fy-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv31673

Modified Files:
	dump_props.py 
Log Message:
Remove old debug code I missed.


Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** dump_props.py	2 Nov 2002 03:13:22 -0000	1.1
--- dump_props.py	2 Nov 2002 03:18:08 -0000	1.2
***************
*** 48,53 ****
                      prop_tag,   # of the given prop
                      (prop_tag, prop_val))) # with given val
- ##    tab.SetColumns((PR_ENTRYID,), 0)
- ##    restriction = None
      rows = mapi.HrQueryAllRows(tab,
                                 (PR_ENTRYID,),   # columns to retrieve
--- 48,51 ----
***************
*** 56,60 ****
                                 0)               # any # of results is fine
      # get entry IDs
-     print rows
      return [row[0][1] for row in rows]
      
--- 54,57 ----


From mhammond@users.sourceforge.net  Sat Nov  2 04:00:45 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 20:00:45 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5
Message-ID: <E187pSn-0003aG-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv13755

Modified Files:
	README.txt 
Log Message:
Update to reflect the current world state.


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** README.txt	21 Oct 2002 01:38:10 -0000	1.4
--- README.txt	2 Nov 2002 04:00:43 -0000	1.5
***************
*** 4,12 ****
  to run the Outlook Addin you *must* have win32all-149 or later.
  
! ** NOTE ** - You also need CDO installed.  This comes with Outlook 2k, but is
! not installed by default.  Attempting to install the add-in will detect this 
! situation, and print instructions how to install CDO.  Note however that
! running the stand-alone scripts (see below) will generally just print the raw
! Python exception - generally a semi-incomprehensible COM exception.
  
  Outlook Addin
--- 4,8 ----
  to run the Outlook Addin you *must* have win32all-149 or later.
  
! CDO is no longer needed :)
  
  Outlook Addin
***************
*** 43,54 ****
  Inbox filter).  You can watch as many folders for Spam as you like.
  
- You can define any number of filters to apply, each performing a different 
- action or testing a different spam probability.  You can enable and disable
- any rule, and you can "Filter Now" an entire folder in one step.
- 
- Note that the rule ordering can be important, as if early rules move
- a message, later rules will not fire for that message (cos MAPI
- appears to make access to the message once moved impossible)
- 
  Command Line Tools
  -------------------
--- 39,42 ----
***************
*** 66,76 ****
      plugin must be running for filtering of new mail to occur)
  
- classify.py
-     Creates a field in each message with the classifier score.  Once run, 
-     the Outlook Field Chooser can be used to display, sort etc the field,
-     or used to change formatting of these messages.  The field will appear
-     in "user defined fields"
- 
- 
  Misc Comments
  ===========
--- 54,57 ----
***************
*** 78,86 ****
  Somewhere over 4MB, they seem to stop working.  Mark's hasn't got
  that big yet - just over 2MB and going strong.
- 
- Outlook will occasionally complain that folders are corrupted after running
- filter.  Closing and reopening Outlook always seems to restore things,
- with no fuss.  Your mileage may vary.  Buyer beware.  Worth what you paid.
- (Mark hasn't seen this)
  
  Copyright transferred to PSF from Sean D. True and WebReply.com.
--- 59,62 ----


From mhammond@users.sourceforge.net  Sat Nov  2 04:08:04 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 20:08:04 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.5,1.6
Message-ID: <E187pZs-00040d-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv15352

Modified Files:
	README.txt 
Log Message:
Add known problems.


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** README.txt	2 Nov 2002 04:00:43 -0000	1.5
--- README.txt	2 Nov 2002 04:08:02 -0000	1.6
***************
*** 2,9 ****
  Outlook 2000, courtesy of Sean True and Mark Hammond.  Note that you need 
  Python's win32com extensions (http://starship.python.net/crew/mhammond) and
! to run the Outlook Addin you *must* have win32all-149 or later.
  
  CDO is no longer needed :)
  
  Outlook Addin
  ==========
--- 2,12 ----
  Outlook 2000, courtesy of Sean True and Mark Hammond.  Note that you need 
  Python's win32com extensions (http://starship.python.net/crew/mhammond) and
! you *must* have win32all-149 or later.
  
  CDO is no longer needed :)
  
+ See below for a list of known problems (particularly that you must manually
+ create an Outlook property before you can see the Spam scores)
+ 
  Outlook Addin
  ==========
***************
*** 54,63 ****
      plugin must be running for filtering of new mail to occur)
  
  Misc Comments
  ===========
- Sean reports bad output saving very large classifiers in training.py.
- Somewhere over 4MB, they seem to stop working.  Mark's hasn't got
- that big yet - just over 2MB and going strong.
- 
  Copyright transferred to PSF from Sean D. True and WebReply.com.
  Licensed under PSF, see Tim Peters for IANAL interpretation.
--- 57,76 ----
      plugin must be running for filtering of new mail to occur)
  
+ Known Problems
+ ---------------
+ * No field is created in Outlook for the Spam Score field.  To create
+   the field, go to the field chooser for the folder you are interested
+   in, and create a new User Property called "Spam".  Ensure the type
+   of the field is "Integer" (the last option), NOT "Number".  This is only
+   necessary for you to *see* the score, not for the scoring to work.
+ 
+ * Filtering an Exchange Server public store appears to not work.
+ 
+ * Sean reports bad output saving very large classifiers in training.py.
+   Somewhere over 4MB, they seem to stop working.  Mark's hasn't got
+   that big yet - just over 2MB and going strong.
+ 
  Misc Comments
  ===========
  Copyright transferred to PSF from Sean D. True and WebReply.com.
  Licensed under PSF, see Tim Peters for IANAL interpretation.


From tim.one@comcast.net  Sat Nov  2 04:12:29 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 23:12:29 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5
In-Reply-To: <E187pSn-0003aG-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEEFCFAB.tim.one@comcast.net>

[Mark Hammond]
> ...
> Modified Files:
> 	README.txt
> Log Message:
> Update to reflect the current world state.

> ...
> - Outlook will occasionally complain that folders are corrupted
> - after running filter.  Closing and reopening Outlook always seems to
> - restore things, with no fuss.  Your mileage may vary.  Buyer beware.
> - Worth what you paid.
> - (Mark hasn't seen this)

I meant to mention before that I've never seen this either.  Sean, do you
still see it?  scanpst.exe sometimes claims there are minor inconsistencies
when I run it, but it's always done that, and AFAICT it doesn't claim it
more often now than before I started using the addin.


From mhammond@skippinet.com.au  Sat Nov  2 04:18:30 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 2 Nov 2002 15:18:30 +1100
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEFCFAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEMGHHAA.mhammond@skippinet.com.au>

> > ...
> > - Outlook will occasionally complain that folders are corrupted
> > - after running filter.  Closing and reopening Outlook always seems to
> > - restore things, with no fuss.  Your mileage may vary.  Buyer beware.
> > - Worth what you paid.
> > - (Mark hasn't seen this)
>
> I meant to mention before that I've never seen this either.  Sean, do you
> still see it?  scanpst.exe sometimes claims there are minor
> inconsistencies
> when I run it, but it's always done that, and AFAICT it doesn't claim it
> more often now than before I started using the addin.

Actually, I saw similar things when using the Outlook model to scan huge
folders.  Since moving to MAPI I think it will have gone away.

Mark.


From mhammond@users.sourceforge.net  Sat Nov  2 05:26:55 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 21:26:55 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	dump_props.py,1.2,1.3
Message-ID: <E187qoB-000172-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv4243

Modified Files:
	dump_props.py 
Log Message:
Add support for dumping attachments too


Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** dump_props.py	2 Nov 2002 03:18:08 -0000	1.2
--- dump_props.py	2 Nov 2002 05:26:52 -0000	1.3
***************
*** 82,86 ****
  	return ret
  
! def DumpProps(folder_eid, subject, shorten):
      mapi_msgstore = _FindDefaultMessageStore()
      mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
--- 82,93 ----
  	return ret
  
! def DumpItemProps(item, shorten):
!     for prop_name, prop_val in GetAllProperties(item):
!         prop_repr = repr(prop_val)
!         if shorten:
!             prop_repr = prop_repr[:50]
!         print "%-20s: %s" % (prop_name, prop_repr)
!     
! def DumpProps(folder_eid, subject, include_attach, shorten):
      mapi_msgstore = _FindDefaultMessageStore()
      mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
***************
*** 89,93 ****
      hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
      name = data[0][1]
-     print name
      eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject)
      print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject)
--- 96,99 ----
***************
*** 97,105 ****
                                         None,
                                         mapi.MAPI_DEFERRED_ERRORS)
!         for prop_name, prop_val in GetAllProperties(item):
!             prop_repr = repr(prop_val)
!             if shorten:
!                 prop_repr = prop_repr[:50]
!             print "%-20s: %s" % (prop_name, prop_repr)
  
  def usage():
--- 103,116 ----
                                         None,
                                         mapi.MAPI_DEFERRED_ERRORS)
!         DumpItemProps(item, shorten)
!         if include_attach:
!             print
!             table = item.GetAttachmentTable(0)
!             rows = mapi.HrQueryAllRows(table, (PR_ATTACH_NUM,), None, None, 0)
!             for row in rows:
!                 attach_num = row[0][1]
!                 print "Dumping attachment (PR_ATTACH_NUM=%d)" % (attach_num,)
!                 attach = item.OpenAttach(attach_num, None, mapi.MAPI_DEFERRED_ERRORS)
!                 DumpItemProps(attach, shorten)
  
  def usage():
***************
*** 108,111 ****
--- 119,123 ----
  -f - Search for the message in the specified folder (default = Inbox)
  -s - Shorten long property values.
+ -a - Include attachments
  
  Dumps all properties for all messages that match the subject.  Subject
***************
*** 128,132 ****
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "f:s")
      except getopt.error, e:
          print e
--- 140,144 ----
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "af:s")
      except getopt.error, e:
          print e
***************
*** 141,144 ****
--- 153,157 ----
  
      shorten = False
+     include_attach = False
      for opt, opt_val in opts:
          if opt == "-f":
***************
*** 146,149 ****
--- 159,164 ----
          elif opt == "-s":
              shorten = True
+         elif opt == "-a":
+             include_attach = True
          else:
              print "Invalid arg"
***************
*** 157,161 ****
          print "*** Cant find folder", folder_name
          return
!     DumpProps(eid, subject, shorten)
  
  if __name__=='__main__':
--- 172,176 ----
          print "*** Cant find folder", folder_name
          return
!     DumpProps(eid, subject, include_attach, shorten)
  
  if __name__=='__main__':


From mhammond@users.sourceforge.net  Sat Nov  2 06:12:36 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 22:12:36 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.17,1.18
Message-ID: <E187rWO-0003Wr-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv13542

Modified Files:
	msgstore.py 
Log Message:
Correct misleading comment.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** msgstore.py	1 Nov 2002 23:54:03 -0000	1.17
--- msgstore.py	2 Nov 2002 06:12:34 -0000	1.18
***************
*** 209,213 ****
          # message representing the object.
          if hasattr(message_id, "EntryID"):
!             # A CDO object
              message_id = mapi.BinFromHex(message_id.Parent.StoreID), \
                           mapi.BinFromHex(message_id.EntryID)
--- 209,213 ----
          # message representing the object.
          if hasattr(message_id, "EntryID"):
!             # An Outlook object
              message_id = mapi.BinFromHex(message_id.Parent.StoreID), \
                           mapi.BinFromHex(message_id.EntryID)


From tim_one@users.sourceforge.net  Sat Nov  2 06:53:26 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 01 Nov 2002 22:53:26 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.2,1.3
Message-ID: <E187s9u-0005Uj-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv21025/Outlook2000

Modified Files:
	about.html 
Log Message:
Added exhaustive sister-friendly instructions for creating a Spam column
in a view in a folder.


Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** about.html	1 Nov 2002 01:24:09 -0000	1.2
--- about.html	2 Nov 2002 06:53:24 -0000	1.3
***************
*** 18,25 ****
--- 18,27 ----
  consider spam, and continually adapt as both your regular email and spam
  patterns change.<br>
+ 
  <h2>Training</h2>
  Due to the nature of the system, it must be trained before it can be effective.
  &nbsp;Although the system does learn over time, when first installed it has
  no knowledge of either spam or good email.<br>
+ 
  <h3>Initial Training</h3>
  When first installed, it is recommended you perform the following steps:<br>
***************
*** 44,47 ****
--- 46,50 ----
  You can then look at and sort by the Spam field in your Inbox - this is likely
  to find hidden spam that you missed from your inbox cleanup.
+ 
  <h3>Incremental Training</h3>
  When you drag a message to your Spam folder, it will be automatically trained
***************
*** 51,55 ****
  the system learns what good messages look like should it incorrectly classify
  it as spam or possible spam.<br>
! <br>
  Contributions to this documentation are welcome!<br>
  <br>
--- 54,97 ----
  the system learns what good messages look like should it incorrectly classify
  it as spam or possible spam.<br>
! 
! <h2>Creating a Spam Score Field</h2>
! A custom property named "Spam" is added to all Outlook messages scored.
! This is an integer in 0 (ham) through 100 (spam) inclusive.
! You can teach Outlook to display this field as a column in any table view,
! like the standard Messages view.
! <p>
! This takes some work, and has to be done again for every folder in which
! you want to display a Spam column:
! <ul>
!     <li>While looking at an Outlook table view (like Messages), right-click
!         on the line with column headers (From, Subject, To, Received, ...).
!         In the context menu that pops up, click on Field Chooser.  A box
!         with title <i>Field Chooser</i> pops up.
!     <li>In the lower left corner of the <i>Field Chooser</i> box, click
!         <i>New...</i>.  A box with title <i>New Field</i> pops up.
!     <li>In the <i>Name:</i> box, type Spam.
!     <li>In the <i>Type:</i> dropdown list, select <i>Integer</i>.  This is the
!         last choice in the dropdown list.
!         Do not select <i>Number</i> -- it won't work.
!     <li>The <i>Format:</i> dropdown list should display "1,234" now.  Leave it alone.
!     <li>Click OK in the <i>New Field</i> box.  Now you're back in the
!         <i>Field Chooser</i> box.
!     <li>The dropdown list at the top of the <i>Field Chooser</i> box should say
!         <i>User-defined fields in FOLDER</i> now, where FOLDER is the name of the
!         folder you're currently looking at (like Inbox).  Below that, you
!         should see a new rectangular button with a Spam label.
!     <li>Use your mouse to drag the Spam button to the column header position
!         where you want to see the Spam column.  You don't have to be precise
!         here -- you can rearrange or resize the column later just by dragging
!         it around.
!     <li>You're done!  Close the <i>Field Chooser</i> box.
! </ul>
! Outlook's standard Automatic Formatting features can also be taught how
! access the value of this field; for example, you could tell Outlook to display
! rows with suspected spam messages in green italic.  However, for whatever reason,
! the Outlook Rules Wizard does not allow creating rules based on user-defined
! fields.  That's why this addin supplies its own filtering rules.
! 
! <p>
  Contributions to this documentation are welcome!<br>
  <br>


From tim_one@users.sourceforge.net  Sat Nov  2 07:01:24 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 01 Nov 2002 23:01:24 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.3,1.4
Message-ID: <E187sHc-0005rD-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv22485/Outlook2000

Modified Files:
	about.html 
Log Message:
Grammar repair in new stuff.


Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** about.html	2 Nov 2002 06:53:24 -0000	1.3
--- about.html	2 Nov 2002 07:01:21 -0000	1.4
***************
*** 87,91 ****
      <li>You're done!  Close the <i>Field Chooser</i> box.
  </ul>
! Outlook's standard Automatic Formatting features can also be taught how
  access the value of this field; for example, you could tell Outlook to display
  rows with suspected spam messages in green italic.  However, for whatever reason,
--- 87,91 ----
      <li>You're done!  Close the <i>Field Chooser</i> box.
  </ul>
! Outlook's standard Automatic Formatting features can also be taught how to
  access the value of this field; for example, you could tell Outlook to display
  rows with suspected spam messages in green italic.  However, for whatever reason,


From mhammond@users.sourceforge.net  Sat Nov  2 11:27:55 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 03:27:55 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	dump_props.py,1.3,1.4
Message-ID: <E187wRX-0002Rx-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv9291/sandbox

Modified Files:
	dump_props.py 
Log Message:
Beat Tim to the whitespace normalization <Wink>


Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** dump_props.py	2 Nov 2002 05:26:52 -0000	1.3
--- dump_props.py	2 Nov 2002 11:27:53 -0000	1.4
***************
*** 55,59 ****
      # get entry IDs
      return [row[0][1] for row in rows]
!     
  def _FindFolderEID(name):
      assert name
--- 55,59 ----
      # get entry IDs
      return [row[0][1] for row in rows]
! 
  def _FindFolderEID(name):
      assert name
***************
*** 67,84 ****
  # Also in new versions of mapituil
  def GetAllProperties(obj, make_tag_names = True):
! 	tags = obj.GetPropList(0)
! 	hr, data = obj.GetProps(tags)
! 	ret = []
! 	for tag, val in data:
! 		if make_tag_names:
! 			hr, tags, array = obj.GetNamesFromIDs( (tag,) )
! 			if type(array[0][1])==type(u''):
! 				name = array[0][1]
! 			else:
! 				name = mapiutil.GetPropTagName(tag)
! 		else:
! 			name = tag
! 		ret.append((name, val))
! 	return ret
  
  def DumpItemProps(item, shorten):
--- 67,84 ----
  # Also in new versions of mapituil
  def GetAllProperties(obj, make_tag_names = True):
!     tags = obj.GetPropList(0)
!     hr, data = obj.GetProps(tags)
!     ret = []
!     for tag, val in data:
!         if make_tag_names:
!             hr, tags, array = obj.GetNamesFromIDs( (tag,) )
!             if type(array[0][1])==type(u''):
!                 name = array[0][1]
!             else:
!                 name = mapiutil.GetPropTagName(tag)
!         else:
!             name = tag
!         ret.append((name, val))
!     return ret
  
  def DumpItemProps(item, shorten):
***************
*** 88,92 ****
              prop_repr = prop_repr[:50]
          print "%-20s: %s" % (prop_name, prop_repr)
!     
  def DumpProps(folder_eid, subject, include_attach, shorten):
      mapi_msgstore = _FindDefaultMessageStore()
--- 88,92 ----
              prop_repr = prop_repr[:50]
          print "%-20s: %s" % (prop_name, prop_repr)
! 
  def DumpProps(folder_eid, subject, include_attach, shorten):
      mapi_msgstore = _FindDefaultMessageStore()
***************
*** 167,171 ****
      if not folder_name:
          folder_name = "Inbox" # Assume this exists!
!         
      eid = _FindFolderEID(folder_name)
      if eid is None:
--- 167,171 ----
      if not folder_name:
          folder_name = "Inbox" # Assume this exists!
! 
      eid = _FindFolderEID(folder_name)
      if eid is None:


From mhammond@users.sourceforge.net  Sat Nov  2 12:09:38 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 04:09:38 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.18,1.19
Message-ID: <E187x5u-0000Ni-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv812

Modified Files:
	msgstore.py 
Log Message:
Nice patch from Piers Haken that does the best we can with Exchange Server
delivered messages.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** msgstore.py	2 Nov 2002 06:12:34 -0000	1.18
--- msgstore.py	2 Nov 2002 12:09:36 -0000	1.19
***************
*** 351,355 ****
--- 351,379 ----
          body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
          html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
+         # Mail delivered internally via Exchange Server etc may not have
+         # headers - fake some up.
+         if not headers:
+             headers = self._GetFakeHeaders ()
+         # Mail delivered via the Exchange Internet Mail MTA may have
+         # gibberish at the start of the headers - fix this.
+         elif headers.startswith("Microsoft Mail"):
+             headers = "X-MS-Mail-Gibberish: " + headers
          return "%s\n%s\n%s" % (headers, html, body)
+ 
+     def _GetFakeHeaders(self):
+         # This is designed to fake up some SMTP headers for messages
+         # on an exchange server that do not have such headers of their own
+         prop_ids = PR_SUBJECT_A, PR_DISPLAY_NAME_A, PR_DISPLAY_TO_A, PR_DISPLAY_CC_A
+         hr, data = self.mapi_object.GetProps(prop_ids,0)
+         subject = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0])
+         sender = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
+         to = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
+         cc = self._GetPotentiallyLargeStringProp(prop_ids[3], data[3])
+         headers = ["X-Exchange-Message: true"]
+         if subject: headers.append("Subject: "+subject)
+         if sender: headers.append("From: "+sender)
+         if to: headers.append("To: "+to)
+         if cc: headers.append("CC: "+cc)
+         return "\n".join(headers)
  
      def _EnsureObject(self):


From mhammond@users.sourceforge.net  Sat Nov  2 12:28:41 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 04:28:41 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
	FilterDialog.py,1.8,1.9
	FolderSelector.py,1.6,1.7 TrainingDialog.py,1.7,1.8
Message-ID: <E187xOL-0002vT-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv9661

Modified Files:
	FilterDialog.py FolderSelector.py TrainingDialog.py 
Log Message:
Another nice patch from Piers Haken - use the Outlook object model for the
folder dialog.  I have no idea why this is necessary for Exchange server,
but it seems OK, and is trivial to revert.

I'm certain that Exchange Server can be navigated via Ext MAPI, but I'm
happy this at least gets more people going.

Note after applying this, the Folder dialog may not automatically 
pre-select the folders you had selected (but they are still working)
- however, once you have re-selected, it does re-remember.

(It seems Outlook has done something funky with the entry IDs, and made 
them binary comparable, whereas MAPI and CDO ones are not.  Whatever)


Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** FilterDialog.py	1 Nov 2002 02:03:46 -0000	1.8
--- FilterDialog.py	2 Nov 2002 12:28:38 -0000	1.9
***************
*** 194,198 ****
                  ids = [ids]
              single_select = not ids_are_list
!             d = FolderSelector.FolderSelector(self.mgr.message_store.session, ids, checkbox_state=None, single_select=single_select)
              if d.DoModal()==win32con.IDOK:
                  new_ids, include_sub = d.GetSelectedIDs()
--- 194,199 ----
                  ids = [ids]
              single_select = not ids_are_list
! #            d = FolderSelector.FolderSelector(self.mgr.message_store.session, ids, checkbox_state=None, single_select=single_select)
!             d = FolderSelector.FolderSelector(self.mgr.outlook.Session, ids, checkbox_state=None, single_select=single_select)
              if d.DoModal()==win32con.IDOK:
                  new_ids, include_sub = d.GetSelectedIDs()

Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** FolderSelector.py	1 Nov 2002 05:47:59 -0000	1.6
--- FolderSelector.py	2 Nov 2002 12:28:38 -0000	1.7
***************
*** 22,25 ****
--- 22,35 ----
              c.dump(level+1)
  
+ # Oh, lord help us.
+ # We started with a CDO version - but CDO sucks for lots of reasons I
+ # wont even start to mention.
+ # So we moved to an Extended MAPI version with is nice and fast - screams
+ # along!  Except it doesn't work in all cases with Exchange (which 
+ # strikes Mark as extremely strange given that the Extended MAPI Python
+ # bindings were developed against an Exchange Server - but Mark doesn't
+ # have an Exchange server handy these days, and really doesn't give a
+ # rat's arse <wink>
+ # So finally we have an Outlook object model version!
  #########################################################################
  ## CDO version of a folder walker.
***************
*** 90,93 ****
--- 100,118 ----
      return root
  
+ ## <sob> - An Outlook object model version
+ def _BuildFolderTreeOutlook(session, parent):
+     children = []
+     for i in range (parent.Folders.Count):
+         folder = parent.Folders [i+1]
+         spec = FolderSpec ((folder.StoreID, folder.EntryID), folder.Name.encode("mbcs", "replace"))
+         if folder.Folders != None:
+             spec.children = _BuildFolderTreeOutlook (session, folder)
+         children.append(spec)
+     return children
+ 
+ def BuildFolderTreeOutlook(session):
+     root = FolderSpec(None, "root")
+     root.children = _BuildFolderTreeOutlook(session, session)
+     return root
  
  #########################################################################
***************
*** 141,146 ****
          if type(id2) != type(()):
              id2 = default_store_id, id2
!         return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \
!                self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1]))
  
      def InIDs(self, id, ids):
--- 166,172 ----
          if type(id2) != type(()):
              id2 = default_store_id, id2
!         return id1 == id2
! #        return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \
! #               self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1]))
  
      def InIDs(self, id, ids):
***************
*** 251,260 ****
              self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
  
!         if hasattr(self.mapi, "_oleobj_"): # Dispatch COM object
!             # CDO
!             tree = BuildFolderTreeCDO(self.mapi)
!         else:
!             # Extended MAPI.
!             tree = BuildFolderTreeMAPI(self.mapi)
          self._InsertSubFolders(0, tree)
          self.selected_ids = [] # wipe this out while we are alive.
--- 277,287 ----
              self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
  
!         tree = BuildFolderTreeOutlook(self.mapi)
! #        if hasattr(self.mapi, "_oleobj_"): # Dispatch COM object
! #            # CDO
! #            tree = BuildFolderTreeCDO(self.mapi)
! #        else:
! #            # Extended MAPI.
! #            tree = BuildFolderTreeMAPI(self.mapi)            
          self._InsertSubFolders(0, tree)
          self.selected_ids = [] # wipe this out while we are alive.
***************
*** 353,356 ****
      print d.GetSelectedIDs()
  
  if __name__=='__main__':
!     TestWithMAPI()
--- 380,391 ----
      print d.GetSelectedIDs()
  
+ def TestWithOutlook():
+     from win32com.client import Dispatch
+     outlook = Dispatch("Outlook.Application")
+     d=FolderSelector(outlook.Session, None, single_select = False)
+     d.DoModal()
+     print d.GetSelectedIDs()
+ 
+ 
  if __name__=='__main__':
!     TestWithOutlook()

Index: TrainingDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/TrainingDialog.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** TrainingDialog.py	1 Nov 2002 02:03:52 -0000	1.7
--- TrainingDialog.py	2 Nov 2002 12:28:38 -0000	1.8
***************
*** 105,109 ****
                  sub_attr = "ham_include_sub"
              include_sub = getattr(self.config, sub_attr)
!             d = FolderSelector.FolderSelector(self.mgr.message_store.session, l, checkbox_state=include_sub)
              if d.DoModal()==win32con.IDOK:
                  l[:], include_sub = d.GetSelectedIDs()[:]
--- 105,110 ----
                  sub_attr = "ham_include_sub"
              include_sub = getattr(self.config, sub_attr)
! #            d = FolderSelector.FolderSelector(self.mgr.message_store.session, l, checkbox_state=include_sub)
!             d = FolderSelector.FolderSelector(self.mgr.outlook.Session, l, checkbox_state=include_sub)
              if d.DoModal()==win32con.IDOK:
                  l[:], include_sub = d.GetSelectedIDs()[:]


From tim_one@users.sourceforge.net  Sat Nov  2 17:11:50 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 02 Nov 2002 09:11:50 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs FolderSelector.py,1.7,1.8
Message-ID: <E1881oM-0005q6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv19232/Outlook2000/dialogs

Modified Files:
	FolderSelector.py 
Log Message:
Folded long lines so I could read it better.  We've got a regression
here:  the folder selectors in the Training and Define Filters dialogs
still work, but in the Filter Now dialog clicking Browse dies with

Traceback (most recent call last):
File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py",
    line 313, in OnInitDialog
    tree = BuildFolderTreeOutlook(self.mapi)
File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py",
    line 119, in BuildFolderTreeOutlook
    root.children = _BuildFolderTreeOutlook(session, session)
File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py",
    line 108, in _BuildFolderTreeOutlook
    for i in range(parent.Folders.Count):
AttributeError: Folders
win32ui: OnInitDialog() virtual handler
    (<bound method FolderSelector.OnInitDialog of
     <dialogs.FolderSelector.FolderSelector instance at 0x04025050>>)
raised an exception


Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** FolderSelector.py	2 Nov 2002 12:28:38 -0000	1.7
--- FolderSelector.py	2 Nov 2002 17:11:47 -0000	1.8
***************
*** 26,34 ****
  # wont even start to mention.
  # So we moved to an Extended MAPI version with is nice and fast - screams
! # along!  Except it doesn't work in all cases with Exchange (which 
  # strikes Mark as extremely strange given that the Extended MAPI Python
  # bindings were developed against an Exchange Server - but Mark doesn't
  # have an Exchange server handy these days, and really doesn't give a
! # rat's arse <wink>
  # So finally we have an Outlook object model version!
  #########################################################################
--- 26,34 ----
  # wont even start to mention.
  # So we moved to an Extended MAPI version with is nice and fast - screams
! # along!  Except it doesn't work in all cases with Exchange (which
  # strikes Mark as extremely strange given that the Extended MAPI Python
  # bindings were developed against an Exchange Server - but Mark doesn't
  # have an Exchange server handy these days, and really doesn't give a
! # rat's arse <wink>).
  # So finally we have an Outlook object model version!
  #########################################################################
***************
*** 69,73 ****
      table = folder.GetHierarchyTable(0)
      children = []
!     rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
      for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows:
          folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid)
--- 69,75 ----
      table = folder.GetHierarchyTable(0)
      children = []
!     rows = mapi.HrQueryAllRows(table, (PR_ENTRYID,
!                                        PR_STORE_ENTRYID,
!                                        PR_DISPLAY_NAME_A), None, None, 0)
      for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows:
          folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid)
***************
*** 90,95 ****
              default_store_id = hex_eid
  
!         msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS)
!         hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0)
          subtree_eid = data[0][1]
          folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
--- 92,98 ----
              default_store_id = hex_eid
  
!         msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL |
!                                                       mapi.MAPI_DEFERRED_ERRORS)
!         hr, data = msgstore.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
          subtree_eid = data[0][1]
          folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
***************
*** 103,111 ****
  def _BuildFolderTreeOutlook(session, parent):
      children = []
!     for i in range (parent.Folders.Count):
!         folder = parent.Folders [i+1]
!         spec = FolderSpec ((folder.StoreID, folder.EntryID), folder.Name.encode("mbcs", "replace"))
!         if folder.Folders != None:
!             spec.children = _BuildFolderTreeOutlook (session, folder)
          children.append(spec)
      return children
--- 106,115 ----
  def _BuildFolderTreeOutlook(session, parent):
      children = []
!     for i in range(parent.Folders.Count):
!         folder = parent.Folders[i+1]
!         spec = FolderSpec((folder.StoreID, folder.EntryID),
!                           folder.Name.encode("mbcs", "replace"))
!         if folder.Folders:
!             spec.children = _BuildFolderTreeOutlook(session, folder)
          children.append(spec)
      return children
***************
*** 128,136 ****
  
  class FolderSelector(dialog.Dialog):
!     style = win32con.DS_MODALFRAME | win32con.WS_POPUP | win32con.WS_VISIBLE | win32con.WS_CAPTION | win32con.WS_SYSMENU | win32con.DS_SETFONT
      cs = win32con.WS_CHILD | win32con.WS_VISIBLE
!     treestyle = cs | win32con.WS_BORDER | commctrl.TVS_HASLINES | commctrl.TVS_LINESATROOT | \
!                 commctrl.TVS_CHECKBOXES | commctrl.TVS_HASBUTTONS | \
!                 commctrl.TVS_DISABLEDRAGDROP | commctrl.TVS_SHOWSELALWAYS
      dt = [
          # Dialog itself.
--- 132,150 ----
  
  class FolderSelector(dialog.Dialog):
!     style = (win32con.DS_MODALFRAME |
!              win32con.WS_POPUP |
!              win32con.WS_VISIBLE |
!              win32con.WS_CAPTION |
!              win32con.WS_SYSMENU |
!              win32con.DS_SETFONT)
      cs = win32con.WS_CHILD | win32con.WS_VISIBLE
!     treestyle = (cs |
!                  win32con.WS_BORDER |
!                  commctrl.TVS_HASLINES |
!                  commctrl.TVS_LINESATROOT |
!                  commctrl.TVS_CHECKBOXES |
!                  commctrl.TVS_HASBUTTONS |
!                  commctrl.TVS_DISABLEDRAGDROP |
!                  commctrl.TVS_SHOWSELALWAYS)
      dt = [
          # Dialog itself.
***************
*** 147,151 ****
      ]
  
!     def __init__ (self, mapi, selected_ids = None, single_select = False, checkbox_state = False, checkbox_text = None, desc_noun = "Select", desc_noun_suffix = "ed"):
          assert not single_select or selected_ids is None or len(selected_ids)<=1
          dialog.Dialog.__init__ (self, self.dt)
--- 161,170 ----
      ]
  
!     def __init__ (self, mapi, selected_ids=None,
!                               single_select=False,
!                               checkbox_state=False,
!                               checkbox_text=None,
!                               desc_noun="Select",
!                               desc_noun_suffix="ed"):
          assert not single_select or selected_ids is None or len(selected_ids)<=1
          dialog.Dialog.__init__ (self, self.dt)
***************
*** 194,198 ****
                  mask = state = 0
              else:
!                 if self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
                      state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
                      num_children_selected += 1
--- 213,218 ----
                  mask = state = 0
              else:
!                 if (self.selected_ids and
!                         self.InIDs(child.folder_id, self.selected_ids)):
                      state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
                      num_children_selected += 1
***************
*** 201,206 ****
                  mask = commctrl.TVIS_STATEIMAGEMASK
              item_id = self._MakeItemParam(child)
!             hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id))
!             if self.single_select and self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
                  self.list.SelectItem(hitem)
  
--- 221,236 ----
                  mask = commctrl.TVIS_STATEIMAGEMASK
              item_id = self._MakeItemParam(child)
!             hitem = self.list.InsertItem(hParent, 0,
!                                          (None,
!                                           state,
!                                           mask,
!                                           text,
!                                           bitmapCol,
!                                           bitmapSel,
!                                           cItems,
!                                           item_id))
!             if (self.single_select and
!                     self.selected_ids and
!                     self.InIDs(child.folder_id, self.selected_ids)):
                  self.list.SelectItem(hitem)
  
***************
*** 232,236 ****
      def _YieldCheckedChildren(self):
          if self.single_select:
!             # If single-select, the checked state is not used, just the selected state.
              try:
                  h = self.list.GetSelectedItem()
--- 262,267 ----
      def _YieldCheckedChildren(self):
          if self.single_select:
!             # If single-select, the checked state is not used, just the
!             # selected state.
              try:
                  h = self.list.GetSelectedItem()
***************
*** 271,277 ****
          if self.single_select:
              # Remove the checkbox style from the list for single-selection
!             style = win32api.GetWindowLong(self.list.GetSafeHwnd(), win32con.GWL_STYLE)
              style = style & ~commctrl.TVS_CHECKBOXES
!             win32api.SetWindowLong(self.list.GetSafeHwnd(), win32con.GWL_STYLE, style)
              # Hide "clear all"
              self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
--- 302,311 ----
          if self.single_select:
              # Remove the checkbox style from the list for single-selection
!             style = win32api.GetWindowLong(self.list.GetSafeHwnd(),
!                                            win32con.GWL_STYLE)
              style = style & ~commctrl.TVS_CHECKBOXES
!             win32api.SetWindowLong(self.list.GetSafeHwnd(),
!                                    win32con.GWL_STYLE,
!                                    style)
              # Hide "clear all"
              self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
***************
*** 283,287 ****
  #        else:
  #            # Extended MAPI.
! #            tree = BuildFolderTreeMAPI(self.mapi)            
          self._InsertSubFolders(0, tree)
          self.selected_ids = [] # wipe this out while we are alive.
--- 317,321 ----
  #        else:
  #            # Extended MAPI.
! #            tree = BuildFolderTreeMAPI(self.mapi)
          self._InsertSubFolders(0, tree)
          self.selected_ids = [] # wipe this out while we are alive.
***************
*** 311,315 ****
                      names.append(info[3])
  
!             status_string = "%s%s %d folder" % (self.select_desc_noun, self.select_desc_noun_suffix, num_checked)
              if num_checked != 1:
                  status_string += "s"
--- 345,351 ----
                      names.append(info[3])
  
!             status_string = "%s%s %d folder" % (self.select_desc_noun,
!                                                 self.select_desc_noun_suffix,
!                                                 num_checked)
              if num_checked != 1:
                  status_string += "s"


From tim_one@users.sourceforge.net  Sat Nov  2 17:27:49 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 02 Nov 2002 09:27:49 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs FilterDialog.py,1.9,1.10
Message-ID: <E18823p-0001em-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv5390/Outlook2000/dialogs

Modified Files:
	FilterDialog.py 
Log Message:
FilterNowDialog.OnButBrowse():  Repaired the way FolderSelector is
called so that this works again.


Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** FilterDialog.py	2 Nov 2002 12:28:38 -0000	1.9
--- FilterDialog.py	2 Nov 2002 17:27:44 -0000	1.10
***************
*** 333,338 ****
              import FolderSelector
              filter = self.mgr.config.filter_now
!             d = FolderSelector.FolderSelector(self.mgr.message_store.session, filter.folder_ids,checkbox_state=filter.include_sub)
!             if d.DoModal()==win32con.IDOK:
                  filter.folder_ids, filter.include_sub = d.GetSelectedIDs()
                  self.UpdateFolderNames()
--- 333,341 ----
              import FolderSelector
              filter = self.mgr.config.filter_now
!             # d = FolderSelector.FolderSelector(self.mgr.message_store.session, filter.folder_ids,checkbox_state=filter.include_sub)
!             d = FolderSelector.FolderSelector(self.mgr.outlook.Session,
!                                               filter.folder_ids,
!                                               checkbox_state=filter.include_sub)
!             if d.DoModal() == win32con.IDOK:
                  filter.folder_ids, filter.include_sub = d.GetSelectedIDs()
                  self.UpdateFolderNames()


From richiehindle@users.sourceforge.net  Sat Nov  2 21:00:23 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Sat, 02 Nov 2002 13:00:23 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.8,1.9
Message-ID: <E1885NX-0003aN-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13701

Modified Files:
	pop3proxy.py 
Log Message:
Can now listen on the port of your choice (thanks to Tim Stone).
Now supports the 'Unsure' value for X-Hammie-Disposition.
Now less anal about correcting for the size of the added header.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** pop3proxy.py	1 Nov 2002 09:14:47 -0000	1.8
--- pop3proxy.py	2 Nov 2002 21:00:21 -0000	1.9
***************
*** 12,18 ****
                   defaults to 110.
  
!         options (the same as hammie):
              -p FILE : use the named data file
              -d      : the file is a DBM file rather than a pickle
  
      pop3proxy -t
--- 12,19 ----
                   defaults to 110.
  
!         options:
              -p FILE : use the named data file
              -d      : the file is a DBM file rather than a pickle
+             -l port : listen on this port number (default 110)
  
      pop3proxy -t
***************
*** 39,44 ****
  from Options import options
  
  HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
! HEADER_EXAMPLE = '%s: Yes\r\n' % hammie.DISPHEADER
  
  
--- 40,47 ----
  from Options import options
  
+ # HEADER_EXAMPLE is the longest possible header - the length of this one
+ # is added to the size of each message.
  HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
! HEADER_EXAMPLE = '%s: Unsure\r\n' % hammie.DISPHEADER
  
  
***************
*** 58,61 ****
--- 61,65 ----
          self.set_socket(s, socketMap)
          self.set_reuse_addr()
+         print "Listening on port %d." % port
          self.bind(('', port))
          self.listen(5)
***************
*** 337,350 ****
              ok, message = response.split('\n', 1)
  
!             # Now find the spam disposition and add the header.  The
!             # trailing space in "No " ensures consistent lengths - this
!             # is required because POP3 commands like 'STAT' and 'LIST'
!             # need to be able to report the size of a message before
!             # it's been classified.
              prob = self.bayes.spamprob(tokenizer.tokenize(message))
!             if prob > options.spam_cutoff:
                  disposition = "Yes"
              else:
!                 disposition = "No "
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
--- 341,353 ----
              ok, message = response.split('\n', 1)
  
!             # Now find the spam disposition and add the header.
              prob = self.bayes.spamprob(tokenizer.tokenize(message))
!             if prob < options.ham_cutoff:
!                 disposition = "No"
!             elif prob > options.spam_cutoff:
                  disposition = "Yes"
              else:
!                 disposition = "Unsure"
!             
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
***************
*** 577,581 ****
      # Read the arguments.
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htdp:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
--- 580,584 ----
      # Read the arguments.
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htdp:l:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
***************
*** 583,586 ****
--- 586,590 ----
  
      pickleName = hammie.DEFAULTDB
+     proxyPort = 110
      useDB = False
      runTestServer = False
***************
*** 595,599 ****
          elif opt == '-p':
              pickleName = arg
! 
      # Do whatever we've been asked to do...
      if not opts and not args:
--- 599,605 ----
          elif opt == '-p':
              pickleName = arg
!         elif opt == '-l':
!             proxyPort = int(arg)
!             
      # Do whatever we've been asked to do...
      if not opts and not args:
***************
*** 609,617 ****
      elif len(args) == 1:
          # Named POP3 server, default port.
!         main(args[0], 110, 110, pickleName, useDB)
  
      elif len(args) == 2:
          # Named POP3 server, named port.
!         main(args[0], int(args[1]), 110, pickleName, useDB)
  
      else:
--- 615,623 ----
      elif len(args) == 1:
          # Named POP3 server, default port.
!         main(args[0], 110, proxyPort, pickleName, useDB)
  
      elif len(args) == 2:
          # Named POP3 server, named port.
!         main(args[0], int(args[1]), proxyPort, pickleName, useDB)
  
      else:


From mhammond@users.sourceforge.net  Sun Nov  3 02:00:33 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 18:00:33 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.19,1.20
Message-ID: <E188A41-0008J0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv31898

Modified Files:
	msgstore.py 
Log Message:
_GetFakeHeaders must end with \n


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** msgstore.py	2 Nov 2002 12:09:36 -0000	1.19
--- msgstore.py	3 Nov 2002 02:00:31 -0000	1.20
***************
*** 375,379 ****
          if to: headers.append("To: "+to)
          if cc: headers.append("CC: "+cc)
!         return "\n".join(headers)
  
      def _EnsureObject(self):
--- 375,379 ----
          if to: headers.append("To: "+to)
          if cc: headers.append("CC: "+cc)
!         return "\n".join(headers) + "\n"
  
      def _EnsureObject(self):


From hooft@users.sourceforge.net  Sun Nov  3 13:48:49 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 03 Nov 2002 05:48:49 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.63,1.64
	hammie.py,1.33,1.34
Message-ID: <E188L7R-0004OA-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16667

Modified Files:
	Options.py hammie.py 
Log Message:
 * Added options "header_spam_string", "header_unsure_string",
   "header_ham_string". Defaults are set to "Yes", "Unsure", "No".
 * Added options header_score_digits and header_score_logarithm. The
   first is an integer telling hammie in how many digits it should show
   the score. If the second option is set to "True", scores of 1.00 or
   0.00 are augmented by a logarithmic "one-ness" or "zero-ness" score
   (basically it shows the "number of zeros" or "number of nines" next
   to the score value).
 * Added support for a debugging header using the boolean hammie_debug_header
   option and the string hammie_debug_header_name
 * Changed hammie.py to use all of the new options


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** Options.py	28 Oct 2002 20:19:46 -0000	1.63
--- Options.py	3 Nov 2002 13:48:47 -0000	1.64
***************
*** 286,302 ****
  [Hammie]
  # The name of the header that hammie adds to an E-mail in filter mode
  hammie_header_name: X-Hammie-Disposition
  
! # The default database path used by hammie
! persistent_storage_file: hammie.db
  
! # The range of clues that are added to the "hammie" header in the E-mail
  # All clues that have their probability smaller than this number, or larger
  # than one minus this number are added to the header such that you can see
  # why spambayes thinks this is ham/spam or why it is unsure. The default is
  # to show all clues, but you can reduce that by setting showclue to a lower
! # value, such as 0.1 (which Rob is using)
  clue_mailheader_cutoff: 0.5
  
  # hammie can use either a database (quick to score one message) or a pickle
  # (quick to train on huge amounts of messages). Set this to True to use a
--- 286,324 ----
  [Hammie]
  # The name of the header that hammie adds to an E-mail in filter mode
+ # It contains the "classification" of the mail, plus the score.
  hammie_header_name: X-Hammie-Disposition
  
! # The three disposition names are added to the header as the following
! # Three words:
! header_spam_string: Yes
! header_unsure_string: Unsure
! header_ham_string: No
  
! # Accuracy of the score in the header in decimal digits
! header_score_digits: 2
! 
! # Set this to "True", to augment scores of 1.00 or 0.00 by a logarithmic
! # "one-ness" or "zero-ness" score (basically it shows the "number of zeros"
! # or "number of nines" next to the score value).
! header_score_logarithm: False
! 
! # Enable debugging information in the header.
! hammie_debug_header: False
! 
! # Name of a debugging header for spambayes hackers, showing the strongest
! # clues that have resulted in the classification in the standard header.
! hammie_debug_header_name: X-Hammie-Debug
! 
! # The range of clues that are added to the "debug" header in the E-mail
  # All clues that have their probability smaller than this number, or larger
  # than one minus this number are added to the header such that you can see
  # why spambayes thinks this is ham/spam or why it is unsure. The default is
  # to show all clues, but you can reduce that by setting showclue to a lower
! # value, such as 0.1
  clue_mailheader_cutoff: 0.5
  
+ # The default database path used by hammie
+ persistent_storage_file: hammie.db
+ 
  # hammie can use either a database (quick to score one message) or a pickle
  # (quick to train on huge amounts of messages). Set this to True to use a
***************
*** 363,366 ****
--- 385,395 ----
                 'clue_mailheader_cutoff': float_cracker,
                 'persistent_use_database': boolean_cracker,
+                'header_spam_string': string_cracker,
+                'header_unsure_string': string_cracker,
+                'header_ham_string': string_cracker,
+                'header_score_digits': int_cracker,
+                'header_score_logarithm': boolean_cracker,
+                'hammie_debug_header': boolean_cracker,
+                'hammie_debug_header_name': string_cracker,
                 },
  

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** hammie.py	27 Oct 2002 22:56:15 -0000	1.33
--- hammie.py	3 Nov 2002 13:48:47 -0000	1.34
***************
*** 57,60 ****
--- 57,62 ----
  # Name of the header to add in filter mode
  DISPHEADER = options.hammie_header_name
+ DEBUGHEADER = options.hammie_debug_header_name
+ DODEBUG = options.hammie_debug_header
  
  # Default database name
***************
*** 242,246 ****
  
      def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
!                ham_cutoff=HAM_THRESHOLD):
          """Score (judge) a message and add a disposition header.
  
--- 244,249 ----
  
      def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
!                ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER,
!                debug=DODEBUG):
          """Score (judge) a message and add a disposition header.
  
***************
*** 248,253 ****
  
          Optionally, set header to the name of the header to add, and/or
!         cutoff to the probability value which must be met or exceeded
!         for a message to get a 'Yes' disposition.
  
          Returns the same message with a new disposition header.
--- 251,261 ----
  
          Optionally, set header to the name of the header to add, and/or
!         spam_cutoff/ham_cutoff to the probability values which must be met
!         or exceeded for a message to get a 'Spam' or 'Ham' classification.
! 
!         An extra debugging header can be added if 'debug' is set to True.
!         The name of the debugging header is given as 'debugheader'.
! 
!         All defaults for optional parameters come from the Options file.
  
          Returns the same message with a new disposition header.
***************
*** 261,272 ****
          prob, clues = self._scoremsg(msg, True)
          if prob < ham_cutoff:
!             disp = "No"
          elif prob > spam_cutoff:
!             disp = "Yes"
          else:
!             disp = "Unsure"
!         disp += "; %.2f" % prob
!         disp += "; " + self.formatclues(clues)
          msg.add_header(header, disp)
          return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
  
--- 269,291 ----
          prob, clues = self._scoremsg(msg, True)
          if prob < ham_cutoff:
!             disp = options.header_ham_string
          elif prob > spam_cutoff:
!             disp = options.header_spam_string
          else:
!             disp = options.header_unknown_string
!         disp += ("; %."+str(options.header_score_digits)+"f") % prob
!         if options.header_score_logarithm:
!             if prob<=0.005 and prob>0.0:
!                 import math
!                 x=-math.log10(prob)
!                 disp += " (%d)"%x
!             if prob>=0.995 and prob<1.0:
!                 import math
!                 x=-math.log10(1.0-prob)
!                 disp += " (%d)"%x
          msg.add_header(header, disp)
+         if debug:
+             disp = self.formatclues(clues)
+             msg.add_header(debugheader, disp)
          return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
  

From hooft@users.sourceforge.net  Sun Nov  3 14:24:38 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 03 Nov 2002 06:24:38 -0800
Subject: [Spambayes-checkins] spambayes hammie.py,1.34,1.35
Message-ID: <E188Lg6-0006kG-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25907

Modified Files:
	hammie.py 
Log Message:
fix typo(?)

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** hammie.py	3 Nov 2002 13:48:47 -0000	1.34
--- hammie.py	3 Nov 2002 14:24:36 -0000	1.35
***************
*** 273,277 ****
              disp = options.header_spam_string
          else:
!             disp = options.header_unknown_string
          disp += ("; %."+str(options.header_score_digits)+"f") % prob
          if options.header_score_logarithm:
--- 273,277 ----
              disp = options.header_spam_string
          else:
!             disp = options.header_unsure_string
          disp += ("; %."+str(options.header_score_digits)+"f") % prob
          if options.header_score_logarithm:


From mhammond@users.sourceforge.net  Mon Nov  4 00:41:10 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:41:10 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.20,1.21
Message-ID: <E188VIk-0006ze-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv26387

Modified Files:
	msgstore.py 
Log Message:
Allow an Outlook folder to be passed as a "folder id" (in the same way
we did that for messages).

Give __eq__ and __ne__ methods to compare folders.  I'm pretty sure the
MAPI semantics are correct, but not as confident on the new rich
comparisons <wink>.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** msgstore.py	3 Nov 2002 02:00:31 -0000	1.20
--- msgstore.py	4 Nov 2002 00:41:08 -0000	1.21
***************
*** 198,202 ****
      def GetFolder(self, folder_id):
          # Return a single folder given the ID.
!         folder_id = self.NormalizeID(folder_id)
          folder = self._OpenEntry(folder_id)
          table = folder.GetContentsTable(0)
--- 198,207 ----
      def GetFolder(self, folder_id):
          # Return a single folder given the ID.
!         if hasattr(folder_id, "EntryID"):
!             # An Outlook object
!             folder_id = mapi.BinFromHex(folder_id.StoreID), \
!                          mapi.BinFromHex(folder_id.EntryID)
!         else:
!             folder_id = self.NormalizeID(folder_id)
          folder = self._OpenEntry(folder_id)
          table = folder.GetContentsTable(0)
***************
*** 248,251 ****
--- 253,265 ----
                                                  mapi.HexFromBin(self.id[1]))
  
+     def __eq__(self, other):
+         if other is None: return False
+         ceid = self.msgstore.session.CompareEntryIDs
+         return ceid(self.id[0], other.id[0]) and \
+                ceid(self.id[1], other.id[1])
+ 
+     def __ne__(self, other):
+         return not self.__eq__(other)
+ 
      def GetID(self):
          return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
***************
*** 298,301 ****
--- 312,324 ----
                                       mapi.HexFromBin(self.id[1]))
  
+     def __eq__(self, other):
+         if other is None: return False
+         ceid = self.msgstore.session.CompareEntryIDs
+         return ceid(self.id[0], other.id[0]) and \
+                ceid(self.id[1], other.id[1])
+ 
+     def __ne__(self, other):
+         return not self.__eq__(other)
+ 
      def GetID(self):
          return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
***************
*** 303,307 ****
      def GetOutlookItem(self):
          hex_item_id = mapi.HexFromBin(self.id[1])
!         store_hex_id = mapi.HexFromBin(self.id[0])
          return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id)
  
--- 326,330 ----
      def GetOutlookItem(self):
          hex_item_id = mapi.HexFromBin(self.id[1])
!         hex_store_id = mapi.HexFromBin(self.id[0])
          return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id)
  

From mhammond@users.sourceforge.net  Mon Nov  4 00:49:13 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:49:13 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	dump_props.py,1.4,1.5
Message-ID: <E188VQX-0007bY-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv29119

Modified Files:
	dump_props.py 
Log Message:
If the property type is PT_ERROR, show the best error code repr we can.


Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** dump_props.py	2 Nov 2002 11:27:53 -0000	1.4
--- dump_props.py	4 Nov 2002 00:49:11 -0000	1.5
***************
*** 66,75 ****
  
  # Also in new versions of mapituil
! def GetAllProperties(obj, make_tag_names = True):
      tags = obj.GetPropList(0)
      hr, data = obj.GetProps(tags)
      ret = []
      for tag, val in data:
!         if make_tag_names:
              hr, tags, array = obj.GetNamesFromIDs( (tag,) )
              if type(array[0][1])==type(u''):
--- 66,75 ----
  
  # Also in new versions of mapituil
! def GetAllProperties(obj, make_pretty = True):
      tags = obj.GetPropList(0)
      hr, data = obj.GetProps(tags)
      ret = []
      for tag, val in data:
!         if make_pretty:
              hr, tags, array = obj.GetNamesFromIDs( (tag,) )
              if type(array[0][1])==type(u''):
***************
*** 77,80 ****
--- 77,83 ----
              else:
                  name = mapiutil.GetPropTagName(tag)
+             # pretty value transformations
+             if PROP_TYPE(tag)==PT_ERROR:
+                 val = mapiutil.GetScodeString(val)
          else:
              name = tag


From mhammond@users.sourceforge.net  Mon Nov  4 00:50:11 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:50:11 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.31,1.32
Message-ID: <E188VRT-0007ge-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv29458

Modified Files:
	manager.py 
Log Message:
Wipe outlook reference as we die.


Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.31
retrieving revision 1.32
diff -C2 -d -r1.31 -r1.32
*** manager.py	1 Nov 2002 14:35:05 -0000	1.31
--- manager.py	4 Nov 2002 00:50:09 -0000	1.32
***************
*** 239,242 ****
--- 239,243 ----
              self.message_store.Close()
              self.message_store = None
+         self.outlook = None
  
      def score(self, msg, evidence=False, scale=True):


From mhammond@users.sourceforge.net  Mon Nov  4 00:50:26 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:50:26 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/images - New directory
Message-ID: <E188VRi-0007hi-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/images
In directory usw-pr-cvs1:/tmp/cvs-serv29597/images

Log Message:
Directory /cvsroot/spambayes/spambayes/Outlook2000/images added to the repository


From mhammond@users.sourceforge.net  Mon Nov  4 00:51:18 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:51:18 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/images delete_as_spam.bmp,NONE,1.1
 recover_ham.bmp,NONE,1.1
Message-ID: <E188VSY-0007lU-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/images
In directory usw-pr-cvs1:/tmp/cvs-serv29827

Added Files:
	delete_as_spam.bmp recover_ham.bmp 
Log Message:
Some button images :)


--- NEW FILE: delete_as_spam.bmp ---
(This appears to be a binary file; contents omitted.)

--- NEW FILE: recover_ham.bmp ---
(This appears to be a binary file; contents omitted.)


From mhammond@users.sourceforge.net  Mon Nov  4 00:52:12 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:52:12 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.24,1.25
Message-ID: <E188VTQ-0007pZ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv29880

Modified Files:
	addin.py 
Log Message:
New "Delete As Spam" button, complete with button image, and the
button changes appearance and behaviour when one of the spam
folders is selected.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** addin.py	1 Nov 2002 23:54:03 -0000	1.24
--- addin.py	4 Nov 2002 00:52:10 -0000	1.25
***************
*** 1,5 ****
  # SpamBayes Outlook Addin
  
! import sys
  import warnings
  
--- 1,5 ----
  # SpamBayes Outlook Addin
  
! import sys, os
  import warnings
  
***************
*** 16,19 ****
--- 16,21 ----
  import win32ui
  
+ import win32gui, win32con, win32clipboard # for button images!
+ 
  # If we are not running in a console, redirect all print statements to the
  # win32traceutil collector.
***************
*** 28,38 ****
  
  
! # A lovely big block that attempts to catch the most common errors - COM objects not installed.
  try:
!     # Support for COM objects we use.
      gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True) # Outlook 9
      gencache.EnsureModule('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1, bForDemand=True) # Office 9
  
!     # The TLB defiining the interfaces we implement
      universal.RegisterInterfaces('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0, ["_IDTExtensibility2"])
  except pythoncom.com_error, (hr, msg, exc, arg):
--- 30,40 ----
  
  
! # Attempt to catch the most common errors - COM objects not installed.
  try:
!     # Generate support so we get complete support including events
      gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True) # Outlook 9
      gencache.EnsureModule('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1, bForDemand=True) # Office 9
  
!     # Register what vtable based interfaces we need to implement.
      universal.RegisterInterfaces('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0, ["_IDTExtensibility2"])
  except pythoncom.com_error, (hr, msg, exc, arg):
***************
*** 46,76 ****
      if exc:
          print "Exception: %s" % (exc)
!     print "Sorry, I can't be more help, but I can't continue while I have this error."
      sys.exit(1)
  
! # Something that should be in win32com in some form or another.
  def CastToClone(ob, target):
      """'Cast' a COM object to another type"""
-     # todo - should support target being an IID
      if hasattr(target, "index"): # string like
      # for now, we assume makepy for this to work.
          if not ob.__class__.__dict__.has_key("CLSID"):
-             # Eeek - no makepy support - try and build it.
              ob = gencache.EnsureDispatch(ob)
          if not ob.__class__.__dict__.has_key("CLSID"):
              raise ValueError, "Must be a makepy-able object for this to work"
          clsid = ob.CLSID
-         # Lots of hoops to support "demand-build" - ie, generating
-         # code for an interface first time it is used.  We assume the
-         # interface name exists in the same library as the object.
-         # This is generally the case - only referenced typelibs may be
-         # a problem, and we can handle that later.  Maybe <wink>
-         # So get the generated module for the library itself, then
-         # find the interface CLSID there.
          mod = gencache.GetModuleForCLSID(clsid)
-         # Get the 'root' module.
          mod = gencache.GetModuleForTypelib(mod.CLSID, mod.LCID,
                                             mod.MajorVersion, mod.MinorVersion)
-         # Find the CLSID of the target
          # XXX - should not be looking in VTables..., but no general map currently exists
          # (Fixed in win32all!)
--- 48,69 ----
      if exc:
          print "Exception: %s" % (exc)
!     print "Sorry I can't be more help, but I can't continue while I have this error."
      sys.exit(1)
  
! # A couple of functions that are in new win32all, but we dont want to
! # force people to ugrade if we can avoid it.
! # NOTE: Most docstrings and comments removed - see the win32all version
  def CastToClone(ob, target):
      """'Cast' a COM object to another type"""
      if hasattr(target, "index"): # string like
      # for now, we assume makepy for this to work.
          if not ob.__class__.__dict__.has_key("CLSID"):
              ob = gencache.EnsureDispatch(ob)
          if not ob.__class__.__dict__.has_key("CLSID"):
              raise ValueError, "Must be a makepy-able object for this to work"
          clsid = ob.CLSID
          mod = gencache.GetModuleForCLSID(clsid)
          mod = gencache.GetModuleForTypelib(mod.CLSID, mod.LCID,
                                             mod.MajorVersion, mod.MinorVersion)
          # XXX - should not be looking in VTables..., but no general map currently exists
          # (Fixed in win32all!)
***************
*** 81,85 ****
          mod = gencache.GetModuleForCLSID(target_clsid)
          target_class = getattr(mod, target)
-         # resolve coclass to interface
          target_class = getattr(target_class, "default_interface", target_class)
          return target_class(ob) # auto QI magic happens
--- 74,77 ----
***************
*** 90,93 ****
--- 82,118 ----
      CastTo = CastToClone
  
+ # Something else in later win32alls - like "DispatchWithEvents", but the
+ # returned object is not both the Dispatch *and* the event handler
+ def WithEventsClone(clsid, user_event_class):
+     clsid = getattr(clsid, "_oleobj_", clsid)
+     disp = Dispatch(clsid)
+     if not disp.__dict__.get("CLSID"): # Eeek - no makepy support - try and build it.
+         try:
+             ti = disp._oleobj_.GetTypeInfo()
+             disp_clsid = ti.GetTypeAttr()[0]
+             tlb, index = ti.GetContainingTypeLib()
+             tla = tlb.GetLibAttr()
+             mod = gencache.EnsureModule(tla[0], tla[1], tla[3], tla[4])
+             disp_class = gencache.GetClassForProgID(str(disp_clsid))
+         except pythoncom.com_error:
+             raise TypeError, "This COM object can not automate the makepy process - please run makepy manually for this object"
+     else:
+         disp_class = disp.__class__
+     clsid = disp_class.CLSID
+     import new
+     events_class = getevents(clsid)
+     if events_class is None:
+         raise ValueError, "This COM object does not support events."
+     result_class = new.classobj("COMEventClass", (events_class, user_event_class), {})
+     instance = result_class(disp) # This only calls the first base class __init__.
+     if hasattr(user_event_class, "__init__"):
+         user_event_class.__init__(instance)
+     return instance
+ 
+ try:
+     from win32com.client import WithEvents
+ except ImportError: # appears in 151 and later.
+     WithEvents = WithEventsClone
+ 
  # Whew - we seem to have all the COM support we need - let's rock!
  
***************
*** 97,101 ****
          self.handler = handler
          self.args = args
! 
      def OnClick(self, button, cancel):
          self.handler(*self.args)
--- 122,127 ----
          self.handler = handler
          self.args = args
!     def Close(self):
!         self.handler = self.args = None
      def OnClick(self, button, cancel):
          self.handler(*self.args)
***************
*** 107,110 ****
--- 133,138 ----
          self.manager = manager
          self.target = target
+     def Close(self):
+         self.application = self.manager = self.target = None
  
  class FolderItemsEvent(_BaseItemsEvent):
***************
*** 172,195 ****
                  assert train.been_trained_as_spam(msgstore_message, self.manager)
  
  def ShowClues(mgr, app):
      from cgi import escape
  
!     sel = app.ActiveExplorer().Selection
!     if sel.Count == 0:
!         win32ui.MessageBox("No items are selected", "No selection")
!         return
!     if sel.Count > 1:
!         win32ui.MessageBox("Please select a single item", "Large selection")
!         return
! 
!     item = sel.Item(1)
!     if item.Class != constants.olMail:
!         win32ui.MessageBox("This function can only be performed on mail items",
!                            "Not a mail message")
          return
! 
!     msgstore_message = mgr.message_store.GetMessage(item)
      score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
      new_msg = app.CreateItem(0)
      body = ["<h2>Spam Score: %g</h2><br>" % score]
      push = body.append
--- 200,217 ----
                  assert train.been_trained_as_spam(msgstore_message, self.manager)
  
+ # Event function fired from the "Show Clues" UI items.
  def ShowClues(mgr, app):
      from cgi import escape
  
!     msgstore_message = mgr.addin.GetSelectedMessages(False)
!     if msgstore_message is None:
          return
!     item = msgstore_message.GetOutlookItem()
      score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
      new_msg = app.CreateItem(0)
+     # NOTE: Silly Outlook always switches the message editor back to RTF
+     # once the Body property has been set.  Thus, there is no reasonable
+     # way to get this as text only.  Next best then is to use HTML, 'cos at
+     # least we know how to exploit it!
      body = ["<h2>Spam Score: %g</h2><br>" % score]
      push = body.append
***************
*** 210,215 ****
  
      new_msg.Subject = "Spam Clues: " + item.Subject
!     # Stupid outlook always switches to RTF :( Work-around
! ##    new_msg.Body = body
      new_msg.HTMLBody = "<HTML><BODY>" + body + "</BODY></HTML>"
      # Attach the source message to it
--- 232,236 ----
  
      new_msg.Subject = "Spam Clues: " + item.Subject
!     # As above, use HTMLBody else Outlook refuses to behave.
      new_msg.HTMLBody = "<HTML><BODY>" + body + "</BODY></HTML>"
      # Attach the source message to it
***************
*** 218,221 ****
--- 239,359 ----
      new_msg.Display()
  
+ # The "Delete As Spam" and "Recover Spam" button
+ # The event from Outlook's explorer that our folder has changed.
+ class ButtonDeleteAsExplorerEvent:
+     def Init(self, but):
+         self.but = but
+     def Close(self):
+         self.but = None
+     def OnFolderSwitch(self):
+         self.but._UpdateForFolderChange()
+ 
+ class ButtonDeleteAsEvent:
+     def Init(self, manager, application, explorer):
+         # NOTE - keeping a reference to 'explorer' in this event
+         # appears to cause an Outlook circular reference, and outlook
+         # never terminates (it does close, but the process remains alive)
+         # This is why we needed to use WithEvents, so the event class
+         # itself doesnt keep such a reference (and we need to keep a ref
+         # to the event class so it doesn't auto-disconnect!)
+         self.manager = manager
+         self.application = application
+         self.explorer_events = WithEvents(explorer,
+                                            ButtonDeleteAsExplorerEvent)
+         self.set_for_as_spam = None
+         self.explorer_events.Init(self)
+         self._UpdateForFolderChange()
+ 
+     def Close(self):
+         self.manager = self.application = self.explorer = None
+ 
+     def _UpdateForFolderChange(self):
+         explorer = self.application.ActiveExplorer()
+         if explorer is None:
+             print "** Folder Change, but don't have an explorer"
+             return
+         outlook_folder = explorer.CurrentFolder
+         is_spam = False
+         if outlook_folder is not None:
+             mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
+             look_id = self.manager.config.filter.spam_folder_id
+             if look_id:
+                 look_folder = self.manager.message_store.GetFolder(look_id)
+                 if mapi_folder == look_folder:
+                     is_spam = True
+             if not is_spam:
+                 look_id = self.manager.config.filter.unsure_folder_id
+                 if look_id:
+                     look_folder = self.manager.message_store.GetFolder(look_id)
+                     if mapi_folder == look_folder:
+                         is_spam = True
+         if is_spam:
+             set_for_as_spam = False
+         else:
+             set_for_as_spam = True
+         if set_for_as_spam != self.set_for_as_spam:
+             if set_for_as_spam:
+                 image = "delete_as_spam.bmp"
+                 self.Caption = "Delete As Spam"
+                 self.TooltipText = \
+                         "Move the selected message to the Spam folder,\n" \
+                         "and train the system that this is Spam."
+             else:
+                 image = "recover_ham.bmp"
+                 self.Caption = "Recover from Spam"
+                 self.TooltipText = \
+                         "Recovers the selected item back to the folder\n" \
+                         "it was filtered from (or to the Inbox if this\n" \
+                         "folder is not known), and trains the system that\n" \
+                         "this is a good message\n"
+             # Set the image.
+             print "Setting image to", image
+             SetButtonImage(self, image)
+             self.set_for_as_spam = set_for_as_spam
+ 
+     def OnClick(self, button, cancel):
+         msgstore = self.manager.message_store
+         msgstore_messages = self.manager.addin.GetSelectedMessages(True)
+         if not msgstore_messages:
+             return
+         if self.set_for_as_spam:
+             # Delete this item as spam.
+             spam_folder_id = self.manager.config.filter.spam_folder_id
+             spam_folder = msgstore.GetFolder(spam_folder_id)
+             if not spam_folder:
+                 win32ui.MessageBox("You must configure the Spam folder",
+                                    "Invalid Configuration")
+                 return
+             import train
+             for msgstore_message in msgstore_messages:
+                 # Must train before moving, else we lose the message!
+                 print "Training on message - ",
+                 if train.train_message(msgstore_message, True, self.manager):
+                     print "trained as spam"
+                 else:
+                     print "already was trained as spam"
+                 # Now move it.
+                 msgstore_message.MoveTo(spam_folder)
+         else:
+             win32ui.MessageBox("Please be patient <wink>")
+ 
+ # Helpers to work with images on buttons/toolbars.
+ def SetButtonImage(button, fname):
+     # whew - http://support.microsoft.com/default.aspx?scid=KB;EN-US;q288771
+     # shows how to make a transparent bmp.
+     # Also note that the clipboard takes ownership of the handle -
+     # this, we can not simply perform this load once and reuse the image.
+     if not os.path.isabs(fname):
+         fname = os.path.join( os.path.dirname(__file__), "images", fname)
+     if not os.path.isfile(fname):
+         print "WARNING - Trying to use image '%s', but it doesn't exist" % (fname,)
+         return None
+     handle = win32gui.LoadImage(0, fname, win32con.IMAGE_BITMAP, 0, 0, win32con.LR_DEFAULTSIZE | win32con.LR_LOADFROMFILE)
+     win32clipboard.OpenClipboard()
+     win32clipboard.SetClipboardData(win32con.CF_BITMAP, handle)
+     win32clipboard.CloseClipboard()
+     button.Style = constants.msoButtonIconAndCaption
+     button.PasteFace()
+ 
  # The outlook Plugin COM object itself.
  class OutlookAddin:
***************
*** 247,250 ****
--- 385,396 ----
              bars = activeExplorer.CommandBars
              toolbar = bars.Item("Standard")
+             # Add our "Delete as ..." button
+             button = toolbar.Controls.Add(Type=constants.msoControlButton, Temporary=True)
+             # Hook events for the item
+             button.BeginGroup = True
+             button = DispatchWithEvents(button, ButtonDeleteAsEvent)
+             button.Init(self.manager, application, activeExplorer)
+             self.buttons.append(button)
+ 
              # Add a pop-up menu to the toolbar
              popup = toolbar.Controls.Add(Type=constants.msoControlPopup, Temporary=True)
***************
*** 323,326 ****
--- 469,494 ----
          return new_hooks
  
+     def GetSelectedMessages(self, allow_multi = True, explorer = None):
+         if explorer is None:
+             explorer = self.application.ActiveExplorer()
+         sel = explorer.Selection
+         if sel.Count > 1 and not allow_multi:
+             win32ui.MessageBox("Please select a single item", "Large selection")
+             return None
+ 
+         ret = []
+         for i in range(sel.Count):
+             item = sel.Item(i+1)
+             if item.Class == constants.olMail:
+                 msgstore_message = self.manager.message_store.GetMessage(item)
+                 ret.append(msgstore_message)
+ 
+         if len(ret) == 0:
+             win32ui.MessageBox("No mail items are selected", "No selection")
+             return None
+         if allow_multi:
+             return ret
+         return ret[0]
+ 
      def OnDisconnection(self, mode, custom):
          print "SpamAddin - Disconnecting from Outlook"
***************
*** 331,336 ****
              self.manager.Close()
              self.manager = None
!         self.buttons = None
! 
          print "Addin terminating: %d COM client and %d COM servers exist." \
                % (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount())
--- 499,506 ----
              self.manager.Close()
              self.manager = None
!         if self.buttons:
!             for button in self.buttons:
!                 button.Close()
!             self.buttons = None
          print "Addin terminating: %d COM client and %d COM servers exist." \
                % (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount())


From mhammond@users.sourceforge.net  Mon Nov  4 01:12:56 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 17:12:56 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13
Message-ID: <E188VnU-0000fp-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv2046

Modified Files:
	train.py 
Log Message:
Fix the root of my:
  File "F:\src\spambayes\classifier.py", line 450, in _getclues
    distance = abs(prob - 0.5)

Exception - problem is that we trained, but didn't update probabilities -
thus, we failed for every new word seen only since the last complete
retrain.

There may be a case for _getclues() to detect a probability of None
and call update_probabilities() automatically - either that or just
keep throwing vague exceptions <wink>


Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** train.py	31 Oct 2002 22:03:35 -0000	1.12
--- train.py	4 Nov 2002 01:12:53 -0000	1.13
***************
*** 19,23 ****
      return spam == True
  
! def train_message(msg, is_spam, mgr):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
--- 19,23 ----
      return spam == True
  
! def train_message(msg, is_spam, mgr, update_probs = True):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
***************
*** 41,44 ****
--- 41,47 ----
      mgr.bayes.learn(tokens, is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
+     if update_probs:
+         mgr.bayes.update_probabilities()
+ 
      mgr.bayes_dirty = True
      return True
***************
*** 51,55 ****
          progress.tick()
          try:
!             if train_message(message, isspam, mgr):
                  num_added += 1
          except:
--- 54,58 ----
          progress.tick()
          try:
!             if train_message(message, isspam, mgr, False):
                  num_added += 1
          except:


From jhylton@users.sourceforge.net  Mon Nov  4 04:36:01 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:36:01 -0800
Subject: [Spambayes-checkins] spambayes/pspam - New directory
Message-ID: <E188Yy1-00050t-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv19246/pspam

Log Message:
Directory /cvsroot/spambayes/spambayes/pspam added to the repository


From jhylton@users.sourceforge.net  Mon Nov  4 04:42:44 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:42:44 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam - New directory
Message-ID: <E188Z4W-0005Vw-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv21182/pspam/pspam

Log Message:
Directory /cvsroot/spambayes/spambayes/pspam/pspam added to the repository


From jhylton@users.sourceforge.net  Mon Nov  4 04:44:22 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:44:22 -0800
Subject: [Spambayes-checkins] 
 spambayes/pspam/pspam __init__.py,NONE,1.1 database.py,NONE,1.1
 folder.py,NONE,1.1 message.py,NONE,1.1 options.py,NONE,1.1
 profile.py,NONE,1.1
Message-ID: <E188Z66-0005c1-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv21558/pspam/pspam

Added Files:
	__init__.py database.py folder.py message.py options.py 
	profile.py 
Log Message:
Initial checkin of pspam code.


--- NEW FILE: __init__.py ---
"""Package for interacting with VM folders.

Design notes go here.

Use ZODB to store training data and classifier.

The spam and ham data are culled from sets of folders.  The actual
tokenized messages are stored in a training database.  When the folder
changes, the training data is updated.

- Updates are incremental.
- Changes to a folder are detected based on mtime and folder size.
- The contents of the folder are keyed on message-id.
- If a message is removed from a folder, it is removed from training data.
"""

--- NEW FILE: database.py ---
from pspam.options import options

import ZODB
from ZEO.ClientStorage import ClientStorage
import zLOG

import os

def logging():
    os.environ["STUPID_LOG_FILE"] = options.event_log_file
    os.environ["STUPID_LOG_SEVERITY"] = str(options.event_log_severity)
    zLOG.initialize()

def open():
    cs = ClientStorage(options.zeo_addr)
    db = ZODB.DB(cs, cache_size=options.cache_size)
    return db

--- NEW FILE: folder.py ---
import ZODB
from Persistence import Persistent
from BTrees.OOBTree import OOBTree, OOSet, difference

import email
import mailbox
import os
import stat

from pspam.message import PMessage

def factory(fp):
    try:
        return email.message_from_file(fp, PMessage)
    except email.Errors.MessageError, msg:
        print msg
        return PMessage()

class Folder(Persistent):

    def __init__(self, path):
        self.path = path
        self.mtime = 0
        self.size = 0
        self.messages = OOBTree()

    def _stat(self):
        t = os.stat(self.path)
        self.mtime = t[stat.ST_MTIME]
        self.size = t[stat.ST_SIZE]

    def changed(self):
        t = os.stat(self.path)
        if (t[stat.ST_MTIME] != self.mtime
            or t[stat.ST_SIZE] != self.size):
            return True
        else:
            return False

    def read(self):
        """Return messages added and removed from folder.

        Two sets of message objects are returned.  The first set is
        messages that were added to the folder since the last read.
        The second set is the messages that were removed from the
        folder since the last read.

        The code assumes messages are added and removed but not edited.
        """
        mbox = mailbox.UnixMailbox(open(self.path, "rb"), factory)
        self._stat()
        cur = OOSet()
        new = OOSet()
        while 1:
            msg = mbox.next()
            if msg is None:
                break
            msgid = msg["message-id"]
            cur.insert(msgid)
            if not self.messages.has_key(msgid):
                self.messages[msgid] = msg
                new.insert(msg)
                
        removed = difference(self.messages, cur)
        for msgid in removed.keys():
            del self.messages[msgid]

        # XXX perhaps just return the OOBTree for removed?
        return new, OOSet(removed.values())

if __name__ == "__main__":
    f = Folder("/home/jeremy/Mail/INBOX")

--- NEW FILE: message.py ---
import ZODB
from Persistence import Persistent
from email.Message import Message

class PMessage(Message, Persistent):

    def __hash__(self):
        return id(self)

--- NEW FILE: options.py ---
from Options import options, all_options, \
     boolean_cracker, float_cracker, int_cracker, string_cracker
from sets import Set     

all_options["Score"] = {'max_ham': float_cracker,
                        'min_spam': float_cracker,
                        }

all_options["Train"] = {'folder_dir': string_cracker,
                        'spam_folders': ('get', lambda s: Set(s.split())),
                        'ham_folders': ('get', lambda s: Set(s.split())),
                        }

all_options["Proxy"] = {'server': string_cracker,
                        'server_port': int_cracker,
                        'proxy_port': int_cracker,
                        'log_pop_session': boolean_cracker,
                        'log_pop_session_file': string_cracker,
                        }

all_options["ZODB"] = {'zeo_addr': string_cracker,
                       'event_log_file': string_cracker,
                       'event_log_severity': int_cracker,
                       'cache_size': int_cracker,
                       }

import os
options.mergefiles("vmspam.ini")

def mergefile(p):
    options.mergefiles(p)

--- NEW FILE: profile.py ---
"""Spam/ham profile for a single VM user."""

import ZODB
from ZODB.PersistentList import PersistentList
from Persistence import Persistent
from BTrees.OOBTree import OOBTree

import classifier
from tokenizer import tokenize

from pspam.folder import Folder

import os

def open_folders(dir, names, klass):
    L = []
    for name in names:
        path = os.path.join(dir, name)
        L.append(klass(path))
    return L

import time
_start = None
def log(s):
    global _start
    if _start is None:
        _start = time.time()
    print round(time.time() - _start, 2), s


class IterOOBTree(OOBTree):

    def iteritems(self):
        return self.items()

class WordInfo(Persistent):

    def __init__(self, atime, spamprob=None):
        self.atime = atime
        self.spamcount = self.hamcount = self.killcount = 0
        self.spamprob = spamprob

    def __repr__(self):
        return "WordInfo%r" % repr((self.atime, self.spamcount,
                                    self.hamcount, self.killcount,
                                    self.spamprob))

class PBayes(classifier.Bayes, Persistent):

    WordInfoClass = WordInfo

    def __init__(self):
        classifier.Bayes.__init__(self)
        self.wordinfo = IterOOBTree()

    # XXX what about the getstate and setstate defined in base class

class Profile(Persistent):

    FolderClass = Folder

    def __init__(self, folder_dir):
        self._dir = folder_dir
        self.classifier = PBayes()
        self.hams = PersistentList()
        self.spams = PersistentList()

    def add_ham(self, folder):
        p = os.path.join(self._dir, folder)
        f = self.FolderClass(p)
        self.hams.append(f)

    def add_spam(self, folder):
        p = os.path.join(self._dir, folder)
        f = self.FolderClass(p)
        self.spams.append(f)

    def update(self):
        """Update classifier from current folder contents."""
        changed1 = self._update(self.hams, False)
        changed2 = self._update(self.spams, True)
        if changed1 or changed2:
            self.classifier.update_probabilities()
        get_transaction().commit()
        log("updated probabilities")
        
    def _update(self, folders, is_spam):
        changed = False
        for f in folders:
            log("update from %s" % f.path)
            added, removed = f.read()
            if added:
                log("added %d" % len(added))
            if removed:    
                log("removed %d" % len(removed))
            get_transaction().commit()
            if not (added or removed):
                continue
            changed = True

            # It's important not to commit a transaction until
            # after update_probabilities is called in update().
            # Otherwise some new entries will cause scoring to fail.
            for msg in added.keys():
                self.classifier.learn(tokenize(msg), is_spam, False)
            del added
            get_transaction().commit(1)
            log("learned")
            for msg in removed.keys():
                self.classifier.unlearn(tokenize(msg), is_spam, False)
            if removed: 
                log("unlearned")
            del removed
            get_transaction().commit(1)
        return changed


From jhylton@users.sourceforge.net  Mon Nov  4 04:44:22 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:44:22 -0800
Subject: [Spambayes-checkins] spambayes/pspam README.txt,NONE,1.1
	pop.py,NONE,1.1vmspam.ini,NONE,1.1zeo.sh,NONE,1.1
Message-ID: <E188Z66-0005bz-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv21558/pspam

Added Files:
	README.txt pop.py scoremsg.py update.py vmspam.ini zeo.sh 
Log Message:
Initial checkin of pspam code.


--- NEW FILE: README.txt ---
pspam: persistent spambayes filtering system
--------------------------------------------

pspam uses a POP proxy to score incoming messages, a set of VM folders
to manage training data, and a ZODB database to manage data used by
the various applications.

The current code only works with a patched version of classifier.py.
Remove the object base class & change the class used to create new
WordInfo objects.

This directory contains:

pspam -- a Python package
pop.py -- a POP proxy based on SocketServer
scoremsg.py -- prints the evidence for a single message read from stdin
update.py -- a script to update training data from folders
vmspam.ini -- a sample configuration file
zeo.sh -- a script to start a ZEO server

The code depends on ZODB3, which you can download from
http://www.zope.org/Products/StandaloneZODB.


--- NEW FILE: pop.py ---
"""Spam-filtering proxy for a POP3 server.

The implementation uses the SocketServer module to run a
multi-threaded POP3 proxy.  It adds an X-Spambayes header with a spam
probability.  It scores a message using a persistent spambayes
classifier loaded from a ZEO server.

The strategy for adding spam headers is from Richie Hindler's
pop3proxy.py.  The STAT, LIST, RETR, and TOP commands are intercepted
to change the number of bytes the client is told to expect and/or to
insert the spam header.

XXX A POP3 server sometimes adds the number of bytes in the +OK
response to some commands when the POP3 spec doesn't require it to.
In those case, the proxy does not re-write the number of bytes.  I
assume the clients won't be confused by this behavior, because they
shouldn't be expecting to see the number of bytes.

POP3 is documented in RFC 1939.
"""

import SocketServer
import asyncore
import cStringIO
import email
import re
import socket
import sys
import threading
import time

import ZODB
from ZEO.ClientStorage import ClientStorage
import zLOG

from tokenizer import tokenize
import pspam.database
from pspam.options import options

HEADER = "X-Spambayes: %5.3f\r\n"
HEADER_SIZE = len(HEADER % 0.0)

class POP3ProxyServer(SocketServer.ThreadingTCPServer):

    allow_reuse_address = True

    def __init__(self, addr, handler, classifier, real_server, log, zodb):
        SocketServer.ThreadingTCPServer.__init__(self, addr, handler)
        self.classifier = classifier
        self.pop_server = real_server
        self.log = log
        self.zodb = zodb

class LogWrapper:

    def __init__(self, log, file):
        self.log = log
        self.file = file

    def readline(self):
        line = self.file.readline()
        self.log.write(line)
        return line

    def write(self, buf):
        self.log.write(buf)
        return self.file.write(buf)

    def close(self):
        self.file.close()

class POP3RequestHandler(SocketServer.StreamRequestHandler):
    """Act as proxy between POP client and server."""

    def connect_pop(self):
        # connect to the pop server
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect(self.server.pop_server)
        self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb"))
        # the write side should be unbuffered
        self.pop_wfile = LogWrapper(self.server.log, s.makefile("wb", 0))

    def close_pop(self):
        self.pop_rfile.close()
        self.pop_wfile.close()

    def handle(self):
        zLOG.LOG("POP3", zLOG.INFO,
                 "Connection from %s" % repr(self.client_address))
        self.server.zodb.sync()
        self.sess_retr_count = 0
        self.connect_pop()
        try:
            self.handle_pop()
        finally:
            self.close_pop()
            if self.sess_retr_count == 1:
                ending = ""
            else:
                ending = "s"
            zLOG.LOG("POP3", zLOG.INFO,
                     "Ending session (%d message%s retrieved)"
                     % (self.sess_retr_count, ending))

    _multiline = {"RETR": True, "TOP": True,}
    _multiline_noargs = {"LIST": True, "UIDL": True,}

    def is_multiline(self, command, args):
        if command in self._multiline:
            return True
        if command in self._multiline_noargs and not args:
            return True
        return False

    def parse_request(self, req):
        parts = req.split()
        req = parts[0]
        args = tuple(parts[1:])
        return req, args

    def handle_pop(self):
        # send the initial server hello
        hello = self.pop_rfile.readline()
        self.wfile.write(hello)

        # now get client requests and return server responses
        while 1:
            line = self.rfile.readline()
            if line == '':
                break
            self.pop_wfile.write(line)
            if not self.handle_pop_response(line):
                break

    def handle_pop_response(self, req):
        # Return True if connection is still open
        cmd, args = self.parse_request(req)
        multiline = self.is_multiline(cmd, args)
        firstline = self.pop_rfile.readline()
        zLOG.LOG("POP3", zLOG.DEBUG, "command %s multiline %s resp %s"
                 % (cmd, multiline, firstline.strip()))
        if multiline:
            # Collect the entire response as one string
            resp = cStringIO.StringIO()
            while 1:
                line = self.pop_rfile.readline()
                resp.write(line)
                # The response is finished if we get . or an error.
                # XXX should handle byte-stuffed response
                if line == ".\r\n":
                    break
                if line.startswith("-ERR"):
                    break
            buf = resp.getvalue()
        else:
            buf = None

        handler = getattr(self, "handle_%s" % cmd, None)
        if handler:
            firstline, buf = handler(cmd, args, firstline, buf)

        self.wfile.write(firstline)
        if buf is not None:
            self.wfile.write(buf)
        if cmd == "QUIT":
            return False
        else:
            return True

    def handle_RETR(self, cmd, args, firstline, resp):
        if not resp:
            return firstline, resp
        try:
            msg = email.message_from_string(resp)
        except email.Errors.MessageParseError, err:
            zLOG.LOG("POP3", zLOG.WARNING,
                     "Failed to parse msg: %s" % err, error=sys.exc_info())
            resp = self.message_parse_error(resp)
        else:
            self.score_msg(msg)
            resp = msg.as_string()

        self.sess_retr_count += 1
        return firstline, resp

    def handle_TOP(self, cmd, args, firstline, resp):
        # XXX Just handle TOP like RETR?
        return self.handle_RETR(cmd, args, firstline, resp)

    rx_STAT = re.compile("\+OK (\d+) (\d+)(.*)", re.DOTALL)

    def handle_STAT(self, cmd, args, firstline, resp):
        # STAT returns the number of messages and the total size.  The
        # proxy must add the size of new headers to the total size.
        # Example: +OK 3 340
        mo = self.rx_STAT.match(firstline)
        if mo is None:
            return firstline, resp
        count, size, extra = mo.group(1, 2, 3)
        count = int(count)
        size = int(size)
        size += count * HEADER_SIZE
        firstline = "+OK %d %d%s" % (count, size, extra)
        return firstline, resp

    rx_LIST = re.compile("\+OK (\d+) (\d+)(.*)", re.DOTALL)
    rx_LIST_2 = re.compile("(\d+) (\d+)(.*)", re.DOTALL)

    def handle_LIST(self, cmd, args, firstline, resp):
        # If there are no args, LIST returns size info for each message.
        # If there is an arg, LIST return number and size for one message.
        mo = self.rx_LIST.match(firstline)
        if mo:
            # a single-line response
            n, size, extra = mo.group(1, 2, 3)
            size = int(size) + HEADER_SIZE
            firstline = "+OK %s %d%s" % (n, size, extra)
            return firstline, resp
        else:
            # possibility a multiline response
            if not firstline.startswith("+OK"):
                return firstline, resp
            # update each line of the response
            L = []
            for line in resp.split("\r\n"):
                if not line:
                    continue
                mo = self.rx_LIST_2.match(line)
                if not mo:
                    L.append(line)
                else:
                    n, size, extra = mo.group(1, 2, 3)
                    size = int(size) + HEADER_SIZE
                    L.append("%s %d%s" % (n, size, extra))
            return firstline, "\r\n".join(L)

    def message_parse_error(self, buf):
        # We get an error parsing the message.  We've already told the
        # client to expect more bytes that this buffer contains, but
        # there's not clean way to add the header.

        self.server.log.write("# error: %s\n" % repr(buf))

        # XXX what to do?  list's just add it after the first line
        score = self.server.classifier.spamprob(tokenize(buf))

        L = buf.split("\n")
        L.insert(1, HEADER % score)
        return "\n".join(L)

    def score_msg(self, msg):
        score = self.server.classifier.spamprob(tokenize(msg))
        msg.add_header("X-Spambayes", "%5.3f" % score)

def main():
    db = pspam.database.open()
    conn = db.open()
    r = conn.root()
    profile = r["profile"]

    log = open("/var/tmp/pop.log", "ab")
    print >> log, "+PROXY start", time.ctime()

    server = POP3ProxyServer(('', options.proxy_port),
                             POP3RequestHandler,
                             profile.classifier,
                             (options.server, options.server_port),
                             log,
                             conn,
                             )
    server.serve_forever()

if __name__ == "__main__":
    main()

--- NEW FILE: scoremsg.py ---
#! /usr/bin/env python
"""Score a message provided on stdin and show the evidence."""

import ZODB
from ZEO.ClientStorage import ClientStorage

from tokenizer import tokenize

import email
import sys

import pspam.options

def main(fp):
    cs = ClientStorage("/var/tmp/zeospam")
    db = ZODB.DB(cs)
    r = db.open().root()

    # make sure scoring uses the right set of options
    pspam.options.mergefile("/home/jeremy/src/vmspam/vmspam.ini")

    p = r["profile"]

    msg = email.message_from_file(fp)
    prob, evidence = p.classifier.spamprob(tokenize(msg), True)
    print "Score:", prob
    print
    print "Clues"
    print "-----"
    for clue, prob in evidence:
        print clue, prob
##    print
##    print msg
        
if __name__ == "__main__":
    main(sys.stdin)

--- NEW FILE: update.py ---
import getopt
import os
import sys

import ZODB
from ZEO.ClientStorage import ClientStorage

import pspam.database
from pspam.profile import Profile
from pspam.options import options

def folder_exists(L, p):
    """Return true folder with path p exists in list L."""
    for f in L:
        if f.path == p:
            return True
    return False

def main(rebuild=False):
    db = pspam.database.open()
    r = db.open().root()

    profile = r.get("profile")
    if profile is None or rebuild:
        # if there is no profile, create it
        profile = r["profile"] = Profile(options.folder_dir)
        get_transaction().commit()

    # check for new folders of training data
    for ham in options.ham_folders:
        p = os.path.join(options.folder_dir, ham)
        if not folder_exists(profile.hams, p):
            profile.add_ham(p)
    
    for spam in options.spam_folders:
        p = os.path.join(options.folder_dir, spam)
        if not folder_exists(profile.spams, p):
            profile.add_spam(p)
    get_transaction().commit()

    # read new messages from folders
    profile.update()
    get_transaction().commit()
    
    db.close()

if __name__ == "__main__":
    FORCE_REBUILD = False
    opts, args = getopt.getopt(sys.argv[1:], 'F')
    for k, v in opts:
        if k == '-F':
            FORCE_REBUILD = True
    
    main(FORCE_REBUILD)

--- NEW FILE: vmspam.ini ---
[Train]
folder_dir: /home/jeremy/Mail
spam_folders: train/spam
ham_folders: train/ham

[Score]
max_ham: 0.05
min_spam: 0.99

[Proxy]
server: mail.zope.com
server_port: 110
proxy_port: 1111
log_pop_session: true
log_pop_session_file: /var/tmp/pop.log

[ZODB]
zeo_addr: /var/tmp/zeospam
event_log_file: /var/tmp/zeospam.log
event_log_severity: 0
cache_size: 2000

--- NEW FILE: zeo.sh ---
#! /bin/bash

export STUPID_LOG_FILE=/var/tmp/zeospam.log
export LIBDIR=/usr/local/lib/python2.3/site-packages
python2.3 $LIBDIR/ZEO/start.py -U /var/tmp/zeospam /var/tmp/zeospam.fs


From tim.one@comcast.net  Mon Nov  4 05:03:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 00:03:05 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13
In-Reply-To: <E188VnU-0000fp-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEOMCFAB.tim.one@comcast.net>

[Mark Hammond]
> Modified Files:
> 	train.py
> Log Message:
> Fix the root of my:
>   File "F:\src\spambayes\classifier.py", line 450, in _getclues
>     distance = abs(prob - 0.5)
>
> Exception - problem is that we trained, but didn't update probabilities -
> thus, we failed for every new word seen only since the last complete
> retrain.

Mark, I've never seen this, and believed I fixed the only way it could have
happened last week -- WordInfo records start life with a genuine probability
(spamprob) now, instead with a spamprob of None.  It's possible, though,
that you had some leftover WordInfo record with None in your dict, and
didn't retrain from scratch after that fix.  Or it's possible there's an
entirely different bug I still don't know about.

> There may be a case for _getclues() to detect a probability of None
> and call update_probabilities() automatically - either that or just
> keep throwing vague exceptions <wink>

Except it should never be possible for _getclues() to see None -- if that
was still happening for you, there's a deeper bug that still needs to be
fixed.


In other news, here's a shallow bug, upon starting Outlook now:

Traceback (most recent call last):
  File "C:\PYTHON22\lib\site-packages\win32com\universal.py", line 150, in
dispatch
    retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD, args,
None, None)
  File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 322,
in _InvokeEx_
    return self._invokeex_(dispid, lcid, wFlags, args, kwargs,
serviceProvider)
  File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 562,
in _invokeex_
    return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
args, kwArgs, serviceProvider)
  File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 510,
in _invokeex_
    return apply(func, args)
  File "C:\Code\spambayes\Outlook2000\addin.py", line 392, in OnConnection
    button.Init(self.manager, application, activeExplorer)
  File "C:\Code\spambayes\Outlook2000\addin.py", line 262, in Init
    ButtonDeleteAsExplorerEvent)
  File "C:\Code\spambayes\Outlook2000\addin.py", line 103, in
WithEventsClone
    events_class = getevents(clsid)
exceptions.NameError: global name 'getevents' is not defined

It can't have worked for you, either.  I fiddled my local copy to do

from win32com.client import constants, getevents

near the top, and that appears to have fixed it.  I'll check that in, but
please ensure that was the correct fix.


From tim_one@users.sourceforge.net  Mon Nov  4 05:03:49 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 03 Nov 2002 21:03:49 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.25,1.26
Message-ID: <E188ZOv-0006ly-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv25951/Outlook2000

Modified Files:
	addin.py 
Log Message:
Fix whar appeared to be a missing import of win32.client.getevents.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** addin.py	4 Nov 2002 00:52:10 -0000	1.25
--- addin.py	4 Nov 2002 05:03:47 -0000	1.26
***************
*** 13,17 ****
  import win32api
  import pythoncom
! from win32com.client import constants
  import win32ui
  
--- 13,17 ----
  import win32api
  import pythoncom
! from win32com.client import constants, getevents
  import win32ui
  

From anthonybaxter@users.sourceforge.net  Mon Nov  4 06:38:54 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Sun, 03 Nov 2002 22:38:54 -0800
Subject: [Spambayes-checkins] website developer.ht,1.3,1.4
Message-ID: <E188asw-0004BE-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv16008

Modified Files:
	developer.ht 
Log Message:
added a "what needs to be done" section.


Index: developer.ht
===================================================================
RCS file: /cvsroot/spambayes/website/developer.ht,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** developer.ht	22 Sep 2002 07:48:03 -0000	1.3
--- developer.ht	4 Nov 2002 06:38:52 -0000	1.4
***************
*** 27,30 ****
--- 27,38 ----
  available as links from the <a href="docs.html">documentation</a> page.
  
+ <h3>So what needs to be done</h3>
+ <p>Currently (early November) work is now being focussed on finding 
+ additional things that are beneficial to the tokenizer. The combining
+ scheme is now pretty solid and pretty amazing. The other big body of
+ work at the moment is producing something that's useful to end-users -
+ actually building the applications and the code so that Tim's sister
+ &lt;wink&gt; can use the system.</p>
+ 
  <h2>Collecting training data</h2>
  <p>One of the tricky problems is collecting a set of data that's 


From anthonybaxter@users.sourceforge.net  Mon Nov  4 06:39:44 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Sun, 03 Nov 2002 22:39:44 -0800
Subject: [Spambayes-checkins] website background.ht,1.1,1.2
Message-ID: <E188atk-0004Ex-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv16178

Modified Files:
	background.ht 
Log Message:
A bit of a potted history here. I probably have a bunch of things here
that need to be cleaned up and made more obvious, but hey, it's a start.


Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** background.ht	19 Sep 2002 23:39:24 -0000	1.1
--- background.ht	4 Nov 2002 06:39:42 -0000	1.2
***************
*** 15,18 ****
--- 15,67 ----
  <p><i>more links? mail anthony at interlink.com.au</i></p>
  
+ <h2>Overall Approach</h2>
+ <b>Please note that I (Anthony) am writing this based on memory and
+ limited understanding of some of the subtler points of the maths. Gentle
+ corrections are welcome, or even encouraged.</b>
+ <h3>Tokenizing</h3>
+ <p>The architecture of the spambayes system has a couple of distinct 
+ parts. The first, and most obvious, is the <i>tokenizer</i>. This takes
+ a mail message and breaks it up into a series of tokens. At the moment
+ it splits words out of the text parts of a message, there's a variety
+ of header tokenization that goes on as well. The code in tokenizer.py
+ and the comments in the Tokenizer section of Options.py contain more 
+ information about various approaches to tokenizing.</p>
+ 
+ <h3>Combining and Scoring</h3>
+ <p>The next part of the system is the scoring and combining part. This
+ is where the hairy mathematics and statistics come in. </p>
+ <p>Initially we started with Paul Graham's original combining scheme - 
+ this has a number of "magic numbers" and "fuzz factors" built into it. 
+ The Graham combining scheme has a number of problems, aside from the
+ magic in the internal fudge factors - it tends to produce scores of 
+ either 1 or 0, and there's a very small middle ground in between - it 
+ doesn't often claim to be "unsure", and gets it wrong because of this. 
+ There's a number of discussions back and forth between Tim Peters and 
+ Gary Robinson on this subject in the mailing list archives - I'll try 
+ and put links to the relevant threads at some point.</p>
+ <p>Gary produced a number of alternative approaches to combining and
+ scoring word probabilities. The initial one, after much back and forth
+ in the mailing list, is in the code today as 'gary_combining'. A couple
+ of other approaches, using the Central Limit Theorem, were also tried.
+ They produced interesting output - but histograms of the ham and spam
+ distributions had a disturbingly large overlap in the middle. There was
+ also an issue with incremental training and untraining of messages that
+ made it harder to use in the "real world". These two central limit 
+ approaches were dropped after Tim, Gary and Rob Hooft produced a combining
+ scheme using chi-squared probabilities. This is now the default combining
+ scheme. </p>
+ <p>The chi-squared approach produces two numbers - a "ham probability" ("*H*")
+ and a "spam probability" ("*S*"). A typical spam will have a high *S*
+ and low *H*, while a ham will have high *H* and low *S*. In the case where
+ the message looks entirely unlike anything the system's been trained on,
+ you can end up with a low *H* and low *S* - this is the code saying "I don't
+ know what this message is". So at the end of the processing, you end up 
+ with three possible results - "Spam", "Ham", or "Unsure". It's possible to
+ tweak the high and low cutoffs for the Unsure window - this trades off 
+ unsure messages vs possible false positives or negatives.</P>
+ 
+ <h3>Training</h3>
+ <p>TBD</p>
+ 
  <h2>Mailing list archives</h2>
  <p>There's a lot of background on what's been tried available from


From anthonybaxter@users.sourceforge.net  Mon Nov  4 09:58:02 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 04 Nov 2002 01:58:02 -0800
Subject: [Spambayes-checkins] website background.ht,1.2,1.3
Message-ID: <E188dze-0001lq-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv6694

Modified Files:
	background.ht 
Log Message:
addition from RobH about high *H* and high *S* meaning.


Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** background.ht	4 Nov 2002 06:39:42 -0000	1.2
--- background.ht	4 Nov 2002 09:57:59 -0000	1.3
***************
*** 56,60 ****
  the message looks entirely unlike anything the system's been trained on,
  you can end up with a low *H* and low *S* - this is the code saying "I don't
! know what this message is". So at the end of the processing, you end up 
  with three possible results - "Spam", "Ham", or "Unsure". It's possible to
  tweak the high and low cutoffs for the Unsure window - this trades off 
--- 56,66 ----
  the message looks entirely unlike anything the system's been trained on,
  you can end up with a low *H* and low *S* - this is the code saying "I don't
! know what this message is". 
! Some messages can even have both a high *H* and a high *S*, telling you 
! basically that the message looks very much like ham, but also very much 
! like spam. In this case spambayes is also unsure where the message 
! should be classified, and the final score will be near 0.5.</p>
! 
! <p>So at the end of the processing, you end up 
  with three possible results - "Spam", "Ham", or "Unsure". It's possible to
  tweak the high and low cutoffs for the Unsure window - this trades off 


From tim_one@users.sourceforge.net  Mon Nov  4 21:06:30 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 04 Nov 2002 13:06:30 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.46,1.47
Message-ID: <E188oQY-0002Nz-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8400

Modified Files:
	classifier.py 
Log Message:
_add_msg():  Removed redundant store into wordinfo[word].

_remove_msg():  Added a store into wordinfo[word], which may be needed
if wordinfo is a persistent database, to let the persistence machinery
know that an internal field in the value associated *with* word changed.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.46
retrieving revision 1.47
diff -C2 -d -r1.46 -r1.47
*** classifier.py	1 Nov 2002 16:01:14 -0000	1.46
--- classifier.py	4 Nov 2002 21:06:26 -0000	1.47
***************
*** 401,405 ****
              record = wordinfoget(word)
              if record is None:
!                 record = wordinfo[word] = WordInfo(now)
  
              if is_spam:
--- 401,405 ----
              record = wordinfoget(word)
              if record is None:
!                 record = WordInfo(now)
  
              if is_spam:
***************
*** 407,410 ****
--- 407,411 ----
              else:
                  record.hamcount += 1
+             # Needed to tell a persistent DB that the content changed.
              wordinfo[word] = record
  
***************
*** 419,423 ****
              self.nham -= 1
  
!         wordinfoget = self.wordinfo.get
          for word in Set(wordstream):
              record = wordinfoget(word)
--- 420,425 ----
              self.nham -= 1
  
!         wordinfo = self.wordinfo
!         wordinfoget = wordinfo.get
          for word in Set(wordstream):
              record = wordinfoget(word)
***************
*** 430,434 ****
                          record.hamcount -= 1
                  if record.hamcount == 0 == record.spamcount:
!                     del self.wordinfo[word]
  
      def _getclues(self, wordstream):
--- 432,439 ----
                          record.hamcount -= 1
                  if record.hamcount == 0 == record.spamcount:
!                     del wordinfo[word]
!                 else:
!                     # Needed to tell a persistent DB that the content changed.
!                     wordinfo[word] = record
  
      def _getclues(self, wordstream):


From jhylton@users.sourceforge.net  Mon Nov  4 21:25:56 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Mon, 04 Nov 2002 13:25:56 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.1,1.2
Message-ID: <E188ojM-0004hO-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv18044

Modified Files:
	profile.py 
Log Message:
Use the same default spamprob as regular classifier.


Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** profile.py	4 Nov 2002 04:44:20 -0000	1.1
--- profile.py	4 Nov 2002 21:25:54 -0000	1.2
***************
*** 10,13 ****
--- 10,14 ----
  
  from pspam.folder import Folder
+ from pspam.options import options
  
  import os
***************
*** 36,40 ****
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=None):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 37,41 ----
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0


From jhylton@users.sourceforge.net  Mon Nov  4 21:24:54 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Mon, 04 Nov 2002 13:24:54 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.47,1.48
Message-ID: <E188oiM-0004Ys-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv17508

Modified Files:
	classifier.py 
Log Message:
Two changes to support pspam.

Make Bayes a classic class so that it can be mixed with
ExtensionClass.

Define Bayes.WordInfoClass so that a subclass can define a different
class to represent word info.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.47
retrieving revision 1.48
diff -C2 -d -r1.47 -r1.48
*** classifier.py	4 Nov 2002 21:06:26 -0000	1.47
--- classifier.py	4 Nov 2002 21:24:52 -0000	1.48
***************
*** 80,84 ****
           self.spamprob) = t
  
! class Bayes(object):
      # Defining __slots__ here made Jeremy's life needlessly difficult when
      # trying to hook this all up to ZODB as a persistent object.  There's
--- 80,84 ----
           self.spamprob) = t
  
! class Bayes:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
      # trying to hook this all up to ZODB as a persistent object.  There's
***************
*** 92,95 ****
--- 92,98 ----
      #            )
  
+     # allow a subclass to use a different class for WordInfo
+     WordInfoClass = WordInfo
+ 
      def __init__(self):
          self.wordinfo = {}
***************
*** 401,405 ****
              record = wordinfoget(word)
              if record is None:
!                 record = WordInfo(now)
  
              if is_spam:
--- 404,408 ----
              record = wordinfoget(word)
              if record is None:
!                 record = self.WordInfoClass(now)
  
              if is_spam:


From mhammond@users.sourceforge.net  Mon Nov  4 22:19:36 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Mon, 04 Nov 2002 14:19:36 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.13,1.14
Message-ID: <E188pZI-0004gG-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv17976

Modified Files:
	train.py 
Log Message:
Roll-back my previous "update probs" change - Tim's fix would have fixed it had I done a complete retain.  Done that now, and if I still need this Tim will sort it out once-and-for-all <wink>

Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** train.py	4 Nov 2002 01:12:53 -0000	1.13
--- train.py	4 Nov 2002 22:19:34 -0000	1.14
***************
*** 19,23 ****
      return spam == True
  
! def train_message(msg, is_spam, mgr, update_probs = True):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
--- 19,23 ----
      return spam == True
  
! def train_message(msg, is_spam, mgr):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
***************
*** 41,47 ****
      mgr.bayes.learn(tokens, is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
-     if update_probs:
-         mgr.bayes.update_probabilities()
- 
      mgr.bayes_dirty = True
      return True
--- 41,44 ----
***************
*** 54,58 ****
          progress.tick()
          try:
!             if train_message(message, isspam, mgr, False):
                  num_added += 1
          except:
--- 51,55 ----
          progress.tick()
          try:
!             if train_message(message, isspam, mgr):
                  num_added += 1
          except:


From mhammond@skippinet.com.au  Mon Nov  4 22:48:08 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 5 Nov 2002 09:48:08 +1100
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEOMCFAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGELPHIAA.mhammond@skippinet.com.au>

[Tim]
> In other news, here's a shallow bug, upon starting Outlook now:
...

> It can't have worked for you, either.

It can - my code took the "win32all has such a function" path.  Pity mine is
the only machine in the world taking that path <wink>

> I fiddled my local copy to do
>
> from win32com.client import constants, getevents
>
> near the top, and that appears to have fixed it.  I'll check that in, but
> please ensure that was the correct fix.

Just dandy - thanks!  pychecker can tell us when it is no longer necessary!

Mark.


From mhammond@users.sourceforge.net  Mon Nov  4 22:50:44 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Mon, 04 Nov 2002 14:50:44 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.26,1.27 train.py,1.14,1.15
Message-ID: <E188q3Q-000895-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv30899

Modified Files:
	addin.py train.py 
Log Message:
After incremental training on individual messages, they are also recored
so that they appear in the ham/spam folder with the *new* post-training
score rather than their pre-training, presumably wrong score.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** addin.py	4 Nov 2002 05:03:47 -0000	1.26
--- addin.py	4 Nov 2002 22:50:41 -0000	1.27
***************
*** 159,163 ****
                  import train
                  print "Training on message '%s' - " % subject,
!                 if train.train_message(msgstore_message, False, self.manager):
                      print "trained as good"
                  else:
--- 159,163 ----
                  import train
                  print "Training on message '%s' - " % subject,
!                 if train.train_message(msgstore_message, False, self.manager, rescore = True):
                      print "trained as good"
                  else:
***************
*** 191,195 ****
                  subject = item.Subject.encode("mbcs", "replace")
                  print "Training on message '%s' - " % subject,
!                 if train.train_message(msgstore_message, True, self.manager):
                      print "trained as spam"
                  else:
--- 191,195 ----
                  subject = item.Subject.encode("mbcs", "replace")
                  print "Training on message '%s' - " % subject,
!                 if train.train_message(msgstore_message, True, self.manager, rescore = True):
                      print "trained as spam"
                  else:
***************
*** 329,333 ****
                  # Must train before moving, else we lose the message!
                  print "Training on message - ",
!                 if train.train_message(msgstore_message, True, self.manager):
                      print "trained as spam"
                  else:
--- 329,333 ----
                  # Must train before moving, else we lose the message!
                  print "Training on message - ",
!                 if train.train_message(msgstore_message, True, self.manager, rescore = True):
                      print "trained as spam"
                  else:

Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** train.py	4 Nov 2002 22:19:34 -0000	1.14
--- train.py	4 Nov 2002 22:50:41 -0000	1.15
***************
*** 19,27 ****
      return spam == True
  
! def train_message(msg, is_spam, mgr):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
      # untrained if it was in the wrong category), False if already
      # in the correct category.  Catch your own damn exceptions.
      from tokenizer import tokenize
      stream = msg.GetEmailPackageObject()
--- 19,29 ----
      return spam == True
  
! def train_message(msg, is_spam, mgr, rescore = False):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
      # untrained if it was in the wrong category), False if already
      # in the correct category.  Catch your own damn exceptions.
+     # If re-classified AND rescore = True, then a new score will
+     # be written to the message (so the user can see some effects)
      from tokenizer import tokenize
      stream = msg.GetEmailPackageObject()
***************
*** 42,45 ****
--- 44,52 ----
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
+     # Simplest way to rescore is to re-filter with all_actions = False
+     if rescore:
+         import filter
+         filter.filter_message(msg, mgr, all_actions = False)
+ 
      return True
  

From tim_one@users.sourceforge.net  Mon Nov  4 23:21:45 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:21:45 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.64,1.65
	tokenizer.py,1.60,1.61
Message-ID: <E188qXR-0003EZ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12377

Modified Files:
	Options.py tokenizer.py 
Log Message:
New option record_header_absence.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.64
retrieving revision 1.65
diff -C2 -d -r1.64 -r1.65
*** Options.py	3 Nov 2002 13:48:47 -0000	1.64
--- Options.py	4 Nov 2002 23:21:43 -0000	1.65
***************
*** 54,63 ****
  # very strong ham clue, but a bogus one.  In that case, set
  # count_all_header_lines to False, and adjust safe_headers instead.
- 
  count_all_header_lines: False
  
! # Like count_all_header_lines, but restricted to headers in this list.
! # safe_headers is ignored when count_all_header_lines is true.
  
  safe_headers: abuse-reports-to
      date
--- 54,68 ----
  # very strong ham clue, but a bogus one.  In that case, set
  # count_all_header_lines to False, and adjust safe_headers instead.
  count_all_header_lines: False
  
! # When True, generate a "noheader:HEADERNAME" token for each header in
! # safe_headers (below) that *doesn't* appear in the headers.  This helped
! # in various of Tim's python.org tests, but appeared to hurt a little in
! # Anthony Baxter's tests.
! record_header_absence: False
  
+ # Like count_all_header_lines, but restricted to headers in this list.
+ # safe_headers is ignored when count_all_header_lines is true, unless
+ # record_header_absence is also true.
  safe_headers: abuse-reports-to
      date
***************
*** 336,339 ****
--- 341,345 ----
                    'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
+                   'record_header_absence': boolean_cracker,
                    'generate_long_skips': boolean_cracker,
                    'skip_max_word_size': int_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.60
retrieving revision 1.61
diff -C2 -d -r1.60 -r1.61
*** tokenizer.py	1 Nov 2002 16:10:13 -0000	1.60
--- tokenizer.py	4 Nov 2002 23:21:43 -0000	1.61
***************
*** 1179,1182 ****
--- 1179,1185 ----
          for x in x2n.items():
              yield "header:%s:%d" % x
+         if options.record_header_absence:
+             for x in options.safe_headers - Set([k.lower() for k in x2n]):
+                 yield "noheader:" + x
  
      def tokenize_body(self, msg, maxword=options.skip_max_word_size):


From tim_one@users.sourceforge.net  Mon Nov  4 23:21:45 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:21:45 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 default_bayes_customize.ini,1.4,1.5
Message-ID: <E188qXR-0003Ed-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv12377/Outlook2000

Modified Files:
	default_bayes_customize.ini 
Log Message:
New option record_header_absence.


Index: default_bayes_customize.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** default_bayes_customize.ini	27 Oct 2002 03:42:58 -0000	1.4
--- default_bayes_customize.ini	4 Nov 2002 23:21:43 -0000	1.5
***************
*** 14,17 ****
--- 14,20 ----
  replace_nonascii_chars: True
  
+ # It's helpful for Tim <wink>.
+ record_header_absence: True
+ 
  [Classifier]
  # Uncomment the next lines if you want to use the former default for


From tim.one@comcast.net  Mon Nov  4 23:39:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 18:39:27 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.13,1.14
In-Reply-To: <E188pZI-0004gG-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEHCCGAB.tim.one@comcast.net>

[Mark Hammond]
> Roll-back my previous "update probs" change - Tim's fix would
> have fixed it had I done a complete retain.  Done that now, and
> if I still need this Tim will sort it out once-and-for-all <wink>

Do keep an eye on it!  I've never seen software that had a bug, but I keep
hearing it's possible ...


From mhammond@users.sourceforge.net  Tue Nov  5 11:44:30 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 05 Nov 2002 03:44:30 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.21,1.22
Message-ID: <E18928E-00057T-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14881

Modified Files:
	msgstore.py 
Log Message:
Fix a few typos in comments, and code!

Also adding a check if the message has attachments - currently not used, 
but will be soon (to handle multipart/signed messages) - was in the code
then found the typos, so decided I should get 'em in.

[DoCopyMode -> DoCopyMove does get me wondering about the utility of
auto-complete in editors tho' <0.1 wink>]


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** msgstore.py	4 Nov 2002 00:41:08 -0000	1.21
--- msgstore.py	5 Nov 2002 11:44:27 -0000	1.22
***************
*** 296,301 ****
          # only problem is that it can potentially be changed - however, the
          # Outlook client provides no such (easy/obvious) way
!         # (ie, someone would need to really want to change it <wink>
!         # This, searchkey is the only reliable long-lived message key.
          self.searchkey = searchkey
          self.unread = unread
--- 296,301 ----
          # only problem is that it can potentially be changed - however, the
          # Outlook client provides no such (easy/obvious) way
!         # (ie, someone would need to really want to change it <wink>)
!         # Thus, searchkey is the only reliable long-lived message key.
          self.searchkey = searchkey
          self.unread = unread
***************
*** 369,377 ****
          # Oh - and for multipart/signed messages <frown>
          self._EnsureObject()
!         prop_ids = PR_TRANSPORT_MESSAGE_HEADERS_A, PR_BODY_A, MYPR_BODY_HTML_A
          hr, data = self.mapi_object.GetProps(prop_ids,0)
          headers = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0])
          body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
          html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
          # Mail delivered internally via Exchange Server etc may not have
          # headers - fake some up.
--- 369,381 ----
          # Oh - and for multipart/signed messages <frown>
          self._EnsureObject()
!         prop_ids = (PR_TRANSPORT_MESSAGE_HEADERS_A,
!                     PR_BODY_A,
!                     MYPR_BODY_HTML_A,
!                     PR_HASATTACH)
          hr, data = self.mapi_object.GetProps(prop_ids,0)
          headers = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0])
          body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
          html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
+         has_attach = data[3][1]
          # Mail delivered internally via Exchange Server etc may not have
          # headers - fake some up.
***************
*** 382,385 ****
--- 386,395 ----
          elif headers.startswith("Microsoft Mail"):
              headers = "X-MS-Mail-Gibberish: " + headers
+         if not html and not body:
+             # Only ever seen this for "multipart/signed" messages, so
+             # without any better clues, just handle this.
+             # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
+             pass
+             
          return "%s\n%s\n%s" % (headers, html, body)
  
***************
*** 476,480 ****
              props = ( (mapi.PS_PUBLIC_STRINGS, prop), )
              prop = self.mapi_object.GetIDsFromNames(props, 0)[0]
-             # Docs say PT_ERROR, reality shows PT_UNSPECIFIED
              if PROP_TYPE(prop) == PT_ERROR: # No such property
                  return None
--- 486,489 ----
***************
*** 494,498 ****
          self.dirty = False
  
!     def _DoCopyMode(self, folder, isMove):
  ##        self.mapi_object = None # release the COM pointer
          assert not self.dirty, \
--- 503,507 ----
          self.dirty = False
  
!     def _DoCopyMove(self, folder, isMove):
  ##        self.mapi_object = None # release the COM pointer
          assert not self.dirty, \
***************
*** 517,524 ****
  
      def MoveTo(self, folder):
!         self._DoCopyMode(folder, True)
  
      def CopyTo(self, folder):
!         self._DoCopyMode(folder, True)
  
  def test():
--- 526,533 ----
  
      def MoveTo(self, folder):
!         self._DoCopyMove(folder, True)
  
      def CopyTo(self, folder):
!         self._DoCopyMove(folder, False)
  
  def test():


From mhammond@users.sourceforge.net  Tue Nov  5 21:51:55 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 05 Nov 2002 13:51:55 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs ManagerDialog.py,1.5,1.6
Message-ID: <E189Bc3-0002f0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv10075

Modified Files:
	ManagerDialog.py 
Log Message:
Ensure filter_status is always set to a value indicating why the filter
can not be enabled.


Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** ManagerDialog.py	1 Nov 2002 02:03:48 -0000	1.5
--- ManagerDialog.py	5 Nov 2002 21:51:53 -0000	1.6
***************
*** 120,123 ****
--- 120,128 ----
                  if ok_to_enable:
                      unsure_name = self.mgr.FormatFolderNames([config.unsure_folder_id], False)
+                 else:
+                     filter_status = "You must define the folder to receive your possible spam"
+             else:
+                 filter_status = "You must define the folder to receive your certain spam"
+                 
              # whew
              if ok_to_enable:


From richiehindle@users.sourceforge.net  Tue Nov  5 22:18:59 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Tue, 05 Nov 2002 14:18:59 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.9,1.10
Message-ID: <E189C2F-00065f-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23270

Modified Files:
	pop3proxy.py 
Log Message:
First cut of the HTML user interface - see the docstring for -b and -u.
Now reads the classification header and its values from the options.
Added TOP support to the test server (to make 40tude Dialog happy).


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** pop3proxy.py	2 Nov 2002 21:00:21 -0000	1.9
--- pop3proxy.py	5 Nov 2002 22:18:56 -0000	1.10
***************
*** 15,19 ****
              -p FILE : use the named data file
              -d      : the file is a DBM file rather than a pickle
!             -l port : listen on this port number (default 110)
  
      pop3proxy -t
--- 15,22 ----
              -p FILE : use the named data file
              -d      : the file is a DBM file rather than a pickle
!             -l port : proxy listens on this port number (default 110)
!             -u port : User interface listens on this port number
!                       (default 8880; Browse http://localhost:8880/)
!             -b      : Launch a web browser showing the user interface.
  
      pop3proxy -t
***************
*** 35,40 ****
  
  
! import sys, re, operator, errno, getopt, cPickle, time
! import socket, asyncore, asynchat
  import classifier, tokenizer, hammie
  from Options import options
--- 38,43 ----
  
  
! import sys, re, operator, errno, getopt, cPickle, cStringIO, time
! import socket, asyncore, asynchat, cgi, urlparse, webbrowser
  import classifier, tokenizer, hammie
  from Options import options
***************
*** 42,47 ****
  # HEADER_EXAMPLE is the longest possible header - the length of this one
  # is added to the size of each message.
! HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
! HEADER_EXAMPLE = '%s: Unsure\r\n' % hammie.DISPHEADER
  
  
--- 45,57 ----
  # HEADER_EXAMPLE is the longest possible header - the length of this one
  # is added to the size of each message.
! HEADER_FORMAT = '%s: %%s\r\n' % options.hammie_header_name
! HEADER_EXAMPLE = '%s: xxxxxxxxxxxxxxxxxxxx\r\n' % options.hammie_header_name
! 
! # This keeps the global status of the module - the command-line options,
! # how many mails have been classified, how many active connections there
! # are, and so on.
! class Status:
!     pass
! status = Status()
  
  
***************
*** 61,65 ****
          self.set_socket(s, socketMap)
          self.set_reuse_addr()
!         print "Listening on port %d." % port
          self.bind(('', port))
          self.listen(5)
--- 71,75 ----
          self.set_socket(s, socketMap)
          self.set_reuse_addr()
!         print "%s listening on port %d." % (self.__class__.__name__, port)
          self.bind(('', port))
          self.listen(5)
***************
*** 73,80 ****
              self.factory(*args)
  
  
! class POP3ProxyBase(asynchat.async_chat):
      """An async dispatcher that understands POP3 and proxies to a POP3
!     server, calling `self.onTransaction( request, response )` for each
      transaction. Responses are not un-byte-stuffed before reaching
      self.onTransaction() (they probably should be for a totally generic
--- 83,107 ----
              self.factory(*args)
  
+ class BrighterAsyncChat(asynchat.async_chat):
+     """An asynchat.async_chat that doesn't give spurious warnings on
+     receiving an incoming connection, and lets SystemExit cause an
+     exit."""
  
!     def handle_connect(self):
!         """Suppress the asyncore "unhandled connect event" warning."""
!         pass
! 
!     def handle_error(self):
!         """Let SystemExit cause an exit."""
!         type, v, t = sys.exc_info()
!         if type == SystemExit:
!             raise
!         else:
!             asynchat.async_chat.handle_error(self)
! 
! 
! class POP3ProxyBase(BrighterAsyncChat):
      """An async dispatcher that understands POP3 and proxies to a POP3
!     server, calling `self.onTransaction(request, response)` for each
      transaction. Responses are not un-byte-stuffed before reaching
      self.onTransaction() (they probably should be for a totally generic
***************
*** 88,92 ****
  
      def __init__(self, clientSocket, serverName, serverPort):
!         asynchat.async_chat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
--- 115,119 ----
  
      def __init__(self, clientSocket, serverName, serverPort):
!         BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
***************
*** 96,103 ****
          self.push(self.serverIn.readline())
  
-     def handle_connect(self):
-         """Suppress the asyncore "unhandled connect event" warning."""
-         pass
- 
      def onTransaction(self, command, args, response):
          """Overide this.  Takes the raw request and the response, and
--- 123,126 ----
***************
*** 221,232 ****
              self.close_when_done()
  
-     def handle_error(self):
-         """Let SystemExit cause an exit."""
-         type, v, t = sys.exc_info()
-         if type == SystemExit:
-             raise
-         else:
-             asynchat.async_chat.handle_error(self)
- 
  
  class BayesProxyListener(Listener):
--- 244,247 ----
***************
*** 276,279 ****
--- 291,296 ----
          self.handlers = {'STAT': self.onStat, 'LIST': self.onList,
                           'RETR': self.onRetr, 'TOP': self.onTop}
+         status.totalSessions += 1
+         status.activeSessions += 1
  
      def send(self, data):
***************
*** 290,293 ****
--- 307,314 ----
          return data
  
+     def close(self):
+         status.activeSessions -= 1
+         POP3ProxyBase.close(self)
+     
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
***************
*** 343,352 ****
              # Now find the spam disposition and add the header.
              prob = self.bayes.spamprob(tokenizer.tokenize(message))
              if prob < options.ham_cutoff:
!                 disposition = "No"
              elif prob > options.spam_cutoff:
!                 disposition = "Yes"
              else:
!                 disposition = "Unsure"
              
              headers, body = re.split(r'\n\r?\n', response, 1)
--- 364,381 ----
              # Now find the spam disposition and add the header.
              prob = self.bayes.spamprob(tokenizer.tokenize(message))
+             if command == 'RETR':
+                 status.numEmails += 1
              if prob < options.ham_cutoff:
!                 disposition = options.header_ham_string
!                 if command == 'RETR':
!                     status.numHams += 1
              elif prob > options.spam_cutoff:
!                 disposition = options.header_spam_string
!                 if command == 'RETR':
!                     status.numSpams += 1
              else:
!                 disposition = options.header_unsure_string
!                 if command == 'RETR':
!                     status.numUnsure += 1
              
              headers, body = re.split(r'\n\r?\n', response, 1)
***************
*** 368,372 ****
  
  
! def main(serverName, serverPort, proxyPort, pickleName, useDB):
      """Runs the proxy forever or until a 'KILL' command is received or
      someone hits Ctrl+Break."""
--- 397,646 ----
  
  
! class UserInterfaceListener(Listener):
!     """Listens for incoming web browser connections and spins off
!     UserInterface objects to serve them."""
! 
!     def __init__(self, uiPort, bayes):
!         uiArgs = (bayes,)
!         Listener.__init__(self, uiPort, UserInterface, uiArgs)
! 
! 
! # Until the user interface has had a wider audience, I won't pollute the
! # project with .gif files and the like.  Here's the viking helmet.
! import base64
! helmet = base64.decodestring(
! """R0lGODlhIgAYAPcAAEJCRlVTVGNaUl5eXmtaVm9lXGtrZ3NrY3dvZ4d0Znt3dImHh5R+a6GDcJyU
! jrSdjaWlra2tra2tta+3ur2trcC9t7W9ysDDyMbGzsbS3r3W78bW78be78be973e/8bn/86pjNav
! kc69re/Lrc7Ly9ba4vfWveTh5M7e79be79bn797n7+fr6+/v5+/v7/f3787e987n987n/9bn99bn
! /9bv/97n997v++fv9+f3/+/v9+/3//f39/f/////9////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAEAAB4ALAAAAAAiABgA
! AAj+AD0IHEiwoMGDA2XI8PBhxg2EECN+YJHjwwccOz5E3FhQBgseMmK44KGRo0kaLHzQENljoUmO
! NE74uGHDxQ8aL2GmzFHzZs6NNFr8yKHC5sOfEEUOVcHiR8aNFksi/LCCx1KZPXAilLHBAoYMMSB6
! 9DEUhsyhUgl+wOBAwQIHFsIapGpzaIcTVnvcSOsBhgUFBgYUMKAgAgqNH2J0aPjxR9YPJerqlYEi
! w4YYExQM2FygwIHCKVBgiBChBIsXP5wu3HD2Bw8MC2JD0CygAIHOnhU4cLDA7QWrqfd6iBE5dQsH
! BgJvHiDgNoID0A88V6AAAQSyjl16QIHXBwnNAwDIBAhAwDmDBAjQHyiAIPkC7DnUljhxwkGAAQHE
! B+icIAGD8+clUMByCNjUUkEdlHCBAvflF0BtB/zHQAMSCjhYYBXsoFVBMWAQWH4AAFBbAg2UWOID
! FK432AEO2ABRBwtsFuKDBTSAYgMghBDCAwwgwB4CClQAQ0R/4RciAQjYyMADIIwwAggN+PeWBTPw
! VdAHHEjA4IMR8ojjCCaEEGUCFcygnUQxaEndbhBAwKQIFVAAgQMQHPZTBxrkqUEHfHLAAZ+AdgBR
! QAAAOw==""")
! 
! 
! class UserInterface(BrighterAsyncChat):
!     """Serves the HTML user interface of the proxy."""
! 
!     header = """<html><head><title>Spambayes proxy: %s</title>
!              <style>
!              body { font: 90%% arial, swiss, helvetica }
!              table { font: 90%% arial, swiss, helvetica }
!              form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15 }
!              .header { font-size: 133%% }
!              .content { margin: 15 }
!              .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex; 
!                                border-bottom: 1px solid #808080;
!                                font-weight: bold }
!              .sectionbody { padding: 1em }
!              </style>
!              </head>\n"""
! 
!     bodyStart = """<body style='margin: 0'>
!                 <div class='banner'>
!                 <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>Spambayes proxy: %s</span></div>
!                 <div class='content'>\n"""
! 
!     footer = """</div>
!              <form action='/shutdown'>
!              <table width='100%%' cellspacing='0'>
!              <tr><td class='banner'>&nbsp;Spambayes Proxy, %s.
!              <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
!              <td align='right' class='banner'>
!              <input type='submit' value='Shutdown now'>
!              </td></tr></table></form>\n"""
! 
!     pageSection = """<table class='sectiontable' cellspacing='0'>
!                   <tr><td class='sectionheading'>%s</td></tr>
!                   <tr><td class='sectionbody'>%s</td></tr></table>
!                   &nbsp;<br>\n"""
!     
!     wordQuery = """<form action='/wordquery'>
!                 <input name='word' type='text' size='30'>
!                 <input type='submit' value='Tell me about this word'>
!                 </form>"""
!     
!     def __init__(self, clientSocket, bayes):
!         BrighterAsyncChat.__init__(self, clientSocket)
!         self.bayes = bayes
!         self.request = ''
!         self.set_terminator('\r\n\r\n')
!         self.helmet = helmet
! 
!     def collect_incoming_data(self, data):
!         """Asynchat override."""
!         self.request = self.request + data
! 
!     def found_terminator(self):
!         """Asynchat override.
!         Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = self.request.split('\r\n', 1)
!         try:
!             method, url, version = requestLine.strip().split()
!         except ValueError:
!             self.pushError(400, "Malformed request: '%s'" % requestLine)  # XXX: 400??
!             self.close_when_done()
!         else:
!             method = method.upper()
!             _, _, path, _, query, _ = urlparse.urlparse(url)
!             params = cgi.parse_qs(query, keep_blank_values=True)
!             if self.get_terminator() == '\r\n\r\n' and method == 'POST':
!                 # We need to read a body; set a numeric async_chat terminator.
!                 match = re.search(r'(?i)content-length:\s*(\d+)', headers)
!                 self.set_terminator(int(match.group(1)))
!                 self.request = self.request + '\r\n\r\n'
!                 return
!     
!             if type(self.get_terminator()) is type(1):
!                 # We've just read the body of a POSTed request.
!                 self.set_terminator('\r\n\r\n')
!                 body = self.request.split('\r\n\r\n', 1)[1]
!                 match = re.search(r'(?i)content-type:\s*([^\r\n]+)', headers)
!                 contentTypeHeader = match.group(1)
!                 contentType, pdict = cgi.parse_header(contentTypeHeader)
!                 if contentType == 'multipart/form-data':
!                     # multipart/form-data - probably a file upload.
!                     bodyFile = cStringIO.StringIO(body)
!                     params.update(cgi.parse_multipart(bodyFile, pdict))
!                 else:
!                     # A normal x-www-form-urlencoded.
!                     params.update(cgi.parse_qs(body, keep_blank_values=True))
!             
!             # Convert the cgi params into a simple dictionary.
!             plainParams = {}
!             for name, value in params.iteritems():
!                 plainParams[name] = value[0]
!             self.onRequest(path, plainParams)
!             self.close_when_done()
! 
!     def onRequest(self, path, params):
!         """Handles a decoded HTTP request."""
!         if path == '/':
!             path = '/Home'
!         
!         if path == '/helmet.gif':
!             self.pushOKHeaders('image/gif')
!             self.push(self.helmet)
!         else:
!             try:
!                 name = path[1:].capitalize()
!                 handler = getattr(self, 'on' + name)
!             except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % url)
!             else:
!                 # This is a request for a valid page; run the handler.
!                 self.pushOKHeaders('text/html')
!                 self.pushPreamble(name)
!                 handler(params)
!                 timeString = time.asctime(time.localtime())
!                 self.push(self.footer % timeString)
!     
!     def pushOKHeaders(self, contentType):
!         self.push("HTTP/1.0 200 OK\r\n")
!         self.push("Content-Type: %s\r\n" % contentType)
!         self.push("\r\n")
! 
!     def pushError(self, code, message):
!         self.push("HTTP/1.0 %d Error\r\n" % code)
!         self.push("Content-Type: text/html\r\n")
!         self.push("\r\n")
!         self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
!     
!     def pushPreamble(self, name):
!         self.push(self.header % name)
!         if name == 'Home':
!             homeLink = name
!         else:
!             homeLink = "<a href='/'>Home</a> > %s" % name
!         self.push(self.bodyStart % homeLink)
! 
!     def onHome(self, params):
!         summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!                   proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
!                   Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
!                   POP3 conversations this session:
!                     <b>%(totalSessions)d</b>.<br>
!                   Emails classified this session: <b>%(numSpams)d</b> spam,
!                     <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
!                   """ % status.__dict__
!         
!         train = """<form action='/upload' method='POST'
!                     enctype='multipart/form-data'>
!                 Either upload a message file:
!                 <input type='file' name='file'><br>
!                 Or paste the whole message (incuding headers) here:<br>
!                 <textarea name='text' rows='3' cols='60'></textarea><br>
!                 Is this message
!                 <input type='radio' name='which' value='ham'>Ham</input> or
!                 <input type='radio'
!                        name='which' value='spam' checked>Spam</input>?<br>
!                 <input type='submit' value='Train on this message'>
!                 </form>"""
!         
!         body = (self.pageSection % ('Status', summary) +
!                 self.pageSection % ('Word query', self.wordQuery) +
!                 self.pageSection % ('Train', train))
!         self.push(body)
! 
!     def onShutdown(self, params):
!         self.push("<p><b>Shutdown.</b> Goodbye.</p>")
!         self.push(' ')  # Acts as a flush for small buffers.
!         self.shutdown(2)
!         self.close()
!         raise SystemExit
! 
!     def onUpload(self, params):
!         message = params.get('file') or params.get('text')            
!         isSpam = (params['which'] == 'spam')
!         self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("""<p>Trained on your message. Saving database...</p>""")
!         self.push(" ")  # Flush... must find out how to do this properly...
!         if not status.useDB and status.pickleName:
!             fp = open(status.pickleName, 'wb')
!             cPickle.dump(self.bayes, fp, 1)
!             fp.close()
!         self.push("<p>Done.</p><p><a href='/'>Home</a></p>")
! 
!     def onWordquery(self, params):
!         word = params['word']
!         try:
!             # Must be a better way to get __dict__ for a new-style class...
!             wi = self.bayes.wordinfo[word]
!             members = dict(map(lambda n: (n, getattr(wi, n)), wi.__slots__))
!             members['atime'] = time.asctime(time.localtime(members['atime']))
!             info = """Number of spam messages: <b>%(spamcount)d</b>.<br>
!                    Number of ham messages: <b>%(hamcount)d</b>.<br>
!                    Number of times used to classify: <b>%(killcount)s</b>.<br>
!                    Probability that a message containing this word is spam:
!                    <b>%(spamprob)f</b>.<br>
!                    Last used: <b>%(atime)s</b>.<br>""" % members
!         except KeyError:
!             info = "'%s' does not appear in the database." % word
!         
!         body = (self.pageSection % ("Statistics for '%s':" % word, info) +
!                 self.pageSection % ('Word query', self.wordQuery))
!         self.push(body)
! 
! 
! def main(serverName, serverPort, proxyPort,
!          uiPort, launchUI, pickleName, useDB):
      """Runs the proxy forever or until a 'KILL' command is received or
      someone hits Ctrl+Break."""
***************
*** 375,378 ****
--- 649,655 ----
      print "Done."
      BayesProxyListener(serverName, serverPort, proxyPort, bayes)
+     UserInterfaceListener(uiPort, bayes)
+     if launchUI:
+         webbrowser.open_new("http://localhost:%d/" % uiPort)
      asyncore.loop()
  
***************
*** 424,430 ****
  
  
! class TestPOP3Server(asynchat.async_chat):
!     """Minimal POP3 server, for testing purposes.  Doesn't support TOP
!     or UIDL.  USER, PASS, APOP, DELE and RSET simply return "+OK"
      without doing anything.  Also understands the 'KILL' command, to
      kill it.  The mail content is the example messages above.
--- 701,707 ----
  
  
! class TestPOP3Server(BrighterAsyncChat):
!     """Minimal POP3 server, for testing purposes.  Doesn't support
!     UIDL.  USER, PASS, APOP, DELE and RSET simply return "+OK"
      without doing anything.  Also understands the 'KILL' command, to
      kill it.  The mail content is the example messages above.
***************
*** 434,439 ****
          # Grumble: asynchat.__init__ doesn't take a 'map' argument,
          # hence the two-stage construction.
!         asynchat.async_chat.__init__(self)
!         asynchat.async_chat.set_socket(self, clientSocket, socketMap)
          self.maildrop = [spam1, good1]
          self.set_terminator('\r\n')
--- 711,716 ----
          # Grumble: asynchat.__init__ doesn't take a 'map' argument,
          # hence the two-stage construction.
!         BrighterAsyncChat.__init__(self)
!         BrighterAsyncChat.set_socket(self, clientSocket, socketMap)
          self.maildrop = [spam1, good1]
          self.set_terminator('\r\n')
***************
*** 442,453 ****
          self.handlers = {'STAT': self.onStat,
                           'LIST': self.onList,
!                          'RETR': self.onRetr}
          self.push("+OK ready\r\n")
          self.request = ''
  
-     def handle_connect(self):
-         """Suppress the asyncore "unhandled connect event" warning."""
-         pass
- 
      def collect_incoming_data(self, data):
          """Asynchat override."""
--- 719,727 ----
          self.handlers = {'STAT': self.onStat,
                           'LIST': self.onList,
!                          'RETR': self.onRetr,
!                          'TOP': self.onTop}
          self.push("+OK ready\r\n")
          self.request = ''
  
      def collect_incoming_data(self, data):
          """Asynchat override."""
***************
*** 466,469 ****
--- 740,745 ----
                  self.close_when_done()
              if command == 'KILL':
+                 self.shutdown(2)
+                 self.close()
                  raise SystemExit
          else:
***************
*** 472,483 ****
          self.request = ''
  
-     def handle_error(self):
-         """Let SystemExit cause an exit."""
-         type, v, t = sys.exc_info()
-         if type == SystemExit:
-             raise
-         else:
-             asynchat.async_chat.handle_error(self)
- 
      def onStat(self, command, args):
          """POP3 STAT command."""
--- 748,751 ----
***************
*** 502,514 ****
              return '\r\n'.join(returnLines) + '\r\n'
  
!     def onRetr(self, command, args):
!         """POP3 RETR command."""
!         number = int(args)
          if 0 < number <= len(self.maildrop):
              message = self.maildrop[number-1]
              return "+OK\r\n%s\r\n.\r\n" % message
          else:
              return "-ERR no such message\r\n"
  
      def onUnknown(self, command, args):
          """Unknown POP3 command."""
--- 770,793 ----
              return '\r\n'.join(returnLines) + '\r\n'
  
!     def _getMessage(self, number, maxLines):
!         """Implements the POP3 RETR and TOP commands."""
          if 0 < number <= len(self.maildrop):
              message = self.maildrop[number-1]
+             headers, body = message.split('\n\n', 1)
+             bodyLines = body.split('\n')[:maxLines]
+             message = headers + '\r\n\r\n' + '\n'.join(bodyLines)
              return "+OK\r\n%s\r\n.\r\n" % message
          else:
              return "-ERR no such message\r\n"
  
+     def onRetr(self, command, args):
+         """POP3 RETR command."""
+         return self._getMessage(int(args), 12345)
+ 
+     def onTop(self, command, args):
+         """POP3 RETR command."""
+         number, lines = map(int, args.split())
+         return self._getMessage(number, lines)
+ 
      def onUnknown(self, command, args):
          """Unknown POP3 command."""
***************
*** 564,568 ****
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(hammie.DISPHEADER) != -1
  
      # Kill the proxy and the test server.
--- 843,847 ----
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) != -1
  
      # Kill the proxy and the test server.
***************
*** 580,592 ****
      # Read the arguments.
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htdp:l:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
          sys.exit()
  
!     pickleName = hammie.DEFAULTDB
!     proxyPort = 110
!     useDB = False
!     runTestServer = False
      for opt, arg in opts:
          if opt == '-h':
--- 859,880 ----
      # Read the arguments.
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htdbp:l:u:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
          sys.exit()
  
!     status.pickleName = hammie.DEFAULTDB
!     status.proxyPort = 110
!     status.uiPort = 8880
!     status.serverPort = 110
!     status.useDB = False
!     status.runTestServer = False
!     status.launchUI = False
!     status.totalSessions = 0
!     status.activeSessions = 0
!     status.numEmails = 0
!     status.numSpams = 0
!     status.numHams = 0
!     status.numUnsure = 0
      for opt, arg in opts:
          if opt == '-h':
***************
*** 594,604 ****
              sys.exit()
          elif opt == '-t':
!             runTestServer = True
          elif opt == '-d':
!             useDB = True
          elif opt == '-p':
!             pickleName = arg
          elif opt == '-l':
!             proxyPort = int(arg)
              
      # Do whatever we've been asked to do...
--- 882,896 ----
              sys.exit()
          elif opt == '-t':
!             status.runTestServer = True
!         elif opt == '-b':
!             status.launchUI = True
          elif opt == '-d':
!             status.useDB = True
          elif opt == '-p':
!             status.pickleName = arg
          elif opt == '-l':
!             status.proxyPort = int(arg)
!         elif opt == '-u':
!             status.uiPort = int(arg)
              
      # Do whatever we've been asked to do...
***************
*** 608,623 ****
          print "Self-test passed."   # ...else it would have asserted.
  
!     elif runTestServer:
          print "Running a test POP3 server on port 8110..."
          TestListener()
          asyncore.loop()
  
!     elif len(args) == 1:
!         # Named POP3 server, default port.
!         main(args[0], 110, proxyPort, pickleName, useDB)
! 
!     elif len(args) == 2:
!         # Named POP3 server, named port.
!         main(args[0], int(args[1]), proxyPort, pickleName, useDB)
  
      else:
--- 900,915 ----
          print "Self-test passed."   # ...else it would have asserted.
  
!     elif status.runTestServer:
          print "Running a test POP3 server on port 8110..."
          TestListener()
          asyncore.loop()
  
!     elif 1 <= len(args) <= 2:
!         # Normal usage, with optional server port number.
!         status.serverName = args[0]
!         if len(args) == 2:
!             status.serverPort = int(args[1])
!         main(status.serverName, status.serverPort, status.proxyPort,
!              status.uiPort, status.launchUI, status.pickleName, status.useDB)
  
      else:


From jhylton@users.sourceforge.net  Tue Nov  5 22:57:29 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Tue, 05 Nov 2002 14:57:29 -0800
Subject: [Spambayes-checkins] spambayes/pspam pop.py,1.1,1.2
Message-ID: <E189CdV-0002NK-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv9113

Modified Files:
	pop.py 
Log Message:
Allow the proxy server to get the real server name from USER command.


Index: pop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** pop.py	4 Nov 2002 04:44:19 -0000	1.1
--- pop.py	5 Nov 2002 22:57:27 -0000	1.2
***************
*** 11,14 ****
--- 11,21 ----
  insert the spam header.
  
+ The proxy can connect to any real POP3 server.  It parses the USER
+ command to figure out the address of the real server.  It expects the
+ USER argument to follow this format user@server[:port].  For example,
+ if you configure your POP client to send USER jeremy@example.com:111.
+ It will connect to a server on port 111 at example.com and send it the
+ command USER jeremy.
+ 
  XXX A POP3 server sometimes adds the number of bytes in the +OK
  response to some commands when the POP3 spec doesn't require it to.
***************
*** 41,52 ****
  HEADER_SIZE = len(HEADER % 0.0)
  
  class POP3ProxyServer(SocketServer.ThreadingTCPServer):
  
      allow_reuse_address = True
  
!     def __init__(self, addr, handler, classifier, real_server, log, zodb):
          SocketServer.ThreadingTCPServer.__init__(self, addr, handler)
          self.classifier = classifier
-         self.pop_server = real_server
          self.log = log
          self.zodb = zodb
--- 48,60 ----
  HEADER_SIZE = len(HEADER % 0.0)
  
+ VERSION = 0.1
+ 
  class POP3ProxyServer(SocketServer.ThreadingTCPServer):
  
      allow_reuse_address = True
  
!     def __init__(self, addr, handler, classifier, log, zodb):
          SocketServer.ThreadingTCPServer.__init__(self, addr, handler)
          self.classifier = classifier
          self.log = log
          self.zodb = zodb
***************
*** 73,80 ****
      """Act as proxy between POP client and server."""
  
!     def connect_pop(self):
          # connect to the pop server
          s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         s.connect(self.server.pop_server)
          self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb"))
          # the write side should be unbuffered
--- 81,117 ----
      """Act as proxy between POP client and server."""
  
!     def read_user(self):
!         # XXX This could be cleaned up a bit.
!         line = self.rfile.readline()
!         if line == "":
!             return False
!         parts = line.split()
!         if parts[0] != "USER":
!             self.wfile.write("-ERR Invalid command; must specify USER first")
!             return False
!         user = parts[1]
!         i = user.rfind("@")
!         username = user[:i]
!         server = user[i+1:]
!         i = server.find(":")
!         if i == -1:
!             server = server, 110
!         else:
!             port = int(server[i+1:])
!             server = server[:i], port
!         zLOG.LOG("POP3", zLOG.INFO, "Got connect for %s" % repr(server))
!         self.connect_pop(server)
!         self.pop_wfile.write("USER %s\r\n" % username)
!         resp = self.pop_rfile.readline()
!         # As long the server responds OK, just swallow this reponse.
!         if resp.startswith("+OK"):
!             return True
!         else:
!             return False
! 
!     def connect_pop(self, pop_server):
          # connect to the pop server
          s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         s.connect(pop_server)
          self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb"))
          # the write side should be unbuffered
***************
*** 90,94 ****
          self.server.zodb.sync()
          self.sess_retr_count = 0
!         self.connect_pop()
          try:
              self.handle_pop()
--- 127,135 ----
          self.server.zodb.sync()
          self.sess_retr_count = 0
!         self.wfile.write("+OK pspam/pop %s\r\n" % VERSION)
!         # First read the USER command to get the real server's name
!         if not self.read_user():
!             zLOG.LOG("POP3", zLOG.INFO, "Did not get valid USER")
!             return
          try:
              self.handle_pop()
***************
*** 265,269 ****
                               POP3RequestHandler,
                               profile.classifier,
-                              (options.server, options.server_port),
                               log,
                               conn,
--- 306,309 ----


From montanaro@users.sourceforge.net  Wed Nov  6 01:57:42 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 17:57:42 -0800
Subject: [Spambayes-checkins] spambayes mboxutils.py,1.3,1.4
Message-ID: <E189FRu-0003EX-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12413

Modified Files:
	mboxutils.py 
Log Message:
Add get_message() factory function ripped from
tokenizer.Tokenizer.get_message().  Replace usage of _factory() with it.


Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** mboxutils.py	27 Oct 2002 21:35:00 -0000	1.3
--- mboxutils.py	6 Nov 2002 01:57:39 -0000	1.4
***************
*** 24,27 ****
--- 24,28 ----
  import email
  import mailbox
+ import email.Message
  
  class DirOfTxtFileMailbox:
***************
*** 44,54 ****
              f.close()
  
- def _factory(fp):
-     # Helper for getmbox
-     try:
-         return email.message_from_file(fp)
-     except email.Errors.MessageParseError:
-         return ''
- 
  def _cat(seqs):
      for seq in seqs:
--- 45,48 ----
***************
*** 74,78 ****
          for name in names:
              filename = os.path.join(mhpath, name)
!             mbox = mailbox.MHMailbox(filename, _factory)
              mboxes.append(mbox)
          if len(mboxes) == 1:
--- 68,72 ----
          for name in names:
              filename = os.path.join(mhpath, name)
!             mbox = mailbox.MHMailbox(filename, get_message)
              mboxes.append(mbox)
          if len(mboxes) == 1:
***************
*** 85,95 ****
          # if the pathname contains /Mail/, else a DirOfTxtFileMailbox.
          if os.path.exists(os.path.join(name, 'cur')):
!             mbox = mailbox.Maildir(name, _factory)
          elif name.find("/Mail/") >= 0:
!             mbox = mailbox.MHMailbox(name, _factory)
          else:
!             mbox = DirOfTxtFileMailbox(name, _factory)
      else:
          fp = open(name, "rb")
!         mbox = mailbox.PortableUnixMailbox(fp, _factory)
      return iter(mbox)
--- 79,120 ----
          # if the pathname contains /Mail/, else a DirOfTxtFileMailbox.
          if os.path.exists(os.path.join(name, 'cur')):
!             mbox = mailbox.Maildir(name, get_message)
          elif name.find("/Mail/") >= 0:
!             mbox = mailbox.MHMailbox(name, get_message)
          else:
!             mbox = DirOfTxtFileMailbox(name, get_message)
      else:
          fp = open(name, "rb")
!         mbox = mailbox.PortableUnixMailbox(fp, get_message)
      return iter(mbox)
+ 
+ def get_message(obj):
+     """Return an email Message object.
+ 
+     The argument may be a Message object already, in which case it's
+     returned as-is.
+ 
+     If the argument is a string or file-like object (supports read()),
+     the email package is used to create a Message object from it.  This
+     can fail if the message is malformed.  In that case, the headers
+     (everything through the first blank line) are thrown out, and the
+     rest of the text is wrapped in a bare email.Message.Message.
+     """
+ 
+     if isinstance(obj, email.Message.Message):
+         return obj
+     # Create an email Message object.
+     if hasattr(obj, "read"):
+         obj = obj.read()
+     try:
+         msg = email.message_from_string(obj)
+     except email.Errors.MessageParseError:
+         # Wrap the raw text in a bare Message object.  Since the
+         # headers are most likely damaged, we can't use the email
+         # package to parse them, so just get rid of them first.
+         i = obj.find('\n\n')
+         if i >= 0:
+             obj = obj[i+2:]     # strip headers
+         msg = email.Message.Message()
+         msg.set_payload(obj)
+     return msg


From montanaro@users.sourceforge.net  Wed Nov  6 01:58:37 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 17:58:37 -0800
Subject: [Spambayes-checkins] spambayes mboxcount.py,1.1,1.2
Message-ID: <E189FSn-0003Hw-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12636

Modified Files:
	mboxcount.py 
Log Message:
replace _factory() with mboxutils.get_message()


Index: mboxcount.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxcount.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** mboxcount.py	5 Sep 2002 16:16:43 -0000	1.1
--- mboxcount.py	6 Nov 2002 01:58:35 -0000	1.2
***************
*** 34,40 ****
  import glob
  
! program = sys.argv[0]
  
! _marker = object()
  
  def usage(code, msg=''):
--- 34,40 ----
  import glob
  
! from mboxutils import get_message
  
! program = sys.argv[0]
  
  def usage(code, msg=''):
***************
*** 44,60 ****
      sys.exit(code)
  
- def _factory(fp):
-     try:
-         return email.message_from_file(fp)
-     except email.Errors.MessageParseError:
-         return _marker
- 
  def count(fname):
      fp = open(fname, 'rb')
!     mbox = mailbox.PortableUnixMailbox(fp, _factory)
      goodcount = 0
      badcount = 0
      for msg in mbox:
!         if msg is _marker:
              badcount += 1
          else:
--- 44,54 ----
      sys.exit(code)
  
  def count(fname):
      fp = open(fname, 'rb')
!     mbox = mailbox.PortableUnixMailbox(fp, get_message)
      goodcount = 0
      badcount = 0
      for msg in mbox:
!         if msg["to"] is None and msg["cc"] is None:
              badcount += 1
          else:


From montanaro@users.sourceforge.net  Wed Nov  6 02:01:27 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 18:01:27 -0800
Subject: [Spambayes-checkins] spambayes split.py,1.1,1.2
Message-ID: <E189FVX-0003Tj-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13359

Modified Files:
	split.py 
Log Message:
replace _factory() with mboxutils.get_message()


Index: split.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/split.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** split.py	5 Sep 2002 16:16:43 -0000	1.1
--- split.py	6 Nov 2002 02:01:25 -0000	1.2
***************
*** 32,35 ****
--- 32,37 ----
  import getopt
  
+ import mboxutils
+ 
  program = sys.argv[0]
  
***************
*** 44,55 ****
  
  
- def _factory(fp):
-     try:
-         return email.message_from_file(fp)
-     except email.Errors.MessageParseError:
-         return ''
- 
- 
- 
  def main():
      try:
--- 46,49 ----
***************
*** 81,85 ****
      infp = open(mboxfile, 'rb')
  
!     mbox = mailbox.PortableUnixMailbox(infp, _factory)
      for msg in mbox:
          if random.random() < percent:
--- 75,79 ----
      infp = open(mboxfile, 'rb')
  
!     mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message)
      for msg in mbox:
          if random.random() < percent:


From montanaro@users.sourceforge.net  Wed Nov  6 02:02:10 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 18:02:10 -0800
Subject: [Spambayes-checkins] spambayes splitn.py,1.2,1.3
Message-ID: <E189FWE-0003Wx-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13571

Modified Files:
	splitn.py 
Log Message:
replace _factory() with mboxutils.get_message()


Index: splitn.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitn.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** splitn.py	8 Sep 2002 17:41:56 -0000	1.2
--- splitn.py	6 Nov 2002 02:02:08 -0000	1.3
***************
*** 46,49 ****
--- 46,51 ----
  import getopt
  
+ import mboxutils
+ 
  program = sys.argv[0]
  
***************
*** 54,63 ****
      sys.exit(code)
  
- def _factory(fp):
-     try:
-         return email.message_from_file(fp)
-     except email.Errors.MessageParseError:
-         return ''
- 
  def main():
      try:
--- 56,59 ----
***************
*** 89,93 ****
                  for i in range(1, n+1)]
  
!     mbox = mailbox.PortableUnixMailbox(infile, _factory)
      counter = 0
      for msg in mbox:
--- 85,89 ----
                  for i in range(1, n+1)]
  
!     mbox = mailbox.PortableUnixMailbox(infile, mboxutils.get_message)
      counter = 0
      for msg in mbox:


From montanaro@users.sourceforge.net  Wed Nov  6 02:02:46 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 18:02:46 -0800
Subject: [Spambayes-checkins] spambayes splitndirs.py,1.5,1.6
Message-ID: <E189FWo-0003Zu-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13738

Modified Files:
	splitndirs.py 
Log Message:
delete unused _factory() function


Index: splitndirs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** splitndirs.py	24 Sep 2002 18:26:11 -0000	1.5
--- splitndirs.py	6 Nov 2002 02:02:43 -0000	1.6
***************
*** 63,72 ****
      sys.exit(code)
  
- def _factory(fp):
-     try:
-         return email.message_from_file(fp)
-     except email.Errors.MessageParseError:
-         return ''
- 
  def main():
      try:
--- 63,66 ----


From montanaro@users.sourceforge.net  Wed Nov  6 02:07:44 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 18:07:44 -0800
Subject: [Spambayes-checkins] spambayes hammie.py,1.35,1.36
Message-ID: <E189Fbc-0003yR-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15267

Modified Files:
	hammie.py 
Log Message:
use mboxutils.get_message()


Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** hammie.py	3 Nov 2002 14:24:36 -0000	1.35
--- hammie.py	6 Nov 2002 02:07:42 -0000	1.36
***************
*** 263,270 ****
          """
  
!         if hasattr(msg, "readlines"):
!             msg = email.message_from_file(msg)
!         elif not hasattr(msg, "add_header"):
!             msg = email.message_from_string(msg)
          prob, clues = self._scoremsg(msg, True)
          if prob < ham_cutoff:
--- 263,267 ----
          """
  
!         msg = mboxutils.get_message(msg)
          prob, clues = self._scoremsg(msg, True)
          if prob < ham_cutoff:


From montanaro@users.sourceforge.net  Wed Nov  6 02:12:49 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 05 Nov 2002 18:12:49 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.61,1.62
Message-ID: <E189FgX-0004KH-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16612

Modified Files:
	tokenizer.py 
Log Message:
move Tokenizer.get_message() to mboxutils.py where it becomes the one true
place to try and generate email.Message.Message objects.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.61
retrieving revision 1.62
diff -C2 -d -r1.61 -r1.62
*** tokenizer.py	4 Nov 2002 23:21:43 -0000	1.61
--- tokenizer.py	6 Nov 2002 02:12:47 -0000	1.62
***************
*** 14,17 ****
--- 14,19 ----
  from Options import options
  
+ from mboxutils import get_message
+ 
  # Patch encodings.aliases to recognize 'ansi_x3_4_1968'
  from encodings.aliases import aliases # The aliases dictionary
***************
*** 985,1017 ****
  
      def get_message(self, obj):
!         """Return an email Message object.
! 
!         The argument may be a Message object already, in which case it's
!         returned as-is.
! 
!         If the argument is a string or file-like object (supports read()),
!         the email package is used to create a Message object from it.  This
!         can fail if the message is malformed.  In that case, the headers
!         (everything through the first blank line) are thrown out, and the
!         rest of the text is wrapped in a bare email.Message.Message.
!         """
! 
!         if isinstance(obj, email.Message.Message):
!             return obj
!         # Create an email Message object.
!         if hasattr(obj, "read"):
!             obj = obj.read()
!         try:
!             msg = email.message_from_string(obj)
!         except email.Errors.MessageParseError:
!             # Wrap the raw text in a bare Message object.  Since the
!             # headers are most likely damaged, we can't use the email
!             # package to parse them, so just get rid of them first.
!             i = obj.find('\n\n')
!             if i >= 0:
!                 obj = obj[i+2:]     # strip headers
!             msg = email.Message.Message()
!             msg.set_payload(obj)
!         return msg
  
      def tokenize(self, obj):
--- 987,991 ----
  
      def get_message(self, obj):
!         return get_message(obj)
  
      def tokenize(self, obj):


From anthonybaxter@users.sourceforge.net  Wed Nov  6 20:07:37 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Wed, 06 Nov 2002 12:07:37 -0800
Subject: [Spambayes-checkins] website related.ht,1.3,1.4
Message-ID: <E189WSf-0000sP-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv3271

Modified Files:
	related.ht 
Log Message:
couple more projects, from Alexandre Fayolle


Index: related.ht
===================================================================
RCS file: /cvsroot/spambayes/website/related.ht,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** related.ht	1 Nov 2002 04:06:49 -0000	1.3
--- related.ht	6 Nov 2002 20:07:34 -0000	1.4
***************
*** 12,16 ****
  <li><a href="http://www.ai.mit.edu/~jrennie/ifile/">ifile</a>, a Naive Bayes classification system.
  <li><a href="http://sourceforge.net/projects/pasp">PASP</a>, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.
! <li> ...
  </ul>
  
--- 12,17 ----
  <li><a href="http://www.ai.mit.edu/~jrennie/ifile/">ifile</a>, a Naive Bayes classification system.
  <li><a href="http://sourceforge.net/projects/pasp">PASP</a>, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.
! <li><a href="http://pauillac.inria.fr/~xleroy/software.html">spamoracle</a>, a Paul Graham based spam filter written in OCaml, designed for use with procmail.
! <li><a href="http://popfile.sf.net">popfile</a>, a pop3 proxy written in Perl with a Naive Bayes classifier.
  </ul>
  

From anthonybaxter@users.sourceforge.net  Wed Nov  6 22:12:52 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Wed, 06 Nov 2002 14:12:52 -0800
Subject: [Spambayes-checkins] spambayes table.py,1.4,1.5
Message-ID: <E189YPs-00041l-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15111

Modified Files:
	table.py 
Log Message:
added '-m' option to print means for each row.

little bit of a cleanup.


Index: table.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/table.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** table.py	26 Oct 2002 15:30:23 -0000	1.4
--- table.py	6 Nov 2002 22:12:48 -0000	1.5
***************
*** 2,6 ****
  
  """
! table.py base1 base2 ... baseN
  
  Combines output from base1.txt, base2.txt, etc., which are created by
--- 2,6 ----
  
  """
! table.py [-m] base1 base2 ... baseN
  
  Combines output from base1.txt, base2.txt, etc., which are created by
***************
*** 8,15 ****
  comparison statistics to stdout.  Each input file is represented by
  one column in the table.
- """
  
! import sys
! import re
  
  # Return
--- 8,15 ----
  comparison statistics to stdout.  Each input file is represented by
  one column in the table.
  
! Optional argument -m shows a final column with the mean value of each
! statistic.
! """
  
  # Return
***************
*** 46,56 ****
          line = get()
          if line.startswith('-> <stat> tested'):
!             # -> <stat> tested 1910 hams & 948 spams against 2741 hams & 948 spams
!             #  0      1      2    3    4 5   6
              print line,
  
          elif line.find(' items; mean ') > 0 and line.find('for all runs') > 0:
!             # -> <stat> Ham scores for all runs: 2741 items; mean 0.86; sdev 6.28
!             #                                             0          1          2
              vals = line.split(';')
              mean = float(vals[1].split()[-1])
--- 46,56 ----
          line = get()
          if line.startswith('-> <stat> tested'):
!             # <stat> tested 1910 hams & 948 spams against 2741 hams & 948 spams
!             #      1      2    3    4 5   6
              print line,
  
          elif line.find(' items; mean ') > 0 and line.find('for all runs') > 0:
!             # <stat> Ham scores for all runs: 2741 items; mean 0.86; sdev 6.28
!             #                                          0          1          2
              vals = line.split(';')
              mean = float(vals[1].split()[-1])
***************
*** 103,184 ****
          return fn
  
! fname = "filename: "
! fnam2 = "          "
! ratio = "ham:spam: "
! rat2  = "          "
! fptot = "fp total: "
! fpper = "fp %:     "
! fntot = "fn total: "
! fnper = "fn %:     "
! untot = "unsure t: "
! unper = "unsure %: "
! rcost = "real cost:"
! bcost = "best cost:"
  
! hmean = "h mean:   "
! hsdev = "h sdev:   "
! smean = "s mean:   "
! ssdev = "s sdev:   "
! meand = "mean diff:"
! kval  = "k:        "
  
! for filename in sys.argv[1:]:
!     filename = windowsfy(filename)
!     (htest, stest, fp, fn, un, fpp, fnp, unp, cost, bestcost,
!      hamdevall, spamdevall) = suck(file(filename))
!     if filename.endswith('.txt'):
!         filename = filename[:-4]
!     filename = filename[filename.rfind('/')+1:]
!     filename = filename[filename.rfind("\\")+1:]
!     if len(fname) > len(fnam2):
!         fname += "        "
!         fname = fname[0:(len(fnam2) + 8)]
!         fnam2 += " %7s" % filename
!     else:
!         fnam2 += "        "
!         fnam2 = fnam2[0:(len(fname) + 8)]
!         fname += " %7s" % filename
!     if len(ratio) > len(rat2):
!         ratio += "        "
!         ratio = ratio[0:(len(rat2) + 8)]
!         rat2  += " %7s" % ("%d:%d" % (htest, stest))
!     else:
!         rat2  += "        "
!         rat2  = rat2[0:(len(ratio) + 8)]
!         ratio += " %7s" % ("%d:%d" % (htest, stest))
!     fptot += "%8d"   % fp
!     fpper += "%8.2f" % fpp
!     fntot += "%8d"   % fn
!     fnper += "%8.2f" % fnp
!     untot += "%8d"   % un
!     unper += "%8.2f" % unp
!     rcost += "%8s"   % ("$%.2f" % cost)
!     bcost += "%8s"   % ("$%.2f" % bestcost)
!     hmean += "%8.2f" % hamdevall[0]
!     hsdev += "%8.2f" % hamdevall[1]
!     smean += "%8.2f" % spamdevall[0]
!     ssdev += "%8.2f" % spamdevall[1]
!     meand += "%8.2f" % (spamdevall[0] - hamdevall[0])
!     k = (spamdevall[0] - hamdevall[0]) / (spamdevall[1] + hamdevall[1])
!     kval  += "%8.2f" % k
  
! print fname
! if len(fnam2.strip()) > 0:
!     print fnam2
! print ratio
! if len(rat2.strip()) > 0:
!     print rat2
! print fptot
! print fpper
! print fntot
! print fnper
! print untot
! print unper
! print rcost
! print bcost
! print hmean
! print hsdev
! print smean
! print ssdev
! print meand
! print kval
--- 103,231 ----
          return fn
  
! def table():
!     import getopt, sys
  
!     showMean = 0
  
!     fname = "filename: "
!     fnam2 = "          "
!     ratio = "ham:spam: "
!     rat2  = "          "
!     fptot = "fp total: "
!     fpper = "fp %:     "
!     fntot = "fn total: "
!     fnper = "fn %:     "
!     untot = "unsure t: "
!     unper = "unsure %: "
!     rcost = "real cost:"
!     bcost = "best cost:"
  
!     hmean = "h mean:   "
!     hsdev = "h sdev:   "
!     smean = "s mean:   "
!     ssdev = "s sdev:   "
!     meand = "mean diff:"
!     kval  = "k:        "
! 
!     tfptot = tfpper = tfntot = tfnper = tuntot = tunper = trcost = tbcost = \
!     thmean = thsdev = tsmean = tssdev = tmeand = tkval =  0
! 
!     args, fileargs = getopt.getopt(sys.argv[1:], 'm')
!     for arg, val in args:
!         if arg == "-m":
!             showMean = 1
! 
!     for filename in fileargs:
!         filename = windowsfy(filename)
!         (htest, stest, fp, fn, un, fpp, fnp, unp, cost, bestcost,
!          hamdevall, spamdevall) = suck(file(filename))
!         if filename.endswith('.txt'):
!             filename = filename[:-4]
!         filename = filename[filename.rfind('/')+1:]
!         filename = filename[filename.rfind("\\")+1:]
!         if len(fname) > len(fnam2):
!             fname += "        "
!             fname = fname[0:(len(fnam2) + 8)]
!             fnam2 += " %7s" % filename
!         else:
!             fnam2 += "        "
!             fnam2 = fnam2[0:(len(fname) + 8)]
!             fname += " %7s" % filename
!         if len(ratio) > len(rat2):
!             ratio += "        "
!             ratio = ratio[0:(len(rat2) + 8)]
!             rat2  += " %7s" % ("%d:%d" % (htest, stest))
!         else:
!             rat2  += "        "
!             rat2  = rat2[0:(len(ratio) + 8)]
!             ratio += " %7s" % ("%d:%d" % (htest, stest))
!         fptot += "%8d"   % fp
!         tfptot += fp
!         fpper += "%8.2f" % fpp
!         tfpper += fpp
!         fntot += "%8d"   % fn
!         tfntot += fn
!         fnper += "%8.2f" % fnp
!         tfnper += fnp
!         untot += "%8d"   % un
!         tuntot += un
!         unper += "%8.2f" % unp
!         tunper += unp
!         rcost += "%8s"   % ("$%.2f" % cost)
!         trcost += cost
!         bcost += "%8s"   % ("$%.2f" % bestcost)
!         tbcost += bestcost
!         hmean += "%8.2f" % hamdevall[0]
!         thmean += hamdevall[0]
!         hsdev += "%8.2f" % hamdevall[1]
!         thsdev += hamdevall[1]
!         smean += "%8.2f" % spamdevall[0]
!         tsmean += spamdevall[0]
!         ssdev += "%8.2f" % spamdevall[1]
!         tssdev += spamdevall[1]
!         meand += "%8.2f" % (spamdevall[0] - hamdevall[0])
!         tmeand += (spamdevall[0] - hamdevall[0])
!         k = (spamdevall[0] - hamdevall[0]) / (spamdevall[1] + hamdevall[1])
!         kval  += "%8.2f" % k
!         tkval  += k
! 
!     nfiles = len(fileargs)
!     if nfiles and showMean:
!         fptot += "%12d"   % (tfptot/nfiles)
!         fpper += "%12.2f" % (tfpper/nfiles)
!         fntot += "%12d"   % (tfntot/nfiles)
!         fnper += "%12.2f" % (tfnper/nfiles)
!         untot += "%12d"   % (tuntot/nfiles)
!         unper += "%12.2f" % (tunper/nfiles)
!         rcost += "%12s"   % ("$%.2f" % (trcost/nfiles))
!         bcost += "%12s"   % ("$%.2f" % (tbcost/nfiles))
!         hmean += "%12.2f" % (thmean/nfiles)
!         hsdev += "%12.2f" % (thsdev/nfiles)
!         smean += "%12.2f" % (tsmean/nfiles)
!         ssdev += "%12.2f" % (tssdev/nfiles)
!         meand += "%12.2f" % (tmeand/nfiles)
!         kval  += "%12.2f" % (tkval/nfiles)
! 
!     print fname
!     if len(fnam2.strip()) > 0:
!         print fnam2
!     print ratio
!     if len(rat2.strip()) > 0:
!         print rat2
!     print fptot
!     print fpper
!     print fntot
!     print fnper
!     print untot
!     print unper
!     print rcost
!     print bcost
!     print hmean
!     print hsdev
!     print smean
!     print ssdev
!     print meand
!     print kval
! 
! if __name__ == "__main__":
!     table()


From mhammond@users.sourceforge.net  Thu Nov  7 02:54:18 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 06 Nov 2002 18:54:18 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs ManagerDialog.py,1.6,1.7
Message-ID: <E189coE-0004nh-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv18380/dialogs

Modified Files:
	ManagerDialog.py 
Log Message:
As per report on mailing list, don't insist on an "Unsure" folder before
filtering can be enabled.  Also wrapped a few long lines.


Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** ManagerDialog.py	5 Nov 2002 21:51:53 -0000	1.6
--- ManagerDialog.py	7 Nov 2002 02:54:16 -0000	1.7
***************
*** 69,74 ****
          self.checkbox_items = [
              (IDC_BUT_FILTER_ENABLE, "self.mgr.config.filter.enabled"),
!             (IDC_BUT_TRAIN_FROM_SPAM_FOLDER, "self.mgr.config.training.train_recovered_spam"),
!             (IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"),
          ]
  
--- 69,76 ----
          self.checkbox_items = [
              (IDC_BUT_FILTER_ENABLE, "self.mgr.config.filter.enabled"),
!             (IDC_BUT_TRAIN_FROM_SPAM_FOLDER,
!                      "self.mgr.config.training.train_recovered_spam"),
!             (IDC_BUT_TRAIN_TO_SPAM_FOLDER,
!                      "self.mgr.config.training.train_manual_spam"),
          ]
  
***************
*** 105,114 ****
          ok_to_enable = operator.truth(config.watch_folder_ids)
          if not ok_to_enable:
!             filter_status = "You must define folders to watch for new messages"
          if ok_to_enable:
              ok_to_enable = nspam >= min_spam and nham >= min_ham
              if not ok_to_enable:
!                 filter_status = "There must be %d good and %d spam messages\n" \
!                                 "trained before filtering can be enabled" \
                                  % (min_ham, min_spam)
          if ok_to_enable:
--- 107,118 ----
          ok_to_enable = operator.truth(config.watch_folder_ids)
          if not ok_to_enable:
!             filter_status = "You must define folders to watch "\
!                             "for new messages"
          if ok_to_enable:
              ok_to_enable = nspam >= min_spam and nham >= min_ham
              if not ok_to_enable:
!                 filter_status = "There must be %d good and %d spam  " \
!                                 "messages\ntrained before filtering " \
!                                 "can be enabled" \
                                  % (min_ham, min_spam)
          if ok_to_enable:
***************
*** 116,137 ****
              ok_to_enable = operator.truth(config.spam_folder_id)
              if ok_to_enable:
!                 certain_spam_name = self.mgr.FormatFolderNames([config.spam_folder_id], False)
!                 ok_to_enable = operator.truth(config.unsure_folder_id)
!                 if ok_to_enable:
!                     unsure_name = self.mgr.FormatFolderNames([config.unsure_folder_id], False)
                  else:
!                     filter_status = "You must define the folder to receive your possible spam"
              else:
!                 filter_status = "You must define the folder to receive your certain spam"
!                 
              # whew
              if ok_to_enable:
!                 watch_names = self.mgr.FormatFolderNames(config.watch_folder_ids, config.watch_include_sub)
!                 filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \
!                                 % (watch_names, certain_spam_name, unsure_name)
  
          self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable)
          enabled = config.enabled
!         self.GetDlgItem(IDC_BUT_FILTER_ENABLE).SetCheck(ok_to_enable and enabled)
          self.SetDlgItemText(IDC_FILTER_STATUS, filter_status)
  
--- 120,148 ----
              ok_to_enable = operator.truth(config.spam_folder_id)
              if ok_to_enable:
!                 certain_spam_name = self.mgr.FormatFolderNames(
!                                         [config.spam_folder_id], False)
!                 if config.unsure_folder_id:
!                     unsure_name = self.mgr.FormatFolderNames(
!                                         [config.unsure_folder_id], False)
!                     unsure_text = "unsure managed in '%s'" % (unsure_name,)
                  else:
!                     unsure_text = "unsure messages untouched"
              else:
!                 filter_status = "You must define the folder to " \
!                                 "receive your certain spam"
! 
              # whew
              if ok_to_enable:
!                 watch_names = self.mgr.FormatFolderNames(
!                         config.watch_folder_ids, config.watch_include_sub)
!                 filter_status = "Watching '%s'. Spam managed in '%s', %s" \
!                                 % (watch_names,
!                                    certain_spam_name,
!                                    unsure_text)
  
          self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable)
          enabled = config.enabled
!         self.GetDlgItem(IDC_BUT_FILTER_ENABLE).SetCheck(
!                                                 ok_to_enable and enabled)
          self.SetDlgItemText(IDC_FILTER_STATUS, filter_status)
  
***************
*** 139,143 ****
          if code == win32con.BN_CLICKED:
  
!             fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html")
              fname = os.path.abspath(fname)
              print fname
--- 150,156 ----
          if code == win32con.BN_CLICKED:
  
!             fname = os.path.join(os.path.dirname(__file__),
!                                  os.pardir,
!                                  "about.html")
              fname = os.path.abspath(fname)
              print fname


From mhammond@users.sourceforge.net  Thu Nov  7 05:05:06 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 06 Nov 2002 21:05:06 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.27,1.28
Message-ID: <E189eqo-00018e-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv3431

Modified Files:
	addin.py 
Log Message:
Revamp the "delete as spam" and "recover from spam" buttons - now 2
buttons, and the visibility state changes depending on the folder.  The
"unsure" folder now has both buttons available.  Probably lighter on
Outlook too, as all we do now is toggle a Visible property on a folder
change event.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** addin.py	4 Nov 2002 22:50:41 -0000	1.27
--- addin.py	7 Nov 2002 05:05:03 -0000	1.28
***************
*** 239,278 ****
      new_msg.Display()
  
! # The "Delete As Spam" and "Recover Spam" button
! # The event from Outlook's explorer that our folder has changed.
! class ButtonDeleteAsExplorerEvent:
!     def Init(self, but):
!         self.but = but
!     def Close(self):
!         self.but = None
!     def OnFolderSwitch(self):
!         self.but._UpdateForFolderChange()
! 
! class ButtonDeleteAsEvent:
!     def Init(self, manager, application, explorer):
!         # NOTE - keeping a reference to 'explorer' in this event
!         # appears to cause an Outlook circular reference, and outlook
!         # never terminates (it does close, but the process remains alive)
!         # This is why we needed to use WithEvents, so the event class
!         # itself doesnt keep such a reference (and we need to keep a ref
!         # to the event class so it doesn't auto-disconnect!)
          self.manager = manager
          self.application = application
!         self.explorer_events = WithEvents(explorer,
!                                            ButtonDeleteAsExplorerEvent)
!         self.set_for_as_spam = None
!         self.explorer_events.Init(self)
!         self._UpdateForFolderChange()
! 
      def Close(self):
!         self.manager = self.application = self.explorer = None
! 
!     def _UpdateForFolderChange(self):
          explorer = self.application.ActiveExplorer()
          if explorer is None:
              print "** Folder Change, but don't have an explorer"
              return
          outlook_folder = explorer.CurrentFolder
!         is_spam = False
          if outlook_folder is not None:
              mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
--- 239,262 ----
      new_msg.Display()
  
! # Events from our Explorer instance - currently used to enable/disable
! # controls
! class ExplorerEvent:
!     def Init(self, manager, application, but_delete_as, but_recover_as):
          self.manager = manager
          self.application = application
!         self.but_delete_as = but_delete_as
!         self.but_recover_as = but_recover_as
      def Close(self):
!         self.but_delete_as = self.but_recover_as = None
!     def OnFolderSwitch(self):
!         # Work out what folder we are in.
          explorer = self.application.ActiveExplorer()
          if explorer is None:
              print "** Folder Change, but don't have an explorer"
              return
+ 
          outlook_folder = explorer.CurrentFolder
!         show_delete_as = True
!         show_recover_as = False
          if outlook_folder is not None:
              mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
***************
*** 281,314 ****
                  look_folder = self.manager.message_store.GetFolder(look_id)
                  if mapi_folder == look_folder:
!                     is_spam = True
!             if not is_spam:
!                 look_id = self.manager.config.filter.unsure_folder_id
!                 if look_id:
!                     look_folder = self.manager.message_store.GetFolder(look_id)
!                     if mapi_folder == look_folder:
!                         is_spam = True
!         if is_spam:
!             set_for_as_spam = False
!         else:
!             set_for_as_spam = True
!         if set_for_as_spam != self.set_for_as_spam:
!             if set_for_as_spam:
!                 image = "delete_as_spam.bmp"
!                 self.Caption = "Delete As Spam"
!                 self.TooltipText = \
                          "Move the selected message to the Spam folder,\n" \
                          "and train the system that this is Spam."
              else:
!                 image = "recover_ham.bmp"
!                 self.Caption = "Recover from Spam"
!                 self.TooltipText = \
!                         "Recovers the selected item back to the folder\n" \
!                         "it was filtered from (or to the Inbox if this\n" \
!                         "folder is not known), and trains the system that\n" \
!                         "this is a good message\n"
!             # Set the image.
!             print "Setting image to", image
!             SetButtonImage(self, image)
!             self.set_for_as_spam = set_for_as_spam
  
      def OnClick(self, button, cancel):
--- 265,341 ----
                  look_folder = self.manager.message_store.GetFolder(look_id)
                  if mapi_folder == look_folder:
!                     # This is the Spam folder - only show "recover"
!                     show_recover_as = True
!                     show_delete_as = False
!             # Check if uncertain
!             look_id = self.manager.config.filter.unsure_folder_id
!             if look_id:
!                 look_folder = self.manager.message_store.GetFolder(look_id)
!                 if mapi_folder == look_folder:
!                     show_recover_as = True
!                     show_delete_as = True
!         self.but_recover_as.Visible = show_recover_as
!         self.but_delete_as.Visible = show_delete_as
! 
! # The "Delete As Spam" and "Recover Spam" button
! # The event from Outlook's explorer that our folder has changed.
! class ButtonDeleteAsEventBase:
!     def Init(self, manager, application):
!         # NOTE - keeping a reference to 'explorer' in this event
!         # appears to cause an Outlook circular reference, and outlook
!         # never terminates (it does close, but the process remains alive)
!         # This is why we needed to use WithEvents, so the event class
!         # itself doesnt keep such a reference (and we need to keep a ref
!         # to the event class so it doesn't auto-disconnect!)
!         self.manager = manager
!         self.application = application
! 
!     def Close(self):
!         self.manager = self.application = None
! 
! class ButtonDeleteAsSpamEvent(ButtonDeleteAsEventBase):
!     def Init(self, manager, application):
!         ButtonDeleteAsEventBase.Init(self, manager, application)
!         image = "delete_as_spam.bmp"
!         self.Caption = "Delete As Spam"
!         self.TooltipText = \
                          "Move the selected message to the Spam folder,\n" \
                          "and train the system that this is Spam."
+         SetButtonImage(self, image)
+ 
+     def OnClick(self, button, cancel):
+         msgstore = self.manager.message_store
+         msgstore_messages = self.manager.addin.GetSelectedMessages(True)
+         if not msgstore_messages:
+             return
+         # Delete this item as spam.
+         spam_folder_id = self.manager.config.filter.spam_folder_id
+         spam_folder = msgstore.GetFolder(spam_folder_id)
+         if not spam_folder:
+             win32ui.MessageBox("You must configure the Spam folder",
+                                "Invalid Configuration")
+             return
+         import train
+         for msgstore_message in msgstore_messages:
+             # Must train before moving, else we lose the message!
+             print "Training on message - ",
+             if train.train_message(msgstore_message, True, self.manager, rescore = True):
+                 print "trained as spam"
              else:
!                 print "already was trained as spam"
!             # Now move it.
!             msgstore_message.MoveTo(spam_folder)
! 
! class ButtonRecoverFromSpamEvent(ButtonDeleteAsEventBase):
!     def Init(self, manager, application):
!         ButtonDeleteAsEventBase.Init(self, manager, application)
!         image = "recover_ham.bmp"
!         self.Caption = "Recover from Spam"
!         self.TooltipText = \
!                 "Recovers the selected item back to the folder\n" \
!                 "it was filtered from (or to the Inbox if this\n" \
!                 "folder is not known), and trains the system that\n" \
!                 "this is a good message\n"
!         SetButtonImage(self, image)
  
      def OnClick(self, button, cancel):
***************
*** 317,340 ****
          if not msgstore_messages:
              return
!         if self.set_for_as_spam:
!             # Delete this item as spam.
!             spam_folder_id = self.manager.config.filter.spam_folder_id
!             spam_folder = msgstore.GetFolder(spam_folder_id)
!             if not spam_folder:
!                 win32ui.MessageBox("You must configure the Spam folder",
!                                    "Invalid Configuration")
!                 return
!             import train
!             for msgstore_message in msgstore_messages:
!                 # Must train before moving, else we lose the message!
!                 print "Training on message - ",
!                 if train.train_message(msgstore_message, True, self.manager, rescore = True):
!                     print "trained as spam"
!                 else:
!                     print "already was trained as spam"
!                 # Now move it.
!                 msgstore_message.MoveTo(spam_folder)
!         else:
!             win32ui.MessageBox("Please be patient <wink>")
  
  # Helpers to work with images on buttons/toolbars.
--- 344,364 ----
          if not msgstore_messages:
              return
!         # Recover to where they were moved from
!         # Get the inbox as the default place to restore to
!         # (incase we dont know (early code) or folder removed etc
!         inbox_folder = msgstore.GetFolder(
!                     self.application.Session.GetDefaultFolder(
!                         constants.olFolderInbox))
!         import train
!         for msgstore_message in msgstore_messages:
!             # Must train before moving, else we lose the message!
!             print "Training on message - ",
!             if train.train_message(msgstore_message, False, self.manager, rescore = True):
!                 print "trained as ham"
!             else:
!                 print "already was trained as ham"
!             # Now move it.
!             # XXX - still don't write the source, so no point looking :(
!             msgstore_message.MoveTo(inbox_folder)
  
  # Helpers to work with images on buttons/toolbars.
***************
*** 379,382 ****
--- 403,407 ----
          assert self.manager.addin is None, "Should not already have an addin"
          self.manager.addin = self
+         self.explorer_events = None
  
          # ActiveExplorer may be none when started without a UI (eg, WinCE synchronisation)
***************
*** 385,414 ****
              bars = activeExplorer.CommandBars
              toolbar = bars.Item("Standard")
!             # Add our "Delete as ..." button
!             button = toolbar.Controls.Add(Type=constants.msoControlButton, Temporary=True)
              # Hook events for the item
              button.BeginGroup = True
!             button = DispatchWithEvents(button, ButtonDeleteAsEvent)
!             button.Init(self.manager, application, activeExplorer)
              self.buttons.append(button)
  
              # Add a pop-up menu to the toolbar
!             popup = toolbar.Controls.Add(Type=constants.msoControlPopup, Temporary=True)
              popup.Caption="Anti-Spam"
              popup.TooltipText = "Anti-Spam filters and functions"
              popup.Enabled = True
!             # Convert from "CommandBarItem" to derived "CommandBarPopup"
!             # Not sure if we should be able to work this out ourselves, but no
!             # introspection I tried seemed to indicate we can.  VB does it via
!             # strongly-typed declarations.
              popup = CastTo(popup, "CommandBarPopup")
              # And add our children.
-             self._AddPopup(popup, ShowClues, (self.manager, application),
-                            Caption="Show spam clues for current message",
-                            Enabled=True)
              self._AddPopup(popup, manager.ShowManager, (self.manager,),
                             Caption="Anti-Spam Manager...",
                             TooltipText = "Show the Anti-Spam manager dialog.",
                             Enabled = True)
  
          self.FiltersChanged()
--- 410,460 ----
              bars = activeExplorer.CommandBars
              toolbar = bars.Item("Standard")
!             # Add our "Delete as ..." and "Recover as" buttons
!             but_delete_as = button = toolbar.Controls.Add(
!                                     Type=constants.msoControlButton,
!                                     Temporary=True)
              # Hook events for the item
              button.BeginGroup = True
!             button = DispatchWithEvents(button, ButtonDeleteAsSpamEvent)
!             button.Init(self.manager, application)
              self.buttons.append(button)
+             # And again for "Recover as"
+             but_recover_as = button = toolbar.Controls.Add(
+                                     Type=constants.msoControlButton,
+                                     Temporary=True)
+             button = DispatchWithEvents(button, ButtonRecoverFromSpamEvent)
+             self.buttons.append(button)
+             # Hook our explorer events, and pass the buttons.
+             button.Init(self.manager, application)
+ 
+             self.explorer_events = WithEvents(activeExplorer,
+                                                ExplorerEvent)
  
+             self.explorer_events.Init(self.manager, application, but_delete_as, but_recover_as)
+             # And prime the event handler.
+             self.explorer_events.OnFolderSwitch()
+ 
+             # The main tool-bar dropdown with all out entries.
              # Add a pop-up menu to the toolbar
!             popup = toolbar.Controls.Add(
!                                 Type=constants.msoControlPopup,
!                                 Temporary=True)
              popup.Caption="Anti-Spam"
              popup.TooltipText = "Anti-Spam filters and functions"
              popup.Enabled = True
!             # Convert from "CommandBarItem" to derived
!             # "CommandBarPopup" Not sure if we should be able to work
!             # this out ourselves, but no introspection I tried seemed
!             # to indicate we can.  VB does it via strongly-typed
!             # declarations.
              popup = CastTo(popup, "CommandBarPopup")
              # And add our children.
              self._AddPopup(popup, manager.ShowManager, (self.manager,),
                             Caption="Anti-Spam Manager...",
                             TooltipText = "Show the Anti-Spam manager dialog.",
                             Enabled = True)
+             self._AddPopup(popup, ShowClues, (self.manager, application),
+                            Caption="Show spam clues for current message",
+                            Enabled=True)
  
          self.FiltersChanged()
***************
*** 499,506 ****
--- 545,556 ----
              self.manager.Close()
              self.manager = None
+ 
+         if self.explorer_events is not None:
+             self.explorer_events = None
          if self.buttons:
              for button in self.buttons:
                  button.Close()
              self.buttons = None
+ 
          print "Addin terminating: %d COM client and %d COM servers exist." \
                % (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount())
***************
*** 514,522 ****
  
      def OnAddInsUpdate(self, custom):
!         print "SpamAddin - OnAddInsUpdate", custom
      def OnStartupComplete(self, custom):
!         print "SpamAddin - OnStartupComplete", custom
      def OnBeginShutdown(self, custom):
!         print "SpamAddin - OnBeginShutdown", custom
  
  def RegisterAddin(klass):
--- 564,572 ----
  
      def OnAddInsUpdate(self, custom):
!         pass
      def OnStartupComplete(self, custom):
!         pass
      def OnBeginShutdown(self, custom):
!         pass
  
  def RegisterAddin(klass):


From tim.one@comcast.net  Thu Nov  7 05:58:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 00:58:58 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.27,1.28
In-Reply-To: <E189eqo-00018e-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEGJCHAB.tim.one@comcast.net>

[Mark Hammond]
> Modified Files:
> 	addin.py
> Log Message:
> Revamp the "delete as spam" and "recover from spam" buttons - now 2
> buttons, and the visibility state changes depending on the folder.

Wow -- a 21KB patch to change a button.  I *knew* there was a reason I
always left this stuff to you <wink>.


From jvr@users.sourceforge.net  Thu Nov  7 22:27:05 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:27:05 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.10,1.11
Message-ID: <E189v7B-0006Ff-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23622

Modified Files:
	pop3proxy.py 
Log Message:
- added True/False for compatibilty with Python 2.2
- write out trained messages to files, to make it easier
  to rebuild the database


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** pop3proxy.py	5 Nov 2002 22:18:56 -0000	1.10
--- pop3proxy.py	7 Nov 2002 22:27:02 -0000	1.11
***************
*** 28,31 ****
--- 28,34 ----
  For safety, and to help debugging, the whole POP3 conversation is
  written out to _pop3proxy.log for each run.
+ 
+ To make rebuilding the database easier, trained messages are appended
+ to _pop3proxyham.mbox and _pop3proxyspam.mbox.
  """
  
***************
*** 37,40 ****
--- 40,49 ----
  __credits__ = "Tim Peters, Neale Pickett, all the spambayes contributors."
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
  
  import sys, re, operator, errno, getopt, cPickle, cStringIO, time
***************
*** 609,614 ****
  
      def onUpload(self, params):
!         message = params.get('file') or params.get('text')            
          isSpam = (params['which'] == 'spam')
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
          self.push("""<p>Trained on your message. Saving database...</p>""")
--- 618,634 ----
  
      def onUpload(self, params):
!         message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         # Append the message to a file, to make it easier to rebuild
+         # the database later.
+         message = message.replace('\r\n', '\n').replace('\r', '\n')
+         if isSpam:
+             f = open("_pop3proxyspam.mbox", "a")
+         else:
+             f = open("_pop3proxyham.mbox", "a")
+         f.write("From ???@???\n")  # fake From line (XXX good enough?)
+         f.write(message)
+         f.write("\n")
+         f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
          self.push("""<p>Trained on your message. Saving database...</p>""")


From jvr@users.sourceforge.net  Thu Nov  7 22:30:12 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:30:12 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.28,1.29 config.py,1.3,1.4
 filter.py,1.12,1.13 manager.py,1.32,1.33 msgstore.py,1.22,1.23
 train.py,1.15,1.16
Message-ID: <E189vAC-0006c6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv25250/Outlook2000

Modified Files:
	addin.py config.py filter.py manager.py msgstore.py train.py 
Log Message:
Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py.

Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** addin.py	7 Nov 2002 05:05:03 -0000	1.28
--- addin.py	7 Nov 2002 22:30:08 -0000	1.29
***************
*** 4,7 ****
--- 4,14 ----
  import warnings
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  if sys.version_info >= (2, 3):
      # sick off the new hex() warnings!

Index: config.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/config.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** config.py	31 Oct 2002 21:56:59 -0000	1.3
--- config.py	7 Nov 2002 22:30:09 -0000	1.4
***************
*** 3,6 ****
--- 3,13 ----
  # or as a module.
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  class _ConfigurationContainer:
      def __init__(self, **kw):

Index: filter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** filter.py	1 Nov 2002 02:03:42 -0000	1.12
--- filter.py	7 Nov 2002 22:30:09 -0000	1.13
***************
*** 4,8 ****
  # Copyright PSF, license under the PSF license
  
! def filter_message(msg, mgr, all_actions = True):
      config = mgr.config.filter
      prob = mgr.score(msg)
--- 4,15 ----
  # Copyright PSF, license under the PSF license
  
! try:
!     True, False
! except NameError:
!     # Maintain compatibility with Python 2.2
!     True, False = 1, 0
! 
! 
! def filter_message(msg, mgr, all_actions=True):
      config = mgr.config.filter
      prob = mgr.score(msg)

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** manager.py	4 Nov 2002 00:50:09 -0000	1.32
--- manager.py	7 Nov 2002 22:30:09 -0000	1.33
***************
*** 13,16 ****
--- 13,22 ----
  
  try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ try:
      this_filename = os.path.abspath(__file__)
  except NameError:
***************
*** 83,87 ****
          return ret
  
!     def EnsureOutlookFieldsForFolder(self, folder_id, include_sub = False):
          # Ensure that our fields exist on the Outlook *folder*
          # Setting properties via our msgstore (via Ext Mapi) gets the props
--- 89,93 ----
          return ret
  
!     def EnsureOutlookFieldsForFolder(self, folder_id, include_sub=False):
          # Ensure that our fields exist on the Outlook *folder*
          # Setting properties via our msgstore (via Ext Mapi) gets the props

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** msgstore.py	5 Nov 2002 11:44:27 -0000	1.22
--- msgstore.py	7 Nov 2002 22:30:09 -0000	1.23
***************
*** 3,6 ****
--- 3,12 ----
  import sys, os
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
  
  # Abstract definition - can be moved out when we have more than one sub-class <wink>

Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** train.py	4 Nov 2002 22:50:41 -0000	1.15
--- train.py	7 Nov 2002 22:30:09 -0000	1.16
***************
*** 7,10 ****
--- 7,17 ----
  from win32com.mapi import mapi
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  # Note our Message Database uses PR_SEARCH_KEY, *not* PR_ENTRYID, as the
  # latter changes after a Move operation - see msgstore.py


From jvr@users.sourceforge.net  Thu Nov  7 22:30:13 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:30:13 -0800
Subject: [Spambayes-checkins] 
 spambayes/pspam/pspam folder.py,1.1,1.2 profile.py,1.2,1.3
Message-ID: <E189vAD-0006co-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv25250/pspam/pspam

Modified Files:
	folder.py profile.py 
Log Message:
Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py.

Index: folder.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/folder.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** folder.py	4 Nov 2002 04:44:20 -0000	1.1
--- folder.py	7 Nov 2002 22:30:11 -0000	1.2
***************
*** 10,13 ****
--- 10,20 ----
  from pspam.message import PMessage
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  def factory(fp):
      try:

Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** profile.py	4 Nov 2002 21:25:54 -0000	1.2
--- profile.py	7 Nov 2002 22:30:11 -0000	1.3
***************
*** 14,17 ****
--- 14,24 ----
  import os
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  def open_folders(dir, names, klass):
      L = []


From jvr@users.sourceforge.net  Thu Nov  7 22:30:13 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:30:13 -0800
Subject: [Spambayes-checkins] 
 spambayes/pspam pop.py,1.2,1.3 scoremsg.py,1.1,1.2 update.py,1.1,1.2
Message-ID: <E189vAD-0006cd-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv25250/pspam

Modified Files:
	pop.py scoremsg.py update.py 
Log Message:
Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py.

Index: pop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pop.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** pop.py	5 Nov 2002 22:57:27 -0000	1.2
--- pop.py	7 Nov 2002 22:30:10 -0000	1.3
***************
*** 45,48 ****
--- 45,55 ----
  from pspam.options import options
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  HEADER = "X-Spambayes: %5.3f\r\n"
  HEADER_SIZE = len(HEADER % 0.0)

Index: scoremsg.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/scoremsg.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** scoremsg.py	4 Nov 2002 04:44:19 -0000	1.1
--- scoremsg.py	7 Nov 2002 22:30:10 -0000	1.2
***************
*** 12,15 ****
--- 12,22 ----
  import pspam.options
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  def main(fp):
      cs = ClientStorage("/var/tmp/zeospam")

Index: update.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/update.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** update.py	4 Nov 2002 04:44:19 -0000	1.1
--- update.py	7 Nov 2002 22:30:10 -0000	1.2
***************
*** 10,13 ****
--- 10,20 ----
  from pspam.options import options
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  def folder_exists(L, p):
      """Return true folder with path p exists in list L."""


From jvr@users.sourceforge.net  Thu Nov  7 22:30:12 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:30:12 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
	AsyncDialog.py,1.2,1.3
	FilterDialog.py,1.10,1.11 FolderSelector.py,1.8,1.9
	ManagerDialog.py,1.7,1.8
Message-ID: <E189vAC-0006cR-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv25250/Outlook2000/dialogs

Modified Files:
	AsyncDialog.py FilterDialog.py FolderSelector.py 
	ManagerDialog.py 
Log Message:
Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py.

Index: AsyncDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/AsyncDialog.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** AsyncDialog.py	19 Oct 2002 18:14:01 -0000	1.2
--- AsyncDialog.py	7 Nov 2002 22:30:10 -0000	1.3
***************
*** 6,9 ****
--- 6,15 ----
  import win32api
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
  
  IDC_START = 1100

Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** FilterDialog.py	2 Nov 2002 17:27:44 -0000	1.10
--- FilterDialog.py	7 Nov 2002 22:30:10 -0000	1.11
***************
*** 11,14 ****
--- 11,21 ----
  from DialogGlobals import *
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  IDC_FOLDER_WATCH = 1024
  IDC_BROWSE_WATCH = 1025

Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** FolderSelector.py	2 Nov 2002 17:11:47 -0000	1.8
--- FolderSelector.py	7 Nov 2002 22:30:10 -0000	1.9
***************
*** 9,12 ****
--- 9,19 ----
  from DialogGlobals import *
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  # Helpers for building the folder list
  class FolderSpec:

Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** ManagerDialog.py	7 Nov 2002 02:54:16 -0000	1.7
--- ManagerDialog.py	7 Nov 2002 22:30:10 -0000	1.8
***************
*** 11,14 ****
--- 11,21 ----
  from DialogGlobals import *
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  IDC_BUT_ABOUT = 1024
  IDC_BUT_TRAIN_FROM_SPAM_FOLDER = 1025


From jvr@users.sourceforge.net  Thu Nov  7 22:30:40 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:30:40 -0800
Subject: [Spambayes-checkins] 
 spambayes README.txt,1.40,1.41 TestDriver.py,1.27,1.28
 Tester.py,1.7,1.8 chi2.py,1.7,1.8 classifier.py,1.48,1.49
 hammie.py,1.36,1.37 hammiesrv.py,1.9,1.10 mboxcount.py,1.2,1.3
 mboxtest.py,1.9,1.10 neiltrain.py,1.3,1.4 rebal.py,1.8,1.9
 sets.py,1.1,1.2 splitn.py,1.3,1.4 splitndirs.py,1.6,1.7
 tokenizer.py,1.62,1.63
Message-ID: <E189vAe-0006bg-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25250

Modified Files:
	README.txt TestDriver.py Tester.py chi2.py classifier.py 
	hammie.py hammiesrv.py mboxcount.py mboxtest.py neiltrain.py 
	rebal.py sets.py splitn.py splitndirs.py tokenizer.py 
Log Message:
Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py.

Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** README.txt	27 Oct 2002 22:04:32 -0000	1.40
--- README.txt	7 Nov 2002 22:30:02 -0000	1.41
***************
*** 24,28 ****
  too small to measure reliably across that much training data.
  
! The code in this project requires Python 2.2.1 (or later).
  
  
--- 24,28 ----
  too small to measure reliably across that much training data.
  
! The code in this project requires Python 2.2 (or later).
  
  
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** TestDriver.py	20 Oct 2002 05:19:48 -0000	1.27
--- TestDriver.py	7 Nov 2002 22:30:04 -0000	1.28
***************
*** 31,34 ****
--- 31,41 ----
  from Histogram import Hist
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  def printhist(tag, ham, spam, nbuckets=options.nbuckets):
      print

Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** Tester.py	20 Oct 2002 04:01:08 -0000	1.7
--- Tester.py	7 Nov 2002 22:30:04 -0000	1.8
***************
*** 1,4 ****
--- 1,11 ----
  from Options import options
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  class Test:
      # Pass a classifier instance (an instance of Bayes).

Index: chi2.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/chi2.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** chi2.py	16 Oct 2002 21:31:19 -0000	1.7
--- chi2.py	7 Nov 2002 22:30:05 -0000	1.8
***************
*** 1,4 ****
--- 1,11 ----
  import math as _math
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  def chi2Q(x2, v, exp=_math.exp, min=min):
      """Return prob(chisq >= x2, with v degrees of freedom).

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.48
retrieving revision 1.49
diff -C2 -d -r1.48 -r1.49
*** classifier.py	4 Nov 2002 21:24:52 -0000	1.48
--- classifier.py	7 Nov 2002 22:30:05 -0000	1.49
***************
*** 37,40 ****
--- 37,48 ----
  from Options import options
  from chi2 import chi2Q
+ 
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  LN2 = math.log(2)       # used frequently by chi-combining
  

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** hammie.py	6 Nov 2002 02:07:42 -0000	1.36
--- hammie.py	7 Nov 2002 22:30:05 -0000	1.37
***************
*** 53,56 ****
--- 53,63 ----
  from Options import options
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  program = sys.argv[0] # For usage(); referenced by docstring above
  

Index: hammiesrv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** hammiesrv.py	1 Nov 2002 02:55:32 -0000	1.9
--- hammiesrv.py	7 Nov 2002 22:30:06 -0000	1.10
***************
*** 30,33 ****
--- 30,40 ----
  import hammie
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  program = sys.argv[0] # For usage(); referenced by docstring above
  

Index: mboxcount.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxcount.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** mboxcount.py	6 Nov 2002 01:58:35 -0000	1.2
--- mboxcount.py	7 Nov 2002 22:30:07 -0000	1.3
***************
*** 36,39 ****
--- 36,46 ----
  from mboxutils import get_message
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  program = sys.argv[0]
  

Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** mboxtest.py	23 Sep 2002 21:20:10 -0000	1.9
--- mboxtest.py	7 Nov 2002 22:30:07 -0000	1.10
***************
*** 33,36 ****
--- 33,43 ----
  from Options import options
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  mbox_fmts = {"unix": mailbox.PortableUnixMailbox,
               "mmdf": mailbox.MmdfMailbox,

Index: neiltrain.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** neiltrain.py	27 Sep 2002 21:18:18 -0000	1.3
--- neiltrain.py	7 Nov 2002 22:30:07 -0000	1.4
***************
*** 13,16 ****
--- 13,23 ----
  import mboxutils
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  program = sys.argv[0] # For usage(); referenced by docstring above
  

Index: rebal.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rebal.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** rebal.py	29 Sep 2002 16:55:10 -0000	1.8
--- rebal.py	7 Nov 2002 22:30:07 -0000	1.9
***************
*** 46,49 ****
--- 46,56 ----
  import getopt
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  # defaults
  NPERDIR = 4000

Index: sets.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/sets.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** sets.py	22 Sep 2002 06:58:36 -0000	1.1
--- sets.py	7 Nov 2002 22:30:07 -0000	1.2
***************
*** 60,63 ****
--- 60,70 ----
  
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  class BaseSet(object):
      """Common base class for mutable and immutable sets."""

Index: splitn.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitn.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** splitn.py	6 Nov 2002 02:02:08 -0000	1.3
--- splitn.py	7 Nov 2002 22:30:08 -0000	1.4
***************
*** 48,51 ****
--- 48,58 ----
  import mboxutils
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  program = sys.argv[0]
  

Index: splitndirs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** splitndirs.py	6 Nov 2002 02:02:43 -0000	1.6
--- splitndirs.py	7 Nov 2002 22:30:08 -0000	1.7
***************
*** 55,58 ****
--- 55,65 ----
  import mboxutils
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  program = sys.argv[0]
  

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.62
retrieving revision 1.63
diff -C2 -d -r1.62 -r1.63
*** tokenizer.py	6 Nov 2002 02:12:47 -0000	1.62
--- tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
***************
*** 16,19 ****
--- 16,26 ----
  from mboxutils import get_message
  
+ try:
+     True, False
+ except NameError:
+     # Maintain compatibility with Python 2.2
+     True, False = 1, 0
+ 
+ 
  # Patch encodings.aliases to recognize 'ansi_x3_4_1968'
  from encodings.aliases import aliases # The aliases dictionary


From jvr@users.sourceforge.net  Thu Nov  7 22:32:17 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Thu, 07 Nov 2002 14:32:17 -0800
Subject: [Spambayes-checkins] website developer.ht,1.4,1.5
Message-ID: <E189vCD-0006qi-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv26318

Modified Files:
	developer.ht 
Log Message:
Python version requirement dropped to 2.2. Someone else should regenerate and upload the site, I haven't got a clue..

Index: developer.ht
===================================================================
RCS file: /cvsroot/spambayes/website/developer.ht,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** developer.ht	4 Nov 2002 06:38:52 -0000	1.4
--- developer.ht	7 Nov 2002 22:32:15 -0000	1.5
***************
*** 12,16 ****
  come crying &lt;wink&gt;.
  </p>
! <p>This project works with either the absolute bleeding edge of python code, available from <a href="https://sourceforge.net/cvs/?group_id=5470">CVS on sourceforge</a>, or with Python 2.2.1 (not 2.2, or 2.1.3).
  </p>
  <p>The spambayes code itself is also available <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a>
--- 12,16 ----
  come crying &lt;wink&gt;.
  </p>
! <p>This project works with either the absolute bleeding edge of python code, available from <a href="https://sourceforge.net/cvs/?group_id=5470">CVS on sourceforge</a>, or with Python 2.2 (not 2.1.x or earlier).
  </p>
  <p>The spambayes code itself is also available <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a>


From just@letterror.com  Thu Nov  7 22:51:11 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 23:51:11 +0100
Subject: [Spambayes-checkins]  spambayes README.txt,1.40,1.41
	TestDriver.py,1.27,1.28 Tester.py,1.7,1.8 chi2.py,1.7,1.8
	classifier.py,1.48,1.49 hammie.py,1.36,1.37 hammiesrv.py,1.9,1.10
	mboxcount.py,1.2,1.3 mboxtest.py,1.9,1.10 neiltrain.py,1.3,1.4 rebal.py,1.
In-Reply-To: <E189vAe-0006bg-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <r01050400-1021-6CD83424F2A311D68CC8003065D5E7E4@[10.0.0.23]>

Just van Rossum wrote:

> Mass checkin: Remain compatible with Python 2.2. Only tested with
> pop3proxy.py.

Btw. I screwed up the checkin for Options.py, Histogram.py and INTEGRATION.txt;
these have a bogus log message for the 2.2 compat patch :-(.

Just

From tim_one@users.sourceforge.net  Fri Nov  8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
	tokenizer.py,1.63,1.64
Message-ID: <E18A0Pd-0008K2-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798

Modified Files:
	Options.py tokenizer.py 
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py	7 Nov 2002 22:25:46 -0000	1.66
--- Options.py	8 Nov 2002 04:06:23 -0000	1.67
***************
*** 42,53 ****
      x-.*
  
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages.  Set true to retain HTML tags in this
- # case.  On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
- 
  # If true, the first few characters of application/octet-stream sections
  # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
  
  all_options = {
!     'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
!                   'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,
--- 339,343 ----
  
  all_options = {
!     'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
--- tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
***************
*** 495,504 ****
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was removed.
  
  ##############################################################################
--- 495,505 ----
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False.  Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was also removed.
  
  ##############################################################################
***************
*** 1167,1175 ****
          """Generate a stream of tokens from an email Message.
  
-         HTML tags are always stripped from text/plain sections.
-         options.retain_pure_html_tags controls whether HTML tags are
-         also stripped from text/html sections.  Except in special cases,
-         it's recommended to leave that at its default of false.
- 
          If options.check_octets is True, the first few undecoded characters
          of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             if (part.get_content_type() == "text/plain" or
!                     not options.retain_pure_html_tags):
!                 text = text.replace('&nbsp;', ' ')
!                 text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1224,1229 ----
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.


From richiehindle@users.sourceforge.net  Fri Nov  8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID: <E18A440-0006h6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390

Modified Files:
	pop3proxy.py 
Log Message:
 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py	7 Nov 2002 22:27:02 -0000	1.11
--- pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
***************
*** 47,50 ****
--- 47,74 ----
  
  
+ todo = """
+  o (Re)training interface - one message per line, quick-rendering table.
+  o Slightly-wordy index page; intro paragraph for each page.
+  o Once the training stuff is on a separate page, make the paste box
+    bigger.
+  o "Links" section (on homepage?) to project homepage, mailing list,
+    etc.
+  o "Home" link (with helmet!) at the end of each page.
+  o "Classify this" - just like Train.
+  o "Send me an email every [...] to remind me to train on new
+    messages."
+  o "Send me a status email every [...] telling how many mails have been
+    classified, etc."
+  o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
+    webbrowser?
+  o Possibly integrate Tim Stone's SMTP code - make it use async, make
+    the training code update (rather than replace!) the database.
+  o Can it cleanly dynamically update its status display while having a
+    POP3 converation?  Hammering reload sucks.
+  o Add a command to save the database without shutting down, and one to
+    reload the database.
+  o Leave the word in the input field after a Word query.
+ """
+ 
  import sys, re, operator, errno, getopt, cPickle, cStringIO, time
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
              self.factory(*args)
  
+ 
  class BrighterAsyncChat(asynchat.async_chat):
      """An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
  
  
+ class ServerLineReader(BrighterAsyncChat):
+     """An async socket that reads lines from a remote server and
+     simply calls a callback with the data.  The BayesProxy object
+     can't connect to the real POP3 server and talk to it
+     synchronously, because that would block the process."""
+     
+     def __init__(self, serverName, serverPort, lineCallback):
+         BrighterAsyncChat.__init__(self)
+         self.lineCallback = lineCallback
+         self.request = ''
+         self.set_terminator('\r\n')
+         self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+         self.connect((serverName, serverPort))
+     
+     def collect_incoming_data(self, data):
+         self.request = self.request + data
+ 
+     def found_terminator(self):
+         self.lineCallback(self.request + '\r\n')
+         self.request = ''
+ 
+     def handle_close(self):
+         self.lineCallback('')
+         self.close()
+ 
+ 
  class POP3ProxyBase(BrighterAsyncChat):
      """An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
!         self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.serverSocket.connect((serverName, serverPort))
!         self.serverIn = self.serverSocket.makefile('r')  # For reading only
!         self.push(self.serverIn.readline())
  
      def onTransaction(self, command, args, response):
--- 177,189 ----
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
+         self.response = ''
          self.set_terminator('\r\n')
!         self.command = ''           # The POP3 command being processed...
!         self.args = ''              # ...and its arguments
!         self.isClosing = False      # Has the server closed the socket?
!         self.seenAllHeaders = False # For the current RETR or TOP
!         self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
!                                              self.onServerLine)
  
      def onTransaction(self, command, args, response):
***************
*** 139,152 ****
          raise NotImplementedError
  
!     def isMultiline(self, command, args):
!         """Returns True if the given request should get a multiline
          response (assuming the response is positive).
          """
!         if command in ['USER', 'PASS', 'APOP', 'QUIT',
!                        'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif command in ['RETR', 'TOP']:
              return True
!         elif command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
--- 194,237 ----
          raise NotImplementedError
  
!     def onServerLine(self, line):
!         """A line of response has been received from the POP3 server."""
!         isFirstLine = not self.response
!         self.response = self.response + line
!         
!         # Is this line that terminates a set of headers?
!         self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
!         # Has the server closed its end of the socket?
!         if not line:
!             self.isClosing = True
!         
!         # If we're not processing a command, just echo the response.
!         if not self.command:
!             self.push(self.response)
!             self.response = ''
!         
!         # Time out after 30 seconds for message-retrieval commands if
!         # all the headers are down.  The rest of the message will proxy
!         # straight through.
!         if self.command in ['TOP', 'RETR'] and \
!            self.seenAllHeaders and time.time() > self.startTime + 30:
!             self.onResponse()
!             self.response = ''
!         # If that's a complete response, handle it.
!         elif not self.isMultiline() or line == '.\r\n' or \
!            (isFirstLine and line.startswith('-ERR')):
!             self.onResponse()
!             self.response = ''
!     
!     def isMultiline(self):
!         """Returns True if the request should get a multiline
          response (assuming the response is positive).
          """
!         if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
!                             'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif self.command in ['RETR', 'TOP']:
              return True
!         elif self.command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
***************
*** 155,204 ****
              return False
  
-     def readResponse(self, command, args):
-         """Reads the POP3 server's response and returns a tuple of
-         (response, isClosing, timedOut).  isClosing is True if the
-         server closes the socket, which tells found_terminator() to
-         close when the response has been sent.  timedOut is set if a
-         TOP or RETR request was still arriving after 30 seconds, and
-         tells found_terminator() to proxy the remainder of the response.
-         """
-         responseLines = []
-         startTime = time.time()
-         isMulti = self.isMultiline(command, args)
-         isClosing = False
-         timedOut = False
-         isFirstLine = True
-         seenAllHeaders = False
-         while True:
-             line = self.serverIn.readline()
-             if not line:
-                 # The socket's been closed by the server, probably by QUIT.
-                 isClosing = True
-                 break
-             elif not isMulti or (isFirstLine and line.startswith('-ERR')):
-                 # A single-line response.
-                 responseLines.append(line)
-                 break
-             elif line == '.\r\n':
-                 # The termination line.
-                 responseLines.append(line)
-                 break
-             else:
-                 # A normal line - append it to the response and carry on.
-                 responseLines.append(line)
-                 seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
- 
-             # Time out after 30 seconds for message-retrieval commands
-             # if all the headers are down - found_terminator() knows how
-             # to deal with this.
-             if command in ['TOP', 'RETR'] and \
-                seenAllHeaders and time.time() > startTime + 30:
-                 timedOut = True
-                 break
- 
-             isFirstLine = False
- 
-         return ''.join(responseLines), isClosing, timedOut
- 
      def collect_incoming_data(self, data):
          """Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
      def found_terminator(self):
          """Asynchat override."""
-         # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverSocket.sendall(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             command, args = ('', '')
          else:
              splitCommand = self.request.strip().split(None, 1)
!             command = splitCommand[0].upper()
!             args = splitCommand[1:]
!         rawResponse, isClosing, timedOut = self.readResponse(command, args)
! 
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cookedResponse = self.onTransaction(command, args, rawResponse)
!         self.push(cookedResponse)
!         self.request = ''
! 
!         # If readResponse() timed out, we still need to read and proxy
!         # the rest of the message.
!         if timedOut:
!             while True:
!                 line = self.serverIn.readline()
!                 if not line:
!                     # The socket's been closed by the server.
!                     isClosing = True
!                     break
!                 elif line == '.\r\n':
!                     # The termination line.
!                     self.push(line)
!                     break
!                 else:
!                     # A normal line.
!                     self.push(line)
! 
!         # If readResponse() or the loop above decided that the server
!         # has closed its socket, close this one when the response has
!         # been sent.
!         if isClosing:
              self.close_when_done()
  
  
  class BayesProxyListener(Listener):
--- 246,288 ----
      def found_terminator(self):
          """Asynchat override."""
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
+             self.serverSocket.shutdown(2)
+             self.serverSocket.close()
              self.shutdown(2)
              self.close()
              raise SystemExit
!         
!         self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             self.command = self.args = ''
          else:
+             # A proper command.
              splitCommand = self.request.strip().split(None, 1)
!             self.command = splitCommand[0].upper()
!             self.args = splitCommand[1:]
!             self.startTime = time.time()
!         
!         self.request = ''
!         
!     def onResponse(self):
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cooked = self.onTransaction(self.command, self.args, self.response)
!         self.push(cooked)
!         
!         # If onServerLine() decided that the server has closed its
!         # socket, close this one when the response has been sent.
!         if self.isClosing:
              self.close_when_done()
  
+         # Reset.
+         self.command = ''
+         self.args = ''
+         self.isClosing = False
+         self.seenAllHeaders = False
+ 
  
  class BayesProxyListener(Listener):
***************
*** 452,456 ****
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15 }
               .header { font-size: 133%% }
               .content { margin: 15 }
--- 484,490 ----
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15;
!                        border-top: 1px solid black;
!                        border-bottom: 1px solid black }
               .header { font-size: 133%% }
               .content { margin: 15 }
***************
*** 466,470 ****
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
--- 500,504 ----
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>&nbsp;Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
***************
*** 475,481 ****
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              <input type='submit' value='Shutdown now'>
               </td></tr></table></form>\n"""
  
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
--- 509,520 ----
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              %s
               </td></tr></table></form>\n"""
  
+     shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
+     
+     shutdownPickle = shutdownDB + """&nbsp;&nbsp;
+             <input type='submit' name='how' value='Save &amp; shutdown'>"""
+ 
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
***************
*** 483,486 ****
--- 522,533 ----
                    &nbsp;<br>\n"""
      
+     summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
+               proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
+               Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
+               POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
+               Emails classified this session: <b>%(numSpams)d</b> spam,
+                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
+               """
+     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
***************
*** 488,491 ****
--- 535,550 ----
                  </form>"""
      
+     train = """<form action='/upload' method='POST'
+                 enctype='multipart/form-data'>
+             Either upload a message file: <input type='file' name='file'><br>
+             Or paste the whole message (incuding headers) here:<br>
+             <textarea name='text' rows='3' cols='60'></textarea><br>
+             Is this message
+             <input type='radio' name='which' value='ham'>Ham</input> or
+             <input type='radio'
+                    name='which' value='spam' checked>Spam</input>?<br>
+             <input type='submit' value='Train on this message'>
+             </form>"""
+     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = self.request.split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
--- 561,565 ----
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
          
          if path == '/helmet.gif':
!             self.pushOKHeaders('image/gif')
              self.push(self.helmet)
          else:
--- 606,614 ----
          
          if path == '/helmet.gif':
!             # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
!             inOneHour = time.gmtime(time.time() + 3600)
!             expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
!             extraHeaders = {'Expires': expiryDate}
!             self.pushOKHeaders('image/gif', extraHeaders)
              self.push(self.helmet)
          else:
***************
*** 554,558 ****
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % url)
              else:
                  # This is a request for a valid page; run the handler.
--- 617,621 ----
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % path)
              else:
                  # This is a request for a valid page; run the handler.
***************
*** 561,569 ****
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 self.push(self.footer % timeString)
      
!     def pushOKHeaders(self, contentType):
!         self.push("HTTP/1.0 200 OK\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
          self.push("\r\n")
  
--- 624,641 ----
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 if status.useDB:
!                     self.push(self.footer % (timeString, self.shutdownDB))
!                 else:
!                     self.push(self.footer % (timeString, self.shutdownPickle))
      
!     def pushOKHeaders(self, contentType, extraHeaders={}):
!         timeNow = time.gmtime(time.time())
!         httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
!         self.push("HTTP/1.1 200 OK\r\n")
!         self.push("Connection: close\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
+         self.push("Date: %s\r\n" % httpNow)
+         for name, value in extraHeaders.items():
+             self.push("%s: %s\r\n" % (name, value))
          self.push("\r\n")
  
***************
*** 583,616 ****
  
      def onHome(self, params):
!         summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!                   proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
!                   Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
!                   POP3 conversations this session:
!                     <b>%(totalSessions)d</b>.<br>
!                   Emails classified this session: <b>%(numSpams)d</b> spam,
!                     <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
!                   """ % status.__dict__
!         
!         train = """<form action='/upload' method='POST'
!                     enctype='multipart/form-data'>
!                 Either upload a message file:
!                 <input type='file' name='file'><br>
!                 Or paste the whole message (incuding headers) here:<br>
!                 <textarea name='text' rows='3' cols='60'></textarea><br>
!                 Is this message
!                 <input type='radio' name='which' value='ham'>Ham</input> or
!                 <input type='radio'
!                        name='which' value='spam' checked>Spam</input>?<br>
!                 <input type='submit' value='Train on this message'>
!                 </form>"""
!         
!         body = (self.pageSection % ('Status', summary) +
!                 self.pageSection % ('Word query', self.wordQuery) +
!                 self.pageSection % ('Train', train))
          self.push(body)
  
      def onShutdown(self, params):
!         self.push("<p><b>Shutdown.</b> Goodbye.</p>")
!         self.push(' ')  # Acts as a flush for small buffers.
          self.shutdown(2)
          self.close()
--- 655,675 ----
  
      def onHome(self, params):
!         """Serve up the homepage."""
!         body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Word query', self.wordQuery)+
!                 self.pageSection % ('Train', self.train))
          self.push(body)
  
      def onShutdown(self, params):
!         """Shutdown the server, saving the pickle if requested to do so."""
!         if params['how'].lower().find('save') >= 0:
!             if not status.useDB and status.pickleName:
!                 self.push("<b>Saving...</b>")
!                 self.push(' ')  # Acts as a flush for small buffers.
!                 fp = open(status.pickleName, 'wb')
!                 cPickle.dump(self.bayes, fp, 1)
!                 fp.close()
!         self.push("<b>Shutdown</b>. Goodbye.")
!         self.push(' ')
          self.shutdown(2)
          self.close()
***************
*** 618,625 ****
  
      def onUpload(self, params):
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
          # Append the message to a file, to make it easier to rebuild
!         # the database later.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
--- 677,690 ----
  
      def onUpload(self, params):
+         """Train on an uploaded or pasted message."""
+         # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         
          # Append the message to a file, to make it easier to rebuild
!         # the database later.   This is a temporary implementation -
!         # it should keep a Corpus (from Tim Stone's forthcoming message
!         # management module) to manage a cache of messages.  It needs
!         # to keep them for the HTML retraining interface anyway.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
***************
*** 627,642 ****
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From ???@???\n")  # fake From line (XXX good enough?)
          f.write(message)
!         f.write("\n")
          f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("""<p>Trained on your message. Saving database...</p>""")
!         self.push(" ")  # Flush... must find out how to do this properly...
!         if not status.useDB and status.pickleName:
!             fp = open(status.pickleName, 'wb')
!             cPickle.dump(self.bayes, fp, 1)
!             fp.close()
!         self.push("<p>Done.</p><p><a href='/'>Home</a></p>")
  
      def onWordquery(self, params):
--- 692,704 ----
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
          f.write(message)
!         f.write("\n\n")
          f.close()
+ 
+         # Train on the message.
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
!         self.push(self.pageSection % ('Train another', self.train))
  
      def onWordquery(self, params):
***************
*** 656,660 ****
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s':" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
--- 718,722 ----
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
***************
*** 765,771 ****
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))
          self.request = ''
  
      def onStat(self, command, args):
          """POP3 STAT command."""
--- 827,839 ----
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))   # Or push_slowly for testing
          self.request = ''
  
+     def push_slowly(self, response):
+         """Useful for testing."""
+         for c in response:
+             self.push(c)
+             time.sleep(0.02)
+ 
      def onStat(self, command, args):
          """POP3 STAT command."""
***************
*** 777,781 ****
          """POP3 LIST command, with optional message number argument."""
          if args:
!             number = int(args)
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
          """POP3 LIST command, with optional message number argument."""
          if args:
!             try:
!                 number = int(args)
!             except ValueError:
!                 number = -1
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         return self._getMessage(int(args), 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         number, lines = map(int, args.split())
          return self._getMessage(number, lines)
  
--- 874,889 ----
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         try:
!             number = int(args)
!         except ValueError:
!             number = -1
!         return self._getMessage(number, 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         try:
!             number, lines = map(int, args.split())
!         except ValueError:
!             number, lines = -1, -1
          return self._getMessage(number, lines)
  
***************
*** 863,867 ****
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) != -1
  
      # Kill the proxy and the test server.
--- 941,945 ----
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) >= 0
  
      # Kill the proxy and the test server.


From jvr@users.sourceforge.net  Sat Nov  9 18:05:44 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Sat, 09 Nov 2002 10:05:44 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13
Message-ID: <E18AZzM-0005QJ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20814

Modified Files:
	pop3proxy.py 
Log Message:
force word query to be lowercase, making the UI case insensitive

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
--- pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
***************
*** 704,707 ****
--- 704,708 ----
      def onWordquery(self, params):
          word = params['word']
+         word = word.lower()
          try:
              # Must be a better way to get __dict__ for a new-style class...


From hooft@users.sourceforge.net  Sat Nov  9 21:48:55 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sat, 09 Nov 2002 13:48:55 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,NONE,1.1
Message-ID: <E18AdTL-00086Q-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31102

Added Files:
	weaktest.py 
Log Message:
New test driver to simulate "unsure only" training

--- NEW FILE: weaktest.py ---
#! /usr/bin/env python

# A test driver using "the standard" test directory structure.
# This simulates a user that gets E-mail, and only trains on fp,
# fn and unsure messages. It starts by training on the first 30
# messages, and from that point on well classified messages will
# not be used for training. This can be used to see what the performance
# of the scoring algorithm is under such conditions. Questions are:
#  * How does the size of the database behave over time?
#  * Does the classification get better over time?
#  * Are there other combinations of parameters for the classifier
#    that make this better behaved than the default values?


"""Usage: %(program)s  [options] -n nsets

Where:
    -h
        Show usage and exit.
    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

from __future__ import generators

import sys,os

from Options import options
import hammie

import msgs

program = sys.argv[0]

debug = 0

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def drive(nsets):
    print options.display()

    spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
    hamdirs  = [options.ham_directories % i for i in range(1, nsets+1)]

    spamfns = [(x,y,1) for x in spamdirs for y in os.listdir(x)]
    hamfns = [(x,y,0) for x in hamdirs for y in os.listdir(x)]

    nham = len(hamfns)
    nspam = len(spamfns)
    
    allfns={}
    for fn in spamfns+hamfns:
        allfns[fn] = None

    d = hammie.Hammie(hammie.createbayes('weaktest.db', False))

    n=0
    unsure=0
    hamtrain=0
    spamtrain=0
    fp=0
    fn=0
    for dir,name, is_spam in allfns.iterkeys():
        n += 1
        m=msgs.Msg(dir, name).guts
        if debug:
            print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name),
        if hamtrain + spamtrain > 30:
            scr=d.score(m)
        else:
            scr=0.50
        if debug:
            print "score:%.3f"%scr,
        if scr < hammie.SPAM_THRESHOLD and is_spam:
            if scr < hammie.HAM_THRESHOLD:
                fn += 1
                if debug:
                    print "fn"
            else:
                unsure += 1
                if debug:
                    print "Unsure"
            spamtrain += 1
            d.train_spam(m)
            d.update_probabilities()
        elif scr > hammie.HAM_THRESHOLD and not is_spam:
            if scr > hammie.SPAM_THRESHOLD:
                fp += 1
                if debug:
                    print "fp"
                else:
                    print "fp: %s score:%.4f"%(os.path.join(dir,name),scr)
            else:
                unsure += 1
                if debug:
                    print "Unsure"
            hamtrain += 1
            d.train_ham(m)
            d.update_probabilities()
        else:
            if debug:
                print "OK"
        if n % 100 == 0:
            print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
                n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure)
    print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam)
    print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
        unsure,unsure*100.0/len(allfns))
    print "Trained on %d ham and %d spam"%(hamtrain,spamtrain)
    print "fp: %d fn: %d"%(fp,fn)
    FPW = options.best_cutoff_fp_weight
    FNW = options.best_cutoff_fn_weight
    UNW = options.best_cutoff_unsure_weight
    print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure)
    
def main():
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
                                   ['ham-keep=', 'spam-keep='])
    except getopt.error, msg:
        usage(1, msg)

    nsets = seed = hamkeep = spamkeep = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-n':
            nsets = int(arg)

    if args:
        usage(1, "Positional arguments not supported")
    if nsets is None:
        usage(1, "-n is required")

    drive(nsets)

if __name__ == "__main__":
    main()


From hooft@users.sourceforge.net  Sun Nov 10 12:02:36 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:02:36 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,1.1,1.2
Message-ID: <E18AqnU-0005vF-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22741

Modified Files:
	weaktest.py 
Log Message:
add flexcost; sanitize spacing

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weaktest.py	9 Nov 2002 21:48:52 -0000	1.1
--- weaktest.py	10 Nov 2002 12:02:33 -0000	1.2
***************
*** 59,63 ****
      nspam = len(spamfns)
      
!     allfns={}
      for fn in spamfns+hamfns:
          allfns[fn] = None
--- 59,63 ----
      nspam = len(spamfns)
      
!     allfns = {}
      for fn in spamfns+hamfns:
          allfns[fn] = None
***************
*** 65,74 ****
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
!     n=0
!     unsure=0
!     hamtrain=0
!     spamtrain=0
!     fp=0
!     fn=0
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
--- 65,80 ----
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
!     n = 0
!     unsure = 0
!     hamtrain = 0
!     spamtrain = 0
!     fp = 0
!     fn = 0
!     flexcost = 0
!     FPW = options.best_cutoff_fp_weight
!     FNW = options.best_cutoff_fn_weight
!     UNW = options.best_cutoff_unsure_weight
!     SPC = options.spam_cutoff
!     HC = options.ham_cutoff
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
***************
*** 82,87 ****
          if debug:
              print "score:%.3f"%scr,
!         if scr < hammie.SPAM_THRESHOLD and is_spam:
!             if scr < hammie.HAM_THRESHOLD:
                  fn += 1
                  if debug:
--- 88,96 ----
          if debug:
              print "score:%.3f"%scr,
!         if scr < SPC and is_spam:
!             t = FNW * (SPC - scr) / (SPC - HC)
!             #print "Spam at %.3f costs %.2f"%(scr,t)
!             flexcost += t
!             if scr < HC:
                  fn += 1
                  if debug:
***************
*** 94,104 ****
              d.train_spam(m)
              d.update_probabilities()
!         elif scr > hammie.HAM_THRESHOLD and not is_spam:
!             if scr > hammie.SPAM_THRESHOLD:
                  fp += 1
                  if debug:
                      print "fp"
                  else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir,name),scr)
              else:
                  unsure += 1
--- 103,116 ----
              d.train_spam(m)
              d.update_probabilities()
!         elif scr > HC and not is_spam:
!             t = FPW * (scr - HC) / (SPC - HC)
!             #print "Ham at %.3f costs %.2f"%(scr,t)
!             flexcost += t
!             if scr > SPC:
                  fp += 1
                  if debug:
                      print "fp"
                  else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir, name), scr)
              else:
                  unsure += 1
***************
*** 113,126 ****
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam)
      print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure,unsure*100.0/len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain,spamtrain)
!     print "fp: %d fn: %d"%(fp,fn)
!     FPW = options.best_cutoff_fp_weight
!     FNW = options.best_cutoff_fn_weight
!     UNW = options.best_cutoff_unsure_weight
!     print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure)
      
  def main():
--- 125,136 ----
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam)
      print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure, unsure * 100.0 / len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain, spamtrain)
!     print "fp: %d fn: %d"%(fp, fn)
!     print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
!     print "Flex cost: $%.4f"%flexcost
      
  def main():
***************
*** 128,137 ****
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
!                                    ['ham-keep=', 'spam-keep='])
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = seed = hamkeep = spamkeep = None
      for opt, arg in opts:
          if opt == '-h':
--- 138,146 ----
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:')
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = None
      for opt, arg in opts:
          if opt == '-h':


From hooft@users.sourceforge.net  Sun Nov 10 12:07:18 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:07:18 -0800
Subject: [Spambayes-checkins] spambayes optimize.py,NONE,1.1
Message-ID: <E18Aqs2-0006JK-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24245

Added Files:
	optimize.py 
Log Message:
Simplex maximization

--- NEW FILE: optimize.py ---
#
__version__ = '$Id: optimize.py,v 1.1 2002/11/10 12:07:15 hooft Exp $'
#
# Optimize any parametric function.
#
import copy
import Numeric

def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001):
    var = Numeric.array(var)
    simplex = [var]
    for i in range(len(var)):
	var2 = copy.copy(var)
	var2[i] = var[i] + err[i]
	simplex.append(var2)
    value = []
    for i in range(len(simplex)):
	value.append(func(simplex[i]))
    while 1:
	# Determine worst and best
	wi = 0
	bi = 0
	for i in range(len(simplex)):
	    if value[wi] > value[i]:
		wi = i
	    if value[bi] < value[i]:
		bi = i
	# Test for convergence
	#print "worst, best are",wi,bi,"with",value[wi],value[bi]
	if abs(value[bi] - value[wi]) <= convcrit:
	    return simplex[bi]
	# Calculate average of non-worst
	ave=Numeric.zeros(len(var), 'd')
	for i in range(len(simplex)):
	    if i != wi:
		ave = ave + simplex[i]
	ave = ave / (len(simplex) - 1)
	worst = Numeric.array(simplex[wi])
	# Check for too-small simplex
	simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
	if simsize <= minerr:
	    #print "Size of simplex too small:",simsize
	    return simplex[bi]
	# Invert worst
	new = 2 * ave - simplex[wi]
	newv = func(new)
	if newv <= value[wi]:
	    # Even worse. Shrink instead
	    #print "Shrunk simplex"
	    #print "ave=",repr(ave)
	    #print "wi=",repr(worst)
	    new = 0.5 * ave + 0.5 * worst
	    newv = func(new)
	elif newv > value[bi]:
	    # Better than the best. Expand
	    new2 = 3 * ave - 2 * worst
	    newv2 = func(new2)
	    if newv2 > newv:
		# Accept
		#print "Expanded simplex"
		new = new2
		newv = newv2
	simplex[wi] = new
	value[wi] = newv

def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
    err = Numeric.array(err)
    var = SimplexMaximize(var, err, func, convcrit*5, minerr*5)
    return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr)


From hooft@users.sourceforge.net  Sun Nov 10 12:08:42 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:08:42 -0800
Subject: [Spambayes-checkins] spambayes weakloop.py,NONE,1.1
Message-ID: <E18AqtO-0006Q0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24653

Added Files:
	weakloop.py 
Log Message:
Loop simplex optimization over weaktest.py

--- NEW FILE: weakloop.py ---
#
# Optimize parameters
#
"""Usage: %(program)s  [options] -n nsets

Where:
    -h
        Show usage and exit.
    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

import sys

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

program = sys.argv[0]

default="""
[Classifier]
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
max_discriminators = 150

[TestDriver]
spam_cutoff = 0.90
ham_cutoff = 0.20
"""

import Options

start = (Options.options.robinson_probability_x,
         Options.options.robinson_minimum_prob_strength,
         Options.options.robinson_probability_s,
         Options.options.spam_cutoff,
         Options.options.ham_cutoff)
err = (0.01, 0.01, 0.01, 0.005, 0.01)

def mkini(vars):
    f=open('bayescustomize.ini', 'w')
    f.write("""
[Classifier]
robinson_probability_x = %.6f
robinson_minimum_prob_strength = %.6f
robinson_probability_s = %.6f

[TestDriver]
spam_cutoff = %.4f
ham_cutoff = %.4f
"""%tuple(vars))
    f.close()

def score(vars):
    import os
    mkini(vars)
    status = os.system('python2.3 weaktest.py -n %d > weak.out'%nsets)
    if status != 0:
        print >> sys.stderr, "Error status from weaktest"
        sys.exit(status)
    f = open('weak.out', 'r')
    txt = f.readlines()
    # Extract the flex cost field.
    cost = float(txt[-1].split()[2][1:])
    f.close()
    print ''.join(txt[-4:])[:-1]
    print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,))
    return -cost

def main():
    import optimize
    finish=optimize.SimplexMaximize(start,err,score)
    mkini(finish)

if __name__ == "__main__":
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hn:')
    except getopt.error, msg:
        usage(1, msg)

    nsets = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-n':
            nsets = int(arg)

    if args:
        usage(1, "Positional arguments not supported")
    if nsets is None:
        usage(1, "-n is required")

    main()


From tim_one@users.sourceforge.net  Sun Nov 10 19:59:24 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 11:59:24 -0800
Subject: [Spambayes-checkins] spambayes msgs.py,1.5,1.6 optimize.py,1.1,1.2
 pop3proxy.py,1.13,1.14 timcv.py,1.11,1.12 weaktest.py,1.2,1.3
Message-ID: <E18AyEu-0003ql-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14712

Modified Files:
	msgs.py optimize.py pop3proxy.py timcv.py weaktest.py 
Log Message:
Whitespace normalization.


Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** msgs.py	1 Nov 2002 04:10:50 -0000	1.5
--- msgs.py	10 Nov 2002 19:59:22 -0000	1.6
***************
*** 84,88 ****
  
  def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.  
         If seed is not None, also set SEED.
         If (ham|spam)test are not set, set to the same as the (ham|spam)train
--- 84,88 ----
  
  def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.
         If seed is not None, also set SEED.
         If (ham|spam)test are not set, set to the same as the (ham|spam)train

Index: optimize.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/optimize.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** optimize.py	10 Nov 2002 12:07:15 -0000	1.1
--- optimize.py	10 Nov 2002 19:59:22 -0000	1.2
***************
*** 11,66 ****
      simplex = [var]
      for i in range(len(var)):
! 	var2 = copy.copy(var)
! 	var2[i] = var[i] + err[i]
! 	simplex.append(var2)
      value = []
      for i in range(len(simplex)):
! 	value.append(func(simplex[i]))
      while 1:
! 	# Determine worst and best
! 	wi = 0
! 	bi = 0
! 	for i in range(len(simplex)):
! 	    if value[wi] > value[i]:
! 		wi = i
! 	    if value[bi] < value[i]:
! 		bi = i
! 	# Test for convergence
! 	#print "worst, best are",wi,bi,"with",value[wi],value[bi]
! 	if abs(value[bi] - value[wi]) <= convcrit:
! 	    return simplex[bi]
! 	# Calculate average of non-worst
! 	ave=Numeric.zeros(len(var), 'd')
! 	for i in range(len(simplex)):
! 	    if i != wi:
! 		ave = ave + simplex[i]
! 	ave = ave / (len(simplex) - 1)
! 	worst = Numeric.array(simplex[wi])
! 	# Check for too-small simplex
! 	simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
! 	if simsize <= minerr:
! 	    #print "Size of simplex too small:",simsize
! 	    return simplex[bi]
! 	# Invert worst
! 	new = 2 * ave - simplex[wi]
! 	newv = func(new)
! 	if newv <= value[wi]:
! 	    # Even worse. Shrink instead
! 	    #print "Shrunk simplex"
! 	    #print "ave=",repr(ave)
! 	    #print "wi=",repr(worst)
! 	    new = 0.5 * ave + 0.5 * worst
! 	    newv = func(new)
! 	elif newv > value[bi]:
! 	    # Better than the best. Expand
! 	    new2 = 3 * ave - 2 * worst
! 	    newv2 = func(new2)
! 	    if newv2 > newv:
! 		# Accept
! 		#print "Expanded simplex"
! 		new = new2
! 		newv = newv2
! 	simplex[wi] = new
! 	value[wi] = newv
  
  def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
--- 11,66 ----
      simplex = [var]
      for i in range(len(var)):
!         var2 = copy.copy(var)
!         var2[i] = var[i] + err[i]
!         simplex.append(var2)
      value = []
      for i in range(len(simplex)):
!         value.append(func(simplex[i]))
      while 1:
!         # Determine worst and best
!         wi = 0
!         bi = 0
!         for i in range(len(simplex)):
!             if value[wi] > value[i]:
!                 wi = i
!             if value[bi] < value[i]:
!                 bi = i
!         # Test for convergence
!         #print "worst, best are",wi,bi,"with",value[wi],value[bi]
!         if abs(value[bi] - value[wi]) <= convcrit:
!             return simplex[bi]
!         # Calculate average of non-worst
!         ave=Numeric.zeros(len(var), 'd')
!         for i in range(len(simplex)):
!             if i != wi:
!                 ave = ave + simplex[i]
!         ave = ave / (len(simplex) - 1)
!         worst = Numeric.array(simplex[wi])
!         # Check for too-small simplex
!         simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
!         if simsize <= minerr:
!             #print "Size of simplex too small:",simsize
!             return simplex[bi]
!         # Invert worst
!         new = 2 * ave - simplex[wi]
!         newv = func(new)
!         if newv <= value[wi]:
!             # Even worse. Shrink instead
!             #print "Shrunk simplex"
!             #print "ave=",repr(ave)
!             #print "wi=",repr(worst)
!             new = 0.5 * ave + 0.5 * worst
!             newv = func(new)
!         elif newv > value[bi]:
!             # Better than the best. Expand
!             new2 = 3 * ave - 2 * worst
!             newv2 = func(new2)
!             if newv2 > newv:
!                 # Accept
!                 #print "Expanded simplex"
!                 new = new2
!                 newv = newv2
!         simplex[wi] = new
!         value[wi] = newv
  
  def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
--- pop3proxy.py	10 Nov 2002 19:59:22 -0000	1.14
***************
*** 140,144 ****
      can't connect to the real POP3 server and talk to it
      synchronously, because that would block the process."""
!     
      def __init__(self, serverName, serverPort, lineCallback):
          BrighterAsyncChat.__init__(self)
--- 140,144 ----
      can't connect to the real POP3 server and talk to it
      synchronously, because that would block the process."""
! 
      def __init__(self, serverName, serverPort, lineCallback):
          BrighterAsyncChat.__init__(self)
***************
*** 148,152 ****
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
          self.connect((serverName, serverPort))
!     
      def collect_incoming_data(self, data):
          self.request = self.request + data
--- 148,152 ----
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
          self.connect((serverName, serverPort))
! 
      def collect_incoming_data(self, data):
          self.request = self.request + data
***************
*** 184,188 ****
          self.seenAllHeaders = False # For the current RETR or TOP
          self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
                                               self.onServerLine)
  
--- 184,188 ----
          self.seenAllHeaders = False # For the current RETR or TOP
          self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort,
                                               self.onServerLine)
  
***************
*** 198,214 ****
          isFirstLine = not self.response
          self.response = self.response + line
!         
          # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
          # Has the server closed its end of the socket?
          if not line:
              self.isClosing = True
!         
          # If we're not processing a command, just echo the response.
          if not self.command:
              self.push(self.response)
              self.response = ''
!         
          # Time out after 30 seconds for message-retrieval commands if
          # all the headers are down.  The rest of the message will proxy
--- 198,214 ----
          isFirstLine = not self.response
          self.response = self.response + line
! 
          # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
! 
          # Has the server closed its end of the socket?
          if not line:
              self.isClosing = True
! 
          # If we're not processing a command, just echo the response.
          if not self.command:
              self.push(self.response)
              self.response = ''
! 
          # Time out after 30 seconds for message-retrieval commands if
          # all the headers are down.  The rest of the message will proxy
***************
*** 223,227 ****
              self.onResponse()
              self.response = ''
!     
      def isMultiline(self):
          """Returns True if the request should get a multiline
--- 223,227 ----
              self.onResponse()
              self.response = ''
! 
      def isMultiline(self):
          """Returns True if the request should get a multiline
***************
*** 254,258 ****
              self.close()
              raise SystemExit
!         
          self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
--- 254,258 ----
              self.close()
              raise SystemExit
! 
          self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
***************
*** 265,271 ****
              self.args = splitCommand[1:]
              self.startTime = time.time()
!         
          self.request = ''
!         
      def onResponse(self):
          # Pass the request and the raw response to the subclass and
--- 265,271 ----
              self.args = splitCommand[1:]
              self.startTime = time.time()
! 
          self.request = ''
! 
      def onResponse(self):
          # Pass the request and the raw response to the subclass and
***************
*** 273,277 ****
          cooked = self.onTransaction(self.command, self.args, self.response)
          self.push(cooked)
!         
          # If onServerLine() decided that the server has closed its
          # socket, close this one when the response has been sent.
--- 273,277 ----
          cooked = self.onTransaction(self.command, self.args, self.response)
          self.push(cooked)
! 
          # If onServerLine() decided that the server has closed its
          # socket, close this one when the response has been sent.
***************
*** 351,355 ****
          status.activeSessions -= 1
          POP3ProxyBase.close(self)
!     
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
--- 351,355 ----
          status.activeSessions -= 1
          POP3ProxyBase.close(self)
! 
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
***************
*** 419,423 ****
                  if command == 'RETR':
                      status.numUnsure += 1
!             
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
--- 419,423 ----
                  if command == 'RETR':
                      status.numUnsure += 1
! 
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
***************
*** 490,494 ****
               .content { margin: 15 }
               .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex; 
                                 border-bottom: 1px solid #808080;
                                 font-weight: bold }
--- 490,494 ----
               .content { margin: 15 }
               .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex;
                                 border-bottom: 1px solid #808080;
                                 font-weight: bold }
***************
*** 513,517 ****
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
!     
      shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
--- 513,517 ----
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
! 
      shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
***************
*** 521,525 ****
                    <tr><td class='sectionbody'>%s</td></tr></table>
                    &nbsp;<br>\n"""
!     
      summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
                proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
--- 521,525 ----
                    <tr><td class='sectionbody'>%s</td></tr></table>
                    &nbsp;<br>\n"""
! 
      summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
                proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
***************
*** 529,538 ****
                  <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
!     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
!     
      train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
--- 529,538 ----
                  <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
! 
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
! 
      train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
***************
*** 546,550 ****
              <input type='submit' value='Train on this message'>
              </form>"""
!     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
--- 546,550 ----
              <input type='submit' value='Train on this message'>
              </form>"""
! 
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 577,581 ****
                  self.request = self.request + '\r\n\r\n'
                  return
!     
              if type(self.get_terminator()) is type(1):
                  # We've just read the body of a POSTed request.
--- 577,581 ----
                  self.request = self.request + '\r\n\r\n'
                  return
! 
              if type(self.get_terminator()) is type(1):
                  # We've just read the body of a POSTed request.
***************
*** 592,596 ****
                      # A normal x-www-form-urlencoded.
                      params.update(cgi.parse_qs(body, keep_blank_values=True))
!             
              # Convert the cgi params into a simple dictionary.
              plainParams = {}
--- 592,596 ----
                      # A normal x-www-form-urlencoded.
                      params.update(cgi.parse_qs(body, keep_blank_values=True))
! 
              # Convert the cgi params into a simple dictionary.
              plainParams = {}
***************
*** 604,608 ****
          if path == '/':
              path = '/Home'
!         
          if path == '/helmet.gif':
              # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
--- 604,608 ----
          if path == '/':
              path = '/Home'
! 
          if path == '/helmet.gif':
              # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
***************
*** 628,632 ****
                  else:
                      self.push(self.footer % (timeString, self.shutdownPickle))
!     
      def pushOKHeaders(self, contentType, extraHeaders={}):
          timeNow = time.gmtime(time.time())
--- 628,632 ----
                  else:
                      self.push(self.footer % (timeString, self.shutdownPickle))
! 
      def pushOKHeaders(self, contentType, extraHeaders={}):
          timeNow = time.gmtime(time.time())
***************
*** 645,649 ****
          self.push("\r\n")
          self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
!     
      def pushPreamble(self, name):
          self.push(self.header % name)
--- 645,649 ----
          self.push("\r\n")
          self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
! 
      def pushPreamble(self, name):
          self.push(self.header % name)
***************
*** 681,685 ****
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
!         
          # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
--- 681,685 ----
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
! 
          # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
***************
*** 718,722 ****
          except KeyError:
              info = "'%s' does not appear in the database." % word
!         
          body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
--- 718,722 ----
          except KeyError:
              info = "'%s' does not appear in the database." % word
! 
          body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
***************
*** 992,996 ****
          elif opt == '-u':
              status.uiPort = int(arg)
!             
      # Do whatever we've been asked to do...
      if not opts and not args:
--- 992,996 ----
          elif opt == '-u':
              status.uiPort = int(arg)
! 
      # Do whatever we've been asked to do...
      if not opts and not args:

Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timcv.py	1 Nov 2002 04:10:50 -0000	1.11
--- timcv.py	10 Nov 2002 19:59:22 -0000	1.12
***************
*** 15,19 ****
  
      --HamTrain int
!         The maximum number of msgs to use from each Ham set for training.  
          The msgs are chosen randomly.  See also the -s option.
  
--- 15,19 ----
  
      --HamTrain int
!         The maximum number of msgs to use from each Ham set for training.
          The msgs are chosen randomly.  See also the -s option.
  
***************
*** 23,27 ****
  
      --HamTest int
!         The maximum number of msgs to use from each Ham set for testing.  
          The msgs are chosen randomly.  See also the -s option.
  
--- 23,27 ----
  
      --HamTest int
!         The maximum number of msgs to use from each Ham set for testing.
          The msgs are chosen randomly.  See also the -s option.
  
***************
*** 73,79 ****
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), 
                              hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), 
                              spamdirs[1:], train=1))
  
--- 73,79 ----
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
                              hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
                              spamdirs[1:], train=1))
  
***************
*** 98,102 ****
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1), 
                          msgs.SpamStream(sname, s2, train=1))
  
--- 98,102 ----
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1),
                          msgs.SpamStream(sname, s2, train=1))
  

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** weaktest.py	10 Nov 2002 12:02:33 -0000	1.2
--- weaktest.py	10 Nov 2002 19:59:22 -0000	1.3
***************
*** 58,62 ****
      nham = len(hamfns)
      nspam = len(spamfns)
!     
      allfns = {}
      for fn in spamfns+hamfns:
--- 58,62 ----
      nham = len(hamfns)
      nspam = len(spamfns)
! 
      allfns = {}
      for fn in spamfns+hamfns:
***************
*** 133,137 ****
      print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
      print "Flex cost: $%.4f"%flexcost
!     
  def main():
      import getopt
--- 133,137 ----
      print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
      print "Flex cost: $%.4f"%flexcost
! 
  def main():
      import getopt


From tim_one@users.sourceforge.net  Sun Nov 10 20:00:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 12:00:03 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24
Message-ID: <E18AyFX-0003uk-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14946

Modified Files:
	msgstore.py 
Log Message:
Whitespace normalization.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** msgstore.py	7 Nov 2002 22:30:09 -0000	1.23
--- msgstore.py	10 Nov 2002 19:59:59 -0000	1.24
***************
*** 397,401 ****
              # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
              pass
!             
          return "%s\n%s\n%s" % (headers, html, body)
  
--- 397,401 ----
              # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
              pass
! 
          return "%s\n%s\n%s" % (headers, html, body)
  

From tim_one@users.sourceforge.net  Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4
Message-ID: <E18B3r2-0001Re-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam

Modified Files:
	profile.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** profile.py	7 Nov 2002 22:30:11 -0000	1.3
--- profile.py	11 Nov 2002 01:59:06 -0000	1.4
***************
*** 44,48 ****
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 44,48 ----
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0


From tim_one@users.sourceforge.net  Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] 
 spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Message-ID: <E18B3r2-0001RY-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402

Modified Files:
	Options.py classifier.py weakloop.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py	8 Nov 2002 04:06:23 -0000	1.67
--- Options.py	11 Nov 2002 01:59:06 -0000	1.68
***************
*** 241,268 ****
  
  # These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before.  Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam.  This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x.  All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE:  Gary Robinson previously used a different formula involving 'a'
! # and 'x'.  The 'x' here is the same as before.  The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
  
  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme.  0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
  
! # The combining scheme currently detailed on Gary Robinons web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
--- 241,268 ----
  
  # These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before.  Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting.  At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam.  This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob.  All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
  
  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests.  0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
  
! # The combining scheme currently detailed on the Robinon web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
***************
*** 272,284 ****
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease".  One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  This is
! # the "provably most-sensitive" test the original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease".  One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'robinson_probability_x': float_cracker,
!                    'robinson_probability_s': float_cracker,
!                    'robinson_minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'unknown_word_prob': float_cracker,
!                    'unknown_word_strength': float_cracker,
!                    'minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py	7 Nov 2002 22:30:05 -0000	1.49
--- classifier.py	11 Nov 2002 01:59:06 -0000	1.50
***************
*** 70,74 ****
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
          nspam = float(self.nspam or 1)
  
!         S = options.robinson_probability_s
!         StimesX = S * options.robinson_probability_x
  
          for word, record in self.wordinfo.iteritems():
--- 322,327 ----
          nspam = float(self.nspam or 1)
  
!         S = options.unknown_word_strength
!         StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
  
      def _getclues(self, wordstream):
!         mindist = options.robinson_minimum_prob_strength
!         unknown = options.robinson_probability_x
  
          clues = []  # (distance, prob, word, record) tuples
--- 449,454 ----
  
      def _getclues(self, wordstream):
!         mindist = options.minimum_prob_strength
!         unknown = options.unknown_word_prob
  
          clues = []  # (distance, prob, word, record) tuples

Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py	10 Nov 2002 12:08:40 -0000	1.1
--- weakloop.py	11 Nov 2002 01:59:06 -0000	1.2
***************
*** 29,35 ****
  default="""
  [Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
  max_discriminators = 150
  
--- 29,35 ----
  default="""
  [Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
  max_discriminators = 150
  
***************
*** 41,47 ****
  import Options
  
! start = (Options.options.robinson_probability_x,
!          Options.options.robinson_minimum_prob_strength,
!          Options.options.robinson_probability_s,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
--- 41,47 ----
  import Options
  
! start = (Options.options.unknown_word_prob,
!          Options.options.minimum_prob_strength,
!          Options.options.unknown_word_strength,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
***************
*** 52,58 ****
      f.write("""
  [Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
  
  [TestDriver]
--- 52,58 ----
      f.write("""
  [Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
  
  [TestDriver]


From tim_one@users.sourceforge.net  Fri Nov  8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
	tokenizer.py,1.63,1.64
Message-ID: <E18A0Pd-0008K2-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798

Modified Files:
	Options.py tokenizer.py 
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py	7 Nov 2002 22:25:46 -0000	1.66
--- Options.py	8 Nov 2002 04:06:23 -0000	1.67
***************
*** 42,53 ****
      x-.*
  
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages.  Set true to retain HTML tags in this
- # case.  On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
- 
  # If true, the first few characters of application/octet-stream sections
  # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
  
  all_options = {
!     'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
!                   'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,
--- 339,343 ----
  
  all_options = {
!     'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
--- tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
***************
*** 495,504 ****
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was removed.
  
  ##############################################################################
--- 495,505 ----
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False.  Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was also removed.
  
  ##############################################################################
***************
*** 1167,1175 ****
          """Generate a stream of tokens from an email Message.
  
-         HTML tags are always stripped from text/plain sections.
-         options.retain_pure_html_tags controls whether HTML tags are
-         also stripped from text/html sections.  Except in special cases,
-         it's recommended to leave that at its default of false.
- 
          If options.check_octets is True, the first few undecoded characters
          of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             if (part.get_content_type() == "text/plain" or
!                     not options.retain_pure_html_tags):
!                 text = text.replace('&nbsp;', ' ')
!                 text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1224,1229 ----
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.


From richiehindle@users.sourceforge.net  Fri Nov  8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID: <E18A440-0006h6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390

Modified Files:
	pop3proxy.py 
Log Message:
 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py	7 Nov 2002 22:27:02 -0000	1.11
--- pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
***************
*** 47,50 ****
--- 47,74 ----
  
  
+ todo = """
+  o (Re)training interface - one message per line, quick-rendering table.
+  o Slightly-wordy index page; intro paragraph for each page.
+  o Once the training stuff is on a separate page, make the paste box
+    bigger.
+  o "Links" section (on homepage?) to project homepage, mailing list,
+    etc.
+  o "Home" link (with helmet!) at the end of each page.
+  o "Classify this" - just like Train.
+  o "Send me an email every [...] to remind me to train on new
+    messages."
+  o "Send me a status email every [...] telling how many mails have been
+    classified, etc."
+  o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
+    webbrowser?
+  o Possibly integrate Tim Stone's SMTP code - make it use async, make
+    the training code update (rather than replace!) the database.
+  o Can it cleanly dynamically update its status display while having a
+    POP3 converation?  Hammering reload sucks.
+  o Add a command to save the database without shutting down, and one to
+    reload the database.
+  o Leave the word in the input field after a Word query.
+ """
+ 
  import sys, re, operator, errno, getopt, cPickle, cStringIO, time
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
              self.factory(*args)
  
+ 
  class BrighterAsyncChat(asynchat.async_chat):
      """An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
  
  
+ class ServerLineReader(BrighterAsyncChat):
+     """An async socket that reads lines from a remote server and
+     simply calls a callback with the data.  The BayesProxy object
+     can't connect to the real POP3 server and talk to it
+     synchronously, because that would block the process."""
+     
+     def __init__(self, serverName, serverPort, lineCallback):
+         BrighterAsyncChat.__init__(self)
+         self.lineCallback = lineCallback
+         self.request = ''
+         self.set_terminator('\r\n')
+         self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+         self.connect((serverName, serverPort))
+     
+     def collect_incoming_data(self, data):
+         self.request = self.request + data
+ 
+     def found_terminator(self):
+         self.lineCallback(self.request + '\r\n')
+         self.request = ''
+ 
+     def handle_close(self):
+         self.lineCallback('')
+         self.close()
+ 
+ 
  class POP3ProxyBase(BrighterAsyncChat):
      """An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
!         self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.serverSocket.connect((serverName, serverPort))
!         self.serverIn = self.serverSocket.makefile('r')  # For reading only
!         self.push(self.serverIn.readline())
  
      def onTransaction(self, command, args, response):
--- 177,189 ----
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
+         self.response = ''
          self.set_terminator('\r\n')
!         self.command = ''           # The POP3 command being processed...
!         self.args = ''              # ...and its arguments
!         self.isClosing = False      # Has the server closed the socket?
!         self.seenAllHeaders = False # For the current RETR or TOP
!         self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
!                                              self.onServerLine)
  
      def onTransaction(self, command, args, response):
***************
*** 139,152 ****
          raise NotImplementedError
  
!     def isMultiline(self, command, args):
!         """Returns True if the given request should get a multiline
          response (assuming the response is positive).
          """
!         if command in ['USER', 'PASS', 'APOP', 'QUIT',
!                        'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif command in ['RETR', 'TOP']:
              return True
!         elif command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
--- 194,237 ----
          raise NotImplementedError
  
!     def onServerLine(self, line):
!         """A line of response has been received from the POP3 server."""
!         isFirstLine = not self.response
!         self.response = self.response + line
!         
!         # Is this line that terminates a set of headers?
!         self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
!         # Has the server closed its end of the socket?
!         if not line:
!             self.isClosing = True
!         
!         # If we're not processing a command, just echo the response.
!         if not self.command:
!             self.push(self.response)
!             self.response = ''
!         
!         # Time out after 30 seconds for message-retrieval commands if
!         # all the headers are down.  The rest of the message will proxy
!         # straight through.
!         if self.command in ['TOP', 'RETR'] and \
!            self.seenAllHeaders and time.time() > self.startTime + 30:
!             self.onResponse()
!             self.response = ''
!         # If that's a complete response, handle it.
!         elif not self.isMultiline() or line == '.\r\n' or \
!            (isFirstLine and line.startswith('-ERR')):
!             self.onResponse()
!             self.response = ''
!     
!     def isMultiline(self):
!         """Returns True if the request should get a multiline
          response (assuming the response is positive).
          """
!         if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
!                             'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif self.command in ['RETR', 'TOP']:
              return True
!         elif self.command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
***************
*** 155,204 ****
              return False
  
-     def readResponse(self, command, args):
-         """Reads the POP3 server's response and returns a tuple of
-         (response, isClosing, timedOut).  isClosing is True if the
-         server closes the socket, which tells found_terminator() to
-         close when the response has been sent.  timedOut is set if a
-         TOP or RETR request was still arriving after 30 seconds, and
-         tells found_terminator() to proxy the remainder of the response.
-         """
-         responseLines = []
-         startTime = time.time()
-         isMulti = self.isMultiline(command, args)
-         isClosing = False
-         timedOut = False
-         isFirstLine = True
-         seenAllHeaders = False
-         while True:
-             line = self.serverIn.readline()
-             if not line:
-                 # The socket's been closed by the server, probably by QUIT.
-                 isClosing = True
-                 break
-             elif not isMulti or (isFirstLine and line.startswith('-ERR')):
-                 # A single-line response.
-                 responseLines.append(line)
-                 break
-             elif line == '.\r\n':
-                 # The termination line.
-                 responseLines.append(line)
-                 break
-             else:
-                 # A normal line - append it to the response and carry on.
-                 responseLines.append(line)
-                 seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
- 
-             # Time out after 30 seconds for message-retrieval commands
-             # if all the headers are down - found_terminator() knows how
-             # to deal with this.
-             if command in ['TOP', 'RETR'] and \
-                seenAllHeaders and time.time() > startTime + 30:
-                 timedOut = True
-                 break
- 
-             isFirstLine = False
- 
-         return ''.join(responseLines), isClosing, timedOut
- 
      def collect_incoming_data(self, data):
          """Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
      def found_terminator(self):
          """Asynchat override."""
-         # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverSocket.sendall(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             command, args = ('', '')
          else:
              splitCommand = self.request.strip().split(None, 1)
!             command = splitCommand[0].upper()
!             args = splitCommand[1:]
!         rawResponse, isClosing, timedOut = self.readResponse(command, args)
! 
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cookedResponse = self.onTransaction(command, args, rawResponse)
!         self.push(cookedResponse)
!         self.request = ''
! 
!         # If readResponse() timed out, we still need to read and proxy
!         # the rest of the message.
!         if timedOut:
!             while True:
!                 line = self.serverIn.readline()
!                 if not line:
!                     # The socket's been closed by the server.
!                     isClosing = True
!                     break
!                 elif line == '.\r\n':
!                     # The termination line.
!                     self.push(line)
!                     break
!                 else:
!                     # A normal line.
!                     self.push(line)
! 
!         # If readResponse() or the loop above decided that the server
!         # has closed its socket, close this one when the response has
!         # been sent.
!         if isClosing:
              self.close_when_done()
  
  
  class BayesProxyListener(Listener):
--- 246,288 ----
      def found_terminator(self):
          """Asynchat override."""
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
+             self.serverSocket.shutdown(2)
+             self.serverSocket.close()
              self.shutdown(2)
              self.close()
              raise SystemExit
!         
!         self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             self.command = self.args = ''
          else:
+             # A proper command.
              splitCommand = self.request.strip().split(None, 1)
!             self.command = splitCommand[0].upper()
!             self.args = splitCommand[1:]
!             self.startTime = time.time()
!         
!         self.request = ''
!         
!     def onResponse(self):
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cooked = self.onTransaction(self.command, self.args, self.response)
!         self.push(cooked)
!         
!         # If onServerLine() decided that the server has closed its
!         # socket, close this one when the response has been sent.
!         if self.isClosing:
              self.close_when_done()
  
+         # Reset.
+         self.command = ''
+         self.args = ''
+         self.isClosing = False
+         self.seenAllHeaders = False
+ 
  
  class BayesProxyListener(Listener):
***************
*** 452,456 ****
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15 }
               .header { font-size: 133%% }
               .content { margin: 15 }
--- 484,490 ----
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15;
!                        border-top: 1px solid black;
!                        border-bottom: 1px solid black }
               .header { font-size: 133%% }
               .content { margin: 15 }
***************
*** 466,470 ****
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
--- 500,504 ----
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>&nbsp;Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
***************
*** 475,481 ****
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              <input type='submit' value='Shutdown now'>
               </td></tr></table></form>\n"""
  
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
--- 509,520 ----
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              %s
               </td></tr></table></form>\n"""
  
+     shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
+     
+     shutdownPickle = shutdownDB + """&nbsp;&nbsp;
+             <input type='submit' name='how' value='Save &amp; shutdown'>"""
+ 
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
***************
*** 483,486 ****
--- 522,533 ----
                    &nbsp;<br>\n"""
      
+     summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
+               proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
+               Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
+               POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
+               Emails classified this session: <b>%(numSpams)d</b> spam,
+                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
+               """
+     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
***************
*** 488,491 ****
--- 535,550 ----
                  </form>"""
      
+     train = """<form action='/upload' method='POST'
+                 enctype='multipart/form-data'>
+             Either upload a message file: <input type='file' name='file'><br>
+             Or paste the whole message (incuding headers) here:<br>
+             <textarea name='text' rows='3' cols='60'></textarea><br>
+             Is this message
+             <input type='radio' name='which' value='ham'>Ham</input> or
+             <input type='radio'
+                    name='which' value='spam' checked>Spam</input>?<br>
+             <input type='submit' value='Train on this message'>
+             </form>"""
+     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = self.request.split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
--- 561,565 ----
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
          
          if path == '/helmet.gif':
!             self.pushOKHeaders('image/gif')
              self.push(self.helmet)
          else:
--- 606,614 ----
          
          if path == '/helmet.gif':
!             # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
!             inOneHour = time.gmtime(time.time() + 3600)
!             expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
!             extraHeaders = {'Expires': expiryDate}
!             self.pushOKHeaders('image/gif', extraHeaders)
              self.push(self.helmet)
          else:
***************
*** 554,558 ****
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % url)
              else:
                  # This is a request for a valid page; run the handler.
--- 617,621 ----
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % path)
              else:
                  # This is a request for a valid page; run the handler.
***************
*** 561,569 ****
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 self.push(self.footer % timeString)
      
!     def pushOKHeaders(self, contentType):
!         self.push("HTTP/1.0 200 OK\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
          self.push("\r\n")
  
--- 624,641 ----
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 if status.useDB:
!                     self.push(self.footer % (timeString, self.shutdownDB))
!                 else:
!                     self.push(self.footer % (timeString, self.shutdownPickle))
      
!     def pushOKHeaders(self, contentType, extraHeaders={}):
!         timeNow = time.gmtime(time.time())
!         httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
!         self.push("HTTP/1.1 200 OK\r\n")
!         self.push("Connection: close\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
+         self.push("Date: %s\r\n" % httpNow)
+         for name, value in extraHeaders.items():
+             self.push("%s: %s\r\n" % (name, value))
          self.push("\r\n")
  
***************
*** 583,616 ****
  
      def onHome(self, params):
!         summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!                   proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
!                   Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
!                   POP3 conversations this session:
!                     <b>%(totalSessions)d</b>.<br>
!                   Emails classified this session: <b>%(numSpams)d</b> spam,
!                     <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
!                   """ % status.__dict__
!         
!         train = """<form action='/upload' method='POST'
!                     enctype='multipart/form-data'>
!                 Either upload a message file:
!                 <input type='file' name='file'><br>
!                 Or paste the whole message (incuding headers) here:<br>
!                 <textarea name='text' rows='3' cols='60'></textarea><br>
!                 Is this message
!                 <input type='radio' name='which' value='ham'>Ham</input> or
!                 <input type='radio'
!                        name='which' value='spam' checked>Spam</input>?<br>
!                 <input type='submit' value='Train on this message'>
!                 </form>"""
!         
!         body = (self.pageSection % ('Status', summary) +
!                 self.pageSection % ('Word query', self.wordQuery) +
!                 self.pageSection % ('Train', train))
          self.push(body)
  
      def onShutdown(self, params):
!         self.push("<p><b>Shutdown.</b> Goodbye.</p>")
!         self.push(' ')  # Acts as a flush for small buffers.
          self.shutdown(2)
          self.close()
--- 655,675 ----
  
      def onHome(self, params):
!         """Serve up the homepage."""
!         body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Word query', self.wordQuery)+
!                 self.pageSection % ('Train', self.train))
          self.push(body)
  
      def onShutdown(self, params):
!         """Shutdown the server, saving the pickle if requested to do so."""
!         if params['how'].lower().find('save') >= 0:
!             if not status.useDB and status.pickleName:
!                 self.push("<b>Saving...</b>")
!                 self.push(' ')  # Acts as a flush for small buffers.
!                 fp = open(status.pickleName, 'wb')
!                 cPickle.dump(self.bayes, fp, 1)
!                 fp.close()
!         self.push("<b>Shutdown</b>. Goodbye.")
!         self.push(' ')
          self.shutdown(2)
          self.close()
***************
*** 618,625 ****
  
      def onUpload(self, params):
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
          # Append the message to a file, to make it easier to rebuild
!         # the database later.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
--- 677,690 ----
  
      def onUpload(self, params):
+         """Train on an uploaded or pasted message."""
+         # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         
          # Append the message to a file, to make it easier to rebuild
!         # the database later.   This is a temporary implementation -
!         # it should keep a Corpus (from Tim Stone's forthcoming message
!         # management module) to manage a cache of messages.  It needs
!         # to keep them for the HTML retraining interface anyway.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
***************
*** 627,642 ****
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From ???@???\n")  # fake From line (XXX good enough?)
          f.write(message)
!         f.write("\n")
          f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("""<p>Trained on your message. Saving database...</p>""")
!         self.push(" ")  # Flush... must find out how to do this properly...
!         if not status.useDB and status.pickleName:
!             fp = open(status.pickleName, 'wb')
!             cPickle.dump(self.bayes, fp, 1)
!             fp.close()
!         self.push("<p>Done.</p><p><a href='/'>Home</a></p>")
  
      def onWordquery(self, params):
--- 692,704 ----
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
          f.write(message)
!         f.write("\n\n")
          f.close()
+ 
+         # Train on the message.
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
!         self.push(self.pageSection % ('Train another', self.train))
  
      def onWordquery(self, params):
***************
*** 656,660 ****
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s':" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
--- 718,722 ----
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
***************
*** 765,771 ****
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))
          self.request = ''
  
      def onStat(self, command, args):
          """POP3 STAT command."""
--- 827,839 ----
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))   # Or push_slowly for testing
          self.request = ''
  
+     def push_slowly(self, response):
+         """Useful for testing."""
+         for c in response:
+             self.push(c)
+             time.sleep(0.02)
+ 
      def onStat(self, command, args):
          """POP3 STAT command."""
***************
*** 777,781 ****
          """POP3 LIST command, with optional message number argument."""
          if args:
!             number = int(args)
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
          """POP3 LIST command, with optional message number argument."""
          if args:
!             try:
!                 number = int(args)
!             except ValueError:
!                 number = -1
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         return self._getMessage(int(args), 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         number, lines = map(int, args.split())
          return self._getMessage(number, lines)
  
--- 874,889 ----
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         try:
!             number = int(args)
!         except ValueError:
!             number = -1
!         return self._getMessage(number, 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         try:
!             number, lines = map(int, args.split())
!         except ValueError:
!             number, lines = -1, -1
          return self._getMessage(number, lines)
  
***************
*** 863,867 ****
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) != -1
  
      # Kill the proxy and the test server.
--- 941,945 ----
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) >= 0
  
      # Kill the proxy and the test server.


From tim_one@users.sourceforge.net  Fri Nov  8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
	tokenizer.py,1.63,1.64
Message-ID: <E18A0Pd-0008K2-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798

Modified Files:
	Options.py tokenizer.py 
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py	7 Nov 2002 22:25:46 -0000	1.66
--- Options.py	8 Nov 2002 04:06:23 -0000	1.67
***************
*** 42,53 ****
      x-.*
  
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages.  Set true to retain HTML tags in this
- # case.  On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
- 
  # If true, the first few characters of application/octet-stream sections
  # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
  
  all_options = {
!     'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
!                   'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,
--- 339,343 ----
  
  all_options = {
!     'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
--- tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
***************
*** 495,504 ****
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was removed.
  
  ##############################################################################
--- 495,505 ----
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False.  Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was also removed.
  
  ##############################################################################
***************
*** 1167,1175 ****
          """Generate a stream of tokens from an email Message.
  
-         HTML tags are always stripped from text/plain sections.
-         options.retain_pure_html_tags controls whether HTML tags are
-         also stripped from text/html sections.  Except in special cases,
-         it's recommended to leave that at its default of false.
- 
          If options.check_octets is True, the first few undecoded characters
          of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             if (part.get_content_type() == "text/plain" or
!                     not options.retain_pure_html_tags):
!                 text = text.replace('&nbsp;', ' ')
!                 text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1224,1229 ----
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.


From richiehindle@users.sourceforge.net  Fri Nov  8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID: <E18A440-0006h6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390

Modified Files:
	pop3proxy.py 
Log Message:
 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py	7 Nov 2002 22:27:02 -0000	1.11
--- pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
***************
*** 47,50 ****
--- 47,74 ----
  
  
+ todo = """
+  o (Re)training interface - one message per line, quick-rendering table.
+  o Slightly-wordy index page; intro paragraph for each page.
+  o Once the training stuff is on a separate page, make the paste box
+    bigger.
+  o "Links" section (on homepage?) to project homepage, mailing list,
+    etc.
+  o "Home" link (with helmet!) at the end of each page.
+  o "Classify this" - just like Train.
+  o "Send me an email every [...] to remind me to train on new
+    messages."
+  o "Send me a status email every [...] telling how many mails have been
+    classified, etc."
+  o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
+    webbrowser?
+  o Possibly integrate Tim Stone's SMTP code - make it use async, make
+    the training code update (rather than replace!) the database.
+  o Can it cleanly dynamically update its status display while having a
+    POP3 converation?  Hammering reload sucks.
+  o Add a command to save the database without shutting down, and one to
+    reload the database.
+  o Leave the word in the input field after a Word query.
+ """
+ 
  import sys, re, operator, errno, getopt, cPickle, cStringIO, time
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
              self.factory(*args)
  
+ 
  class BrighterAsyncChat(asynchat.async_chat):
      """An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
  
  
+ class ServerLineReader(BrighterAsyncChat):
+     """An async socket that reads lines from a remote server and
+     simply calls a callback with the data.  The BayesProxy object
+     can't connect to the real POP3 server and talk to it
+     synchronously, because that would block the process."""
+     
+     def __init__(self, serverName, serverPort, lineCallback):
+         BrighterAsyncChat.__init__(self)
+         self.lineCallback = lineCallback
+         self.request = ''
+         self.set_terminator('\r\n')
+         self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+         self.connect((serverName, serverPort))
+     
+     def collect_incoming_data(self, data):
+         self.request = self.request + data
+ 
+     def found_terminator(self):
+         self.lineCallback(self.request + '\r\n')
+         self.request = ''
+ 
+     def handle_close(self):
+         self.lineCallback('')
+         self.close()
+ 
+ 
  class POP3ProxyBase(BrighterAsyncChat):
      """An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
!         self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.serverSocket.connect((serverName, serverPort))
!         self.serverIn = self.serverSocket.makefile('r')  # For reading only
!         self.push(self.serverIn.readline())
  
      def onTransaction(self, command, args, response):
--- 177,189 ----
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
+         self.response = ''
          self.set_terminator('\r\n')
!         self.command = ''           # The POP3 command being processed...
!         self.args = ''              # ...and its arguments
!         self.isClosing = False      # Has the server closed the socket?
!         self.seenAllHeaders = False # For the current RETR or TOP
!         self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
!                                              self.onServerLine)
  
      def onTransaction(self, command, args, response):
***************
*** 139,152 ****
          raise NotImplementedError
  
!     def isMultiline(self, command, args):
!         """Returns True if the given request should get a multiline
          response (assuming the response is positive).
          """
!         if command in ['USER', 'PASS', 'APOP', 'QUIT',
!                        'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif command in ['RETR', 'TOP']:
              return True
!         elif command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
--- 194,237 ----
          raise NotImplementedError
  
!     def onServerLine(self, line):
!         """A line of response has been received from the POP3 server."""
!         isFirstLine = not self.response
!         self.response = self.response + line
!         
!         # Is this line that terminates a set of headers?
!         self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
!         # Has the server closed its end of the socket?
!         if not line:
!             self.isClosing = True
!         
!         # If we're not processing a command, just echo the response.
!         if not self.command:
!             self.push(self.response)
!             self.response = ''
!         
!         # Time out after 30 seconds for message-retrieval commands if
!         # all the headers are down.  The rest of the message will proxy
!         # straight through.
!         if self.command in ['TOP', 'RETR'] and \
!            self.seenAllHeaders and time.time() > self.startTime + 30:
!             self.onResponse()
!             self.response = ''
!         # If that's a complete response, handle it.
!         elif not self.isMultiline() or line == '.\r\n' or \
!            (isFirstLine and line.startswith('-ERR')):
!             self.onResponse()
!             self.response = ''
!     
!     def isMultiline(self):
!         """Returns True if the request should get a multiline
          response (assuming the response is positive).
          """
!         if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
!                             'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif self.command in ['RETR', 'TOP']:
              return True
!         elif self.command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
***************
*** 155,204 ****
              return False
  
-     def readResponse(self, command, args):
-         """Reads the POP3 server's response and returns a tuple of
-         (response, isClosing, timedOut).  isClosing is True if the
-         server closes the socket, which tells found_terminator() to
-         close when the response has been sent.  timedOut is set if a
-         TOP or RETR request was still arriving after 30 seconds, and
-         tells found_terminator() to proxy the remainder of the response.
-         """
-         responseLines = []
-         startTime = time.time()
-         isMulti = self.isMultiline(command, args)
-         isClosing = False
-         timedOut = False
-         isFirstLine = True
-         seenAllHeaders = False
-         while True:
-             line = self.serverIn.readline()
-             if not line:
-                 # The socket's been closed by the server, probably by QUIT.
-                 isClosing = True
-                 break
-             elif not isMulti or (isFirstLine and line.startswith('-ERR')):
-                 # A single-line response.
-                 responseLines.append(line)
-                 break
-             elif line == '.\r\n':
-                 # The termination line.
-                 responseLines.append(line)
-                 break
-             else:
-                 # A normal line - append it to the response and carry on.
-                 responseLines.append(line)
-                 seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
- 
-             # Time out after 30 seconds for message-retrieval commands
-             # if all the headers are down - found_terminator() knows how
-             # to deal with this.
-             if command in ['TOP', 'RETR'] and \
-                seenAllHeaders and time.time() > startTime + 30:
-                 timedOut = True
-                 break
- 
-             isFirstLine = False
- 
-         return ''.join(responseLines), isClosing, timedOut
- 
      def collect_incoming_data(self, data):
          """Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
      def found_terminator(self):
          """Asynchat override."""
-         # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverSocket.sendall(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             command, args = ('', '')
          else:
              splitCommand = self.request.strip().split(None, 1)
!             command = splitCommand[0].upper()
!             args = splitCommand[1:]
!         rawResponse, isClosing, timedOut = self.readResponse(command, args)
! 
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cookedResponse = self.onTransaction(command, args, rawResponse)
!         self.push(cookedResponse)
!         self.request = ''
! 
!         # If readResponse() timed out, we still need to read and proxy
!         # the rest of the message.
!         if timedOut:
!             while True:
!                 line = self.serverIn.readline()
!                 if not line:
!                     # The socket's been closed by the server.
!                     isClosing = True
!                     break
!                 elif line == '.\r\n':
!                     # The termination line.
!                     self.push(line)
!                     break
!                 else:
!                     # A normal line.
!                     self.push(line)
! 
!         # If readResponse() or the loop above decided that the server
!         # has closed its socket, close this one when the response has
!         # been sent.
!         if isClosing:
              self.close_when_done()
  
  
  class BayesProxyListener(Listener):
--- 246,288 ----
      def found_terminator(self):
          """Asynchat override."""
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
+             self.serverSocket.shutdown(2)
+             self.serverSocket.close()
              self.shutdown(2)
              self.close()
              raise SystemExit
!         
!         self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             self.command = self.args = ''
          else:
+             # A proper command.
              splitCommand = self.request.strip().split(None, 1)
!             self.command = splitCommand[0].upper()
!             self.args = splitCommand[1:]
!             self.startTime = time.time()
!         
!         self.request = ''
!         
!     def onResponse(self):
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cooked = self.onTransaction(self.command, self.args, self.response)
!         self.push(cooked)
!         
!         # If onServerLine() decided that the server has closed its
!         # socket, close this one when the response has been sent.
!         if self.isClosing:
              self.close_when_done()
  
+         # Reset.
+         self.command = ''
+         self.args = ''
+         self.isClosing = False
+         self.seenAllHeaders = False
+ 
  
  class BayesProxyListener(Listener):
***************
*** 452,456 ****
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15 }
               .header { font-size: 133%% }
               .content { margin: 15 }
--- 484,490 ----
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15;
!                        border-top: 1px solid black;
!                        border-bottom: 1px solid black }
               .header { font-size: 133%% }
               .content { margin: 15 }
***************
*** 466,470 ****
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
--- 500,504 ----
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>&nbsp;Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
***************
*** 475,481 ****
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              <input type='submit' value='Shutdown now'>
               </td></tr></table></form>\n"""
  
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
--- 509,520 ----
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              %s
               </td></tr></table></form>\n"""
  
+     shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
+     
+     shutdownPickle = shutdownDB + """&nbsp;&nbsp;
+             <input type='submit' name='how' value='Save &amp; shutdown'>"""
+ 
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
***************
*** 483,486 ****
--- 522,533 ----
                    &nbsp;<br>\n"""
      
+     summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
+               proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
+               Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
+               POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
+               Emails classified this session: <b>%(numSpams)d</b> spam,
+                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
+               """
+     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
***************
*** 488,491 ****
--- 535,550 ----
                  </form>"""
      
+     train = """<form action='/upload' method='POST'
+                 enctype='multipart/form-data'>
+             Either upload a message file: <input type='file' name='file'><br>
+             Or paste the whole message (incuding headers) here:<br>
+             <textarea name='text' rows='3' cols='60'></textarea><br>
+             Is this message
+             <input type='radio' name='which' value='ham'>Ham</input> or
+             <input type='radio'
+                    name='which' value='spam' checked>Spam</input>?<br>
+             <input type='submit' value='Train on this message'>
+             </form>"""
+     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = self.request.split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
--- 561,565 ----
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
          
          if path == '/helmet.gif':
!             self.pushOKHeaders('image/gif')
              self.push(self.helmet)
          else:
--- 606,614 ----
          
          if path == '/helmet.gif':
!             # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
!             inOneHour = time.gmtime(time.time() + 3600)
!             expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
!             extraHeaders = {'Expires': expiryDate}
!             self.pushOKHeaders('image/gif', extraHeaders)
              self.push(self.helmet)
          else:
***************
*** 554,558 ****
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % url)
              else:
                  # This is a request for a valid page; run the handler.
--- 617,621 ----
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % path)
              else:
                  # This is a request for a valid page; run the handler.
***************
*** 561,569 ****
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 self.push(self.footer % timeString)
      
!     def pushOKHeaders(self, contentType):
!         self.push("HTTP/1.0 200 OK\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
          self.push("\r\n")
  
--- 624,641 ----
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 if status.useDB:
!                     self.push(self.footer % (timeString, self.shutdownDB))
!                 else:
!                     self.push(self.footer % (timeString, self.shutdownPickle))
      
!     def pushOKHeaders(self, contentType, extraHeaders={}):
!         timeNow = time.gmtime(time.time())
!         httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
!         self.push("HTTP/1.1 200 OK\r\n")
!         self.push("Connection: close\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
+         self.push("Date: %s\r\n" % httpNow)
+         for name, value in extraHeaders.items():
+             self.push("%s: %s\r\n" % (name, value))
          self.push("\r\n")
  
***************
*** 583,616 ****
  
      def onHome(self, params):
!         summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!                   proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
!                   Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
!                   POP3 conversations this session:
!                     <b>%(totalSessions)d</b>.<br>
!                   Emails classified this session: <b>%(numSpams)d</b> spam,
!                     <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
!                   """ % status.__dict__
!         
!         train = """<form action='/upload' method='POST'
!                     enctype='multipart/form-data'>
!                 Either upload a message file:
!                 <input type='file' name='file'><br>
!                 Or paste the whole message (incuding headers) here:<br>
!                 <textarea name='text' rows='3' cols='60'></textarea><br>
!                 Is this message
!                 <input type='radio' name='which' value='ham'>Ham</input> or
!                 <input type='radio'
!                        name='which' value='spam' checked>Spam</input>?<br>
!                 <input type='submit' value='Train on this message'>
!                 </form>"""
!         
!         body = (self.pageSection % ('Status', summary) +
!                 self.pageSection % ('Word query', self.wordQuery) +
!                 self.pageSection % ('Train', train))
          self.push(body)
  
      def onShutdown(self, params):
!         self.push("<p><b>Shutdown.</b> Goodbye.</p>")
!         self.push(' ')  # Acts as a flush for small buffers.
          self.shutdown(2)
          self.close()
--- 655,675 ----
  
      def onHome(self, params):
!         """Serve up the homepage."""
!         body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Word query', self.wordQuery)+
!                 self.pageSection % ('Train', self.train))
          self.push(body)
  
      def onShutdown(self, params):
!         """Shutdown the server, saving the pickle if requested to do so."""
!         if params['how'].lower().find('save') >= 0:
!             if not status.useDB and status.pickleName:
!                 self.push("<b>Saving...</b>")
!                 self.push(' ')  # Acts as a flush for small buffers.
!                 fp = open(status.pickleName, 'wb')
!                 cPickle.dump(self.bayes, fp, 1)
!                 fp.close()
!         self.push("<b>Shutdown</b>. Goodbye.")
!         self.push(' ')
          self.shutdown(2)
          self.close()
***************
*** 618,625 ****
  
      def onUpload(self, params):
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
          # Append the message to a file, to make it easier to rebuild
!         # the database later.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
--- 677,690 ----
  
      def onUpload(self, params):
+         """Train on an uploaded or pasted message."""
+         # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         
          # Append the message to a file, to make it easier to rebuild
!         # the database later.   This is a temporary implementation -
!         # it should keep a Corpus (from Tim Stone's forthcoming message
!         # management module) to manage a cache of messages.  It needs
!         # to keep them for the HTML retraining interface anyway.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
***************
*** 627,642 ****
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From ???@???\n")  # fake From line (XXX good enough?)
          f.write(message)
!         f.write("\n")
          f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("""<p>Trained on your message. Saving database...</p>""")
!         self.push(" ")  # Flush... must find out how to do this properly...
!         if not status.useDB and status.pickleName:
!             fp = open(status.pickleName, 'wb')
!             cPickle.dump(self.bayes, fp, 1)
!             fp.close()
!         self.push("<p>Done.</p><p><a href='/'>Home</a></p>")
  
      def onWordquery(self, params):
--- 692,704 ----
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
          f.write(message)
!         f.write("\n\n")
          f.close()
+ 
+         # Train on the message.
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
!         self.push(self.pageSection % ('Train another', self.train))
  
      def onWordquery(self, params):
***************
*** 656,660 ****
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s':" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
--- 718,722 ----
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
***************
*** 765,771 ****
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))
          self.request = ''
  
      def onStat(self, command, args):
          """POP3 STAT command."""
--- 827,839 ----
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))   # Or push_slowly for testing
          self.request = ''
  
+     def push_slowly(self, response):
+         """Useful for testing."""
+         for c in response:
+             self.push(c)
+             time.sleep(0.02)
+ 
      def onStat(self, command, args):
          """POP3 STAT command."""
***************
*** 777,781 ****
          """POP3 LIST command, with optional message number argument."""
          if args:
!             number = int(args)
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
          """POP3 LIST command, with optional message number argument."""
          if args:
!             try:
!                 number = int(args)
!             except ValueError:
!                 number = -1
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         return self._getMessage(int(args), 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         number, lines = map(int, args.split())
          return self._getMessage(number, lines)
  
--- 874,889 ----
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         try:
!             number = int(args)
!         except ValueError:
!             number = -1
!         return self._getMessage(number, 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         try:
!             number, lines = map(int, args.split())
!         except ValueError:
!             number, lines = -1, -1
          return self._getMessage(number, lines)
  
***************
*** 863,867 ****
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) != -1
  
      # Kill the proxy and the test server.
--- 941,945 ----
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) >= 0
  
      # Kill the proxy and the test server.


From jvr@users.sourceforge.net  Sat Nov  9 18:05:44 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Sat, 09 Nov 2002 10:05:44 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13
Message-ID: <E18AZzM-0005QJ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20814

Modified Files:
	pop3proxy.py 
Log Message:
force word query to be lowercase, making the UI case insensitive

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
--- pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
***************
*** 704,707 ****
--- 704,708 ----
      def onWordquery(self, params):
          word = params['word']
+         word = word.lower()
          try:
              # Must be a better way to get __dict__ for a new-style class...


From hooft@users.sourceforge.net  Sat Nov  9 21:48:55 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sat, 09 Nov 2002 13:48:55 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,NONE,1.1
Message-ID: <E18AdTL-00086Q-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31102

Added Files:
	weaktest.py 
Log Message:
New test driver to simulate "unsure only" training

--- NEW FILE: weaktest.py ---
#! /usr/bin/env python

# A test driver using "the standard" test directory structure.
# This simulates a user that gets E-mail, and only trains on fp,
# fn and unsure messages. It starts by training on the first 30
# messages, and from that point on well classified messages will
# not be used for training. This can be used to see what the performance
# of the scoring algorithm is under such conditions. Questions are:
#  * How does the size of the database behave over time?
#  * Does the classification get better over time?
#  * Are there other combinations of parameters for the classifier
#    that make this better behaved than the default values?


"""Usage: %(program)s  [options] -n nsets

Where:
    -h
        Show usage and exit.
    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

from __future__ import generators

import sys,os

from Options import options
import hammie

import msgs

program = sys.argv[0]

debug = 0

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def drive(nsets):
    print options.display()

    spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
    hamdirs  = [options.ham_directories % i for i in range(1, nsets+1)]

    spamfns = [(x,y,1) for x in spamdirs for y in os.listdir(x)]
    hamfns = [(x,y,0) for x in hamdirs for y in os.listdir(x)]

    nham = len(hamfns)
    nspam = len(spamfns)
    
    allfns={}
    for fn in spamfns+hamfns:
        allfns[fn] = None

    d = hammie.Hammie(hammie.createbayes('weaktest.db', False))

    n=0
    unsure=0
    hamtrain=0
    spamtrain=0
    fp=0
    fn=0
    for dir,name, is_spam in allfns.iterkeys():
        n += 1
        m=msgs.Msg(dir, name).guts
        if debug:
            print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name),
        if hamtrain + spamtrain > 30:
            scr=d.score(m)
        else:
            scr=0.50
        if debug:
            print "score:%.3f"%scr,
        if scr < hammie.SPAM_THRESHOLD and is_spam:
            if scr < hammie.HAM_THRESHOLD:
                fn += 1
                if debug:
                    print "fn"
            else:
                unsure += 1
                if debug:
                    print "Unsure"
            spamtrain += 1
            d.train_spam(m)
            d.update_probabilities()
        elif scr > hammie.HAM_THRESHOLD and not is_spam:
            if scr > hammie.SPAM_THRESHOLD:
                fp += 1
                if debug:
                    print "fp"
                else:
                    print "fp: %s score:%.4f"%(os.path.join(dir,name),scr)
            else:
                unsure += 1
                if debug:
                    print "Unsure"
            hamtrain += 1
            d.train_ham(m)
            d.update_probabilities()
        else:
            if debug:
                print "OK"
        if n % 100 == 0:
            print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
                n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure)
    print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam)
    print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
        unsure,unsure*100.0/len(allfns))
    print "Trained on %d ham and %d spam"%(hamtrain,spamtrain)
    print "fp: %d fn: %d"%(fp,fn)
    FPW = options.best_cutoff_fp_weight
    FNW = options.best_cutoff_fn_weight
    UNW = options.best_cutoff_unsure_weight
    print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure)
    
def main():
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
                                   ['ham-keep=', 'spam-keep='])
    except getopt.error, msg:
        usage(1, msg)

    nsets = seed = hamkeep = spamkeep = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-n':
            nsets = int(arg)

    if args:
        usage(1, "Positional arguments not supported")
    if nsets is None:
        usage(1, "-n is required")

    drive(nsets)

if __name__ == "__main__":
    main()


From hooft@users.sourceforge.net  Sun Nov 10 12:02:36 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:02:36 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,1.1,1.2
Message-ID: <E18AqnU-0005vF-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22741

Modified Files:
	weaktest.py 
Log Message:
add flexcost; sanitize spacing

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weaktest.py	9 Nov 2002 21:48:52 -0000	1.1
--- weaktest.py	10 Nov 2002 12:02:33 -0000	1.2
***************
*** 59,63 ****
      nspam = len(spamfns)
      
!     allfns={}
      for fn in spamfns+hamfns:
          allfns[fn] = None
--- 59,63 ----
      nspam = len(spamfns)
      
!     allfns = {}
      for fn in spamfns+hamfns:
          allfns[fn] = None
***************
*** 65,74 ****
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
!     n=0
!     unsure=0
!     hamtrain=0
!     spamtrain=0
!     fp=0
!     fn=0
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
--- 65,80 ----
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
!     n = 0
!     unsure = 0
!     hamtrain = 0
!     spamtrain = 0
!     fp = 0
!     fn = 0
!     flexcost = 0
!     FPW = options.best_cutoff_fp_weight
!     FNW = options.best_cutoff_fn_weight
!     UNW = options.best_cutoff_unsure_weight
!     SPC = options.spam_cutoff
!     HC = options.ham_cutoff
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
***************
*** 82,87 ****
          if debug:
              print "score:%.3f"%scr,
!         if scr < hammie.SPAM_THRESHOLD and is_spam:
!             if scr < hammie.HAM_THRESHOLD:
                  fn += 1
                  if debug:
--- 88,96 ----
          if debug:
              print "score:%.3f"%scr,
!         if scr < SPC and is_spam:
!             t = FNW * (SPC - scr) / (SPC - HC)
!             #print "Spam at %.3f costs %.2f"%(scr,t)
!             flexcost += t
!             if scr < HC:
                  fn += 1
                  if debug:
***************
*** 94,104 ****
              d.train_spam(m)
              d.update_probabilities()
!         elif scr > hammie.HAM_THRESHOLD and not is_spam:
!             if scr > hammie.SPAM_THRESHOLD:
                  fp += 1
                  if debug:
                      print "fp"
                  else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir,name),scr)
              else:
                  unsure += 1
--- 103,116 ----
              d.train_spam(m)
              d.update_probabilities()
!         elif scr > HC and not is_spam:
!             t = FPW * (scr - HC) / (SPC - HC)
!             #print "Ham at %.3f costs %.2f"%(scr,t)
!             flexcost += t
!             if scr > SPC:
                  fp += 1
                  if debug:
                      print "fp"
                  else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir, name), scr)
              else:
                  unsure += 1
***************
*** 113,126 ****
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam)
      print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure,unsure*100.0/len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain,spamtrain)
!     print "fp: %d fn: %d"%(fp,fn)
!     FPW = options.best_cutoff_fp_weight
!     FNW = options.best_cutoff_fn_weight
!     UNW = options.best_cutoff_unsure_weight
!     print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure)
      
  def main():
--- 125,136 ----
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam)
      print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure, unsure * 100.0 / len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain, spamtrain)
!     print "fp: %d fn: %d"%(fp, fn)
!     print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
!     print "Flex cost: $%.4f"%flexcost
      
  def main():
***************
*** 128,137 ****
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
!                                    ['ham-keep=', 'spam-keep='])
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = seed = hamkeep = spamkeep = None
      for opt, arg in opts:
          if opt == '-h':
--- 138,146 ----
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:')
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = None
      for opt, arg in opts:
          if opt == '-h':


From hooft@users.sourceforge.net  Sun Nov 10 12:07:18 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:07:18 -0800
Subject: [Spambayes-checkins] spambayes optimize.py,NONE,1.1
Message-ID: <E18Aqs2-0006JK-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24245

Added Files:
	optimize.py 
Log Message:
Simplex maximization

--- NEW FILE: optimize.py ---
#
__version__ = '$Id: optimize.py,v 1.1 2002/11/10 12:07:15 hooft Exp $'
#
# Optimize any parametric function.
#
import copy
import Numeric

def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001):
    var = Numeric.array(var)
    simplex = [var]
    for i in range(len(var)):
	var2 = copy.copy(var)
	var2[i] = var[i] + err[i]
	simplex.append(var2)
    value = []
    for i in range(len(simplex)):
	value.append(func(simplex[i]))
    while 1:
	# Determine worst and best
	wi = 0
	bi = 0
	for i in range(len(simplex)):
	    if value[wi] > value[i]:
		wi = i
	    if value[bi] < value[i]:
		bi = i
	# Test for convergence
	#print "worst, best are",wi,bi,"with",value[wi],value[bi]
	if abs(value[bi] - value[wi]) <= convcrit:
	    return simplex[bi]
	# Calculate average of non-worst
	ave=Numeric.zeros(len(var), 'd')
	for i in range(len(simplex)):
	    if i != wi:
		ave = ave + simplex[i]
	ave = ave / (len(simplex) - 1)
	worst = Numeric.array(simplex[wi])
	# Check for too-small simplex
	simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
	if simsize <= minerr:
	    #print "Size of simplex too small:",simsize
	    return simplex[bi]
	# Invert worst
	new = 2 * ave - simplex[wi]
	newv = func(new)
	if newv <= value[wi]:
	    # Even worse. Shrink instead
	    #print "Shrunk simplex"
	    #print "ave=",repr(ave)
	    #print "wi=",repr(worst)
	    new = 0.5 * ave + 0.5 * worst
	    newv = func(new)
	elif newv > value[bi]:
	    # Better than the best. Expand
	    new2 = 3 * ave - 2 * worst
	    newv2 = func(new2)
	    if newv2 > newv:
		# Accept
		#print "Expanded simplex"
		new = new2
		newv = newv2
	simplex[wi] = new
	value[wi] = newv

def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
    err = Numeric.array(err)
    var = SimplexMaximize(var, err, func, convcrit*5, minerr*5)
    return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr)


From hooft@users.sourceforge.net  Sun Nov 10 12:08:42 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:08:42 -0800
Subject: [Spambayes-checkins] spambayes weakloop.py,NONE,1.1
Message-ID: <E18AqtO-0006Q0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24653

Added Files:
	weakloop.py 
Log Message:
Loop simplex optimization over weaktest.py

--- NEW FILE: weakloop.py ---
#
# Optimize parameters
#
"""Usage: %(program)s  [options] -n nsets

Where:
    -h
        Show usage and exit.
    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

import sys

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

program = sys.argv[0]

default="""
[Classifier]
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
max_discriminators = 150

[TestDriver]
spam_cutoff = 0.90
ham_cutoff = 0.20
"""

import Options

start = (Options.options.robinson_probability_x,
         Options.options.robinson_minimum_prob_strength,
         Options.options.robinson_probability_s,
         Options.options.spam_cutoff,
         Options.options.ham_cutoff)
err = (0.01, 0.01, 0.01, 0.005, 0.01)

def mkini(vars):
    f=open('bayescustomize.ini', 'w')
    f.write("""
[Classifier]
robinson_probability_x = %.6f
robinson_minimum_prob_strength = %.6f
robinson_probability_s = %.6f

[TestDriver]
spam_cutoff = %.4f
ham_cutoff = %.4f
"""%tuple(vars))
    f.close()

def score(vars):
    import os
    mkini(vars)
    status = os.system('python2.3 weaktest.py -n %d > weak.out'%nsets)
    if status != 0:
        print >> sys.stderr, "Error status from weaktest"
        sys.exit(status)
    f = open('weak.out', 'r')
    txt = f.readlines()
    # Extract the flex cost field.
    cost = float(txt[-1].split()[2][1:])
    f.close()
    print ''.join(txt[-4:])[:-1]
    print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,))
    return -cost

def main():
    import optimize
    finish=optimize.SimplexMaximize(start,err,score)
    mkini(finish)

if __name__ == "__main__":
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hn:')
    except getopt.error, msg:
        usage(1, msg)

    nsets = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-n':
            nsets = int(arg)

    if args:
        usage(1, "Positional arguments not supported")
    if nsets is None:
        usage(1, "-n is required")

    main()


From tim_one@users.sourceforge.net  Sun Nov 10 19:59:24 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 11:59:24 -0800
Subject: [Spambayes-checkins] spambayes msgs.py,1.5,1.6 optimize.py,1.1,1.2
 pop3proxy.py,1.13,1.14 timcv.py,1.11,1.12 weaktest.py,1.2,1.3
Message-ID: <E18AyEu-0003ql-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14712

Modified Files:
	msgs.py optimize.py pop3proxy.py timcv.py weaktest.py 
Log Message:
Whitespace normalization.


Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** msgs.py	1 Nov 2002 04:10:50 -0000	1.5
--- msgs.py	10 Nov 2002 19:59:22 -0000	1.6
***************
*** 84,88 ****
  
  def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.  
         If seed is not None, also set SEED.
         If (ham|spam)test are not set, set to the same as the (ham|spam)train
--- 84,88 ----
  
  def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.
         If seed is not None, also set SEED.
         If (ham|spam)test are not set, set to the same as the (ham|spam)train

Index: optimize.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/optimize.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** optimize.py	10 Nov 2002 12:07:15 -0000	1.1
--- optimize.py	10 Nov 2002 19:59:22 -0000	1.2
***************
*** 11,66 ****
      simplex = [var]
      for i in range(len(var)):
! 	var2 = copy.copy(var)
! 	var2[i] = var[i] + err[i]
! 	simplex.append(var2)
      value = []
      for i in range(len(simplex)):
! 	value.append(func(simplex[i]))
      while 1:
! 	# Determine worst and best
! 	wi = 0
! 	bi = 0
! 	for i in range(len(simplex)):
! 	    if value[wi] > value[i]:
! 		wi = i
! 	    if value[bi] < value[i]:
! 		bi = i
! 	# Test for convergence
! 	#print "worst, best are",wi,bi,"with",value[wi],value[bi]
! 	if abs(value[bi] - value[wi]) <= convcrit:
! 	    return simplex[bi]
! 	# Calculate average of non-worst
! 	ave=Numeric.zeros(len(var), 'd')
! 	for i in range(len(simplex)):
! 	    if i != wi:
! 		ave = ave + simplex[i]
! 	ave = ave / (len(simplex) - 1)
! 	worst = Numeric.array(simplex[wi])
! 	# Check for too-small simplex
! 	simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
! 	if simsize <= minerr:
! 	    #print "Size of simplex too small:",simsize
! 	    return simplex[bi]
! 	# Invert worst
! 	new = 2 * ave - simplex[wi]
! 	newv = func(new)
! 	if newv <= value[wi]:
! 	    # Even worse. Shrink instead
! 	    #print "Shrunk simplex"
! 	    #print "ave=",repr(ave)
! 	    #print "wi=",repr(worst)
! 	    new = 0.5 * ave + 0.5 * worst
! 	    newv = func(new)
! 	elif newv > value[bi]:
! 	    # Better than the best. Expand
! 	    new2 = 3 * ave - 2 * worst
! 	    newv2 = func(new2)
! 	    if newv2 > newv:
! 		# Accept
! 		#print "Expanded simplex"
! 		new = new2
! 		newv = newv2
! 	simplex[wi] = new
! 	value[wi] = newv
  
  def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
--- 11,66 ----
      simplex = [var]
      for i in range(len(var)):
!         var2 = copy.copy(var)
!         var2[i] = var[i] + err[i]
!         simplex.append(var2)
      value = []
      for i in range(len(simplex)):
!         value.append(func(simplex[i]))
      while 1:
!         # Determine worst and best
!         wi = 0
!         bi = 0
!         for i in range(len(simplex)):
!             if value[wi] > value[i]:
!                 wi = i
!             if value[bi] < value[i]:
!                 bi = i
!         # Test for convergence
!         #print "worst, best are",wi,bi,"with",value[wi],value[bi]
!         if abs(value[bi] - value[wi]) <= convcrit:
!             return simplex[bi]
!         # Calculate average of non-worst
!         ave=Numeric.zeros(len(var), 'd')
!         for i in range(len(simplex)):
!             if i != wi:
!                 ave = ave + simplex[i]
!         ave = ave / (len(simplex) - 1)
!         worst = Numeric.array(simplex[wi])
!         # Check for too-small simplex
!         simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
!         if simsize <= minerr:
!             #print "Size of simplex too small:",simsize
!             return simplex[bi]
!         # Invert worst
!         new = 2 * ave - simplex[wi]
!         newv = func(new)
!         if newv <= value[wi]:
!             # Even worse. Shrink instead
!             #print "Shrunk simplex"
!             #print "ave=",repr(ave)
!             #print "wi=",repr(worst)
!             new = 0.5 * ave + 0.5 * worst
!             newv = func(new)
!         elif newv > value[bi]:
!             # Better than the best. Expand
!             new2 = 3 * ave - 2 * worst
!             newv2 = func(new2)
!             if newv2 > newv:
!                 # Accept
!                 #print "Expanded simplex"
!                 new = new2
!                 newv = newv2
!         simplex[wi] = new
!         value[wi] = newv
  
  def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
--- pop3proxy.py	10 Nov 2002 19:59:22 -0000	1.14
***************
*** 140,144 ****
      can't connect to the real POP3 server and talk to it
      synchronously, because that would block the process."""
!     
      def __init__(self, serverName, serverPort, lineCallback):
          BrighterAsyncChat.__init__(self)
--- 140,144 ----
      can't connect to the real POP3 server and talk to it
      synchronously, because that would block the process."""
! 
      def __init__(self, serverName, serverPort, lineCallback):
          BrighterAsyncChat.__init__(self)
***************
*** 148,152 ****
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
          self.connect((serverName, serverPort))
!     
      def collect_incoming_data(self, data):
          self.request = self.request + data
--- 148,152 ----
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
          self.connect((serverName, serverPort))
! 
      def collect_incoming_data(self, data):
          self.request = self.request + data
***************
*** 184,188 ****
          self.seenAllHeaders = False # For the current RETR or TOP
          self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
                                               self.onServerLine)
  
--- 184,188 ----
          self.seenAllHeaders = False # For the current RETR or TOP
          self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort,
                                               self.onServerLine)
  
***************
*** 198,214 ****
          isFirstLine = not self.response
          self.response = self.response + line
!         
          # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
          # Has the server closed its end of the socket?
          if not line:
              self.isClosing = True
!         
          # If we're not processing a command, just echo the response.
          if not self.command:
              self.push(self.response)
              self.response = ''
!         
          # Time out after 30 seconds for message-retrieval commands if
          # all the headers are down.  The rest of the message will proxy
--- 198,214 ----
          isFirstLine = not self.response
          self.response = self.response + line
! 
          # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
! 
          # Has the server closed its end of the socket?
          if not line:
              self.isClosing = True
! 
          # If we're not processing a command, just echo the response.
          if not self.command:
              self.push(self.response)
              self.response = ''
! 
          # Time out after 30 seconds for message-retrieval commands if
          # all the headers are down.  The rest of the message will proxy
***************
*** 223,227 ****
              self.onResponse()
              self.response = ''
!     
      def isMultiline(self):
          """Returns True if the request should get a multiline
--- 223,227 ----
              self.onResponse()
              self.response = ''
! 
      def isMultiline(self):
          """Returns True if the request should get a multiline
***************
*** 254,258 ****
              self.close()
              raise SystemExit
!         
          self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
--- 254,258 ----
              self.close()
              raise SystemExit
! 
          self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
***************
*** 265,271 ****
              self.args = splitCommand[1:]
              self.startTime = time.time()
!         
          self.request = ''
!         
      def onResponse(self):
          # Pass the request and the raw response to the subclass and
--- 265,271 ----
              self.args = splitCommand[1:]
              self.startTime = time.time()
! 
          self.request = ''
! 
      def onResponse(self):
          # Pass the request and the raw response to the subclass and
***************
*** 273,277 ****
          cooked = self.onTransaction(self.command, self.args, self.response)
          self.push(cooked)
!         
          # If onServerLine() decided that the server has closed its
          # socket, close this one when the response has been sent.
--- 273,277 ----
          cooked = self.onTransaction(self.command, self.args, self.response)
          self.push(cooked)
! 
          # If onServerLine() decided that the server has closed its
          # socket, close this one when the response has been sent.
***************
*** 351,355 ****
          status.activeSessions -= 1
          POP3ProxyBase.close(self)
!     
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
--- 351,355 ----
          status.activeSessions -= 1
          POP3ProxyBase.close(self)
! 
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
***************
*** 419,423 ****
                  if command == 'RETR':
                      status.numUnsure += 1
!             
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
--- 419,423 ----
                  if command == 'RETR':
                      status.numUnsure += 1
! 
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
***************
*** 490,494 ****
               .content { margin: 15 }
               .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex; 
                                 border-bottom: 1px solid #808080;
                                 font-weight: bold }
--- 490,494 ----
               .content { margin: 15 }
               .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex;
                                 border-bottom: 1px solid #808080;
                                 font-weight: bold }
***************
*** 513,517 ****
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
!     
      shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
--- 513,517 ----
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
! 
      shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
***************
*** 521,525 ****
                    <tr><td class='sectionbody'>%s</td></tr></table>
                    &nbsp;<br>\n"""
!     
      summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
                proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
--- 521,525 ----
                    <tr><td class='sectionbody'>%s</td></tr></table>
                    &nbsp;<br>\n"""
! 
      summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
                proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
***************
*** 529,538 ****
                  <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
!     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
!     
      train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
--- 529,538 ----
                  <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
! 
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
! 
      train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
***************
*** 546,550 ****
              <input type='submit' value='Train on this message'>
              </form>"""
!     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
--- 546,550 ----
              <input type='submit' value='Train on this message'>
              </form>"""
! 
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 577,581 ****
                  self.request = self.request + '\r\n\r\n'
                  return
!     
              if type(self.get_terminator()) is type(1):
                  # We've just read the body of a POSTed request.
--- 577,581 ----
                  self.request = self.request + '\r\n\r\n'
                  return
! 
              if type(self.get_terminator()) is type(1):
                  # We've just read the body of a POSTed request.
***************
*** 592,596 ****
                      # A normal x-www-form-urlencoded.
                      params.update(cgi.parse_qs(body, keep_blank_values=True))
!             
              # Convert the cgi params into a simple dictionary.
              plainParams = {}
--- 592,596 ----
                      # A normal x-www-form-urlencoded.
                      params.update(cgi.parse_qs(body, keep_blank_values=True))
! 
              # Convert the cgi params into a simple dictionary.
              plainParams = {}
***************
*** 604,608 ****
          if path == '/':
              path = '/Home'
!         
          if path == '/helmet.gif':
              # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
--- 604,608 ----
          if path == '/':
              path = '/Home'
! 
          if path == '/helmet.gif':
              # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
***************
*** 628,632 ****
                  else:
                      self.push(self.footer % (timeString, self.shutdownPickle))
!     
      def pushOKHeaders(self, contentType, extraHeaders={}):
          timeNow = time.gmtime(time.time())
--- 628,632 ----
                  else:
                      self.push(self.footer % (timeString, self.shutdownPickle))
! 
      def pushOKHeaders(self, contentType, extraHeaders={}):
          timeNow = time.gmtime(time.time())
***************
*** 645,649 ****
          self.push("\r\n")
          self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
!     
      def pushPreamble(self, name):
          self.push(self.header % name)
--- 645,649 ----
          self.push("\r\n")
          self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
! 
      def pushPreamble(self, name):
          self.push(self.header % name)
***************
*** 681,685 ****
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
!         
          # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
--- 681,685 ----
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
! 
          # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
***************
*** 718,722 ****
          except KeyError:
              info = "'%s' does not appear in the database." % word
!         
          body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
--- 718,722 ----
          except KeyError:
              info = "'%s' does not appear in the database." % word
! 
          body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
***************
*** 992,996 ****
          elif opt == '-u':
              status.uiPort = int(arg)
!             
      # Do whatever we've been asked to do...
      if not opts and not args:
--- 992,996 ----
          elif opt == '-u':
              status.uiPort = int(arg)
! 
      # Do whatever we've been asked to do...
      if not opts and not args:

Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timcv.py	1 Nov 2002 04:10:50 -0000	1.11
--- timcv.py	10 Nov 2002 19:59:22 -0000	1.12
***************
*** 15,19 ****
  
      --HamTrain int
!         The maximum number of msgs to use from each Ham set for training.  
          The msgs are chosen randomly.  See also the -s option.
  
--- 15,19 ----
  
      --HamTrain int
!         The maximum number of msgs to use from each Ham set for training.
          The msgs are chosen randomly.  See also the -s option.
  
***************
*** 23,27 ****
  
      --HamTest int
!         The maximum number of msgs to use from each Ham set for testing.  
          The msgs are chosen randomly.  See also the -s option.
  
--- 23,27 ----
  
      --HamTest int
!         The maximum number of msgs to use from each Ham set for testing.
          The msgs are chosen randomly.  See also the -s option.
  
***************
*** 73,79 ****
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), 
                              hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), 
                              spamdirs[1:], train=1))
  
--- 73,79 ----
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
                              hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
                              spamdirs[1:], train=1))
  
***************
*** 98,102 ****
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1), 
                          msgs.SpamStream(sname, s2, train=1))
  
--- 98,102 ----
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1),
                          msgs.SpamStream(sname, s2, train=1))
  

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** weaktest.py	10 Nov 2002 12:02:33 -0000	1.2
--- weaktest.py	10 Nov 2002 19:59:22 -0000	1.3
***************
*** 58,62 ****
      nham = len(hamfns)
      nspam = len(spamfns)
!     
      allfns = {}
      for fn in spamfns+hamfns:
--- 58,62 ----
      nham = len(hamfns)
      nspam = len(spamfns)
! 
      allfns = {}
      for fn in spamfns+hamfns:
***************
*** 133,137 ****
      print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
      print "Flex cost: $%.4f"%flexcost
!     
  def main():
      import getopt
--- 133,137 ----
      print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
      print "Flex cost: $%.4f"%flexcost
! 
  def main():
      import getopt


From tim_one@users.sourceforge.net  Sun Nov 10 20:00:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 12:00:03 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24
Message-ID: <E18AyFX-0003uk-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14946

Modified Files:
	msgstore.py 
Log Message:
Whitespace normalization.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** msgstore.py	7 Nov 2002 22:30:09 -0000	1.23
--- msgstore.py	10 Nov 2002 19:59:59 -0000	1.24
***************
*** 397,401 ****
              # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
              pass
!             
          return "%s\n%s\n%s" % (headers, html, body)
  
--- 397,401 ----
              # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
              pass
! 
          return "%s\n%s\n%s" % (headers, html, body)
  

From tim_one@users.sourceforge.net  Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4
Message-ID: <E18B3r2-0001Re-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam

Modified Files:
	profile.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** profile.py	7 Nov 2002 22:30:11 -0000	1.3
--- profile.py	11 Nov 2002 01:59:06 -0000	1.4
***************
*** 44,48 ****
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 44,48 ----
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0


From tim_one@users.sourceforge.net  Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] 
 spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Message-ID: <E18B3r2-0001RY-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402

Modified Files:
	Options.py classifier.py weakloop.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py	8 Nov 2002 04:06:23 -0000	1.67
--- Options.py	11 Nov 2002 01:59:06 -0000	1.68
***************
*** 241,268 ****
  
  # These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before.  Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam.  This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x.  All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE:  Gary Robinson previously used a different formula involving 'a'
! # and 'x'.  The 'x' here is the same as before.  The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
  
  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme.  0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
  
! # The combining scheme currently detailed on Gary Robinons web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
--- 241,268 ----
  
  # These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before.  Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting.  At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam.  This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob.  All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
  
  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests.  0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
  
! # The combining scheme currently detailed on the Robinon web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
***************
*** 272,284 ****
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease".  One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  This is
! # the "provably most-sensitive" test the original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease".  One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'robinson_probability_x': float_cracker,
!                    'robinson_probability_s': float_cracker,
!                    'robinson_minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'unknown_word_prob': float_cracker,
!                    'unknown_word_strength': float_cracker,
!                    'minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py	7 Nov 2002 22:30:05 -0000	1.49
--- classifier.py	11 Nov 2002 01:59:06 -0000	1.50
***************
*** 70,74 ****
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
          nspam = float(self.nspam or 1)
  
!         S = options.robinson_probability_s
!         StimesX = S * options.robinson_probability_x
  
          for word, record in self.wordinfo.iteritems():
--- 322,327 ----
          nspam = float(self.nspam or 1)
  
!         S = options.unknown_word_strength
!         StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
  
      def _getclues(self, wordstream):
!         mindist = options.robinson_minimum_prob_strength
!         unknown = options.robinson_probability_x
  
          clues = []  # (distance, prob, word, record) tuples
--- 449,454 ----
  
      def _getclues(self, wordstream):
!         mindist = options.minimum_prob_strength
!         unknown = options.unknown_word_prob
  
          clues = []  # (distance, prob, word, record) tuples

Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py	10 Nov 2002 12:08:40 -0000	1.1
--- weakloop.py	11 Nov 2002 01:59:06 -0000	1.2
***************
*** 29,35 ****
  default="""
  [Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
  max_discriminators = 150
  
--- 29,35 ----
  default="""
  [Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
  max_discriminators = 150
  
***************
*** 41,47 ****
  import Options
  
! start = (Options.options.robinson_probability_x,
!          Options.options.robinson_minimum_prob_strength,
!          Options.options.robinson_probability_s,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
--- 41,47 ----
  import Options
  
! start = (Options.options.unknown_word_prob,
!          Options.options.minimum_prob_strength,
!          Options.options.unknown_word_strength,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
***************
*** 52,58 ****
      f.write("""
  [Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
  
  [TestDriver]
--- 52,58 ----
      f.write("""
  [Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
  
  [TestDriver]


From tim_one@users.sourceforge.net  Fri Nov  8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
	tokenizer.py,1.63,1.64
Message-ID: <E18A0Pd-0008K2-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798

Modified Files:
	Options.py tokenizer.py 
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py	7 Nov 2002 22:25:46 -0000	1.66
--- Options.py	8 Nov 2002 04:06:23 -0000	1.67
***************
*** 42,53 ****
      x-.*
  
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages.  Set true to retain HTML tags in this
- # case.  On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
- 
  # If true, the first few characters of application/octet-stream sections
  # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
  
  all_options = {
!     'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
!                   'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,
--- 339,343 ----
  
  all_options = {
!     'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
--- tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
***************
*** 495,504 ****
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was removed.
  
  ##############################################################################
--- 495,505 ----
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False.  Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was also removed.
  
  ##############################################################################
***************
*** 1167,1175 ****
          """Generate a stream of tokens from an email Message.
  
-         HTML tags are always stripped from text/plain sections.
-         options.retain_pure_html_tags controls whether HTML tags are
-         also stripped from text/html sections.  Except in special cases,
-         it's recommended to leave that at its default of false.
- 
          If options.check_octets is True, the first few undecoded characters
          of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             if (part.get_content_type() == "text/plain" or
!                     not options.retain_pure_html_tags):
!                 text = text.replace('&nbsp;', ' ')
!                 text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1224,1229 ----
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.


From tim_one@users.sourceforge.net  Fri Nov  8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
	tokenizer.py,1.63,1.64
Message-ID: <E18A0Pd-0008K2-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798

Modified Files:
	Options.py tokenizer.py 
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py	7 Nov 2002 22:25:46 -0000	1.66
--- Options.py	8 Nov 2002 04:06:23 -0000	1.67
***************
*** 42,53 ****
      x-.*
  
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages.  Set true to retain HTML tags in this
- # case.  On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
- 
  # If true, the first few characters of application/octet-stream sections
  # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
  
  all_options = {
!     'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
!                   'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,
--- 339,343 ----
  
  all_options = {
!     'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
--- tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
***************
*** 495,504 ****
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was removed.
  
  ##############################################################################
--- 495,505 ----
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False.  Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was also removed.
  
  ##############################################################################
***************
*** 1167,1175 ****
          """Generate a stream of tokens from an email Message.
  
-         HTML tags are always stripped from text/plain sections.
-         options.retain_pure_html_tags controls whether HTML tags are
-         also stripped from text/html sections.  Except in special cases,
-         it's recommended to leave that at its default of false.
- 
          If options.check_octets is True, the first few undecoded characters
          of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             if (part.get_content_type() == "text/plain" or
!                     not options.retain_pure_html_tags):
!                 text = text.replace('&nbsp;', ' ')
!                 text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1224,1229 ----
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.


From richiehindle@users.sourceforge.net  Fri Nov  8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID: <E18A440-0006h6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390

Modified Files:
	pop3proxy.py 
Log Message:
 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py	7 Nov 2002 22:27:02 -0000	1.11
--- pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
***************
*** 47,50 ****
--- 47,74 ----
  
  
+ todo = """
+  o (Re)training interface - one message per line, quick-rendering table.
+  o Slightly-wordy index page; intro paragraph for each page.
+  o Once the training stuff is on a separate page, make the paste box
+    bigger.
+  o "Links" section (on homepage?) to project homepage, mailing list,
+    etc.
+  o "Home" link (with helmet!) at the end of each page.
+  o "Classify this" - just like Train.
+  o "Send me an email every [...] to remind me to train on new
+    messages."
+  o "Send me a status email every [...] telling how many mails have been
+    classified, etc."
+  o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
+    webbrowser?
+  o Possibly integrate Tim Stone's SMTP code - make it use async, make
+    the training code update (rather than replace!) the database.
+  o Can it cleanly dynamically update its status display while having a
+    POP3 converation?  Hammering reload sucks.
+  o Add a command to save the database without shutting down, and one to
+    reload the database.
+  o Leave the word in the input field after a Word query.
+ """
+ 
  import sys, re, operator, errno, getopt, cPickle, cStringIO, time
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
              self.factory(*args)
  
+ 
  class BrighterAsyncChat(asynchat.async_chat):
      """An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
  
  
+ class ServerLineReader(BrighterAsyncChat):
+     """An async socket that reads lines from a remote server and
+     simply calls a callback with the data.  The BayesProxy object
+     can't connect to the real POP3 server and talk to it
+     synchronously, because that would block the process."""
+     
+     def __init__(self, serverName, serverPort, lineCallback):
+         BrighterAsyncChat.__init__(self)
+         self.lineCallback = lineCallback
+         self.request = ''
+         self.set_terminator('\r\n')
+         self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+         self.connect((serverName, serverPort))
+     
+     def collect_incoming_data(self, data):
+         self.request = self.request + data
+ 
+     def found_terminator(self):
+         self.lineCallback(self.request + '\r\n')
+         self.request = ''
+ 
+     def handle_close(self):
+         self.lineCallback('')
+         self.close()
+ 
+ 
  class POP3ProxyBase(BrighterAsyncChat):
      """An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
!         self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.serverSocket.connect((serverName, serverPort))
!         self.serverIn = self.serverSocket.makefile('r')  # For reading only
!         self.push(self.serverIn.readline())
  
      def onTransaction(self, command, args, response):
--- 177,189 ----
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
+         self.response = ''
          self.set_terminator('\r\n')
!         self.command = ''           # The POP3 command being processed...
!         self.args = ''              # ...and its arguments
!         self.isClosing = False      # Has the server closed the socket?
!         self.seenAllHeaders = False # For the current RETR or TOP
!         self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
!                                              self.onServerLine)
  
      def onTransaction(self, command, args, response):
***************
*** 139,152 ****
          raise NotImplementedError
  
!     def isMultiline(self, command, args):
!         """Returns True if the given request should get a multiline
          response (assuming the response is positive).
          """
!         if command in ['USER', 'PASS', 'APOP', 'QUIT',
!                        'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif command in ['RETR', 'TOP']:
              return True
!         elif command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
--- 194,237 ----
          raise NotImplementedError
  
!     def onServerLine(self, line):
!         """A line of response has been received from the POP3 server."""
!         isFirstLine = not self.response
!         self.response = self.response + line
!         
!         # Is this line that terminates a set of headers?
!         self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
!         # Has the server closed its end of the socket?
!         if not line:
!             self.isClosing = True
!         
!         # If we're not processing a command, just echo the response.
!         if not self.command:
!             self.push(self.response)
!             self.response = ''
!         
!         # Time out after 30 seconds for message-retrieval commands if
!         # all the headers are down.  The rest of the message will proxy
!         # straight through.
!         if self.command in ['TOP', 'RETR'] and \
!            self.seenAllHeaders and time.time() > self.startTime + 30:
!             self.onResponse()
!             self.response = ''
!         # If that's a complete response, handle it.
!         elif not self.isMultiline() or line == '.\r\n' or \
!            (isFirstLine and line.startswith('-ERR')):
!             self.onResponse()
!             self.response = ''
!     
!     def isMultiline(self):
!         """Returns True if the request should get a multiline
          response (assuming the response is positive).
          """
!         if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
!                             'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif self.command in ['RETR', 'TOP']:
              return True
!         elif self.command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
***************
*** 155,204 ****
              return False
  
-     def readResponse(self, command, args):
-         """Reads the POP3 server's response and returns a tuple of
-         (response, isClosing, timedOut).  isClosing is True if the
-         server closes the socket, which tells found_terminator() to
-         close when the response has been sent.  timedOut is set if a
-         TOP or RETR request was still arriving after 30 seconds, and
-         tells found_terminator() to proxy the remainder of the response.
-         """
-         responseLines = []
-         startTime = time.time()
-         isMulti = self.isMultiline(command, args)
-         isClosing = False
-         timedOut = False
-         isFirstLine = True
-         seenAllHeaders = False
-         while True:
-             line = self.serverIn.readline()
-             if not line:
-                 # The socket's been closed by the server, probably by QUIT.
-                 isClosing = True
-                 break
-             elif not isMulti or (isFirstLine and line.startswith('-ERR')):
-                 # A single-line response.
-                 responseLines.append(line)
-                 break
-             elif line == '.\r\n':
-                 # The termination line.
-                 responseLines.append(line)
-                 break
-             else:
-                 # A normal line - append it to the response and carry on.
-                 responseLines.append(line)
-                 seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
- 
-             # Time out after 30 seconds for message-retrieval commands
-             # if all the headers are down - found_terminator() knows how
-             # to deal with this.
-             if command in ['TOP', 'RETR'] and \
-                seenAllHeaders and time.time() > startTime + 30:
-                 timedOut = True
-                 break
- 
-             isFirstLine = False
- 
-         return ''.join(responseLines), isClosing, timedOut
- 
      def collect_incoming_data(self, data):
          """Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
      def found_terminator(self):
          """Asynchat override."""
-         # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverSocket.sendall(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             command, args = ('', '')
          else:
              splitCommand = self.request.strip().split(None, 1)
!             command = splitCommand[0].upper()
!             args = splitCommand[1:]
!         rawResponse, isClosing, timedOut = self.readResponse(command, args)
! 
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cookedResponse = self.onTransaction(command, args, rawResponse)
!         self.push(cookedResponse)
!         self.request = ''
! 
!         # If readResponse() timed out, we still need to read and proxy
!         # the rest of the message.
!         if timedOut:
!             while True:
!                 line = self.serverIn.readline()
!                 if not line:
!                     # The socket's been closed by the server.
!                     isClosing = True
!                     break
!                 elif line == '.\r\n':
!                     # The termination line.
!                     self.push(line)
!                     break
!                 else:
!                     # A normal line.
!                     self.push(line)
! 
!         # If readResponse() or the loop above decided that the server
!         # has closed its socket, close this one when the response has
!         # been sent.
!         if isClosing:
              self.close_when_done()
  
  
  class BayesProxyListener(Listener):
--- 246,288 ----
      def found_terminator(self):
          """Asynchat override."""
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
+             self.serverSocket.shutdown(2)
+             self.serverSocket.close()
              self.shutdown(2)
              self.close()
              raise SystemExit
!         
!         self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             self.command = self.args = ''
          else:
+             # A proper command.
              splitCommand = self.request.strip().split(None, 1)
!             self.command = splitCommand[0].upper()
!             self.args = splitCommand[1:]
!             self.startTime = time.time()
!         
!         self.request = ''
!         
!     def onResponse(self):
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cooked = self.onTransaction(self.command, self.args, self.response)
!         self.push(cooked)
!         
!         # If onServerLine() decided that the server has closed its
!         # socket, close this one when the response has been sent.
!         if self.isClosing:
              self.close_when_done()
  
+         # Reset.
+         self.command = ''
+         self.args = ''
+         self.isClosing = False
+         self.seenAllHeaders = False
+ 
  
  class BayesProxyListener(Listener):
***************
*** 452,456 ****
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15 }
               .header { font-size: 133%% }
               .content { margin: 15 }
--- 484,490 ----
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15;
!                        border-top: 1px solid black;
!                        border-bottom: 1px solid black }
               .header { font-size: 133%% }
               .content { margin: 15 }
***************
*** 466,470 ****
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
--- 500,504 ----
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>&nbsp;Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
***************
*** 475,481 ****
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              <input type='submit' value='Shutdown now'>
               </td></tr></table></form>\n"""
  
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
--- 509,520 ----
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              %s
               </td></tr></table></form>\n"""
  
+     shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
+     
+     shutdownPickle = shutdownDB + """&nbsp;&nbsp;
+             <input type='submit' name='how' value='Save &amp; shutdown'>"""
+ 
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
***************
*** 483,486 ****
--- 522,533 ----
                    &nbsp;<br>\n"""
      
+     summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
+               proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
+               Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
+               POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
+               Emails classified this session: <b>%(numSpams)d</b> spam,
+                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
+               """
+     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
***************
*** 488,491 ****
--- 535,550 ----
                  </form>"""
      
+     train = """<form action='/upload' method='POST'
+                 enctype='multipart/form-data'>
+             Either upload a message file: <input type='file' name='file'><br>
+             Or paste the whole message (incuding headers) here:<br>
+             <textarea name='text' rows='3' cols='60'></textarea><br>
+             Is this message
+             <input type='radio' name='which' value='ham'>Ham</input> or
+             <input type='radio'
+                    name='which' value='spam' checked>Spam</input>?<br>
+             <input type='submit' value='Train on this message'>
+             </form>"""
+     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = self.request.split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
--- 561,565 ----
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
          
          if path == '/helmet.gif':
!             self.pushOKHeaders('image/gif')
              self.push(self.helmet)
          else:
--- 606,614 ----
          
          if path == '/helmet.gif':
!             # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
!             inOneHour = time.gmtime(time.time() + 3600)
!             expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
!             extraHeaders = {'Expires': expiryDate}
!             self.pushOKHeaders('image/gif', extraHeaders)
              self.push(self.helmet)
          else:
***************
*** 554,558 ****
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % url)
              else:
                  # This is a request for a valid page; run the handler.
--- 617,621 ----
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % path)
              else:
                  # This is a request for a valid page; run the handler.
***************
*** 561,569 ****
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 self.push(self.footer % timeString)
      
!     def pushOKHeaders(self, contentType):
!         self.push("HTTP/1.0 200 OK\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
          self.push("\r\n")
  
--- 624,641 ----
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 if status.useDB:
!                     self.push(self.footer % (timeString, self.shutdownDB))
!                 else:
!                     self.push(self.footer % (timeString, self.shutdownPickle))
      
!     def pushOKHeaders(self, contentType, extraHeaders={}):
!         timeNow = time.gmtime(time.time())
!         httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
!         self.push("HTTP/1.1 200 OK\r\n")
!         self.push("Connection: close\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
+         self.push("Date: %s\r\n" % httpNow)
+         for name, value in extraHeaders.items():
+             self.push("%s: %s\r\n" % (name, value))
          self.push("\r\n")
  
***************
*** 583,616 ****
  
      def onHome(self, params):
!         summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!                   proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
!                   Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
!                   POP3 conversations this session:
!                     <b>%(totalSessions)d</b>.<br>
!                   Emails classified this session: <b>%(numSpams)d</b> spam,
!                     <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
!                   """ % status.__dict__
!         
!         train = """<form action='/upload' method='POST'
!                     enctype='multipart/form-data'>
!                 Either upload a message file:
!                 <input type='file' name='file'><br>
!                 Or paste the whole message (incuding headers) here:<br>
!                 <textarea name='text' rows='3' cols='60'></textarea><br>
!                 Is this message
!                 <input type='radio' name='which' value='ham'>Ham</input> or
!                 <input type='radio'
!                        name='which' value='spam' checked>Spam</input>?<br>
!                 <input type='submit' value='Train on this message'>
!                 </form>"""
!         
!         body = (self.pageSection % ('Status', summary) +
!                 self.pageSection % ('Word query', self.wordQuery) +
!                 self.pageSection % ('Train', train))
          self.push(body)
  
      def onShutdown(self, params):
!         self.push("<p><b>Shutdown.</b> Goodbye.</p>")
!         self.push(' ')  # Acts as a flush for small buffers.
          self.shutdown(2)
          self.close()
--- 655,675 ----
  
      def onHome(self, params):
!         """Serve up the homepage."""
!         body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Word query', self.wordQuery)+
!                 self.pageSection % ('Train', self.train))
          self.push(body)
  
      def onShutdown(self, params):
!         """Shutdown the server, saving the pickle if requested to do so."""
!         if params['how'].lower().find('save') >= 0:
!             if not status.useDB and status.pickleName:
!                 self.push("<b>Saving...</b>")
!                 self.push(' ')  # Acts as a flush for small buffers.
!                 fp = open(status.pickleName, 'wb')
!                 cPickle.dump(self.bayes, fp, 1)
!                 fp.close()
!         self.push("<b>Shutdown</b>. Goodbye.")
!         self.push(' ')
          self.shutdown(2)
          self.close()
***************
*** 618,625 ****
  
      def onUpload(self, params):
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
          # Append the message to a file, to make it easier to rebuild
!         # the database later.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
--- 677,690 ----
  
      def onUpload(self, params):
+         """Train on an uploaded or pasted message."""
+         # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         
          # Append the message to a file, to make it easier to rebuild
!         # the database later.   This is a temporary implementation -
!         # it should keep a Corpus (from Tim Stone's forthcoming message
!         # management module) to manage a cache of messages.  It needs
!         # to keep them for the HTML retraining interface anyway.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
***************
*** 627,642 ****
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From ???@???\n")  # fake From line (XXX good enough?)
          f.write(message)
!         f.write("\n")
          f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("""<p>Trained on your message. Saving database...</p>""")
!         self.push(" ")  # Flush... must find out how to do this properly...
!         if not status.useDB and status.pickleName:
!             fp = open(status.pickleName, 'wb')
!             cPickle.dump(self.bayes, fp, 1)
!             fp.close()
!         self.push("<p>Done.</p><p><a href='/'>Home</a></p>")
  
      def onWordquery(self, params):
--- 692,704 ----
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
          f.write(message)
!         f.write("\n\n")
          f.close()
+ 
+         # Train on the message.
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
!         self.push(self.pageSection % ('Train another', self.train))
  
      def onWordquery(self, params):
***************
*** 656,660 ****
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s':" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
--- 718,722 ----
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
***************
*** 765,771 ****
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))
          self.request = ''
  
      def onStat(self, command, args):
          """POP3 STAT command."""
--- 827,839 ----
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))   # Or push_slowly for testing
          self.request = ''
  
+     def push_slowly(self, response):
+         """Useful for testing."""
+         for c in response:
+             self.push(c)
+             time.sleep(0.02)
+ 
      def onStat(self, command, args):
          """POP3 STAT command."""
***************
*** 777,781 ****
          """POP3 LIST command, with optional message number argument."""
          if args:
!             number = int(args)
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
          """POP3 LIST command, with optional message number argument."""
          if args:
!             try:
!                 number = int(args)
!             except ValueError:
!                 number = -1
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         return self._getMessage(int(args), 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         number, lines = map(int, args.split())
          return self._getMessage(number, lines)
  
--- 874,889 ----
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         try:
!             number = int(args)
!         except ValueError:
!             number = -1
!         return self._getMessage(number, 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         try:
!             number, lines = map(int, args.split())
!         except ValueError:
!             number, lines = -1, -1
          return self._getMessage(number, lines)
  
***************
*** 863,867 ****
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) != -1
  
      # Kill the proxy and the test server.
--- 941,945 ----
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) >= 0
  
      # Kill the proxy and the test server.


From jvr@users.sourceforge.net  Sat Nov  9 18:05:44 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Sat, 09 Nov 2002 10:05:44 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13
Message-ID: <E18AZzM-0005QJ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20814

Modified Files:
	pop3proxy.py 
Log Message:
force word query to be lowercase, making the UI case insensitive

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
--- pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
***************
*** 704,707 ****
--- 704,708 ----
      def onWordquery(self, params):
          word = params['word']
+         word = word.lower()
          try:
              # Must be a better way to get __dict__ for a new-style class...


From hooft@users.sourceforge.net  Sat Nov  9 21:48:55 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sat, 09 Nov 2002 13:48:55 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,NONE,1.1
Message-ID: <E18AdTL-00086Q-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31102

Added Files:
	weaktest.py 
Log Message:
New test driver to simulate "unsure only" training

--- NEW FILE: weaktest.py ---
#! /usr/bin/env python

# A test driver using "the standard" test directory structure.
# This simulates a user that gets E-mail, and only trains on fp,
# fn and unsure messages. It starts by training on the first 30
# messages, and from that point on well classified messages will
# not be used for training. This can be used to see what the performance
# of the scoring algorithm is under such conditions. Questions are:
#  * How does the size of the database behave over time?
#  * Does the classification get better over time?
#  * Are there other combinations of parameters for the classifier
#    that make this better behaved than the default values?


"""Usage: %(program)s  [options] -n nsets

Where:
    -h
        Show usage and exit.
    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

from __future__ import generators

import sys,os

from Options import options
import hammie

import msgs

program = sys.argv[0]

debug = 0

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def drive(nsets):
    print options.display()

    spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
    hamdirs  = [options.ham_directories % i for i in range(1, nsets+1)]

    spamfns = [(x,y,1) for x in spamdirs for y in os.listdir(x)]
    hamfns = [(x,y,0) for x in hamdirs for y in os.listdir(x)]

    nham = len(hamfns)
    nspam = len(spamfns)
    
    allfns={}
    for fn in spamfns+hamfns:
        allfns[fn] = None

    d = hammie.Hammie(hammie.createbayes('weaktest.db', False))

    n=0
    unsure=0
    hamtrain=0
    spamtrain=0
    fp=0
    fn=0
    for dir,name, is_spam in allfns.iterkeys():
        n += 1
        m=msgs.Msg(dir, name).guts
        if debug:
            print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name),
        if hamtrain + spamtrain > 30:
            scr=d.score(m)
        else:
            scr=0.50
        if debug:
            print "score:%.3f"%scr,
        if scr < hammie.SPAM_THRESHOLD and is_spam:
            if scr < hammie.HAM_THRESHOLD:
                fn += 1
                if debug:
                    print "fn"
            else:
                unsure += 1
                if debug:
                    print "Unsure"
            spamtrain += 1
            d.train_spam(m)
            d.update_probabilities()
        elif scr > hammie.HAM_THRESHOLD and not is_spam:
            if scr > hammie.SPAM_THRESHOLD:
                fp += 1
                if debug:
                    print "fp"
                else:
                    print "fp: %s score:%.4f"%(os.path.join(dir,name),scr)
            else:
                unsure += 1
                if debug:
                    print "Unsure"
            hamtrain += 1
            d.train_ham(m)
            d.update_probabilities()
        else:
            if debug:
                print "OK"
        if n % 100 == 0:
            print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
                n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure)
    print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam)
    print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
        unsure,unsure*100.0/len(allfns))
    print "Trained on %d ham and %d spam"%(hamtrain,spamtrain)
    print "fp: %d fn: %d"%(fp,fn)
    FPW = options.best_cutoff_fp_weight
    FNW = options.best_cutoff_fn_weight
    UNW = options.best_cutoff_unsure_weight
    print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure)
    
def main():
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
                                   ['ham-keep=', 'spam-keep='])
    except getopt.error, msg:
        usage(1, msg)

    nsets = seed = hamkeep = spamkeep = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-n':
            nsets = int(arg)

    if args:
        usage(1, "Positional arguments not supported")
    if nsets is None:
        usage(1, "-n is required")

    drive(nsets)

if __name__ == "__main__":
    main()


From hooft@users.sourceforge.net  Sun Nov 10 12:02:36 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:02:36 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,1.1,1.2
Message-ID: <E18AqnU-0005vF-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22741

Modified Files:
	weaktest.py 
Log Message:
add flexcost; sanitize spacing

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weaktest.py	9 Nov 2002 21:48:52 -0000	1.1
--- weaktest.py	10 Nov 2002 12:02:33 -0000	1.2
***************
*** 59,63 ****
      nspam = len(spamfns)
      
!     allfns={}
      for fn in spamfns+hamfns:
          allfns[fn] = None
--- 59,63 ----
      nspam = len(spamfns)
      
!     allfns = {}
      for fn in spamfns+hamfns:
          allfns[fn] = None
***************
*** 65,74 ****
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
!     n=0
!     unsure=0
!     hamtrain=0
!     spamtrain=0
!     fp=0
!     fn=0
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
--- 65,80 ----
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
!     n = 0
!     unsure = 0
!     hamtrain = 0
!     spamtrain = 0
!     fp = 0
!     fn = 0
!     flexcost = 0
!     FPW = options.best_cutoff_fp_weight
!     FNW = options.best_cutoff_fn_weight
!     UNW = options.best_cutoff_unsure_weight
!     SPC = options.spam_cutoff
!     HC = options.ham_cutoff
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
***************
*** 82,87 ****
          if debug:
              print "score:%.3f"%scr,
!         if scr < hammie.SPAM_THRESHOLD and is_spam:
!             if scr < hammie.HAM_THRESHOLD:
                  fn += 1
                  if debug:
--- 88,96 ----
          if debug:
              print "score:%.3f"%scr,
!         if scr < SPC and is_spam:
!             t = FNW * (SPC - scr) / (SPC - HC)
!             #print "Spam at %.3f costs %.2f"%(scr,t)
!             flexcost += t
!             if scr < HC:
                  fn += 1
                  if debug:
***************
*** 94,104 ****
              d.train_spam(m)
              d.update_probabilities()
!         elif scr > hammie.HAM_THRESHOLD and not is_spam:
!             if scr > hammie.SPAM_THRESHOLD:
                  fp += 1
                  if debug:
                      print "fp"
                  else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir,name),scr)
              else:
                  unsure += 1
--- 103,116 ----
              d.train_spam(m)
              d.update_probabilities()
!         elif scr > HC and not is_spam:
!             t = FPW * (scr - HC) / (SPC - HC)
!             #print "Ham at %.3f costs %.2f"%(scr,t)
!             flexcost += t
!             if scr > SPC:
                  fp += 1
                  if debug:
                      print "fp"
                  else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir, name), scr)
              else:
                  unsure += 1
***************
*** 113,126 ****
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam)
      print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure,unsure*100.0/len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain,spamtrain)
!     print "fp: %d fn: %d"%(fp,fn)
!     FPW = options.best_cutoff_fp_weight
!     FNW = options.best_cutoff_fn_weight
!     UNW = options.best_cutoff_unsure_weight
!     print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure)
      
  def main():
--- 125,136 ----
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam)
      print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure, unsure * 100.0 / len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain, spamtrain)
!     print "fp: %d fn: %d"%(fp, fn)
!     print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
!     print "Flex cost: $%.4f"%flexcost
      
  def main():
***************
*** 128,137 ****
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
!                                    ['ham-keep=', 'spam-keep='])
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = seed = hamkeep = spamkeep = None
      for opt, arg in opts:
          if opt == '-h':
--- 138,146 ----
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:')
      except getopt.error, msg:
          usage(1, msg)
  
!     nsets = None
      for opt, arg in opts:
          if opt == '-h':


From hooft@users.sourceforge.net  Sun Nov 10 12:07:18 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:07:18 -0800
Subject: [Spambayes-checkins] spambayes optimize.py,NONE,1.1
Message-ID: <E18Aqs2-0006JK-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24245

Added Files:
	optimize.py 
Log Message:
Simplex maximization

--- NEW FILE: optimize.py ---
#
__version__ = '$Id: optimize.py,v 1.1 2002/11/10 12:07:15 hooft Exp $'
#
# Optimize any parametric function.
#
import copy
import Numeric

def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001):
    var = Numeric.array(var)
    simplex = [var]
    for i in range(len(var)):
	var2 = copy.copy(var)
	var2[i] = var[i] + err[i]
	simplex.append(var2)
    value = []
    for i in range(len(simplex)):
	value.append(func(simplex[i]))
    while 1:
	# Determine worst and best
	wi = 0
	bi = 0
	for i in range(len(simplex)):
	    if value[wi] > value[i]:
		wi = i
	    if value[bi] < value[i]:
		bi = i
	# Test for convergence
	#print "worst, best are",wi,bi,"with",value[wi],value[bi]
	if abs(value[bi] - value[wi]) <= convcrit:
	    return simplex[bi]
	# Calculate average of non-worst
	ave=Numeric.zeros(len(var), 'd')
	for i in range(len(simplex)):
	    if i != wi:
		ave = ave + simplex[i]
	ave = ave / (len(simplex) - 1)
	worst = Numeric.array(simplex[wi])
	# Check for too-small simplex
	simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
	if simsize <= minerr:
	    #print "Size of simplex too small:",simsize
	    return simplex[bi]
	# Invert worst
	new = 2 * ave - simplex[wi]
	newv = func(new)
	if newv <= value[wi]:
	    # Even worse. Shrink instead
	    #print "Shrunk simplex"
	    #print "ave=",repr(ave)
	    #print "wi=",repr(worst)
	    new = 0.5 * ave + 0.5 * worst
	    newv = func(new)
	elif newv > value[bi]:
	    # Better than the best. Expand
	    new2 = 3 * ave - 2 * worst
	    newv2 = func(new2)
	    if newv2 > newv:
		# Accept
		#print "Expanded simplex"
		new = new2
		newv = newv2
	simplex[wi] = new
	value[wi] = newv

def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
    err = Numeric.array(err)
    var = SimplexMaximize(var, err, func, convcrit*5, minerr*5)
    return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr)


From hooft@users.sourceforge.net  Sun Nov 10 12:08:42 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 10 Nov 2002 04:08:42 -0800
Subject: [Spambayes-checkins] spambayes weakloop.py,NONE,1.1
Message-ID: <E18AqtO-0006Q0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24653

Added Files:
	weakloop.py 
Log Message:
Loop simplex optimization over weaktest.py

--- NEW FILE: weakloop.py ---
#
# Optimize parameters
#
"""Usage: %(program)s  [options] -n nsets

Where:
    -h
        Show usage and exit.
    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

import sys

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

program = sys.argv[0]

default="""
[Classifier]
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
max_discriminators = 150

[TestDriver]
spam_cutoff = 0.90
ham_cutoff = 0.20
"""

import Options

start = (Options.options.robinson_probability_x,
         Options.options.robinson_minimum_prob_strength,
         Options.options.robinson_probability_s,
         Options.options.spam_cutoff,
         Options.options.ham_cutoff)
err = (0.01, 0.01, 0.01, 0.005, 0.01)

def mkini(vars):
    f=open('bayescustomize.ini', 'w')
    f.write("""
[Classifier]
robinson_probability_x = %.6f
robinson_minimum_prob_strength = %.6f
robinson_probability_s = %.6f

[TestDriver]
spam_cutoff = %.4f
ham_cutoff = %.4f
"""%tuple(vars))
    f.close()

def score(vars):
    import os
    mkini(vars)
    status = os.system('python2.3 weaktest.py -n %d > weak.out'%nsets)
    if status != 0:
        print >> sys.stderr, "Error status from weaktest"
        sys.exit(status)
    f = open('weak.out', 'r')
    txt = f.readlines()
    # Extract the flex cost field.
    cost = float(txt[-1].split()[2][1:])
    f.close()
    print ''.join(txt[-4:])[:-1]
    print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,))
    return -cost

def main():
    import optimize
    finish=optimize.SimplexMaximize(start,err,score)
    mkini(finish)

if __name__ == "__main__":
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hn:')
    except getopt.error, msg:
        usage(1, msg)

    nsets = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-n':
            nsets = int(arg)

    if args:
        usage(1, "Positional arguments not supported")
    if nsets is None:
        usage(1, "-n is required")

    main()


From tim_one@users.sourceforge.net  Sun Nov 10 19:59:24 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 11:59:24 -0800
Subject: [Spambayes-checkins] spambayes msgs.py,1.5,1.6 optimize.py,1.1,1.2
 pop3proxy.py,1.13,1.14 timcv.py,1.11,1.12 weaktest.py,1.2,1.3
Message-ID: <E18AyEu-0003ql-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14712

Modified Files:
	msgs.py optimize.py pop3proxy.py timcv.py weaktest.py 
Log Message:
Whitespace normalization.


Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** msgs.py	1 Nov 2002 04:10:50 -0000	1.5
--- msgs.py	10 Nov 2002 19:59:22 -0000	1.6
***************
*** 84,88 ****
  
  def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.  
         If seed is not None, also set SEED.
         If (ham|spam)test are not set, set to the same as the (ham|spam)train
--- 84,88 ----
  
  def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
!     """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.
         If seed is not None, also set SEED.
         If (ham|spam)test are not set, set to the same as the (ham|spam)train

Index: optimize.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/optimize.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** optimize.py	10 Nov 2002 12:07:15 -0000	1.1
--- optimize.py	10 Nov 2002 19:59:22 -0000	1.2
***************
*** 11,66 ****
      simplex = [var]
      for i in range(len(var)):
! 	var2 = copy.copy(var)
! 	var2[i] = var[i] + err[i]
! 	simplex.append(var2)
      value = []
      for i in range(len(simplex)):
! 	value.append(func(simplex[i]))
      while 1:
! 	# Determine worst and best
! 	wi = 0
! 	bi = 0
! 	for i in range(len(simplex)):
! 	    if value[wi] > value[i]:
! 		wi = i
! 	    if value[bi] < value[i]:
! 		bi = i
! 	# Test for convergence
! 	#print "worst, best are",wi,bi,"with",value[wi],value[bi]
! 	if abs(value[bi] - value[wi]) <= convcrit:
! 	    return simplex[bi]
! 	# Calculate average of non-worst
! 	ave=Numeric.zeros(len(var), 'd')
! 	for i in range(len(simplex)):
! 	    if i != wi:
! 		ave = ave + simplex[i]
! 	ave = ave / (len(simplex) - 1)
! 	worst = Numeric.array(simplex[wi])
! 	# Check for too-small simplex
! 	simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
! 	if simsize <= minerr:
! 	    #print "Size of simplex too small:",simsize
! 	    return simplex[bi]
! 	# Invert worst
! 	new = 2 * ave - simplex[wi]
! 	newv = func(new)
! 	if newv <= value[wi]:
! 	    # Even worse. Shrink instead
! 	    #print "Shrunk simplex"
! 	    #print "ave=",repr(ave)
! 	    #print "wi=",repr(worst)
! 	    new = 0.5 * ave + 0.5 * worst
! 	    newv = func(new)
! 	elif newv > value[bi]:
! 	    # Better than the best. Expand
! 	    new2 = 3 * ave - 2 * worst
! 	    newv2 = func(new2)
! 	    if newv2 > newv:
! 		# Accept
! 		#print "Expanded simplex"
! 		new = new2
! 		newv = newv2
! 	simplex[wi] = new
! 	value[wi] = newv
  
  def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
--- 11,66 ----
      simplex = [var]
      for i in range(len(var)):
!         var2 = copy.copy(var)
!         var2[i] = var[i] + err[i]
!         simplex.append(var2)
      value = []
      for i in range(len(simplex)):
!         value.append(func(simplex[i]))
      while 1:
!         # Determine worst and best
!         wi = 0
!         bi = 0
!         for i in range(len(simplex)):
!             if value[wi] > value[i]:
!                 wi = i
!             if value[bi] < value[i]:
!                 bi = i
!         # Test for convergence
!         #print "worst, best are",wi,bi,"with",value[wi],value[bi]
!         if abs(value[bi] - value[wi]) <= convcrit:
!             return simplex[bi]
!         # Calculate average of non-worst
!         ave=Numeric.zeros(len(var), 'd')
!         for i in range(len(simplex)):
!             if i != wi:
!                 ave = ave + simplex[i]
!         ave = ave / (len(simplex) - 1)
!         worst = Numeric.array(simplex[wi])
!         # Check for too-small simplex
!         simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
!         if simsize <= minerr:
!             #print "Size of simplex too small:",simsize
!             return simplex[bi]
!         # Invert worst
!         new = 2 * ave - simplex[wi]
!         newv = func(new)
!         if newv <= value[wi]:
!             # Even worse. Shrink instead
!             #print "Shrunk simplex"
!             #print "ave=",repr(ave)
!             #print "wi=",repr(worst)
!             new = 0.5 * ave + 0.5 * worst
!             newv = func(new)
!         elif newv > value[bi]:
!             # Better than the best. Expand
!             new2 = 3 * ave - 2 * worst
!             newv2 = func(new2)
!             if newv2 > newv:
!                 # Accept
!                 #print "Expanded simplex"
!                 new = new2
!                 newv = newv2
!         simplex[wi] = new
!         value[wi] = newv
  
  def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
--- pop3proxy.py	10 Nov 2002 19:59:22 -0000	1.14
***************
*** 140,144 ****
      can't connect to the real POP3 server and talk to it
      synchronously, because that would block the process."""
!     
      def __init__(self, serverName, serverPort, lineCallback):
          BrighterAsyncChat.__init__(self)
--- 140,144 ----
      can't connect to the real POP3 server and talk to it
      synchronously, because that would block the process."""
! 
      def __init__(self, serverName, serverPort, lineCallback):
          BrighterAsyncChat.__init__(self)
***************
*** 148,152 ****
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
          self.connect((serverName, serverPort))
!     
      def collect_incoming_data(self, data):
          self.request = self.request + data
--- 148,152 ----
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
          self.connect((serverName, serverPort))
! 
      def collect_incoming_data(self, data):
          self.request = self.request + data
***************
*** 184,188 ****
          self.seenAllHeaders = False # For the current RETR or TOP
          self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
                                               self.onServerLine)
  
--- 184,188 ----
          self.seenAllHeaders = False # For the current RETR or TOP
          self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort,
                                               self.onServerLine)
  
***************
*** 198,214 ****
          isFirstLine = not self.response
          self.response = self.response + line
!         
          # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
          # Has the server closed its end of the socket?
          if not line:
              self.isClosing = True
!         
          # If we're not processing a command, just echo the response.
          if not self.command:
              self.push(self.response)
              self.response = ''
!         
          # Time out after 30 seconds for message-retrieval commands if
          # all the headers are down.  The rest of the message will proxy
--- 198,214 ----
          isFirstLine = not self.response
          self.response = self.response + line
! 
          # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
! 
          # Has the server closed its end of the socket?
          if not line:
              self.isClosing = True
! 
          # If we're not processing a command, just echo the response.
          if not self.command:
              self.push(self.response)
              self.response = ''
! 
          # Time out after 30 seconds for message-retrieval commands if
          # all the headers are down.  The rest of the message will proxy
***************
*** 223,227 ****
              self.onResponse()
              self.response = ''
!     
      def isMultiline(self):
          """Returns True if the request should get a multiline
--- 223,227 ----
              self.onResponse()
              self.response = ''
! 
      def isMultiline(self):
          """Returns True if the request should get a multiline
***************
*** 254,258 ****
              self.close()
              raise SystemExit
!         
          self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
--- 254,258 ----
              self.close()
              raise SystemExit
! 
          self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
***************
*** 265,271 ****
              self.args = splitCommand[1:]
              self.startTime = time.time()
!         
          self.request = ''
!         
      def onResponse(self):
          # Pass the request and the raw response to the subclass and
--- 265,271 ----
              self.args = splitCommand[1:]
              self.startTime = time.time()
! 
          self.request = ''
! 
      def onResponse(self):
          # Pass the request and the raw response to the subclass and
***************
*** 273,277 ****
          cooked = self.onTransaction(self.command, self.args, self.response)
          self.push(cooked)
!         
          # If onServerLine() decided that the server has closed its
          # socket, close this one when the response has been sent.
--- 273,277 ----
          cooked = self.onTransaction(self.command, self.args, self.response)
          self.push(cooked)
! 
          # If onServerLine() decided that the server has closed its
          # socket, close this one when the response has been sent.
***************
*** 351,355 ****
          status.activeSessions -= 1
          POP3ProxyBase.close(self)
!     
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
--- 351,355 ----
          status.activeSessions -= 1
          POP3ProxyBase.close(self)
! 
      def onTransaction(self, command, args, response):
          """Takes the raw request and response, and returns the
***************
*** 419,423 ****
                  if command == 'RETR':
                      status.numUnsure += 1
!             
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
--- 419,423 ----
                  if command == 'RETR':
                      status.numUnsure += 1
! 
              headers, body = re.split(r'\n\r?\n', response, 1)
              headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
***************
*** 490,494 ****
               .content { margin: 15 }
               .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex; 
                                 border-bottom: 1px solid #808080;
                                 font-weight: bold }
--- 490,494 ----
               .content { margin: 15 }
               .sectiontable { border: 1px solid #808080; width: 95%% }
!              .sectionheading { background: fffae0; padding-left: 1ex;
                                 border-bottom: 1px solid #808080;
                                 font-weight: bold }
***************
*** 513,517 ****
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
!     
      shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
--- 513,517 ----
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
! 
      shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
***************
*** 521,525 ****
                    <tr><td class='sectionbody'>%s</td></tr></table>
                    &nbsp;<br>\n"""
!     
      summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
                proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
--- 521,525 ----
                    <tr><td class='sectionbody'>%s</td></tr></table>
                    &nbsp;<br>\n"""
! 
      summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
                proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
***************
*** 529,538 ****
                  <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
!     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
!     
      train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
--- 529,538 ----
                  <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
! 
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
! 
      train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
***************
*** 546,550 ****
              <input type='submit' value='Train on this message'>
              </form>"""
!     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
--- 546,550 ----
              <input type='submit' value='Train on this message'>
              </form>"""
! 
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 577,581 ****
                  self.request = self.request + '\r\n\r\n'
                  return
!     
              if type(self.get_terminator()) is type(1):
                  # We've just read the body of a POSTed request.
--- 577,581 ----
                  self.request = self.request + '\r\n\r\n'
                  return
! 
              if type(self.get_terminator()) is type(1):
                  # We've just read the body of a POSTed request.
***************
*** 592,596 ****
                      # A normal x-www-form-urlencoded.
                      params.update(cgi.parse_qs(body, keep_blank_values=True))
!             
              # Convert the cgi params into a simple dictionary.
              plainParams = {}
--- 592,596 ----
                      # A normal x-www-form-urlencoded.
                      params.update(cgi.parse_qs(body, keep_blank_values=True))
! 
              # Convert the cgi params into a simple dictionary.
              plainParams = {}
***************
*** 604,608 ****
          if path == '/':
              path = '/Home'
!         
          if path == '/helmet.gif':
              # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
--- 604,608 ----
          if path == '/':
              path = '/Home'
! 
          if path == '/helmet.gif':
              # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
***************
*** 628,632 ****
                  else:
                      self.push(self.footer % (timeString, self.shutdownPickle))
!     
      def pushOKHeaders(self, contentType, extraHeaders={}):
          timeNow = time.gmtime(time.time())
--- 628,632 ----
                  else:
                      self.push(self.footer % (timeString, self.shutdownPickle))
! 
      def pushOKHeaders(self, contentType, extraHeaders={}):
          timeNow = time.gmtime(time.time())
***************
*** 645,649 ****
          self.push("\r\n")
          self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
!     
      def pushPreamble(self, name):
          self.push(self.header % name)
--- 645,649 ----
          self.push("\r\n")
          self.push("<html><body><p>%d %s</p></body></html>" % (code, message))
! 
      def pushPreamble(self, name):
          self.push(self.header % name)
***************
*** 681,685 ****
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
!         
          # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
--- 681,685 ----
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
! 
          # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
***************
*** 718,722 ****
          except KeyError:
              info = "'%s' does not appear in the database." % word
!         
          body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
--- 718,722 ----
          except KeyError:
              info = "'%s' does not appear in the database." % word
! 
          body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
***************
*** 992,996 ****
          elif opt == '-u':
              status.uiPort = int(arg)
!             
      # Do whatever we've been asked to do...
      if not opts and not args:
--- 992,996 ----
          elif opt == '-u':
              status.uiPort = int(arg)
! 
      # Do whatever we've been asked to do...
      if not opts and not args:

Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timcv.py	1 Nov 2002 04:10:50 -0000	1.11
--- timcv.py	10 Nov 2002 19:59:22 -0000	1.12
***************
*** 15,19 ****
  
      --HamTrain int
!         The maximum number of msgs to use from each Ham set for training.  
          The msgs are chosen randomly.  See also the -s option.
  
--- 15,19 ----
  
      --HamTrain int
!         The maximum number of msgs to use from each Ham set for training.
          The msgs are chosen randomly.  See also the -s option.
  
***************
*** 23,27 ****
  
      --HamTest int
!         The maximum number of msgs to use from each Ham set for testing.  
          The msgs are chosen randomly.  See also the -s option.
  
--- 23,27 ----
  
      --HamTest int
!         The maximum number of msgs to use from each Ham set for testing.
          The msgs are chosen randomly.  See also the -s option.
  
***************
*** 73,79 ****
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), 
                              hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), 
                              spamdirs[1:], train=1))
  
--- 73,79 ----
      d = TestDriver.Driver()
      # Train it on all sets except the first.
!     d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
                              hamdirs[1:], train=1),
!             msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
                              spamdirs[1:], train=1))
  
***************
*** 98,102 ****
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1), 
                          msgs.SpamStream(sname, s2, train=1))
  
--- 98,102 ----
                  del s2[i]
  
!                 d.train(msgs.HamStream(hname, h2, train=1),
                          msgs.SpamStream(sname, s2, train=1))
  

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** weaktest.py	10 Nov 2002 12:02:33 -0000	1.2
--- weaktest.py	10 Nov 2002 19:59:22 -0000	1.3
***************
*** 58,62 ****
      nham = len(hamfns)
      nspam = len(spamfns)
!     
      allfns = {}
      for fn in spamfns+hamfns:
--- 58,62 ----
      nham = len(hamfns)
      nspam = len(spamfns)
! 
      allfns = {}
      for fn in spamfns+hamfns:
***************
*** 133,137 ****
      print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
      print "Flex cost: $%.4f"%flexcost
!     
  def main():
      import getopt
--- 133,137 ----
      print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
      print "Flex cost: $%.4f"%flexcost
! 
  def main():
      import getopt


From tim_one@users.sourceforge.net  Sun Nov 10 20:00:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 12:00:03 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24
Message-ID: <E18AyFX-0003uk-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14946

Modified Files:
	msgstore.py 
Log Message:
Whitespace normalization.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** msgstore.py	7 Nov 2002 22:30:09 -0000	1.23
--- msgstore.py	10 Nov 2002 19:59:59 -0000	1.24
***************
*** 397,401 ****
              # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
              pass
!             
          return "%s\n%s\n%s" % (headers, html, body)
  
--- 397,401 ----
              # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
              pass
! 
          return "%s\n%s\n%s" % (headers, html, body)
  

From tim_one@users.sourceforge.net  Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4
Message-ID: <E18B3r2-0001Re-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam

Modified Files:
	profile.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** profile.py	7 Nov 2002 22:30:11 -0000	1.3
--- profile.py	11 Nov 2002 01:59:06 -0000	1.4
***************
*** 44,48 ****
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 44,48 ----
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0


From tim_one@users.sourceforge.net  Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] 
 spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Message-ID: <E18B3r2-0001RY-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402

Modified Files:
	Options.py classifier.py weakloop.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py	8 Nov 2002 04:06:23 -0000	1.67
--- Options.py	11 Nov 2002 01:59:06 -0000	1.68
***************
*** 241,268 ****
  
  # These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before.  Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam.  This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x.  All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE:  Gary Robinson previously used a different formula involving 'a'
! # and 'x'.  The 'x' here is the same as before.  The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
  
  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme.  0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
  
! # The combining scheme currently detailed on Gary Robinons web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
--- 241,268 ----
  
  # These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before.  Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting.  At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam.  This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob.  All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
  
  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests.  0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
  
! # The combining scheme currently detailed on the Robinon web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
***************
*** 272,284 ****
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease".  One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  This is
! # the "provably most-sensitive" test the original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease".  One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'robinson_probability_x': float_cracker,
!                    'robinson_probability_s': float_cracker,
!                    'robinson_minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'unknown_word_prob': float_cracker,
!                    'unknown_word_strength': float_cracker,
!                    'minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py	7 Nov 2002 22:30:05 -0000	1.49
--- classifier.py	11 Nov 2002 01:59:06 -0000	1.50
***************
*** 70,74 ****
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
      # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
          nspam = float(self.nspam or 1)
  
!         S = options.robinson_probability_s
!         StimesX = S * options.robinson_probability_x
  
          for word, record in self.wordinfo.iteritems():
--- 322,327 ----
          nspam = float(self.nspam or 1)
  
!         S = options.unknown_word_strength
!         StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
  
      def _getclues(self, wordstream):
!         mindist = options.robinson_minimum_prob_strength
!         unknown = options.robinson_probability_x
  
          clues = []  # (distance, prob, word, record) tuples
--- 449,454 ----
  
      def _getclues(self, wordstream):
!         mindist = options.minimum_prob_strength
!         unknown = options.unknown_word_prob
  
          clues = []  # (distance, prob, word, record) tuples

Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py	10 Nov 2002 12:08:40 -0000	1.1
--- weakloop.py	11 Nov 2002 01:59:06 -0000	1.2
***************
*** 29,35 ****
  default="""
  [Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
  max_discriminators = 150
  
--- 29,35 ----
  default="""
  [Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
  max_discriminators = 150
  
***************
*** 41,47 ****
  import Options
  
! start = (Options.options.robinson_probability_x,
!          Options.options.robinson_minimum_prob_strength,
!          Options.options.robinson_probability_s,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
--- 41,47 ----
  import Options
  
! start = (Options.options.unknown_word_prob,
!          Options.options.minimum_prob_strength,
!          Options.options.unknown_word_strength,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
***************
*** 52,58 ****
      f.write("""
  [Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
  
  [TestDriver]
--- 52,58 ----
      f.write("""
  [Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
  
  [TestDriver]


From tim_one@users.sourceforge.net  Fri Nov  8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
	tokenizer.py,1.63,1.64
Message-ID: <E18A0Pd-0008K2-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798

Modified Files:
	Options.py tokenizer.py 
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py	7 Nov 2002 22:25:46 -0000	1.66
--- Options.py	8 Nov 2002 04:06:23 -0000	1.67
***************
*** 42,53 ****
      x-.*
  
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages.  Set true to retain HTML tags in this
- # case.  On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
- 
  # If true, the first few characters of application/octet-stream sections
  # are used, undecoded.  What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
  
  all_options = {
!     'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
!                   'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,
--- 339,343 ----
  
  all_options = {
!     'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py	7 Nov 2002 22:30:08 -0000	1.63
--- tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
***************
*** 495,504 ****
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was removed.
  
  ##############################################################################
--- 495,505 ----
  # Later:  As the amount of training data increased, the effect of retaining
  # HTML tags decreased to insignificance.  options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False.  Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
  #
  # Later:  The decision to ignore "redundant" HTML is also dubious, since
  # the text/plain and text/html alternatives may have entirely different
  # content.  options.ignore_redundant_html was introduced to control this,
! # and it defaults to False.  Later:  ignore_redundant_html was also removed.
  
  ##############################################################################
***************
*** 1167,1175 ****
          """Generate a stream of tokens from an email Message.
  
-         HTML tags are always stripped from text/plain sections.
-         options.retain_pure_html_tags controls whether HTML tags are
-         also stripped from text/html sections.  Except in special cases,
-         it's recommended to leave that at its default of false.
- 
          If options.check_octets is True, the first few undecoded characters
          of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             if (part.get_content_type() == "text/plain" or
!                     not options.retain_pure_html_tags):
!                 text = text.replace('&nbsp;', ' ')
!                 text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1224,1229 ----
  
              # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.


From richiehindle@users.sourceforge.net  Fri Nov  8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID: <E18A440-0006h6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390

Modified Files:
	pop3proxy.py 
Log Message:
 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py	7 Nov 2002 22:27:02 -0000	1.11
--- pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
***************
*** 47,50 ****
--- 47,74 ----
  
  
+ todo = """
+  o (Re)training interface - one message per line, quick-rendering table.
+  o Slightly-wordy index page; intro paragraph for each page.
+  o Once the training stuff is on a separate page, make the paste box
+    bigger.
+  o "Links" section (on homepage?) to project homepage, mailing list,
+    etc.
+  o "Home" link (with helmet!) at the end of each page.
+  o "Classify this" - just like Train.
+  o "Send me an email every [...] to remind me to train on new
+    messages."
+  o "Send me a status email every [...] telling how many mails have been
+    classified, etc."
+  o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
+    webbrowser?
+  o Possibly integrate Tim Stone's SMTP code - make it use async, make
+    the training code update (rather than replace!) the database.
+  o Can it cleanly dynamically update its status display while having a
+    POP3 converation?  Hammering reload sucks.
+  o Add a command to save the database without shutting down, and one to
+    reload the database.
+  o Leave the word in the input field after a Word query.
+ """
+ 
  import sys, re, operator, errno, getopt, cPickle, cStringIO, time
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
              self.factory(*args)
  
+ 
  class BrighterAsyncChat(asynchat.async_chat):
      """An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
  
  
+ class ServerLineReader(BrighterAsyncChat):
+     """An async socket that reads lines from a remote server and
+     simply calls a callback with the data.  The BayesProxy object
+     can't connect to the real POP3 server and talk to it
+     synchronously, because that would block the process."""
+     
+     def __init__(self, serverName, serverPort, lineCallback):
+         BrighterAsyncChat.__init__(self)
+         self.lineCallback = lineCallback
+         self.request = ''
+         self.set_terminator('\r\n')
+         self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+         self.connect((serverName, serverPort))
+     
+     def collect_incoming_data(self, data):
+         self.request = self.request + data
+ 
+     def found_terminator(self):
+         self.lineCallback(self.request + '\r\n')
+         self.request = ''
+ 
+     def handle_close(self):
+         self.lineCallback('')
+         self.close()
+ 
+ 
  class POP3ProxyBase(BrighterAsyncChat):
      """An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
          self.set_terminator('\r\n')
!         self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.serverSocket.connect((serverName, serverPort))
!         self.serverIn = self.serverSocket.makefile('r')  # For reading only
!         self.push(self.serverIn.readline())
  
      def onTransaction(self, command, args, response):
--- 177,189 ----
          BrighterAsyncChat.__init__(self, clientSocket)
          self.request = ''
+         self.response = ''
          self.set_terminator('\r\n')
!         self.command = ''           # The POP3 command being processed...
!         self.args = ''              # ...and its arguments
!         self.isClosing = False      # Has the server closed the socket?
!         self.seenAllHeaders = False # For the current RETR or TOP
!         self.startTime = 0          # (ditto)
!         self.serverSocket = ServerLineReader(serverName, serverPort, 
!                                              self.onServerLine)
  
      def onTransaction(self, command, args, response):
***************
*** 139,152 ****
          raise NotImplementedError
  
!     def isMultiline(self, command, args):
!         """Returns True if the given request should get a multiline
          response (assuming the response is positive).
          """
!         if command in ['USER', 'PASS', 'APOP', 'QUIT',
!                        'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif command in ['RETR', 'TOP']:
              return True
!         elif command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
--- 194,237 ----
          raise NotImplementedError
  
!     def onServerLine(self, line):
!         """A line of response has been received from the POP3 server."""
!         isFirstLine = not self.response
!         self.response = self.response + line
!         
!         # Is this line that terminates a set of headers?
!         self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!         
!         # Has the server closed its end of the socket?
!         if not line:
!             self.isClosing = True
!         
!         # If we're not processing a command, just echo the response.
!         if not self.command:
!             self.push(self.response)
!             self.response = ''
!         
!         # Time out after 30 seconds for message-retrieval commands if
!         # all the headers are down.  The rest of the message will proxy
!         # straight through.
!         if self.command in ['TOP', 'RETR'] and \
!            self.seenAllHeaders and time.time() > self.startTime + 30:
!             self.onResponse()
!             self.response = ''
!         # If that's a complete response, handle it.
!         elif not self.isMultiline() or line == '.\r\n' or \
!            (isFirstLine and line.startswith('-ERR')):
!             self.onResponse()
!             self.response = ''
!     
!     def isMultiline(self):
!         """Returns True if the request should get a multiline
          response (assuming the response is positive).
          """
!         if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
!                             'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
              return False
!         elif self.command in ['RETR', 'TOP']:
              return True
!         elif self.command in ['LIST', 'UIDL']:
              return len(args) == 0
          else:
***************
*** 155,204 ****
              return False
  
-     def readResponse(self, command, args):
-         """Reads the POP3 server's response and returns a tuple of
-         (response, isClosing, timedOut).  isClosing is True if the
-         server closes the socket, which tells found_terminator() to
-         close when the response has been sent.  timedOut is set if a
-         TOP or RETR request was still arriving after 30 seconds, and
-         tells found_terminator() to proxy the remainder of the response.
-         """
-         responseLines = []
-         startTime = time.time()
-         isMulti = self.isMultiline(command, args)
-         isClosing = False
-         timedOut = False
-         isFirstLine = True
-         seenAllHeaders = False
-         while True:
-             line = self.serverIn.readline()
-             if not line:
-                 # The socket's been closed by the server, probably by QUIT.
-                 isClosing = True
-                 break
-             elif not isMulti or (isFirstLine and line.startswith('-ERR')):
-                 # A single-line response.
-                 responseLines.append(line)
-                 break
-             elif line == '.\r\n':
-                 # The termination line.
-                 responseLines.append(line)
-                 break
-             else:
-                 # A normal line - append it to the response and carry on.
-                 responseLines.append(line)
-                 seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
- 
-             # Time out after 30 seconds for message-retrieval commands
-             # if all the headers are down - found_terminator() knows how
-             # to deal with this.
-             if command in ['TOP', 'RETR'] and \
-                seenAllHeaders and time.time() > startTime + 30:
-                 timedOut = True
-                 break
- 
-             isFirstLine = False
- 
-         return ''.join(responseLines), isClosing, timedOut
- 
      def collect_incoming_data(self, data):
          """Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
      def found_terminator(self):
          """Asynchat override."""
-         # Send the request to the server and read the reply.
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
              self.shutdown(2)
              self.close()
              raise SystemExit
!         self.serverSocket.sendall(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             command, args = ('', '')
          else:
              splitCommand = self.request.strip().split(None, 1)
!             command = splitCommand[0].upper()
!             args = splitCommand[1:]
!         rawResponse, isClosing, timedOut = self.readResponse(command, args)
! 
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cookedResponse = self.onTransaction(command, args, rawResponse)
!         self.push(cookedResponse)
!         self.request = ''
! 
!         # If readResponse() timed out, we still need to read and proxy
!         # the rest of the message.
!         if timedOut:
!             while True:
!                 line = self.serverIn.readline()
!                 if not line:
!                     # The socket's been closed by the server.
!                     isClosing = True
!                     break
!                 elif line == '.\r\n':
!                     # The termination line.
!                     self.push(line)
!                     break
!                 else:
!                     # A normal line.
!                     self.push(line)
! 
!         # If readResponse() or the loop above decided that the server
!         # has closed its socket, close this one when the response has
!         # been sent.
!         if isClosing:
              self.close_when_done()
  
  
  class BayesProxyListener(Listener):
--- 246,288 ----
      def found_terminator(self):
          """Asynchat override."""
          if self.request.strip().upper() == 'KILL':
              self.serverSocket.sendall('QUIT\r\n')
              self.send("+OK, dying.\r\n")
+             self.serverSocket.shutdown(2)
+             self.serverSocket.close()
              self.shutdown(2)
              self.close()
              raise SystemExit
!         
!         self.serverSocket.push(self.request + '\r\n')
          if self.request.strip() == '':
              # Someone just hit the Enter key.
!             self.command = self.args = ''
          else:
+             # A proper command.
              splitCommand = self.request.strip().split(None, 1)
!             self.command = splitCommand[0].upper()
!             self.args = splitCommand[1:]
!             self.startTime = time.time()
!         
!         self.request = ''
!         
!     def onResponse(self):
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cooked = self.onTransaction(self.command, self.args, self.response)
!         self.push(cooked)
!         
!         # If onServerLine() decided that the server has closed its
!         # socket, close this one when the response has been sent.
!         if self.isClosing:
              self.close_when_done()
  
+         # Reset.
+         self.command = ''
+         self.args = ''
+         self.isClosing = False
+         self.seenAllHeaders = False
+ 
  
  class BayesProxyListener(Listener):
***************
*** 452,456 ****
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15 }
               .header { font-size: 133%% }
               .content { margin: 15 }
--- 484,490 ----
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
!              .banner { background: #c0e0ff; padding=5; padding-left: 15;
!                        border-top: 1px solid black;
!                        border-bottom: 1px solid black }
               .header { font-size: 133%% }
               .content { margin: 15 }
***************
*** 466,470 ****
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
--- 500,504 ----
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
!                 <span class='header'>&nbsp;Spambayes proxy: %s</span></div>
                  <div class='content'>\n"""
  
***************
*** 475,481 ****
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              <input type='submit' value='Shutdown now'>
               </td></tr></table></form>\n"""
  
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
--- 509,520 ----
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
!              %s
               </td></tr></table></form>\n"""
  
+     shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
+     
+     shutdownPickle = shutdownDB + """&nbsp;&nbsp;
+             <input type='submit' name='how' value='Save &amp; shutdown'>"""
+ 
      pageSection = """<table class='sectiontable' cellspacing='0'>
                    <tr><td class='sectionheading'>%s</td></tr>
***************
*** 483,486 ****
--- 522,533 ----
                    &nbsp;<br>\n"""
      
+     summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
+               proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
+               Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
+               POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
+               Emails classified this session: <b>%(numSpams)d</b> spam,
+                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
+               """
+     
      wordQuery = """<form action='/wordquery'>
                  <input name='word' type='text' size='30'>
***************
*** 488,491 ****
--- 535,550 ----
                  </form>"""
      
+     train = """<form action='/upload' method='POST'
+                 enctype='multipart/form-data'>
+             Either upload a message file: <input type='file' name='file'><br>
+             Or paste the whole message (incuding headers) here:<br>
+             <textarea name='text' rows='3' cols='60'></textarea><br>
+             Is this message
+             <input type='radio' name='which' value='ham'>Ham</input> or
+             <input type='radio'
+                    name='which' value='spam' checked>Spam</input>?<br>
+             <input type='submit' value='Train on this message'>
+             </form>"""
+     
      def __init__(self, clientSocket, bayes):
          BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = self.request.split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
--- 561,565 ----
          """Asynchat override.
          Read and parse the HTTP request and call an on<Command> handler."""
!         requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
          try:
              method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
          
          if path == '/helmet.gif':
!             self.pushOKHeaders('image/gif')
              self.push(self.helmet)
          else:
--- 606,614 ----
          
          if path == '/helmet.gif':
!             # XXX Why doesn't Expires work?  Must read RFC 2616 one day.
!             inOneHour = time.gmtime(time.time() + 3600)
!             expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
!             extraHeaders = {'Expires': expiryDate}
!             self.pushOKHeaders('image/gif', extraHeaders)
              self.push(self.helmet)
          else:
***************
*** 554,558 ****
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % url)
              else:
                  # This is a request for a valid page; run the handler.
--- 617,621 ----
                  handler = getattr(self, 'on' + name)
              except AttributeError:
!                 self.pushError(404, "Not found: '%s'" % path)
              else:
                  # This is a request for a valid page; run the handler.
***************
*** 561,569 ****
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 self.push(self.footer % timeString)
      
!     def pushOKHeaders(self, contentType):
!         self.push("HTTP/1.0 200 OK\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
          self.push("\r\n")
  
--- 624,641 ----
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 if status.useDB:
!                     self.push(self.footer % (timeString, self.shutdownDB))
!                 else:
!                     self.push(self.footer % (timeString, self.shutdownPickle))
      
!     def pushOKHeaders(self, contentType, extraHeaders={}):
!         timeNow = time.gmtime(time.time())
!         httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
!         self.push("HTTP/1.1 200 OK\r\n")
!         self.push("Connection: close\r\n")
          self.push("Content-Type: %s\r\n" % contentType)
+         self.push("Date: %s\r\n" % httpNow)
+         for name, value in extraHeaders.items():
+             self.push("%s: %s\r\n" % (name, value))
          self.push("\r\n")
  
***************
*** 583,616 ****
  
      def onHome(self, params):
!         summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!                   proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
!                   Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
!                   POP3 conversations this session:
!                     <b>%(totalSessions)d</b>.<br>
!                   Emails classified this session: <b>%(numSpams)d</b> spam,
!                     <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
!                   """ % status.__dict__
!         
!         train = """<form action='/upload' method='POST'
!                     enctype='multipart/form-data'>
!                 Either upload a message file:
!                 <input type='file' name='file'><br>
!                 Or paste the whole message (incuding headers) here:<br>
!                 <textarea name='text' rows='3' cols='60'></textarea><br>
!                 Is this message
!                 <input type='radio' name='which' value='ham'>Ham</input> or
!                 <input type='radio'
!                        name='which' value='spam' checked>Spam</input>?<br>
!                 <input type='submit' value='Train on this message'>
!                 </form>"""
!         
!         body = (self.pageSection % ('Status', summary) +
!                 self.pageSection % ('Word query', self.wordQuery) +
!                 self.pageSection % ('Train', train))
          self.push(body)
  
      def onShutdown(self, params):
!         self.push("<p><b>Shutdown.</b> Goodbye.</p>")
!         self.push(' ')  # Acts as a flush for small buffers.
          self.shutdown(2)
          self.close()
--- 655,675 ----
  
      def onHome(self, params):
!         """Serve up the homepage."""
!         body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Word query', self.wordQuery)+
!                 self.pageSection % ('Train', self.train))
          self.push(body)
  
      def onShutdown(self, params):
!         """Shutdown the server, saving the pickle if requested to do so."""
!         if params['how'].lower().find('save') >= 0:
!             if not status.useDB and status.pickleName:
!                 self.push("<b>Saving...</b>")
!                 self.push(' ')  # Acts as a flush for small buffers.
!                 fp = open(status.pickleName, 'wb')
!                 cPickle.dump(self.bayes, fp, 1)
!                 fp.close()
!         self.push("<b>Shutdown</b>. Goodbye.")
!         self.push(' ')
          self.shutdown(2)
          self.close()
***************
*** 618,625 ****
  
      def onUpload(self, params):
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
          # Append the message to a file, to make it easier to rebuild
!         # the database later.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
--- 677,690 ----
  
      def onUpload(self, params):
+         """Train on an uploaded or pasted message."""
+         # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         
          # Append the message to a file, to make it easier to rebuild
!         # the database later.   This is a temporary implementation -
!         # it should keep a Corpus (from Tim Stone's forthcoming message
!         # management module) to manage a cache of messages.  It needs
!         # to keep them for the HTML retraining interface anyway.
          message = message.replace('\r\n', '\n').replace('\r', '\n')
          if isSpam:
***************
*** 627,642 ****
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From ???@???\n")  # fake From line (XXX good enough?)
          f.write(message)
!         f.write("\n")
          f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("""<p>Trained on your message. Saving database...</p>""")
!         self.push(" ")  # Flush... must find out how to do this properly...
!         if not status.useDB and status.pickleName:
!             fp = open(status.pickleName, 'wb')
!             cPickle.dump(self.bayes, fp, 1)
!             fp.close()
!         self.push("<p>Done.</p><p><a href='/'>Home</a></p>")
  
      def onWordquery(self, params):
--- 692,704 ----
          else:
              f = open("_pop3proxyham.mbox", "a")
!         f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
          f.write(message)
!         f.write("\n\n")
          f.close()
+ 
+         # Train on the message.
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
!         self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
!         self.push(self.pageSection % ('Train another', self.train))
  
      def onWordquery(self, params):
***************
*** 656,660 ****
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s':" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
--- 718,722 ----
              info = "'%s' does not appear in the database." % word
          
!         body = (self.pageSection % ("Statistics for '%s'" % word, info) +
                  self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
***************
*** 765,771 ****
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))
          self.request = ''
  
      def onStat(self, command, args):
          """POP3 STAT command."""
--- 827,839 ----
          else:
              handler = self.handlers.get(command, self.onUnknown)
!             self.push(handler(command, args))   # Or push_slowly for testing
          self.request = ''
  
+     def push_slowly(self, response):
+         """Useful for testing."""
+         for c in response:
+             self.push(c)
+             time.sleep(0.02)
+ 
      def onStat(self, command, args):
          """POP3 STAT command."""
***************
*** 777,781 ****
          """POP3 LIST command, with optional message number argument."""
          if args:
!             number = int(args)
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
          """POP3 LIST command, with optional message number argument."""
          if args:
!             try:
!                 number = int(args)
!             except ValueError:
!                 number = -1
              if 0 < number <= len(self.maildrop):
                  return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         return self._getMessage(int(args), 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         number, lines = map(int, args.split())
          return self._getMessage(number, lines)
  
--- 874,889 ----
      def onRetr(self, command, args):
          """POP3 RETR command."""
!         try:
!             number = int(args)
!         except ValueError:
!             number = -1
!         return self._getMessage(number, 12345)
  
      def onTop(self, command, args):
          """POP3 RETR command."""
!         try:
!             number, lines = map(int, args.split())
!         except ValueError:
!             number, lines = -1, -1
          return self._getMessage(number, lines)
  
***************
*** 863,867 ****
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) != -1
  
      # Kill the proxy and the test server.
--- 941,945 ----
          while response.find('\n.\r\n') == -1:
              response = response + proxy.recv(1000)
!         assert response.find(options.hammie_header_name) >= 0
  
      # Kill the proxy and the test server.


From jvr@users.sourceforge.net  Sat Nov  9 18:05:44 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Sat, 09 Nov 2002 10:05:44 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13
Message-ID: <E18AZzM-0005QJ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20814

Modified Files:
	pop3proxy.py 
Log Message:
force word query to be lowercase, making the UI case insensitive

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** pop3proxy.py	8 Nov 2002 08:00:20 -0000	1.12
--- pop3proxy.py	9 Nov 2002 18:05:42 -0000	1.13
***************
*** 704,707 ****
--- 704,708 ----
      def onWordquery(self, params):
          word = params['word']
+         word = word.lower()
          try:
              # Must be a better way to get __dict__ for a new-style class...


From tim_one@users.sourceforge.net  Mon Nov 11 23:26:21 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 11 Nov 2002 15:26:21 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.64,1.65
Message-ID: <E18BNwj-0002mj-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10237

Modified Files:
	tokenizer.py 
Log Message:
An idea from Anthony Baxter:  decode Subject lines, so that they're
tokenized in decoded form, and so that they generate charset tokens too.
This had minor good effects in both our tests.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.64
retrieving revision 1.65
diff -C2 -d -r1.64 -r1.65
*** tokenizer.py	8 Nov 2002 04:06:24 -0000	1.64
--- tokenizer.py	11 Nov 2002 23:26:18 -0000	1.65
***************
*** 5,8 ****
--- 5,9 ----
  
  import email
+ import email.Header
  import email.Message
  import email.Errors
***************
*** 1054,1062 ****
          # but real benefit to keeping case intact in this specific context.
          x = msg.get('subject', '')
!         for w in subject_word_re.findall(x):
!             for t in tokenize_word(w):
!                 yield 'subject:' + t
!         for w in punctuation_run_re.findall(x):
!             yield 'subject:' + w
  
          # Dang -- I can't use Sender:.  If I do,
--- 1055,1066 ----
          # but real benefit to keeping case intact in this specific context.
          x = msg.get('subject', '')
!         for x, subjcharset in email.Header.decode_header(x):
!             if subjcharset is not None:
!                 yield 'subjectcharset:' + subjcharset
!             for w in subject_word_re.findall(x):
!                 for t in tokenize_word(w):
!                     yield 'subject:' + t
!             for w in punctuation_run_re.findall(x):
!                 yield 'subject:' + w
  
          # Dang -- I can't use Sender:.  If I do,


From anthonybaxter@users.sourceforge.net  Tue Nov 12 00:37:21 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 16:37:21 -0800
Subject: [Spambayes-checkins] website docs.ht,1.3,1.4
Message-ID: <E18BP3R-0001Xf-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv5772

Modified Files:
	docs.ht 
Log Message:
few more definitions


Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** docs.ht	19 Sep 2002 23:39:24 -0000	1.3
--- docs.ht	12 Nov 2002 00:37:19 -0000	1.4
***************
*** 27,32 ****
  <dt>f-n, FN<dd>(abbrev.) false negative
  <dt>f-p, FP<dd>(abbrev.) false positive
! 
  </dl>
- 
  
--- 27,34 ----
  <dt>f-n, FN<dd>(abbrev.) false negative
  <dt>f-p, FP<dd>(abbrev.) false positive
! <dt>corpus<dd>in this context, a body of messages. Usually referring to a
! training database.
! <dt>hapax, hapax legomenon <dd>a word or form occuring only once in a 
! document or corpus. (plural is hapax legomena)
  </dl>
  

From tim.one@comcast.net  Tue Nov 12 00:40:44 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 19:40:44 -0500
Subject: [Spambayes-checkins] website docs.ht,1.3,1.4
In-Reply-To: <E18BP3R-0001Xf-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOELMCJAB.tim.one@comcast.net>

> ! <dt>hapax, hapax legomenon <dd>a word or form occuring only once in a
> ! document or corpus. (plural is hapax legomena)
>   </dl>

Ya, but even I'm not that anal -- I usually say hapaxes.  hapaxora would be
a hoot too <wink>.


From anthony@interlink.com.au  Tue Nov 12 00:43:58 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 11:43:58 +1100
Subject: [Spambayes-checkins] website docs.ht,1.3,1.4 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOELMCJAB.tim.one@comcast.net> 
Message-ID: <200211120043.gAC0hwp09308@localhost.localdomain>


>>> Tim Peters wrote
> > ! <dt>hapax, hapax legomenon <dd>a word or form occuring only once in a
> > ! document or corpus. (plural is hapax legomena)
> >   </dl>
> 
> Ya, but even I'm not that anal -- I usually say hapaxes.  hapaxora would be
> a hoot too <wink>

Hapax legomena sounds like something that the CDC sends the black 
helicopters in to lock down an outbreak of...


From tim_one@users.sourceforge.net  Tue Nov 12 04:52:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 11 Nov 2002 20:52:14 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.29,1.30 manager.py,1.33,1.34
Message-ID: <E18BT26-00075F-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv27097/Outlook2000

Modified Files:
	addin.py manager.py 
Log Message:
In the "show clues" msg, for each word give the raw ham and spam counts
too.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** addin.py	7 Nov 2002 22:30:08 -0000	1.29
--- addin.py	12 Nov 2002 04:52:12 -0000	1.30
***************
*** 225,233 ****
      # Format the clues.
      push("<PRE>\n")
      for word, prob in clues:
          word = repr(word)
!         push(escape(word) + ' ' * (30 - len(word)))
!         push(' %g\n' % prob)
      push("</PRE>\n")
      # Now the raw text of the message, as best we can
      push("<h2>Message Stream:</h2><br>")
--- 225,244 ----
      # Format the clues.
      push("<PRE>\n")
+     push("word                                spamprob         #ham  #spam\n")
+     format = " %-12g %8s %6s\n"
+     c = mgr.GetClassifier()
+     fetchword = c.wordinfo.get
      for word, prob in clues:
+         record = fetchword(word)
+         if record:
+             nham = record.hamcount
+             nspam = record.spamcount
+         else:
+             nham = nspam = "-"
          word = repr(word)
!         push(escape(word) + " " * (35-len(word)))
!         push(format % (prob, nham, nspam))
      push("</PRE>\n")
+ 
      # Now the raw text of the message, as best we can
      push("<h2>Message Stream:</h2><br>")

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** manager.py	7 Nov 2002 22:30:09 -0000	1.33
--- manager.py	12 Nov 2002 04:52:12 -0000	1.34
***************
*** 223,226 ****
--- 223,230 ----
          self.bayes_dirty = False
  
+     def GetClassifier(self):
+         """Return the classifier we're using."""
+         return self.bayes
+ 
      def SaveConfig(self):
          if self.verbose > 1:


From anthonybaxter@users.sourceforge.net  Tue Nov 12 06:21:41 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 22:21:41 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.65,1.66
	Options.py,1.68,1.69
Message-ID: <E18BUQf-0004L7-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16090

Modified Files:
	tokenizer.py Options.py 
Log Message:
New tokenizer option 'address_headers'. Allows the mining of headers 
other than 'from' for email addresses and names (e.g. to or cc). 

By default, it's just set to 'from' for now.

In addition, address headers (including from) now get decoded and parsed
correctly, rather than by a whitespace split.

This shows a quite nice improvement for me.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.65
retrieving revision 1.66
diff -C2 -d -r1.65 -r1.66
*** tokenizer.py	11 Nov 2002 23:26:18 -0000	1.65
--- tokenizer.py	12 Nov 2002 06:21:38 -0000	1.66
***************
*** 7,10 ****
--- 7,12 ----
  import email.Header
  import email.Message
+ import email.Header
+ import email.Utils
  import email.Errors
  import re
***************
*** 1072,1082 ****
          #               # one (smalls wins & losses across runs, overall
          #               # not significant), so leaving it out
!         for field in ('from',):
!             prefix = field + ':'
!             x = msg.get(field, 'none').lower()
!             for w in x.split():
!                 for t in tokenize_word(w):
!                     yield prefix + t
! 
          # To:
          # Cc:
--- 1074,1096 ----
          #               # one (smalls wins & losses across runs, overall
          #               # not significant), so leaving it out
!         # To:, Cc:      # These can help, if your ham and spam are sourced
!         #               # from the same location. If not, they'll be horrible.
!         for field in options.address_headers:
!             addrlist = msg.get_all(field, [])
!             if not addrlist:
!                 yield field + ":none"
!             for addrs in addrlist:
!                 for rname,ename in email.Utils.getaddresses([addrs]):
!                     if rname:
!                         for rname,rcharset in email.Header.decode_header(rname):
!                             for w in rname.lower().split():
!                                 for t in tokenize_word(w):
!                                     yield field+'realname:'+t
!                             if rcharset is not None:
!                                 yield field+'charset:'+rcharset
!                     if ename:
!                         for w in ename.lower().split('@'):
!                             for t in tokenize_word(w):
!                                 yield field+'email:'+t
          # To:
          # Cc:

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.68
retrieving revision 1.69
diff -C2 -d -r1.68 -r1.69
*** Options.py	11 Nov 2002 01:59:06 -0000	1.68
--- Options.py	12 Nov 2002 06:21:38 -0000	1.69
***************
*** 90,93 ****
--- 90,101 ----
  mine_received_headers: False
  
+ # Mine the following address headers. If you have mixed source corpuses
+ # (as opposed to a mixed sauce walrus, which is delicious!) then you
+ # probably don't want to use 'to' or 'cc')
+ # Address headers will be decoded, and will generate charset tokens as
+ # well as the real address.
+ # others to consider: to, cc, reply-to, errors-to, sender, ...
+ address_headers: from
+ 
  # If legitimate mail contains things that look like text to the tokenizer
  # and turning turning off this option helps (perhaps binary attachments get
***************
*** 340,343 ****
--- 348,352 ----
  all_options = {
      'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
+                   'address_headers': ('get', lambda s: Set(s.split())),
                    'count_all_header_lines': boolean_cracker,
                    'record_header_absence': boolean_cracker,


From anthonybaxter@users.sourceforge.net  Tue Nov 12 07:03:22 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 23:03:22 -0800
Subject: [Spambayes-checkins] spambayes/pspam scoremsg.py,1.2,1.3
	update.py,1.2,1.3
Message-ID: <E18BV50-0006n5-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv26080

Modified Files:
	scoremsg.py update.py 
Log Message:
whitespace normalisation.


Index: scoremsg.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/scoremsg.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** scoremsg.py	7 Nov 2002 22:30:10 -0000	1.2
--- scoremsg.py	12 Nov 2002 07:03:20 -0000	1.3
***************
*** 39,43 ****
  ##    print
  ##    print msg
!         
  if __name__ == "__main__":
      main(sys.stdin)
--- 39,43 ----
  ##    print
  ##    print msg
! 
  if __name__ == "__main__":
      main(sys.stdin)

Index: update.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/update.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** update.py	7 Nov 2002 22:30:10 -0000	1.2
--- update.py	12 Nov 2002 07:03:20 -0000	1.3
***************
*** 39,43 ****
          if not folder_exists(profile.hams, p):
              profile.add_ham(p)
!     
      for spam in options.spam_folders:
          p = os.path.join(options.folder_dir, spam)
--- 39,43 ----
          if not folder_exists(profile.hams, p):
              profile.add_ham(p)
! 
      for spam in options.spam_folders:
          p = os.path.join(options.folder_dir, spam)
***************
*** 49,53 ****
      profile.update()
      get_transaction().commit()
!     
      db.close()
  
--- 49,53 ----
      profile.update()
      get_transaction().commit()
! 
      db.close()
  
***************
*** 58,61 ****
          if k == '-F':
              FORCE_REBUILD = True
!     
      main(FORCE_REBUILD)
--- 58,61 ----
          if k == '-F':
              FORCE_REBUILD = True
! 
      main(FORCE_REBUILD)


From anthonybaxter@users.sourceforge.net  Tue Nov 12 07:03:22 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 23:03:22 -0800
Subject: [Spambayes-checkins] 
 spambayes/pspam/pspam folder.py,1.2,1.3 options.py,1.1,1.2
 profile.py,1.4,1.5
Message-ID: <E18BV50-0006n8-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv26080/pspam

Modified Files:
	folder.py options.py profile.py 
Log Message:
whitespace normalisation.


Index: folder.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/folder.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** folder.py	7 Nov 2002 22:30:11 -0000	1.2
--- folder.py	12 Nov 2002 07:03:20 -0000	1.3
***************
*** 68,72 ****
                  self.messages[msgid] = msg
                  new.insert(msg)
!                 
          removed = difference(self.messages, cur)
          for msgid in removed.keys():
--- 68,72 ----
                  self.messages[msgid] = msg
                  new.insert(msg)
! 
          removed = difference(self.messages, cur)
          for msgid in removed.keys():

Index: options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/options.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** options.py	4 Nov 2002 04:44:20 -0000	1.1
--- options.py	12 Nov 2002 07:03:20 -0000	1.2
***************
*** 1,5 ****
  from Options import options, all_options, \
       boolean_cracker, float_cracker, int_cracker, string_cracker
! from sets import Set     
  
  all_options["Score"] = {'max_ham': float_cracker,
--- 1,5 ----
  from Options import options, all_options, \
       boolean_cracker, float_cracker, int_cracker, string_cracker
! from sets import Set
  
  all_options["Score"] = {'max_ham': float_cracker,

Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** profile.py	11 Nov 2002 01:59:06 -0000	1.4
--- profile.py	12 Nov 2002 07:03:20 -0000	1.5
***************
*** 92,96 ****
          get_transaction().commit()
          log("updated probabilities")
!         
      def _update(self, folders, is_spam):
          changed = False
--- 92,96 ----
          get_transaction().commit()
          log("updated probabilities")
! 
      def _update(self, folders, is_spam):
          changed = False
***************
*** 100,104 ****
              if added:
                  log("added %d" % len(added))
!             if removed:    
                  log("removed %d" % len(removed))
              get_transaction().commit()
--- 100,104 ----
              if added:
                  log("added %d" % len(added))
!             if removed:
                  log("removed %d" % len(removed))
              get_transaction().commit()
***************
*** 117,121 ****
              for msg in removed.keys():
                  self.classifier.unlearn(tokenize(msg), is_spam, False)
!             if removed: 
                  log("unlearned")
              del removed
--- 117,121 ----
              for msg in removed.keys():
                  self.classifier.unlearn(tokenize(msg), is_spam, False)
!             if removed:
                  log("unlearned")
              del removed


From tim_one@users.sourceforge.net  Tue Nov 12 22:56:26 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 14:56:26 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.30,1.31 msgstore.py,1.24,1.25
Message-ID: <E18BjxK-0005by-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv21157/Outlook2000

Modified Files:
	addin.py msgstore.py 
Log Message:
Removed the strip_mime_headers business.  I'm not sure whether it ever
helped, but at this point it was definitely happening too late to do
any good.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** addin.py	12 Nov 2002 04:52:12 -0000	1.30
--- addin.py	12 Nov 2002 22:56:24 -0000	1.31
***************
*** 244,248 ****
      push("<h2>Message Stream:</h2><br>")
      push("<PRE>\n")
!     msg = msgstore_message.GetEmailPackageObject(strip_mime_headers=False)
      push(escape(msg.as_string(), True))
      push("</PRE>\n")
--- 244,248 ----
      push("<h2>Message Stream:</h2><br>")
      push("<PRE>\n")
!     msg = msgstore_message.GetEmailPackageObject()
      push(escape(msg.as_string(), True))
      push("</PRE>\n")

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** msgstore.py	10 Nov 2002 19:59:59 -0000	1.24
--- msgstore.py	12 Nov 2002 22:56:24 -0000	1.25
***************
*** 49,53 ****
      def __init__(self):
          self.unread = False
!     def GetEmailPackageObject(self, strip_mime_headers=True):
          # Return a "read-only" Python email package object
          # "read-only" in that changes will never be reflected to the real store.
--- 49,53 ----
      def __init__(self):
          self.unread = False
!     def GetEmailPackageObject(self):
          # Return a "read-only" Python email package object
          # "read-only" in that changes will never be reflected to the real store.
***************
*** 420,424 ****
              self.mapi_object = self.msgstore._OpenEntry(self.id)
  
!     def GetEmailPackageObject(self, strip_mime_headers=True):
          import email
          # XXX If this was originally a MIME msg, we're hosed at this point --
--- 420,424 ----
              self.mapi_object = self.msgstore._OpenEntry(self.id)
  
!     def GetEmailPackageObject(self):
          import email
          # XXX If this was originally a MIME msg, we're hosed at this point --
***************
*** 433,451 ****
              print "FAILED to create email.message from: ", `text`
              raise
- 
-         if strip_mime_headers:
-             # If we're going to pass this to a scoring function, the MIME
-             # headers must be stripped, else the email pkg will run off
-             # looking for MIME boundaries that don't exist.  The charset
-             # info from the original MIME armor is also lost, and we don't
-             # want the email pkg to try decoding the msg a second time
-             # (assuming Outlook is in fact already decoding text originally
-             # in base64 and quoted-printable).
-             # We want to retain the MIME headers if we're just displaying
-             # the msg stream.
-             if msg.has_key('content-type'):
-                 del msg['content-type']
-             if msg.has_key('content-transfer-encoding'):
-                 del msg['content-transfer-encoding']
          return msg
  
--- 433,436 ----


From tim_one@users.sourceforge.net  Tue Nov 12 23:12:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:12:14 -0800
Subject: [Spambayes-checkins] spambayes mboxutils.py,1.4,1.5
Message-ID: <E18BkCc-0008BZ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31150

Modified Files:
	mboxutils.py 
Log Message:
New utility function extract_headers(), for very simple-minded header
extraction.


Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** mboxutils.py	6 Nov 2002 01:57:39 -0000	1.4
--- mboxutils.py	12 Nov 2002 23:12:11 -0000	1.5
***************
*** 25,28 ****
--- 25,29 ----
  import mailbox
  import email.Message
+ import re
  
  class DirOfTxtFileMailbox:
***************
*** 119,120 ****
--- 120,164 ----
          msg.set_payload(obj)
      return msg
+ 
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+ 
+ def extract_headers(text):
+     """Very simple-minded header extraction:  prefix of text up to blank line.
+ 
+     A blank line is recognized via two adjacent line-ending sequences, where
+     a line-ending sequence is a newline optionally preceded by a carriage
+     return.
+ 
+     If no blank line is found, all of text is considered to be a potential
+     header section.  If a blank line is found, the text up to (but not
+     including) the blank line is considered to be a potential header section.
+ 
+     The potential header section is returned, unless it doesn't contain a
+     colon, in which case an empty string is returned.
+ 
+     >>> extract_headers("abc")
+     ''
+     >>> extract_headers("abc\\n\\n\\n")  # no colon
+     ''
+     >>> extract_headers("abc: xyz\\n\\n\\n")
+     'abc: xyz\\n'
+     >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+     'abc: xyz\\r\\n'
+     >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+     'a: b\\ngibberish\\n'
+     """
+ 
+     m = header_break_re.search(text)
+     if m:
+         eol = m.start(1)
+         text = text[:eol]
+     if ':' not in text:
+         text = ""
+     return text
+ 
+ def _test():
+     import doctest, mboxutils
+     return doctest.testmod(mboxutils)
+ 
+ if __name__ == "__main__":
+     _test()


From tim_one@users.sourceforge.net  Tue Nov 12 23:16:06 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:16:06 -0800
Subject: [Spambayes-checkins] spambayes mboxutils.py,1.5,1.6
	tokenizer.py,1.66,1.67
Message-ID: <E18BkGM-0000M8-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1192

Modified Files:
	mboxutils.py tokenizer.py 
Log Message:
get_message():  changed to use the new extract_headers() hack.


Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** mboxutils.py	12 Nov 2002 23:12:11 -0000	1.5
--- mboxutils.py	12 Nov 2002 23:16:04 -0000	1.6
***************
*** 114,120 ****
          # headers are most likely damaged, we can't use the email
          # package to parse them, so just get rid of them first.
!         i = obj.find('\n\n')
!         if i >= 0:
!             obj = obj[i+2:]     # strip headers
          msg = email.Message.Message()
          msg.set_payload(obj)
--- 114,119 ----
          # headers are most likely damaged, we can't use the email
          # package to parse them, so just get rid of them first.
!         headers = extract_headers(obj)
!         obj = obj[len(headers):]
          msg = email.Message.Message()
          msg.set_payload(obj)

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** tokenizer.py	12 Nov 2002 06:21:38 -0000	1.66
--- tokenizer.py	12 Nov 2002 23:16:04 -0000	1.67
***************
*** 17,20 ****
--- 17,21 ----
  from Options import options
  
+ import mboxutils
  from mboxutils import get_message
  

From tim_one@users.sourceforge.net  Tue Nov 12 23:19:35 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:19:35 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.25,1.26
Message-ID: <E18BkJj-0000sK-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv3198/Outlook2000

Modified Files:
	msgstore.py 
Log Message:
GetEmailPackageObject():  Removed comments that no longer made sense, at
least not here.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** msgstore.py	12 Nov 2002 22:56:24 -0000	1.25
--- msgstore.py	12 Nov 2002 23:19:33 -0000	1.26
***************
*** 422,430 ****
      def GetEmailPackageObject(self):
          import email
-         # XXX If this was originally a MIME msg, we're hosed at this point --
-         # the boundary tag in the headers doesn't exist in the body, and
-         # the msg is simply ill-formed.  The miserable hack here simply
-         # squashes the text part (if any) and the HTML part (if any) together,
-         # and strips MIME info from the original headers.
          text = self._GetMessageText()
          try:
--- 422,425 ----


From tim_one@users.sourceforge.net  Tue Nov 12 23:33:48 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:33:48 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.26,1.27
Message-ID: <E18BkXU-0003Co-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv11116/Outlook2000

Modified Files:
	msgstore.py 
Log Message:
_GetMessageText():  Whatever the value of the headers property, stop
paying attention to it after the first blank line, and don't believe it
at all if it doesn't contain a colon.  Cheap trick to worm around the
problems some people have reported with Outlook returning multiple header
sections here (including internal MIME armor with empty bodies).


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** msgstore.py	12 Nov 2002 23:19:33 -0000	1.26
--- msgstore.py	12 Nov 2002 23:33:45 -0000	1.27
***************
*** 1,5 ****
  from __future__ import generators
  
! import sys, os
  
  try:
--- 1,5 ----
  from __future__ import generators
  
! import sys, os, re
  
  try:
***************
*** 10,13 ****
--- 10,53 ----
  
  
+ # XXX
+ # import mboxutils  doesn't work at this point.  The extract_headers function
+ # here is a copy-and-paste.
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+ 
+ def extract_headers(text):
+     """Very simple-minded header extraction:  prefix of text up to blank line.
+ 
+     A blank line is recognized via two adjacent line-ending sequences, where
+     a line-ending sequence is a newline optionally preceded by a carriage
+     return.
+ 
+     If no blank line is found, all of text is considered to be a potential
+     header section.  If a blank line is found, the text up to (but not
+     including) the blank line is considered to be a potential header section.
+ 
+     The potential header section is returned, unless it doesn't contain a
+     colon, in which case an empty string is returned.
+ 
+     >>> extract_headers("abc")
+     ''
+     >>> extract_headers("abc\\n\\n\\n")  # no colon
+     ''
+     >>> extract_headers("abc: xyz\\n\\n\\n")
+     'abc: xyz\\n'
+     >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+     'abc: xyz\\r\\n'
+     >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+     'a: b\\ngibberish\\n'
+     """
+ 
+     m = header_break_re.search(text)
+     if m:
+         eol = m.start(1)
+         text = text[:eol]
+     if ':' not in text:
+         text = ""
+     return text
+ 
+ 
  # Abstract definition - can be moved out when we have more than one sub-class <wink>
  # External interface to this module is almost exclusively via a "folder ID"
***************
*** 384,387 ****
--- 424,434 ----
          html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
          has_attach = data[3][1]
+ 
+         # Some Outlooks deliver a strange notion of headers, including
+         # interior MIME armor.  To prevent later errors, try to get rid
+         # of stuff now that can't possibly be parsed as "real" (SMTP)
+         # headers.
+         headers = extract_headers(headers)
+ 
          # Mail delivered internally via Exchange Server etc may not have
          # headers - fake some up.
***************
*** 392,395 ****
--- 439,443 ----
          elif headers.startswith("Microsoft Mail"):
              headers = "X-MS-Mail-Gibberish: " + headers
+ 
          if not html and not body:
              # Only ever seen this for "multipart/signed" messages, so


From tim_one@users.sourceforge.net  Wed Nov 13 05:29:15 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 21:29:15 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
Message-ID: <E18Bq5T-0001Ke-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv4228

Modified Files:
	train.py 
Log Message:
train_message():  When rescoring was asked for, it had no visible
effect, since the probabilities didn't get updated after training.
So update the probs before rescoring.


Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** train.py	7 Nov 2002 22:30:09 -0000	1.16
--- train.py	13 Nov 2002 05:29:10 -0000	1.17
***************
*** 26,30 ****
      return spam == True
  
! def train_message(msg, is_spam, mgr, rescore = False):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
--- 26,30 ----
      return spam == True
  
! def train_message(msg, is_spam, mgr, rescore=False):
      # Train an individual message.
      # Returns True if newly added (message will be correctly
***************
*** 54,57 ****
--- 54,58 ----
      if rescore:
          import filter
+         mgr.bayes.update_probabilities()  # else rescoring gives the same score
          filter.filter_message(msg, mgr, all_actions = False)
  

From tim_one@users.sourceforge.net  Wed Nov 13 06:25:10 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 22:25:10 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.67,1.68
Message-ID: <E18Bqxa-0000he-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2039a

Modified Files:
	tokenizer.py 
Log Message:
More refinements of address-header tokenization.  In particular, it
now generators "no real name" log-count tokens, which are strong
spam clues in my data.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** tokenizer.py	12 Nov 2002 23:16:04 -0000	1.67
--- tokenizer.py	13 Nov 2002 06:25:08 -0000	1.68
***************
*** 1081,1097 ****
              if not addrlist:
                  yield field + ":none"
!             for addrs in addrlist:
!                 for rname,ename in email.Utils.getaddresses([addrs]):
!                     if rname:
!                         for rname,rcharset in email.Header.decode_header(rname):
!                             for w in rname.lower().split():
!                                 for t in tokenize_word(w):
!                                     yield field+'realname:'+t
!                             if rcharset is not None:
!                                 yield field+'charset:'+rcharset
!                     if ename:
!                         for w in ename.lower().split('@'):
!                             for t in tokenize_word(w):
!                                 yield field+'email:'+t
          # To:
          # Cc:
--- 1081,1105 ----
              if not addrlist:
                  yield field + ":none"
!                 continue
! 
!             noname_count = 0
!             for name, addr in email.Utils.getaddresses(addrlist):
!                 if name:
!                     for name, charset in email.Header.decode_header(name):
!                         yield "%s:name:%s" % (field, name.lower())
!                         if charset is not None:
!                             yield "%s:charset:%s" % (field, charset)
!                 else:
!                     noname_count += 1
!                 if addr:
!                     for w in addr.lower().split('@'):
!                         yield "%s:addr:%s" % (field, w)
!                 else:
!                     yield field + ":addr:none"
! 
!             if noname_count:
!                 yield "%s:no real name:2**%d" % (field,
!                                                  round(log2(noname_count)))
! 
          # To:
          # Cc:


From mhammond@skippinet.com.au  Wed Nov 13 07:01:59 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 13 Nov 2002 18:01:59 +1100
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
In-Reply-To: <E18Bq5T-0001Ke-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEPKHKAA.mhammond@skippinet.com.au>

> Log Message:
> train_message():  When rescoring was asked for, it had no visible
> effect, since the probabilities didn't get updated after training.
> So update the probs before rescoring.

I'm a little confused about these probabilities.

Isn't it true that whenever we do a "train operation", we should also update
the probabilities?  For a batch train, we only want to do it at the end, but
for an individual, incremental train, I would have thought we still want the
probabilities updated, even if we don't rescore the message.  Otherwise
future messages will not use the new probabilities.

I ask because revision 1.14 did exactly this, and we regressed it.  That
revision was:

diff -r1.13 -r1.14
21c21
< def train_message(msg, is_spam, mgr, update_probs = True):
---
> def train_message(msg, is_spam, mgr):
43,45d42
<     if update_probs:
<         mgr.bayes.update_probabilities()
<
56c53
<             if train_message(message, isspam, mgr, False):
---
>             if train_message(message, isspam, mgr):

And it seems to me that a new param, specifically for update_probs, is less
of a hack than tieing it to the "rescore" param - we want the new probs used
for the *next* incoming message even if we don't need it for *this* message.

Mark.


From tim_one@users.sourceforge.net  Wed Nov 13 06:59:27 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 22:59:27 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 default_bayes_customize.ini,1.5,1.6
Message-ID: <E18BrUl-0005FQ-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv19210/Outlook2000

Modified Files:
	default_bayes_customize.ini 
Log Message:
Enable more address-header tokenization than the default.  This should
help any personal email classifier.  I recommend a full retrain to
get the most benefit.


Index: default_bayes_customize.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** default_bayes_customize.ini	4 Nov 2002 23:21:43 -0000	1.5
--- default_bayes_customize.ini	13 Nov 2002 06:59:24 -0000	1.6
***************
*** 17,20 ****
--- 17,26 ----
  record_header_absence: True
  
+ # These should help.  All but "from" are disabled by default, because
+ # they're killer-good clues for bad reasons when using mixed-source
+ # data.
+ address_headers: from to cc sender reply-to
+ 
+ 
  [Classifier]
  # Uncomment the next lines if you want to use the former default for


From tim.one@comcast.net  Wed Nov 13 07:18:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 13 Nov 2002 02:18:45 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAEPKHKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEGFCKAB.tim.one@comcast.net>

[Mark Hammond]
> I'm a little confused about these probabilities.
>
> Isn't it true that whenever we do a "train operation", we should
> also update the probabilities?

It's a tradeoff.  The bigger the database, the longer update_probabilities()
takes.  If the user is staring at a specific msg, and expects to see its
score change, then the probs *have* to be updated or the score won't change.
So that was a very clear reason to force updating here.  I didn't  know why
the probs weren't being updated anyway, so fixed the one thing that was
unarguably buggy.

> For a batch train, we only want to do it at the end, but for an
> individual, incremental train, I would have thought we still want the
> probabilities updated, even if we don't rescore the message.  Otherwise
> future messages will not use the new probabilities.

That's so.  I haven't worried about it, perhaps because I run on Win9x most
of the time so live with frequent reboots (i.e., I retrain from scratch
several times every day anyway, as incremental updates are lost when a
forced reboot occurs; that's not *this* code's fault, although I eventual
hope to get around to writing out the updated database whenever the probs
get updated).

> I ask because revision 1.14 did exactly this, and we regressed it.

That's odd -- the CVS log says mhammond did that <wink>.

> ...
> And it seems to me that a new param, specifically for update_probs, is
> less of a hack than tieing it to the "rescore" param - we want the
> new probs used for the *next* incoming message even if we don't need
> it for *this* message.

It's still a tradeoff, though.  Once a classifier has gotten any amount of
decent training, whether or not a new training msg gets reflected instantly
in the probs should make little difference to results.

If it's possible that update_probabilities() *never* gets called after
training and before shutdown now, then that's clearly a bug.

It's OK by me whatever you'd rather do here, and updating probs after
training, without fail, is certainly the least error-prone strategy.


From richiehindle@users.sourceforge.net  Wed Nov 13 18:13:46 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 13 Nov 2002 10:13:46 -0800
Subject: [Spambayes-checkins] spambayes README.txt,1.41,1.42
Message-ID: <E18C21K-0003p6-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14506

Modified Files:
	README.txt 
Log Message:
Added a note about the web interface implemented by pop3proxy.py.


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.41
retrieving revision 1.42
diff -C2 -d -r1.41 -r1.42
*** README.txt	7 Nov 2002 22:30:02 -0000	1.41
--- README.txt	13 Nov 2002 18:13:43 -0000	1.42
***************
*** 74,77 ****
--- 74,82 ----
      delivery system.
  
+     Also acts as a web server providing a user interface that allows you
+     to train the classifier, classify messages interactively, and query
+     the token database.  This piece will at some point be split out into
+     a separate module.
+ 
  neiltrain.py
      Builds a CDB (constant database) file of word probabilities using


From richiehindle@users.sourceforge.net  Wed Nov 13 18:14:34 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 13 Nov 2002 10:14:34 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.69,1.70
Message-ID: <E18C226-00042H-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15336

Modified Files:
	Options.py 
Log Message:
Added options for pop3proxy.py, so you don't need a huge command line.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.69
retrieving revision 1.70
diff -C2 -d -r1.69 -r1.70
*** Options.py	12 Nov 2002 06:21:38 -0000	1.69
--- Options.py	13 Nov 2002 18:14:32 -0000	1.70
***************
*** 339,342 ****
--- 339,357 ----
  # database by default.
  persistent_use_database: False
+ 
+ [pop3proxy]
+ # pop3proxy settings - pop3proxy also respects the options in the Hammie
+ # section, with the exception of the extra header details at the moment.
+ # The only mandatory option is pop3proxy_server_name, eg. pop3.my-isp.com,
+ # but that can come from the command line - see "pop3proxy -h".
+ pop3proxy_server_name: ""
+ pop3proxy_server_port: 110
+ pop3proxy_port: 110
+ pop3proxy_cache_use_gzip: True
+ pop3proxy_cache_expiry_days: 7
+ 
+ [html_ui]
+ html_ui_port: 8880
+ html_ui_launch_browser: False
  """
  
***************
*** 408,412 ****
                 'hammie_debug_header_name': string_cracker,
                 },
! 
  }
  
--- 423,435 ----
                 'hammie_debug_header_name': string_cracker,
                 },
!     'pop3proxy': {'pop3proxy_server_name': string_cracker,
!                   'pop3proxy_server_port': int_cracker,
!                   'pop3proxy_port': int_cracker,
!                   'pop3proxy_cache_use_gzip': boolean_cracker,
!                   'pop3proxy_cache_expiry_days': int_cracker,
!                   },
!     'html_ui': {'html_ui_port': int_cracker,
!                 'html_ui_launch_browser': boolean_cracker,
!                 },
  }
  

From richiehindle@users.sourceforge.net  Wed Nov 13 18:19:48 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 13 Nov 2002 10:19:48 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.14,1.15
Message-ID: <E18C27A-0005ON-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20474

Modified Files:
	pop3proxy.py 
Log Message:
 o All command line switches and options now default to values from
   bayescustomize.ini.  Thanks to Francois Granger for the idea.
 o Instead of there being two radio buttons (ham, spam) on the training
   form, there are now two buttons: "Train as Ham" and "Train as Spam".
   Thanks to Just van Rossum for the suggestion.
 o "Classify message" form - paste or upload a message for classification.
   Gives you the spam probability and the clues.
 o It now gives a decent error if the POP3 server is unreachable.
 o The "Bad file descriptor" / last-response-is-logged-three-times bug
   is (hopefully) fixed.
 o The bug whereby socket errors could cause the "Active POP3
   conversations" count to go negative is fixed.
 o After doing a word query, it now prepopulates the query field with
   your word - handy if you mistyped it or you want to try a variant.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** pop3proxy.py	10 Nov 2002 19:59:22 -0000	1.14
--- pop3proxy.py	13 Nov 2002 18:19:45 -0000	1.15
***************
*** 7,11 ****
  header.  Usage:
  
!     pop3proxy.py [options] <server> [<server port>]
          <server> is the name of your real POP3 server
          <port>   is the port number of your real POP3 server, which
--- 7,11 ----
  header.  Usage:
  
!     pop3proxy.py [options] [<server> [<server port>]]
          <server> is the name of your real POP3 server
          <port>   is the port number of your real POP3 server, which
***************
*** 13,16 ****
--- 13,20 ----
  
          options:
+             -z      : Runs a self-test and exits.
+             -t      : Runs a test POP3 server on port 8110 (for testing).
+             -h      : Displays this help message.
+ 
              -p FILE : use the named data file
              -d      : the file is a DBM file rather than a pickle
***************
*** 20,28 ****
              -b      : Launch a web browser showing the user interface.
  
!     pop3proxy -t
!         Runs a test POP3 server on port 8110; useful for testing.
! 
!     pop3proxy -h
!         Displays this help message.
  
  For safety, and to help debugging, the whole POP3 conversation is
--- 24,30 ----
              -b      : Launch a web browser showing the user interface.
  
!         All command line arguments and switches take their default
!         values from the [Hammie], [pop3proxy] and [html_ui] sections
!         of bayescustomize.ini.
  
  For safety, and to help debugging, the whole POP3 conversation is
***************
*** 48,72 ****
  
  todo = """
!  o (Re)training interface - one message per line, quick-rendering table.
!  o Slightly-wordy index page; intro paragraph for each page.
   o Once the training stuff is on a separate page, make the paste box
     bigger.
-  o "Links" section (on homepage?) to project homepage, mailing list,
-    etc.
-  o "Home" link (with helmet!) at the end of each page.
-  o "Classify this" - just like Train.
-  o "Send me an email every [...] to remind me to train on new
-    messages."
-  o "Send me a status email every [...] telling how many mails have been
-    classified, etc."
   o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
     webbrowser?
-  o Possibly integrate Tim Stone's SMTP code - make it use async, make
-    the training code update (rather than replace!) the database.
   o Can it cleanly dynamically update its status display while having a
     POP3 converation?  Hammering reload sucks.
   o Add a command to save the database without shutting down, and one to
     reload the database.
!  o Leave the word in the input field after a Word query.
  """
  
--- 50,103 ----
  
  todo = """
! 
! User interface improvements:
! 
   o Once the training stuff is on a separate page, make the paste box
     bigger.
   o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
     webbrowser?
   o Can it cleanly dynamically update its status display while having a
     POP3 converation?  Hammering reload sucks.
   o Add a command to save the database without shutting down, and one to
     reload the database.
!  o Save the Status (num classified, etc.) between sessions.
! 
! 
! New features:
! 
!  o (Re)training interface - one message per line, quick-rendering table.
!  o "Send me an email every [...] to remind me to train on new
!    messages."
!  o "Send me a status email every [...] telling how many mails have been
!    classified, etc."
!  o Possibly integrate Tim Stone's SMTP code - make it use async, make
!    the training code update (rather than replace!) the database.
!  o Option to keep trained messages and view potential FPs and FNs to
!    correct them.
!  o Allow use of the UI without the POP3 proxy.
! 
! 
! Code quality:
! 
!  o Move the UI into its own module.
!  o Eventually, pull the common HTTP code from pop3proxy.py and Entrian
!    Debugger into a library.
! 
! 
! Info:
! 
!  o Slightly-wordy index page; intro paragraph for each page.
!  o In both stats and training results, report nham and nspam - warn if
!    they're very different (for some value of 'very').
!  o "Links" section (on homepage?) to project homepage, mailing list,
!    etc.
! 
! 
! Gimmicks:
! 
!  o Classify a web page given a URL.
!  o Graphs.  Of something.  Who cares what?
!  o Zoe...!
! 
  """
  
***************
*** 147,151 ****
          self.set_terminator('\r\n')
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
!         self.connect((serverName, serverPort))
  
      def collect_incoming_data(self, data):
--- 178,188 ----
          self.set_terminator('\r\n')
          self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
!         try:
!             self.connect((serverName, serverPort))
!         except socket.error, e:
!             print >>sys.stderr, "Can't connect to %s:%d: %s" % \
!                                 (serverName, serverPort, e)
!             self.close()
!             self.lineCallback('')   # "The socket's been closed."
  
      def collect_incoming_data(self, data):
***************
*** 199,203 ****
          self.response = self.response + line
  
!         # Is this line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
  
--- 236,240 ----
          self.response = self.response + line
  
!         # Is this the line that terminates a set of headers?
          self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
  
***************
*** 237,241 ****
          else:
              # Assume that an unknown command will get a single-line
!             # response.  This should work for errors and for POP-AUTH.
              return False
  
--- 274,281 ----
          else:
              # Assume that an unknown command will get a single-line
!             # response.  This should work for errors and for POP-AUTH,
!             # and is harmless even for multiline responses - the first
!             # line will be passed to onTransaction and ignored, then the
!             # rest will be proxied straight through.
              return False
  
***************
*** 246,257 ****
      def found_terminator(self):
          """Asynchat override."""
!         if self.request.strip().upper() == 'KILL':
!             self.serverSocket.sendall('QUIT\r\n')
!             self.send("+OK, dying.\r\n")
!             self.serverSocket.shutdown(2)
!             self.serverSocket.close()
              self.shutdown(2)
              self.close()
              raise SystemExit
  
          self.serverSocket.push(self.request + '\r\n')
--- 286,298 ----
      def found_terminator(self):
          """Asynchat override."""
!         verb = self.request.strip().upper()
!         if verb == 'KILL':
              self.shutdown(2)
              self.close()
              raise SystemExit
+         elif verb == 'CRASH':
+             # For testing
+             x = 0
+             y = 1/x
  
          self.serverSocket.push(self.request + '\r\n')
***************
*** 271,276 ****
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         cooked = self.onTransaction(self.command, self.args, self.response)
!         self.push(cooked)
  
          # If onServerLine() decided that the server has closed its
--- 312,318 ----
          # Pass the request and the raw response to the subclass and
          # send back the cooked response.
!         if self.response:
!             cooked = self.onTransaction(self.command, self.args, self.response)
!             self.push(cooked)
  
          # If onServerLine() decided that the server has closed its
***************
*** 334,337 ****
--- 376,380 ----
          status.totalSessions += 1
          status.activeSessions += 1
+         self.isClosed = False
  
      def send(self, data):
***************
*** 339,343 ****
          self.logFile.write(data)
          self.logFile.flush()
!         return POP3ProxyBase.send(self, data)
  
      def recv(self, size):
--- 382,392 ----
          self.logFile.write(data)
          self.logFile.flush()
!         try:
!             return POP3ProxyBase.send(self, data)
!         except socket.error:
!             # The email client has closed the connection - 40tude Dialog
!             # does this immediately after issuing a QUIT command,
!             # without waiting for the response.
!             self.close()
  
      def recv(self, size):
***************
*** 349,354 ****
  
      def close(self):
!         status.activeSessions -= 1
!         POP3ProxyBase.close(self)
  
      def onTransaction(self, command, args, response):
--- 398,406 ----
  
      def close(self):
!         # This can be called multiple times by async.
!         if not self.isClosed:
!             self.isClosed = True
!             status.activeSessions -= 1
!             POP3ProxyBase.close(self)
  
      def onTransaction(self, command, args, response):
***************
*** 442,448 ****
      UserInterface objects to serve them."""
  
!     def __init__(self, uiPort, bayes):
          uiArgs = (bayes,)
!         Listener.__init__(self, uiPort, UserInterface, uiArgs)
  
  
--- 494,500 ----
      UserInterface objects to serve them."""
  
!     def __init__(self, uiPort, bayes, socketMap=asyncore.socket_map):
          uiArgs = (bayes,)
!         Listener.__init__(self, uiPort, UserInterface, uiArgs, socketMap=socketMap)
  
  
***************
*** 479,485 ****
      """Serves the HTML user interface of the proxy."""
  
      header = """<html><head><title>Spambayes proxy: %s</title>
               <style>
!              body { font: 90%% arial, swiss, helvetica }
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
--- 531,544 ----
      """Serves the HTML user interface of the proxy."""
  
+     # A couple of notes about the HTML here:
+     #  o I've tried to keep content and presentation separate using
+     #    one main stylesheet - no <font> tags, and no inline stylesheets
+     #  o Form fields must specify their name and value attributes like
+     #    this: "... name='n' value='v' ..." even if there is no default
+     #    value.  This is so that setFieldValue can set the value.
+ 
      header = """<html><head><title>Spambayes proxy: %s</title>
               <style>
!              body { font: 90%% arial, swiss, helvetica; margin: 0 }
               table { font: 90%% arial, swiss, helvetica }
               form { margin: 0 }
***************
*** 497,501 ****
               </head>\n"""
  
!     bodyStart = """<body style='margin: 0'>
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
--- 556,560 ----
               </head>\n"""
  
!     bodyStart = """<body>
                  <div class='banner'>
                  <img src='/helmet.gif' align='absmiddle'>
***************
*** 504,514 ****
  
      footer = """</div>
!              <form action='/shutdown'>
               <table width='100%%' cellspacing='0'>
!              <tr><td class='banner'>&nbsp;Spambayes Proxy, %s.
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
               %s
!              </td></tr></table></form>\n"""
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
--- 563,575 ----
  
      footer = """</div>
!              <form action='/shutdown' method='POST'>
               <table width='100%%' cellspacing='0'>
!              <tr><td class='banner'>&nbsp;<a href='/'>Spambayes Proxy</a>,
!              %s.
               <a href='http://www.spambayes.org/'>Spambayes.org</a></td>
               <td align='right' class='banner'>
               %s
!              </td></tr></table></form>
!              </body></html>\n"""
  
      shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
***************
*** 531,552 ****
  
      wordQuery = """<form action='/wordquery'>
!                 <input name='word' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
  
!     train = """<form action='/upload' method='POST'
                  enctype='multipart/form-data'>
!             Either upload a message file: <input type='file' name='file'><br>
!             Or paste the whole message (incuding headers) here:<br>
!             <textarea name='text' rows='3' cols='60'></textarea><br>
!             Is this message
!             <input type='radio' name='which' value='ham'>Ham</input> or
!             <input type='radio'
!                    name='which' value='spam' checked>Spam</input>?<br>
!             <input type='submit' value='Train on this message'>
!             </form>"""
  
!     def __init__(self, clientSocket, bayes):
!         BrighterAsyncChat.__init__(self, clientSocket)
          self.bayes = bayes
          self.request = ''
--- 592,621 ----
  
      wordQuery = """<form action='/wordquery'>
!                 <input name='word' value='' type='text' size='30'>
                  <input type='submit' value='Tell me about this word'>
                  </form>"""
  
!     upload = """<form action='/%s' method='POST'
                  enctype='multipart/form-data'>
!              Either upload a message file:
!              <input type='file' name='file' value=''><br>
!              Or paste the whole message (incuding headers) here:<br>
!              <textarea name='text' rows='3' cols='60'></textarea><br>
!              %s
!              </form>"""
  
!     uploadSumbit = """<input type='submit' name='which' value='%s'>"""
! 
!     train = upload % ('train',
!                       (uploadSumbit % "Train as Spam") + "&nbsp;" + \
!                       (uploadSumbit % "Train as Ham"))
! 
!     classify = upload % ('classify', uploadSumbit % "Classify")
! 
!     def __init__(self, clientSocket, bayes, socketMap=asyncore.socket_map):
!         # Grumble: asynchat.__init__ doesn't take a 'map' argument,
!         # hence the two-stage construction.
!         BrighterAsyncChat.__init__(self)
!         BrighterAsyncChat.set_socket(self, clientSocket, socketMap)
          self.bayes = bayes
          self.request = ''
***************
*** 654,662 ****
          self.push(self.bodyStart % homeLink)
  
      def onHome(self, params):
          """Serve up the homepage."""
          body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Word query', self.wordQuery)+
!                 self.pageSection % ('Train', self.train))
          self.push(body)
  
--- 723,745 ----
          self.push(self.bodyStart % homeLink)
  
+     def setFieldValue(self, form, name, value):
+         """Sets the default value of a field in a form.  See the comment
+         at the top of this class for how to specify HTML that works with
+         this function.  (This is exactly what Entrian PyMeld is for, but
+         that ships under the Sleepycat License.)"""
+         match = re.search(r"\s+name='%s'\s+value='([^']*)'" % name, form)
+         if match:
+             quotedValue = re.sub("'", "&#%d;" % ord("'"), value)
+             return form[:match.start(1)] + quotedValue + form[match.end(1):]
+         else:
+             print >>sys.stderr, "Warning: setFieldValue('%s') failed" % name
+             return form
+ 
      def onHome(self, params):
          """Serve up the homepage."""
          body = (self.pageSection % ('Status', self.summary % status.__dict__)+
!                 self.pageSection % ('Train', self.train)+
!                 self.pageSection % ('Classify a message', self.classify)+
!                 self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
  
***************
*** 676,684 ****
          raise SystemExit
  
!     def onUpload(self, params):
          """Train on an uploaded or pasted message."""
          # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
!         isSpam = (params['which'] == 'spam')
  
          # Append the message to a file, to make it easier to rebuild
--- 759,767 ----
          raise SystemExit
  
!     def onTrain(self, params):
          """Train on an uploaded or pasted message."""
          # Upload or paste?  Spam or ham?
          message = params.get('file') or params.get('text')
!         isSpam = (params['which'] == 'Train as Spam')
  
          # Append the message to a file, to make it easier to rebuild
***************
*** 698,705 ****
  
          # Train on the message.
!         self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
          self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
          self.push(self.pageSection % ('Train another', self.train))
  
      def onWordquery(self, params):
          word = params['word']
--- 781,803 ----
  
          # Train on the message.
!         tokens = tokenizer.tokenize(message)
!         self.bayes.learn(tokens, isSpam, True)
          self.push("<p>OK. Return <a href='/'>Home</a> or train another:</p>")
          self.push(self.pageSection % ('Train another', self.train))
  
+     def onClassify(self, params):
+         """Classify an uploaded or pasted message."""
+         message = params.get('file') or params.get('text')
+         tokens = tokenizer.tokenize(message)
+         prob, clues = self.bayes.spamprob(tokens, evidence=True)
+         self.push("<p>Spam probability: <b>%.8f</b></p>" % prob)
+         self.push("<table class='sectiontable' cellspacing='0'>")
+         self.push("<tr><td class='sectionheading'>Clues:</td></tr>\n")
+         self.push("<tr><td class='sectionbody'><table>")
+         for w, p in clues:
+             self.push("<tr><td>%s</td><td>%.8f</td></tr>\n" % (w, p))
+         self.push("</table></td></tr></table>")
+         self.push("<p>Return <a href='/'>Home</a> or classify another:</p>")
+         self.push(self.pageSection % ('Classify another', self.classify))
      def onWordquery(self, params):
          word = params['word']
***************
*** 717,727 ****
                     Last used: <b>%(atime)s</b>.<br>""" % members
          except KeyError:
!             info = "'%s' does not appear in the database." % word
  
!         body = (self.pageSection % ("Statistics for '%s'" % word, info) +
!                 self.pageSection % ('Word query', self.wordQuery))
          self.push(body)
  
  
  def main(serverName, serverPort, proxyPort,
           uiPort, launchUI, pickleName, useDB):
--- 815,845 ----
                     Last used: <b>%(atime)s</b>.<br>""" % members
          except KeyError:
!             info = "%r does not appear in the database." % word
  
!         query = self.setFieldValue(self.wordQuery, 'word', params['word'])
!         body = (self.pageSection % ("Statistics for %r" % word, info) +
!                 self.pageSection % ('Word query', query))
          self.push(body)
  
  
+ def initStatus():
+     status.proxyPort = options.pop3proxy_port
+     status.serverName = options.pop3proxy_server_name
+     status.serverPort = options.pop3proxy_server_port
+     status.pickleName = options.persistent_storage_file
+     status.useDB = options.persistent_use_database
+     status.uiPort = options.html_ui_port
+     status.launchUI = options.html_ui_launch_browser
+     status.gzipCache = options.pop3proxy_cache_use_gzip
+     status.cacheExpiryDays = options.pop3proxy_cache_expiry_days
+     status.runTestServer = False
+     status.totalSessions = 0
+     status.activeSessions = 0
+     status.numEmails = 0
+     status.numSpams = 0
+     status.numHams = 0
+     status.numUnsure = 0
+ 
+ 
  def main(serverName, serverPort, proxyPort,
           uiPort, launchUI, pickleName, useDB):
***************
*** 891,895 ****
      def onUnknown(self, command, args):
          """Unknown POP3 command."""
!         return "-ERR Unknown command: '%s'\r\n" % command
  
  
--- 1009,1013 ----
      def onUnknown(self, command, args):
          """Unknown POP3 command."""
!         return "-ERR Unknown command: %s\r\n" % repr(command)
  
  
***************
*** 901,904 ****
--- 1019,1023 ----
      # asyncore environments.
      import threading
+     initStatus()
      testServerReady = threading.Event()
      def runTestServer():
***************
*** 912,915 ****
--- 1031,1035 ----
          # Name the database in case it ever gets auto-flushed to disk.
          bayes = hammie.createbayes('_pop3proxy.db')
+         UserInterfaceListener(8881, bayes)
          BayesProxyListener('localhost', 8110, 8111, bayes)
          bayes.learn(tokenizer.tokenize(spam1), True)
***************
*** 944,952 ****
          assert response.find(options.hammie_header_name) >= 0
  
      # Kill the proxy and the test server.
      proxy.sendall("kill\r\n")
!     server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!     server.connect(('localhost', 8110))
!     server.sendall("kill\r\n")
  
  
--- 1064,1085 ----
          assert response.find(options.hammie_header_name) >= 0
  
+     # Smoke-test the HTML UI.
+     httpServer = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+     httpServer.connect(('localhost', 8881))
+     httpServer.sendall("get / HTTP/1.0\r\n\r\n")
+     response = ''
+     while 1:
+         packet = httpServer.recv(1000)
+         if not packet: break
+         response += packet
+     assert re.search(r"(?s)<html>.*Spambayes proxy.*</html>", response)
+ 
      # Kill the proxy and the test server.
      proxy.sendall("kill\r\n")
!     proxy.recv(100)
!     pop3Server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
!     pop3Server.connect(('localhost', 8110))
!     pop3Server.sendall("kill\r\n")
!     pop3Server.recv(100)
  
  
***************
*** 958,979 ****
      # Read the arguments.
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htdbp:l:u:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
          sys.exit()
  
!     status.pickleName = hammie.DEFAULTDB
!     status.proxyPort = 110
!     status.uiPort = 8880
!     status.serverPort = 110
!     status.useDB = False
!     status.runTestServer = False
!     status.launchUI = False
!     status.totalSessions = 0
!     status.activeSessions = 0
!     status.numEmails = 0
!     status.numSpams = 0
!     status.numHams = 0
!     status.numUnsure = 0
      for opt, arg in opts:
          if opt == '-h':
--- 1091,1101 ----
      # Read the arguments.
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htdbzp:l:u:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
          sys.exit()
  
!     initStatus()
!     runSelfTest = False
      for opt, arg in opts:
          if opt == '-h':
***************
*** 992,999 ****
          elif opt == '-u':
              status.uiPort = int(arg)
  
      # Do whatever we've been asked to do...
!     if not opts and not args:
!         print "Running a self-test (use 'pop3proxy -h' for help)"
          test()
          print "Self-test passed."   # ...else it would have asserted.
--- 1114,1123 ----
          elif opt == '-u':
              status.uiPort = int(arg)
+         elif opt == '-z':
+             runSelfTest = True
  
      # Do whatever we've been asked to do...
!     if runSelfTest:
!         print "\nRunning self-test...\n"
          test()
          print "Self-test passed."   # ...else it would have asserted.
***************
*** 1004,1014 ****
          asyncore.loop()
  
!     elif 1 <= len(args) <= 2:
!         # Normal usage, with optional server port number.
!         status.serverName = args[0]
!         if len(args) == 2:
              status.serverPort = int(args[1])
!         main(status.serverName, status.serverPort, status.proxyPort,
!              status.uiPort, status.launchUI, status.pickleName, status.useDB)
  
      else:
--- 1128,1147 ----
          asyncore.loop()
  
!     elif 0 <= len(args) <= 2:
!         # Normal usage, with optional server name and port number.
!         if len(args) >= 1:
!             status.serverName = args[0]
!         if len(args) >= 2:
              status.serverPort = int(args[1])
! 
!         if not status.serverName:
!             print >>sys.stderr, \
!                   ("Error: You must give a POP3 server name, either in\n"
!                    "bayescustomize.ini as pop3proxy_server_name or on the\n"
!                    "command line.  pop3server.py -h prints a usage message.")
!         else:
!             main(status.serverName, status.serverPort, status.proxyPort,
!                  status.uiPort, status.launchUI, status.pickleName,
!                  status.useDB)
  
      else:


From tim_one@users.sourceforge.net  Wed Nov 13 18:30:04 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 13 Nov 2002 10:30:04 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.27,1.28
Message-ID: <E18C2H6-0007Zg-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv28672/Outlook2000

Modified Files:
	msgstore.py 
Log Message:
_GetMessageText():  Use extract_headers() from mboxutils, instead of
our own cut'n'paste duplicate.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** msgstore.py	12 Nov 2002 23:33:45 -0000	1.27
--- msgstore.py	13 Nov 2002 18:30:01 -0000	1.28
***************
*** 10,53 ****
  
  
- # XXX
- # import mboxutils  doesn't work at this point.  The extract_headers function
- # here is a copy-and-paste.
- header_break_re = re.compile(r"\r?\n(\r?\n)")
- 
- def extract_headers(text):
-     """Very simple-minded header extraction:  prefix of text up to blank line.
- 
-     A blank line is recognized via two adjacent line-ending sequences, where
-     a line-ending sequence is a newline optionally preceded by a carriage
-     return.
- 
-     If no blank line is found, all of text is considered to be a potential
-     header section.  If a blank line is found, the text up to (but not
-     including) the blank line is considered to be a potential header section.
- 
-     The potential header section is returned, unless it doesn't contain a
-     colon, in which case an empty string is returned.
- 
-     >>> extract_headers("abc")
-     ''
-     >>> extract_headers("abc\\n\\n\\n")  # no colon
-     ''
-     >>> extract_headers("abc: xyz\\n\\n\\n")
-     'abc: xyz\\n'
-     >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
-     'abc: xyz\\r\\n'
-     >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
-     'a: b\\ngibberish\\n'
-     """
- 
-     m = header_break_re.search(text)
-     if m:
-         eol = m.start(1)
-         text = text[:eol]
-     if ':' not in text:
-         text = ""
-     return text
- 
- 
  # Abstract definition - can be moved out when we have more than one sub-class <wink>
  # External interface to this module is almost exclusively via a "folder ID"
--- 10,13 ----
***************
*** 414,417 ****
--- 374,379 ----
          # in an attachment.  Later.
          # Oh - and for multipart/signed messages <frown>
+         import mboxutils
+ 
          self._EnsureObject()
          prop_ids = (PR_TRANSPORT_MESSAGE_HEADERS_A,
***************
*** 429,433 ****
          # of stuff now that can't possibly be parsed as "real" (SMTP)
          # headers.
!         headers = extract_headers(headers)
  
          # Mail delivered internally via Exchange Server etc may not have
--- 391,395 ----
          # of stuff now that can't possibly be parsed as "real" (SMTP)
          # headers.
!         headers = mboxutils.extract_headers(headers)
  
          # Mail delivered internally via Exchange Server etc may not have


From tim_one@users.sourceforge.net  Wed Nov 13 19:26:30 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 13 Nov 2002 11:26:30 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.17,1.18
Message-ID: <E18C39i-0000ij-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv31686/Outlook2000

Modified Files:
	train.py 
Log Message:
train_message():

Bugfix:  If a msg was incorrectly classified, untraining from the wrong
category worked fine, but training for the new category had no effect.
That's because tokenize() returns an iterator rather than a sequence,
and after you've run thru the end of the iterator once (as unlearning
did do), trying to run thru it again simply yields an empty sequence.
So called tokenize() anew whenever needed.  Tranforming into a sequence
via list() or tuple() would also have worked, but the case in which
the tokenstream *can* be reused is too rare to worry about.

Optimization:  Don't bother tokenizing, or even materializing a msg
object, if the msg has already been trained with the correct
classification.  Incremental training goes at light speed now.


Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** train.py	13 Nov 2002 05:29:10 -0000	1.17
--- train.py	13 Nov 2002 19:26:27 -0000	1.18
***************
*** 34,54 ****
      # be written to the message (so the user can see some effects)
      from tokenizer import tokenize
!     stream = msg.GetEmailPackageObject()
!     tokens = tokenize(stream)
!     # Handle we may have already been trained.
      was_spam = mgr.message_db.get(msg.searchkey)
!     if was_spam is None:
!         # never previously trained.
!         pass
!     elif was_spam == is_spam:
!         # Already in DB - do nothing (full retrain will wipe msg db)
!         # leave now.
!         return False
!     else:
!         mgr.bayes.unlearn(tokens, was_spam, False)
!     # OK - setup the new data.
!     mgr.bayes.learn(tokens, is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
      # Simplest way to rescore is to re-filter with all_actions = False
      if rescore:
--- 34,53 ----
      # be written to the message (so the user can see some effects)
      from tokenizer import tokenize
! 
      was_spam = mgr.message_db.get(msg.searchkey)
!     if was_spam == is_spam:
!         return False    # already correctly classified
! 
!     # Brand new (was_spam is None), or incorrectly classified.
!     stream = msg.GetEmailPackageObject()
!     if was_spam is not None:
!         # The classification has changed; unlearn the old classification.
!         mgr.bayes.unlearn(tokenize(stream), was_spam, False)
! 
!     # Learn the correct classification.
!     mgr.bayes.learn(tokenize(stream), is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
+ 
      # Simplest way to rescore is to re-filter with all_actions = False
      if rescore:
***************
*** 59,63 ****
      return True
  
! def train_folder( f, isspam, mgr, progress):
      num = num_added = 0
      for message in f.GetMessageGenerator():
--- 58,62 ----
      return True
  
! def train_folder(f, isspam, mgr, progress):
      num = num_added = 0
      for message in f.GetMessageGenerator():


From tim_one@users.sourceforge.net  Thu Nov 14 01:16:13 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 13 Nov 2002 17:16:13 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.31,1.32 msgstore.py,1.28,1.29
Message-ID: <E18C8c9-000073-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv32322/Outlook2000

Modified Files:
	addin.py msgstore.py 
Log Message:
GetEmailPackageObject():  Put pack code to strip Content-Type:  turns out
there was a superb reason to do this after all, just not the one I
thought there was <wink>.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.31
retrieving revision 1.32
diff -C2 -d -r1.31 -r1.32
*** addin.py	12 Nov 2002 22:56:24 -0000	1.31
--- addin.py	14 Nov 2002 01:16:11 -0000	1.32
***************
*** 244,248 ****
      push("<h2>Message Stream:</h2><br>")
      push("<PRE>\n")
!     msg = msgstore_message.GetEmailPackageObject()
      push(escape(msg.as_string(), True))
      push("</PRE>\n")
--- 244,248 ----
      push("<h2>Message Stream:</h2><br>")
      push("<PRE>\n")
!     msg = msgstore_message.GetEmailPackageObject(strip_content_type=False)
      push(escape(msg.as_string(), True))
      push("</PRE>\n")

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** msgstore.py	13 Nov 2002 18:30:01 -0000	1.28
--- msgstore.py	14 Nov 2002 01:16:11 -0000	1.29
***************
*** 430,434 ****
              self.mapi_object = self.msgstore._OpenEntry(self.id)
  
!     def GetEmailPackageObject(self):
          import email
          text = self._GetMessageText()
--- 430,451 ----
              self.mapi_object = self.msgstore._OpenEntry(self.id)
  
!     def GetEmailPackageObject(self, strip_content_type=True):
!         # Return an email.Message object.
!         # strip_content_type is a hack, and should be left True unless you're
!         # trying to display all the headers for diagnostic purposes.  If we
!         # figure out something better to do, it should go away entirely.
!         # The problem:  suppose a msg is multipart/alternative, with
!         # text/plain and text/html sections.  The latter MIME decorations
!         # are plain missing in what _GetMessageText() returns.  If we leave
!         # the multipart/alternative in the headers anyway, the email
!         # package's "lax parsing" won't complain about not finding any
!         # sections, but since the type *is* multipart/alternative then
!         # anyway, the tokenizer finds no text/* parts at all to tokenize.
!         # As a result, only the headers get tokenized.  By stripping
!         # Content-Type from the headers (if present), the email pkg
!         # considers the body to be text/plain (the default), and so it
!         # does get tokenized.
!         # Short course:  we either have to synthesize non-insane MIME
!         # structure, or eliminate all evidence of original MIME structure.
          import email
          text = self._GetMessageText()
***************
*** 438,441 ****
--- 455,463 ----
              print "FAILED to create email.message from: ", `text`
              raise
+ 
+         if strip_content_type:
+             if msg.has_key('content-type'):
+                 del msg['content-type']
+ 
          return msg
  

From mhammond@users.sourceforge.net  Thu Nov 14 02:52:52 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 13 Nov 2002 18:52:52 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.32,1.33
Message-ID: <E18CA7g-0004BR-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv15470

Modified Files:
	addin.py 
Log Message:
Add a dumb exception handler for the folder switch event.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** addin.py	14 Nov 2002 01:16:11 -0000	1.32
--- addin.py	14 Nov 2002 02:52:50 -0000	1.33
***************
*** 277,296 ****
          show_delete_as = True
          show_recover_as = False
!         if outlook_folder is not None:
!             mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
!             look_id = self.manager.config.filter.spam_folder_id
!             if look_id:
!                 look_folder = self.manager.message_store.GetFolder(look_id)
!                 if mapi_folder == look_folder:
!                     # This is the Spam folder - only show "recover"
!                     show_recover_as = True
!                     show_delete_as = False
!             # Check if uncertain
!             look_id = self.manager.config.filter.unsure_folder_id
!             if look_id:
!                 look_folder = self.manager.message_store.GetFolder(look_id)
!                 if mapi_folder == look_folder:
!                     show_recover_as = True
!                     show_delete_as = True
          self.but_recover_as.Visible = show_recover_as
          self.but_delete_as.Visible = show_delete_as
--- 277,301 ----
          show_delete_as = True
          show_recover_as = False
!         try:
!             if outlook_folder is not None:
!                 mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
!                 look_id = self.manager.config.filter.spam_folder_id
!                 if look_id:
!                     look_folder = self.manager.message_store.GetFolder(look_id)
!                     if mapi_folder == look_folder:
!                         # This is the Spam folder - only show "recover"
!                         show_recover_as = True
!                         show_delete_as = False
!                 # Check if uncertain
!                 look_id = self.manager.config.filter.unsure_folder_id
!                 if look_id:
!                     look_folder = self.manager.message_store.GetFolder(look_id)
!                     if mapi_folder == look_folder:
!                         show_recover_as = True
!                         show_delete_as = True
!         except:
!             print "Error finding the MAPI folders for a folder switch event"
!             import traceback
!             traceback.print_exc()
          self.but_recover_as.Visible = show_recover_as
          self.but_delete_as.Visible = show_delete_as


From mhammond@users.sourceforge.net  Thu Nov 14 03:59:24 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 13 Nov 2002 19:59:24 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.29,1.30
Message-ID: <E18CBA4-0000hi-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv2659

Modified Files:
	msgstore.py 
Log Message:
Handle multipart/signed messages.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** msgstore.py	14 Nov 2002 01:16:11 -0000	1.29
--- msgstore.py	14 Nov 2002 03:59:21 -0000	1.30
***************
*** 373,377 ****
          # are for "forwarded" messages, where the forwards are actually
          # in an attachment.  Later.
!         # Oh - and for multipart/signed messages <frown>
          import mboxutils
  
--- 373,378 ----
          # are for "forwarded" messages, where the forwards are actually
          # in an attachment.  Later.
!         # Note we *dont* look in plain text attachments, which we arguably
!         # should.
          import mboxutils
  
***************
*** 405,410 ****
              # Only ever seen this for "multipart/signed" messages, so
              # without any better clues, just handle this.
!             # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
!             pass
  
          return "%s\n%s\n%s" % (headers, html, body)
--- 406,456 ----
              # Only ever seen this for "multipart/signed" messages, so
              # without any better clues, just handle this.
!             # Find all attachments with
!             # PR_ATTACH_MIME_TAG_A=multipart/signed
!             table = self.mapi_object.GetAttachmentTable(0)
!             restriction = (mapi.RES_PROPERTY,   # a property restriction
!                            (mapi.RELOP_EQ,      # check for equality
!                             PR_ATTACH_MIME_TAG_A,   # of the given prop
!                             (PR_ATTACH_MIME_TAG_A, "multipart/signed")))
!             rows = mapi.HrQueryAllRows(table,
!                                        (PR_ATTACH_NUM,), # columns to get
!                                        restriction,    # only these rows
!                                        None,    # any sort order is fine
!                                        0)       # any # of results is fine
!             if len(rows) == 0:
!                 pass # Nothing we can fetch :(
!             else:
!                 if len(rows) > 1:
!                     print "WARNING: Found %d rows with multipart/signed" \
!                           "- using first only" % len(rows)
!                 row = rows[0]
!                 (attach_num_tag, attach_num), = row
!                 assert attach_num_tag != PT_ERROR, \
!                        "Error fetching attach_num prop"
!                 # Open the attachment
!                 attach = self.mapi_object.OpenAttach(attach_num,
!                                                    None,
!                                                    mapi.MAPI_DEFERRED_ERRORS)
!                 prop_ids = (PR_ATTACH_DATA_BIN,)
!                 hr, data = attach.GetProps(prop_ids, 0)
!                 attach_body = self._GetPotentiallyLargeStringProp(
!                     prop_ids[0], data[0])
!                 # What we seem to have here now is a *complete* multi-part
!                 # mime message - that Outlook must have re-constituted on
!                 # the fly immediately after pulling it apart! - not unlike
!                 # exactly what we are doing ourselves right here - putting
!                 # it into a message object, so we can extract the text, so
!                 # we can stick it back into another one.  Ahhhhh.
!                 import email
!                 msg = email.message_from_string(attach_body)
!                 assert msg.is_multipart()
!                 sub = msg.get_payload(0)
!                 body = sub.get_payload()
! 
!         if not html and not body:
!             # MarkH has only ever seen this when it is indeed true!
!             # (generally as the message has an attachment and nothing else)
!             print "Couldn't find any useful body for message '%s'" \
!                   % (self.GetField(PR_SUBJECT_A),)
  
          return "%s\n%s\n%s" % (headers, html, body)


From mhammond@users.sourceforge.net  Thu Nov 14 07:01:07 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 13 Nov 2002 23:01:07 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/sandbox delete_outlook_field.py,1.3,1.4
Message-ID: <E18CDzv-0003WY-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv13358

Modified Files:
	delete_outlook_field.py 
Log Message:
Allow finer control over exactly now we try and delete
(Useful when Outlook dies with an out of memory error, so
you can skip this step!)


Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** delete_outlook_field.py	2 Nov 2002 03:12:12 -0000	1.3
--- delete_outlook_field.py	14 Nov 2002 07:01:04 -0000	1.4
***************
*** 49,56 ****
      return mapi.HexFromBin(folder_eid)
  
! def DeleteField(folder, name):
      name = name.lower()
      entries = folder.Items
!     num_outlook = num_mapi = 0
      entry = entries.GetFirst()
      while entry is not None:
--- 49,56 ----
      return mapi.HexFromBin(folder_eid)
  
! def DeleteField_Outlook(folder, name):
      name = name.lower()
      entries = folder.Items
!     num_outlook = 0
      entry = entries.GetFirst()
      while entry is not None:
***************
*** 64,67 ****
--- 64,70 ----
                  break
          entry = entries.GetNext()
+     return num_outlook
+ 
+ def DeleteField_MAPI(folder, name):
      # OK - now try and wipe the field using MAPI.
      mapi_msgstore = _FindDefaultMessageStore()
***************
*** 74,78 ****
      table.SetColumns(prop_ids, 0)
      propIds = mapi_folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
!     del_from_folder = False
      if PROP_TYPE(propIds[0])!=PT_ERROR:
          assert propIds[0] == PROP_TAG( PT_UNSPECIFIED, PROP_ID(propIds[0]))
--- 77,81 ----
      table.SetColumns(prop_ids, 0)
      propIds = mapi_folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
!     num_mapi = 0
      if PROP_TYPE(propIds[0])!=PT_ERROR:
          assert propIds[0] == PROP_TAG( PT_UNSPECIFIED, PROP_ID(propIds[0]))
***************
*** 94,97 ****
--- 97,110 ----
                          item.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
                          num_mapi += 1
+     return num_mapi
+ 
+ def DeleteField_Folder(folder, name):
+     mapi_msgstore = _FindDefaultMessageStore()
+     mapi_folder = mapi_msgstore.OpenEntry(mapi.BinFromHex(folder.EntryID),
+                                           None,
+                                           mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
+     propIds = mapi_folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
+     num_mapi = 0
+     if PROP_TYPE(propIds[0])!=PT_ERROR:
          hr, vals = mapi_folder.GetProps(propIds)
          if hr==0: # We actually have it
***************
*** 99,104 ****
              if  hr == 0:
                  mapi_folder.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
!                 del_from_folder = True
!     return num_outlook, num_mapi, del_from_folder
  
  def CountFields(folder):
--- 112,117 ----
              if  hr == 0:
                  mapi_folder.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
!                 return 1
!     return 0
  
  def CountFields(folder):
***************
*** 139,142 ****
--- 152,158 ----
  -s - Show message subject and field value for all messages with field
  If no options given, prints a summary of field names in the folders
+ --no-outlook - Don't delete via the Outlook UserProperties API
+ --no-mapi - Don't delete via the extended MAPI API
+ --no-folder - Don't attempt to delete the field from the folder itself
  
  Folder name must be a hierarchical 'path' name, using '\\'
***************
*** 156,160 ****
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "dsf:")
      except getopt.error, e:
          print e
--- 172,178 ----
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:],
!                                    "dsf:",
!                                    ["no-mapi", "no-outlook", "no-folder"])
      except getopt.error, e:
          print e
***************
*** 163,166 ****
--- 181,185 ----
          sys.exit(1)
      delete = show = False
+     do_mapi = do_outlook = do_folder = True
      folder_names = []
      for opt, opt_val in opts:
***************
*** 171,174 ****
--- 190,200 ----
          elif opt == "-f":
              folder_names.append(opt_val)
+         elif opt == "--no-mapi":
+             do_mapi = False
+         elif opt == "--no-outlook":
+             do_outlook = False
+         elif opt == "--no-folder":
+             do_folder = False
+ 
          else:
              print "Invalid arg"
***************
*** 196,206 ****
              if delete:
                  print "Deleting field", field_name
!                 num_ol, num_mapi, did_folder = DeleteField(folder, field_name)
!                 print "Deleted", num_ol, "field instances from Outlook"
!                 print "Deleted", num_mapi, "field instances via MAPI"
!                 if did_folder:
!                     print "Deleted property from folder"
!                 else:
!                     print "Could not find property to delete in the folder"
  
  ##        item = folder.Items.Add()
--- 222,237 ----
              if delete:
                  print "Deleting field", field_name
!                 if do_outlook:
!                     num = DeleteField_Outlook(folder, field_name)
!                     print "Deleted", num, "field instances from Outlook"
!                 if do_mapi:
!                     num = DeleteField_MAPI(folder, field_name)
!                     print "Deleted", num, "field instances via MAPI"
!                 if do_folder:
!                     num = DeleteField_Folder(folder, field_name)
!                     if num:
!                         print "Deleted property from folder"
!                     else:
!                         print "Could not find property to delete in the folder"
  
  ##        item = folder.Items.Add()


From mhammond@users.sourceforge.net  Thu Nov 14 07:04:48 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 13 Nov 2002 23:04:48 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.33,1.34 msgstore.py,1.30,1.31
Message-ID: <E18CE3U-0003lu-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14133

Modified Files:
	addin.py msgstore.py 
Log Message:
Process all missed messages at startup.  "missed" is defined as both unread,
and missing our "Spam" field.  This should be quite fast (unless, of 
course, if finds your huge folder is all unread and unscored!)


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** addin.py	14 Nov 2002 02:52:50 -0000	1.33
--- addin.py	14 Nov 2002 07:04:45 -0000	1.34
***************
*** 124,127 ****
--- 124,153 ----
  # Whew - we seem to have all the COM support we need - let's rock!
  
+ # Function to filter a message - note it is a msgstore msg, not an
+ # outlook one
+ def ProcessMessage(msgstore_message, manager):
+     if msgstore_message.GetField(manager.config.field_score_name) is not None:
+         # Already seem this message - user probably moving it back
+         # after incorrect classification.
+         # If enabled, re-train as Ham
+         # otherwise just ignore.
+         if manager.config.training.train_recovered_spam:
+             subject = msgstore_message.GetSubject()
+             import train
+             print "Training on message '%s' - " % subject,
+             if train.train_message(msgstore_message, False, manager, rescore = True):
+                 print "trained as good"
+             else:
+                 print "already was trained as good"
+             assert train.been_trained_as_ham(msgstore_message, manager)
+         return
+     if manager.config.filter.enabled:
+         import filter
+         disposition = filter.filter_message(msgstore_message, manager)
+         print "Message '%s' had a Spam classification of '%s'" \
+               % (msgstore_message.GetSubject(), disposition)
+     else:
+         print "Spam filtering is disabled - ignoring new message"
+ 
  # Button/Menu and other UI event handler classes
  class ButtonEvent:
***************
*** 157,182 ****
          #     PR_TRANSPORT_MESSAGE_HEADERS
          msgstore_message = self.manager.message_store.GetMessage(item)
!         if msgstore_message.GetField(self.manager.config.field_score_name) is not None:
!             # Already seem this message - user probably moving it back
!             # after incorrect classification.
!             # If enabled, re-train as Ham
!             # otherwise just ignore.
!             if self.manager.config.training.train_recovered_spam:
!                 subject = item.Subject.encode("mbcs", "replace")
!                 import train
!                 print "Training on message '%s' - " % subject,
!                 if train.train_message(msgstore_message, False, self.manager, rescore = True):
!                     print "trained as good"
!                 else:
!                     print "already was trained as good"
!                 assert train.been_trained_as_ham(msgstore_message, self.manager)
!             return
!         if self.manager.config.filter.enabled:
!             import filter
!             disposition = filter.filter_message(msgstore_message, self.manager)
!             print "Message '%s' had a Spam classification of '%s'" \
!                   % (item.Subject.encode("ascii", "replace"), disposition)
!         else:
!             print "Spam filtering is disabled - ignoring new message"
  
  # Event fired when item moved into the Spam folder.
--- 183,187 ----
          #     PR_TRANSPORT_MESSAGE_HEADERS
          msgstore_message = self.manager.message_store.GetMessage(item)
!         ProcessMessage(msgstore_message, self.manager)
  
  # Event fired when item moved into the Spam folder.
***************
*** 458,462 ****
              self.explorer_events.OnFolderSwitch()
  
!             # The main tool-bar dropdown with all out entries.
              # Add a pop-up menu to the toolbar
              popup = toolbar.Controls.Add(
--- 463,467 ----
              self.explorer_events.OnFolderSwitch()
  
!             # The main tool-bar dropdown with all our entries.
              # Add a pop-up menu to the toolbar
              popup = toolbar.Controls.Add(
***************
*** 482,485 ****
--- 487,497 ----
  
          self.FiltersChanged()
+         if self.manager.config.filter.enabled:
+             try:
+                 self.ProcessMissedMessages()
+             except:
+                 print "Error processing missed messages!"
+                 import traceback
+                 traceback.print_exc()
  
      def _AddPopup(self, parent, target, target_args, **item_attrs):
***************
*** 491,494 ****
--- 503,524 ----
              setattr(item, attr, val)
          self.buttons.append(item)
+ 
+     def ProcessMissedMessages(self):
+         # This could possibly spawn threads if it was too slow!
+         from time import clock
+         config = self.manager.config.filter
+         manager = self.manager
+         field_name = manager.config.field_score_name
+         for folder in manager.message_store.GetFolderGenerator(
+                                     config.watch_folder_ids,
+                                     config.watch_include_sub):
+             num = 0
+             start = clock()
+             for message in folder.GetNewUnscoredMessageGenerator(field_name):
+                 ProcessMessage(message, manager)
+                 num += 1
+             # See if perf hurts anyone too much.
+             print "Processing %d missed spam in folder '%s' took %gms" \
+                   % (num, folder.name, clock()-start*1000)
  
      def FiltersChanged(self):

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** msgstore.py	14 Nov 2002 03:59:21 -0000	1.30
--- msgstore.py	14 Nov 2002 07:04:45 -0000	1.31
***************
*** 58,61 ****
--- 58,64 ----
          # should get their own methods
          raise NotImplementedError
+     def GetSubject(self):
+         # Get the subject - function as it may require a trip to the store!
+         raise NotImplementedError
      def GetField(self, name):
          # Abstractly get a user field name/id to a field value.
***************
*** 292,295 ****
--- 295,326 ----
                                        item_id, row[1][1], row[2][1])
  
+     def GetNewUnscoredMessageGenerator(self, scoreFieldName):
+         folder = self.msgstore._OpenEntry(self.id)
+         table = folder.GetContentsTable(0)
+         # Resolve the field name
+         resolve_props = ( (mapi.PS_PUBLIC_STRINGS, "Spam"), )
+         resolve_ids = folder.GetIDsFromNames(resolve_props, 0)
+         field_id = PROP_TAG( PT_I4, PROP_ID(resolve_ids[0]))
+         # Setup the properties we want to read.
+         prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
+         table.SetColumns(prop_ids, 0)
+         # Set up the restriction
+         prop_restriction = (mapi.RES_PROPERTY,   # a property restriction
+                                (mapi.RELOP_EQ,      # check for equality
+                                 PR_CONTENT_UNREAD,   # of the unread flag
+                                 (PR_CONTENT_UNREAD, True))
+                             )
+         exist_restriction = mapi.RES_EXIST, (field_id,)
+         not_exist_restriction = mapi.RES_NOT, (exist_restriction,)
+         restriction = (mapi.RES_AND, (prop_restriction, not_exist_restriction))
+         table.Restrict(restriction, 0)
+         while 1:
+             rows = table.QueryRows(70, 0)
+             if len(rows) == 0:
+                 break
+             for row in rows:
+                 item_id = self.id[0], row[0][1] # assume in same store as folder!
+                 yield MAPIMsgStoreMsg(self.msgstore, self,
+                                       item_id, row[1][1], row[2][1])
  
  class MAPIMsgStoreMsg(MsgStoreMsg):
***************
*** 299,302 ****
--- 330,334 ----
          self.mapi_object = None
          self.id = entryid
+         self.subject = None
          # Search key is the only reliable thing after a move/copy operation
          # only problem is that it can potentially be changed - however, the
***************
*** 313,317 ****
          else:
              urs = "unread"
!         return "<%s, (%s) id=%s/%s>" % (self.__class__.__name__,
                                       urs,
                                       mapi.HexFromBin(self.id[0]),
--- 345,350 ----
          else:
              urs = "unread"
!         return "<%s, '%s' (%s) id=%s/%s>" % (self.__class__.__name__,
!                                      self.GetSubject(),
                                       urs,
                                       mapi.HexFromBin(self.id[0]),
***************
*** 329,332 ****
--- 362,370 ----
      def GetID(self):
          return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
+ 
+     def GetSubject(self):
+         if self.subject is None:
+             self.subject = self.GetField(PR_SUBJECT_A,)
+         return self.subject
  
      def GetOutlookItem(self):


From mhammond@users.sourceforge.net  Thu Nov 14 11:07:20 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 14 Nov 2002 03:07:20 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.34,1.35
Message-ID: <E18CHqC-0005iT-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv21897

Modified Files:
	addin.py 
Log Message:
BODMAS, BODMAS, BODMAS :(


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** addin.py	14 Nov 2002 07:04:45 -0000	1.34
--- addin.py	14 Nov 2002 11:07:18 -0000	1.35
***************
*** 520,524 ****
              # See if perf hurts anyone too much.
              print "Processing %d missed spam in folder '%s' took %gms" \
!                   % (num, folder.name, clock()-start*1000)
  
      def FiltersChanged(self):
--- 520,524 ----
              # See if perf hurts anyone too much.
              print "Processing %d missed spam in folder '%s' took %gms" \
!                   % (num, folder.name, (clock()-start)*1000)
  
      def FiltersChanged(self):


From bwarsaw@users.sourceforge.net  Thu Nov 14 17:08:52 2002
From: bwarsaw@users.sourceforge.net (Barry Warsaw)
Date: Thu, 14 Nov 2002 09:08:52 -0800
Subject: [Spambayes-checkins] 
 spambayes/email Charset.py,1.1.1.1,NONE Encoders.py,1.1.1.1,NONE
 Errors.py,1.1.1.1,NONE Generator.py,1.1.1.1,NONE
 Header.py,1.1.1.1,NONE Iterators.py,1.1.1.1,NONE
 MIMEAudio.py,1.1.1.1,NONE MIMEBase.py,1.1.1.1,NONE
 MIMEImage.py,1.1.1.1,NONE MIMEMessage.py,1.1.1.1,NONE
 MIMEMultipart.py,1.1.1.1,NONE MIMENonMultipart.py,1.1.1.1,NONE
 MIMEText.py,1.1.1.1,NONE Message.py,1.2,NONE Parser.py,1.1.1.1,NONE
 Utils.py,1.1.1.1,NONE __init__.py,1.2,NONE
 _compat21.py,1.1.1.1,NONE _compat22.py,1.1.1.1,NONE
 base64MIME.py,1.1.1.1,NONE quopriMIME.py,1.1.1.1,NONE
Message-ID: <E18CNU4-0004gv-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/email
In directory usw-pr-cvs1:/tmp/cvs-serv17425

Removed Files:
	Charset.py Encoders.py Errors.py Generator.py Header.py 
	Iterators.py MIMEAudio.py MIMEBase.py MIMEImage.py 
	MIMEMessage.py MIMEMultipart.py MIMENonMultipart.py 
	MIMEText.py Message.py Parser.py Utils.py __init__.py 
	_compat21.py _compat22.py base64MIME.py quopriMIME.py 
Log Message:
"Deleting" the email package from here.  Use "cvs up -P" to prune out
the email directory.  Get the email package either from Python 2.2.2,
Python 2.3cvs (which will always have the latest version), or from
mimelib.sf.net for older Pythons.


--- Charset.py DELETED ---

--- Encoders.py DELETED ---

--- Errors.py DELETED ---

--- Generator.py DELETED ---

--- Header.py DELETED ---

--- Iterators.py DELETED ---

--- MIMEAudio.py DELETED ---

--- MIMEBase.py DELETED ---

--- MIMEImage.py DELETED ---

--- MIMEMessage.py DELETED ---

--- MIMEMultipart.py DELETED ---

--- MIMENonMultipart.py DELETED ---

--- MIMEText.py DELETED ---

--- Message.py DELETED ---

--- Parser.py DELETED ---

--- Utils.py DELETED ---

--- __init__.py DELETED ---

--- _compat21.py DELETED ---

--- _compat22.py DELETED ---

--- base64MIME.py DELETED ---

--- quopriMIME.py DELETED ---


From bwarsaw@users.sourceforge.net  Thu Nov 14 17:09:11 2002
From: bwarsaw@users.sourceforge.net (Barry Warsaw)
Date: Thu, 14 Nov 2002 09:09:11 -0800
Subject: [Spambayes-checkins] spambayes/email .cvsignore,1.1,NONE
Message-ID: <E18CNUN-0004jr-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/email
In directory usw-pr-cvs1:/tmp/cvs-serv18146

Removed Files:
	.cvsignore 
Log Message:
Oops, one more to delete...


--- .cvsignore DELETED ---


From bwarsaw@users.sourceforge.net  Thu Nov 14 19:56:46 2002
From: bwarsaw@users.sourceforge.net (Barry Warsaw)
Date: Thu, 14 Nov 2002 11:56:46 -0800
Subject: [Spambayes-checkins] website developer.ht,1.5,1.6
Message-ID: <E18CQ6Y-0000Nb-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv1437

Modified Files:
	developer.ht 
Log Message:
Updated some information on email package compatibility.


Index: developer.ht
===================================================================
RCS file: /cvsroot/spambayes/website/developer.ht,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** developer.ht	7 Nov 2002 22:32:15 -0000	1.5
--- developer.ht	14 Nov 2002 19:56:44 -0000	1.6
***************
*** 12,16 ****
  come crying &lt;wink&gt;.
  </p>
! <p>This project works with either the absolute bleeding edge of python code, available from <a href="https://sourceforge.net/cvs/?group_id=5470">CVS on sourceforge</a>, or with Python 2.2 (not 2.1.x or earlier).
  </p>
  <p>The spambayes code itself is also available <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a>
--- 12,26 ----
  come crying &lt;wink&gt;.
  </p>
! <p>This project works with either the absolute bleeding edge of python
! code, available from <a 
! href="http://sourceforge.net/cvs/?group_id=5470">CVS on
! sourceforge</a>, or with Python 2.2 (not 2.1.x or earlier).  Note that
! you really want to be running Python 2.2.2 or Python 2.3cvs to get the
! latest <a href="http://mimelib.sf.net">email package</a>.  If you
! really plan on using an older version of Python, you'll need to
! <a
! href="http://sourceforge.net/project/showfiles.php?group_id=25568">download</a>
! and install the email package (unpack the tarball and read the README
! file for more details).
  </p>
  <p>The spambayes code itself is also available <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a>


From montanaro@users.sourceforge.net  Thu Nov 14 22:00:17 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Thu, 14 Nov 2002 14:00:17 -0800
Subject: [Spambayes-checkins] spambayes hammie.py,1.37,1.38
Message-ID: <E18CS25-0007yB-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30618

Modified Files:
	hammie.py 
Log Message:
Only open the database in write mode if we're training.  This allows
multiple users to share the same database, though of course you're still
subject to training quality issues.


Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** hammie.py	7 Nov 2002 22:30:05 -0000	1.37
--- hammie.py	14 Nov 2002 22:00:15 -0000	1.38
***************
*** 105,110 ****
      """
  
!     def __init__(self, dbname, iterskip=()):
!         self.hash = anydbm.open(dbname, 'c')
          self.iterskip = iterskip
  
--- 105,110 ----
      """
  
!     def __init__(self, dbname, mode, iterskip=()):
!         self.hash = anydbm.open(dbname, mode)
          self.iterskip = iterskip
  
***************
*** 185,192 ****
      # should just use ZODB.
  
!     def __init__(self, dbname):
          classifier.Bayes.__init__(self)
          self.statekey = "saved state"
!         self.wordinfo = DBDict(dbname, (self.statekey,))
  
          self.restore_state()
--- 185,193 ----
      # should just use ZODB.
  
!     def __init__(self, dbname, mode):
          classifier.Bayes.__init__(self)
          self.statekey = "saved state"
!         self.wordinfo = DBDict(dbname, mode, (self.statekey,))
!         self.dbmode = mode
  
          self.restore_state()
***************
*** 197,201 ****
  
      def save_state(self):
!         self.wordinfo[self.statekey] = (self.nham, self.nspam)
  
      def restore_state(self):
--- 198,203 ----
  
      def save_state(self):
!         if self.dbmode != 'r':
!             self.wordinfo[self.statekey] = (self.nham, self.nspam)
  
      def restore_state(self):
***************
*** 383,392 ****
      return (spams, hams)
  
! def createbayes(pck=DEFAULTDB, usedb=False):
      """Create a Bayes instance for the given pickle (which
      doesn't have to exist).  Create a PersistentBayes if
      usedb is True."""
      if usedb:
!         bayes = PersistentBayes(pck)
      else:
          bayes = None
--- 385,394 ----
      return (spams, hams)
  
! def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
      """Create a Bayes instance for the given pickle (which
      doesn't have to exist).  Create a PersistentBayes if
      usedb is True."""
      if usedb:
!         bayes = PersistentBayes(pck, mode)
      else:
          bayes = None
***************
*** 427,430 ****
--- 429,433 ----
      do_filter = False
      usedb = USEDB
+     mode = 'r'
      for opt, arg in opts:
          if opt == '-h':
***************
*** 432,437 ****
--- 435,442 ----
          elif opt == '-g':
              good.append(arg)
+             mode = 'c'
          elif opt == '-s':
              spam.append(arg)
+             mode = 'c'
          elif opt == '-p':
              pck = arg
***************
*** 451,455 ****
      save = False
  
!     bayes = createbayes(pck, usedb)
      h = Hammie(bayes)
  
--- 456,460 ----
      save = False
  
!     bayes = createbayes(pck, usedb, mode)
      h = Hammie(bayes)
  

From anthonybaxter@users.sourceforge.net  Fri Nov 15 00:58:15 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 14 Nov 2002 16:58:15 -0800
Subject: [Spambayes-checkins] website related.ht,1.4,1.5
Message-ID: <E18CUoJ-0004Bk-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv16076

Modified Files:
	related.ht 
Log Message:
updated mozilla news.


Index: related.ht
===================================================================
RCS file: /cvsroot/spambayes/website/related.ht,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** related.ht	6 Nov 2002 20:07:34 -0000	1.4
--- related.ht	15 Nov 2002 00:58:12 -0000	1.5
***************
*** 8,12 ****
  <ul>
  <li>Gary Arnold's <a href="http://www.garyarnold.com/projects.php#bayespam">bayespam</a>, a perl qmail filter.
! <li>The mozilla project is working on this, see <a href="http://bugzilla.mozilla.org/show_bug.cgi?id=163188">bug 163188</a>
  <li>Eric Raymond's <a href="http://bogofilter.sf.net/">bogofilter</a>, a C code bayesian filter.
  <li><a href="http://www.ai.mit.edu/~jrennie/ifile/">ifile</a>, a Naive Bayes classification system.
--- 8,12 ----
  <ul>
  <li>Gary Arnold's <a href="http://www.garyarnold.com/projects.php#bayespam">bayespam</a>, a perl qmail filter.
! <li>The mozilla project is working on this, see <a href="http://bugzilla.mozilla.org/show_bug.cgi?id=163188">bug 163188</a>, or <a href="http://www.mozilla.org/mailnews/spam.html">this section</a> on the mozilla website. It looks like they're only using the Graham-style filtering, which is a pity.
  <li>Eric Raymond's <a href="http://bogofilter.sf.net/">bogofilter</a>, a C code bayesian filter.
  <li><a href="http://www.ai.mit.edu/~jrennie/ifile/">ifile</a>, a Naive Bayes classification system.


From sjoerd@acm.org  Fri Nov 15 09:45:19 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Fri, 15 Nov 2002 10:45:19 +0100
Subject: [Spambayes-checkins] spambayes/email Charset.py,1.1.1.1,NONE
	Encoders.py,1.1.1.1,NONE Errors.py,1.1.1.1,NONE Generator.py,1.1.1.1,NONE
	Header.py,1.1.1.1,NONE Iterators.py,1.1.1.1,NONE MIMEAudio.py,1.1.1.1,NONE
	MIMEBase.py,1.1.1.1,NONE MIMEImage.py,1.1.1.1,NONE MIMEMessage.py,1.1.1.1,NONE
	MIMEMultipart.py,1.1.1.1,NONE MIMENonMultipart.py,1.1.1.1,NONE
	MIMEText.py,1.1.1.1,NONE Message.py,1.2,NONE Parser.py,1.1.1.1,NONE
	Utils.py,1.1.1.1,NONE __init__.py,1.2,NONE _compat21.py,1.1.1.1,NONE
	_compat22.py,1.1.1.1,NONE base64MIME.py,1.1.1.1,NONE
	quopriMIME.py,1.1.1.1,NONE
In-Reply-To: <E18CNU4-0004gv-00@usw-pr-cvs1.sourceforge.net> 
References: <E18CNU4-0004gv-00@usw-pr-cvs1.sourceforge.net> 
Message-ID: <20021115094524.2DC0F74C3B@indus.ins.cwi.nl>

On Thu, Nov 14 2002 "Barry Warsaw" wrote:

> "Deleting" the email package from here.  Use "cvs up -P" to prune out
> the email directory.

Don't forget to remove the *.pyc (and *.pyo) files manually.
Otherwise "cvs up -P" won't delete the directory and you'll still use
the email stuff that was just deleted.

-- Sjoerd Mullender <sjoerd@acm.org>

From hooft@users.sourceforge.net  Fri Nov 15 21:28:39 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Fri, 15 Nov 2002 13:28:39 -0800
Subject: [Spambayes-checkins] spambayes weakloop.py,1.2,NONE
Message-ID: <E18Co11-00026Z-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8076

Removed Files:
	weakloop.py 
Log Message:
removed bad optimization tool

--- weakloop.py DELETED ---


From hooft@users.sourceforge.net  Fri Nov 15 21:31:32 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Fri, 15 Nov 2002 13:31:32 -0800
Subject: [Spambayes-checkins] spambayes CostCounter.py,NONE,1.1
Message-ID: <E18Co3o-0002LC-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8976

Added Files:
	CostCounter.py 
Log Message:
generic framework to handle cost functions

--- NEW FILE: CostCounter.py ---
from Options import options

class CostCounter:
    name = "Superclass Cost"

    def __init__(self):
        self.total = 0

    def spam(self, scr):
        pass

    def ham(self, scr):
        pass

    def __str__(self):
        return "%s: $%.2f" % (self.name, self.total)

class CompositeCostCounter:
    def __init__(self,cclist):
        self.clients = cclist

    def spam(self, scr):
        for c in self.clients:
             c.spam(scr)

    def ham(self, scr):
        for c in self.clients:
            c.ham(scr)

    def __str__(self):
        s = []
        for c in self.clients:
            s.append(str(c))
        return '\n'.join(s)

class StdCostCounter(CostCounter):
    name = "Standard Cost"
    def spam(self, scr):
        if scr < options.ham_cutoff:
            self.total += options.best_cutoff_fn_weight
        elif scr < options.spam_cutoff:
            self.total += options.best_cutoff_unsure_weight

    def ham(self, scr):
        if scr > options.spam_cutoff:
            self.total += options.best_cutoff_fp_weight
        elif scr > options.ham_cutoff:
            self.total += options.best_cutoff_unsure_weight

class FlexCostCounter(CostCounter):
    name = "Flex Cost"
    def _lambda(self, scr):
        if scr < options.ham_cutoff:
	    return 0
        elif scr > options.spam_cutoff:
            return 1
        else:
            return (scr - options.ham_cutoff) / (
                      options.spam_cutoff - options.ham_cutoff)

    def spam(self, scr):
        self.total += self._lambda(scr) * options.best_cutoff_fn_weight

    def ham(self, scr):
        self.total += (1 - self._lambda(scr)) * options.best_cutoff_fp_weight

def default():
     return CompositeCostCounter([
                                  StdCostCounter(),
                                  FlexCostCounter(),
                                 ])


From hooft@users.sourceforge.net  Fri Nov 15 21:32:22 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Fri, 15 Nov 2002 13:32:22 -0800
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.28,1.29
Message-ID: <E18Co4c-0002Ph-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9245

Modified Files:
	TestDriver.py 
Log Message:
use CostCounter

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** TestDriver.py	7 Nov 2002 22:30:04 -0000	1.28
--- TestDriver.py	15 Nov 2002 21:32:19 -0000	1.29
***************
*** 131,134 ****
--- 131,135 ----
      print guts
  
+ 
  class Driver:
  
***************
*** 141,144 ****
--- 142,147 ----
          self.ntimes_finishtest_called = 0
          self.new_classifier()
+         import CostCounter
+         self.cc=CostCounter.default()
  
      def new_classifier(self):
***************
*** 204,207 ****
--- 207,211 ----
                nfn * options.best_cutoff_fn_weight +
                nun * options.best_cutoff_unsure_weight)
+         print self.cc
  
          if options.save_histogram_pickles:
***************
*** 223,226 ****
--- 227,231 ----
                                 hi=options.show_ham_hi):
              local_ham_hist.add(prob * 100.0)
+             self.cc.ham(prob)
              if lo <= prob <= hi:
                  print
***************
*** 232,235 ****
--- 237,241 ----
                                  hi=options.show_spam_hi):
              local_spam_hist.add(prob * 100.0)
+             self.cc.spam(prob)
              if lo <= prob <= hi:
                  print


From hooft@users.sourceforge.net  Fri Nov 15 21:35:17 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Fri, 15 Nov 2002 13:35:17 -0800
Subject: [Spambayes-checkins] spambayes simplexloop.py,NONE,1.1
Message-ID: <E18Co7R-0002fh-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10258

Added Files:
	simplexloop.py 
Log Message:
more generic simplex optimizer; accepts any command line as argument and will optimize the cost it reports in its last line of output by tuning 5 parameters

--- NEW FILE: simplexloop.py ---
#
# Optimize parameters
#
"""Usage: %(program)s  [options] -c command

Where:
    -h
        Show usage and exit.
    -c command
        The command to be run, with all its options. 
        The last line of output from this program should
        match 'YYYYYYY cost: $xxxx.xx'
        (i.e. the third word of the last line should be the value to be
         minimized, preceded by a dollar sign)
        I have used
         "python2.3 timcv.py -n 10 --spam-keep=600 --ham-keep=600 -s 12345"

This program will overwrite bayescustomize.ini!
"""

import sys

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

program = sys.argv[0]

import Options

start = (Options.options.unknown_word_prob,
         Options.options.minimum_prob_strength,
         Options.options.unknown_word_strength,
         Options.options.spam_cutoff,
         Options.options.ham_cutoff)
err = (0.01, 0.01, 0.01, 0.005, 0.01)

def mkini(vars):
    f=open('bayescustomize.ini', 'w')
    f.write("""
[Classifier]
unknown_word_prob = %.6f
minimum_prob_strength = %.6f
unknown_word_strength = %.6f

[TestDriver]
spam_cutoff = %.4f
ham_cutoff = %.4f
"""%tuple(vars))
    f.close()

def score(vars):
    import os
    mkini(vars)
    status = os.system('%s > loop.out'%command)
    if status != 0:
        print >> sys.stderr, "Error status from subcommand"
        sys.exit(status)
    f = open('loop.out', 'r')
    txt = f.readlines()
    # Extract the flex cost field.
    cost = float(txt[-1].split()[2][1:])
    f.close()
    # print ''.join(txt[-4:])[:-1]
    print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,))
    return -cost

def main():
    import optimize
    finish=optimize.SimplexMaximize(start,err,score)
    mkini(finish)
    print "Best result left in bayescustomize.ini"

if __name__ == "__main__":
    import getopt

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hc:')
    except getopt.error, msg:
        usage(1, msg)

    command = None
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-c':
            command = arg

    if args:
        usage(1, "Positional arguments not supported")
    if command is None:
        usage(1, "-c is required")

    main()


From hooft@users.sourceforge.net  Sat Nov 16 05:41:31 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Fri, 15 Nov 2002 21:41:31 -0800
Subject: [Spambayes-checkins] spambayes CostCounter.py,1.1,1.2
Message-ID: <E18Cvhz-0000vl-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3570

Modified Files:
	CostCounter.py 
Log Message:
Yikes; a HUGE bug in the FlexCost

Index: CostCounter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CostCounter.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** CostCounter.py	15 Nov 2002 21:31:28 -0000	1.1
--- CostCounter.py	16 Nov 2002 05:41:29 -0000	1.2
***************
*** 60,67 ****
  
      def spam(self, scr):
!         self.total += self._lambda(scr) * options.best_cutoff_fn_weight
  
      def ham(self, scr):
!         self.total += (1 - self._lambda(scr)) * options.best_cutoff_fp_weight
  
  def default():
--- 60,67 ----
  
      def spam(self, scr):
!         self.total += (1 - self._lambda(scr)) * options.best_cutoff_fn_weight
  
      def ham(self, scr):
!         self.total += self._lambda(scr) * options.best_cutoff_fp_weight
  
  def default():


From hooft@users.sourceforge.net  Sat Nov 16 05:42:37 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Fri, 15 Nov 2002 21:42:37 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,1.3,1.4
Message-ID: <E18Cvj3-00010p-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3886

Modified Files:
	weaktest.py 
Log Message:
Use the CostCounter

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** weaktest.py	10 Nov 2002 19:59:22 -0000	1.3
--- weaktest.py	16 Nov 2002 05:42:35 -0000	1.4
***************
*** 34,37 ****
--- 34,38 ----
  
  import msgs
+ import CostCounter
  
  program = sys.argv[0]
***************
*** 58,61 ****
--- 59,63 ----
      nham = len(hamfns)
      nspam = len(spamfns)
+     cc = CostCounter.default()
  
      allfns = {}
***************
*** 71,78 ****
      fp = 0
      fn = 0
-     flexcost = 0
-     FPW = options.best_cutoff_fp_weight
-     FNW = options.best_cutoff_fn_weight
-     UNW = options.best_cutoff_unsure_weight
      SPC = options.spam_cutoff
      HC = options.ham_cutoff
--- 73,76 ----
***************
*** 88,95 ****
          if debug:
              print "score:%.3f"%scr,
          if scr < SPC and is_spam:
-             t = FNW * (SPC - scr) / (SPC - HC)
-             #print "Spam at %.3f costs %.2f"%(scr,t)
-             flexcost += t
              if scr < HC:
                  fn += 1
--- 86,94 ----
          if debug:
              print "score:%.3f"%scr,
+         if is_spam:
+             cc.spam(scr)
+         else:
+             cc.ham(scr)
          if scr < SPC and is_spam:
              if scr < HC:
                  fn += 1
***************
*** 104,110 ****
              d.update_probabilities()
          elif scr > HC and not is_spam:
-             t = FPW * (scr - HC) / (SPC - HC)
-             #print "Ham at %.3f costs %.2f"%(scr,t)
-             flexcost += t
              if scr > SPC:
                  fp += 1
--- 103,106 ----
***************
*** 131,136 ****
      print "Trained on %d ham and %d spam"%(hamtrain, spamtrain)
      print "fp: %d fn: %d"%(fp, fn)
!     print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
!     print "Flex cost: $%.4f"%flexcost
  
  def main():
--- 127,131 ----
      print "Trained on %d ham and %d spam"%(hamtrain, spamtrain)
      print "fp: %d fn: %d"%(fp, fn)
!     print cc
  
  def main():


From timstone4@users.sourceforge.net  Sat Nov 16 16:18:25 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 08:18:25 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,NONE,1.1
Message-ID: <E18D5eL-00085s-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31078

Added Files:
	Bayes.py 
Log Message:
This module manages Bayes databases, and includes a Trainer class that is
used in conjunction with Corpus to provide message movement based
automatic training and untraining capabilities.

--- NEW FILE: Bayes.py ---
#! /usr/bin/env python

'''Bayes.py - Spambayes database management framework.

Classes:
    PersistentBayes - subclass of Bayes, adds auto persistence
    PickledBayes - PersistentBayes that uses a pickle db
    DBDictBayes - PersistentBayes that uses a (hammie.) DB_Dict db
    Trainer - Bayes training observer
    SpamTrainer - Trainer for spam
    HamTrainer - Trainer for ham

Abstract:
    PersistentBayes is an abstract subclass of Bayes (classifier.Bayes)
    that adds automatic state store/restore function to the Bayes class.
    It also adds a convenience method, which should probably
    more properly be defined in Bayes: classify, which returns
    'spam'|'ham'|'unsure' for a message based on the spamprob against
    the ham_cutoff and spam_cutoff specified in Options.
    
    PickledBayes is a concrete PersistentBayes class that uses a cPickle
    datastore.  This database is relatively small, but slower than other
    databases.

    DBDictBayes is a concrete PersistentBayes class that uses a DB_Dict
    datastore.  DB_Dict is currently definied in hammie.py, and wraps
    an anydbm with some very convenient dictionary functionality, such as
    the ability to skip particular keys or key patterns during iteration.

    Trainer is concrete class that observes a Corpus and trains a
    Bayes object based upon movement of messages between corpora  When
    an add message notification is received, the trainer trains the
    database with the message, as spam or ham as appropriate given the
    type of trainer (spam or ham).  When a remove message notification
    is received, the trainer untrains the database as appropriate.

    SpamTrainer and HamTrainer are convenience subclasses of Trainer, that
    initialize as the appropriate type of Trainer

To Do:
    o ZODBBayes
    o Would Trainer.trainall really want to train with the whole corpus,
      or just a random subset?
    o Corpus.Verbose is a bit of a strange thing to have.  Verbose should be
      in the global namespace, but how do you get it there?
    o Suggestions?

    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim@fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, Neil Gunton, \
all the spambayes contributors."

import Corpus
from classifier import Bayes
from Options import options
from hammie import DBDict     # hammie only for DBDict, which should
                              # probably really be somewhere else
import cPickle as pickle
import errno

PICKLE_TYPE = 1
NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
UPDATEPROBS = True       # Probabilities will be autoupdated with training

class PersistentBayes(Bayes):
    '''Persistent Bayes database object'''

    def __init__(self, db_name):
        '''Constructor(database name)'''

        self.db_name = db_name
        self.load()

    def load(self):
        '''Restore state from a persistent store'''

        raise NotImplementedError

    def store(self):
        '''Persist state into a persistent store'''

        raise NotImplementedError

    def classify(self, message):
        '''Returns the classification of a Message {'spam'|'ham'|'unsure'}'''

        prob = self.spamprob(message.tokenize())

        message.setSpamprob(prob)   # don't like this

        if prob < options.ham_cutoff:
            type = 'ham'
        elif prob > options.spam_cutoff:
            type = 'spam'
        else:
            type = 'unsure'

        return type


class PickledBayes(PersistentBayes):
    '''Bayes object persisted in a pickle'''

    def load(self):
        '''Load this instance from the pickle.'''
        # This is a bit strange, because the loading process
        # creates a temporary instance of PickledBayes, from which
        # this object's state is copied.  This is a nuance of the way
        # that pickle does its job

        if Corpus.Verbose:
            print 'Loading state from',self.db_name,'pickle'

        tempbayes = None
        try:
            fp = open(self.db_name, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT: raise
        else:
            tempbayes = pickle.load(fp)
            fp.close()

        if tempbayes:
            self.wordinfo = tempbayes.wordinfo
            self.nham = tempbayes.nham
            self.nspam = tempbayes.nspam
            
            if Corpus.Verbose:
                print '%s is an existing pickle, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new pickle
            if Corpus.Verbose:
                print self.db_name,'is a new pickle'
            self.wordinfo = {}
            self.nham = 0
            self.nspam = 0

    def store(self):
        '''Store self as a pickle'''

        if Corpus.Verbose:
            print 'Persisting',self.db_name,'as a pickle'

        fp = open(self.db_name, 'wb')
        pickle.dump(self, fp, PICKLE_TYPE)
        fp.close()

    def __getstate__(self):
        '''State requested by pickler'''

        return PICKLE_TYPE, self.wordinfo, self.nspam, self.nham

    def __setstate__(self, t):
        '''State provided by pickler'''
        # This can be confusing, because self in this method
        # is not the same instance as self in the load() method

        if t[0] != PICKLE_TYPE:
            raise ValueError("Can't unpickle -- version %s unknown" % t[0])

        self.wordinfo, self.nspam, self.nham = t[1:]
        

class DBDictBayes(PersistentBayes):
    '''Bayes object persisted in a hammie.DB_Dict'''

    def __init__(self, db_name):
        '''Constructor(database name)'''

        self.db_name = db_name
        self.statekey = "saved state"
        self.wordinfo = DBDict(db_name, (self.statekey,))  # r/rw?

        self.load()

    def load(self):
        '''Load state from DB_Dict'''

        if Corpus.Verbose:
            print 'Loading state from',self.db_name,'DB_Dict'

        if self.wordinfo.has_key(self.statekey):
            self.nham, self.nspam = self.wordinfo[self.statekey]
            
            if Corpus.Verbose:
                print '%s is an existing DBDict, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new dbdict
            if Corpus.Verbose:
                print self.db_name,'is a new DBDict'
            self.nham = 0
            self.nspam = 0

    def store(self):
        '''Place state into persistent store'''

        if Corpus.Verbose:
            print 'Persisting',self.db_name,'state in DBDict'

        self.wordinfo[self.statekey] = (self.nham, self.nspam)


class Trainer:
    '''Associates a Bayes object and one or more Corpora, \
    is an observer of the corpora'''

    def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
        '''Constructor(Bayes, \
                       Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''

        self.bayes = bayes
        self.trainertype = trainertype
        self.updateprobs = updateprobs

    def onAddMessage(self, message):
        '''A message is being added to an observed corpus.'''

        self.train(message)

    def train(self, message):
        '''Train the database with the message'''

        if Corpus.Verbose:
            print 'training with',message.key()

        self.bayes.learn(message.tokenize(), \
                         self.trainertype, \
                         self.updateprobs)

    def onRemoveMessage(self, message):
        '''A message is being removed from an observed corpus.'''

        self.untrain(message)

    def untrain(self, message):
        '''Untrain the database with the message'''

        if Corpus.Verbose:
            print 'untraining with',message.key()

        self.bayes.unlearn(message.tokenize(), \
                           self.trainertype, \
                           self.updateprobs)
        # can raise ValueError if database is fouled.  If this is the case,
        # then retraining is the only recovery option.

    def trainAll(self, corpus):
        '''Train all the messages in the corpus'''

        for msg in corpus:
            self.train(msg)

    def untrainAll(self, corpus):
        '''Untrain all the messages in the corpus'''

        for msg in corpus:
            self.untrain(msg)


class SpamTrainer(Trainer):
    '''Trainer for spam'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, Corpus.SPAM, updateprobs)


class HamTrainer(Trainer):
    '''Trainer for ham'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, Corpus.HAM, updateprobs)


if __name__ == '__main__':
    print >>sys.stderr, __doc__


From timstone4@users.sourceforge.net  Sat Nov 16 16:27:41 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 08:27:41 -0800
Subject: [Spambayes-checkins] spambayes Corpus.py,NONE,1.1
Message-ID: <E18D5nJ-0000fi-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2563

Added Files:
	Corpus.py 
Log Message:
This module defines abstract classes for the management of message corpora.  A corpus is defined simply as a set
of messages.  Corpus objects can be observed by Bayes.Trainer objects,
to provide training when messages are added or removed from corpora,
or moved from one corpus to another.  Corpora are defined as spam or ham
depending on the kind of trainer that observs them, and they don't need to
be either, for example, an Unsure corpus.

This module also defines abstract Message and MessageFactory classes which
are specifically useful for Corpus and Trainer.

--- NEW FILE: Corpus.py ---
#! /usr/bin/env python

'''Corpus.py - Spambayes corpus management framework.

Classes:
    Corpus - a collection of Messages
    ExpiryCorpus - a "young" Corpus
    Message - a subject of Spambayes training
    MessageFactory - creates a Message

Abstract:
    A corpus is defined as a set of messages that share some common
    characteristic relative to spamness.  Examples might be spam, ham,
    unsure, or untrained, or "bayes rating between .4 and .6.  A
    corpus is a collection of messages.  Corpus is a dictionary that
    is keyed by the keys of the messages within it.  It is iterable,
    and observable.  Observers are notified when a message is added
    to or removed from the corpus.

    Corpus is designed to cache message objects.  By default, it will
    only engage in lazy creation of message objects, keeping those
    objects in memory until the corpus instance itself is destroyed.
    In large corpora, this could consume a large amount of memory.  A
    cacheSize operand is implemented on the constructor, which is used
    to limit the *number* of messages currently loaded into memory.
    The instance variable that implements this cache is
    Corpus.Corpus.msgs, a dictionary.  Access to this variable should
    be through keys(), [key], or using an iterator.  Direct access
    should not be used, as subclasses that manage their cache may use
    this variable very differently.

    Iterating Corpus objects is potentially very expensive, as each
    message in the corpus will be brought into memory.  For large
    corpora, this could consume a lot of system resources.

    ExpiryCorpus is designed to keep a corpus of file messages that
    are guaranteed to be younger than a given age.  The age is
    specified on the constructor, as a number of seconds in the past.
    If a message file was created before that point in time, the a
    message is deemed to be "old" and thus ignored.  Access to a
    message that is deemed to be old will raise KeyError, which should
    be handled by the corpus user as appropriate.  While iterating,
    KeyError is handled by the iterator, and messages that raise
    KeyError are ignored.

    As messages pass their "expiration date," they are eligible for
    removal from the corpus. To remove them properly,
    removeExpiredMessages() should be called.  As messages are removed,
    observers are notified.

    ExpiryCorpus function is included into a concrete Corpus through
    multiple inheritance. It must be inherited before any inheritance
    that derives from Corpus.  For example:

        class RealCorpus(Corpus)
           ...

        class ExpiryRealCorpus(Corpus.ExpiryCorpus, RealCorpus)
           ...

    Messages have substance, which is is the textual content of the
    message. They also have a key, which uniquely defines them within
    the corpus.  This framework makes no assumptions about how or if
    messages persist.

    MessageFactory is a required factory class, because Corpus is
    designed to do lazy initialization of messages and as an abstract
    class, must know how to create concrete instances of the correct
    class.

To Do:
    o Suggestions?

    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim@fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."

from __future__ import generators

import sys           # for output of docstring
import time
import tokenizer
import re

SPAM = True
HAM = False
Verbose = False

class Corpus:
    '''An observable dictionary of Messages'''

    def __init__(self, factory, cacheSize=-1):
        '''Constructor(MessageFactory)'''

        self.msgs = {}            # dict of all messages in corpus
                                  # value is None if msg not currently loaded
        self.keysInMemory = []    # keys of messages currently loaded
                                  # this *could* be derived by iterating msgs
        self.cacheSize = cacheSize  # max number of messages in memory
        self.observers = []       # observers of this corpus
        self.factory = factory    # factory for the correct Message subclass
        self.mfilter = None       # regex to filter messages

    def addObserver(self, observer):
        '''Register an observer, which must implement
        onAddMessage, onRemoveMessage'''

        self.observers.append(observer)

    def addMessage(self, message):
        '''Add a Message to this corpus'''

        if Verbose:
            print 'adding message %s to corpus' % (message.key())

        self.cacheMessage(message)

        for obs in self.observers:
            # there is no reason that a Corpus observer MUST be a Trainer
            # and so it may very well not be interested in AddMessage events
            # even though right now the only observable events are
            # training related
            try:
                obs.onAddMessage(message)
            except AttributeError:   # ignore if not implemented
                pass

    def removeMessage(self, message):
        '''Remove a Message from this corpus'''

        key = message.key()
        if Verbose:
            print 'removing message %s from corpus' % (key)
        self.unCacheMessage(key)
        del self.msgs[key]

        for obs in self.observers:
            # see comments in event loop in addMessage
            try:
                obs.onRemoveMessage(message)
            except AttributeError:
                pass

    def cacheMessage(self, message):
        '''Add a message to the in-memory cache'''
        # This method should probably not be overridden

        key = message.key()
        sub = message.getSubstance()
        
        if self.mfilter != None:
            match = re.match(self.mfilter, sub, re.DOTALL)
            if not match:
                print 'not cacheing %s because it does not \
match the corpus filter' % (key)
                raise KeyError, message

        if Verbose:
            print 'placing %s in corpus cache' % (key)

        self.msgs[key] = message

        # Here is where we manage the in-memory cache size...
        self.keysInMemory.append(key)

        if self.cacheSize > 0:       # performance optimization
            if len(self.keysInMemory) > self.cacheSize:
                keyToFlush = self.keysInMemory[0]
                self.unCacheMessage(keyToFlush)

    def unCacheMessage(self, key):
        '''Remove a message from the in-memory cache'''
        # This method should probably not be overridden

        if Verbose:
            print 'Flushing %s from corpus cache' % (key)

        try:
            ki = self.keysInMemory.index(key)
        except ValueError:
            pass
        else:
            del self.keysInMemory[ki]

        self.msgs[key] = None

    def takeMessage(self, key, fromcorpus):
        '''Move a Message from another corpus to this corpus'''

        msg = fromcorpus[key]
        fromcorpus.removeMessage(msg)
        self.addMessage(msg)

    def __getitem__(self, key):
        '''Corpus is a dictionary'''

        amsg = self.msgs[key]

        if not amsg:
            amsg = self.makeMessage(key)     # lazy init, saves memory
            self.cacheMessage(amsg)

        return amsg

    def keys(self):
        '''Message keys in the Corpus'''

        return self.msgs.keys()

    def __iter__(self):
        '''Corpus is iterable'''

        for key in self.keys():
            try:
                yield self[key]
            except KeyError:
                pass

    def __str__(self):
        '''Instance as a printable string'''

        return self.__repr__()

    def __repr__(self):
        '''Instance as a representative string'''

        raise NotImplementedError

    def makeMessage(self, key):
        '''Call the factory to make a message'''

        # This method will likely be overridden
        msg = self.factory.create(key)

        return msg

    def setFilter(self, sub):
        '''set this message filter'''
        
        self.mfilter = sub
        
    def getFilter(self):
        '''Return this message filter'''
        
        return self.mfilter
        

class ExpiryCorpus:
    '''Corpus of "young" file system artifacts'''

    def __init__(self, expireBefore, factory, cacheSize=-1):
        '''Constructor'''

        self.expireBefore = expireBefore
        Corpus.__init__(self, factory, cacheSize)

    def cacheMessage(self, msg):
        '''Add a message to the in-memory cache'''
        # This is where the expiry of a message is enforced
        # This method should probably not be overridden

        if msg.createTimestamp() >= time.time() - self.expireBefore:
            Corpus.cacheMessage(self, msg)
        else:
            if Verbose:
                print 'Not caching %s because it has expired' % (msg.key())
            raise KeyError, msg

        return msg

    def removeExpiredMessages(self):
        '''Kill expired messages'''

        for key in self.keys():
            try:
                msg = self[key]
            except KeyError, e:
                if Verbose:
                    print 'message %s has expired' % (key)
                self.removeMessage(e[0])


class Message:
    '''Abstract Message class'''

    def __init__(self):
        '''Constructor()'''
        pass

    def load(self):
        '''Method to load headers and body'''

        raise NotImplementedError

    def store(self):
        '''Method to persist a message'''

        raise NotImplementedError

    def remove(self):
        '''Method to obliterate a message'''

        raise NotImplementedError

    def __repr__(self):
        '''Instance as a representative string'''

        raise NotImplementedError

    def __str__(self):
        '''Instance as a printable string'''

        return self.substance

    def name(self):
        '''Message may have a unique human readable name'''

        return self.__repr__()

    def key(self):
        '''The key for this instance'''

        raise NotImplementedError

    def setSubstance(self, sub):
        '''set this message substance'''
        
        self.substance = sub
        
    def getSubstance(self):
        '''Return this message substance'''
        
        return self.substance
        
    def setSpamprob(self, prob):
        '''Score of the last spamprob calc, may not be persistent'''

        self.spamprob = prob

    def tokenize(self):
        '''Returns substance as tokens'''

        return tokenizer.tokenize(self.substance)

    def createTimeStamp(self):
        '''Returns the create time of this message'''
        # Should return a timestamp like time.time()

        raise NotImplementedError


class MessageFactory:
    '''Abstract Message Factory'''

    def __init__(self):
        '''Constructor()'''
        pass

    def create(self, key):
        '''Create a message instance'''

        raise NotImplementedError


if __name__ == '__main__':
    print >>sys.stderr, __doc__


From timstone4@users.sourceforge.net  Sat Nov 16 16:30:23 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 08:30:23 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,NONE,1.1
Message-ID: <E18D5pv-0000ys-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3745

Added Files:
	FileCorpus.py 
Log Message:
This module defines classes for the management of message corpora
resident in file systems.  Messages can exist in directories as simple files
or as gzip files.

This module also has a test harness for exercising the entire
Bayes/Corpus/FileCorpus set of classes.  This harness is useful for
understanding the general use of these classes.

--- NEW FILE: FileCorpus.py ---
#! /usr/bin/env python

'''FileCorpus.py - Corpus composed of file system artifacts

Classes:
    FileCorpus - an observable dictionary of FileMessages
    ExpiryFileCorpus - a FileCorpus of young files
    FileMessage - a subject of Spambayes training
    FileMessageFactory - a factory to create FileMessage objects
    GzipFileMessage - A FileMessage zipped for less storage
    GzipFileMessageFactory - factory to create GzipFileMessage objects

Abstract:
    These classes are concrete implementations of the Corpus framework.

    FileCorpus is designed to manage corpora that are directories of
    message files.

    ExpiryFileCorpus is an ExpiryCorpus of file messages.

    FileMessage manages messages that are files in the file system.

    FileMessageFactory is responsible for the creation of FileMessages,
    in response to requests to a corpus for messages.

    GzipFileMessage and GzipFileMessageFactory are used to persist messages
    as zipped files.  This can save a bit of persistent storage, though the
    ability of the compresser to do very much deflation is limited due to the
    relatively small size of the average textual message.  Still, for a large
    corpus, this could amount to a significant space savings.

    See Corpus.__doc__ for more information.

Test harness:
    FileCorpus [options]

        options:
            -h : show this message
            -v : execute in verbose mode, useful for general understanding
                 and debugging purposes
            -g : use GzipFileMessage and GzipFileMessageFactory
            -s : setup self test, useful for seeing what is going into the
                 test
            -t : setup and execute a self test.
            -c : clean up file system after self test

    Please note that running with -s or -t will create file system artifacts
    in the current directory.  Be sure this doesn't stomp something of
    yours...  The artifacts created are:

        fctestmisc.bayes
        fctestclass.bayes
        fctestspamcorpus/MSG00001
        fctestspamcorpus/MSG00002
        fctestunsurecorpus/MSG00003
        fctestunsurecorpus/MSG00004
        fctestunsurecorpus/MSG00005
        fctestunsurecorpus/MSG00006
        fctesthamcorpus/

    After the test has executed, the following file system artifacts
    (should) will exist:

        fctestmisc.bayes
        fctestclass.bayes
        fctestspamcorpus/MSG00001
        fctestspamcorpus/MSG00004
        fctesthamcorpus/MSG00002
        fctesthamcorpus/MSG00005
        fctesthamcorpus/MSG00006
        fctestunsurecorpus/

To Do:
    o Suggestions?

'''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim@fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."

from __future__ import generators

import Corpus
import Bayes
import sys, os, gzip, fnmatch, getopt, errno, time, stat

class FileCorpus(Corpus.Corpus):

    def __init__(self, factory, directory, filter='*', cacheSize=250):
        '''Constructor(FileMessageFactory, corpus directory name, fnmatch
filter'''

        Corpus.Corpus.__init__(self, factory, cacheSize)

        self.directory = directory
        self.filter = filter

        # This assumes that the directory exists.  A horrible death occurs
        # otherwise. We *could* simply create it, but that will likely only
        # mask errors

        # This will not pick up any changes to the corpus that are made
        # through the file system. The key list is established in __init__,
        # and if anybody stores files in the directory, even if they match
        # the filter, they won't make it into the key list.  The same
        # problem exists if anybody removes files. This *could* be a problem.
        # If so, we can maybe override the keys() method to account for this,
        # but there would be training side-effects...  The short of it is that
        # corpora that are managed by FileCorpus should *only* be managed by
        # FileCorpus (at least for now).  External changes that must be made
        # to the corpus should for the moment be handled by a complete
        # retraining.

        for filename in os.listdir(directory):
            if fnmatch.fnmatch(filename, filter):
               self.msgs[filename] = None

    def makeMessage(self, key):
        '''Ask our factory to make a Message'''

        msg = self.factory.create(key, self.directory)

        return msg

    def addMessage(self, message):
        '''Add a Message to this corpus'''

        if not fnmatch.fnmatch(message.key(), self.filter):
            raise ValueError

        if Corpus.Verbose:
            print 'adding',message.key(),'to corpus'

        message.directory = self.directory
        message.store()
        # superclass processing *MUST* be done
        # perform superclass processing *LAST!*
        Corpus.Corpus.addMessage(self, message)

    def removeMessage(self, message):
        '''Remove a Message from this corpus'''

        if Corpus.Verbose:
            print 'removing',message.key(),'from corpus'

        message.remove()

        # superclass processing *MUST* be done
        # perform superclass processing *LAST!*
        Corpus.Corpus.removeMessage(self, message)

    def __repr__(self):
        '''Instance as a representative string'''

        nummsgs = len(self.msgs)
        if nummsgs != 1:
            s = 's'
        else:
            s = ''

        if Corpus.Verbose and nummsgs > 0:
            lst = ', ' + '%s' % (self.keys())
        else:
            lst = ''

        return "<%s object at %8.8x, directory: %s, %s message%s%s>" % \
            (self.__class__.__name__, \
            id(self), \
            self.directory, \
            nummsgs, s, lst)


class ExpiryFileCorpus(Corpus.ExpiryCorpus, FileCorpus):
    '''FileCorpus of "young" file system artifacts'''

    def __init__(self, expireBefore, factory, directory, filter='*', cacheSize=250):
        '''Constructor(FileMessageFactory, corpus directory name, fnmatch
filter'''

        Corpus.ExpiryCorpus.__init__(self, expireBefore, factory, cacheSize)
        FileCorpus.__init__(self, factory, directory, filter, cacheSize)


class FileMessage(Corpus.Message):
    '''Message that persists as a file system artifact.'''

    def __init__(self,file_name, directory):
        '''Constructor(message file name, corpus directory name)'''

        self.file_name = file_name
        self.directory = directory
        self.load()

    def pathname(self):
        '''Derive the pathname of the message file'''

        return os.path.join(self.directory, self.file_name)

    def load(self):
        '''Read the Message substance from the file'''

        if Corpus.Verbose:
            print 'loading', self.file_name

        pn = self.pathname()
        try:
            fp = open(pn, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT:
               raise
        else:
           self.substance = fp.read()
           fp.close()

    def store(self):
        '''Write the Message substance to the file'''

        if Corpus.Verbose:
            print 'storing', self.file_name

        pn = self.pathname()
        fp = open(pn, 'wb')
        fp.write(self.substance)
        fp.close()

    def remove(self):
        '''Message hara-kiri'''

        if Corpus.Verbose:
            print 'physically deleting file',self.pathname()

        os.unlink(self.pathname())

    def name(self):
        '''A unique name for the message'''
        return self.file_name

    def key(self):
        '''The key of this message in the msgs dictionary'''
        return self.file_name

    def __repr__(self):
        '''Instance as a representative string'''

        elip = ''
        sub = self.substance

        if Corpus.Verbose:
            sub = self.substance
        else:
            if len(self.substance) > 20:
                sub = self.substance[:20]
                if len(self.substance) > 40:
                    sub += '...' + self.substance[-20:]

        pn = os.path.join(self.directory, self.file_name)

        return "<%s object at %8.8x, file: %s, %s>" % \
            (self.__class__.__name__, \
            id(self), pn, sub)

    def __str__(self):
        '''Instance as a printable string'''

        return self.__repr__()

    def createTimestamp(self):
        '''Return the create timestamp for the file'''

        stats = os.stat(self.pathname())
        ctime = stats[stat.ST_CTIME]

        return ctime


class FileMessageFactory(Corpus.MessageFactory):
    '''MessageFactory for FileMessage objects'''

    def create(self, key, directory):
        '''Create a message object from a filename in a directory'''

        return FileMessage(key, directory)


class GzipFileMessage(FileMessage):
    '''Message that persists as a zipped file system artifact.'''

    def load(self):
        '''Read the Message substance from the file'''

        if Corpus.Verbose:
            print 'loading', self.file_name

        pn = self.pathname()

        try:
            fp = gzip.open(pn, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT:
                raise
        else:
            self.substance = fp.read()
            fp.close()


    def store(self):
        '''Write the Message substance to the file'''

        if Corpus.Verbose:
            print 'storing', self.file_name

        pn = self.pathname()
        gz = gzip.open(pn, 'wb')
        gz.write(self.substance)
        gz.flush()
        gz.close()


class GzipFileMessageFactory(FileMessageFactory):
    '''MessageFactory for FileMessage objects'''

    def create(self, key, directory):
        '''Create a message object from a filename in a directory'''

        return GzipFileMessage(key, directory)


def runTest(useGzip):

    print 'Executing Test'

    if useGzip:
        fmFact = GzipFileMessageFactory()
        print 'Executing with Gzipped files'
    else:
        fmFact =  FileMessageFactory()
        print 'Executing with uncompressed files'

    print '\n\nCreating two Bayes databases'
    miscbayes = Bayes.PickledBayes('fctestmisc.bayes')
    classbayes = Bayes.DBDictBayes('fctestclass.bayes')

    print '\n\nSetting up spam corpus'
    spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
    spamtrainer = Bayes.SpamTrainer(miscbayes)
    spamcorpus.addObserver(spamtrainer)
    anotherspamtrainer = Bayes.SpamTrainer(classbayes, Bayes.UPDATEPROBS)
    spamcorpus.addObserver(anotherspamtrainer)

    keys = spamcorpus.keys()
    keys.sort()
    for key in keys:                          # iterate the list of keys
        msg = spamcorpus[key]                 # corpus is a dictionary
        spamtrainer.train(msg)
        anotherspamtrainer.train(msg)


    print '\n\nSetting up ham corpus'
    hamcorpus = FileCorpus(fmFact, \
                          'fctesthamcorpus', \
                          'MSG*')
    hamtrainer = Bayes.HamTrainer(miscbayes)
    hamcorpus.addObserver(hamtrainer)
    hamtrainer.trainAll(hamcorpus)


    print '\n\nAdd a message to hamcorpus that does not match the filter'
    if useGzip:
        fmClass = GzipFileMessage
    else:
        fmClass = FileMessage

    m1 = fmClass('XMG00001', 'fctestspamcorpus')

    try:
        hamcorpus.addMessage(m1)
    except ValueError:
        print 'Add failed, test passed'
    else:
        print 'Add passed, test failed'


    print '\n\nThis is the hamcorpus'
    print hamcorpus


    print '\n\nThis is the spamcorpus'
    print spamcorpus


    print '\n\nSetting up unsure corpus'
    # the unsure corpus is an expiry corpus with five second expiry
    # and a cache size of 2 (for testing purposes only...), and
    # no trainers, since there's no such thing as 'unsure training'
    unsurecorpus = ExpiryFileCorpus(5, fmFact, \
                                    'fctestunsurecorpus', 'MSG*', 2)


    print '\n\nIterate the unsure corpus, to make sure cache size \
is managed correctly.  We should not see MSG00003 in this iteration.'
    for msg in unsurecorpus:
        print msg.key()    # don't print msg, too much information
    for msg in unsurecorpus:
        print msg.key()    # don't print msg, too much information


    print '\n\nIterate the unsure corpus with a filter, to make sure \
the filtering mechanism is working correctly.  We should not see \
MSG00004 in this iteration.'
#    unsurecorpus.setFilter('richie')
    for msg in unsurecorpus:
        print msg.key()    # don't print msg, too much information


    print '\n\nRemoving expired messages from unsure corpus.'
    unsurecorpus.removeExpiredMessages()


    print '\n\nTrain with an individual message'
    anotherhamtrainer = Bayes.HamTrainer(classbayes)
    anotherhamtrainer.train(unsurecorpus['MSG00005'])


    print '\n\nMoving msg00002 from spamcorpus to hamcorpus'
    hamcorpus.takeMessage('MSG00002', spamcorpus)   # Oops. made a mistake...


    print "\n\nLet's test printing a message"
    msg = spamcorpus['MSG00001']
    print msg


    print '\n\nClassifying messages in unsure corpus'

    for msg in unsurecorpus:
        type = classbayes.classify(msg)

        print 'Message %s spam probability is %f' % (msg.key(), msg.spamprob)

        if type == 'ham':
            print 'Moving %s from unsurecorpus to hamcorpus, \
based on prob of %f' % (msg.key(), msg.spamprob)
            hamcorpus.takeMessage(msg.key(), unsurecorpus)
        elif type == 'spam':
            print 'Moving %s from unsurecorpus to spamcorpus, \
based on prob of %f' % (msg.key(), msg.spamprob)
            spamcorpus.takeMessage(msg.key(), unsurecorpus)


    print '\n\nThis is the new hamcorpus'
    print hamcorpus


    print '\n\nThis is the new spamcorpus'
    print spamcorpus


    print '\n\nThis is the new unsurecorpus'
    print unsurecorpus
    print 'unsurecorpus cache contains', unsurecorpus.keysInMemory
    print 'unsurecorpus msgs dict contains', unsurecorpus.msgs


    print '\n\nUpdating and storing bayes databases'
    miscbayes.update_probabilities()  # if we don't, training is forgotten
    miscbayes.store()
    classbayes.store()

def cleanupTest():

    print 'Cleaning up'

    cleanupDirectory('fctestspamcorpus')
    cleanupDirectory('fctesthamcorpus')
    cleanupDirectory('fctestunsurecorpus')

    if not useExistingDB:
        try:
            os.unlink('fctestmisc.bayes')
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise
    
        try:
            os.unlink('fctestclass.bayes')
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise

def cleanupDirectory(dirname):

    try:
        flist = os.listdir(dirname)
    except OSError, e:
        if e.errno != 3:     # errno.<WHAT>
           raise
    else:
        for filename in os.listdir(dirname):
            fn = os.path.join(dirname, filename)
            os.unlink(fn)
    try:
        os.rmdir(dirname)
    except OSError, e:
        if e.errno != 2:     # errno.<WHAT>
            raise

def setupTest(useGzip):

    cleanupTest()

    print 'Setting up test'

    # no try blocks here, because if any of this dies, the test
    # cannot proceed

    os.mkdir('fctestspamcorpus')
    os.mkdir('fctesthamcorpus')
    os.mkdir('fctestunsurecorpus')

    tm1 = testmsg1()
    tm2 = testmsg2()

    if useGzip:
        fmClass = GzipFileMessage
    else:
        fmClass = FileMessage

    m1 = fmClass('MSG00001', 'fctestspamcorpus')
    m1.substance = tm1
    m1.store()

    m2 = fmClass('MSG00002', 'fctestspamcorpus')
    m2.substance = tm2
    m2.store()

    m3 = fmClass('MSG00003', 'fctestunsurecorpus')
    m3.substance = tm1
    m3.store()

    for x in range(11):
       time.sleep(1)    # make sure MSG00003 has expired
       if 10-x == 1:
           s = ''
       else:
           s = 's'
       print 'wait',10-x,'more second%s' % (s)

    m4 = fmClass('MSG00004', 'fctestunsurecorpus')
    m4.substance = tm1
    m4.store()

    m5 = fmClass('MSG00005', 'fctestunsurecorpus')
    m5.substance = tm2
    m5.store()

    m6 = fmClass('MSG00006', 'fctestunsurecorpus')
    m6.substance = tm2
    m6.store()


def testmsg1():

    return '''
X-Hd:skip@pobox.com Mon Nov 04 10:50:49 2002
Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
1998/05/13) Mon Nov 4 08:50:58 2002)
X-From_:skip@mojam.com Mon Nov 4 08:49:03 2002
Return-Path:<skip@mojam.com>
Delivered-To:timstone@mail.powweb.com
Received:from manatee.mojam.com (manatee.mojam.com [199.249.165.175]) by
mail.powweb.com (Postfix) with ESMTP id DC95A1BB1D0 for
<tim@fourstonesExpressions.com>; Mon, 4 Nov 2002 08:49:02 -0800 (PST)
Received:from montanaro.dyndns.org (12-248-11-90.client.attbi.com
[12.248.11.90]) by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id
gA4Gn0oY029655 for <tim@fourstonesExpressions.com>; Mon, 4 Nov 2002 10:49:00
-0600
Received:from montanaro.dyndns.org (localhost [127.0.0.1]) by
montanaro.dyndns.org (8.12.2/8.12.2) with ESMTP id gA4Gn3cP015572 for
<tim@fourstonesExpressions.com>; Mon, 4 Nov 2002 10:49:03 -0600 (CST)
Received:(from skip@localhost) by montanaro.dyndns.org (8.12.2/8.12.2/Submit)
id gA4Gn37l015569; Mon, 4 Nov 2002 10:49:03 -0600 (CST)
From:Skip Montanaro <skip@pobox.com>
MIME-Version:1.0
Content-Type:text/plain; charset=us-ascii
Content- Transfer- Encoding:7bit

Message-ID:<15814.42238.882013.702030@montanaro.dyndns.org>
Date:Mon, 4 Nov 2002 10:49:02 -0600
To:Four Stones Expressions <tim@fourstonesExpressions.com>
Subject:Reformat mail to 80 columns?
In-Reply-To:<QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1@riven>
References:<8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc@riven>
<QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1@riven>
X-Mailer:VM 7.07 under 21.5 (beta9) "brussels sprouts" XEmacs Lucid
Reply-To:skip@pobox.com
X-Hammie- Disposition:Unsure


11/4/2002 10:49:02 AM, Skip Montanaro <skip@pobox.com> wrote:

>(off-list)
>
>Tim,
>
>Any chance you can easily generate messages to the spambayes list which wrap
>at something between 70 and 78 columns?  I find I have to always edit your
>messages to read them easily.
>
>Thanks,
>
>--
>Skip Montanaro - skip@pobox.com
>http://www.mojam.com/
>http://www.musi-cal.com/
>
>
- Tim
www.fourstonesExpressions.com '''

def testmsg2():
    return '''
X-Hd:richie@entrian.com Wed Nov 06 12:05:41 2002
Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
1998/05/13) Wed Nov 6 10:05:45 2002)
X-From_:richie@entrian.com Wed Nov 6 10:05:33 2002
Return-Path:<richie@entrian.com>
Delivered-To:timstone@mail.powweb.com
Received:from anchor-post-31.mail.demon.net (anchor-post-31.mail.demon.net
[194.217.242.89]) by mail.powweb.com (Postfix) with ESMTP id 3DC431BB06A for
<tim@fourstonesexpressions.com>; Wed, 6 Nov 2002 10:05:33 -0800 (PST)
Received:from sundog.demon.co.uk ([158.152.226.183]) by
anchor-post-31.mail.demon.net with smtp (Exim 3.35 #1) id 189UYP-000IAw-0V for
tim@fourstonesExpressions.com; Wed, 06 Nov 2002 18:05:25 +0000
From:Richie Hindle <richie@entrian.com>
To:tim@fourstonesExpressions.com
Subject:Re: What to call this training stuff
Date:Wed, 06 Nov 2002 18:05:56 +0000
Organization:entrian.com
Reply-To:richie@entrian.com
Message-ID:<d0hisugn3nau4m704kotgpd4jlt33rvrda@4ax.com>
References:<IFWRHE041VTXW72JGDBD0RTS04YTGE.3dc933a1@riven>
In-Reply-To:<IFWRHE041VTXW72JGDBD0RTS04YTGE.3dc933a1@riven>
X-Mailer:Forte Agent 1.7/32.534
MIME-Version:1.0
Content-Type:text/plain; charset=us-ascii
Content- Transfer- Encoding:7bit

X-Hammie- Disposition:Unsure


Hi Tim,

> Richie, I think we should package these classes I've been writing as
> 'corpusManagement.py'  What we're really doing here is creating a set of
tools
> that can be used to manage corpi (?) corpusses (?)  corpae (?)  whatever...
of
> messages.

Good plan.  Minor point of style: mixed-case module names (like class
names) tend to have an initial capital: CorpusManagement.py

On the name... sorry to disagree about names again, but what does the word
'management' add?  This is a module for manipulating corpuses, so I reckon
it should be called Corpus.  Like Cookie, gzip, zipfile, locale, mailbox...
see what I mean?

--
Richie Hindle
richie@entrian.com'''

if __name__ == '__main__':

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'estgvhcu')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    Corpus.Verbose = False
    runTestServer = False
    setupTestServer = False
    cleanupTestServer = False
    useGzip = False
    useExistingDB = False

    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-s':
            setupTestServer = True
        elif opt == '-e':
            runTestServer = True
        elif opt == '-t':
            setupTestServer = True
            runTestServer = True
        elif opt == '-c':
            cleanupTestServer = True
        elif opt == '-v':
            Corpus.Verbose = True
        elif opt == '-g':
            useGzip = True
        elif opt == '-u':
            useExistingDB = True

    if setupTestServer:
        setupTest(useGzip)
    if runTestServer:
        runTest(useGzip)
    elif cleanupTestServer:
        cleanupTest()
    else:
        print >>sys.stderr, __doc__

       
From timstone4@users.sourceforge.net  Sat Nov 16 16:38:58 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 08:38:58 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.1,1.2
Message-ID: <E18D5yE-00022v-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7833

Modified Files:
	Bayes.py 
Log Message:
Added mode 'c' on DBDict constructor call to account for recent change
to that class.

Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** Bayes.py	16 Nov 2002 16:18:23 -0000	1.1
--- Bayes.py	16 Nov 2002 16:38:56 -0000	1.2
***************
*** 176,180 ****
          self.db_name = db_name
          self.statekey = "saved state"
!         self.wordinfo = DBDict(db_name, (self.statekey,))  # r/rw?
  
          self.load()
--- 176,180 ----
          self.db_name = db_name
          self.statekey = "saved state"
!         self.wordinfo = DBDict(db_name, (self.statekey,), 'c')  # r/rw?
  
          self.load()


From timstone4@users.sourceforge.net  Sat Nov 16 16:47:41 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 08:47:41 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.2,1.3
Message-ID: <E18D66f-0003Ls-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12873

Modified Files:
	Bayes.py 
Log Message:
Put the mode in the correct spot in the constructor parm list  :: sigh ::

Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** Bayes.py	16 Nov 2002 16:38:56 -0000	1.2
--- Bayes.py	16 Nov 2002 16:47:39 -0000	1.3
***************
*** 176,180 ****
          self.db_name = db_name
          self.statekey = "saved state"
!         self.wordinfo = DBDict(db_name, (self.statekey,), 'c')  # r/rw?
  
          self.load()
--- 176,180 ----
          self.db_name = db_name
          self.statekey = "saved state"
!         self.wordinfo = DBDict(db_name, 'c', (self.statekey,))  # r/rw?
  
          self.load()


From timstone4@users.sourceforge.net  Sat Nov 16 19:03:18 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 11:03:18 -0800
Subject: [Spambayes-checkins] spambayes Corpus.py,1.1,1.2
Message-ID: <E18D8Du-0001t0-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7226

Modified Files:
	Corpus.py 
Log Message:
Removed prototype message filtering code

Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** Corpus.py	16 Nov 2002 16:27:39 -0000	1.1
--- Corpus.py	16 Nov 2002 19:03:15 -0000	1.2
***************
*** 105,109 ****
          self.observers = []       # observers of this corpus
          self.factory = factory    # factory for the correct Message subclass
-         self.mfilter = None       # regex to filter messages
  
      def addObserver(self, observer):
--- 105,108 ----
***************
*** 152,163 ****
  
          key = message.key()
-         sub = message.getSubstance()
-         
-         if self.mfilter != None:
-             match = re.match(self.mfilter, sub, re.DOTALL)
-             if not match:
-                 print 'not cacheing %s because it does not \
- match the corpus filter' % (key)
-                 raise KeyError, message
  
          if Verbose:
--- 151,154 ----
***************
*** 239,252 ****
  
          return msg
- 
-     def setFilter(self, sub):
-         '''set this message filter'''
-         
-         self.mfilter = sub
-         
-     def getFilter(self):
-         '''Return this message filter'''
-         
-         return self.mfilter
          
  
--- 230,233 ----


From timstone4@users.sourceforge.net  Sat Nov 16 19:06:29 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sat, 16 Nov 2002 11:06:29 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.1,1.2
Message-ID: <E18D8Gz-0002ME-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9030

Modified Files:
	FileCorpus.py 
Log Message:
Removed prototype message filtering code

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** FileCorpus.py	16 Nov 2002 16:30:20 -0000	1.1
--- FileCorpus.py	16 Nov 2002 19:06:27 -0000	1.2
***************
*** 402,417 ****
  
  
!     print '\n\nIterate the unsure corpus, to make sure cache size \
! is managed correctly.  We should not see MSG00003 in this iteration.'
!     for msg in unsurecorpus:
!         print msg.key()    # don't print msg, too much information
      for msg in unsurecorpus:
          print msg.key()    # don't print msg, too much information
! 
! 
!     print '\n\nIterate the unsure corpus with a filter, to make sure \
! the filtering mechanism is working correctly.  We should not see \
! MSG00004 in this iteration.'
! #    unsurecorpus.setFilter('richie')
      for msg in unsurecorpus:
          print msg.key()    # don't print msg, too much information
--- 402,411 ----
  
  
!     print '\n\nIterate the unsure corpus twice, to make sure cache size \
! is managed correctly, and to make sure iteration is repeatable. \
! We should not see MSG00003 in this iteration.'
      for msg in unsurecorpus:
          print msg.key()    # don't print msg, too much information
!     print '...and again'
      for msg in unsurecorpus:
          print msg.key()    # don't print msg, too much information


From npickett@users.sourceforge.net  Sun Nov 17 03:42:39 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sat, 16 Nov 2002 19:42:39 -0800
Subject: [Spambayes-checkins] 
 spambayes hammiefilter.py,NONE,1.1 README.txt,1.42,1.43
 hammie.py,1.38,1.39 mboxutils.py,1.6,1.7
Message-ID: <E18DGKV-0003cB-00@usw-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12012

Modified Files:
	README.txt hammie.py mboxutils.py 
Added Files:
	hammiefilter.py 
Log Message:
* WordInfo optimization in hammie.py.  If you didn't catch it on the
  mail list, this is going to make you dbm file smaller, and
  unusable by older hammies.
* hammie.py can now take messages on stdin, but it's ugly.  If you
  want to do this, you should look at hammiefilter.py
* hammiefilter.py is like hammie jr--it takes a single message on
  stdin and either scores it or trains on it.
* Modified README to talk about hammiecli.py and new hammiefilter.py


--- NEW FILE: hammiefilter.py ---
#!/usr/bin/env python

## A hammie front-end to make the simple stuff simple.
##
##
## The intent is to call this from procmail and its ilk like so:
##
##   :0 fw
##   | hammiefilter.py
## 
## Then, you can set up your MUA to pipe ham and spam to it, one at a
## time, by calling it with either the -g or -s options, respectively.
##
## Author: Neale Pickett <neale@woozle.org>
##

"""Usage: %(program)s [option]

Where [option] is one of:
    -h
        show usage and exit
    -n
        create a new database
    -g
        train on stdin as a good (ham) message
    -s
        train on stdin as a bad (spam) message

If neither -g nor -s is given, stdin will be scored: the same message,
with a new header containing the score, will be send to stdout.
"""

import sys
import getopt
import hammie
from Options import options

# See Options.py for explanations of these properties
DBNAME = options.persistent_storage_file
USEDB = options.persistent_use_database
program = sys.argv[0]

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def hammie_open(mode):
    b = hammie.createbayes(DBNAME, USEDB, mode)
    return hammie.Hammie(b)

def newdb():
    hammie_open('n')
    print "Created new database in", DBNAME

def filter():
    h = hammie_open('r')
    msg = sys.stdin.read()
    print h.filter(msg)

def train_ham():
    h = hammie_open('w')
    msg = sys.stdin.read()
    h.train_ham(msg)
    h.update_probabilities()

def train_spam():
    h = hammie_open('w')
    msg = sys.stdin.read()
    h.train_spam(msg)
    h.update_probabilities()


def main():
    action = filter
    opts, args = getopt.getopt(sys.argv[1:], 'hngs')
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-g':
            action = train_ham
        elif opt == '-s':
            action = train_spam
        elif opt == "-n":
            action = newdb
    action()

if __name__ == "__main__":
    main()


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** README.txt	13 Nov 2002 18:13:43 -0000	1.42
--- README.txt	17 Nov 2002 03:42:36 -0000	1.43
***************
*** 68,71 ****
--- 68,78 ----
      XML-RPC.
  
+ hammiecli.py
+     A client for hammiesrv.
+ 
+ hammiefilter.py
+     A simpler hammie front-end that doesn't print anything.  Useful for
+     procmail filering and scoring from your MUA.
+ 
  pop3proxy.py
      A spam-classifying POP3 proxy.  It adds a spam-judgement header to

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -C2 -d -r1.38 -r1.39
*** hammie.py	14 Nov 2002 22:00:15 -0000	1.38
--- hammie.py	17 Nov 2002 03:42:37 -0000	1.39
***************
*** 11,18 ****
      -g PATH
          mbox or directory of known good messages (non-spam) to train on.
!         Can be specified more than once.
      -s PATH
          mbox or directory of known spam messages to train on.
!         Can be specified more than once.
      -u PATH
          mbox of unknown messages.  A ham/spam decision is reported for each.
--- 11,18 ----
      -g PATH
          mbox or directory of known good messages (non-spam) to train on.
!         Can be specified more than once, or use - for stdin.
      -s PATH
          mbox or directory of known spam messages to train on.
!         Can be specified more than once, or use - for stdin.
      -u PATH
          mbox of unknown messages.  A ham/spam decision is reported for each.
***************
*** 41,44 ****
--- 41,45 ----
  import sys
  import os
+ import types
  import getopt
  import mailbox
***************
*** 110,120 ****
  
      def __getitem__(self, key):
!         if self.hash.has_key(key):
!             return pickle.loads(self.hash[key])
          else:
!             raise KeyError(key)
  
      def __setitem__(self, key, val):
!         v = pickle.dumps(val, 1)
          self.hash[key] = v
  
--- 111,131 ----
  
      def __getitem__(self, key):
!         v = self.hash[key]
!         if v[0] == 'W':
!             val = pickle.loads(v[1:])
!             # We could be sneaky, like pickle.Unpickler.load_inst,
!             # but I think that's overly confusing.
!             obj = classifier.WordInfo(0)
!             obj.__setstate__(val)
!             return obj
          else:
!             return pickle.loads(v)
  
      def __setitem__(self, key, val):
!         if isinstance(val, classifier.WordInfo):
!             val = val.__getstate__()
!             v = 'W' + pickle.dumps(val, 1)
!         else:
!             v = pickle.dumps(val, 1)
          self.hash[key] = v
  

Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** mboxutils.py	12 Nov 2002 23:16:04 -0000	1.6
--- mboxutils.py	17 Nov 2002 03:42:37 -0000	1.7
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
  """Utilities for dealing with various types of mailboxes.
  
***************
*** 21,24 ****
--- 22,26 ----
  
  import os
+ import sys
  import glob
  import email
***************
*** 53,56 ****
--- 55,61 ----
  def getmbox(name):
      """Return an mbox iterator given a file/directory/folder name."""
+ 
+     if name == "-":
+         return [get_message(sys.stdin)]
  
      if name.startswith("+"):


From tim_one@projects.sourceforge.net  Mon Nov 18 01:40:06 2002
From: tim_one@projects.sourceforge.net (Tim Peters)
Date: Sun, 17 Nov 2002 17:40:06 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 default_bayes_customize.ini,1.6,1.7
Message-ID: <E18DatS-0006cE-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv24664/Outlook2000

Modified Files:
	default_bayes_customize.ini 
Log Message:
Added option experimental_ham_spam_imbalance_adjustment.  Please test!
Especially if you train on a lot more ham than spam (or vice versa).


Index: default_bayes_customize.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** default_bayes_customize.ini	13 Nov 2002 06:59:24 -0000	1.6
--- default_bayes_customize.ini	18 Nov 2002 01:40:04 -0000	1.7
***************
*** 28,29 ****
--- 28,32 ----
  #use_chi_squared_combining: False
  #use_gary_combining: True
+ 
+ # This will probably go away if testing confirms it's a Good Thing.
+ experimental_ham_spam_imbalance_adjustment: True
\ No newline at end of file


From tim_one@projects.sourceforge.net  Mon Nov 18 01:40:06 2002
From: tim_one@projects.sourceforge.net (Tim Peters)
Date: Sun, 17 Nov 2002 17:40:06 -0800
Subject: [Spambayes-checkins] 
 spambayes Options.py,1.70,1.71 classifier.py,1.50,1.51
Message-ID: <E18DatS-0006c9-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv24664

Modified Files:
	Options.py classifier.py 
Log Message:
Added option experimental_ham_spam_imbalance_adjustment.  Please test!
Especially if you train on a lot more ham than spam (or vice versa).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.70
retrieving revision 1.71
diff -C2 -d -r1.70 -r1.71
*** Options.py	13 Nov 2002 18:14:32 -0000	1.70
--- Options.py	18 Nov 2002 01:40:03 -0000	1.71
***************
*** 298,301 ****
--- 298,315 ----
  use_chi_squared_combining: True
  
+ # If the # of ham and spam in training data are out of balance, the
+ # spamprob guesses can get stronger in the direction of the category with
+ # more training msgs.  In one sense this must be so, since the more data
+ # we have of one flavor, the more we know about that flavor.  But that
+ # allows the accidental appearance of a strong word of that flavor in a msg
+ # of the other flavor much more power than an accident in the other
+ # direction.  Enable experimental_ham_spam_imbalance_adjustment if you have
+ # more ham than spam training data (or more spam than ham), and the
+ # Bayesian probability adjustment won't 'believe' raw counts more than
+ # min(# ham trained on, # spam trained on) justifies.  I *expect* this
+ # option will go away (and become the default), but people *with* strong
+ # imbalance need to test it first.
+ experimental_ham_spam_imbalance_adjustment: False
+ 
  [Hammie]
  # The name of the header that hammie adds to an E-mail in filter mode
***************
*** 410,414 ****
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
!                    },
      'Hammie': {'hammie_header_name': string_cracker,
                 'persistent_storage_file': string_cracker,
--- 424,429 ----
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
!                    'experimental_ham_spam_imbalance_adjustment': boolean_cracker,
!                   },
      'Hammie': {'hammie_header_name': string_cracker,
                 'persistent_storage_file': string_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.50
retrieving revision 1.51
diff -C2 -d -r1.50 -r1.51
*** classifier.py	11 Nov 2002 01:59:06 -0000	1.50
--- classifier.py	18 Nov 2002 01:40:04 -0000	1.51
***************
*** 322,330 ****
          nspam = float(self.nspam or 1)
  
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
!             # Compute prob(msg is spam | msg contains word).
              # This is the Graham calculation, but stripped of biases, and
              # stripped of clamping into 0.01 thru 0.99.  The Bayesian
--- 322,336 ----
          nspam = float(self.nspam or 1)
  
+         if options.experimental_ham_spam_imbalance_adjustment:
+             spam2ham = min(nspam / nham, 1.0)
+             ham2spam = min(nham / nspam, 1.0)
+         else:
+             spam2ham = ham2spam = 1.0
+ 
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
!             # Compute p(word) = prob(msg is spam | msg contains word).
              # This is the Graham calculation, but stripped of biases, and
              # stripped of clamping into 0.01 thru 0.99.  The Bayesian
***************
*** 358,362 ****
              # less so the larger n is, or the smaller s is.
  
!             n = hamcount + spamcount
              prob = (StimesX + n * prob) / (S + n)
  
--- 364,386 ----
              # less so the larger n is, or the smaller s is.
  
!             # Experimental:
!             # Picking a good value for n is interesting:  how much empirical
!             # evidence do we really have?  If nham == nspam,
!             # hamcount + spamcount makes a lot of sense, and the code here
!             # does that by default.
!             # But if, e.g., nham is much larger than nspam, p(w) can get a
!             # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!             # strong ham words (high hamcount) much stronger than strong
!             # spam words (high spamcount), and that makes the accidental
!             # appearance of a strong ham word in spam much more damaging than
!             # the accidental appearance of a strong spam word in ham.
!             # So we don't give hamcount full credit when nham > nspam (or
!             # spamcount when nspam > nham):  instead we knock hamcount down
!             # to what it would have been had nham been equal to nspam.  IOW,
!             # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!             # we don't "believe" any count to an extent more than
!             # min(nspam, nham) justifies.
! 
!             n = hamcount * spam2ham  +  spamcount * ham2spam
              prob = (StimesX + n * prob) / (S + n)
  

From timstone4@projects.sourceforge.net  Mon Nov 18 04:48:12 2002
From: timstone4@projects.sourceforge.net (Tim Stone)
Date: Sun, 17 Nov 2002 20:48:12 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.3,1.4
Message-ID: <E18DdpU-0008V4-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv32640

Modified Files:
	Bayes.py 
Log Message:
Corrected the load/store semantic to DBDictBayes, which had been using a
straight DBDict instance - with continual persistence.  This was
inconsistent with the PersistentBayes design point, and did not meet
the requirements of pop3proxy, which will have 'quit' and 'save and quit'
options.

Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** Bayes.py	16 Nov 2002 16:47:39 -0000	1.3
--- Bayes.py	18 Nov 2002 04:48:10 -0000	1.4
***************
*** 63,66 ****
--- 63,68 ----
  import cPickle as pickle
  import errno
+ import copy
+ import anydbm
  
  PICKLE_TYPE = 1
***************
*** 176,180 ****
          self.db_name = db_name
          self.statekey = "saved state"
-         self.wordinfo = DBDict(db_name, 'c', (self.statekey,))  # r/rw?
  
          self.load()
--- 178,181 ----
***************
*** 186,199 ****
              print 'Loading state from',self.db_name,'DB_Dict'
  
!         if self.wordinfo.has_key(self.statekey):
!             self.nham, self.nspam = self.wordinfo[self.statekey]
!             
              if Corpus.Verbose:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
              if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
              self.nham = 0
              self.nspam = 0
--- 187,209 ----
              print 'Loading state from',self.db_name,'DB_Dict'
  
!         try:
!             wi = DBDict(self.db_name, 'r')
!         except anydbm.error:
!             wi = {}
!         
!         if wi.has_key(self.statekey):
              if Corpus.Verbose:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
+ 
+             self.nham, self.nspam = wi[self.statekey]
+ 
+             for word,info in wi:
+                 self.wordinfo[word] = info
          else:
              # new dbdict
              if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
+             self.wordinfo = {}
              self.nham = 0
              self.nspam = 0
***************
*** 202,209 ****
          '''Place state into persistent store'''
  
          if Corpus.Verbose:
              print 'Persisting',self.db_name,'state in DBDict'
  
!         self.wordinfo[self.statekey] = (self.nham, self.nspam)
  
  
--- 212,223 ----
          '''Place state into persistent store'''
  
+         wi = DBDict(self.db_name, 'c')
+ 
          if Corpus.Verbose:
              print 'Persisting',self.db_name,'state in DBDict'
  
!         wi[self.statekey] = (self.nham, self.nspam)
!         for word in self.wordinfo:
!             wi[word] = self.wordinfo[word]
  
  
From timstone4@projects.sourceforge.net  Mon Nov 18 13:04:25 2002
From: timstone4@projects.sourceforge.net (Tim Stone)
Date: Mon, 18 Nov 2002 05:04:25 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.4,1.5
Message-ID: <E18DlZh-0002WJ-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9648

Modified Files:
	Bayes.py 
Log Message:
Backed out load/store semantic.  Didn't work in all cases.

Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** Bayes.py	18 Nov 2002 04:48:10 -0000	1.4
--- Bayes.py	18 Nov 2002 13:04:20 -0000	1.5
***************
*** 18,22 ****
      'spam'|'ham'|'unsure' for a message based on the spamprob against
      the ham_cutoff and spam_cutoff specified in Options.
!     
      PickledBayes is a concrete PersistentBayes class that uses a cPickle
      datastore.  This database is relatively small, but slower than other
--- 18,22 ----
      'spam'|'ham'|'unsure' for a message based on the spamprob against
      the ham_cutoff and spam_cutoff specified in Options.
! 
      PickledBayes is a concrete PersistentBayes class that uses a cPickle
      datastore.  This database is relatively small, but slower than other
***************
*** 132,136 ****
              self.nham = tempbayes.nham
              self.nspam = tempbayes.nspam
!             
              if Corpus.Verbose:
                  print '%s is an existing pickle, with %d ham and %d spam' \
--- 132,136 ----
              self.nham = tempbayes.nham
              self.nspam = tempbayes.nspam
! 
              if Corpus.Verbose:
                  print '%s is an existing pickle, with %d ham and %d spam' \
***************
*** 168,172 ****
  
          self.wordinfo, self.nspam, self.nham = t[1:]
!         
  
  class DBDictBayes(PersistentBayes):
--- 168,172 ----
  
          self.wordinfo, self.nspam, self.nham = t[1:]
! 
  
  class DBDictBayes(PersistentBayes):
***************
*** 187,209 ****
              print 'Loading state from',self.db_name,'DB_Dict'
  
!         try:
!             wi = DBDict(self.db_name, 'r')
!         except anydbm.error:
!             wi = {}
!         
!         if wi.has_key(self.statekey):
              if Corpus.Verbose:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
- 
-             self.nham, self.nspam = wi[self.statekey]
- 
-             for word,info in wi:
-                 self.wordinfo[word] = info
          else:
              # new dbdict
              if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
-             self.wordinfo = {}
              self.nham = 0
              self.nspam = 0
--- 187,202 ----
              print 'Loading state from',self.db_name,'DB_Dict'
  
!         self.wordinfo = DBDict(self.db_name, 'c')
! 
!         if self.wordinfo.has_key(self.statekey):
! 
!             self.nham, self.nspam = self.wordinfo[self.statekey]
              if Corpus.Verbose:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
              if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
              self.nham = 0
              self.nspam = 0
***************
*** 212,223 ****
          '''Place state into persistent store'''
  
-         wi = DBDict(self.db_name, 'c')
- 
          if Corpus.Verbose:
              print 'Persisting',self.db_name,'state in DBDict'
  
!         wi[self.statekey] = (self.nham, self.nspam)
!         for word in self.wordinfo:
!             wi[word] = self.wordinfo[word]
  
  
--- 205,212 ----
          '''Place state into persistent store'''
  
          if Corpus.Verbose:
              print 'Persisting',self.db_name,'state in DBDict'
  
!         self.wordinfo[self.statekey] = (self.nham, self.nspam)
  
  
From npickett@projects.sourceforge.net  Mon Nov 18 18:14:41 2002
From: npickett@projects.sourceforge.net (Neale Pickett)
Date: Mon, 18 Nov 2002 10:14:41 -0800
Subject: [Spambayes-checkins] spambayes hammie.py,1.39,1.40
	hammiefilter.py,1.1,1.2
Message-ID: <E18DqPx-0008Pe-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv31157

Modified Files:
	hammie.py hammiefilter.py 
Log Message:
* hammie.py now removes the header before adding it, so we can be
  sure the header we write is unique (thanks Todd Mokros)
* hammiefilter.py now uses Config values, although it's currently
  very yucky
* hammiefilter.py writes out pickles on exit


Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** hammie.py	17 Nov 2002 03:42:37 -0000	1.39
--- hammie.py	18 Nov 2002 18:13:54 -0000	1.40
***************
*** 284,287 ****
--- 284,291 ----
  
          msg = mboxutils.get_message(msg)
+         try:
+             del msg[header]
+         except KeyError:
+             pass
          prob, clues = self._scoremsg(msg, True)
          if prob < ham_cutoff:

Index: hammiefilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** hammiefilter.py	17 Nov 2002 03:42:37 -0000	1.1
--- hammiefilter.py	18 Nov 2002 18:14:04 -0000	1.2
***************
*** 31,44 ****
  """
  
  import sys
  import getopt
  import hammie
! from Options import options
  
  # See Options.py for explanations of these properties
- DBNAME = options.persistent_storage_file
- USEDB = options.persistent_use_database
  program = sys.argv[0]
  
  def usage(code, msg=''):
      """Print usage message and sys.exit(code)."""
--- 31,47 ----
  """
  
+ import os
  import sys
  import getopt
  import hammie
! import Options
! import StringIO
  
  # See Options.py for explanations of these properties
  program = sys.argv[0]
  
+ # Options
+ options = Options.options
+ 
  def usage(code, msg=''):
      """Print usage message and sys.exit(code)."""
***************
*** 49,59 ****
      sys.exit(code)
  
  def hammie_open(mode):
!     b = hammie.createbayes(DBNAME, USEDB, mode)
      return hammie.Hammie(b)
  
  def newdb():
!     hammie_open('n')
!     print "Created new database in", DBNAME
  
  def filter():
--- 52,73 ----
      sys.exit(code)
  
+ def jar_pickle(h):
+     if not options.persistent_use_database:
+         import pickle
+         fp = open(options.persistent_storage_file, 'wb')
+         pickle.dump(h.bayes, fp, 1)
+         fp.close()
+     
+ 
  def hammie_open(mode):
!     b = hammie.createbayes(options.persistent_storage_file,
!                            options.persistent_use_database,
!                            mode)
      return hammie.Hammie(b)
  
  def newdb():
!     h = hammie_open('n')
!     jar_pickle(h)
!     print "Created new database in", options.persistent_storage_file
  
  def filter():
***************
*** 67,70 ****
--- 81,85 ----
      h.train_ham(msg)
      h.update_probabilities()
+     jar_pickle(h)    
  
  def train_spam():
***************
*** 73,77 ****
      h.train_spam(msg)
      h.update_probabilities()
! 
  
  def main():
--- 88,92 ----
      h.train_spam(msg)
      h.update_probabilities()
!     jar_pickle(h)    
  
  def main():
***************
*** 87,90 ****
--- 102,115 ----
          elif opt == "-n":
              action = newdb
+ 
+     # hammiefilter overrides
+     config_overrides = """[Hammie]
+ persistent_storage_file = %s
+ persistent_use_database = True
+ """ % os.path.expanduser('~/.hammiedb')
+     options.mergefilelike(StringIO.StringIO(config_overrides))
+     options.mergefiles(['/etc/hammierc',
+                         os.path.expanduser('~/.hammierc')])
+ 
      action()
  

From tim_one@projects.sourceforge.net  Mon Nov 18 18:18:05 2002
From: tim_one@projects.sourceforge.net (Tim Peters)
Date: Mon, 18 Nov 2002 10:18:05 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.51,1.52
Message-ID: <E18DqTF-0000JX-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv818

Modified Files:
	classifier.py 
Log Message:
Repaired braino in comment.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.51
retrieving revision 1.52
diff -C2 -d -r1.51 -r1.52
*** classifier.py	18 Nov 2002 01:40:04 -0000	1.51
--- classifier.py	18 Nov 2002 18:17:13 -0000	1.52
***************
*** 418,422 ****
      # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
      # adding only once in training, a word like that was simply ignored until
!     # it appeared in 5 distinct training hams.  Without the ham-favoring
      # biases, though, and never ignoring words, counting n times introduces
      # a subtle and unhelpful bias.
--- 418,422 ----
      # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
      # adding only once in training, a word like that was simply ignored until
!     # it appeared in 5 distinct training spams.  Without the ham-favoring
      # biases, though, and never ignoring words, counting n times introduces
      # a subtle and unhelpful bias.


From tim_one@projects.sourceforge.net  Mon Nov 18 18:23:43 2002
From: tim_one@projects.sourceforge.net (Tim Peters)
Date: Mon, 18 Nov 2002 10:23:43 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.52,1.53
Message-ID: <E18DqYh-0000wI-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv2946

Modified Files:
	classifier.py 
Log Message:
clearjunk():  More proof nobody ever tried this -- it would have blown
up at once w/ a NameError on mincount (left over from a previous version).


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.52
retrieving revision 1.53
diff -C2 -d -r1.52 -r1.53
*** classifier.py	18 Nov 2002 18:17:13 -0000	1.52
--- classifier.py	18 Nov 2002 18:23:09 -0000	1.53
***************
*** 399,403 ****
  
          wordinfo = self.wordinfo
-         mincount = float(mincount)
          tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
          for w in tonuke:
--- 399,402 ----


From richiehindle@projects.sourceforge.net  Mon Nov 18 19:14:51 2002
From: richiehindle@projects.sourceforge.net (Richie Hindle)
Date: Mon, 18 Nov 2002 11:14:51 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.15,1.16
	Options.py,1.71,1.72
Message-ID: <E18DrMB-0005jD-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv21748

Modified Files:
	pop3proxy.py Options.py 
Log Message:
 o New web-based training interface for the POP3 proxy.
 o Fix for Fran�ois' log-file-re-opening problem on the Mac.
 o Fix for Fran�ois' classify-by-upload problem on the Mac.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** pop3proxy.py	13 Nov 2002 18:19:45 -0000	1.15
--- pop3proxy.py	18 Nov 2002 19:14:48 -0000	1.16
***************
*** 2,6 ****
  
  """A POP3 proxy that works with classifier.py, and adds a simple
! X-Hammie-Disposition header (Yes or No) to each incoming email.
  You point pop3proxy at your POP3 server, and configure your email
  client to collect mail from the proxy then filter on the added
--- 2,6 ----
  
  """A POP3 proxy that works with classifier.py, and adds a simple
! X-Hammie-Disposition header (Yes/No/Unsure) to each incoming email.
  You point pop3proxy at your POP3 server, and configure your email
[...995 lines suppressed...]
!         if not state.serverName:
              print >>sys.stderr, \
                    ("Error: You must give a POP3 server name, either in\n"
***************
*** 1141,1147 ****
                     "command line.  pop3server.py -h prints a usage message.")
          else:
!             main(status.serverName, status.serverPort, status.proxyPort,
!                  status.uiPort, status.launchUI, status.pickleName,
!                  status.useDB)
  
      else:
--- 1414,1420 ----
                     "command line.  pop3server.py -h prints a usage message.")
          else:
!             main(state.serverName, state.serverPort, state.proxyPort,
!                  state.uiPort, state.launchUI, state.databaseFilename,
!                  state.useDB)
  
      else:

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.71
retrieving revision 1.72
diff -C2 -d -r1.71 -r1.72
*** Options.py	18 Nov 2002 01:40:03 -0000	1.71
--- Options.py	18 Nov 2002 19:14:48 -0000	1.72
***************
*** 362,367 ****
  pop3proxy_server_port: 110
  pop3proxy_port: 110
! pop3proxy_cache_use_gzip: True
  pop3proxy_cache_expiry_days: 7
  
  [html_ui]
--- 362,370 ----
  pop3proxy_server_port: 110
  pop3proxy_port: 110
! pop3proxy_cache_use_gzip: False
  pop3proxy_cache_expiry_days: 7
+ pop3proxy_spam_cache: pop3proxy-spam-cache
+ pop3proxy_ham_cache: pop3proxy-ham-cache
+ pop3proxy_unknown_cache: pop3proxy-unknown-cache
  
  [html_ui]
***************
*** 443,446 ****
--- 446,452 ----
                    'pop3proxy_cache_use_gzip': boolean_cracker,
                    'pop3proxy_cache_expiry_days': int_cracker,
+                   'pop3proxy_spam_cache': string_cracker,
+                   'pop3proxy_ham_cache': string_cracker,
+                   'pop3proxy_unknown_cache': string_cracker,
                    },
      'html_ui': {'html_ui_port': int_cracker,


From richiehindle@projects.sourceforge.net  Mon Nov 18 22:51:09 2002
From: richiehindle@projects.sourceforge.net (Richie Hindle)
Date: Mon, 18 Nov 2002 14:51:09 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.72,1.73
Message-ID: <E18DujV-0004KV-00@projects.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv16563

Modified Files:
	Options.py 
Log Message:
 o Fix for Fran�ois' lack-of-os.getenv problem on the Mac.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72
retrieving revision 1.73
diff -C2 -d -r1.72 -r1.73
*** Options.py	18 Nov 2002 19:14:48 -0000	1.72
--- Options.py	18 Nov 2002 22:51:07 -0000	1.73
***************
*** 504,508 ****
  del d
  
! alternate = os.getenv('BAYESCUSTOMIZE')
  if alternate:
      options.mergefiles(alternate.split())
--- 504,510 ----
  del d
  
! alternate = None
! if hasattr(os, 'getenv'):
!     alternate = os.getenv('BAYESCUSTOMIZE')
  if alternate:
      options.mergefiles(alternate.split())


From tim_one@users.sourceforge.net  Tue Nov 19 02:13:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 18 Nov 2002 18:13:03 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.68,1.69
Message-ID: <E18Dxst-0000yT-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv3684

Modified Files:
	tokenizer.py 
Log Message:
Removed redundant import.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.68
retrieving revision 1.69
diff -C2 -d -r1.68 -r1.69
*** tokenizer.py	13 Nov 2002 06:25:08 -0000	1.68
--- tokenizer.py	19 Nov 2002 02:13:00 -0000	1.69
***************
*** 5,9 ****
  
  import email
- import email.Header
  import email.Message
  import email.Header
--- 5,8 ----


From hooft@users.sourceforge.net  Tue Nov 19 17:41:34 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Tue, 19 Nov 2002 09:41:34 -0800
Subject: [Spambayes-checkins] spambayes CostCounter.py,1.2,1.3
Message-ID: <E18ECNS-0007ul-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv30383

Modified Files:
	CostCounter.py 
Log Message:
More different cost counters as optimization targets

Index: CostCounter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CostCounter.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** CostCounter.py	16 Nov 2002 05:41:29 -0000	1.2
--- CostCounter.py	19 Nov 2002 17:41:28 -0000	1.3
***************
*** 14,18 ****
  
      def __str__(self):
!         return "%s: $%.2f" % (self.name, self.total)
  
  class CompositeCostCounter:
--- 14,18 ----
  
      def __str__(self):
!         return "%s: $%.4f" % (self.name, self.total)
  
  class CompositeCostCounter:
***************
*** 34,37 ****
--- 34,59 ----
          return '\n'.join(s)
  
+ class DelayedCostCounter(CompositeCostCounter):
+     def __init__(self,cclist):
+         CompositeCostCounter.__init__(self,cclist)
+         self.spamscr=[]
+         self.hamscr=[]
+ 
+     def spam(self, scr):
+         self.spamscr.append(scr)
+ 
+     def ham(self, scr):
+         self.hamscr.append(scr)
+ 
+     def __str__(self):
+         for scr in self.spamscr:
+             CompositeCostCounter.spam(self,scr)
+         for scr in self.hamscr:
+             CompositeCostCounter.ham(self,scr)
+         s=[]
+         for line in CompositeCostCounter.__str__(self).split('\n'):
+             s.append('Delayed-'+line)
+         return '\n'.join(s)
+ 
  class StdCostCounter(CostCounter):
      name = "Standard Cost"
***************
*** 65,72 ****
          self.total += self._lambda(scr) * options.best_cutoff_fp_weight
  
  def default():
       return CompositeCostCounter([
!                                   StdCostCounter(),
!                                   FlexCostCounter(),
!                                  ])
  
--- 87,117 ----
          self.total += self._lambda(scr) * options.best_cutoff_fp_weight
  
+ class Flex2CostCounter(FlexCostCounter):
+     name = "Flex**2 Cost"
+     def spam(self, scr):
+         self.total += (1 - self._lambda(scr))**2 * options.best_cutoff_fn_weight
+ 
+     def ham(self, scr):
+         self.total += self._lambda(scr)**2 * options.best_cutoff_fp_weight
+ 
  def default():
       return CompositeCostCounter([
!                 StdCostCounter(),
!                 FlexCostCounter(),
!                 Flex2CostCounter(),
!                 DelayedCostCounter([
!                     StdCostCounter(),
!                     FlexCostCounter(),
!                     Flex2CostCounter(),
!                 ])
!             ])
  
+ if __name__=="__main__":
+     cc=default()
+     cc.ham(0)
+     cc.spam(1)
+     cc.ham(0.5)
+     cc.spam(0.5)
+     options.spam_cutoff=0.7
+     options.ham_cutoff=0.4
+     print cc


From hooft@users.sourceforge.net  Tue Nov 19 17:43:29 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Tue, 19 Nov 2002 09:43:29 -0800
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.29,1.30
Message-ID: <E18ECPJ-00086i-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv31142

Modified Files:
	TestDriver.py 
Log Message:
Some changes to enable the use of Delayed cost counters. It is a bit ugly to change the options on-the-fly, but OTOH: hey, this is a test driver!

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** TestDriver.py	15 Nov 2002 21:32:19 -0000	1.29
--- TestDriver.py	19 Nov 2002 17:43:27 -0000	1.30
***************
*** 120,123 ****
--- 120,125 ----
                (num_unh + num_uns)*1e2 / (ham.n + spam.n))
  
+     return float(bests[0][0])/n,float(bests[0][1])/n
+ 
  def printmsg(msg, prob, clues):
      print msg.tag
***************
*** 190,195 ****
      def alldone(self):
          if options.show_histograms:
!             printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
! 
          nham = self.global_ham_hist.n
          nspam = self.global_spam_hist.n
--- 192,201 ----
      def alldone(self):
          if options.show_histograms:
!             besthamcut,bestspamcut = printhist("all runs:", 
!                                                self.global_ham_hist, 
!                                                self.global_spam_hist)
!         else:
!             besthamcut = options.ham_cutoff
!             bestspamcut = options.spam_cutoff
          nham = self.global_ham_hist.n
          nspam = self.global_spam_hist.n
***************
*** 207,210 ****
--- 213,219 ----
                nfn * options.best_cutoff_fn_weight +
                nun * options.best_cutoff_unsure_weight)
+         # Set back the options for the delayed calculations in self.cc
+         options.ham_cutoff = besthamcut
+         options.spam_cutoff = bestspamcut
          print self.cc
  

From hooft@users.sourceforge.net  Tue Nov 19 17:44:27 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Tue, 19 Nov 2002 09:44:27 -0800
Subject: [Spambayes-checkins] spambayes simplexloop.py,1.1,1.2
Message-ID: <E18ECQF-0008DS-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv31566

Modified Files:
	simplexloop.py 
Log Message:
optimize only 3 parameters; some changes to make it easier to follow a run

Index: simplexloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/simplexloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** simplexloop.py	15 Nov 2002 21:35:15 -0000	1.1
--- simplexloop.py	19 Nov 2002 17:44:25 -0000	1.2
***************
*** 35,42 ****
  start = (Options.options.unknown_word_prob,
           Options.options.minimum_prob_strength,
!          Options.options.unknown_word_strength,
!          Options.options.spam_cutoff,
!          Options.options.ham_cutoff)
! err = (0.01, 0.01, 0.01, 0.005, 0.01)
  
  def mkini(vars):
--- 35,40 ----
  start = (Options.options.unknown_word_prob,
           Options.options.minimum_prob_strength,
!          Options.options.unknown_word_strength)
! err = (0.01, 0.01, 0.01)
  
  def mkini(vars):
***************
*** 47,54 ****
  minimum_prob_strength = %.6f
  unknown_word_strength = %.6f
- 
- [TestDriver]
- spam_cutoff = %.4f
- ham_cutoff = %.4f
  """%tuple(vars))
      f.close()
--- 45,48 ----
***************
*** 66,71 ****
      cost = float(txt[-1].split()[2][1:])
      f.close()
!     # print ''.join(txt[-4:])[:-1]
!     print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,))
      return -cost
  
--- 60,67 ----
      cost = float(txt[-1].split()[2][1:])
      f.close()
!     os.rename('loop.out','loop.out.old')
!     print ''.join(txt[-20:])[:-1]
!     print "x=%.4f p=%.4f s=%.4f %.2f"%(tuple(vars)+(cost,))
!     sys.stdout.flush()
      return -cost
  

From hooft@users.sourceforge.net  Tue Nov 19 21:55:00 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Tue, 19 Nov 2002 13:55:00 -0800
Subject: [Spambayes-checkins] spambayes CostCounter.py,1.3,1.4
Message-ID: <E18EGKi-0002uG-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv11148

Modified Files:
	CostCounter.py 
Log Message:
add simple unit counter; add nodelay function to instantiate all non-delayed cost counters only.

Index: CostCounter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CostCounter.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** CostCounter.py	19 Nov 2002 17:41:28 -0000	1.3
--- CostCounter.py	19 Nov 2002 21:54:57 -0000	1.4
***************
*** 56,59 ****
--- 56,112 ----
          return '\n'.join(s)
  
+ class CountCostCounter(CostCounter):
+     def __init__(self):
+         CostCounter.__init__(self)
+         self._fp = 0
+         self._fn = 0
+         self._unsure = 0
+         self._unsureham = 0
+         self._unsurespam = 0
+         self._spam = 0
+         self._ham = 0
+         self._correctham = 0
+         self._correctspam = 0
+         self._total = 0
+ 
+     def spam(self, scr):
+         self._total += 1
+         self._spam += 1
+         if scr < options.ham_cutoff:
+             self._fn += 1
+         elif scr < options.spam_cutoff:
+             self._unsure += 1
+             self._unsurespam += 1
+         else:
+             self._correctspam += 1
+ 
+     def ham(self, scr):
+         self._total += 1
+         self._ham += 1
+         if scr > options.spam_cutoff:
+             self._fp += 1
+         elif scr > options.ham_cutoff:
+             self._unsure += 1
+             self._unsureham += 1
+         else:
+             self._correctham += 1
+ 
+     def __str__(self):
+          return ("Total messages: %d; %d (%.1f%%) ham + %d (%.1f%%) spam\n"%(
+                      self._total,
+                      self._ham, (100.*self._ham)/self._total,
+                      self._spam, (100.*self._spam)/self._total)+
+                  "Ham: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fp\n"%(
+                      self._correctham, (100.*self._correctham)/self._ham,
+                      self._unsureham, (100.*self._unsureham)/self._ham,
+                      self._fp, (100.*self._fp)/self._ham)+
+                  "Spam: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fn\n"%(
+                      self._correctspam, (100.*self._correctspam)/self._spam,
+                      self._unsurespam, (100.*self._unsurespam)/self._spam,
+                      self._fn, (100.*self._fn)/self._spam)+
+                  "Score False: %.2f%% Unsure %.2f%%"%(
+                      (100.*(self._fp+self._fn))/self._total,
+                      (100.*self._unsure)/self._total))
+ 
  class StdCostCounter(CostCounter):
      name = "Standard Cost"
***************
*** 97,108 ****
--- 150,171 ----
  def default():
       return CompositeCostCounter([
+                 CountCostCounter(),
                  StdCostCounter(),
                  FlexCostCounter(),
                  Flex2CostCounter(),
                  DelayedCostCounter([
+                     CountCostCounter(),
                      StdCostCounter(),
                      FlexCostCounter(),
                      Flex2CostCounter(),
                  ])
+             ])
+ 
+ def nodelay():
+      return CompositeCostCounter([
+                 CountCostCounter(),
+                 StdCostCounter(),
+                 FlexCostCounter(),
+                 Flex2CostCounter(),
              ])
  

From hooft@users.sourceforge.net  Tue Nov 19 22:38:40 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Tue, 19 Nov 2002 14:38:40 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,1.4,1.5
Message-ID: <E18EH0y-0007to-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv30359

Modified Files:
	weaktest.py 
Log Message:
more flexible design of what to train on

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** weaktest.py	16 Nov 2002 05:42:35 -0000	1.4
--- weaktest.py	19 Nov 2002 22:38:37 -0000	1.5
***************
*** 21,24 ****
--- 21,30 ----
          Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
          This is required.
+     -d decider 
+         Name of the decider. One of %(decisionkeys)s
+     -u updater
+         Name of the updater. One of %(updaterkeys)s
+     -m min
+         Minimal number of messages to train on before involving the decider.
  
  In addition, an attempt is made to merge bayescustomize.ini into the options.
***************
*** 48,52 ****
      sys.exit(code)
  
! def drive(nsets):
      print options.display()
  
--- 54,143 ----
      sys.exit(code)
  
! class TrainDecision:
!     def __call__(self,scr,is_spam):
!         if is_spam:
!             return self.spamtrain(scr)
!         else:
!             return self.hamtrain(scr)
! 
! class UnsureAndFalses(TrainDecision):
!     def spamtrain(self,scr):
!         return scr < options.spam_cutoff
! 
!     def hamtrain(self,scr):
!         return scr > options.ham_cutoff
! 
! class UnsureOnly(TrainDecision):
!     def spamtrain(self,scr):
!         return options.ham_cutoff < scr < options.spam_cutoff
! 
!     hamtrain = spamtrain
! 
! class All(TrainDecision):
!     def spamtrain(self,scr):
!         return 1
! 
!     hamtrain = spamtrain
! 
! class AllBut0and100(TrainDecision):
!     def spamtrain(self,scr):
!         return scr < 0.995
! 
!     def hamtrain(self,scr):
!         return scr > 0.005
! 
! decisions={'all': All,
!            'allbut0and100': AllBut0and100,
!            'unsureonly': UnsureOnly,
!            'unsureandfalses': UnsureAndFalses,
!           }
! decisionkeys=decisions.keys()
! decisionkeys.sort()
! 
! class FirstN:
!     def __init__(self,n,client):
!         self.client = client
!         self.x = 0
!         self.n = n
! 
!     def __call__(self,scr,is_spam):
!         self.x += 1
!         if self.tooearly():
!             return True
!         else:
!             return self.client(scr,is_spam)
!     
!     def tooearly(self):
!         return self.x < self.n
! 
! class Updater:
!     def __init__(self,d=None):
!         self.setd(d)
! 
!     def setd(self,d):
!         self.d=d
! 
! class AlwaysUpdate(Updater):
!     def __call__(self):
!         self.d.update_probabilities()
! 
! class SometimesUpdate(Updater):
!     def __init__(self,d=None,factor=10):
!         Updater.__init__(self,d)
!         self.factor=factor
!         self.n = 0
! 
!     def __call__(self):
!         self.n += 1
!         if self.n % self.factor == 0:
!             self.d.update_probabilities()
! 
! updaters={'always':AlwaysUpdate,
!           'sometimes':SometimesUpdate,
!          }
! updaterkeys=updaters.keys()
! updaterkeys.sort()
! 
! def drive(nsets,decision,updater):
      print options.display()
  
***************
*** 59,63 ****
      nham = len(hamfns)
      nspam = len(spamfns)
!     cc = CostCounter.default()
  
      allfns = {}
--- 150,154 ----
      nham = len(hamfns)
      nspam = len(spamfns)
!     cc = CostCounter.nodelay()
  
      allfns = {}
***************
*** 66,141 ****
  
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
  
-     n = 0
-     unsure = 0
      hamtrain = 0
      spamtrain = 0
!     fp = 0
!     fn = 0
!     SPC = options.spam_cutoff
!     HC = options.ham_cutoff
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
          m=msgs.Msg(dir, name).guts
!         if debug:
!             print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name),
!         if hamtrain + spamtrain > 30:
!             scr=d.score(m)
!         else:
!             scr=0.50
!         if debug:
!             print "score:%.3f"%scr,
!         if is_spam:
!             cc.spam(scr)
!         else:
!             cc.ham(scr)
!         if scr < SPC and is_spam:
!             if scr < HC:
!                 fn += 1
!                 if debug:
!                     print "fn"
              else:
!                 unsure += 1
!                 if debug:
!                     print "Unsure"
!             spamtrain += 1
!             d.train_spam(m)
!             d.update_probabilities()
!         elif scr > HC and not is_spam:
!             if scr > SPC:
!                 fp += 1
!                 if debug:
!                     print "fp"
!                 else:
!                     print "fp: %s score:%.4f"%(os.path.join(dir, name), scr)
              else:
!                 unsure += 1
!                 if debug:
!                     print "Unsure"
!             hamtrain += 1
!             d.train_ham(m)
!             d.update_probabilities()
!         else:
!             if debug:
!                 print "OK"
          if n % 100 == 0:
!             print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%(
!                 n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure)
!     print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam)
!     print "Total unsure (including 30 startup messages): %d (%.1f%%)"%(
!         unsure, unsure * 100.0 / len(allfns))
!     print "Trained on %d ham and %d spam"%(hamtrain, spamtrain)
!     print "fp: %d fn: %d"%(fp, fn)
      print cc
  
  def main():
      import getopt
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hn:')
      except getopt.error, msg:
          usage(1, msg)
  
      nsets = None
      for opt, arg in opts:
          if opt == '-h':
--- 157,214 ----
  
      d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
+     updater.setd(d)
  
      hamtrain = 0
      spamtrain = 0
!     n = 0
      for dir,name, is_spam in allfns.iterkeys():
          n += 1
          m=msgs.Msg(dir, name).guts
!         if debug > 1:
!             print "trained:%dH+%dS"%(hamtrain,spamtrain)
!         scr=d.score(m)
!         if debug > 1:
!             print "score:%.3f"%scr
!         if not decision.tooearly():
!             if is_spam:
!                 if debug > 0:
!                     print "Spam with score %.2f"%scr
!                 cc.spam(scr)
              else:
!                 if debug > 0:
!                     print "Ham with score %.2f"%scr
!                 cc.ham(scr)
!         if decision(scr,is_spam):
!             if is_spam:
!                 d.train_spam(m)
!                 spamtrain += 1
              else:
!                 d.train_ham(m)
!                 hamtrain += 1
!             updater()
          if n % 100 == 0:
!             print "%5d trained:%dH+%dS wrds:%d"%(
!                 n, hamtrain, spamtrain, len(d.bayes.wordinfo))
!             print cc
!     print "="*70
!     print "%5d trained:%dH+%dS wrds:%d"%(
!         n, hamtrain, spamtrain, len(d.bayes.wordinfo))
      print cc
  
  def main():
+     global debug
+ 
      import getopt
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'vd:u:hn:m:')
      except getopt.error, msg:
          usage(1, msg)
  
      nsets = None
+     decision = decisions['unsureonly']
+     updater = updaters['always']
+     m = 10
+ 
      for opt, arg in opts:
          if opt == '-h':
***************
*** 143,146 ****
--- 216,231 ----
          elif opt == '-n':
              nsets = int(arg)
+         elif opt == '-v':
+             debug += 1
+         elif opt == '-m':
+             m = int(arg)
+         elif opt == '-d':
+             if not decisions.has_key(arg):
+                 usage(1,'Unknown decisionmaker')
+             decision = decisions[arg]
+         elif opt == '-u':
+             if not updaters.has_key(arg):
+                 usage(1,'Unknown updater')
+             updater = updaters[arg]
  
      if args:
***************
*** 149,153 ****
          usage(1, "-n is required")
  
!     drive(nsets)
  
  if __name__ == "__main__":
--- 234,238 ----
          usage(1, "-n is required")
  
!     drive(nsets,decision=FirstN(m,decision()),updater=updater())
  
  if __name__ == "__main__":


From mhammond@users.sourceforge.net  Tue Nov 19 22:52:27 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 19 Nov 2002 14:52:27 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.6,1.7
Message-ID: <E18EHEJ-0000eW-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv2231

Modified Files:
	README.txt 
Log Message:
Let Sean off the tech-support hook <wink>


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** README.txt	2 Nov 2002 04:08:02 -0000	1.6
--- README.txt	19 Nov 2002 22:52:25 -0000	1.7
***************
*** 65,69 ****
    necessary for you to *see* the score, not for the scoring to work.
  
! * Filtering an Exchange Server public store appears to not work.
  
  * Sean reports bad output saving very large classifiers in training.py.
--- 65,70 ----
    necessary for you to *see* the score, not for the scoring to work.
  
! * Filtering an Exchange Server public store appears to not work (is this 
!   still true?)
  
  * Sean reports bad output saving very large classifiers in training.py.
***************
*** 79,88 ****
  Licensed under PSF, see Tim Peters for IANAL interpretation.
  
! Ask me technical questions, and if your mail doesn't get eaten by a broken
! spam filter, I'll try to help.
  -- Sean
  seant@iname.com
- 
- Ask Sean all the technical questions <wink>
  -- Mark
  mhammond@skippinet.com.au
--- 80,88 ----
  Licensed under PSF, see Tim Peters for IANAL interpretation.
  
! Please send all comments, queries, support questions etc to the SpamBayes
! mailing list - see http://mail.python.org/mailman-21/listinfo/spambayes
! 
  -- Sean
  seant@iname.com
  -- Mark
  mhammond@skippinet.com.au


From npickett@users.sourceforge.net  Tue Nov 19 23:31:46 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Tue, 19 Nov 2002 15:31:46 -0800
Subject: [Spambayes-checkins] spambayes dbdict.py,NONE,1.1
Message-ID: <E18EHqM-0004M9-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv16599

Added Files:
	dbdict.py 
Log Message:
* new DBDict module


--- NEW FILE: dbdict.py ---
#! /usr/bin/env python

from __future__ import generators
import dbhash
try:
    import cPickle as pickle
except ImportError:
    import pickle

class DBDict:
    """Database Dictionary.

    This wraps a dbhash database to make it look even more like a
    dictionary, much like the built-in shelf class.  The difference is
    that a DBDict supports all dict methods.

    Call it with the database.  Optionally, you can specify a list of
    keys to skip when iterating.  This only affects iterators; things
    like .keys() still list everything.  For instance:

    >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo'))
    >>> d['skipme'] = 'booga'
    >>> d['countme'] = 'wakka'
    >>> print d.keys()
    ['skipme', 'countme']
    >>> for k in d.iterkeys():
    ...     print k
    countme

    """

    def __init__(self, dbname, mode, iterskip=()):
        self.hash = dbhash.open(dbname, mode)
        self.iterskip = iterskip

    def __getitem__(self, key):
        return pickle.loads(self.hash[key])

    def __setitem__(self, key, val):
        self.hash[key] = pickle.dumps(val, 1)

    def __delitem__(self, key, val):
        del(self.hash[key])

    def __iter__(self, fn=None):
        k = self.hash.first()
        while k != None:
            key = k[0]
            val = self.__getitem__(key)
            if key not in self.iterskip:
                if fn:
                    yield fn((key, val))
                else:
                    yield (key, val)
            try:
                k = self.hash.next()
            except KeyError:
                break

    def __contains__(self, name):
        return self.has_key(name)

    def __getattr__(self, name):
        # Pass the buck
        return getattr(self.hash, name)

    def get(self, key, dfl=None):
        if self.has_key(key):
            return self[key]
        else:
            return dfl

    def iteritems(self):
        return self.__iter__()

    def iterkeys(self):
        return self.__iter__(lambda k: k[0])

    def itervalues(self):
        return self.__iter__(lambda k: k[1])

open = DBDict

def _test():
    import doctest
    import dbdict

    doctest.testmod(dbdict)

if __name__ == '__main__':
    _test()


From npickett@users.sourceforge.net  Tue Nov 19 23:45:27 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Tue, 19 Nov 2002 15:45:27 -0800
Subject: [Spambayes-checkins] 
 spambayes Bayes.py,1.5,1.5.2.1 Options.py,1.72,1.72.2.1
 hammie.py,1.40,1.40.2.1 hammiefilter.py,1.2,1.2.2.1
 pop3proxy.py,1.16,1.16.2.1
Message-ID: <E18EI3b-0005mW-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv20373

Modified Files:
      Tag: hammie-playground
	Bayes.py Options.py hammie.py hammiefilter.py pop3proxy.py 
Log Message:
* Removes DBDict and PersistentBayes from hammie.py
* hammie.py is no longer an executable, just a container for the
  Hammie class
* Splits persistent_use_database into pop3proxy and hammiefilter
  sections


Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.5
retrieving revision 1.5.2.1
diff -C2 -d -r1.5 -r1.5.2.1
*** Bayes.py	18 Nov 2002 13:04:20 -0000	1.5
--- Bayes.py	19 Nov 2002 23:45:24 -0000	1.5.2.1
***************
*** 41,47 ****
      o ZODBBayes
      o Would Trainer.trainall really want to train with the whole corpus,
!       or just a random subset?
!     o Corpus.Verbose is a bit of a strange thing to have.  Verbose should be
!       in the global namespace, but how do you get it there?
      o Suggestions?
  
--- 41,47 ----
      o ZODBBayes
      o Would Trainer.trainall really want to train with the whole corpus,
!         or just a random subset?
!     o Corpus.Verbose is a bit of a strange thing to have.  Verbose
!         should be in the global namespace, but how do you get it there?
      o Suggestions?
  
***************
*** 57,65 ****
  
  import Corpus
! from classifier import Bayes
  from Options import options
- from hammie import DBDict     # hammie only for DBDict, which should
-                               # probably really be somewhere else
  import cPickle as pickle
  import errno
  import copy
--- 57,64 ----
  
  import Corpus
! import classifier
  from Options import options
  import cPickle as pickle
+ import dbdict
  import errno
  import copy
***************
*** 70,74 ****
  UPDATEPROBS = True       # Probabilities will be autoupdated with training
  
! class PersistentBayes(Bayes):
      '''Persistent Bayes database object'''
  
--- 69,73 ----
  UPDATEPROBS = True       # Probabilities will be autoupdated with training
  
! class PersistentBayes(classifier.Bayes):
      '''Persistent Bayes database object'''
  
***************
*** 170,179 ****
  
  
  class DBDictBayes(PersistentBayes):
      '''Bayes object persisted in a hammie.DB_Dict'''
  
!     def __init__(self, db_name):
          '''Constructor(database name)'''
  
          self.db_name = db_name
          self.statekey = "saved state"
--- 169,215 ----
  
  
+ class WIDict(dbdict.DBDict):
+     """DBDict optimized for holding lots of WordInfo objects.
+ 
+     Normally, the pickler can figure out that you're pickling the same
+     type thing over and over, and will just tag the type with a new
+     byte, thus reducing Administrative Pickle Bloat(R).  Since the
+     DBDict continually creates new picklers, however, nothing ever gets
+     the chance to do this optimization.
+ 
+     The WIDict class forces this optimization by stealing the
+     (currently) unused 'W' pickle type for WordInfo objects.  This
+     results in about a 50% reduction in database size.
+ 
+     """
+ 
+     def __getitem__(self, key):
+         v = self.hash[key]
+         if v[0] == 'W':
+             val = pickle.loads(v[1:])
+             # We could be sneaky, like pickle.Unpickler.load_inst,
+             # but I think that's overly confusing.
+             obj = classifier.WordInfo(0)
+             obj.__setstate__(val)
+             return obj
+         else:
+             return pickle.loads(v)
+ 
+     def __setitem__(self, key, val):
+         if isinstance(val, classifier.WordInfo):
+             val = val.__getstate__()
+             v = 'W' + pickle.dumps(val, 1)
+         else:
+             v = pickle.dumps(val, 1)
+         self.hash[key] = v
+ 
+ 
  class DBDictBayes(PersistentBayes):
      '''Bayes object persisted in a hammie.DB_Dict'''
  
!     def __init__(self, db_name, mode='c'):
          '''Constructor(database name)'''
  
+         self.mode = mode
          self.db_name = db_name
          self.statekey = "saved state"
***************
*** 187,191 ****
              print 'Loading state from',self.db_name,'DB_Dict'
  
!         self.wordinfo = DBDict(self.db_name, 'c')
  
          if self.wordinfo.has_key(self.statekey):
--- 223,228 ----
              print 'Loading state from',self.db_name,'DB_Dict'
  
!         self.wordinfo = WIDict(self.db_name, self.mode,
!                                iterskip=[self.statekey])
  
          if self.wordinfo.has_key(self.statekey):
***************
*** 217,221 ****
      def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
          '''Constructor(Bayes, \
!                        Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
  
          self.bayes = bayes
--- 254,258 ----
      def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
          '''Constructor(Bayes, \
!             Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
  
          self.bayes = bayes
***************
*** 287,289 ****
  
  if __name__ == '__main__':
!     print >>sys.stderr, __doc__
\ No newline at end of file
--- 324,326 ----
  
  if __name__ == '__main__':
!     print >>sys.stderr, __doc__

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72
retrieving revision 1.72.2.1
diff -C2 -d -r1.72 -r1.72.2.1
*** Options.py	18 Nov 2002 19:14:48 -0000	1.72
--- Options.py	19 Nov 2002 23:45:24 -0000	1.72.2.1
***************
*** 349,356 ****
  persistent_storage_file: hammie.db
  
! # hammie can use either a database (quick to score one message) or a pickle
! # (quick to train on huge amounts of messages). Set this to True to use a
! # database by default.
! persistent_use_database: False
  
  [pop3proxy]
--- 349,357 ----
  persistent_storage_file: hammie.db
  
! [hammiefilter]
! # hammiefilter can use either a database (quick to score one message) or
! # a pickle (quick to train on huge amounts of messages). Set this to
! # True to use a database by default.
! hammiefilter_persistent_use_database: False
  
  [pop3proxy]
***************
*** 367,370 ****
--- 368,372 ----
  pop3proxy_ham_cache: pop3proxy-ham-cache
  pop3proxy_unknown_cache: pop3proxy-unknown-cache
+ pop3proxy_persistent_use_database: False
  
  [html_ui]
***************
*** 441,444 ****
--- 443,448 ----
                 'hammie_debug_header_name': string_cracker,
                 },
+     'hammiefilter' : {'hammiefilter_persistent_use_database': boolean_cracker,
+                       },
      'pop3proxy': {'pop3proxy_server_name': string_cracker,
                    'pop3proxy_server_port': int_cracker,
***************
*** 449,452 ****
--- 453,457 ----
                    'pop3proxy_ham_cache': string_cracker,
                    'pop3proxy_unknown_cache': string_cracker,
+                   'pop3proxy_persistent_use_database': string_cracker,
                    },
      'html_ui': {'html_ui_port': int_cracker,

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40
retrieving revision 1.40.2.1
diff -C2 -d -r1.40 -r1.40.2.1
*** hammie.py	18 Nov 2002 18:13:54 -0000	1.40
--- hammie.py	19 Nov 2002 23:45:24 -0000	1.40.2.1
***************
*** 1,56 ****
  #! /usr/bin/env python
  
- # A driver for the classifier module and Tim's tokenizer that you can
- # call from procmail.
- 
- """Usage: %(program)s [options]
- 
- Where:
-     -h
-         show usage and exit
-     -g PATH
-         mbox or directory of known good messages (non-spam) to train on.
-         Can be specified more than once, or use - for stdin.
-     -s PATH
-         mbox or directory of known spam messages to train on.
-         Can be specified more than once, or use - for stdin.
-     -u PATH
-         mbox of unknown messages.  A ham/spam decision is reported for each.
-         Can be specified more than once.
-     -r
-         reverse the meaning of the check (report ham instead of spam).
-         Only meaningful with the -u option.
-     -p FILE
-         use file as the persistent store.  loads data from this file if it
-         exists, and saves data to this file at the end.
-         Default: %(DEFAULTDB)s
-     -d
-         use the DBM store instead of cPickle.  The file is larger and
-         creating it is slower, but checking against it is much faster,
-         especially for large word databases. Default: %(USEDB)s
-     -D
-         the reverse of -d: use the cPickle instead of DBM
-     -f
-         run as a filter: read a single message from stdin, add an
-         %(DISPHEADER)s header, and write it to stdout.  If you want to
-         run from procmail, this is your option.
- """
- 
- from __future__ import generators
- 
- import sys
- import os
- import types
- import getopt
- import mailbox
- import glob
- import email
- import errno
- import anydbm
- import cPickle as pickle
  
  import mboxutils
! import classifier
  from Options import options
  
  try:
--- 1,10 ----
  #! /usr/bin/env python
  
  
+ import dbdict
  import mboxutils
! import Bayes
  from Options import options
+ from tokenizer import tokenize
  
  try:
***************
*** 61,224 ****
  
  
! program = sys.argv[0] # For usage(); referenced by docstring above
! 
! # Name of the header to add in filter mode
! DISPHEADER = options.hammie_header_name
! DEBUGHEADER = options.hammie_debug_header_name
! DODEBUG = options.hammie_debug_header
! 
! # Default database name
! DEFAULTDB = options.persistent_storage_file
! 
! # Probability at which a message is considered spam
! SPAM_THRESHOLD = options.spam_cutoff
! HAM_THRESHOLD = options.ham_cutoff
! 
! # Probability limit for a clue to be added to the DISPHEADER
! SHOWCLUE = options.clue_mailheader_cutoff
! 
! # Use a database? If False, use a pickle
! USEDB = options.persistent_use_database
! 
! # Tim's tokenizer kicks far more booty than anything I would have
! # written.  Score one for analysis ;)
! from tokenizer import tokenize
! 
! class DBDict:
! 
!     """Database Dictionary.
! 
!     This wraps an anydbm to make it look even more like a dictionary.
! 
!     Call it with the name of your database file.  Optionally, you can
!     specify a list of keys to skip when iterating.  This only affects
!     iterators; things like .keys() still list everything.  For instance:
! 
!     >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo'))
!     >>> d['skipme'] = 'booga'
!     >>> d['countme'] = 'wakka'
!     >>> print d.keys()
!     ['skipme', 'countme']
!     >>> for k in d.iterkeys():
!     ...     print k
!     countme
! 
!     """
! 
!     def __init__(self, dbname, mode, iterskip=()):
!         self.hash = anydbm.open(dbname, mode)
!         self.iterskip = iterskip
! 
!     def __getitem__(self, key):
!         v = self.hash[key]
!         if v[0] == 'W':
!             val = pickle.loads(v[1:])
!             # We could be sneaky, like pickle.Unpickler.load_inst,
!             # but I think that's overly confusing.
!             obj = classifier.WordInfo(0)
!             obj.__setstate__(val)
!             return obj
!         else:
!             return pickle.loads(v)
! 
!     def __setitem__(self, key, val):
!         if isinstance(val, classifier.WordInfo):
!             val = val.__getstate__()
!             v = 'W' + pickle.dumps(val, 1)
!         else:
!             v = pickle.dumps(val, 1)
!         self.hash[key] = v
! 
!     def __delitem__(self, key, val):
!         del(self.hash[key])
! 
!     def __iter__(self, fn=None):
!         k = self.hash.first()
!         while k != None:
!             key = k[0]
!             val = self.__getitem__(key)
!             if key not in self.iterskip:
!                 if fn:
!                     yield fn((key, val))
!                 else:
!                     yield (key, val)
!             try:
!                 k = self.hash.next()
!             except KeyError:
!                 break
! 
!     def __contains__(self, name):
!         return self.has_key(name)
! 
!     def __getattr__(self, name):
!         # Pass the buck
!         return getattr(self.hash, name)
! 
!     def get(self, key, dfl=None):
!         if self.has_key(key):
!             return self[key]
!         else:
!             return dfl
! 
!     def iteritems(self):
!         return self.__iter__()
! 
!     def iterkeys(self):
!         return self.__iter__(lambda k: k[0])
! 
!     def itervalues(self):
!         return self.__iter__(lambda k: k[1])
! 
! 
! class PersistentBayes(classifier.Bayes):
! 
!     """A persistent Bayes classifier.
! 
!     This is just like classifier.Bayes, except that the dictionary is a
!     database.  You take less disk this way and you can pretend it's
!     persistent.  The tradeoffs vs. a pickle are: 1. it's slower
!     training, but faster checking, and 2. it needs less memory to run,
!     but takes more space on the hard drive.
  
!     On destruction, an instantiation of this class will write its state
!     to a special key.  When you instantiate a new one, it will attempt
!     to read these values out of that key again, so you can pick up where
!     you left off.
  
      """
  
-     # XXX: Would it be even faster to remember (in a list) which keys
-     # had been modified, and only recalculate those keys?  No sense in
-     # going over the entire word database if only 100 words are
-     # affected.
- 
-     # XXX: Another idea: cache stuff in memory.  But by then maybe we
-     # should just use ZODB.
- 
-     def __init__(self, dbname, mode):
-         classifier.Bayes.__init__(self)
-         self.statekey = "saved state"
-         self.wordinfo = DBDict(dbname, mode, (self.statekey,))
-         self.dbmode = mode
- 
-         self.restore_state()
- 
-     def __del__(self):
-         #super.__del__(self)
-         self.save_state()
- 
-     def save_state(self):
-         if self.dbmode != 'r':
-             self.wordinfo[self.statekey] = (self.nham, self.nspam)
- 
-     def restore_state(self):
-         if self.wordinfo.has_key(self.statekey):
-             self.nham, self.nspam = self.wordinfo[self.statekey]
- 
- 
- class Hammie:
- 
-     """A spambayes mail filter"""
- 
      def __init__(self, bayes):
          self.bayes = bayes
--- 15,26 ----
  
  
! class Hammie:
!     """A spambayes mail filter.
  
!     This implements the basic functionality needed to score, filter, or
!     train.  
  
      """
  
      def __init__(self, bayes):
          self.bayes = bayes
***************
*** 263,269 ****
              traceback.print_exc()
  
!     def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
!                ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER,
!                debug=DODEBUG):
          """Score (judge) a message and add a disposition header.
  
--- 65,71 ----
              traceback.print_exc()
  
!     def filter(self, msg, header=None, spam_cutoff=None,
!                ham_cutoff=None, debugheader=None,
!                debug=None):
          """Score (judge) a message and add a disposition header.
  
***************
*** 283,286 ****
--- 85,99 ----
          """
  
+         if header == None:
+             header = options.hammie_header_name
+         if spam_cutoff == None:
+             spam_cutoff = options.spam_cutoff
+         if ham_cutoff == None:
+             ham_cutoff = options.ham_cutoff
+         if debugheader == None:
+             debugheader = options.hammie_debug_header_name
+         if debug == None:
+             debug = options.hammie_debug_header
+ 
          msg = mboxutils.get_message(msg)
          try:
***************
*** 349,353 ****
          self.train(msg, True)
  
!     def update_probabilities(self):
          """Update probability values.
  
--- 162,166 ----
          self.train(msg, True)
  
!     def update_probabilities(self, store=True):
          """Update probability values.
  
***************
*** 356,510 ****
          until you're all done before calling this.
  
          """
  
          self.bayes.update_probabilities()
  
  
! def train(hammie, msgs, is_spam):
!     """Train bayes with all messages from a mailbox."""
!     mbox = mboxutils.getmbox(msgs)
!     i = 0
!     for msg in mbox:
!         i += 1
!         # XXX: Is the \r a Unixism?  I seem to recall it working in DOS
!         # back in the day.  Maybe it's a line-printer-ism ;)
!         sys.stdout.write("\r%6d" % i)
!         sys.stdout.flush()
!         hammie.train(msg, is_spam)
!     print
! 
! def score(hammie, msgs, reverse=0):
!     """Score (judge) all messages from a mailbox."""
!     # XXX The reporting needs work!
!     mbox = mboxutils.getmbox(msgs)
!     i = 0
!     spams = hams = 0
!     for msg in mbox:
!         i += 1
!         prob, clues = hammie.score(msg, True)
!         if hasattr(msg, '_mh_msgno'):
!             msgno = msg._mh_msgno
!         else:
!             msgno = i
!         isspam = (prob >= SPAM_THRESHOLD)
!         if isspam:
!             spams += 1
!             if not reverse:
!                 print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
!                 print hammie.formatclues(clues)
!         else:
!             hams += 1
!             if reverse:
!                 print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
!                 print hammie.formatclues(clues)
!     return (spams, hams)
! 
! def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
!     """Create a Bayes instance for the given pickle (which
!     doesn't have to exist).  Create a PersistentBayes if
!     usedb is True."""
!     if usedb:
!         bayes = PersistentBayes(pck, mode)
!     else:
!         bayes = None
!         try:
!             fp = open(pck, 'rb')
!         except IOError, e:
!             if e.errno <> errno.ENOENT: raise
!         else:
!             bayes = pickle.load(fp)
!             fp.close()
!         if bayes is None:
!             bayes = classifier.Bayes()
!     return bayes
! 
! def usage(code, msg=''):
!     """Print usage message and sys.exit(code)."""
!     if msg:
!         print >> sys.stderr, msg
!         print >> sys.stderr
!     print >> sys.stderr, __doc__ % globals()
!     sys.exit(code)
! 
! def main():
!     """Main program; parse options and go."""
!     try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
!     except getopt.error, msg:
!         usage(2, msg)
! 
!     if not opts:
!         usage(2, "No options given")
! 
!     pck = DEFAULTDB
!     good = []
!     spam = []
!     unknown = []
!     reverse = 0
!     do_filter = False
!     usedb = USEDB
!     mode = 'r'
!     for opt, arg in opts:
!         if opt == '-h':
!             usage(0)
!         elif opt == '-g':
!             good.append(arg)
!             mode = 'c'
!         elif opt == '-s':
!             spam.append(arg)
!             mode = 'c'
!         elif opt == '-p':
!             pck = arg
!         elif opt == "-d":
!             usedb = True
!         elif opt == "-D":
!             usedb = False
!         elif opt == "-f":
!             do_filter = True
!         elif opt == '-u':
!             unknown.append(arg)
!         elif opt == '-r':
!             reverse = 1
!     if args:
!         usage(2, "Positional arguments not allowed")
  
!     save = False
  
!     bayes = createbayes(pck, usedb, mode)
!     h = Hammie(bayes)
  
-     for g in good:
-         print "Training ham (%s):" % g
-         train(h, g, False)
-         save = True
  
!     for s in spam:
!         print "Training spam (%s):" % s
!         train(h, s, True)
!         save = True
  
!     if save:
!         h.update_probabilities()
!         if not usedb and pck:
!             fp = open(pck, 'wb')
!             pickle.dump(bayes, fp, 1)
!             fp.close()
  
!     if do_filter:
!         msg = sys.stdin.read()
!         filtered = h.filter(msg)
!         sys.stdout.write(filtered)
  
!     if unknown:
!         (spams, hams) = (0, 0)
!         for u in unknown:
!             if len(unknown) > 1:
!                 print "Scoring", u
!             s, g = score(h, u, reverse)
!             spams += s
!             hams += g
!         print "Total %d spam, %d ham" % (spams, hams)
  
  
- if __name__ == "__main__":
-     main()
--- 169,207 ----
          until you're all done before calling this.
  
+         Unless store is false, the peristent store will be written after
+         updating probabilities.
+ 
          """
  
          self.bayes.update_probabilities()
+         if store:
+             self.store()
  
+     def store(self):
+         """Write out the persistent store.
  
!         This makes sure the persistent store reflects what is currently
!         in memory.  You would want to do this after a write and before
!         exiting.
  
!         """
  
!         self.bayes.store()
  
  
! def open(filename, usedb=True, mode='r'):
!     """Open a file, returning a Hammie instance.
  
!     If usedb is False, open as a pickle instead of a DBDict.  mode is
  
!     used as the flag to open DBDict objects.  'c' for read-write (create
!     if needed), 'r' for read-only, 'w' for read-write.
  
!     """
  
+     if usedb:
+         b = Bayes.DBDictBayes(filename, mode)
+     else:
+         b = Bayes.PickledBayes(filename)
+     return Hammie(b)
  

Index: hammiefilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
retrieving revision 1.2
retrieving revision 1.2.2.1
diff -C2 -d -r1.2 -r1.2.2.1
*** hammiefilter.py	18 Nov 2002 18:14:04 -0000	1.2
--- hammiefilter.py	19 Nov 2002 23:45:25 -0000	1.2.2.1
***************
*** 52,92 ****
      sys.exit(code)
  
- def jar_pickle(h):
-     if not options.persistent_use_database:
-         import pickle
-         fp = open(options.persistent_storage_file, 'wb')
-         pickle.dump(h.bayes, fp, 1)
-         fp.close()
-     
- 
- def hammie_open(mode):
-     b = hammie.createbayes(options.persistent_storage_file,
-                            options.persistent_use_database,
-                            mode)
-     return hammie.Hammie(b)
- 
  def newdb():
!     h = hammie_open('n')
!     jar_pickle(h)
      print "Created new database in", options.persistent_storage_file
  
  def filter():
!     h = hammie_open('r')
      msg = sys.stdin.read()
      print h.filter(msg)
  
  def train_ham():
!     h = hammie_open('w')
      msg = sys.stdin.read()
      h.train_ham(msg)
      h.update_probabilities()
!     jar_pickle(h)    
  
  def train_spam():
!     h = hammie_open('w')
      msg = sys.stdin.read()
      h.train_spam(msg)
      h.update_probabilities()
!     jar_pickle(h)    
  
  def main():
--- 52,86 ----
      sys.exit(code)
  
  def newdb():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'n')
!     h.store()
      print "Created new database in", options.persistent_storage_file
  
  def filter():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'r')
      msg = sys.stdin.read()
      print h.filter(msg)
  
  def train_ham():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'w')
      msg = sys.stdin.read()
      h.train_ham(msg)
      h.update_probabilities()
!     h.store()
  
  def train_spam():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'w')
      msg = sys.stdin.read()
      h.train_spam(msg)
      h.update_probabilities()
!     h.store()
  
  def main():
***************
*** 104,112 ****
  
      # hammiefilter overrides
-     config_overrides = """[Hammie]
- persistent_storage_file = %s
- persistent_use_database = True
- """ % os.path.expanduser('~/.hammiedb')
-     options.mergefilelike(StringIO.StringIO(config_overrides))
      options.mergefiles(['/etc/hammierc',
                          os.path.expanduser('~/.hammierc')])
--- 98,101 ----

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.16
retrieving revision 1.16.2.1
diff -C2 -d -r1.16 -r1.16.2.1
*** pop3proxy.py	18 Nov 2002 19:14:48 -0000	1.16
--- pop3proxy.py	19 Nov 2002 23:45:25 -0000	1.16.2.1
***************
*** 1051,1056 ****
          self.serverName = options.pop3proxy_server_name
          self.serverPort = options.pop3proxy_server_port
!         self.databaseFilename = options.persistent_storage_file
!         self.useDB = options.persistent_use_database
          self.uiPort = options.html_ui_port
          self.launchUI = options.html_ui_launch_browser
--- 1051,1056 ----
          self.serverName = options.pop3proxy_server_name
          self.serverPort = options.pop3proxy_server_port
!         self.databaseFilename = options.pop3proxy_persistent_storage_file
!         self.useDB = options.pop3proxy_persistent_use_database
          self.uiPort = options.html_ui_port
          self.launchUI = options.html_ui_launch_browser


From timstone4@users.sourceforge.net  Wed Nov 20 04:28:36 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Tue, 19 Nov 2002 20:28:36 -0800
Subject: [Spambayes-checkins] spambayes dbdict.py,1.1,1.1.2.1
Message-ID: <E18EMTc-0008Q8-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv32342

Modified Files:
      Tag: hammie-playground
	dbdict.py 
Log Message:
Added LSDBDict class, supports load/store/restore

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1
retrieving revision 1.1.2.1
diff -C2 -d -r1.1 -r1.1.2.1
*** dbdict.py	19 Nov 2002 23:31:44 -0000	1.1
--- dbdict.py	20 Nov 2002 04:28:34 -0000	1.1.2.1
***************
*** 1,4 ****
--- 1,55 ----
  #! /usr/bin/env python
  
+ '''DBDict.py - Dictionary access to dbhash
+ 
+ Classes:
+     DBDict - wraps an anydbm file
+     LSDBDict - adds load/store/restore semantic to DBDict
+ 
+ Abstract:
+     DBDict class wraps an anydbm file with a reasonably complete set
+     of dictionary access methods.  DBDicts can be iterated like a dictionary.
+ 
+     DBDict accepts an iterskip operand on the constructor.  This is a tuple
+     of hash keys that will be skipped (not seen) during iteration.  This
+     is for iteration only.  Methods such as keys() will return the entire
+     complement of keys in the dbm hash, even if they're in iterskip.  An
+     iterkeys() method is provided for iterating with skipped keys, and
+     itervaluess() is provided for iterating values with skipped keys.
+ 
+         >>> d = DBDict('/tmp/goober.db', MODE_CREATE, ('skipme', 'skipmetoo'))
+         >>> d['skipme'] = 'booga'
+         >>> d['countme'] = 'wakka'
+         >>> print d.keys()
+         ['skipme', 'countme']
+         >>> for k in d.iterkeys():
+         ...     print k
+         countme
+         >>> for v in d.itervalues():
+         ...     print v
+         wakka
+         >>> for k,v in d.iteritems():
+         ...     print k,v
+         countme wakka
+ 
+     LSDBDict class addes load/store/restore functions to DBDict.  It does this
+     by creating a working copy of the dbm file, and using that for all
+     working access.  When the store() method is called, the working dbm hash
+     is closed, copied to the real copy, then reopened, in effect
+     committing any changes.  When restore() is called, the working copy
+     is closed, replaced with the real copy, then reopened.  Store and restore
+     methods are disallowed for readonly (mode MODE_READONLY) LSDBDicts.
+ 
+ To Do:
+     '''
+ 
+ # This module is part of the spambayes project, which is Copyright 2002
+ # The Python Software Foundation and is covered by the Python Software
+ # Foundation license.
+ 
+ __author__ = "Neale Pickett <neale@woozle.org>, \
+               Tim Stone <tim@fourstonesExpressions.com>"
+ __credits__ = "Tim Peters (author of DBDict class), \
+                all the spambayes contributors."
  from __future__ import generators
  import dbhash
***************
*** 7,10 ****
--- 58,72 ----
  except ImportError:
      import pickle
+     
+ import errno
+ import copy
+ import shutil
+ import os
+ 
+ MODE_CREATE = 'c'       # create file if necessary, open for readwrite
+ MODE_NEW = 'n'          # always create new file, open for readwrite
+ MODE_READWRITE = 'w'    # open existing file for readwrite
+ MODE_READONLY = 'r'     # open existing file for read only
+ 
  
  class DBDict:
***************
*** 80,83 ****
--- 142,201 ----
          return self.__iter__(lambda k: k[1])
  
+ 
+ class LSDBDict(DBDict):
+     """Database Dictionary that supports Load/Store semantic."""
+ 
+     def __init__(self, dbname, mode=MODE_CREATE, iterskip=()):
+         '''Constructor, dbname, mode {c|n|r|w}, iteration skip tuple'''
+ 
+         self.mode = mode
+         self.dbname = dbname
+         self.wdbname = self.dbname+'.working'
+         self.iterskip = iterskip
+ 
+         if self.mode == MODE_READWRITE or self.mode == MODE_CREATE:
+             try:
+                 shutil.copyfile(self.dbname, self.wdbname)
+             except (IOError, os.error), why:
+                 pass           # don't blow up for now
+         elif self.mode == MODE_READONLY:
+             # for readonly access, use the real dbm file
+             self.wdbname = self.dbname
+         elif self.mode == MODE_NEW:
+             try:
+                 os.unlink(self.wdbname)
+             except OSError, e:
+                 if e.errno != errno.ENOENT:
+                     raise
+         else:
+             raise ValueError, "Mode must be MODE_CREATE, MODE_NEW, MODE_READONLY, or MODE_READWRITE"
+ 
+         self.hash = dbhash.open(self.wdbname, self.mode)
+ 
+ 
+     def store(self):
+         '''store the working dbm into the 'real' dbm file'''
+ 
+         if self.mode != MODE_READONLY:
+             self.hash.close()
+             shutil.copyfile(self.wdbname, self.dbname)
+             self.hash = dbhash.open(self.wdbname, MODE_CREATE)
+         else:
+             raise error, 'Store operation not permitted on readonly dbm'
+ 
+     def restore(self):
+         '''restore the working dbm to the 'real' dbm condition'''
+ 
+         if self.mode == MODE_READONLY:
+             raise error, \
+                    'Restore operation not permitted on readonly dbm'
+         else:
+            self.hash.close()
+ 
+            if self.mode != MODE_NEW:
+                shutil.copyfile(self.dbname, self.wdbname)
+ 
+            self.hash = dbhash.open(self.wdbname, self.mode)
+            
  open = DBDict
  

From timstone4@users.sourceforge.net  Wed Nov 20 04:29:58 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Tue, 19 Nov 2002 20:29:58 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.5.2.1,1.5.2.2
Message-ID: <E18EMUw-0008W0-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv32696

Modified Files:
      Tag: hammie-playground
	Bayes.py 
Log Message:
Minor tweaks to accomodate LSDBDict usage

Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.5.2.1
retrieving revision 1.5.2.2
diff -C2 -d -r1.5.2.1 -r1.5.2.2
*** Bayes.py	19 Nov 2002 23:45:24 -0000	1.5.2.1
--- Bayes.py	20 Nov 2002 04:29:55 -0000	1.5.2.2
***************
*** 6,10 ****
      PersistentBayes - subclass of Bayes, adds auto persistence
      PickledBayes - PersistentBayes that uses a pickle db
!     DBDictBayes - PersistentBayes that uses a (hammie.) DB_Dict db
      Trainer - Bayes training observer
      SpamTrainer - Trainer for spam
--- 6,10 ----
      PersistentBayes - subclass of Bayes, adds auto persistence
      PickledBayes - PersistentBayes that uses a pickle db
!     DBDictBayes - PersistentBayes that uses a LSDBDict db
      Trainer - Bayes training observer
      SpamTrainer - Trainer for spam
***************
*** 23,30 ****
      databases.
  
!     DBDictBayes is a concrete PersistentBayes class that uses a DB_Dict
!     datastore.  DB_Dict is currently definied in hammie.py, and wraps
!     an anydbm with some very convenient dictionary functionality, such as
!     the ability to skip particular keys or key patterns during iteration.
  
      Trainer is concrete class that observes a Corpus and trains a
--- 23,28 ----
      databases.
  
!     DBDictBayes is a concrete PersistentBayes class that uses a LSDBDict
!     datastore.
  
      Trainer is concrete class that observes a Corpus and trains a
***************
*** 53,57 ****
  
  __author__ = "Tim Stone <tim@fourstonesExpressions.com>"
! __credits__ = "Richie Hindle, Tim Peters, Neil Gunton, \
  all the spambayes contributors."
  
--- 51,55 ----
  
  __author__ = "Tim Stone <tim@fourstonesExpressions.com>"
! __credits__ = "Richie Hindle, Tim Peters, Neale Pickett, \
  all the spambayes contributors."
  
***************
*** 62,67 ****
  import dbdict
  import errno
- import copy
- import anydbm
  
  PICKLE_TYPE = 1
--- 60,63 ----
***************
*** 169,174 ****
  
  
! class WIDict(dbdict.DBDict):
!     """DBDict optimized for holding lots of WordInfo objects.
  
      Normally, the pickler can figure out that you're pickling the same
--- 165,170 ----
  
  
! class WIDict(dbdict.LSDBDict):
!     """LSDBDict optimized for holding lots of WordInfo objects.
  
      Normally, the pickler can figure out that you're pickling the same
***************
*** 246,249 ****
--- 242,246 ----
  
          self.wordinfo[self.statekey] = (self.nham, self.nspam)
+         self.wordinfo.store()
  
  
From timstone4@users.sourceforge.net  Wed Nov 20 05:04:06 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Tue, 19 Nov 2002 21:04:06 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.72.2.1,1.72.2.2
Message-ID: <E18EN1y-000201-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv7648

Modified Files:
      Tag: hammie-playground
	Options.py 
Log Message:
Added missing pop3proxy_persistent_storage_file

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72.2.1
retrieving revision 1.72.2.2
diff -C2 -d -r1.72.2.1 -r1.72.2.2
*** Options.py	19 Nov 2002 23:45:24 -0000	1.72.2.1
--- Options.py	20 Nov 2002 05:04:03 -0000	1.72.2.2
***************
*** 369,372 ****
--- 369,373 ----
  pop3proxy_unknown_cache: pop3proxy-unknown-cache
  pop3proxy_persistent_use_database: False
+ pop3proxy_persistent_storage_file: ""
  
  [html_ui]
***************
*** 454,457 ****
--- 455,459 ----
                    'pop3proxy_unknown_cache': string_cracker,
                    'pop3proxy_persistent_use_database': string_cracker,
+                   'pop3proxy_persistent_storage_file': string_cracker,
                    },
      'html_ui': {'html_ui_port': int_cracker,


From npickett@users.sourceforge.net  Wed Nov 20 06:06:30 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Tue, 19 Nov 2002 22:06:30 -0800
Subject: [Spambayes-checkins] 
 spambayes Options.py,1.72.2.2,1.72.2.3 classifier.py,1.53,1.53.2.1
 dbdict.py,1.1.2.1,1.1.2.2
Message-ID: <E18EO0M-0005ZQ-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv21044

Modified Files:
      Tag: hammie-playground
	Options.py classifier.py dbdict.py 
Log Message:
* new classifier method to only update the probablity of a single
  word.  I want to try using this during word reads with the dbm
  method, to see if I can make training on single messages quicker.
* s/string/boolean/ in new pop3proxy option
* dbdict ''' to """ to cope with emacs syntax highlighting bogosity


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72.2.2
retrieving revision 1.72.2.3
diff -C2 -d -r1.72.2.2 -r1.72.2.3
*** Options.py	20 Nov 2002 05:04:03 -0000	1.72.2.2
--- Options.py	20 Nov 2002 06:06:27 -0000	1.72.2.3
***************
*** 353,357 ****
  # a pickle (quick to train on huge amounts of messages). Set this to
  # True to use a database by default.
! hammiefilter_persistent_use_database: False
  
  [pop3proxy]
--- 353,357 ----
  # a pickle (quick to train on huge amounts of messages). Set this to
  # True to use a database by default.
! hammiefilter_persistent_use_database: True
  
  [pop3proxy]
***************
*** 454,458 ****
                    'pop3proxy_ham_cache': string_cracker,
                    'pop3proxy_unknown_cache': string_cracker,
!                   'pop3proxy_persistent_use_database': string_cracker,
                    'pop3proxy_persistent_storage_file': string_cracker,
                    },
--- 454,458 ----
                    'pop3proxy_ham_cache': string_cracker,
                    'pop3proxy_unknown_cache': string_cracker,
!                   'pop3proxy_persistent_use_database': boolean_cracker,
                    'pop3proxy_persistent_storage_file': string_cracker,
                    },

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53
retrieving revision 1.53.2.1
diff -C2 -d -r1.53 -r1.53.2.1
*** classifier.py	18 Nov 2002 18:23:09 -0000	1.53
--- classifier.py	20 Nov 2002 06:06:28 -0000	1.53.2.1
***************
*** 319,322 ****
--- 319,334 ----
          """
  
+         for word, record in self.wordinfo.iteritems():
+             self.update_word(word, record)
+                 
+     def update_word(self, word, record):
+         """Compute p(word) = prob(msg is spam | msg contains word).
+         
+         This is the Graham calculation, but stripped of biases, and
+         stripped of clamping into 0.01 thru 0.99.  The Bayesian
+         adjustment following keeps them in a sane range, and one
+         that naturally grows the more evidence there is to back up
+         a probability.
+         """
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
***************
*** 330,393 ****
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
!         for word, record in self.wordinfo.iteritems():
!             # Compute p(word) = prob(msg is spam | msg contains word).
!             # This is the Graham calculation, but stripped of biases, and
!             # stripped of clamping into 0.01 thru 0.99.  The Bayesian
!             # adjustment following keeps them in a sane range, and one
!             # that naturally grows the more evidence there is to back up
!             # a probability.
!             hamcount = record.hamcount
!             assert hamcount <= nham
!             hamratio = hamcount / nham
! 
!             spamcount = record.spamcount
!             assert spamcount <= nspam
!             spamratio = spamcount / nspam
  
!             prob = spamratio / (hamratio + spamratio)
  
!             # Now do Robinson's Bayesian adjustment.
!             #
!             #         s*x + n*p(w)
!             # f(w) = --------------
!             #           s + n
!             #
!             # I find this easier to reason about like so (equivalent when
!             # s != 0):
!             #
!             #        x - p
!             #  p +  -------
!             #       1 + n/s
!             #
!             # IOW, it moves p a fraction of the distance from p to x, and
!             # less so the larger n is, or the smaller s is.
  
!             # Experimental:
!             # Picking a good value for n is interesting:  how much empirical
!             # evidence do we really have?  If nham == nspam,
!             # hamcount + spamcount makes a lot of sense, and the code here
!             # does that by default.
!             # But if, e.g., nham is much larger than nspam, p(w) can get a
!             # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!             # strong ham words (high hamcount) much stronger than strong
!             # spam words (high spamcount), and that makes the accidental
!             # appearance of a strong ham word in spam much more damaging than
!             # the accidental appearance of a strong spam word in ham.
!             # So we don't give hamcount full credit when nham > nspam (or
!             # spamcount when nspam > nham):  instead we knock hamcount down
!             # to what it would have been had nham been equal to nspam.  IOW,
!             # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!             # we don't "believe" any count to an extent more than
!             # min(nspam, nham) justifies.
  
!             n = hamcount * spam2ham  +  spamcount * ham2spam
!             prob = (StimesX + n * prob) / (S + n)
  
!             if record.spamprob != prob:
!                 record.spamprob = prob
!                 # The next seemingly pointless line appears to be a hack
!                 # to allow a persistent db to realize the record has changed.
!                 self.wordinfo[word] = record
  
      def clearjunk(self, oldesttime):
--- 342,398 ----
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
+                 
+         hamcount = record.hamcount
+         assert hamcount <= nham
+         hamratio = hamcount / nham
  
!         spamcount = record.spamcount
!         assert spamcount <= nspam
!         spamratio = spamcount / nspam
  
!         prob = spamratio / (hamratio + spamratio)
  
!         # Now do Robinson's Bayesian adjustment.
!         #
!         #         s*x + n*p(w)
!         # f(w) = --------------
!         #           s + n
!         #
!         # I find this easier to reason about like so (equivalent when
!         # s != 0):
!         #
!         #        x - p
!         #  p +  -------
!         #       1 + n/s
!         #
!         # IOW, it moves p a fraction of the distance from p to x, and
!         # less so the larger n is, or the smaller s is.
  
!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.
  
!         n = hamcount * spam2ham  +  spamcount * ham2spam
!         prob = (StimesX + n * prob) / (S + n)
  
!         if record.spamprob != prob:
!             record.spamprob = prob
!             # The next seemingly pointless line appears to be a hack
!             # to allow a persistent db to realize the record has changed.
!             self.wordinfo[word] = record
  
      def clearjunk(self, oldesttime):

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1.2.1
retrieving revision 1.1.2.2
diff -C2 -d -r1.1.2.1 -r1.1.2.2
*** dbdict.py	20 Nov 2002 04:28:34 -0000	1.1.2.1
--- dbdict.py	20 Nov 2002 06:06:28 -0000	1.1.2.2
***************
*** 1,5 ****
  #! /usr/bin/env python
  
! '''DBDict.py - Dictionary access to dbhash
  
  Classes:
--- 1,5 ----
  #! /usr/bin/env python
  
! """DBDict.py - Dictionary access to dbhash
  
  Classes:
***************
*** 42,46 ****
  
  To Do:
!     '''
  
  # This module is part of the spambayes project, which is Copyright 2002
--- 42,46 ----
  
  To Do:
!     """
  
  # This module is part of the spambayes project, which is Copyright 2002


From richiehindle@users.sourceforge.net  Wed Nov 20 12:28:18 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 20 Nov 2002 04:28:18 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.73,1.74
Message-ID: <E18ETxq-0004Ps-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv16909

Modified Files:
	Options.py 
Log Message:
pop3proxy: New options for configuring multiple POP3 servers.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.73
retrieving revision 1.74
diff -C2 -d -r1.73 -r1.74
*** Options.py	18 Nov 2002 22:51:07 -0000	1.73
--- Options.py	20 Nov 2002 12:28:15 -0000	1.74
***************
*** 357,365 ****
  # pop3proxy settings - pop3proxy also respects the options in the Hammie
  # section, with the exception of the extra header details at the moment.
! # The only mandatory option is pop3proxy_server_name, eg. pop3.my-isp.com,
! # but that can come from the command line - see "pop3proxy -h".
! pop3proxy_server_name: ""
! pop3proxy_server_port: 110
! pop3proxy_port: 110
  pop3proxy_cache_use_gzip: False
  pop3proxy_cache_expiry_days: 7
--- 357,366 ----
  # pop3proxy settings - pop3proxy also respects the options in the Hammie
  # section, with the exception of the extra header details at the moment.
! # The only mandatory option is pop3proxy_servers, eg. "pop3.my-isp.com:110",
! # or a comma-separated list of those.  The ":110" is optional.  If you
! # specify more than one server in pop3proxy_servers, you must specify the
! # same number of ports in pop3proxy_ports.
! pop3proxy_servers: ""
! pop3proxy_ports: ""
  pop3proxy_cache_use_gzip: False
  pop3proxy_cache_expiry_days: 7
***************
*** 368,371 ****
--- 369,377 ----
  pop3proxy_unknown_cache: pop3proxy-unknown-cache
  
+ # Deprecated - use pop3proxy_servers and pop3proxy_ports instead.
+ pop3proxy_server_name: ""
+ pop3proxy_server_port: 110
+ pop3proxy_port: 110
+ 
  [html_ui]
  html_ui_port: 8880
***************
*** 441,445 ****
                 'hammie_debug_header_name': string_cracker,
                 },
!     'pop3proxy': {'pop3proxy_server_name': string_cracker,
                    'pop3proxy_server_port': int_cracker,
                    'pop3proxy_port': int_cracker,
--- 447,453 ----
                 'hammie_debug_header_name': string_cracker,
                 },
!     'pop3proxy': {'pop3proxy_servers': string_cracker,
!                   'pop3proxy_ports': string_cracker,
!                   'pop3proxy_server_name': string_cracker,
                    'pop3proxy_server_port': int_cracker,
                    'pop3proxy_port': int_cracker,


From richiehindle@users.sourceforge.net  Wed Nov 20 12:30:18 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 20 Nov 2002 04:30:18 -0800
Subject: [Spambayes-checkins] spambayes pop3graph.py,NONE,1.1
Message-ID: <E18ETzm-0004cz-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv17757

Added Files:
	pop3graph.py 
Log Message:
Script for producing ASCII graphs of classifier performance, based
on pop3proxy corpuses.


--- NEW FILE: pop3graph.py ---
"""Analyse the pop3proxy's caches and produce a graph of how accurate
classifier has been over time.  Only really meaningful if you started
with an empty database."""

from __future__ import division

import sys, mboxutils
from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
from Options import options

def main():
   # Create the corpuses and the factory that reads the messages.
   if options.pop3proxy_cache_use_gzip:
       messageFactory = GzipFileMessageFactory()
   else:
       messageFactory = FileMessageFactory()
   spamCorpus = FileCorpus(messageFactory, options.pop3proxy_spam_cache)
   hamCorpus = FileCorpus(messageFactory, options.pop3proxy_ham_cache)

   # Read in all the trained messages.
   allTrained = {}
   for corpus, disposition in [(spamCorpus, 'Yes'), (hamCorpus, 'No')]:
      for m in corpus:
         message = mboxutils.get_message(m.getSubstance())
         message._pop3CacheDisposition = disposition
         allTrained[m.key()] = message

   # Sort the messages into the order they arrived, then work out a scaling
   # factor for the graph - 'limit' is the widest it can be in characters.
   keys = allTrained.keys()
   keys.sort()
   limit = 70
   if len(keys) < limit:
      scale = 1
   else:
      scale = len(keys) // (limit//2)

   # Build the data - an array of cumulative success indexed by count.
   count = successful = 0
   successByCount = []
   for key in keys:
      message = allTrained[key]
      disposition = message[options.hammie_header_name]
      if (message._pop3CacheDisposition == disposition):
         successful += 1
      count += 1
      if count % scale == (scale-1):
         successByCount.append(successful // scale)

   # Build the graph, as a list of rows of characters.
   size = count // scale
   graph = [[" " for i in range(size+3)] for j in range(size)]
   for c in range(size):
      graph[c][1] = "|"
      graph[c][c+3] = "."
      graph[successByCount[c]][c+3] = "*"
   graph.reverse()

   # Print the graph.
   print "\n   Success of the classifier over time:\n"
   print "   . - Number of messages over time"
   print "   * - Number of correctly classified messages over time\n\n"
   for row in range(size):
      line = ''.join(graph[row])
      if row == 0:
         print line + " %d" % count
      elif row == (count - successful) // scale:
         print line + " %d" % successful
      else:
         print line
   print " " + "_" * (size+2)

if __name__ == '__main__':
   main()


From richiehindle@users.sourceforge.net  Wed Nov 20 12:45:24 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 20 Nov 2002 04:45:24 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.16,1.17
Message-ID: <E18EUEO-00061L-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv21143

Modified Files:
	pop3proxy.py 
Log Message:
 o Multiple server support - the old ini-file settings are deprecated;
   see Options.py
 o Added a 'defer' choice in addition to discard/ham/spam - thanks to
   Skip for the suggestion.
 o The training page now groups by X-Hammie-Disposition - thanks again
   to Skip.
 o Added a Save Database button to the status panel.
 o Added nspam and nham to the status panel.
 o Fixed several Mac-related problems reported by Fran�ois, whereby I
   needed to use longs for timestamps.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** pop3proxy.py	18 Nov 2002 19:14:48 -0000	1.16
--- pop3proxy.py	20 Nov 2002 12:45:21 -0000	1.17
***************
*** 53,58 ****
  Web training interface:
  
-  o Include more stats in the Status box - it's easy to lose track of
-    where you are when testing.
   o Functional tests.
   o Review already-trained messages, and purge them.
--- 53,56 ----
***************
*** 80,85 ****
   o Possibly integrate Tim Stone's SMTP code - make it use async, make
     the training code update (rather than replace!) the database.
-  o Option to keep trained messages and view potential FPs and FNs to
-    correct them.
   o Allow use of the UI without the POP3 proxy.
   o Remove any existing X-Hammie-Disposition header from incoming emails.
--- 78,81 ----
***************
*** 107,115 ****
   o Classify a web page given a URL.
   o Graphs.  Of something.  Who cares what?
   o Zoe...!
  
  """
  
! import os, sys, re, operator, errno, getopt, cPickle, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
  import Bayes, tokenizer, mboxutils
--- 103,112 ----
   o Classify a web page given a URL.
   o Graphs.  Of something.  Who cares what?
+  o NNTP proxy.
   o Zoe...!
  
  """
  
! import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
  import Bayes, tokenizer, mboxutils
***************
*** 477,481 ****
                  # The message name is the time it arrived, with a uniquifier
                  # appended if two arrive within one clock tick of each other.
!                 messageName = "%10.10d" % time.time()
                  if messageName == state.lastBaseMessageName:
                      state.lastBaseMessageName = messageName
--- 474,478 ----
                  # The message name is the time it arrived, with a uniquifier
                  # appended if two arrive within one clock tick of each other.
!                 messageName = "%10.10d" % long(time.time())
                  if messageName == state.lastBaseMessageName:
                      state.lastBaseMessageName = messageName
***************
*** 603,612 ****
                    &nbsp;<br>\n"""
  
!     summary = """POP3 proxy running on port <b>%(proxyPort)d</b>,
!               proxying to <b>%(serverName)s:%(serverPort)d</b>.<br>
                Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
                POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
                Emails classified this session: <b>%(numSpams)d</b> spam,
!                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.
                """
  
--- 600,614 ----
                    &nbsp;<br>\n"""
  
!     summary = """POP3 proxy running on <b>%(proxyPortsString)s</b>,
!               proxying to <b>%(serversString)s</b>.<br>
                Active POP3 conversations: <b>%(activeSessions)d</b>.<br>
                POP3 conversations this session: <b>%(totalSessions)d</b>.<br>
                Emails classified this session: <b>%(numSpams)d</b> spam,
!                 <b>%(numHams)d</b> ham, <b>%(numUnsure)d</b> unsure.<br>
!               Total emails trained: Spam: <b>%(nspam)d</b>
!                                      Ham: <b>%(nham)d</b><br>
!               <form action='save' method='POST'>
!               <input type='submit' value='Save database'>
!               </form>
                """
  
***************
*** 620,628 ****
               using the <a href='review'>Review messages</a> page."""
  
!     reviewHeader = """<p>These are unclassified emails, which you can use to
!                    train the classifier.  Check the Discard / Ham / Spam
!                    buttton for each email, then click 'Train' below.  (To
!                    discard the whole page, leave everything with Discard
!                    checked and click 'Train'.)</p>
                     <form action='review' method='GET'>
                         <input type='hidden' name='prior' value='%d'>
--- 622,630 ----
               using the <a href='review'>Review messages</a> page."""
  
!     reviewHeader = """<p>These are untrained emails, which you can use to
!                    train the classifier.  Check the Discard / Defer / Ham /
!                    Spam buttton for each email, then click 'Train' below.
!                    (Defer leaves the message here, to be trained on
!                    later.)</p>
                     <form action='review' method='GET'>
                         <input type='hidden' name='prior' value='%d'>
***************
*** 639,644 ****
                     <form action='review' method='POST'>
                     <table class='messagetable' cellpadding='0' cellspacing='0'>
!                    <tr><td><b>Subject:</b></td><td><b>From:</b></td>
!                    <td><b>Discard / Ham / Spam</b></td></tr>"""
  
      upload = """<form action='%s' method='POST'
--- 641,649 ----
                     <form action='review' method='POST'>
                     <table class='messagetable' cellpadding='0' cellspacing='0'>
!                    """
! 
!     reviewSubheader = """<tr><td><b>Messages classified as %s:</b></td>
!                           <td><b>From:</b></td>
!                           <td><b>Discard / Defer / Ham / Spam</b></td></tr>"""
  
      upload = """<form action='%s' method='POST'
***************
*** 769,773 ****
              homeLink = "<a href='home'>Home</a> > %s" % name
          if showImage:
!             image = "<img src='/helmet.gif' align='absmiddle'>&nbsp;"
          else:
              image = ""
--- 774,778 ----
              homeLink = "<a href='home'>Home</a> > %s" % name
          if showImage:
!             image = "<img src='helmet.gif' align='absmiddle'>&nbsp;"
          else:
              image = ""
***************
*** 796,800 ****
      def onHome(self, params):
          """Serve up the homepage."""
!         body = (self.pageSection % ('Status', self.summary % state.__dict__)+
                  self.pageSection % ('Train on proxied messages', self.review)+
                  self.pageSection % ('Train on a given message', self.train)+
--- 801,807 ----
      def onHome(self, params):
          """Serve up the homepage."""
!         stateDict = state.__dict__
!         stateDict.update(state.bayes.__dict__)
!         body = (self.pageSection % ('Status', self.summary % stateDict)+
                  self.pageSection % ('Train on proxied messages', self.review)+
                  self.pageSection % ('Train on a given message', self.train)+
***************
*** 803,813 ****
          self.push(body)
  
      def onShutdown(self, params):
          """Shutdown the server, saving the pickle if requested to do so."""
          if params['how'].lower().find('save') >= 0:
!             if not state.useDB and state.databaseFilename:
!                 self.push("<b>Saving...</b>")
!                 self.push(' ')  # Acts as a flush for small buffers.
!                 state.bayes.store()
          self.push("<b>Shutdown</b>. Goodbye.</div></body></html>")
          self.push(' ')
--- 810,828 ----
          self.push(body)
  
+     def doSave(self):
+         """Saves the database.  Worker for onSave and onShutdown."""
+         self.push("<b>Saving... ")
+         self.push(' ')
+         state.bayes.store()
+         self.push("Done</b>.")
+ 
+     def onSave(self, params):
+         """Command handler for "Save"."""
+         self.doSave()
+ 
      def onShutdown(self, params):
          """Shutdown the server, saving the pickle if requested to do so."""
          if params['how'].lower().find('save') >= 0:
!             self.doSave()
          self.push("<b>Shutdown</b>. Goodbye.</div></body></html>")
          self.push(' ')
***************
*** 845,849 ****
          for that message.  This is the time that the message was received,
          not the Date header."""
!         return int(key[:10])
  
      def getTimeRange(self, timestamp):
--- 860,864 ----
          for that message.  This is the time that the message was received,
          not the Date header."""
!         return long(key[:10])
  
      def getTimeRange(self, timestamp):
***************
*** 879,884 ****
  
          # Find the subset of the keys within this range.
!         startKeyIndex = bisect.bisect(allKeys, "%d" % start)
!         endKeyIndex = bisect.bisect(allKeys, "%d" % end)
          keys = allKeys[startKeyIndex:endKeyIndex]
          keys.reverse()
--- 894,899 ----
  
          # Find the subset of the keys within this range.
!         startKeyIndex = bisect.bisect(allKeys, "%d" % long(start))
!         endKeyIndex = bisect.bisect(allKeys, "%d" % long(end))
          keys = allKeys[startKeyIndex:endKeyIndex]
          keys.reverse()
***************
*** 896,911 ****
          return keys, date, prior, start, end
  
!     def onReview(self, params):
!         """Present a list of message for (re)training."""
  
!         # This is the radio group for training/discarding.
!         trainRadio = """<input type='radio' name='classify:%s'
!                                value='discard' checked>
!                         <input type='radio' name='classify:%s' value='ham'>
!                         <input type='radio' name='classify:%s' value='spam'>"""
  
          # Train/discard sumbitted messages.
          id = ''
          numTrained = 0
          for key, value in params.items():
              if key.startswith('classify:'):
--- 911,947 ----
          return keys, date, prior, start, end
  
!     def appendMessages(self, lines, keyedMessages, judgement):
!         """Appends the lines of a table of messages to 'lines'."""
!         buttons = """<input type='radio' name='classify:%s' value='discard'>
!                   <input type='radio' name='classify:%s' value='defer' %s>
!                   <input type='radio' name='classify:%s' value='ham' %s>
!                   <input type='radio' name='classify:%s' value='spam' %s>"""
!         stripe = 0
!         for key, message in keyedMessages:
!             # Parse the message and get the relevant headers.
!             subject = self.trimAndQuote(message["Subject"] or "(none)", 50)
!             from_ = self.trimAndQuote(message["From"] or "(none)", 40)
  
!             # Output the table row for this message.
!             defer = ham = spam = ""
!             if judgement == options.header_spam_string:
!                 spam='checked'
!             elif judgement == options.header_ham_string:
!                 ham='checked'
!             elif judgement == options.header_unsure_string:
!                 defer='checked'
!             radioGroup = buttons % (key, key, defer, key, ham, key, spam)
!             stripeClass = ['stripe_on', 'stripe_off'][stripe]
!             lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             <td align='middle'>%s</td></tr>""" % \
!                             (stripeClass, subject, from_, radioGroup))
!             stripe = stripe ^ 1
  
+     def onReview(self, params):
+         """Present a list of message for (re)training."""
          # Train/discard sumbitted messages.
          id = ''
          numTrained = 0
+         numDeferred = 0
          for key, value in params.items():
              if key.startswith('classify:'):
***************
*** 915,921 ****
                  elif value == 'ham':
                      targetCorpus = state.hamCorpus
!                 else: # Discard
                      targetCorpus = None
!                     state.unknownCorpus.removeMessage(state.unknownCorpus[id])
                  if targetCorpus:
                      try:
--- 951,963 ----
                  elif value == 'ham':
                      targetCorpus = state.hamCorpus
!                 elif value == 'discard':
                      targetCorpus = None
!                     try:
!                         state.unknownCorpus.removeMessage(state.unknownCorpus[id])
!                     except KeyError:
!                         pass  # Must be a reload.
!                 else: # defer
!                     targetCorpus = None
!                     numDeferred += 1
                  if targetCorpus:
                      try:
***************
*** 939,946 ****
              self.push("Done.</b></p>")
  
!         # After submitting a page, display the prior page or the next one.
!         # Derive the day of the submitted page from the ID of the last
!         # processed message.
!         if id:
              start = self.keyToTimestamp(id)
              _, _, prior, _, next = self.buildReviewKeys(start)
--- 981,992 ----
              self.push("Done.</b></p>")
  
!         # If any messages were deferred, show the same page again.
!         if numDeferred > 0:
!             start = self.keyToTimestamp(id)
! 
!         # Else after submitting a whole page, display the prior page or the
!         # next one.  Derive the day of the submitted page from the ID of the
!         # last processed message.
!         elif id:
              start = self.keyToTimestamp(id)
              _, _, prior, _, next = self.buildReviewKeys(start)
***************
*** 960,965 ****
              start = 0
  
!         # Present the list of messages in reverse order of appearance.
          keys, date, prior, this, next = self.buildReviewKeys(start)
          if keys:
              priorState = nextState = ""
--- 1006,1024 ----
              start = 0
  
!         # Build the lists of messages: spams, hams and unsure.
          keys, date, prior, this, next = self.buildReviewKeys(start)
+         keyedMessages = {options.header_spam_string: [],
+                          options.header_ham_string: [],
+                          options.header_unsure_string: []}
+         for key in keys:
+             # Parse the message and get the judgement header.
+             cachedMessage = state.unknownCorpus[key]
+             message = mboxutils.get_message(cachedMessage.getSubstance())
+             judgement = message[options.hammie_header_name] or \
+                                             options.header_unsure_string
+             keyedMessages[judgement].append((key, message))
+ 
+         # Present the list of messages in their groups in reverse order of
+         # appearance.
          if keys:
              priorState = nextState = ""
***************
*** 969,996 ****
                  nextState = 'disabled'
              lines = [self.reviewHeader % (prior, next, priorState, nextState)]
!             stripe = 0
!             for key in keys:
!                 # Parse the message and get the relevant headers.
!                 cachedMessage = state.unknownCorpus[key]
!                 message = mboxutils.get_message(cachedMessage.getSubstance())
!                 subject = self.trimAndQuote(message["Subject"] or "(none)", 50)
!                 from_ = self.trimAndQuote(message["From"] or "(none)", 40)
  
-                 # Output the table row for this message.
-                 key = cachedMessage.key()
-                 radioGroup = trainRadio % (key, key, key)
-                 stripeClass = ['stripe_on', 'stripe_off'][stripe]
-                 lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
-                                 <td align='middle'>%s</td></tr>""" % \
-                                 (stripeClass, subject, from_, radioGroup))
-                 stripe = stripe ^ 1
              lines.append("""<tr><td></td><td></td><td align='middle'>&nbsp;<br>
                              <input type='submit' value='Train'></td></tr>""")
              lines.append("</table></form>")
              content = "\n".join(lines)
!             title = "Unclassified messages received on %s" % date
          else:
!             content = "<p>There are no unclassified messages to display.</p>"
!             title = "No unclassified messages"
  
          self.push(self.pageSection % (title, content))
--- 1028,1047 ----
                  nextState = 'disabled'
              lines = [self.reviewHeader % (prior, next, priorState, nextState)]
!             for header, type in ((options.header_spam_string, 'Spam'),
!                                  (options.header_ham_string, 'Ham'),
!                                  (options.header_unsure_string, 'Unsure')):
!                 if keyedMessages[header]:
!                     lines.append("<tr><td>&nbsp;</td><td></td><td></td></tr>")
!                     lines.append(self.reviewSubheader % type)
!                     self.appendMessages(lines, keyedMessages[header], header)
  
              lines.append("""<tr><td></td><td></td><td align='middle'>&nbsp;<br>
                              <input type='submit' value='Train'></td></tr>""")
              lines.append("</table></form>")
              content = "\n".join(lines)
!             title = "Untrained messages received on %s" % date
          else:
!             content = "<p>There are no untrained messages to display.</p>"
!             title = "No untrained messages"
  
          self.push(self.pageSection % (title, content))
***************
*** 1047,1054 ****
          self.logFile = open('_pop3proxy.log', 'wb', 0)
  
!         # Load up the default settings from Option.py / bayescustomize.ini
!         self.proxyPort = options.pop3proxy_port
!         self.serverName = options.pop3proxy_server_name
!         self.serverPort = options.pop3proxy_server_port
          self.databaseFilename = options.persistent_storage_file
          self.useDB = options.persistent_use_database
--- 1098,1134 ----
          self.logFile = open('_pop3proxy.log', 'wb', 0)
  
!         # Load up the old proxy settings from Options.py / bayescustomize.ini
!         # and give warnings if they're present.   XXX Remove these soon.
!         if options.pop3proxy_port != 110 or \
!            options.pop3proxy_server_name != '' or \
!            options.pop3proxy_server_port != 110:
!             print "\n    pop3proxy_port, pop3proxy_server_name and"
!             print "    pop3proxy_server_port are deprecated!  Please use"
!             print "    pop3proxy_servers and pop3proxy_ports instead.\n"
!         self.servers = [(options.pop3proxy_server_name,
!                          options.pop3proxy_server_port)]
!         self.proxyPorts = [options.pop3proxy_port]
! 
!         # Load the new proxy settings - these will override the old ones
!         # if both are present.
!         if options.pop3proxy_servers:
!             self.servers = []
!             for server in options.pop3proxy_servers.split(','):
!                 server = server.strip()
!                 if server.find(':') > -1:
!                     server, port = server.split(':', 1)
!                 else:
!                     port = '110'
!                 self.servers.append((server, int(port)))
! 
!         if options.pop3proxy_ports:
!             splitPorts = options.pop3proxy_ports.split(',')
!             self.proxyPorts = map(int, map(string.strip, splitPorts))
! 
!         if len(self.servers) != len(self.proxyPorts):
!             print "pop3proxy_servers & pop3proxy_ports are different lengths!"
!             sys.exit()
! 
!         # Load up the other settings from Option.py / bayescustomize.ini
          self.databaseFilename = options.persistent_storage_file
          self.useDB = options.persistent_use_database
***************
*** 1074,1077 ****
--- 1154,1164 ----
          self.uniquifier = 2
  
+     def buildServerStrings(self):
+         """After the server details have been set up, this creates string
+         versions of the details, for display in the Status panel."""
+         serverStrings = ["%s:%s" % (s, p) for s, p in self.servers]
+         self.serversString = ', '.join(serverStrings)
+         self.proxyPortsString = ', '.join(map(str, self.proxyPorts))
+ 
      def createWorkers(self):
          """Using the options that were initialised in __init__ and then
***************
*** 1117,1125 ****
  
  
! def main(serverName, serverPort, proxyPort,
!          uiPort, launchUI, databaseFilename, useDB):
      """Runs the proxy forever or until a 'KILL' command is received or
      someone hits Ctrl+Break."""
!     BayesProxyListener(serverName, serverPort, proxyPort)
      UserInterfaceListener(uiPort)
      if launchUI:
--- 1204,1212 ----
  
  
! def main(servers, proxyPorts, uiPort, launchUI):
      """Runs the proxy forever or until a 'KILL' command is received or
      someone hits Ctrl+Break."""
!     for (server, serverPort), proxyPort in zip(servers, proxyPorts):
!         BayesProxyListener(server, serverPort, proxyPort)
      UserInterfaceListener(uiPort)
      if launchUI:
***************
*** 1382,1386 ****
              state.databaseFilename = arg
          elif opt == '-l':
!             state.proxyPort = int(arg)
          elif opt == '-u':
              state.uiPort = int(arg)
--- 1469,1473 ----
              state.databaseFilename = arg
          elif opt == '-l':
!             state.proxyPorts = [int(arg)]
          elif opt == '-u':
              state.uiPort = int(arg)
***************
*** 1393,1396 ****
--- 1480,1484 ----
      if runSelfTest:
          print "\nRunning self-test...\n"
+         state.buildServerStrings()
          test()
          print "Self-test passed."   # ...else it would have asserted.
***************
*** 1403,1420 ****
      elif 0 <= len(args) <= 2:
          # Normal usage, with optional server name and port number.
!         if len(args) >= 1:
!             state.serverName = args[0]
!         if len(args) >= 2:
!             state.serverPort = int(args[1])
  
!         if not state.serverName:
              print >>sys.stderr, \
                    ("Error: You must give a POP3 server name, either in\n"
!                    "bayescustomize.ini as pop3proxy_server_name or on the\n"
                     "command line.  pop3server.py -h prints a usage message.")
          else:
!             main(state.serverName, state.serverPort, state.proxyPort,
!                  state.uiPort, state.launchUI, state.databaseFilename,
!                  state.useDB)
  
      else:
--- 1491,1507 ----
      elif 0 <= len(args) <= 2:
          # Normal usage, with optional server name and port number.
!         if len(args) == 1:
!             state.servers = [(args[0], 110)]
!         elif len(args) == 2:
!             state.servers = [(args[0], int(args[1]))]
  
!         if not state.servers or not state.servers[0][0]:
              print >>sys.stderr, \
                    ("Error: You must give a POP3 server name, either in\n"
!                    "bayescustomize.ini as pop3proxy_servers or on the\n"
                     "command line.  pop3server.py -h prints a usage message.")
          else:
!             state.buildServerStrings()
!             main(state.servers, state.proxyPorts, state.uiPort, state.launchUI)
  
      else:


From jeremy@alum.mit.edu  Wed Nov 20 13:08:34 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Wed, 20 Nov 2002 08:08:34 -0500
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.16,1.17
In-Reply-To: <E18EUEO-00061L-00@sc8-pr-cvs1.sourceforge.net>
References: <E18EUEO-00061L-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <15835.35154.632695.854279@slothrop.zope.com>

>>>>> "RH" == Richie Hindle <richiehindle@users.sourceforge.net> writes:

  RH> Update of /cvsroot/spambayes/spambayes In directory
  RH> sc8-pr-cvs1:/tmp/cvs-serv21143

  RH> Modified Files:
  RH> 	pop3proxy.py
  RH> Log Message:
  RH> o Multiple server support - the old ini-file settings are
  RH>    deprecated; see Options.py

I only glanced at the patch and didn't see what multiple server
support meant, but I wanted to suggest a feature from pspam/pop.py
that I've found very useful.  The proxy determines the real server to
use based on the USER name.  My pop client (VM in XEmacs) is
configured this way:
      "slothrop.zope.com:1110:pass:jeremy@mail.zope.com:"
      "slothrop.zope.com:1110:pass:jhylton@mail.speakeasy.net:"

There's a single server listening on port 1110.  It will act as a
proxy for whatever real server is lasted after the @ in the user
name.  It strips off the server name before passing the USER command
on to the real server.

Jeremy
 

From mhammond@users.sourceforge.net  Wed Nov 20 22:06:19 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Wed, 20 Nov 2002 14:06:19 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	dump_props.py,1.5,1.6
Message-ID: <E18EczD-0003rQ-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory sc8-pr-cvs1:/tmp/cvs-serv14510

Modified Files:
	dump_props.py 
Log Message:
Work better with multiple stores, and re-implement a folder search
instead of using a brain dead MS one.

-s option shows all store and top-level folder names.


Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** dump_props.py	4 Nov 2002 00:49:11 -0000	1.5
--- dump_props.py	20 Nov 2002 22:06:17 -0000	1.6
***************
*** 1,2 ****
--- 1,3 ----
+ from __future__ import generators
  # Dump every property we can find for a MAPI item
  
***************
*** 14,35 ****
  session = mapi.MAPILogonEx(0, None, None, logonFlags)
  
! def _FindDefaultMessageStore():
      tab = session.GetMsgStoresTable(0)
-     # Restriction for the table:  get rows where PR_DEFAULT_STORE is true.
-     # There should be only one.
-     restriction = (mapi.RES_PROPERTY,   # a property restriction
-                    (mapi.RELOP_EQ,      # check for equality
-                     PR_DEFAULT_STORE,   # of the PR_DEFAULT_STORE prop
-                     (PR_DEFAULT_STORE, True))) # with True
      rows = mapi.HrQueryAllRows(tab,
!                                (PR_ENTRYID,),   # columns to retrieve
!                                restriction,     # only these rows
                                 None,            # any sort order is fine
                                 0)               # any # of results is fine
!     # get first entry, a (property_tag, value) pair, for PR_ENTRYID
!     row = rows[0]
!     eid_tag, eid = row[0]
!     # Open the store.
!     return session.OpenMsgStore(
                              0,      # no parent window
                              eid,    # msg store to open
--- 15,29 ----
  session = mapi.MAPILogonEx(0, None, None, logonFlags)
  
! def GetMessageStores():
      tab = session.GetMsgStoresTable(0)
      rows = mapi.HrQueryAllRows(tab,
!                                (PR_ENTRYID, PR_DISPLAY_NAME_A, PR_DEFAULT_STORE),   # columns to retrieve
!                                None,     # all rows
                                 None,            # any sort order is fine
                                 0)               # any # of results is fine
!     for row in rows:
!         (eid_tag, eid), (name_tag, name), (def_store_tag, def_store) = row
!         # Open the store.
!         store = session.OpenMsgStore(
                              0,      # no parent window
                              eid,    # msg store to open
***************
*** 40,67 ****
                                  mapi.MDB_NO_MAIL |
                                  mapi.MAPI_DEFERRED_ERRORS)
  
! def _FindItemsWithValue(folder, prop_tag, prop_val):
!     tab = folder.GetContentsTable(0)
!     # Restriction for the table:  get rows where our prop values match
!     restriction = (mapi.RES_CONTENT,   # a property restriction
!                    (mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level
!                     prop_tag,   # of the given prop
!                     (prop_tag, prop_val))) # with given val
!     rows = mapi.HrQueryAllRows(tab,
!                                (PR_ENTRYID,),   # columns to retrieve
!                                restriction,     # only these rows
!                                None,            # any sort order is fine
!                                0)               # any # of results is fine
!     # get entry IDs
!     return [row[0][1] for row in rows]
  
! def _FindFolderEID(name):
      assert name
!     from win32com.mapi import exchange
!     if not name.startswith("\\"):
!         name = "\\Top Of Personal Folders\\" + name
!     store = _FindDefaultMessageStore()
!     folder_eid = exchange.HrMAPIFindFolderEx(store, "\\", name)
!     return folder_eid
  
  # Also in new versions of mapituil
--- 34,77 ----
                                  mapi.MDB_NO_MAIL |
                                  mapi.MAPI_DEFERRED_ERRORS)
+         yield store, name, def_store
  
! def _FindSubfolder(store, folder, find_name):
!     find_name = find_name.lower()
!     table = folder.GetHierarchyTable(0)
!     rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
!     for (eid_tag, eid), (name_tag, name), in rows:
!         if name.lower() == find_name:
!             return store.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
!     return None
  
! def FindFolder(name):
      assert name
!     names = [n.lower() for n in name.split("\\")]
!     if names[0]:
!         for store, name, is_default in GetMessageStores():
!             if is_default:
!                 store_name = name.lower()
!                 break
!         folder_names = names
!     else:
!         store_name = names[1]
!         folder_names = names[2:]
!     # Find the store with the name
!     for store, name, is_default in GetMessageStores():
!         if name.lower() == store_name:
!             folder_store = store
!             break
!     else:
!         raise ValueError, "The store '%s' can not be located" % (store_name,)
! 
!     hr, data = store.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
!     subtree_eid = data[0][1]
!     folder = folder_store.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
! 
!     for name in folder_names:
!         folder = _FindSubfolder(folder_store, folder, name)
!         if folder is None:
!             raise ValueError, "The subfolder '%s' can not be located" % (name,)
!     return folder_store, folder        
  
  # Also in new versions of mapituil
***************
*** 85,88 ****
--- 95,114 ----
      return ret
  
+ def _FindItemsWithValue(folder, prop_tag, prop_val):
+     tab = folder.GetContentsTable(0)
+     # Restriction for the table:  get rows where our prop values match
+     restriction = (mapi.RES_CONTENT,   # a property restriction
+                    (mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level
+                     prop_tag,   # of the given prop
+                     (prop_tag, prop_val))) # with given val
+     rows = mapi.HrQueryAllRows(tab,
+                                (PR_ENTRYID,),   # columns to retrieve
+                                restriction,     # only these rows
+                                None,            # any sort order is fine
+                                0)               # any # of results is fine
+     # get entry IDs
+     return [row[0][1] for row in rows]
+ 
+ 
  def DumpItemProps(item, shorten):
      for prop_name, prop_val in GetAllProperties(item):
***************
*** 92,100 ****
          print "%-20s: %s" % (prop_name, prop_repr)
  
! def DumpProps(folder_eid, subject, include_attach, shorten):
!     mapi_msgstore = _FindDefaultMessageStore()
!     mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
!                                           None,
!                                           mapi.MAPI_DEFERRED_ERRORS)
      hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
      name = data[0][1]
--- 118,122 ----
          print "%-20s: %s" % (prop_name, prop_repr)
  
! def DumpProps(mapi_msgstore, mapi_folder, subject, include_attach, shorten):
      hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
      name = data[0][1]
***************
*** 117,121 ****
--- 139,160 ----
                  DumpItemProps(attach, shorten)
  
+ def DumpTopLevelFolders():
+     print "Top-level folder names are:"
+     for store, name, is_default in GetMessageStores():
+         # Find the folder with the content.
+         hr, data = store.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
+         subtree_eid = data[0][1]
+         folder = store.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
+         # Now the top-level folders in the store.
+         table = folder.GetHierarchyTable(0)
+         rows = mapi.HrQueryAllRows(table, (PR_DISPLAY_NAME_A), None, None, 0)
+         for (name_tag, folder_name), in rows:
+             print " \\%s\\%s" % (name, folder_name)
+ 
  def usage():
+     def_store_name = "<??unknown??>"
+     for store, name, is_def in GetMessageStores():
+         if is_def:
+             def_store_name = name
      msg = """\
  Usage: %s [-f foldername] subject of the message
***************
*** 123,126 ****
--- 162,166 ----
  -s - Shorten long property values.
  -a - Include attachments
+ -n - Show top-level folder names and exit
  
  Dumps all properties for all messages that match the subject.  Subject
***************
*** 130,140 ****
  as the path seperator.  If the folder name begins with a
  \\, it must be a fully-qualified name, including the message
! store name (eg, "Top of Public Folders").  If the path does not
! begin with a \\, it is assumed to be fully-qualifed from the root
! of the default message store
  
! Eg, python\\python-dev' will locate a python-dev subfolder in a python
! subfolder in your default store.
! """ % os.path.basename(sys.argv[0])
      print msg
  
--- 170,180 ----
  as the path seperator.  If the folder name begins with a
  \\, it must be a fully-qualified name, including the message
! store name. For example, your Inbox can be specified either as:
!   -f "Inbox"
! or
!   -f "\\%s\\Inbox"
  
! Use the -n option to see all top-level folder names from all stores.
! """ % (os.path.basename(sys.argv[0]), def_store_name)
      print msg
  
***************
*** 143,147 ****
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "af:s")
      except getopt.error, e:
          print e
--- 183,187 ----
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "af:sn")
      except getopt.error, e:
          print e
***************
*** 150,157 ****
          sys.exit(1)
      folder_name = ""
-     subject = " ".join(args)
-     if not subject:
-         usage()
-         sys.exit(1)
  
      shorten = False
--- 190,193 ----
***************
*** 164,167 ****
--- 200,206 ----
          elif opt == "-a":
              include_attach = True
+         elif opt == "-n":
+             DumpTopLevelFolders()
+             sys.exit(1)
          else:
              print "Invalid arg"
***************
*** 171,179 ****
          folder_name = "Inbox" # Assume this exists!
  
!     eid = _FindFolderEID(folder_name)
!     if eid is None:
!         print "*** Cant find folder", folder_name
!         return
!     DumpProps(eid, subject, include_attach, shorten)
  
  if __name__=='__main__':
--- 210,227 ----
          folder_name = "Inbox" # Assume this exists!
  
!     subject = " ".join(args)
!     if not subject:
!         print "You must specify a subject"
!         print
!         usage()
!         sys.exit(1)
! 
!     try:
!         store, folder = FindFolder(folder_name)
!     except ValueError, details:
!         print details
!         sys.exit(1)
! 
!     DumpProps(store, folder, subject, include_attach, shorten)
  
  if __name__=='__main__':


From richiehindle@users.sourceforge.net  Wed Nov 20 22:41:52 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 20 Nov 2002 14:41:52 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.17,1.18
	Options.py,1.74,1.75
Message-ID: <E18EdXc-0000xy-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv1364

Modified Files:
	pop3proxy.py Options.py 
Log Message:
 o Hovering the mouse on a message subject now displays a
   hovertip with some of the message in it (if that's how your
   browser handles the 'title' attribute).  Thanks to David Ascher
   for the suggestion.
 o Fixed Francois Granger's weird accept() problem (I think).
 o Removed gratuitous quotes from Options.py (thanks to papaDoc)


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** pop3proxy.py	20 Nov 2002 12:45:21 -0000	1.17
--- pop3proxy.py	20 Nov 2002 22:41:50 -0000	1.18
***************
*** 56,59 ****
--- 56,63 ----
   o Review already-trained messages, and purge them.
   o Put in a link to view a message (plain text, html, multipart...?)
+  o Keyboard navigation (David Ascher).  But aren't Tab and left/right
+    arrow enough?
+  o [Francois Granger] Show the raw spambrob number close to the buttons
+    (this would mean using the extra X-Hammie header by default).
  
  
***************
*** 68,71 ****
--- 72,76 ----
     reload the database.
   o Save the stats (num classified, etc.) between sessions.
+  o "Reload database" button.
  
  
***************
*** 81,84 ****
--- 86,92 ----
   o Remove any existing X-Hammie-Disposition header from incoming emails.
   o Whitelist.
+  o Online manual.
+  o Links to project homepage, mailing list, etc.
+  o Edit settings through the web.
  
  
***************
*** 88,91 ****
--- 96,100 ----
   o Eventually, pull the common HTTP code from pop3proxy.py and Entrian
     Debugger into a library.
+  o Cope with the email client timing out and closing the connection.
  
  
***************
*** 112,115 ****
--- 121,125 ----
  import Bayes, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
+ from email.Iterators import typed_subpart_iterator
  from Options import options
  
***************
*** 140,149 ****
  
      def handle_accept(self):
!         clientSocket, clientAddress = self.accept()
!         args = [clientSocket] + list(self.factoryArgs)
!         if self.socketMap != asyncore.socket_map:
!             self.factory(*args, **{'socketMap': self.socketMap})
!         else:
!             self.factory(*args)
  
  
--- 150,165 ----
  
      def handle_accept(self):
!         # If an incoming connection is instantly reset, eg. by following a
!         # link in the web interface then instantly following another one or
!         # hitting stop, handle_accept() will be triggered but accept() will
!         # return None.
!         result = self.accept()
!         if result:
!             clientSocket, clientAddress = result
!             args = [clientSocket] + list(self.factoryArgs)
!             if self.socketMap != asyncore.socket_map:
!                 self.factory(*args, **{'socketMap': self.socketMap})
!             else:
!                 self.factory(*args)
  
  
***************
*** 792,801 ****
              return form
  
!     def trimAndQuote(self, field, limit):
          """Trims a string, adding an ellipsis if necessary, and
          HTML-quotes it."""
          if len(field) > limit:
              field = field[:limit-3] + "..."
!         return cgi.escape(field)
  
      def onHome(self, params):
--- 808,817 ----
              return form
  
!     def trimAndQuote(self, field, limit, quote=False):
          """Trims a string, adding an ellipsis if necessary, and
          HTML-quotes it."""
          if len(field) > limit:
              field = field[:limit-3] + "..."
!         return cgi.escape(field, quote)
  
      def onHome(self, params):
***************
*** 815,819 ****
          self.push(' ')
          state.bayes.store()
!         self.push("Done</b>.")
  
      def onSave(self, params):
--- 831,835 ----
          self.push(' ')
          state.bayes.store()
!         self.push("Done</b>.\n")
  
      def onSave(self, params):
***************
*** 919,925 ****
          stripe = 0
          for key, message in keyedMessages:
!             # Parse the message and get the relevant headers.
              subject = self.trimAndQuote(message["Subject"] or "(none)", 50)
              from_ = self.trimAndQuote(message["From"] or "(none)", 40)
  
              # Output the table row for this message.
--- 935,955 ----
          stripe = 0
          for key, message in keyedMessages:
!             # Parse the message and get the relevant headers and the first
!             # part of the body if we can.
              subject = self.trimAndQuote(message["Subject"] or "(none)", 50)
              from_ = self.trimAndQuote(message["From"] or "(none)", 40)
+             try:
+                 part = typed_subpart_iterator(message, 'text', 'plain').next()
+                 text = part.get_payload()
+             except StopIteration:
+                 try:
+                     part = typed_subpart_iterator(message, 'text', 'html').next()
+                     text = tokenizer.html_re.sub(' ', part.get_payload())
+                     text = '(this message only has an HTML body)\n' + text
+                 except StopIteration:
+                     text = '(this message has no text body)'
+             text = text.replace('&nbsp;', ' ')      # Else they'll be quoted
+             text = re.sub(r'(\s)\s+', r'\1', text)  # Eg. multiple blank lines
+             text = self.trimAndQuote(text.strip(), 200, True)
  
              # Output the table row for this message.
***************
*** 931,934 ****
--- 961,965 ----
              elif judgement == options.header_unsure_string:
                  defer='checked'
+             subject = "<span title=\"%s\">%s</span>" % (text, subject)
              radioGroup = buttons % (key, key, defer, key, ham, key, spam)
              stripeClass = ['stripe_on', 'stripe_off'][stripe]

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.74
retrieving revision 1.75
diff -C2 -d -r1.74 -r1.75
*** Options.py	20 Nov 2002 12:28:15 -0000	1.74
--- Options.py	20 Nov 2002 22:41:50 -0000	1.75
***************
*** 361,366 ****
  # specify more than one server in pop3proxy_servers, you must specify the
  # same number of ports in pop3proxy_ports.
! pop3proxy_servers: ""
! pop3proxy_ports: ""
  pop3proxy_cache_use_gzip: False
  pop3proxy_cache_expiry_days: 7
--- 361,366 ----
  # specify more than one server in pop3proxy_servers, you must specify the
  # same number of ports in pop3proxy_ports.
! pop3proxy_servers:
! pop3proxy_ports:
  pop3proxy_cache_use_gzip: False
  pop3proxy_cache_expiry_days: 7
***************
*** 370,374 ****
  
  # Deprecated - use pop3proxy_servers and pop3proxy_ports instead.
! pop3proxy_server_name: ""
  pop3proxy_server_port: 110
  pop3proxy_port: 110
--- 370,374 ----
  
  # Deprecated - use pop3proxy_servers and pop3proxy_ports instead.
! pop3proxy_server_name:
  pop3proxy_server_port: 110
  pop3proxy_port: 110


From tim_one@users.sourceforge.net  Thu Nov 21 02:57:07 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 20 Nov 2002 18:57:07 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.35,1.36 msgstore.py,1.31,1.32
Message-ID: <E18EhWd-0001Ll-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv4035/Outlook2000

Modified Files:
	addin.py msgstore.py 
Log Message:
GetEmailPackageObject():  renamed the optional arg to strip_mime_headers,
and put back the default strip of the Content-Transfer-Encoding header I
took out before.  Mark Hammond rediscovered the hard way why it was there
before:  Outlook already delivers decoded text, and leaving the CTE
header in makes the (Python) email pkg try to decode it again.  This
wasn't fatal (because the tokenizer recovers from decoding rrrors), but
did lead to some weird results.  Explained this all in excruciatingly
long comments, so nobody is tempted to take it out again.


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** addin.py	14 Nov 2002 11:07:18 -0000	1.35
--- addin.py	21 Nov 2002 02:57:05 -0000	1.36
***************
*** 249,253 ****
      push("<h2>Message Stream:</h2><br>")
      push("<PRE>\n")
!     msg = msgstore_message.GetEmailPackageObject(strip_content_type=False)
      push(escape(msg.as_string(), True))
      push("</PRE>\n")
--- 249,253 ----
      push("<h2>Message Stream:</h2><br>")
      push("<PRE>\n")
!     msg = msgstore_message.GetEmailPackageObject(strip_mime_headers=False)
      push(escape(msg.as_string(), True))
      push("</PRE>\n")

Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.31
retrieving revision 1.32
diff -C2 -d -r1.31 -r1.32
*** msgstore.py	14 Nov 2002 07:04:45 -0000	1.31
--- msgstore.py	21 Nov 2002 02:57:05 -0000	1.32
***************
*** 514,523 ****
              self.mapi_object = self.msgstore._OpenEntry(self.id)
  
!     def GetEmailPackageObject(self, strip_content_type=True):
          # Return an email.Message object.
!         # strip_content_type is a hack, and should be left True unless you're
          # trying to display all the headers for diagnostic purposes.  If we
          # figure out something better to do, it should go away entirely.
!         # The problem:  suppose a msg is multipart/alternative, with
          # text/plain and text/html sections.  The latter MIME decorations
          # are plain missing in what _GetMessageText() returns.  If we leave
--- 514,525 ----
              self.mapi_object = self.msgstore._OpenEntry(self.id)
  
!     def GetEmailPackageObject(self, strip_mime_headers=True):
          # Return an email.Message object.
!         #
!         # strip_mime_headers is a hack, and should be left True unless you're
          # trying to display all the headers for diagnostic purposes.  If we
          # figure out something better to do, it should go away entirely.
!         #
!         # Problem #1:  suppose a msg is multipart/alternative, with
          # text/plain and text/html sections.  The latter MIME decorations
          # are plain missing in what _GetMessageText() returns.  If we leave
***************
*** 530,535 ****
--- 532,547 ----
          # considers the body to be text/plain (the default), and so it
          # does get tokenized.
+         #
+         # Problem #2:  Outlook decodes quoted-printable and base64 on its
+         # own, but leaves any Content-Transfer-Encoding line in the headers.
+         # This can cause the email pkg to try to decode the text again,
+         # with unpleasant (but rarely fatal) results.  If we strip that
+         # header too, no problem -- although the fact that a msg was
+         # encoded in base64 is usually a good spam clue, and we miss that.
+         #
          # Short course:  we either have to synthesize non-insane MIME
          # structure, or eliminate all evidence of original MIME structure.
+         # Since we don't have a way to the former, by default this function
+         # does the latter.
          import email
          text = self._GetMessageText()
***************
*** 540,546 ****
              raise
  
!         if strip_content_type:
              if msg.has_key('content-type'):
                  del msg['content-type']
  
          return msg
--- 552,560 ----
              raise
  
!         if strip_mime_headers:
              if msg.has_key('content-type'):
                  del msg['content-type']
+             if msg.has_key('content-transfer-encoding'):
+                 del msg['content-transfer-encoding']
  
          return msg


From timstone4@users.sourceforge.net  Thu Nov 21 02:58:39 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Wed, 20 Nov 2002 18:58:39 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.5.2.2,1.5.2.3
Message-ID: <E18EhY7-0001TH-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv5621

Modified Files:
      Tag: hammie-playground
	Bayes.py 
Log Message:
Removed LSDBDict class

Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.5.2.2
retrieving revision 1.5.2.3
diff -C2 -d -r1.5.2.2 -r1.5.2.3
*** Bayes.py	20 Nov 2002 04:29:55 -0000	1.5.2.2
--- Bayes.py	21 Nov 2002 02:58:37 -0000	1.5.2.3
***************
*** 165,169 ****
  
  
! class WIDict(dbdict.LSDBDict):
      """LSDBDict optimized for holding lots of WordInfo objects.
  
--- 165,169 ----
  
  
! class WIDict(dbdict.DBDict):
      """LSDBDict optimized for holding lots of WordInfo objects.
  
***************
*** 242,246 ****
  
          self.wordinfo[self.statekey] = (self.nham, self.nspam)
!         self.wordinfo.store()
  
  
--- 242,246 ----
  
          self.wordinfo[self.statekey] = (self.nham, self.nspam)
!         self.wordinfo.sync()
  
  
From timstone4@users.sourceforge.net  Thu Nov 21 02:58:58 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Wed, 20 Nov 2002 18:58:58 -0800
Subject: [Spambayes-checkins] spambayes dbdict.py,1.1.2.2,1.1.2.3
Message-ID: <E18EhYQ-0001Ui-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv5722

Modified Files:
      Tag: hammie-playground
	dbdict.py 
Log Message:
Removed LSDBDict class

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1.2.2
retrieving revision 1.1.2.3
diff -C2 -d -r1.1.2.2 -r1.1.2.3
*** dbdict.py	20 Nov 2002 06:06:28 -0000	1.1.2.2
--- dbdict.py	21 Nov 2002 02:58:56 -0000	1.1.2.3
***************
*** 58,62 ****
  except ImportError:
      import pickle
!     
  import errno
  import copy
--- 58,62 ----
  except ImportError:
      import pickle
! 
  import errno
  import copy
***************
*** 143,201 ****
  
  
- class LSDBDict(DBDict):
-     """Database Dictionary that supports Load/Store semantic."""
- 
-     def __init__(self, dbname, mode=MODE_CREATE, iterskip=()):
-         '''Constructor, dbname, mode {c|n|r|w}, iteration skip tuple'''
- 
-         self.mode = mode
-         self.dbname = dbname
-         self.wdbname = self.dbname+'.working'
-         self.iterskip = iterskip
- 
-         if self.mode == MODE_READWRITE or self.mode == MODE_CREATE:
-             try:
-                 shutil.copyfile(self.dbname, self.wdbname)
-             except (IOError, os.error), why:
-                 pass           # don't blow up for now
-         elif self.mode == MODE_READONLY:
-             # for readonly access, use the real dbm file
-             self.wdbname = self.dbname
-         elif self.mode == MODE_NEW:
-             try:
-                 os.unlink(self.wdbname)
-             except OSError, e:
-                 if e.errno != errno.ENOENT:
-                     raise
-         else:
-             raise ValueError, "Mode must be MODE_CREATE, MODE_NEW, MODE_READONLY, or MODE_READWRITE"
- 
-         self.hash = dbhash.open(self.wdbname, self.mode)
- 
- 
-     def store(self):
-         '''store the working dbm into the 'real' dbm file'''
- 
-         if self.mode != MODE_READONLY:
-             self.hash.close()
-             shutil.copyfile(self.wdbname, self.dbname)
-             self.hash = dbhash.open(self.wdbname, MODE_CREATE)
-         else:
-             raise error, 'Store operation not permitted on readonly dbm'
- 
-     def restore(self):
-         '''restore the working dbm to the 'real' dbm condition'''
- 
-         if self.mode == MODE_READONLY:
-             raise error, \
-                    'Restore operation not permitted on readonly dbm'
-         else:
-            self.hash.close()
- 
-            if self.mode != MODE_NEW:
-                shutil.copyfile(self.dbname, self.wdbname)
- 
-            self.hash = dbhash.open(self.wdbname, self.mode)
-            
  open = DBDict
  
--- 143,146 ----


From npickett@users.sourceforge.net  Thu Nov 21 04:16:39 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Wed, 20 Nov 2002 20:16:39 -0800
Subject: [Spambayes-checkins] 
 spambayes Bayes.py,1.5.2.3,1.5.2.4 Options.py,1.72.2.3,1.72.2.4
 classifier.py,1.53.2.1,1.53.2.2 hammiefilter.py,1.2.2.1,1.2.2.2
Message-ID: <E18Eilb-0006jf-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv25529

Modified Files:
      Tag: hammie-playground
	Bayes.py Options.py classifier.py hammiefilter.py 
Log Message:
Bayes.py: __init__ cleanup

Options.py: moved persistent_storage_file out to hammiefilter and
            pop3proxy sections.
	    
classifier.py: New MetaInfo class which keeps counters
	    for nham and nspam, also a revision, incremented every
	    time either is changed.
					  
            WordInfo class calculates probabilty on the fly iff
	    MetaInfo revision has changed since last calculation.

            Probabilities are no longer stored in the persisitent
            databases.
	       
hammiefilter.py: takes advantage of all this stuff :)


Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.5.2.3
retrieving revision 1.5.2.4
diff -C2 -d -r1.5.2.3 -r1.5.2.4
*** Bayes.py	21 Nov 2002 02:58:37 -0000	1.5.2.3
--- Bayes.py	21 Nov 2002 04:16:36 -0000	1.5.2.4
***************
*** 71,74 ****
--- 71,75 ----
          '''Constructor(database name)'''
  
+         classifier.Bayes.__init__(self)
          self.db_name = db_name
          self.load()
***************
*** 186,190 ****
              # We could be sneaky, like pickle.Unpickler.load_inst,
              # but I think that's overly confusing.
!             obj = classifier.WordInfo(0)
              obj.__setstate__(val)
              return obj
--- 187,191 ----
              # We could be sneaky, like pickle.Unpickler.load_inst,
              # but I think that's overly confusing.
!             obj = classifier.WordInfo()
              obj.__setstate__(val)
              return obj
***************
*** 211,215 ****
          self.statekey = "saved state"
  
!         self.load()
  
      def load(self):
--- 212,216 ----
          self.statekey = "saved state"
  
!         PersistentBayes.__init__(self, db_name)
  
      def load(self):

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72.2.3
retrieving revision 1.72.2.4
diff -C2 -d -r1.72.2.3 -r1.72.2.4
*** Options.py	20 Nov 2002 06:06:27 -0000	1.72.2.3
--- Options.py	21 Nov 2002 04:16:36 -0000	1.72.2.4
***************
*** 346,352 ****
  clue_mailheader_cutoff: 0.5
  
- # The default database path used by hammie
- persistent_storage_file: hammie.db
- 
  [hammiefilter]
  # hammiefilter can use either a database (quick to score one message) or
--- 346,349 ----
***************
*** 354,357 ****
--- 351,355 ----
  # True to use a database by default.
  hammiefilter_persistent_use_database: True
+ hammiefilter_persistent_storage_file: ~/.hammiedb
  
  [pop3proxy]
***************
*** 360,364 ****
  # The only mandatory option is pop3proxy_server_name, eg. pop3.my-isp.com,
  # but that can come from the command line - see "pop3proxy -h".
! pop3proxy_server_name: ""
  pop3proxy_server_port: 110
  pop3proxy_port: 110
--- 358,362 ----
  # The only mandatory option is pop3proxy_server_name, eg. pop3.my-isp.com,
  # but that can come from the command line - see "pop3proxy -h".
! pop3proxy_server_name: 
  pop3proxy_server_port: 110
  pop3proxy_port: 110
***************
*** 369,373 ****
  pop3proxy_unknown_cache: pop3proxy-unknown-cache
  pop3proxy_persistent_use_database: False
! pop3proxy_persistent_storage_file: ""
  
  [html_ui]
--- 367,371 ----
  pop3proxy_unknown_cache: pop3proxy-unknown-cache
  pop3proxy_persistent_use_database: False
! pop3proxy_persistent_storage_file: hammie.db
  
  [html_ui]
***************
*** 433,437 ****
                    },
      'Hammie': {'hammie_header_name': string_cracker,
-                'persistent_storage_file': string_cracker,
                 'clue_mailheader_cutoff': float_cracker,
                 'persistent_use_database': boolean_cracker,
--- 431,434 ----
***************
*** 445,448 ****
--- 442,446 ----
                 },
      'hammiefilter' : {'hammiefilter_persistent_use_database': boolean_cracker,
+                       'hammiefilter_persistent_storage_file': string_cracker,
                        },
      'pop3proxy': {'pop3proxy_server_name': string_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.1
retrieving revision 1.53.2.2
diff -C2 -d -r1.53.2.1 -r1.53.2.2
*** classifier.py	20 Nov 2002 06:06:28 -0000	1.53.2.1
--- classifier.py	21 Nov 2002 04:16:36 -0000	1.53.2.2
***************
*** 32,36 ****
  
  import math
- import time
  from sets import Set
  
--- 32,35 ----
***************
*** 49,90 ****
  PICKLE_VERSION = 1
  
! class WordInfo(object):
!     __slots__ = ('atime',     # when this record was last used by scoring(*)
!                  'spamcount', # # of spams in which this word appears
!                  'hamcount',  # # of hams in which this word appears
!                  'killcount', # # of times this made it to spamprob()'s nbest
!                  'spamprob',  # prob(spam | msg contains this word)
!                 )
  
      # Invariant:  For use in a classifier database, at least one of
      # spamcount and hamcount must be non-zero.
-     #
-     # (*)atime is the last access time, a UTC time.time() value.  It's the
-     # most recent time this word was used by scoring (i.e., by spamprob(),
-     # not by training via learn()); or, if the word has never been used by
-     # scoring, the time the word record was created (i.e., by learn()).
-     # One good criterion for identifying junk (word records that have no
-     # value) is to delete words that haven't been used for a long time.
-     # Perhaps they were typos, or unique identifiers, or relevant to a
-     # once-hot topic or scam that's fallen out of favor.  Whatever, if
-     # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
!         self.atime = atime
!         self.spamcount = self.hamcount = self.killcount = 0
!         self.spamprob = spamprob
  
      def __repr__(self):
!         return "WordInfo%r" % repr((self.atime, self.spamcount,
!                                     self.hamcount, self.killcount,
                                      self.spamprob))
  
      def __getstate__(self):
!         return (self.atime, self.spamcount, self.hamcount, self.killcount,
!                 self.spamprob)
  
      def __setstate__(self, t):
!         (self.atime, self.spamcount, self.hamcount, self.killcount,
!          self.spamprob) = t
  
  class Bayes:
--- 48,196 ----
  PICKLE_VERSION = 1
  
! class MetaInfo(object):
!     """Information about the corpora.
! 
!     Contains nham and nspam, used for calculating probabilities.  Also
!     has a revision, incremented every time nham or nspam is adjusted to
!     invalidate any cached probabilities.
!     
!     """
!     def __init__(self):
!         self._nham = 0
!         self._nspam = 0
!         self.revision = 0
! 
!     def __repr__(self):
!         return "MetaInfo%r" % repr((self._nham,
!                                     self._nspam,
!                                     self.revision))
! 
!     def __getstate__(self):
!         return (self._nham, self._nspam)
! 
!     def __setstate__(self, t):
!         (self._nham, self._nspam) = t
! 
!     def nham(self):
!         return self._nham
! 
!     def nspam(self):
!         return self._nspam
! 
!     def incr_rev(self):
!         self.revision += 1
!         
!     def incr_ham(self, amt=1):
!         self._nham += amt
!         self.incr_rev()
  
+     def incr_spam(self, amt=1):
+         self._nspam += 1
+         self.incr_rev()
+     
+ 
+ class WordInfo(object):
      # Invariant:  For use in a classifier database, at least one of
      # spamcount and hamcount must be non-zero.
  
!     def __init__(self):
!         self.__setstate__((0, 0))
  
      def __repr__(self):
!         return "WordInfo%r" % repr((self.spamcount,
!                                     self.hamcount,
                                      self.spamprob))
  
      def __getstate__(self):
!         return (self.spamcount,
!                 self.hamcount)
  
      def __setstate__(self, t):
!         (self.spamcount, self.hamcount) = t
!         self.spamprob = None
!         self.revision = None
! 
!     def _update_probability(self, meta):
!         """Compute and store p(word) = prob(msg is spam | msg contains word).
!         
!         This is the Graham calculation, but stripped of biases, and
!         stripped of clamping into 0.01 thru 0.99.  The Bayesian
!         adjustment following keeps them in a sane range, and one
!         that naturally grows the more evidence there is to back up
!         a probability.
! 
!         Returns True if the probability changed, False otherwise.
!         """
! 
!         nham = float(meta.nham() or 1)
!         nspam = float(meta.nspam() or 1)
! 
!         if options.experimental_ham_spam_imbalance_adjustment:
!             spam2ham = min(nspam / nham, 1.0)
!             ham2spam = min(nham / nspam, 1.0)
!         else:
!             spam2ham = ham2spam = 1.0
! 
!         S = options.unknown_word_strength
!         StimesX = S * options.unknown_word_prob
!                 
!         assert self.hamcount <= nham
!         hamratio = self.hamcount / nham
! 
!         assert self.spamcount <= nspam
!         spamratio = self.spamcount / nspam
! 
!         prob = spamratio / (hamratio + spamratio)
! 
!         # Now do Robinson's Bayesian adjustment.
!         #
!         #         s*x + n*p(w)
!         # f(w) = --------------
!         #           s + n
!         #
!         # I find this easier to reason about like so (equivalent when
!         # s != 0):
!         #
!         #        x - p
!         #  p +  -------
!         #       1 + n/s
!         #
!         # IOW, it moves p a fraction of the distance from p to x, and
!         # less so the larger n is, or the smaller s is.
! 
!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.
! 
!         n = self.hamcount * spam2ham  +  self.spamcount * ham2spam
!         prob = (StimesX + n * prob) / (S + n)
! 
!         self.revision = meta.revision
!         if self.spamprob != prob:
!             self.spamprob = prob
!             return True
!         else:
!             return False
! 
!     def probability(self, meta):
!         """Return this word's spam probability, recalculating if needed."""
!         if meta.revision != self.revision:
!             self._update_probability(meta)
!         return self.spamprob
! 
  
  class Bayes:
***************
*** 105,117 ****
      def __init__(self):
          self.wordinfo = {}
!         self.nspam = self.nham = 0
  
      def __getstate__(self):
!         return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.wordinfo, self.nspam, self.nham = t[1:]
  
      # spamprob() implementations.  One of the following is aliased to
--- 211,223 ----
      def __init__(self):
          self.wordinfo = {}
!         self.meta = MetaInfo()
  
      def __getstate__(self):
!         return PICKLE_VERSION, self.wordinfo, self.meta
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.wordinfo, self.meta = t[1:]
  
      # spamprob() implementations.  One of the following is aliased to
***************
*** 145,150 ****
          clues = self._getclues(wordstream)
          for prob, word, record in clues:
-             if record is not None:  # else wordinfo doesn't know about it
-                 record.killcount += 1
              P *= 1.0 - prob
              Q *= prob
--- 251,254 ----
***************
*** 234,239 ****
          clues = self._getclues(wordstream)
          for prob, word, record in clues:
-             if record is not None:  # else wordinfo doesn't know about it
-                 record.killcount += 1
              S *= 1.0 - prob
              H *= prob
--- 338,341 ----
***************
*** 278,282 ****
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam, update_probabilities=True):
          """Teach the classifier by example.
  
--- 380,384 ----
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam, update_word_probabilities=True):
          """Teach the classifier by example.
  
***************
*** 285,302 ****
          else that it's definitely not spam.
  
!         If optional arg update_probabilities is False (the default is True),
!         don't update word probabilities.  Updating them is expensive, and if
!         you're going to pass many messages to learn(), it's more efficient
!         to pass False here and call update_probabilities() once when you're
!         done -- or to call learn() with update_probabilities=True when
!         passing the last new example.  The important thing is that the
!         probabilities get updated before calling spamprob() again.
          """
  
!         self._add_msg(wordstream, is_spam)
!         if update_probabilities:
!             self.update_probabilities()
  
!     def unlearn(self, wordstream, is_spam, update_probabilities=True):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
--- 387,403 ----
          else that it's definitely not spam.
  
!         If optional arg update_word_probabilities is False (the default
!         is True), don't update individual words' probabilities.
!         Updating them is expensive, and if you're going to pass many
!         messages to learn(), it's more efficient to pass False here and
!         call update_probabilities() once when you're done.  The
!         important thing is that the probabilities get updated before
!         calling spamprob() again.
!         
          """
  
!         self._add_msg(wordstream, is_spam, update_word_probabilities)
  
!     def unlearn(self, wordstream, is_spam, update_word_probabilities=True):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
***************
*** 304,310 ****
          """
  
!         self._remove_msg(wordstream, is_spam)
!         if update_probabilities:
!             self.update_probabilities()
  
      def update_probabilities(self):
--- 405,409 ----
          """
  
!         self._remove_msg(wordstream, is_spam, update_word_probabilities)
  
      def update_probabilities(self):
***************
*** 320,410 ****
  
          for word, record in self.wordinfo.iteritems():
!             self.update_word(word, record)
!                 
!     def update_word(self, word, record):
!         """Compute p(word) = prob(msg is spam | msg contains word).
!         
!         This is the Graham calculation, but stripped of biases, and
!         stripped of clamping into 0.01 thru 0.99.  The Bayesian
!         adjustment following keeps them in a sane range, and one
!         that naturally grows the more evidence there is to back up
!         a probability.
!         """
!         nham = float(self.nham or 1)
!         nspam = float(self.nspam or 1)
! 
!         if options.experimental_ham_spam_imbalance_adjustment:
!             spam2ham = min(nspam / nham, 1.0)
!             ham2spam = min(nham / nspam, 1.0)
!         else:
!             spam2ham = ham2spam = 1.0
! 
!         S = options.unknown_word_strength
!         StimesX = S * options.unknown_word_prob
!                 
!         hamcount = record.hamcount
!         assert hamcount <= nham
!         hamratio = hamcount / nham
! 
!         spamcount = record.spamcount
!         assert spamcount <= nspam
!         spamratio = spamcount / nspam
! 
!         prob = spamratio / (hamratio + spamratio)
! 
!         # Now do Robinson's Bayesian adjustment.
!         #
!         #         s*x + n*p(w)
!         # f(w) = --------------
!         #           s + n
!         #
!         # I find this easier to reason about like so (equivalent when
!         # s != 0):
!         #
!         #        x - p
!         #  p +  -------
!         #       1 + n/s
!         #
!         # IOW, it moves p a fraction of the distance from p to x, and
!         # less so the larger n is, or the smaller s is.
! 
!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.
! 
!         n = hamcount * spam2ham  +  spamcount * ham2spam
!         prob = (StimesX + n * prob) / (S + n)
! 
!         if record.spamprob != prob:
!             record.spamprob = prob
!             # The next seemingly pointless line appears to be a hack
!             # to allow a persistent db to realize the record has changed.
!             self.wordinfo[word] = record
! 
!     def clearjunk(self, oldesttime):
!         """Forget useless wordinfo records.  This can shrink the database size.
! 
!         A record for a word will be retained only if the word was accessed
!         at or after oldesttime.
!         """
! 
!         wordinfo = self.wordinfo
!         tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
!         for w in tonuke:
!             del wordinfo[w]
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
--- 419,425 ----
  
          for word, record in self.wordinfo.iteritems():
!             # This method updates probability iff the metainfo revision
!             # has changed.
!             record.probability(self.meta)
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
***************
*** 428,444 ****
      # appears in a msg, but distorting spamprob doesn't appear a correct way
      # to exploit it.
!     def _add_msg(self, wordstream, is_spam):
          if is_spam:
!             self.nspam += 1
          else:
!             self.nham += 1
  
          wordinfo = self.wordinfo
          wordinfoget = wordinfo.get
-         now = time.time()
          for word in Set(wordstream):
              record = wordinfoget(word)
              if record is None:
!                 record = self.WordInfoClass(now)
  
              if is_spam:
--- 443,458 ----
      # appears in a msg, but distorting spamprob doesn't appear a correct way
      # to exploit it.
!     def _add_msg(self, wordstream, is_spam, update_word_probabilities):
          if is_spam:
!             self.meta.incr_spam()
          else:
!             self.meta.incr_ham()
  
          wordinfo = self.wordinfo
          wordinfoget = wordinfo.get
          for word in Set(wordstream):
              record = wordinfoget(word)
              if record is None:
!                 record = self.WordInfoClass()
  
              if is_spam:
***************
*** 446,461 ****
              else:
                  record.hamcount += 1
!             # Needed to tell a persistent DB that the content changed.
!             wordinfo[word] = record
  
!     def _remove_msg(self, wordstream, is_spam):
          if is_spam:
!             if self.nspam <= 0:
                  raise ValueError("spam count would go negative!")
!             self.nspam -= 1
          else:
!             if self.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.nham -= 1
  
          wordinfo = self.wordinfo
--- 460,480 ----
              else:
                  record.hamcount += 1
!                 
!             if update_word_probabilities:
!                 self.update_word_probability(word, record)
!             else:
!                 # Needed to tell a persistent DB that the content changed.
!                 wordinfo[word] = record
  
! 
!     def _remove_msg(self, wordstream, is_spam, update_word_probabilities):
          if is_spam:
!             if self.meta.nspam() <= 0:
                  raise ValueError("spam count would go negative!")
!             self.meta.incr_spam(-1)
          else:
!             if self.meta.nham() <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.meta.incr_ham(-1)
  
          wordinfo = self.wordinfo
***************
*** 472,477 ****
                  if record.hamcount == 0 == record.spamcount:
                      del wordinfo[word]
                  else:
!                     # Needed to tell a persistent DB that the content changed.
                      wordinfo[word] = record
  
--- 491,499 ----
                  if record.hamcount == 0 == record.spamcount:
                      del wordinfo[word]
+                 elif update_word_probabilities:
+                     update_word_probability(word, record)
                  else:
!                     # Needed to tell a persistent DB that the content
!                     # changed.
                      wordinfo[word] = record
  
***************
*** 484,488 ****
  
          wordinfoget = self.wordinfo.get
-         now = time.time()
          for word in Set(wordstream):
              record = wordinfoget(word)
--- 506,509 ----
***************
*** 490,495 ****
                  prob = unknown
              else:
!                 record.atime = now
!                 prob = record.spamprob
              distance = abs(prob - 0.5)
              if distance >= mindist:
--- 511,515 ----
                  prob = unknown
              else:
!                 prob = record.probability(self.meta)
              distance = abs(prob - 0.5)
              if distance >= mindist:

Index: hammiefilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
retrieving revision 1.2.2.1
retrieving revision 1.2.2.2
diff -C2 -d -r1.2.2.1 -r1.2.2.2
*** hammiefilter.py	19 Nov 2002 23:45:25 -0000	1.2.2.1
--- hammiefilter.py	21 Nov 2002 04:16:36 -0000	1.2.2.2
***************
*** 52,89 ****
      sys.exit(code)
  
! def newdb():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'n')
!     h.store()
!     print "Created new database in", options.persistent_storage_file
  
! def filter():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'r')
!     msg = sys.stdin.read()
!     print h.filter(msg)
  
! def train_ham():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'w')
!     msg = sys.stdin.read()
!     h.train_ham(msg)
!     h.update_probabilities()
!     h.store()
  
! def train_spam():
!     h = hammie.open(options.persistent_storage_file,
!                     options.hammiefilter_persistent_use_database,
!                     'w')
!     msg = sys.stdin.read()
!     h.train_spam(msg)
!     h.update_probabilities()
!     h.store()
  
  def main():
!     action = filter
      opts, args = getopt.getopt(sys.argv[1:], 'hngs')
      for opt, arg in opts:
--- 52,93 ----
      sys.exit(code)
  
! class HammieFilter(object):
!     def __init__(self):
!         options = Options.options
!         options.mergefiles(['/etc/hammierc',
!                             os.path.expanduser('~/.hammierc')])
!         
!         self.dbname = options.hammiefilter_persistent_storage_file
!         self.dbname = os.path.expanduser(self.dbname)
!         self.usedb = options.hammiefilter_persistent_use_database
!         
  
!     def newdb(self):
!         h = hammie.open(self.dbname, self.usedb, 'n')
!         h.store()
!         print "Created new database in", self.dbname
  
!     def filter(self):
!         h = hammie.open(self.dbname, self.usedb, 'r')
!         msg = sys.stdin.read()
!         print h.filter(msg)
  
!     def train_ham(self):
!         h = hammie.open(self.dbname, self.usedb, 'c')
!         msg = sys.stdin.read()
!         h.train_ham(msg)
!         h.update_probabilities()
!         h.store()
! 
!     def train_spam(self):
!         h = hammie.open(self.dbname, self.usedb, 'c')
!         msg = sys.stdin.read()
!         h.train_spam(msg)
!         h.update_probabilities()
!         h.store()
  
  def main():
!     h = HammieFilter()
!     action = h.filter
      opts, args = getopt.getopt(sys.argv[1:], 'hngs')
      for opt, arg in opts:
***************
*** 91,103 ****
              usage(0)
          elif opt == '-g':
!             action = train_ham
          elif opt == '-s':
!             action = train_spam
          elif opt == "-n":
!             action = newdb
! 
!     # hammiefilter overrides
!     options.mergefiles(['/etc/hammierc',
!                         os.path.expanduser('~/.hammierc')])
  
      action()
--- 95,103 ----
              usage(0)
          elif opt == '-g':
!             action = h.train_ham
          elif opt == '-s':
!             action = h.train_spam
          elif opt == "-n":
!             action = h.newdb
  
      action()


From npickett@users.sourceforge.net  Thu Nov 21 04:27:30 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Wed, 20 Nov 2002 20:27:30 -0800
Subject: [Spambayes-checkins] 
 spambayes Bayes.py,1.5.2.4,1.5.2.5 classifier.py,1.53.2.2,1.53.2.3
 hammie.py,1.40.2.1,1.40.2.2 hammiefilter.py,1.2.2.2,1.2.2.3
Message-ID: <E18Eiw6-0007MU-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv27763

Modified Files:
      Tag: hammie-playground
	Bayes.py classifier.py hammie.py hammiefilter.py 
Log Message:
* A few more MetaInfo class-related changes which I somehow
  overlooked.  hammiefilter will need to start with a new database.


Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.5.2.4
retrieving revision 1.5.2.5
diff -C2 -d -r1.5.2.4 -r1.5.2.5
*** Bayes.py	21 Nov 2002 04:16:36 -0000	1.5.2.4
--- Bayes.py	21 Nov 2002 04:27:26 -0000	1.5.2.5
***************
*** 224,238 ****
  
          if self.wordinfo.has_key(self.statekey):
! 
!             self.nham, self.nspam = self.wordinfo[self.statekey]
              if Corpus.Verbose:
!                 print '%s is an existing DBDict, with %d ham and %d spam' \
!                       % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
              if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
-             self.nham = 0
-             self.nspam = 0
  
      def store(self):
--- 224,235 ----
  
          if self.wordinfo.has_key(self.statekey):
!             self.meta = self.wordinfo[self.statekey]
              if Corpus.Verbose:
!                 print '%s is an existing DBDict' \
!                       % (self.db_name)
          else:
              # new dbdict
              if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
  
      def store(self):
***************
*** 242,246 ****
              print 'Persisting',self.db_name,'state in DBDict'
  
!         self.wordinfo[self.statekey] = (self.nham, self.nspam)
          self.wordinfo.sync()
  
--- 239,243 ----
              print 'Persisting',self.db_name,'state in DBDict'
  
!         self.wordinfo[self.statekey] = self.meta
          self.wordinfo.sync()
  

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.2
retrieving revision 1.53.2.3
diff -C2 -d -r1.53.2.2 -r1.53.2.3
*** classifier.py	21 Nov 2002 04:16:36 -0000	1.53.2.2
--- classifier.py	21 Nov 2002 04:27:27 -0000	1.53.2.3
***************
*** 57,63 ****
      """
      def __init__(self):
!         self._nham = 0
!         self._nspam = 0
!         self.revision = 0
  
      def __repr__(self):
--- 57,61 ----
      """
      def __init__(self):
!         self.__setstate__((0, 0))
  
      def __repr__(self):
***************
*** 71,74 ****
--- 69,73 ----
      def __setstate__(self, t):
          (self._nham, self._nspam) = t
+         self.revision = 0
  
      def nham(self):
***************
*** 380,384 ****
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam, update_word_probabilities=True):
          """Teach the classifier by example.
  
--- 379,383 ----
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam):
          """Teach the classifier by example.
  
***************
*** 397,403 ****
          """
  
!         self._add_msg(wordstream, is_spam, update_word_probabilities)
  
!     def unlearn(self, wordstream, is_spam, update_word_probabilities=True):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
--- 396,402 ----
          """
  
!         self._add_msg(wordstream, is_spam)
  
!     def unlearn(self, wordstream, is_spam):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
***************
*** 405,409 ****
          """
  
!         self._remove_msg(wordstream, is_spam, update_word_probabilities)
  
      def update_probabilities(self):
--- 404,408 ----
          """
  
!         self._remove_msg(wordstream, is_spam)
  
      def update_probabilities(self):
***************
*** 443,447 ****
      # appears in a msg, but distorting spamprob doesn't appear a correct way
      # to exploit it.
!     def _add_msg(self, wordstream, is_spam, update_word_probabilities):
          if is_spam:
              self.meta.incr_spam()
--- 442,446 ----
      # appears in a msg, but distorting spamprob doesn't appear a correct way
      # to exploit it.
!     def _add_msg(self, wordstream, is_spam):
          if is_spam:
              self.meta.incr_spam()
***************
*** 461,472 ****
                  record.hamcount += 1
                  
!             if update_word_probabilities:
!                 self.update_word_probability(word, record)
!             else:
!                 # Needed to tell a persistent DB that the content changed.
!                 wordinfo[word] = record
  
  
!     def _remove_msg(self, wordstream, is_spam, update_word_probabilities):
          if is_spam:
              if self.meta.nspam() <= 0:
--- 460,468 ----
                  record.hamcount += 1
                  
!             # Needed to tell a persistent DB that the content changed.
!             wordinfo[word] = record
  
  
!     def _remove_msg(self, wordstream, is_spam):
          if is_spam:
              if self.meta.nspam() <= 0:
***************
*** 491,496 ****
                  if record.hamcount == 0 == record.spamcount:
                      del wordinfo[word]
-                 elif update_word_probabilities:
-                     update_word_probability(word, record)
                  else:
                      # Needed to tell a persistent DB that the content
--- 487,490 ----

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40.2.1
retrieving revision 1.40.2.2
diff -C2 -d -r1.40.2.1 -r1.40.2.2
*** hammie.py	19 Nov 2002 23:45:24 -0000	1.40.2.1
--- hammie.py	21 Nov 2002 04:27:27 -0000	1.40.2.2
***************
*** 136,140 ****
          """
  
!         self.bayes.learn(tokenize(msg), is_spam, False)
  
      def train_ham(self, msg):
--- 136,140 ----
          """
  
!         self.bayes.learn(tokenize(msg), is_spam)
  
      def train_ham(self, msg):
***************
*** 161,180 ****
  
          self.train(msg, True)
- 
-     def update_probabilities(self, store=True):
-         """Update probability values.
- 
-         You would want to call this after a training session.  It's
-         pretty slow, so if you have a lot of messages to train, wait
-         until you're all done before calling this.
- 
-         Unless store is false, the peristent store will be written after
-         updating probabilities.
- 
-         """
- 
-         self.bayes.update_probabilities()
-         if store:
-             self.store()
  
      def store(self):
--- 161,164 ----

Index: hammiefilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
retrieving revision 1.2.2.2
retrieving revision 1.2.2.3
diff -C2 -d -r1.2.2.2 -r1.2.2.3
*** hammiefilter.py	21 Nov 2002 04:16:36 -0000	1.2.2.2
--- hammiefilter.py	21 Nov 2002 04:27:27 -0000	1.2.2.3
***************
*** 77,81 ****
          msg = sys.stdin.read()
          h.train_ham(msg)
-         h.update_probabilities()
          h.store()
  
--- 77,80 ----
***************
*** 84,88 ****
          msg = sys.stdin.read()
          h.train_spam(msg)
-         h.update_probabilities()
          h.store()
  
--- 83,86 ----


From npickett@users.sourceforge.net  Thu Nov 21 04:36:23 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Wed, 20 Nov 2002 20:36:23 -0800
Subject: [Spambayes-checkins] spambayes hammiecli.py,1.2,1.3
Message-ID: <E18Ej4h-0007tf-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv30038

Modified Files:
	hammiecli.py 
Log Message:
hammiecli.py: Print out the Binary data, not the Binary itself
              (thanks Ranieri J D Severiano)


Index: hammiecli.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiecli.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** hammiecli.py	27 Oct 2002 05:13:54 -0000	1.2
--- hammiecli.py	21 Nov 2002 04:36:21 -0000	1.3
***************
*** 19,23 ****
          m = xmlrpclib.Binary(msg)
          out = x.filter(m)
!         print out
      except:
          if __debug__:
--- 19,23 ----
          m = xmlrpclib.Binary(msg)
          out = x.filter(m)
!         print out.data
      except:
          if __debug__:


From npickett@users.sourceforge.net  Thu Nov 21 06:03:26 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Wed, 20 Nov 2002 22:03:26 -0800
Subject: [Spambayes-checkins] 
 spambayes Options.py,1.72.2.4,1.72.2.5 TestDriver.py,1.29,1.29.2.1
 Tester.py,1.8,1.8.2.1 classifier.py,1.53.2.3,1.53.2.4
Message-ID: <E18EkQw-0004xr-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv17235

Modified Files:
      Tag: hammie-playground
	Options.py TestDriver.py Tester.py classifier.py 
Log Message:
* classifier.py(MetaInfo): nham/nspam look like variables again,
  thanks to the spiff-o property() function.
* Tester.py: adjusted for new arguments for classifier.learn
* Options.py: removed show_best_descriminators, which won't work at
  all without classifier.WordInfo.killcount.
* TestDriver.py: ditto


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72.2.4
retrieving revision 1.72.2.5
diff -C2 -d -r1.72.2.4 -r1.72.2.5
*** Options.py	21 Nov 2002 04:16:36 -0000	1.72.2.4
--- Options.py	21 Nov 2002 06:03:24 -0000	1.72.2.5
***************
*** 198,209 ****
  show_unsure: False
  
- # Near the end of Driver.test(), you can get a listing of the best
- # discriminators in the words from the training sets.  These are the
- # words whose WordInfo.killcount values are highest, meaning they most
- # often were among the most extreme clues spamprob() found.  The number
- # of best discriminators to show is given by show_best_discriminators;
- # set this <= 0 to suppress showing any of the best discriminators.
- show_best_discriminators: 30
- 
  # The maximum # of characters to display for a msg displayed due to the
  # show_xyz options above.
--- 198,201 ----
***************
*** 406,410 ****
                     'show_histograms': boolean_cracker,
                     'percentiles': ('get', lambda s: map(float, s.split())),
-                    'show_best_discriminators': int_cracker,
                     'save_trained_pickles': boolean_cracker,
                     'save_histogram_pickles': boolean_cracker,
--- 398,401 ----

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.29
retrieving revision 1.29.2.1
diff -C2 -d -r1.29 -r1.29.2.1
*** TestDriver.py	15 Nov 2002 21:32:19 -0000	1.29
--- TestDriver.py	21 Nov 2002 06:03:24 -0000	1.29.2.1
***************
*** 296,315 ****
              printmsg(e, prob, clues)
  
-         if options.show_best_discriminators > 0:
-             print
-             print "    best discriminators:"
-             stats = [(-1, None)] * options.show_best_discriminators
-             smallest_killcount = -1
-             for w, r in c.wordinfo.iteritems():
-                 if r.killcount > smallest_killcount:
-                     heapreplace(stats, (r.killcount, w))
-                     smallest_killcount = stats[0][0]
-             stats.sort()
-             for count, w in stats:
-                 if count < 0:
-                     continue
-                 r = c.wordinfo[w]
-                 print "        %r %d %g" % (w, r.killcount, r.spamprob)
- 
          if options.show_histograms:
              printhist("this pair:", local_ham_hist, local_spam_hist)
--- 296,299 ----

Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.8
retrieving revision 1.8.2.1
diff -C2 -d -r1.8 -r1.8.2.1
*** Tester.py	7 Nov 2002 22:30:04 -0000	1.8
--- Tester.py	21 Nov 2002 06:03:24 -0000	1.8.2.1
***************
*** 60,68 ****
          if hamstream is not None:
              for example in hamstream:
!                 learn(example, False, False)
          if spamstream is not None:
              for example in spamstream:
!                 learn(example, True, False)
!         self.classifier.update_probabilities()
  
      # Untrain the classifier on streams of ham and spam.  Updates
--- 60,67 ----
          if hamstream is not None:
              for example in hamstream:
!                 learn(example, False)
          if spamstream is not None:
              for example in spamstream:
!                 learn(example, True)
  
      # Untrain the classifier on streams of ham and spam.  Updates
***************
*** 73,81 ****
          if hamstream is not None:
              for example in hamstream:
!                 unlearn(example, False, False)
          if spamstream is not None:
              for example in spamstream:
!                 unlearn(example, True, False)
!         self.classifier.update_probabilities()
  
      # Run prediction on each sample in stream.  You're swearing that stream
--- 72,79 ----
          if hamstream is not None:
              for example in hamstream:
!                 unlearn(example, False)
          if spamstream is not None:
              for example in spamstream:
!                 unlearn(example, True)
  
      # Run prediction on each sample in stream.  You're swearing that stream

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.3
retrieving revision 1.53.2.4
diff -C2 -d -r1.53.2.3 -r1.53.2.4
*** classifier.py	21 Nov 2002 04:27:27 -0000	1.53.2.3
--- classifier.py	21 Nov 2002 06:03:24 -0000	1.53.2.4
***************
*** 71,90 ****
          self.revision = 0
  
!     def nham(self):
          return self._nham
  
!     def nspam(self):
          return self._nspam
  
-     def incr_rev(self):
-         self.revision += 1
          
-     def incr_ham(self, amt=1):
-         self._nham += amt
-         self.incr_rev()
- 
-     def incr_spam(self, amt=1):
-         self._nspam += 1
-         self.incr_rev()
      
  
--- 71,92 ----
          self.revision = 0
  
!     def incr_rev(self):
!         print "revision going up...", self.revision
!         self.revision += 1
! 
!     def get_nham(self):
          return self._nham
+     def set_nham(self, val):
+         self._nham = val
+         self.incr_rev()
+     nham = property(get_nham, set_nham)
  
!     def set_nspam(self, val):
!         self._nspam = val
!     def get_nspam(self):
          return self._nspam
+     nspam = property(get_nspam, set_nspam)
  
          
***************
*** 122,127 ****
          """
  
!         nham = float(meta.nham() or 1)
!         nspam = float(meta.nspam() or 1)
  
          if options.experimental_ham_spam_imbalance_adjustment:
--- 124,129 ----
          """
  
!         nham = float(meta.nham or 1)
!         nspam = float(meta.nspam or 1)
  
          if options.experimental_ham_spam_imbalance_adjustment:
***************
*** 220,223 ****
--- 222,239 ----
          self.wordinfo, self.meta = t[1:]
  
+     # Slacker's way out--pass calls to nham/nspam up to the meta class
+ 
+     def get_nham(self):
+         return self.meta.nham
+     def set_nham(self, val):
+         self.meta.nham = val
+     nham = property(get_nham, set_nham)
+ 
+     def get_nspam(self):
+         return self.meta.nspam
+     def set_nspam(self, val):
+         self.meta.nspam = val
+     nspam = property(get_nspam, set_nspam)
+ 
      # spamprob() implementations.  One of the following is aliased to
      # spamprob, depending on option settings.
***************
*** 415,418 ****
--- 431,437 ----
          you should call update_probabilities() after feeding the last
          message and before calling spamprob().
+ 
+         You probably don't need to call this, since probabilities are
+         automatically updated.
          """
  
***************
*** 444,450 ****
      def _add_msg(self, wordstream, is_spam):
          if is_spam:
!             self.meta.incr_spam()
          else:
!             self.meta.incr_ham()
  
          wordinfo = self.wordinfo
--- 463,469 ----
      def _add_msg(self, wordstream, is_spam):
          if is_spam:
!             self.meta.nspam += 1
          else:
!             self.meta.nham += 1
  
          wordinfo = self.wordinfo
***************
*** 466,476 ****
      def _remove_msg(self, wordstream, is_spam):
          if is_spam:
!             if self.meta.nspam() <= 0:
                  raise ValueError("spam count would go negative!")
!             self.meta.incr_spam(-1)
          else:
!             if self.meta.nham() <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.meta.incr_ham(-1)
  
          wordinfo = self.wordinfo
--- 485,495 ----
      def _remove_msg(self, wordstream, is_spam):
          if is_spam:
!             if self.meta.nspam <= 0:
                  raise ValueError("spam count would go negative!")
!             self.meta.nspam -= 1
          else:
!             if self.meta.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.meta.nham -= -1
  
          wordinfo = self.wordinfo


From mhammond@users.sourceforge.net  Thu Nov 21 11:20:17 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 21 Nov 2002 03:20:17 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 export.py,NONE,1.1
Message-ID: <E18EpNZ-0003sM-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv14604

Added Files:
	export.py 
Log Message:
Export your outlook spam and ham folders to a standard test directory
structure with one message per file.


--- NEW FILE: export.py ---
# Exports your ham and spam folders to a standard SpamBayes test directory

import sys, os, shutil
from manager import GetManager


def BuildBuckets(manager, root_directory, folder_ids, include_sub):
    store = manager.message_store
    config = manager.config
    num = 0
    for folder in store.GetFolderGenerator(config.training.spam_folder_ids, config.training.spam_include_sub):
        for msg in folder.GetMessageGenerator():
            num += 1
    num_buckets = num / 400
    dirs = []
    for i in range(num_buckets):
        dir=os.path.join(root_directory, "Set%d" % (i+1,))
        dir=os.path.abspath(dir)
        if os.path.isdir(dir):
            shutil.rmtree(dir)
        os.makedirs(dir)
        dirs.append(dir)
    return dirs

def ChooseBucket(buckets):
    import random
    return random.choice(buckets)

def _export_folders(manager, dir, folder_ids, include_sub):
    num = 0
    store = manager.message_store
    buckets = BuildBuckets(manager, dir, folder_ids, include_sub)
    for folder in store.GetFolderGenerator(folder_ids, include_sub):
        print "", folder.name
        for message in folder.GetMessageGenerator():
            dir = ChooseBucket(buckets)
            # filename is the EID.txt
            try:
                msg_text = str(message.GetEmailPackageObject())
            except KeyboardInterrupt:
                raise
            except:
                print "Failed to get message text for '%s': %s" \
                      % (message.GetSubject(), sys.exc_info()[1])
                continue

            fname = os.path.join(dir, message.GetID()[1]) + ".txt"
            f = open(fname, "w")
            f.write(msg_text)
            f.close()
            num += 1
    return num

def export(directory):
    print "Loading bayes manager..."
    manager = GetManager()
    config = manager.config

    print "Exporting spam..."
    num = _export_folders(manager, os.path.join(directory, "Spam"),
                          config.training.spam_folder_ids, config.training.spam_include_sub)
    print "Exported", num, " spam messages."

    print "Exporting ham...",
    num = _export_folders(manager, os.path.join(directory, "Ham"),
                          config.training.ham_folder_ids, config.training.ham_include_sub)
    print "Exported", num, " ham messages."

def main():
    import getopt
    try:
        opts, args = getopt.getopt(sys.argv[1:], "q")
    except getopt.error, d:
        print d
        print
        usage()
    quiet = 0
    for opt, val in opts:
        if opt=='-q':
            quiet = 1

    if len(args) > 1:
        print "Only one directory name can be specified"
        print
        usage()

    if len(args)==0:
        directory = os.path.join(os.path.dirname(sys.argv[0]), "..\\Data")
    else:
        directory = args[0]

    directory = os.path.abspath(directory)
    print "This program will export your Outlook Ham and Spam folders"
    print "to the directory '%s'" % (directory,)
    if os.path.exists(directory):
        print "*******"
        print "WARNING: all existing files in '%s' will be deleted" % (directory,)
        print "*******"
    if not quiet:
        raw_input("Press enter to continue, or Ctrl+C to abort.")
    export(directory)

def usage():
    print """ \
Usage: %s -q [directory]

-q : quiet - don't prompt for confirmation.

Export the folders defined in the Outlook Plugin to a test directory.
The directory structure is as defined in the parent README.txt file,
in the "Standard Test Data Setup" section.

If 'directory' is not specified, '..\\Data' is assumed.

If 'directory' exists, it will be recursively deleted before
the export (but you will be asked to confirm unless -q is given).""" \
            % (os.path.basename(sys.argv[0]))
    sys.exit(1)

if __name__=='__main__':
    main()


From mhammond@users.sourceforge.net  Thu Nov 21 12:06:58 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 21 Nov 2002 04:06:58 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 export.py,1.1,1.2
Message-ID: <E18Eq6k-0000IQ-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv913

Modified Files:
	export.py 
Log Message:
Select correct number of sets even when more spam, and allow user to
specify how many messages in each dir.


Index: export.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/export.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** export.py	21 Nov 2002 11:20:14 -0000	1.1
--- export.py	21 Nov 2002 12:06:55 -0000	1.2
***************
*** 4,25 ****
  from manager import GetManager
  
  
! def BuildBuckets(manager, root_directory, folder_ids, include_sub):
      store = manager.message_store
      config = manager.config
!     num = 0
      for folder in store.GetFolderGenerator(config.training.spam_folder_ids, config.training.spam_include_sub):
          for msg in folder.GetMessageGenerator():
!             num += 1
!     num_buckets = num / 400
      dirs = []
      for i in range(num_buckets):
!         dir=os.path.join(root_directory, "Set%d" % (i+1,))
!         dir=os.path.abspath(dir)
!         if os.path.isdir(dir):
!             shutil.rmtree(dir)
!         os.makedirs(dir)
!         dirs.append(dir)
!     return dirs
  
  def ChooseBucket(buckets):
--- 4,24 ----
  from manager import GetManager
  
+ files_per_directory = 400
  
! def BuildBuckets(manager):
      store = manager.message_store
      config = manager.config
!     num_ham = num_spam = 0
      for folder in store.GetFolderGenerator(config.training.spam_folder_ids, config.training.spam_include_sub):
          for msg in folder.GetMessageGenerator():
!             num_spam += 1
!     for folder in store.GetFolderGenerator(config.training.ham_folder_ids, config.training.ham_include_sub):
!         for msg in folder.GetMessageGenerator():
!             num_ham += 1
!     num_buckets = min(num_ham, num_spam)/ files_per_directory
      dirs = []
      for i in range(num_buckets):
!         dirs.append("Set%d" % (i+1,))
!     return num_spam, num_ham, dirs
  
  def ChooseBucket(buckets):
***************
*** 27,38 ****
      return random.choice(buckets)
  
! def _export_folders(manager, dir, folder_ids, include_sub):
      num = 0
      store = manager.message_store
-     buckets = BuildBuckets(manager, dir, folder_ids, include_sub)
      for folder in store.GetFolderGenerator(folder_ids, include_sub):
          print "", folder.name
          for message in folder.GetMessageGenerator():
!             dir = ChooseBucket(buckets)
              # filename is the EID.txt
              try:
--- 26,37 ----
      return random.choice(buckets)
  
! def _export_folders(manager, dir, buckets, folder_ids, include_sub):
      num = 0
      store = manager.message_store
      for folder in store.GetFolderGenerator(folder_ids, include_sub):
          print "", folder.name
          for message in folder.GetMessageGenerator():
!             sub = ChooseBucket(buckets)
!             this_dir = os.path.join(dir, sub)
              # filename is the EID.txt
              try:
***************
*** 45,49 ****
                  continue
  
!             fname = os.path.join(dir, message.GetID()[1]) + ".txt"
              f = open(fname, "w")
              f.write(msg_text)
--- 44,48 ----
                  continue
  
!             fname = os.path.join(this_dir, message.GetID()[1]) + ".txt"
              f = open(fname, "w")
              f.write(msg_text)
***************
*** 57,74 ****
      config = manager.config
  
      print "Exporting spam..."
!     num = _export_folders(manager, os.path.join(directory, "Spam"),
                            config.training.spam_folder_ids, config.training.spam_include_sub)
!     print "Exported", num, " spam messages."
  
!     print "Exporting ham...",
!     num = _export_folders(manager, os.path.join(directory, "Ham"),
                            config.training.ham_folder_ids, config.training.ham_include_sub)
!     print "Exported", num, " ham messages."
  
  def main():
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "q")
      except getopt.error, d:
          print d
--- 56,84 ----
      config = manager.config
  
+     num_spam, num_ham, buckets = BuildBuckets(manager)
+     print "Have %d spam, and %d ham to export, spread over %d directories." \
+           % (num_spam, num_ham, len(buckets))
+ 
+     for sub in ["Spam", "Ham"]:
+         if os.path.exists(os.path.join(directory, sub)):
+             shutil.rmtree(os.path.join(directory, sub))
+         for b in buckets:
+             d = os.path.join(directory, sub, b)
+             os.makedirs(d)
+ 
      print "Exporting spam..."
!     num = _export_folders(manager, os.path.join(directory, "Spam"), buckets,
                            config.training.spam_folder_ids, config.training.spam_include_sub)
!     print "Exported", num, "spam messages."
  
!     print "Exporting ham..."
!     num = _export_folders(manager, os.path.join(directory, "Ham"), buckets,
                            config.training.ham_folder_ids, config.training.ham_include_sub)
!     print "Exported", num, "ham messages."
  
  def main():
      import getopt
      try:
!         opts, args = getopt.getopt(sys.argv[1:], "qn:")
      except getopt.error, d:
          print d
***************
*** 79,82 ****
--- 89,95 ----
          if opt=='-q':
              quiet = 1
+         elif opt=='-n':
+             global files_per_directory
+             files_per_directory = int(val)
  
      if len(args) > 1:
***************
*** 106,109 ****
--- 119,123 ----
  
  -q : quiet - don't prompt for confirmation.
+ -n : Minimum number of files to aim for in each directory, default=%d
  
  Export the folders defined in the Outlook Plugin to a test directory.
***************
*** 115,119 ****
  If 'directory' exists, it will be recursively deleted before
  the export (but you will be asked to confirm unless -q is given).""" \
!             % (os.path.basename(sys.argv[0]))
      sys.exit(1)
  
--- 129,133 ----
  If 'directory' exists, it will be recursively deleted before
  the export (but you will be asked to confirm unless -q is given).""" \
!             % (os.path.basename(sys.argv[0]), files_per_directory)
      sys.exit(1)
  

From npickett@users.sourceforge.net  Thu Nov 21 23:00:01 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Thu, 21 Nov 2002 15:00:01 -0800
Subject: [Spambayes-checkins] spambayes hammiebulk.py,NONE,1.1.2.1
 classifier.py,1.53.2.4,1.53.2.5 hammie.py,1.40.2.2,1.40.2.3
Message-ID: <E18F0Ij-0000cI-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv1861

Modified Files:
      Tag: hammie-playground
	classifier.py hammie.py 
Added Files:
      Tag: hammie-playground
	hammiebulk.py 
Log Message:
* Bayes.py: removed a debug print
* hammie.py: removed some debug code I put in for hammiesrv
* hammiebulk.py: this does what hammie.py used to do.


--- NEW FILE: hammiebulk.py ---
#! /usr/bin/env python

"""Usage: %(program)s [options]

Where:
    -h
        show usage and exit
    -g PATH
        mbox or directory of known good messages (non-spam) to train on.
        Can be specified more than once, or use - for stdin.
    -s PATH
        mbox or directory of known spam messages to train on.
        Can be specified more than once, or use - for stdin.
    -u PATH
        mbox of unknown messages.  A ham/spam decision is reported for each.
        Can be specified more than once.
    -r
        reverse the meaning of the check (report ham instead of spam).
        Only meaningful with the -u option.
    -p FILE
        use file as the persistent store.  loads data from this file if it
        exists, and saves data to this file at the end.
        Default: %(DEFAULTDB)s
    -d
        use the DBM store instead of cPickle.  The file is larger and
        creating it is slower, but checking against it is much faster,
        especially for large word databases. Default: %(USEDB)s
    -D
        the reverse of -d: use the cPickle instead of DBM
    -f
        run as a filter: read a single message from stdin, add a new
        header, and write it to stdout.  If you want to run from
        procmail, this is your option.
"""

import sys
import os
import types
import getopt
import mailbox
import glob
import email
import errno
import anydbm
import cPickle as pickle

from Options import options
import mboxutils
import classifier
import hammie

program = sys.argv[0] # For usage(); referenced by docstring above

# Default database name
DEFAULTDB = os.path.expanduser(options.hammiefilter_persistent_storage_file)

# Use a database? If False, use a pickle
USEDB = options.hammiefilter_persistent_use_database

# Probability at which a message is considered spam
SPAM_THRESHOLD = options.spam_cutoff
HAM_THRESHOLD = options.ham_cutoff


def train(h, msgs, is_spam):
    """Train bayes with all messages from a mailbox."""
    mbox = mboxutils.getmbox(msgs)
    i = 0
    for msg in mbox:
        i += 1
        # XXX: Is the \r a Unixism?  I seem to recall it working in DOS
        # back in the day.  Maybe it's a line-printer-ism ;)
        sys.stdout.write("\r%6d" % i)
        sys.stdout.flush()
        h.train(msg, is_spam)
    print

def score(h, msgs, reverse=0):
    """Score (judge) all messages from a mailbox."""
    # XXX The reporting needs work!
    mbox = mboxutils.getmbox(msgs)
    i = 0
    spams = hams = 0
    for msg in mbox:
        i += 1
        prob, clues = h.score(msg, True)
        if hasattr(msg, '_mh_msgno'):
            msgno = msg._mh_msgno
        else:
            msgno = i
        isspam = (prob >= SPAM_THRESHOLD)
        if isspam:
            spams += 1
            if not reverse:
                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
                print h.formatclues(clues)
        else:
            hams += 1
            if reverse:
                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
                print h.formatclues(clues)
    return (spams, hams)

def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
    """Create a Bayes instance for the given pickle (which
    doesn't have to exist).  Create a PersistentBayes if
    usedb is True."""
    if usedb:
        bayes = PersistentBayes(pck, mode)
    else:
        bayes = None
        try:
            fp = open(pck, 'rb')
        except IOError, e:
            if e.errno <> errno.ENOENT: raise
        else:
            bayes = pickle.load(fp)
            fp.close()
        if bayes is None:
            bayes = classifier.Bayes()
    return bayes

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def main():
    """Main program; parse options and go."""
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
    except getopt.error, msg:
        usage(2, msg)

    if not opts:
        usage(2, "No options given")

    pck = DEFAULTDB
    good = []
    spam = []
    unknown = []
    reverse = 0
    do_filter = False
    usedb = USEDB
    mode = 'r'
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-g':
            good.append(arg)
            mode = 'c'
        elif opt == '-s':
            spam.append(arg)
            mode = 'c'
        elif opt == '-p':
            pck = arg
        elif opt == "-d":
            usedb = True
        elif opt == "-D":
            usedb = False
        elif opt == "-f":
            do_filter = True
        elif opt == '-u':
            unknown.append(arg)
        elif opt == '-r':
            reverse = 1
    if args:
        usage(2, "Positional arguments not allowed")

    save = False

    h = hammie.open(pck, usedb, mode)

    for g in good:
        print "Training ham (%s):" % g
        train(h, g, False)
        save = True

    for s in spam:
        print "Training spam (%s):" % s
        train(h, s, True)
        save = True

    if save:
        h.store()

    if do_filter:
        msg = sys.stdin.read()
        filtered = h.filter(msg)
        sys.stdout.write(filtered)

    if unknown:
        (spams, hams) = (0, 0)
        for u in unknown:
            if len(unknown) > 1:
                print "Scoring", u
            s, g = score(h, u, reverse)
            spams += s
            hams += g
        print "Total %d spam, %d ham" % (spams, hams)

if __name__ == "__main__":
    main()

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.4
retrieving revision 1.53.2.5
diff -C2 -d -r1.53.2.4 -r1.53.2.5
*** classifier.py	21 Nov 2002 06:03:24 -0000	1.53.2.4
--- classifier.py	21 Nov 2002 22:59:55 -0000	1.53.2.5
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
  # An implementation of a Bayes-like spam classifier.
  #
***************
*** 72,76 ****
  
      def incr_rev(self):
-         print "revision going up...", self.revision
          self.revision += 1
  
--- 73,76 ----
***************
*** 135,139 ****
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
!                 
          assert self.hamcount <= nham
          hamratio = self.hamcount / nham
--- 135,139 ----
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
! 
          assert self.hamcount <= nham
          hamratio = self.hamcount / nham

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40.2.2
retrieving revision 1.40.2.3
diff -C2 -d -r1.40.2.2 -r1.40.2.3
*** hammie.py	21 Nov 2002 04:27:27 -0000	1.40.2.2
--- hammie.py	21 Nov 2002 22:59:56 -0000	1.40.2.3
***************
*** 58,67 ****
          """
  
!         try:
!             return self._scoremsg(msg, evidence)
!         except:
!             print msg
!             import traceback
!             traceback.print_exc()
  
      def filter(self, msg, header=None, spam_cutoff=None,
--- 58,62 ----
          """
  
!         return self._scoremsg(msg, evidence)
  
      def filter(self, msg, header=None, spam_cutoff=None,


From timstone4@users.sourceforge.net  Fri Nov 22 00:12:37 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 16:12:37 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.5,1.53.2.6
Message-ID: <E18F1Qz-00056i-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19621

Modified Files:
      Tag: hammie-playground
	classifier.py 
Log Message:
Changed name of Bayes class to Classifier, aliased Classifier as Bayes
so stuff wouldn't break.

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.5
retrieving revision 1.53.2.6
diff -C2 -d -r1.53.2.5 -r1.53.2.6
*** classifier.py	21 Nov 2002 22:59:55 -0000	1.53.2.5
--- classifier.py	22 Nov 2002 00:12:35 -0000	1.53.2.6
***************
*** 195,199 ****
  
  
! class Bayes:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
      # trying to hook this all up to ZODB as a persistent object.  There's
--- 195,199 ----
  
  
! class Classifier:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
      # trying to hook this all up to ZODB as a persistent object.  There's
***************
*** 534,535 ****
--- 534,537 ----
          # Return (prob, word, record).
          return [t[1:] for t in clues]
+ 
+ Bayes = Classifier
\ No newline at end of file


From timstone4@users.sourceforge.net  Fri Nov 22 00:21:30 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 16:21:30 -0800
Subject: [Spambayes-checkins] spambayes dbdict.py,1.1.2.3,1.1.2.4
Message-ID: <E18F1Za-0005bh-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv21550

Modified Files:
      Tag: hammie-playground
	dbdict.py 
Log Message:
Modified to include wclass operand on constructor, which is used to
create special 'W' pickle strings when an instance of that class is
pickled, and to unpickle to the same class when 'W' pickle strings are
encountered.

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1.2.3
retrieving revision 1.1.2.4
diff -C2 -d -r1.1.2.3 -r1.1.2.4
*** dbdict.py	21 Nov 2002 02:58:56 -0000	1.1.2.3
--- dbdict.py	22 Nov 2002 00:21:28 -0000	1.1.2.4
***************
*** 4,13 ****
  
  Classes:
!     DBDict - wraps an anydbm file
!     LSDBDict - adds load/store/restore semantic to DBDict
  
  Abstract:
!     DBDict class wraps an anydbm file with a reasonably complete set
      of dictionary access methods.  DBDicts can be iterated like a dictionary.
  
      DBDict accepts an iterskip operand on the constructor.  This is a tuple
--- 4,19 ----
  
  Classes:
!     DBDict - wraps a dbhash file
  
  Abstract:
!     DBDict class wraps a dbhash file with a reasonably complete set
      of dictionary access methods.  DBDicts can be iterated like a dictionary.
+     
+     The constructor accepts a class name which is used specifically to
+     to pickle/unpickle an instance of that class.  When an instance of
+     that class is being pickled, the pickler (actually __getstate__) prepends
+     a 'W' to the pickled string, and when the unpickler (really __setstate__)
+     encounters that 'W', it constructs that class (with no constructor
+     arguments) and executes __setstate__ on the constructed instance.
  
      DBDict accepts an iterskip operand on the constructor.  This is a tuple
***************
*** 33,44 ****
          countme wakka
  
-     LSDBDict class addes load/store/restore functions to DBDict.  It does this
-     by creating a working copy of the dbm file, and using that for all
-     working access.  When the store() method is called, the working dbm hash
-     is closed, copied to the real copy, then reopened, in effect
-     committing any changes.  When restore() is called, the working copy
-     is closed, replaced with the real copy, then reopened.  Store and restore
-     methods are disallowed for readonly (mode MODE_READONLY) LSDBDicts.
- 
  To Do:
      """
--- 39,42 ----
***************
*** 53,57 ****
                 all the spambayes contributors."
  from __future__ import generators
! import dbhash
  try:
      import cPickle as pickle
--- 51,55 ----
                 all the spambayes contributors."
  from __future__ import generators
! 
  try:
      import cPickle as pickle
***************
*** 59,62 ****
--- 57,61 ----
      import pickle
  
+ import dbhash
  import errno
  import copy
***************
*** 81,85 ****
      like .keys() still list everything.  For instance:
  
!     >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo'))
      >>> d['skipme'] = 'booga'
      >>> d['countme'] = 'wakka'
--- 80,84 ----
      like .keys() still list everything.  For instance:
  
!     >>> d = DBDict('goober.db', MODE_CREATE, ('skipme', 'skipmetoo'))
      >>> d['skipme'] = 'booga'
      >>> d['countme'] = 'wakka'
***************
*** 92,98 ****
      """
  
!     def __init__(self, dbname, mode, iterskip=()):
          self.hash = dbhash.open(dbname, mode)
!         self.iterskip = iterskip
  
      def __getitem__(self, key):
--- 91,121 ----
      """
  
!     def __init__(self, dbname, mode, wclass, iterskip=()):
          self.hash = dbhash.open(dbname, mode)
!         if not iterskip:
!             self.iterskip = iterskip
!         else:
!             self.iterskip = ()
!         self.wclass=wclass
! 
!     def __getitem__(self, key):
!         v = self.hash[key]
!         if v[0] == 'W':
!             val = pickle.loads(v[1:])
!             # We could be sneaky, like pickle.Unpickler.load_inst,
!             # but I think that's overly confusing.
!             obj = self.wclass()
!             obj.__setstate__(val)
!             return obj
!         else:
!             return pickle.loads(v)
! 
!     def __setitem__(self, key, val):
!         if isinstance(val, self.wclass):
!             val = val.__getstate__()
!             v = 'W' + pickle.dumps(val, 1)
!         else:
!             v = pickle.dumps(val, 1)
!         self.hash[key] = v
  
      def __getitem__(self, key):


From timstone4@users.sourceforge.net  Fri Nov 22 00:25:43 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 16:25:43 -0800
Subject: [Spambayes-checkins] spambayes Persistent.py,NONE,1.1.2.1
Message-ID: <E18F1df-0005ok-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv22357

Added Files:
      Tag: hammie-playground
	Persistent.py 
Log Message:
Replaces Bayes.py, which has been removed.

--- NEW FILE: Persistent.py ---
#! /usr/bin/env python

'''Persistent.py - Spambayes database management framework.

Classes:
    PersistentClassifier - subclass of Classifier, adds auto persistence
    PickledClassifier - PersistentClassifier that uses a pickle db
    DBDictClassifier - PersistentClassifier that uses a DBDict db
    Trainer - Classifier training observer
    SpamTrainer - Trainer for spam
    HamTrainer - Trainer for ham

Abstract:
    PersistentClassifier is an abstract subclass of Classifier (classifier.Classifier)
    that adds automatic state store/restore function to the Classifier class.
    It also adds a convenience method, which should probably
    more properly be defined in Classifier: classify, which returns
    'spam'|'ham'|'unsure' for a message based on the spamprob against
    the ham_cutoff and spam_cutoff specified in Options.

    PickledClassifier is a concrete PersistentClassifier class that uses a cPickle
    datastore.  This database is relatively small, but slower than other
    databases.

    DBDictClassifier is a concrete PersistentClassifier class that uses a DBDict
    datastore.

    Trainer is concrete class that observes a Corpus and trains a
    Classifier object based upon movement of messages between corpora  When
    an add message notification is received, the trainer trains the
    database with the message, as spam or ham as appropriate given the
    type of trainer (spam or ham).  When a remove message notification
    is received, the trainer untrains the database as appropriate.

    SpamTrainer and HamTrainer are convenience subclasses of Trainer, that
    initialize as the appropriate type of Trainer

To Do:
    o ZODBClassifier
    o Would Trainer.trainall really want to train with the whole corpus,
        or just a random subset?
    o Corpus.Verbose is a bit of a strange thing to have.  Verbose
        should be in the global namespace, but how do you get it there?
    o Suggestions?

    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim@fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, Neale Pickett, \
all the spambayes contributors."

import Corpus
import classifier
from Options import options
import cPickle as pickle
import dbdict
import errno

PICKLE_TYPE = 1
NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
UPDATEPROBS = True       # Probabilities will be autoupdated with training

class PersistentClassifier(classifier.Classifier):
    '''Persistent Classifier database object'''

    def __init__(self, db_name):
        '''Constructor(database name)'''

        classifier.Classifier.__init__(self)
        self.db_name = db_name
        self.load()

    def load(self):
        '''Restore state from a persistent store'''

        raise NotImplementedError

    def store(self):
        '''Persist state into a persistent store'''

        raise NotImplementedError

    def classify(self, message):
        '''Returns the classification of a Message {'spam'|'ham'|'unsure'}'''

        prob = self.spamprob(message.tokenize())

        message.setSpamprob(prob)   # don't like this

        if prob < options.ham_cutoff:
            type = 'ham'
        elif prob > options.spam_cutoff:
            type = 'spam'
        else:
            type = 'unsure'

        return type


class PickledClassifier(PersistentClassifier):
    '''Classifier object persisted in a pickle'''

    def load(self):
        '''Load this instance from the pickle.'''
        # This is a bit strange, because the loading process
        # creates a temporary instance of PickledClassifier, from which
        # this object's state is copied.  This is a nuance of the way
        # that pickle does its job

        if Corpus.Verbose:
            print 'Loading state from',self.db_name,'pickle'

        tempbayes = None
        try:
            fp = open(self.db_name, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT: raise
        else:
            tempbayes = pickle.load(fp)
            fp.close()

        if tempbayes:
            self.wordinfo = tempbayes.wordinfo
            self.meta.nham = tempbayes.get_nham()
            self.meta.nspam = tempbayes.get_nspam()

            if Corpus.Verbose:
                print '%s is an existing pickle, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new pickle
            if Corpus.Verbose:
                print self.db_name,'is a new pickle'
            self.wordinfo = {}
            self.meta.nham = 0
            self.meta.nspam = 0

    def store(self):
        '''Store self as a pickle'''

        if Corpus.Verbose:
            print 'Persisting',self.db_name,'as a pickle'

        fp = open(self.db_name, 'wb')
        pickle.dump(self, fp, PICKLE_TYPE)
        fp.close()

    def __getstate__(self):
        return PICKLE_TYPE, self.wordinfo, self.meta

    def __setstate__(self, t):
        if t[0] != PICKLE_TYPE:
            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
        self.wordinfo, self.meta = t[1:]


class DBDictClassifier(PersistentClassifier):
    '''Classifier object persisted in a WIDict'''

    def __init__(self, db_name, mode='c'):
        '''Constructor(database name)'''

        self.mode = mode
        self.statekey = "saved state"
        PersistentClassifier.__init__(self, db_name)

    def load(self):
        '''Load state from WIDict'''

        if Corpus.Verbose:
            print 'Loading state from',self.db_name,'WIDict'

        self.wordinfo = dbdict.DBDict(self.db_name, self.mode,
                             classifier.WordInfo,iterskip=[self.statekey])

        if self.wordinfo.has_key(self.statekey):
            (nham, nspam) = self.wordinfo[self.statekey]
            self.set_nham(nham)
            self.set_nspam(nspam)

            if Corpus.Verbose:
                print '%s is an existing DBDict, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new dbdict
            if Corpus.Verbose:
                print self.db_name,'is a new DBDict'
            self.set_nham(0)
            self.set_nspam(0)

    def store(self):
        '''Place state into persistent store'''

        if Corpus.Verbose:
            print 'Persisting',self.db_name,'state in WIDict'

        self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam())
        self.wordinfo.sync()


class Trainer:
    '''Associates a Classifier object and one or more Corpora, \
    is an observer of the corpora'''

    def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
        '''Constructor(Classifier, \
            Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''

        self.bayes = bayes
        self.trainertype = trainertype
        self.updateprobs = updateprobs

    def onAddMessage(self, message):
        '''A message is being added to an observed corpus.'''

        self.train(message)

    def train(self, message):
        '''Train the database with the message'''

        if Corpus.Verbose:
            print 'training with',message.key()

        self.bayes.learn(message.tokenize(), \
                         self.trainertype)
#                         self.updateprobs)

    def onRemoveMessage(self, message):
        '''A message is being removed from an observed corpus.'''

        self.untrain(message)

    def untrain(self, message):
        '''Untrain the database with the message'''

        if Corpus.Verbose:
            print 'untraining with',message.key()

        self.bayes.unlearn(message.tokenize(), \
                           self.trainertype)
#                           self.updateprobs)
        # can raise ValueError if database is fouled.  If this is the case,
        # then retraining is the only recovery option.

    def trainAll(self, corpus):
        '''Train all the messages in the corpus'''

        for msg in corpus:
            self.train(msg)

    def untrainAll(self, corpus):
        '''Untrain all the messages in the corpus'''

        for msg in corpus:
            self.untrain(msg)


class SpamTrainer(Trainer):
    '''Trainer for spam'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, Corpus.SPAM, updateprobs)


class HamTrainer(Trainer):
    '''Trainer for ham'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, Corpus.HAM, updateprobs)


if __name__ == '__main__':
    print >>sys.stderr, __doc__


From timstone4@users.sourceforge.net  Fri Nov 22 00:28:21 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 16:28:21 -0800
Subject: [Spambayes-checkins] spambayes Corpus.py,1.2,1.2.2.1
Message-ID: <E18F1gD-0005xS-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv22886

Modified Files:
      Tag: hammie-playground
	Corpus.py 
Log Message:
Added methods to Message class:

    getSubject()
    getFrom()
    getDate()
    getHeaders()
    getBody()
    getHeadersList()

Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.2
retrieving revision 1.2.2.1
diff -C2 -d -r1.2 -r1.2.2.1
*** Corpus.py	16 Nov 2002 19:03:15 -0000	1.2
--- Corpus.py	22 Nov 2002 00:28:19 -0000	1.2.2.1
***************
*** 230,234 ****
  
          return msg
!         
  
  class ExpiryCorpus:
--- 230,234 ----
  
          return msg
! 
  
  class ExpiryCorpus:
***************
*** 272,276 ****
      def __init__(self):
          '''Constructor()'''
!         pass
  
      def load(self):
--- 272,278 ----
      def __init__(self):
          '''Constructor()'''
! 
!         self.bodytxt = None
!         self.hdrtxt = None
  
      def load(self):
***************
*** 297,301 ****
          '''Instance as a printable string'''
  
!         return self.substance
  
      def name(self):
--- 299,303 ----
          '''Instance as a printable string'''
  
!         return self.getSubstance()
  
      def name(self):
***************
*** 311,322 ****
      def setSubstance(self, sub):
          '''set this message substance'''
!         
          self.substance = sub
!         
      def getSubstance(self):
          '''Return this message substance'''
!         
          return self.substance
!         
      def setSpamprob(self, prob):
          '''Score of the last spamprob calc, may not be persistent'''
--- 313,329 ----
      def setSubstance(self, sub):
          '''set this message substance'''
! 
          self.substance = sub
!         bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
!         bmatch = bodyRE.search(sub)
!         if bmatch:
!             self.bodytxt = bmatch.group(2)
!             self.hdrtxt = sub[:bmatch.start(2)]
! 
      def getSubstance(self):
          '''Return this message substance'''
! 
          return self.substance
! 
      def setSpamprob(self, prob):
          '''Score of the last spamprob calc, may not be persistent'''
***************
*** 327,331 ****
          '''Returns substance as tokens'''
  
!         return tokenizer.tokenize(self.substance)
  
      def createTimeStamp(self):
--- 334,338 ----
          '''Returns substance as tokens'''
  
!         return tokenizer.tokenize(self.getSubstance())
  
      def createTimeStamp(self):
***************
*** 335,338 ****
--- 342,399 ----
          raise NotImplementedError
  
+     def getFrom(self):
+         '''Return a message From header content'''
+ 
+         if self.hdrtxt:
+             match = re.search(r'^From:(.*)$', self.hdrtxt, re.MULTILINE)
+             return match.group(1)
+         else:
+             return None
+ 
+     def getSubject(self):
+         '''Return a message Subject header contents'''
+ 
+         if self.hdrtxt:
+             match = re.search(r'^Subject:(.*)$', self.hdrtxt, re.MULTILINE)
+             return match.group(1)
+         else:
+             return None
+ 
+     def getDate(self):
+         '''Return a message Date header contents'''
+ 
+         if self.hdrtxt:
+             match = re.search(r'^Date:(.*)$', self.hdrtxt, re.MULTILINE)
+             return match.group(1)
+         else:
+             return None
+ 
+     def getHeadersList(self):
+         '''Return a list of message header tuples'''
+ 
+         hdrregex = re.compile(r'^([A-Za-z0-9-_]*): ?(.*)$', re.MULTILINE)
+         data = re.sub(r'\r?\n\r?\s',' ',self.hdrtxt,re.MULTILINE)
+         match = hdrregex.findall(data)
+ 
+ 	return match
+ 	
+     def getHeaders(self):
+         '''Return message headers as text'''
+         
+         return self.hdrtxt
+ 
+     def getBody(self):
+         '''Return the message body'''
+ 
+         return self.bodytxt
+ 
+     def stripSBDHeader(self):
+         '''Removes the X-Spambayes-Disposition: header from the message'''
+ 
+         # This is useful for training, where a spammer may be spoofing
+         # our header, to make sure that our header doesn't become an
+         # overweight clue to hamminess
+ 
+         raise NotImplementedError
  
  
From timstone4@users.sourceforge.net  Fri Nov 22 00:31:21 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 16:31:21 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.2,1.2.2.1
Message-ID: <E18F1j7-00066g-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv23466

Modified Files:
      Tag: hammie-playground
	FileCorpus.py 
Log Message:
Corrected some references to .substance instead of .getSubstance()
and .setSubstance()

Added tests for the header and body convenience methods that were
added to Message

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.2
retrieving revision 1.2.2.1
diff -C2 -d -r1.2 -r1.2.2.1
*** FileCorpus.py	16 Nov 2002 19:06:27 -0000	1.2
--- FileCorpus.py	22 Nov 2002 00:31:19 -0000	1.2.2.1
***************
*** 86,90 ****
  
  import Corpus
! import Bayes
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
  
--- 86,90 ----
  
  import Corpus
! import Persistent
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
  
***************
*** 192,195 ****
--- 192,196 ----
          '''Constructor(message file name, corpus directory name)'''
  
+         Corpus.Message.__init__(self)
          self.file_name = file_name
          self.directory = directory
***************
*** 214,218 ****
                 raise
          else:
!            self.substance = fp.read()
             fp.close()
  
--- 215,219 ----
                 raise
          else:
!            self.setSubstance(fp.read())
             fp.close()
  
***************
*** 225,229 ****
          pn = self.pathname()
          fp = open(pn, 'wb')
!         fp.write(self.substance)
          fp.close()
  
--- 226,230 ----
          pn = self.pathname()
          fp = open(pn, 'wb')
!         fp.write(self.getSubstance())
          fp.close()
  
***************
*** 248,260 ****
  
          elip = ''
!         sub = self.substance
! 
          if Corpus.Verbose:
!             sub = self.substance
          else:
!             if len(self.substance) > 20:
!                 sub = self.substance[:20]
!                 if len(self.substance) > 40:
!                     sub += '...' + self.substance[-20:]
  
          pn = os.path.join(self.directory, self.file_name)
--- 249,261 ----
  
          elip = ''
!         sub = self.getSubstance()
!         
          if Corpus.Verbose:
!             sub = self.getSubstance()
          else:
!             if len(sub) > 20:
!                 sub = sub[:20]
!                 if len(sub) > 40:
!                     sub += '...' + sub[-20:]
  
          pn = os.path.join(self.directory, self.file_name)
***************
*** 304,308 ****
                  raise
          else:
!             self.substance = fp.read()
              fp.close()
  
--- 305,309 ----
                  raise
          else:
!             self.setSubstance(fp.read())
              fp.close()
  
***************
*** 316,320 ****
          pn = self.pathname()
          gz = gzip.open(pn, 'wb')
!         gz.write(self.substance)
          gz.flush()
          gz.close()
--- 317,321 ----
          pn = self.pathname()
          gz = gzip.open(pn, 'wb')
!         gz.write(self.getSubstance())
          gz.flush()
          gz.close()
***************
*** 342,354 ****
          print 'Executing with uncompressed files'
  
!     print '\n\nCreating two Bayes databases'
!     miscbayes = Bayes.PickledBayes('fctestmisc.bayes')
!     classbayes = Bayes.DBDictBayes('fctestclass.bayes')
  
      print '\n\nSetting up spam corpus'
      spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
!     spamtrainer = Bayes.SpamTrainer(miscbayes)
      spamcorpus.addObserver(spamtrainer)
!     anotherspamtrainer = Bayes.SpamTrainer(classbayes, Bayes.UPDATEPROBS)
      spamcorpus.addObserver(anotherspamtrainer)
  
--- 343,355 ----
          print 'Executing with uncompressed files'
  
!     print '\n\nCreating two Classifier databases'
!     miscbayes = Persistent.PickledClassifier('fctestmisc.bayes')
!     classbayes = Persistent.DBDictClassifier('fctestclass.bayes')
  
      print '\n\nSetting up spam corpus'
      spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
!     spamtrainer = Persistent.SpamTrainer(miscbayes)
      spamcorpus.addObserver(spamtrainer)
!     anotherspamtrainer = Persistent.SpamTrainer(classbayes, Persistent.UPDATEPROBS)
      spamcorpus.addObserver(anotherspamtrainer)
  
***************
*** 365,374 ****
                            'fctesthamcorpus', \
                            'MSG*')
!     hamtrainer = Bayes.HamTrainer(miscbayes)
      hamcorpus.addObserver(hamtrainer)
      hamtrainer.trainAll(hamcorpus)
  
! 
!     print '\n\nAdd a message to hamcorpus that does not match the filter'
      if useGzip:
          fmClass = GzipFileMessage
--- 366,374 ----
                            'fctesthamcorpus', \
                            'MSG*')
!     hamtrainer = Persistent.HamTrainer(miscbayes)
      hamcorpus.addObserver(hamtrainer)
      hamtrainer.trainAll(hamcorpus)
  
!     print '\n\nA couple of message related tests'
      if useGzip:
          fmClass = GzipFileMessage
***************
*** 377,380 ****
--- 377,383 ----
  
      m1 = fmClass('XMG00001', 'fctestspamcorpus')
+     m1.setSubstance(testmsg2())
+     
+     print '\n\nAdd a message to hamcorpus that does not match the filter'
  
      try:
***************
*** 417,421 ****
  
      print '\n\nTrain with an individual message'
!     anotherhamtrainer = Bayes.HamTrainer(classbayes)
      anotherhamtrainer.train(unsurecorpus['MSG00005'])
  
--- 420,424 ----
  
      print '\n\nTrain with an individual message'
!     anotherhamtrainer = Persistent.HamTrainer(classbayes)
      anotherhamtrainer.train(unsurecorpus['MSG00005'])
  
***************
*** 428,431 ****
--- 431,443 ----
      msg = spamcorpus['MSG00001']
      print msg
+     print '\n\nThis is some vital information in the message'
+     print 'Date header is',msg.getDate()
+     print 'Subject header is',msg.getSubject()
+     print 'From header is',msg.getFrom()
+     
+     print 'Header text is:',msg.getHeaders()
+     print 'Headers are:',msg.getHeadersList()
+     print 'Body is:',msg.getBody()
+ 
  
  
***************
*** 526,538 ****
  
      m1 = fmClass('MSG00001', 'fctestspamcorpus')
!     m1.substance = tm1
      m1.store()
  
      m2 = fmClass('MSG00002', 'fctestspamcorpus')
!     m2.substance = tm2
      m2.store()
  
      m3 = fmClass('MSG00003', 'fctestunsurecorpus')
!     m3.substance = tm1
      m3.store()
  
--- 538,550 ----
  
      m1 = fmClass('MSG00001', 'fctestspamcorpus')
!     m1.setSubstance(tm1)
      m1.store()
  
      m2 = fmClass('MSG00002', 'fctestspamcorpus')
!     m2.setSubstance(tm2)
      m2.store()
  
      m3 = fmClass('MSG00003', 'fctestunsurecorpus')
!     m3.setSubstance(tm1)
      m3.store()
  
***************
*** 546,558 ****
  
      m4 = fmClass('MSG00004', 'fctestunsurecorpus')
!     m4.substance = tm1
      m4.store()
  
      m5 = fmClass('MSG00005', 'fctestunsurecorpus')
!     m5.substance = tm2
      m5.store()
  
      m6 = fmClass('MSG00006', 'fctestunsurecorpus')
!     m6.substance = tm2
      m6.store()
  
--- 558,570 ----
  
      m4 = fmClass('MSG00004', 'fctestunsurecorpus')
!     m4.setSubstance(tm1)
      m4.store()
  
      m5 = fmClass('MSG00005', 'fctestunsurecorpus')
!     m5.setSubstance(tm2)
      m5.store()
  
      m6 = fmClass('MSG00006', 'fctestunsurecorpus')
!     m6.setSubstance(tm2)
      m6.store()
  
***************
*** 583,587 ****
  Content-Type:text/plain; charset=us-ascii
  Content- Transfer- Encoding:7bit
- 
  Message-ID:<15814.42238.882013.702030@montanaro.dyndns.org>
  Date:Mon, 4 Nov 2002 10:49:02 -0600
--- 595,598 ----
***************
*** 644,648 ****
  Content-Type:text/plain; charset=us-ascii
  Content- Transfer- Encoding:7bit
- 
  X-Hammie- Disposition:Unsure
  
--- 655,658 ----


From timstone4@users.sourceforge.net  Fri Nov 22 02:07:35 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 18:07:35 -0800
Subject: [Spambayes-checkins] spambayes Corpus.py,1.2.2.1,1.2.2.2
Message-ID: <E18F3EF-0002Qa-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9332

Modified Files:
      Tag: hammie-playground
	Corpus.py 
Log Message:
Removed substance instance variable because it's contents were being
kept in bodytxt and hdrtxt, resulting in double memory usage.

Changed bodytxt and getBody methods to payload and getPayload

Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.2.2.1
retrieving revision 1.2.2.2
diff -C2 -d -r1.2.2.1 -r1.2.2.2
*** Corpus.py	22 Nov 2002 00:28:19 -0000	1.2.2.1
--- Corpus.py	22 Nov 2002 02:07:33 -0000	1.2.2.2
***************
*** 273,277 ****
          '''Constructor()'''
  
!         self.bodytxt = None
          self.hdrtxt = None
  
--- 273,277 ----
          '''Constructor()'''
  
!         self.payload = None
          self.hdrtxt = None
  
***************
*** 314,322 ****
          '''set this message substance'''
  
-         self.substance = sub
          bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
          bmatch = bodyRE.search(sub)
          if bmatch:
!             self.bodytxt = bmatch.group(2)
              self.hdrtxt = sub[:bmatch.start(2)]
  
--- 314,321 ----
          '''set this message substance'''
  
          bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
          bmatch = bodyRE.search(sub)
          if bmatch:
!             self.payload = bmatch.group(2)
              self.hdrtxt = sub[:bmatch.start(2)]
  
***************
*** 324,328 ****
          '''Return this message substance'''
  
!         return self.substance
  
      def setSpamprob(self, prob):
--- 323,327 ----
          '''Return this message substance'''
  
!         return self.hdrtxt + self.payload
  
      def setSpamprob(self, prob):
***************
*** 383,390 ****
          return self.hdrtxt
  
!     def getBody(self):
          '''Return the message body'''
  
!         return self.bodytxt
  
      def stripSBDHeader(self):
--- 382,389 ----
          return self.hdrtxt
  
!     def getPayload(self):
          '''Return the message body'''
  
!         return self.payload
  
      def stripSBDHeader(self):


From timstone4@users.sourceforge.net  Fri Nov 22 02:08:08 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 18:08:08 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.2.2.1,1.2.2.2
Message-ID: <E18F3Em-0002So-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9468

Modified Files:
      Tag: hammie-playground
	FileCorpus.py 
Log Message:
Changed test to use getPayload

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.2.2.1
retrieving revision 1.2.2.2
diff -C2 -d -r1.2.2.1 -r1.2.2.2
*** FileCorpus.py	22 Nov 2002 00:31:19 -0000	1.2.2.1
--- FileCorpus.py	22 Nov 2002 02:08:06 -0000	1.2.2.2
***************
*** 438,442 ****
      print 'Header text is:',msg.getHeaders()
      print 'Headers are:',msg.getHeadersList()
!     print 'Body is:',msg.getBody()
  
  
--- 438,442 ----
      print 'Header text is:',msg.getHeaders()
      print 'Headers are:',msg.getHeadersList()
!     print 'Body is:',msg.getPayload()
  
  
From timstone4@users.sourceforge.net  Fri Nov 22 02:16:25 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 18:16:25 -0800
Subject: [Spambayes-checkins] spambayes Bayes.py,1.5.2.5,NONE
Message-ID: <E18F3Mn-0002rq-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv11015

Removed Files:
      Tag: hammie-playground
	Bayes.py 
Log Message:
Bayes.py has been replaced with Persistent.py

--- Bayes.py DELETED ---


From timstone4@users.sourceforge.net  Fri Nov 22 03:00:38 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 19:00:38 -0800
Subject: [Spambayes-checkins] spambayes hammie.py,1.40.2.3,1.40.2.4
Message-ID: <E18F43a-00058B-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19720

Modified Files:
      Tag: hammie-playground
	hammie.py 
Log Message:
corrected module and class names

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40.2.3
retrieving revision 1.40.2.4
diff -C2 -d -r1.40.2.3 -r1.40.2.4
*** hammie.py	21 Nov 2002 22:59:56 -0000	1.40.2.3
--- hammie.py	22 Nov 2002 03:00:36 -0000	1.40.2.4
***************
*** 4,8 ****
  import dbdict
  import mboxutils
! import Bayes
  from Options import options
  from tokenizer import tokenize
--- 4,8 ----
  import dbdict
  import mboxutils
! import Persistent
  from Options import options
  from tokenizer import tokenize
***************
*** 180,186 ****
  
      if usedb:
!         b = Bayes.DBDictBayes(filename, mode)
      else:
!         b = Bayes.PickledBayes(filename)
      return Hammie(b)
  
--- 180,186 ----
  
      if usedb:
!         b = Persistent.DBDictClassifier(filename, mode)
      else:
!         b = Persistent.PickledClassifier(filename)
      return Hammie(b)
  

From timstone4@users.sourceforge.net  Fri Nov 22 03:00:48 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 19:00:48 -0800
Subject: [Spambayes-checkins] spambayes hammiebulk.py,1.1.2.1,1.1.2.2
Message-ID: <E18F43k-00058f-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19744

Modified Files:
      Tag: hammie-playground
	hammiebulk.py 
Log Message:
corrected module and class names

Index: hammiebulk.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Attic/hammiebulk.py,v
retrieving revision 1.1.2.1
retrieving revision 1.1.2.2
diff -C2 -d -r1.1.2.1 -r1.1.2.2
*** hammiebulk.py	21 Nov 2002 22:59:58 -0000	1.1.2.1
--- hammiebulk.py	22 Nov 2002 03:00:45 -0000	1.1.2.2
***************
*** 48,52 ****
--- 48,56 ----
  import mboxutils
  import classifier
+ import Persistent
  import hammie
+ import Corpus
+ 
+ Corpus.Verbose = True
  
  program = sys.argv[0] # For usage(); referenced by docstring above
***************
*** 107,122 ****
      usedb is True."""
      if usedb:
!         bayes = PersistentBayes(pck, mode)
      else:
!         bayes = None
!         try:
!             fp = open(pck, 'rb')
!         except IOError, e:
!             if e.errno <> errno.ENOENT: raise
!         else:
!             bayes = pickle.load(fp)
!             fp.close()
!         if bayes is None:
!             bayes = classifier.Bayes()
      return bayes
  
--- 111,117 ----
      usedb is True."""
      if usedb:
!         bayes = Persistent.DBDictClassifier(pck, mode)
      else:
!         bayes = Persistent.PickledClassifier(pck)
      return bayes
  

From timstone4@users.sourceforge.net  Fri Nov 22 03:00:58 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 21 Nov 2002 19:00:58 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.16.2.1,1.16.2.2
Message-ID: <E18F43u-00059H-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19780

Modified Files:
      Tag: hammie-playground
	pop3proxy.py 
Log Message:
corrected module and class names

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.16.2.1
retrieving revision 1.16.2.2
diff -C2 -d -r1.16.2.1 -r1.16.2.2
*** pop3proxy.py	19 Nov 2002 23:45:25 -0000	1.16.2.1
--- pop3proxy.py	22 Nov 2002 03:00:56 -0000	1.16.2.2
***************
*** 113,117 ****
  import os, sys, re, operator, errno, getopt, cPickle, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import Bayes, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from Options import options
--- 113,117 ----
  import os, sys, re, operator, errno, getopt, cPickle, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import Persistent, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from Options import options
***************
*** 1037,1041 ****
  # This keeps the global state of the module - the command-line options,
  # statistics like how many mails have been classified, the handle of the
! # log file, the Bayes and FileCorpus objects, and so on.
  class State:
      def __init__(self):
--- 1037,1041 ----
  # This keeps the global state of the module - the command-line options,
  # statistics like how many mails have been classified, the handle of the
! # log file, the Classifier and FileCorpus objects, and so on.
  class State:
      def __init__(self):
***************
*** 1082,1088 ****
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:
!             self.bayes = Bayes.DBDictBayes(self.databaseFilename)
          else:
!             self.bayes = Bayes.PickledBayes(self.databaseFilename)
          print "Done."
  
--- 1082,1088 ----
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:
!             self.bayes = Persistent.DBDictClassifier(self.databaseFilename)
          else:
!             self.bayes = Persistent.PickledClassifier(self.databaseFilename)
          print "Done."
  
***************
*** 1109,1114 ****
  
              # Create the Trainers.
!             self.spamTrainer = Bayes.SpamTrainer(self.bayes)
!             self.hamTrainer = Bayes.HamTrainer(self.bayes)
              self.spamCorpus.addObserver(self.spamTrainer)
              self.hamCorpus.addObserver(self.hamTrainer)
--- 1109,1114 ----
  
              # Create the Trainers.
!             self.spamTrainer = Persistent.SpamTrainer(self.bayes)
!             self.hamTrainer = Persistent.HamTrainer(self.bayes)
              self.spamCorpus.addObserver(self.spamTrainer)
              self.hamCorpus.addObserver(self.hamTrainer)


From timstone4@users.sourceforge.net  Fri Nov 22 16:33:21 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Fri, 22 Nov 2002 08:33:21 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7
Message-ID: <E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv400

Modified Files:
      Tag: hammie-playground
	classifier.py 
Log Message:
Added probability calculation result caching.  No benchmark available to see
how much, if any, performance gain is achieved, but it seems like it could
be significant, particularly in training large corpora, or with long running
processes.

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.6
retrieving revision 1.53.2.7
diff -C2 -d -r1.53.2.6 -r1.53.2.7
*** classifier.py	22 Nov 2002 00:12:35 -0000	1.53.2.6
--- classifier.py	22 Nov 2002 16:33:19 -0000	1.53.2.7
***************
*** 48,51 ****
--- 48,52 ----
  
  PICKLE_VERSION = 1
+ probcache = {}
  
  class MetaInfo(object):
***************
*** 127,130 ****
--- 128,150 ----
          nspam = float(meta.nspam or 1)
  
+         assert self.hamcount <= nham
+         hamratio = self.hamcount / nham
+ 
+         assert self.spamcount <= nspam
+         spamratio = self.spamcount / nspam
+         
+         self.revision = meta.revision
+         
+         # do a cache lookaside here, to possibly save a bunch of calculations
+         try:
+             self.spamprob = probcache[hamratio][spamratio]
+             return True
+         except KeyError:
+             pass
+         except TypeError:
+             probcache[hamratio] = {}
+ 
+         prob = spamratio / (hamratio + spamratio)
+         
          if options.experimental_ham_spam_imbalance_adjustment:
              spam2ham = min(nspam / nham, 1.0)
***************
*** 136,146 ****
          StimesX = S * options.unknown_word_prob
  
-         assert self.hamcount <= nham
-         hamratio = self.hamcount / nham
- 
-         assert self.spamcount <= nspam
-         spamratio = self.spamcount / nspam
- 
-         prob = spamratio / (hamratio + spamratio)
  
          # Now do Robinson's Bayesian adjustment.
--- 156,159 ----
***************
*** 181,190 ****
          prob = (StimesX + n * prob) / (S + n)
  
!         self.revision = meta.revision
!         if self.spamprob != prob:
!             self.spamprob = prob
!             return True
!         else:
!             return False
  
      def probability(self, meta):
--- 194,216 ----
          prob = (StimesX + n * prob) / (S + n)
  
!         # populate the cache, so this calculation won't have to be done again
!         try:
!             probcache[hamratio][spamratio] = prob
!         except KeyError:
!             probcache[hamratio] = {}
!             probcache[hamratio][spamratio] = prob
!         
!         # the following code is meaningless to me, maybe a performance hack?
!         # if so, it's been nullified by the cache, so simply set self.spamprob
!         # and return True
!         
!         #if self.spamprob != prob:
!         #    self.spamprob = prob
!         #    return True
!         #else:
!         #    return False
!         
!         self.spamprob = prob
!         return True
  
      def probability(self, meta):


From popiel@wolfskeep.com  Fri Nov 22 18:22:57 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 22 Nov 2002 10:22:57 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7 
In-Reply-To: Message from "Tim Stone" <timstone4@users.sourceforge.net> 
	<E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net> 
References: <E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net> 
Message-ID: <20021122182258.5CCA9F580@cashew.wolfskeep.com>

In message:  <E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net>
             "Tim Stone" <timstone4@users.sourceforge.net> writes:
>Update of /cvsroot/spambayes/spambayes
>In directory sc8-pr-cvs1:/tmp/cvs-serv400
>
>Modified Files:
>      Tag: hammie-playground
>	classifier.py 
>Log Message:
>Added probability calculation result caching.  No benchmark available to see
>how much, if any, performance gain is achieved, but it seems like it could
>be significant, particularly in training large corpora, or with long running
>processes.

You need to nuke the probcache when meta.revision changes. :-)

Also, wouldn't the cache implemented by this patch be more
efficient if it indexed by hamcount and spamcount (both
integers) instead of hamratio and spamratio (both floats)?

- Alex

From timstone4@users.sourceforge.net  Fri Nov 22 23:50:21 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Fri, 22 Nov 2002 15:50:21 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.7,1.53.2.8
Message-ID: <E18FNYz-00042l-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv15530

Modified Files:
      Tag: hammie-playground
	classifier.py 
Log Message:
Corrected probability calculation result caching, which in the previous
version was <quite> flawed.

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.7
retrieving revision 1.53.2.8
diff -C2 -d -r1.53.2.7 -r1.53.2.8
*** classifier.py	22 Nov 2002 16:33:19 -0000	1.53.2.7
--- classifier.py	22 Nov 2002 23:50:18 -0000	1.53.2.8
***************
*** 48,52 ****
  
  PICKLE_VERSION = 1
- probcache = {}
  
  class MetaInfo(object):
--- 48,51 ----
***************
*** 56,60 ****
      has a revision, incremented every time nham or nspam is adjusted to
      invalidate any cached probabilities.
!     
      """
      def __init__(self):
--- 55,59 ----
      has a revision, incremented every time nham or nspam is adjusted to
      invalidate any cached probabilities.
! 
      """
      def __init__(self):
***************
*** 89,94 ****
      nspam = property(get_nspam, set_nspam)
  
!         
!     
  
  class WordInfo(object):
--- 88,93 ----
      nspam = property(get_nspam, set_nspam)
  
! 
! 
  
  class WordInfo(object):
***************
*** 115,119 ****
      def _update_probability(self, meta):
          """Compute and store p(word) = prob(msg is spam | msg contains word).
!         
          This is the Graham calculation, but stripped of biases, and
          stripped of clamping into 0.01 thru 0.99.  The Bayesian
--- 114,118 ----
      def _update_probability(self, meta):
          """Compute and store p(word) = prob(msg is spam | msg contains word).
! 
          This is the Graham calculation, but stripped of biases, and
          stripped of clamping into 0.01 thru 0.99.  The Bayesian
***************
*** 133,150 ****
          assert self.spamcount <= nspam
          spamratio = self.spamcount / nspam
-         
-         self.revision = meta.revision
-         
-         # do a cache lookaside here, to possibly save a bunch of calculations
-         try:
-             self.spamprob = probcache[hamratio][spamratio]
-             return True
-         except KeyError:
-             pass
-         except TypeError:
-             probcache[hamratio] = {}
  
          prob = spamratio / (hamratio + spamratio)
!         
          if options.experimental_ham_spam_imbalance_adjustment:
              spam2ham = min(nspam / nham, 1.0)
--- 132,138 ----
          assert self.spamcount <= nspam
          spamratio = self.spamcount / nspam
  
          prob = spamratio / (hamratio + spamratio)
! 
          if options.experimental_ham_spam_imbalance_adjustment:
              spam2ham = min(nspam / nham, 1.0)
***************
*** 194,216 ****
          prob = (StimesX + n * prob) / (S + n)
  
!         # populate the cache, so this calculation won't have to be done again
!         try:
!             probcache[hamratio][spamratio] = prob
!         except KeyError:
!             probcache[hamratio] = {}
!             probcache[hamratio][spamratio] = prob
!         
!         # the following code is meaningless to me, maybe a performance hack?
!         # if so, it's been nullified by the cache, so simply set self.spamprob
!         # and return True
!         
!         #if self.spamprob != prob:
!         #    self.spamprob = prob
!         #    return True
!         #else:
!         #    return False
!         
!         self.spamprob = prob
!         return True
  
      def probability(self, meta):
--- 182,192 ----
          prob = (StimesX + n * prob) / (S + n)
  
!         self.revision = meta.revision
! 
!         if self.spamprob != prob:
!             self.spamprob = prob
!             return True
!         else:
!             return False
  
      def probability(self, meta):
***************
*** 239,242 ****
--- 215,219 ----
          self.wordinfo = {}
          self.meta = MetaInfo()
+         self.probcache = {}
  
      def __getstate__(self):
***************
*** 435,441 ****
          important thing is that the probabilities get updated before
          calling spamprob() again.
!         
          """
  
          self._add_msg(wordstream, is_spam)
  
--- 412,419 ----
          important thing is that the probabilities get updated before
          calling spamprob() again.
! 
          """
  
+         self.probcache = {}    # nuke the prob cache
          self._add_msg(wordstream, is_spam)
  
***************
*** 445,449 ****
          Pass the same arguments you passed to learn().
          """
! 
          self._remove_msg(wordstream, is_spam)
  
--- 423,427 ----
          Pass the same arguments you passed to learn().
          """
!         self.probcache = {}    # nuke the prob cache
          self._remove_msg(wordstream, is_spam)
  
***************
*** 504,508 ****
              else:
                  record.hamcount += 1
!                 
              # Needed to tell a persistent DB that the content changed.
              wordinfo[word] = record
--- 482,486 ----
              else:
                  record.hamcount += 1
! 
              # Needed to tell a persistent DB that the content changed.
              wordinfo[word] = record
***************
*** 550,554 ****
                  prob = unknown
              else:
!                 prob = record.probability(self.meta)
              distance = abs(prob - 0.5)
              if distance >= mindist:
--- 528,532 ----
                  prob = unknown
              else:
!                 prob = self.probability(record)
              distance = abs(prob - 0.5)
              if distance >= mindist:
***************
*** 560,563 ****
--- 538,565 ----
          # Return (prob, word, record).
          return [t[1:] for t in clues]
+ 
+     def probability(self, word):
+         """Look up words (spamcount, hamcount) in the prob cache"""
+ 
+         # Dictionary of dictionaries is used here for efficiency
+ 
+         h = word.hamcount
+         s = word.spamcount
+ 
+         try:
+             return self.probcache[h][s]
+         except (KeyError, TypeError):
+             pass
+ 
+         # populate the cache, so this calculation won't have to be done again
+         try:
+             self.probcache[h]
+         except KeyError:
+             self.probcache[h] = {}
+ 
+         word.probability(self.meta)
+         self.probcache[h][s] = word.spamprob
+ 
+         return word.spamprob
  
  Bayes = Classifier


From mhammond@users.sourceforge.net  Sat Nov 23 02:57:49 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 22 Nov 2002 18:57:49 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
	mapi_driver.py,NONE,1.1
	delete_outlook_field.py,1.4,1.5 dump_props.py,1.6,1.7
Message-ID: <E18FQUP-0000hQ-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory sc8-pr-cvs1:/tmp/cvs-serv2516

Modified Files:
	delete_outlook_field.py dump_props.py 
Added Files:
	mapi_driver.py 
Log Message:
Stop cloning the folder location code by creating a utility class.


--- NEW FILE: mapi_driver.py ---
from __future__ import generators
# Utilities for our sandbox

import pythoncom
from win32com.mapi import mapi, mapiutil
from win32com.mapi.mapitags import *

from win32com.client import Dispatch

class MAPIDriver:
    def __init__(self, read_only = False):
        mapi.MAPIInitialize(None)
        logonFlags = (mapi.MAPI_NO_MAIL |
                      mapi.MAPI_EXTENDED |
                      mapi.MAPI_USE_DEFAULT)
        self.session = mapi.MAPILogonEx(0, None, None, logonFlags)
        if read_only:
            self.mapi_flags = mapi.MAPI_DEFERRED_ERRORS
        else:
            self.mapi_flags = mapi.MAPI_DEFERRED_ERRORS | mapi.MAPI_BEST_ACCESS
        self.outlook = None

    def _GetMAPIFlags(self, mapi_flags = None):
        if mapi_flags is None:
            mapi_flags = self.mapi_flags
        return mapi_flags

    def GetOutlookFolder(self, item):
        if self.outlook is None:
            self.outlook = Dispatch("Outlook.Application")

        hr, props = item.GetProps((PR_ENTRYID,PR_STORE_ENTRYID), 0)
        (tag, eid), (tag, store_eid) = props
        eid = mapi.HexFromBin(eid)
        store_eid = mapi.HexFromBin(store_eid)
        return self.outlook.Session.GetFolderFromID(eid, store_eid)

    def GetMessageStores(self):
        tab = self.session.GetMsgStoresTable(0)
        rows = mapi.HrQueryAllRows(tab,
                                   (PR_ENTRYID, PR_DISPLAY_NAME_A, PR_DEFAULT_STORE),   # columns to retrieve
                                   None,     # all rows
                                   None,            # any sort order is fine
                                   0)               # any # of results is fine
        for row in rows:
            (eid_tag, eid), (name_tag, name), (def_store_tag, def_store) = row
            # Open the store.
            store = self.session.OpenMsgStore(
                                0,      # no parent window
                                eid,    # msg store to open
                                None,   # IID; accept default IMsgStore
                                # need write access to add score fields
                                mapi.MDB_WRITE |
                                    # we won't send or receive email
                                    mapi.MDB_NO_MAIL |
                                    mapi.MAPI_DEFERRED_ERRORS)
            yield store, name, def_store

    def _FindSubfolder(self, store, folder, find_name):
        find_name = find_name.lower()
        table = folder.GetHierarchyTable(0)
        rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
        for (eid_tag, eid), (name_tag, name), in rows:
            if name.lower() == find_name:
                return store.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
        return None

    def FindFolder(self, name):
        assert name
        names = [n.lower() for n in name.split("\\")]
        if names[0]:
            for store, name, is_default in self.GetMessageStores():
                if is_default:
                    store_name = name.lower()
                    break
            folder_names = names
        else:
            store_name = names[1]
            folder_names = names[2:]
        # Find the store with the name
        for store, name, is_default in self.GetMessageStores():
            if name.lower() == store_name:
                folder_store = store
                break
        else:
            raise ValueError, "The store '%s' can not be located" % (store_name,)

        hr, data = store.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
        subtree_eid = data[0][1]
        folder = folder_store.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)

        for name in folder_names:
            folder = self._FindSubfolder(folder_store, folder, name)
            if folder is None:
                raise ValueError, "The subfolder '%s' can not be located" % (name,)
        return folder

    def GetAllItems(self, folder, mapi_flags = None):
        mapi_flags = self._GetMAPIFlags(mapi_flags)
        table = folder.GetContentsTable(0)
        table.SetColumns((PR_ENTRYID,PR_STORE_ENTRYID), 0)
        while 1:
            # Getting 70 at a time was the random number that gave best
            # perf for me ;)
            rows = table.QueryRows(70, 0)
            if len(rows) == 0:
                break
            for row in rows:
                (tag, eid), (tag, store_eid) = row
                store = self.session.OpenMsgStore(0, store_eid, None, mapi_flags)
                item = store.OpenEntry(eid, None, mapi_flags)
                yield item

    def GetItemsWithValue(self, folder, prop_tag, prop_val, mapi_flags = None):
        mapi_flags = self._GetMAPIFlags(mapi_flags)
        tab = folder.GetContentsTable(0)
        # Restriction for the table:  get rows where our prop values match
        restriction = (mapi.RES_CONTENT,   # a property restriction
                       (mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level
                        prop_tag,   # of the given prop
                        (prop_tag, prop_val))) # with given val
        rows = mapi.HrQueryAllRows(tab,
                                   (PR_ENTRYID, PR_STORE_ENTRYID),   # columns to retrieve
                                   restriction,     # only these rows
                                   None,            # any sort order is fine
                                   0)               # any # of results is fine
        for row in rows:
            (tag, eid),(tag, store_eid) = row
            store = self.session.OpenMsgStore(0, store_eid, None, mapi_flags)
            item = store.OpenEntry(eid, None, mapi_flags)
            yield item

    def DumpTopLevelFolders(self):
        print "Top-level folder names are:"
        for store, name, is_default in self.GetMessageStores():
            # Find the folder with the content.
            hr, data = store.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
            subtree_eid = data[0][1]
            folder = store.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
            # Now the top-level folders in the store.
            table = folder.GetHierarchyTable(0)
            rows = mapi.HrQueryAllRows(table, (PR_DISPLAY_NAME_A), None, None, 0)
            for (name_tag, folder_name), in rows:
                print " \\%s\\%s" % (name, folder_name)

    def GetFolderNameDoc(self):
        def_store_name = "<??unknown??>"
        for store, name, is_def in self.GetMessageStores():
            if is_def:
                def_store_name = name
        return """\
Folder name is a hierarchical 'path' name, using '\\'
as the path separator.  If the folder name begins with a
\\, it must be a fully-qualified name, including the message
store name. For example, as your default store is currently named
'%s', your Inbox can be specified either as:
  -f "Inbox"
or
  -f "\\%s\\Inbox"
""" % (def_store_name, def_store_name)


if __name__=='__main__':
    print "This is a utility script for the other scripts in this directory"

Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** delete_outlook_field.py	14 Nov 2002 07:01:04 -0000	1.4
--- delete_outlook_field.py	23 Nov 2002 02:57:46 -0000	1.5
***************
*** 1,2 ****
--- 1,3 ----
+ from __future__ import generators
  # Do the best we can to completely obliterate a field from Outlook!
  
***************
*** 8,51 ****
  from win32com.mapi.mapitags import *
  
! mapi.MAPIInitialize(None)
! logonFlags = (mapi.MAPI_NO_MAIL |
!               mapi.MAPI_EXTENDED |
!               mapi.MAPI_USE_DEFAULT)
! session = mapi.MAPILogonEx(0, None, None, logonFlags)
! 
! def _FindDefaultMessageStore():
!     tab = session.GetMsgStoresTable(0)
!     # Restriction for the table:  get rows where PR_DEFAULT_STORE is true.
!     # There should be only one.
!     restriction = (mapi.RES_PROPERTY,   # a property restriction
!                    (mapi.RELOP_EQ,      # check for equality
!                     PR_DEFAULT_STORE,   # of the PR_DEFAULT_STORE prop
!                     (PR_DEFAULT_STORE, True))) # with True
!     rows = mapi.HrQueryAllRows(tab,
!                                (PR_ENTRYID,),   # columns to retrieve
!                                restriction,     # only these rows
!                                None,            # any sort order is fine
!                                0)               # any # of results is fine
!     # get first entry, a (property_tag, value) pair, for PR_ENTRYID
!     row = rows[0]
!     eid_tag, eid = row[0]
!     # Open the store.
!     return session.OpenMsgStore(
!                             0,      # no parent window
!                             eid,    # msg store to open
!                             None,   # IID; accept default IMsgStore
!                             # need write access to add score fields
!                             mapi.MDB_WRITE |
!                                 # we won't send or receive email
!                                 mapi.MDB_NO_MAIL |
!                                 mapi.MAPI_DEFERRED_ERRORS)
! 
! def _FindFolderEID(name):
!     from win32com.mapi import exchange
!     if not name.startswith("\\"):
!         name = "\\Top Of Personal Folders\\" + name
!     store = _FindDefaultMessageStore()
!     folder_eid = exchange.HrMAPIFindFolderEx(store, "\\", name)
!     return mapi.HexFromBin(folder_eid)
  
  def DeleteField_Outlook(folder, name):
--- 9,13 ----
  from win32com.mapi.mapitags import *
  
! import mapi_driver
  
  def DeleteField_Outlook(folder, name):
***************
*** 66,115 ****
      return num_outlook
  
! def DeleteField_MAPI(folder, name):
      # OK - now try and wipe the field using MAPI.
!     mapi_msgstore = _FindDefaultMessageStore()
!     mapi_folder = mapi_msgstore.OpenEntry(mapi.BinFromHex(folder.EntryID),
!                                           None,
!                                           mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
! 
!     table = mapi_folder.GetContentsTable(0)
!     prop_ids = PR_ENTRYID,
!     table.SetColumns(prop_ids, 0)
!     propIds = mapi_folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
      num_mapi = 0
!     if PROP_TYPE(propIds[0])!=PT_ERROR:
!         assert propIds[0] == PROP_TAG( PT_UNSPECIFIED, PROP_ID(propIds[0]))
!         while 1:
!             # Getting 70 at a time was the random number that gave best
!             # perf for me ;)
!             rows = table.QueryRows(70, 0)
!             if len(rows) == 0:
!                 break
!             for row in rows:
!                 eid = row[0][1]
!                 item = mapi_msgstore.OpenEntry(eid, None, mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
!                 # DeleteProps always says"success" - so check to see if it
!                 # actually exists just so we can count it.
!                 hr, vals = item.GetProps(propIds)
!                 if hr==0: # We actually have it
!                     hr, probs = item.DeleteProps(propIds)
!                     if  hr == 0:
!                         item.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
!                         num_mapi += 1
      return num_mapi
  
! def DeleteField_Folder(folder, name):
!     mapi_msgstore = _FindDefaultMessageStore()
!     mapi_folder = mapi_msgstore.OpenEntry(mapi.BinFromHex(folder.EntryID),
!                                           None,
!                                           mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
!     propIds = mapi_folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
!     num_mapi = 0
      if PROP_TYPE(propIds[0])!=PT_ERROR:
!         hr, vals = mapi_folder.GetProps(propIds)
          if hr==0: # We actually have it
!             hr, probs = mapi_folder.DeleteProps(propIds)
              if  hr == 0:
!                 mapi_folder.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
                  return 1
      return 0
--- 28,58 ----
      return num_outlook
  
! def DeleteField_MAPI(driver, folder, name):
      # OK - now try and wipe the field using MAPI.
!     propIds = folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
!     if PROP_TYPE(propIds[0])==PT_ERROR:
!         print "No such field '%s' in folder" % (name,)
!         return 0
!     assert propIds[0] == PROP_TAG( PT_UNSPECIFIED, PROP_ID(propIds[0]))
      num_mapi = 0
!     for item in driver.GetAllItems(folder):
!         # DeleteProps always says"success" - so check to see if it
!         # actually exists just so we can count it.
!         hr, vals = item.GetProps(propIds)
!         if hr==0: # We actually have it
!             hr, probs = item.DeleteProps(propIds)
!             if  hr == 0:
!                 item.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
!                 num_mapi += 1
      return num_mapi
  
! def DeleteField_Folder(driver, folder, name):
!     propIds = folder.GetIDsFromNames(((mapi.PS_PUBLIC_STRINGS,name),), 0)
      if PROP_TYPE(propIds[0])!=PT_ERROR:
!         hr, vals = folder.GetProps(propIds)
          if hr==0: # We actually have it
!             hr, probs = folder.DeleteProps(propIds)
              if  hr == 0:
!                 folder.SaveChanges(mapi.MAPI_DEFERRED_ERRORS)
                  return 1
      return 0
***************
*** 145,182 ****
          entry = entries.GetNext()
  
! def usage():
      msg = """\
! Usage: %s [-f foldername] [-f foldername] [-d] [-s] [FieldName ...]
  -f - Run over the specified folders (default = Inbox)
  -d - Delete the named fields
  -s - Show message subject and field value for all messages with field
! If no options given, prints a summary of field names in the folders
! --no-outlook - Don't delete via the Outlook UserProperties API
! --no-mapi - Don't delete via the extended MAPI API
! --no-folder - Don't attempt to delete the field from the folder itself
  
! Folder name must be a hierarchical 'path' name, using '\\'
! as the path seperator.  If the folder name begins with a
! \\, it must be a fully-qualified name, including the message
! store name (eg, "Top of Public Folders").  If the path does not
! begin with a \\, it is assumed to be fully-qualifed from the root
! of the default message store
  
! Eg, 'python\\python-dev' will locate a python-dev subfolder in a python
! subfolder in your default store.
! """ % os.path.basename(sys.argv[0])
      print msg
  
  
  def main():
      import getopt
      try:
          opts, args = getopt.getopt(sys.argv[1:],
!                                    "dsf:",
                                     ["no-mapi", "no-outlook", "no-folder"])
      except getopt.error, e:
          print e
          print
!         usage()
          sys.exit(1)
      delete = show = False
--- 88,123 ----
          entry = entries.GetNext()
  
! def usage(driver):
!     folder_doc = driver.GetFolderNameDoc()
      msg = """\
! Usage: %s [-f foldername -f ...] [-d] [-s] [FieldName ...]
  -f - Run over the specified folders (default = Inbox)
  -d - Delete the named fields
+   --no-outlook - Don't delete via the Outlook UserProperties API
+   --no-mapi - Don't delete via the extended MAPI API
+   --no-folder - Don't attempt to delete the field from the folder itself
  -s - Show message subject and field value for all messages with field
! -n - Show top-level folder names and exit
  
! If no options are given, prints a summary of field names in the folders.
  
! %s
! Use the -n option to see all top-level folder names from all stores.""" \
!         % (os.path.basename(sys.argv[0]), folder_doc)
      print msg
  
  
  def main():
+     driver = mapi_driver.MAPIDriver()
+ 
      import getopt
      try:
          opts, args = getopt.getopt(sys.argv[1:],
!                                    "dnsf:",
                                     ["no-mapi", "no-outlook", "no-folder"])
      except getopt.error, e:
          print e
          print
!         usage(driver)
          sys.exit(1)
      delete = show = False
***************
*** 196,200 ****
          elif opt == "--no-folder":
              do_folder = False
! 
          else:
              print "Invalid arg"
--- 137,143 ----
          elif opt == "--no-folder":
              do_folder = False
!         elif opt == "-n":
!             driver.DumpTopLevelFolders()
!             sys.exit(1)
          else:
              print "Invalid arg"
***************
*** 203,230 ****
      if not folder_names:
          folder_names = ["Inbox"] # Assume this exists!
-     app = Dispatch("Outlook.Application")
      if not args:
          print "No args specified - dumping all unique UserProperty names,"
          print "and the count of messages they appear in"
      for folder_name in folder_names:
!         eid = _FindFolderEID(folder_name)
!         if eid is None:
!             print "*** Cant find folder", folder_name
              continue
!         folder = app.Session.GetFolderFromID(eid)
!         print "Processing folder", folder.Name.encode("mbcs", "replace")
          if not args:
!             CountFields(folder)
              continue
          for field_name in args:
              if show:
!                 ShowFields(folder, field_name)
              if delete:
                  print "Deleting field", field_name
                  if do_outlook:
!                     num = DeleteField_Outlook(folder, field_name)
                      print "Deleted", num, "field instances from Outlook"
                  if do_mapi:
!                     num = DeleteField_MAPI(folder, field_name)
                      print "Deleted", num, "field instances via MAPI"
                  if do_folder:
--- 146,177 ----
      if not folder_names:
          folder_names = ["Inbox"] # Assume this exists!
      if not args:
          print "No args specified - dumping all unique UserProperty names,"
          print "and the count of messages they appear in"
+     outlook = None
      for folder_name in folder_names:
!         try:
!             folder = driver.FindFolder(folder_name)
!         except ValueError, details:
!             print details
!             print "Ignoring folder '%s'" % (folder_name,)
              continue
!         print "Processing folder '%s'" % (folder_name,)
          if not args:
!             outlook_folder = driver.GetOutlookFolder(folder)
!             CountFields(outlook_folder)
              continue
          for field_name in args:
              if show:
!                 outlook_folder = driver.GetOutlookFolder(folder)
!                 ShowFields(outlook_folder, field_name)
              if delete:
                  print "Deleting field", field_name
                  if do_outlook:
!                     outlook_folder = driver.GetOutlookFolder(folder)
!                     num = DeleteField_Outlook(outlook_folder, field_name)
                      print "Deleted", num, "field instances from Outlook"
                  if do_mapi:
!                     num = DeleteField_MAPI(driver, folder, field_name)
                      print "Deleted", num, "field instances via MAPI"
                  if do_folder:

Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** dump_props.py	20 Nov 2002 22:06:17 -0000	1.6
--- dump_props.py	23 Nov 2002 02:57:46 -0000	1.7
***************
*** 2,6 ****
  # Dump every property we can find for a MAPI item
  
- from win32com.client import Dispatch, constants
  import pythoncom
  import os, sys
--- 2,5 ----
***************
*** 9,77 ****
  from win32com.mapi.mapitags import *
  
! mapi.MAPIInitialize(None)
! logonFlags = (mapi.MAPI_NO_MAIL |
!               mapi.MAPI_EXTENDED |
!               mapi.MAPI_USE_DEFAULT)
! session = mapi.MAPILogonEx(0, None, None, logonFlags)
! 
! def GetMessageStores():
!     tab = session.GetMsgStoresTable(0)
!     rows = mapi.HrQueryAllRows(tab,
!                                (PR_ENTRYID, PR_DISPLAY_NAME_A, PR_DEFAULT_STORE),   # columns to retrieve
!                                None,     # all rows
!                                None,            # any sort order is fine
!                                0)               # any # of results is fine
!     for row in rows:
!         (eid_tag, eid), (name_tag, name), (def_store_tag, def_store) = row
!         # Open the store.
!         store = session.OpenMsgStore(
!                             0,      # no parent window
!                             eid,    # msg store to open
!                             None,   # IID; accept default IMsgStore
!                             # need write access to add score fields
!                             mapi.MDB_WRITE |
!                                 # we won't send or receive email
!                                 mapi.MDB_NO_MAIL |
!                                 mapi.MAPI_DEFERRED_ERRORS)
!         yield store, name, def_store
! 
! def _FindSubfolder(store, folder, find_name):
!     find_name = find_name.lower()
!     table = folder.GetHierarchyTable(0)
!     rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
!     for (eid_tag, eid), (name_tag, name), in rows:
!         if name.lower() == find_name:
!             return store.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
!     return None
! 
! def FindFolder(name):
!     assert name
!     names = [n.lower() for n in name.split("\\")]
!     if names[0]:
!         for store, name, is_default in GetMessageStores():
!             if is_default:
!                 store_name = name.lower()
!                 break
!         folder_names = names
!     else:
!         store_name = names[1]
!         folder_names = names[2:]
!     # Find the store with the name
!     for store, name, is_default in GetMessageStores():
!         if name.lower() == store_name:
!             folder_store = store
!             break
!     else:
!         raise ValueError, "The store '%s' can not be located" % (store_name,)
! 
!     hr, data = store.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
!     subtree_eid = data[0][1]
!     folder = folder_store.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
! 
!     for name in folder_names:
!         folder = _FindSubfolder(folder_store, folder, name)
!         if folder is None:
!             raise ValueError, "The subfolder '%s' can not be located" % (name,)
!     return folder_store, folder        
  
  # Also in new versions of mapituil
--- 8,12 ----
  from win32com.mapi.mapitags import *
  
! import mapi_driver
  
  # Also in new versions of mapituil
***************
*** 95,114 ****
      return ret
  
- def _FindItemsWithValue(folder, prop_tag, prop_val):
-     tab = folder.GetContentsTable(0)
-     # Restriction for the table:  get rows where our prop values match
-     restriction = (mapi.RES_CONTENT,   # a property restriction
-                    (mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level
-                     prop_tag,   # of the given prop
-                     (prop_tag, prop_val))) # with given val
-     rows = mapi.HrQueryAllRows(tab,
-                                (PR_ENTRYID,),   # columns to retrieve
-                                restriction,     # only these rows
-                                None,            # any sort order is fine
-                                0)               # any # of results is fine
-     # get entry IDs
-     return [row[0][1] for row in rows]
- 
- 
  def DumpItemProps(item, shorten):
      for prop_name, prop_val in GetAllProperties(item):
--- 30,33 ----
***************
*** 118,131 ****
          print "%-20s: %s" % (prop_name, prop_repr)
  
! def DumpProps(mapi_msgstore, mapi_folder, subject, include_attach, shorten):
      hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
      name = data[0][1]
!     eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject)
!     print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject)
!     for eid in eids:
!         print "Dumping item with ID", mapi.HexFromBin(eid)
!         item = mapi_msgstore.OpenEntry(eid,
!                                        None,
!                                        mapi.MAPI_DEFERRED_ERRORS)
          DumpItemProps(item, shorten)
          if include_attach:
--- 37,44 ----
          print "%-20s: %s" % (prop_name, prop_repr)
  
! def DumpProps(driver, mapi_folder, subject, include_attach, shorten):
      hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
      name = data[0][1]
!     for item in driver.GetItemsWithValue(mapi_folder, PR_SUBJECT_A, subject):
          DumpItemProps(item, shorten)
          if include_attach:
***************
*** 139,160 ****
                  DumpItemProps(attach, shorten)
  
! def DumpTopLevelFolders():
!     print "Top-level folder names are:"
!     for store, name, is_default in GetMessageStores():
!         # Find the folder with the content.
!         hr, data = store.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
!         subtree_eid = data[0][1]
!         folder = store.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
!         # Now the top-level folders in the store.
!         table = folder.GetHierarchyTable(0)
!         rows = mapi.HrQueryAllRows(table, (PR_DISPLAY_NAME_A), None, None, 0)
!         for (name_tag, folder_name), in rows:
!             print " \\%s\\%s" % (name, folder_name)
! 
! def usage():
!     def_store_name = "<??unknown??>"
!     for store, name, is_def in GetMessageStores():
!         if is_def:
!             def_store_name = name
      msg = """\
  Usage: %s [-f foldername] subject of the message
--- 52,57 ----
                  DumpItemProps(attach, shorten)
  
! def usage(driver):
!     folder_doc = driver.GetFolderNameDoc()
      msg = """\
  Usage: %s [-f foldername] subject of the message
***************
*** 167,184 ****
  matching is substring and ignore-case.
  
! Folder name must be a hierarchical 'path' name, using '\\'
! as the path seperator.  If the folder name begins with a
! \\, it must be a fully-qualified name, including the message
! store name. For example, your Inbox can be specified either as:
!   -f "Inbox"
! or
!   -f "\\%s\\Inbox"
! 
! Use the -n option to see all top-level folder names from all stores.
! """ % (os.path.basename(sys.argv[0]), def_store_name)
      print msg
  
- 
  def main():
      import getopt
      try:
--- 64,75 ----
  matching is substring and ignore-case.
  
! %s
! Use the -n option to see all top-level folder names from all stores.""" \
!     % (os.path.basename(sys.argv[0]),folder_doc)
      print msg
  
  def main():
+     driver = mapi_driver.MAPIDriver()
+ 
      import getopt
      try:
***************
*** 187,191 ****
          print e
          print
!         usage()
          sys.exit(1)
      folder_name = ""
--- 78,82 ----
          print e
          print
!         usage(driver)
          sys.exit(1)
      folder_name = ""
***************
*** 201,205 ****
              include_attach = True
          elif opt == "-n":
!             DumpTopLevelFolders()
              sys.exit(1)
          else:
--- 92,96 ----
              include_attach = True
          elif opt == "-n":
!             driver.DumpTopLevelFolders()
              sys.exit(1)
          else:
***************
*** 214,227 ****
          print "You must specify a subject"
          print
!         usage()
          sys.exit(1)
  
      try:
!         store, folder = FindFolder(folder_name)
      except ValueError, details:
          print details
          sys.exit(1)
  
!     DumpProps(store, folder, subject, include_attach, shorten)
  
  if __name__=='__main__':
--- 105,118 ----
          print "You must specify a subject"
          print
!         usage(driver)
          sys.exit(1)
  
      try:
!         folder = driver.FindFolder(folder_name)
      except ValueError, details:
          print details
          sys.exit(1)
  
!     DumpProps(driver, folder, subject, include_attach, shorten)
  
  if __name__=='__main__':


From mhammond@users.sourceforge.net  Sat Nov 23 04:58:59 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 22 Nov 2002 20:58:59 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.32,1.33
Message-ID: <E18FSNf-0001hx-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv6187

Modified Files:
	msgstore.py 
Log Message:
Change the way we detect unread messages (this seems to work with 
Exchange), skip all non "IPM.Note*" message classes, and remove debug 
print that doesn't seem necessary any more.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** msgstore.py	21 Nov 2002 02:57:05 -0000	1.32
--- msgstore.py	23 Nov 2002 04:58:56 -0000	1.33
***************
*** 86,89 ****
--- 86,90 ----
  
  MESSAGE_MOVE = 0x1 # from MAPIdefs.h
+ MSGFLAG_READ = 0x1 # from MAPIdefs.h
  MYPR_BODY_HTML_A = 0x1013001e # magic <wink>
  MYPR_BODY_HTML_W = 0x1013001f # ditto
***************
*** 282,285 ****
--- 283,292 ----
          folder = self.msgstore._OpenEntry(self.id)
          table = folder.GetContentsTable(0)
+         # Limit ourselves to IPM.Note objects - ie, messages.
+         restriction = (mapi.RES_PROPERTY,   # a property restriction
+                        (mapi.RELOP_GE,      # >=
+                         PR_MESSAGE_CLASS_A,   # of the this prop
+                         (PR_MESSAGE_CLASS_A, "IPM.Note"))) # with this value
+         table.Restrict(restriction, 0)
          prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
          table.SetColumns(prop_ids, 0)
***************
*** 306,317 ****
          table.SetColumns(prop_ids, 0)
          # Set up the restriction
!         prop_restriction = (mapi.RES_PROPERTY,   # a property restriction
!                                (mapi.RELOP_EQ,      # check for equality
!                                 PR_CONTENT_UNREAD,   # of the unread flag
!                                 (PR_CONTENT_UNREAD, True))
!                             )
          exist_restriction = mapi.RES_EXIST, (field_id,)
          not_exist_restriction = mapi.RES_NOT, (exist_restriction,)
!         restriction = (mapi.RES_AND, (prop_restriction, not_exist_restriction))
          table.Restrict(restriction, 0)
          while 1:
--- 313,332 ----
          table.SetColumns(prop_ids, 0)
          # Set up the restriction
!         # Need to check message-flags - PR_CONTENT_UNREAD "optional"
!         prop_restriction = (mapi.RES_BITMASK,   # a bitmask restriction
!                                (mapi.BMR_EQZ,      # when bit is clear
!                                 PR_MESSAGE_FLAGS,
!                                 MSGFLAG_READ))
          exist_restriction = mapi.RES_EXIST, (field_id,)
          not_exist_restriction = mapi.RES_NOT, (exist_restriction,)
!         # A restriction for the message class
!         class_restriction = (mapi.RES_PROPERTY,   # a property restriction
!                              (mapi.RELOP_GE,      # >=
!                               PR_MESSAGE_CLASS_A,   # of the this prop
!                               (PR_MESSAGE_CLASS_A, "IPM.Note"))) # with this value
!         # Put the final restriction together
!         restriction = (mapi.RES_AND, (prop_restriction,
!                                       not_exist_restriction,
!                                       class_restriction))
          table.Restrict(restriction, 0)
          while 1:
***************
*** 485,494 ****
                  sub = msg.get_payload(0)
                  body = sub.get_payload()
- 
-         if not html and not body:
-             # MarkH has only ever seen this when it is indeed true!
-             # (generally as the message has an attachment and nothing else)
-             print "Couldn't find any useful body for message '%s'" \
-                   % (self.GetField(PR_SUBJECT_A),)
  
          return "%s\n%s\n%s" % (headers, html, body)
--- 500,503 ----


From mhammond@users.sourceforge.net  Sat Nov 23 06:45:46 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 22 Nov 2002 22:45:46 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.33,1.34
Message-ID: <E18FU30-00034e-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv11667

Modified Files:
	msgstore.py 
Log Message:
Ignore errors when looking into mime attachments for a body.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** msgstore.py	23 Nov 2002 04:58:56 -0000	1.33
--- msgstore.py	23 Nov 2002 06:45:43 -0000	1.34
***************
*** 466,474 ****
                              PR_ATTACH_MIME_TAG_A,   # of the given prop
                              (PR_ATTACH_MIME_TAG_A, "multipart/signed")))
!             rows = mapi.HrQueryAllRows(table,
!                                        (PR_ATTACH_NUM,), # columns to get
!                                        restriction,    # only these rows
!                                        None,    # any sort order is fine
!                                        0)       # any # of results is fine
              if len(rows) == 0:
                  pass # Nothing we can fetch :(
--- 466,478 ----
                              PR_ATTACH_MIME_TAG_A,   # of the given prop
                              (PR_ATTACH_MIME_TAG_A, "multipart/signed")))
!             try:
!                 rows = mapi.HrQueryAllRows(table,
!                                            (PR_ATTACH_NUM,), # columns to get
!                                            restriction,    # only these rows
!                                            None,    # any sort order is fine
!                                            0)       # any # of results is fine
!             except pythoncom.com_error:
!                 # For some reason there are no rows we can get
!                 rows = []
              if len(rows) == 0:
                  pass # Nothing we can fetch :(


From mhammond@users.sourceforge.net  Sat Nov 23 10:32:50 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 23 Nov 2002 02:32:50 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.34,1.35
Message-ID: <E18FXak-0002MV-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv9065

Modified Files:
	manager.py 
Log Message:
Make ShowManager() an instance method as well as a module method.

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** manager.py	12 Nov 2002 04:52:12 -0000	1.34
--- manager.py	23 Nov 2002 10:32:48 -0000	1.35
***************
*** 282,285 ****
--- 282,311 ----
              return score
  
+     def ShowManager(self):
+         def do_train(dlg):
+             import train
+             import dialogs.TrainingDialog
+             d = dialogs.TrainingDialog.TrainingDialog(dlg.mgr, train.trainer)
+             d.DoModal()
+ 
+         def do_filter(dlg):
+             import filter
+             import dialogs.FilterDialog
+             d = dialogs.FilterDialog.FilterNowDialog(dlg.mgr, filter.filterer)
+             d.DoModal()
+ 
+         def define_filter(dlg):
+             import filter
+             import dialogs.FilterDialog
+             d = dialogs.FilterDialog.FilterArrivalsDialog(dlg.mgr, filter.filterer)
+             d.DoModal()
+             if dlg.mgr.addin is not None:
+                 dlg.mgr.addin.FiltersChanged()
+ 
+ 
+         import dialogs.ManagerDialog
+         d = dialogs.ManagerDialog.ManagerDialog(self, do_train, do_filter, define_filter)
+         d.DoModal()
+ 
  _mgr = None
  
***************
*** 296,323 ****
  
  def ShowManager(mgr):
!     def do_train(dlg):
!         import train
!         import dialogs.TrainingDialog
!         d = dialogs.TrainingDialog.TrainingDialog(dlg.mgr, train.trainer)
!         d.DoModal()
! 
!     def do_filter(dlg):
!         import filter
!         import dialogs.FilterDialog
!         d = dialogs.FilterDialog.FilterNowDialog(dlg.mgr, filter.filterer)
!         d.DoModal()
! 
!     def define_filter(dlg):
!         import filter
!         import dialogs.FilterDialog
!         d = dialogs.FilterDialog.FilterArrivalsDialog(dlg.mgr, filter.filterer)
!         d.DoModal()
!         if dlg.mgr.addin is not None:
!             dlg.mgr.addin.FiltersChanged()
! 
! 
!     import dialogs.ManagerDialog
!     d = dialogs.ManagerDialog.ManagerDialog(mgr, do_train, do_filter, define_filter)
!     d.DoModal()
  
  def main(verbose_level = 1):
--- 322,326 ----
  
  def ShowManager(mgr):
!     mgr.ShowManager()
  
  def main(verbose_level = 1):


From mhammond@users.sourceforge.net  Sat Nov 23 10:34:26 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 23 Nov 2002 02:34:26 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.36,1.37
Message-ID: <E18FXcI-0002Rk-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv9397

Modified Files:
	addin.py 
Log Message:
Ensure our UI is attached to every Outlook window, not just the first one when we start.  Involved a fair bit of reorganization!

Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** addin.py	21 Nov 2002 02:57:05 -0000	1.36
--- addin.py	23 Nov 2002 10:34:24 -0000	1.37
***************
*** 100,104 ****
              tlb, index = ti.GetContainingTypeLib()
              tla = tlb.GetLibAttr()
!             mod = gencache.EnsureModule(tla[0], tla[1], tla[3], tla[4])
              disp_class = gencache.GetClassForProgID(str(disp_clsid))
          except pythoncom.com_error:
--- 100,104 ----
              tlb, index = ti.GetContainingTypeLib()
              tla = tlb.GetLibAttr()
!             gencache.EnsureModule(tla[0], tla[1], tla[3], tla[4])
              disp_class = gencache.GetClassForProgID(str(disp_clsid))
          except pythoncom.com_error:
***************
*** 213,222 ****
  
  # Event function fired from the "Show Clues" UI items.
! def ShowClues(mgr, app):
      from cgi import escape
  
!     msgstore_message = mgr.addin.GetSelectedMessages(False)
      if msgstore_message is None:
          return
      item = msgstore_message.GetOutlookItem()
      score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
--- 213,224 ----
  
  # Event function fired from the "Show Clues" UI items.
! def ShowClues(mgr, explorer):
      from cgi import escape
  
!     app = explorer.Application
!     msgstore_message = explorer.GetSelectedMessages(False)
      if msgstore_message is None:
          return
+ 
      item = msgstore_message.GetOutlookItem()
      score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
***************
*** 262,328 ****
      new_msg.Display()
  
- # Events from our Explorer instance - currently used to enable/disable
- # controls
- class ExplorerEvent:
-     def Init(self, manager, application, but_delete_as, but_recover_as):
-         self.manager = manager
-         self.application = application
-         self.but_delete_as = but_delete_as
-         self.but_recover_as = but_recover_as
-     def Close(self):
-         self.but_delete_as = self.but_recover_as = None
-     def OnFolderSwitch(self):
-         # Work out what folder we are in.
-         explorer = self.application.ActiveExplorer()
-         if explorer is None:
-             print "** Folder Change, but don't have an explorer"
-             return
- 
-         outlook_folder = explorer.CurrentFolder
-         show_delete_as = True
-         show_recover_as = False
-         try:
-             if outlook_folder is not None:
-                 mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
-                 look_id = self.manager.config.filter.spam_folder_id
-                 if look_id:
-                     look_folder = self.manager.message_store.GetFolder(look_id)
-                     if mapi_folder == look_folder:
-                         # This is the Spam folder - only show "recover"
-                         show_recover_as = True
-                         show_delete_as = False
-                 # Check if uncertain
-                 look_id = self.manager.config.filter.unsure_folder_id
-                 if look_id:
-                     look_folder = self.manager.message_store.GetFolder(look_id)
-                     if mapi_folder == look_folder:
-                         show_recover_as = True
-                         show_delete_as = True
-         except:
-             print "Error finding the MAPI folders for a folder switch event"
-             import traceback
-             traceback.print_exc()
-         self.but_recover_as.Visible = show_recover_as
-         self.but_delete_as.Visible = show_delete_as
- 
  # The "Delete As Spam" and "Recover Spam" button
  # The event from Outlook's explorer that our folder has changed.
  class ButtonDeleteAsEventBase:
!     def Init(self, manager, application):
!         # NOTE - keeping a reference to 'explorer' in this event
!         # appears to cause an Outlook circular reference, and outlook
!         # never terminates (it does close, but the process remains alive)
!         # This is why we needed to use WithEvents, so the event class
!         # itself doesnt keep such a reference (and we need to keep a ref
!         # to the event class so it doesn't auto-disconnect!)
          self.manager = manager
!         self.application = application
  
      def Close(self):
!         self.manager = self.application = None
  
  class ButtonDeleteAsSpamEvent(ButtonDeleteAsEventBase):
!     def Init(self, manager, application):
!         ButtonDeleteAsEventBase.Init(self, manager, application)
          image = "delete_as_spam.bmp"
          self.Caption = "Delete As Spam"
--- 264,280 ----
      new_msg.Display()
  
  # The "Delete As Spam" and "Recover Spam" button
  # The event from Outlook's explorer that our folder has changed.
  class ButtonDeleteAsEventBase:
!     def Init(self, manager, explorer):
          self.manager = manager
!         self.explorer = explorer
  
      def Close(self):
!         self.manager = self.explorer = None
  
  class ButtonDeleteAsSpamEvent(ButtonDeleteAsEventBase):
!     def Init(self, manager, explorer):
!         ButtonDeleteAsEventBase.Init(self, manager, explorer)
          image = "delete_as_spam.bmp"
          self.Caption = "Delete As Spam"
***************
*** 334,338 ****
      def OnClick(self, button, cancel):
          msgstore = self.manager.message_store
!         msgstore_messages = self.manager.addin.GetSelectedMessages(True)
          if not msgstore_messages:
              return
--- 286,290 ----
      def OnClick(self, button, cancel):
          msgstore = self.manager.message_store
!         msgstore_messages = self.explorer.GetSelectedMessages(True)
          if not msgstore_messages:
              return
***************
*** 356,361 ****
  
  class ButtonRecoverFromSpamEvent(ButtonDeleteAsEventBase):
!     def Init(self, manager, application):
!         ButtonDeleteAsEventBase.Init(self, manager, application)
          image = "recover_ham.bmp"
          self.Caption = "Recover from Spam"
--- 308,313 ----
  
  class ButtonRecoverFromSpamEvent(ButtonDeleteAsEventBase):
!     def Init(self, manager, explorer):
!         ButtonDeleteAsEventBase.Init(self, manager, explorer)
          image = "recover_ham.bmp"
          self.Caption = "Recover from Spam"
***************
*** 369,373 ****
      def OnClick(self, button, cancel):
          msgstore = self.manager.message_store
!         msgstore_messages = self.manager.addin.GetSelectedMessages(True)
          if not msgstore_messages:
              return
--- 321,325 ----
      def OnClick(self, button, cancel):
          msgstore = self.manager.message_store
!         msgstore_messages = self.explorer.GetSelectedMessages(True)
          if not msgstore_messages:
              return
***************
*** 375,381 ****
          # Get the inbox as the default place to restore to
          # (incase we dont know (early code) or folder removed etc
          inbox_folder = msgstore.GetFolder(
!                     self.application.Session.GetDefaultFolder(
!                         constants.olFolderInbox))
          import train
          for msgstore_message in msgstore_messages:
--- 327,333 ----
          # Get the inbox as the default place to restore to
          # (incase we dont know (early code) or folder removed etc
+         app = self.explorer.Application
          inbox_folder = msgstore.GetFolder(
!                     app.Session.GetDefaultFolder(constants.olFolderInbox))
          import train
          for msgstore_message in msgstore_messages:
***************
*** 408,411 ****
--- 360,515 ----
      button.PasteFace()
  
+ # A class that manages an "Outlook Explorer" - that is, a top-level window
+ # All UI elements are managed here, and there is one instance per explorer.
+ class ExplorerWithEvents:
+     def Init(self, manager, explorer_list):
+         self.manager = manager
+         self.have_setup_ui = False
+         self.explorer_list = explorer_list
+ 
+     def SetupUI(self):
+         application = self.Application
+         manager = self.manager
+         self.buttons = []
+         activeExplorer = self
+         bars = activeExplorer.CommandBars
+         toolbar = bars.Item("Standard")
+         # Add our "Delete as ..." and "Recover as" buttons
+         self.but_delete_as = button = toolbar.Controls.Add(
+                                 Type=constants.msoControlButton,
+                                 Temporary=True)
+         # Hook events for the item
+         button.BeginGroup = True
+         button = DispatchWithEvents(button, ButtonDeleteAsSpamEvent)
+         button.Init(self.manager, self)
+         self.buttons.append(button)
+         # And again for "Recover as"
+         self.but_recover_as = button = toolbar.Controls.Add(
+                                 Type=constants.msoControlButton,
+                                 Temporary=True)
+         button = DispatchWithEvents(button, ButtonRecoverFromSpamEvent)
+         self.buttons.append(button)
+         # Hook our explorer events, and pass the buttons.
+         button.Init(self.manager, self)
+ 
+         # And prime our event handler.
+         self.OnFolderSwitch()
+ 
+         # The main tool-bar dropdown with all our entries.
+         # Add a pop-up menu to the toolbar
+         popup = toolbar.Controls.Add(
+                             Type=constants.msoControlPopup,
+                             Temporary=True)
+         popup.Caption="Anti-Spam"
+         popup.TooltipText = "Anti-Spam filters and functions"
+         popup.Enabled = True
+         # Convert from "CommandBarItem" to derived
+         # "CommandBarPopup" Not sure if we should be able to work
+         # this out ourselves, but no introspection I tried seemed
+         # to indicate we can.  VB does it via strongly-typed
+         # declarations.
+         popup = CastTo(popup, "CommandBarPopup")
+         # And add our children.
+         self._AddPopup(popup, manager.ShowManager, (),
+                        Caption="Anti-Spam Manager...",
+                        TooltipText = "Show the Anti-Spam manager dialog.",
+                        Enabled = True)
+         self._AddPopup(popup, ShowClues, (self.manager, self),
+                        Caption="Show spam clues for current message",
+                        Enabled=True)
+         self.have_setup_ui = True
+ 
+     def _AddPopup(self, parent, target, target_args, **item_attrs):
+         item = parent.Controls.Add(Type=constants.msoControlButton, Temporary=True)
+         # Hook events for the item
+         item = DispatchWithEvents(item, ButtonEvent)
+         item.Init(target, target_args)
+         for attr, val in item_attrs.items():
+             setattr(item, attr, val)
+         self.buttons.append(item)
+ 
+     def GetSelectedMessages(self, allow_multi = True, explorer = None):
+         if explorer is None:
+             explorer = self.Application.ActiveExplorer()
+         sel = explorer.Selection
+         if sel.Count > 1 and not allow_multi:
+             win32ui.MessageBox("Please select a single item", "Large selection")
+             return None
+ 
+         ret = []
+         for i in range(sel.Count):
+             item = sel.Item(i+1)
+             if item.Class == constants.olMail:
+                 msgstore_message = self.manager.message_store.GetMessage(item)
+                 ret.append(msgstore_message)
+ 
+         if len(ret) == 0:
+             win32ui.MessageBox("No mail items are selected", "No selection")
+             return None
+         if allow_multi:
+             return ret
+         return ret[0]
+ 
+     # The Outlook event handlers
+     def OnActivate(self):
+         if not self.have_setup_ui:
+             self.SetupUI()
+ 
+     def OnClose(self):
+         self.explorer_list.remove(self)
+         self.explorer_list = None
+         for button in self.buttons:
+             button.Close()
+         self.buttons = []
+         self.close() # disconnect events.
+ 
+     def OnFolderSwitch(self):
+         # Work out what folder we are in.
+         outlook_folder = self.CurrentFolder
+         show_delete_as = True
+         show_recover_as = False
+         try:
+             if outlook_folder is not None:
+                 mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
+                 look_id = self.manager.config.filter.spam_folder_id
+                 if look_id:
+                     look_folder = self.manager.message_store.GetFolder(look_id)
+                     if mapi_folder == look_folder:
+                         # This is the Spam folder - only show "recover"
+                         show_recover_as = True
+                         show_delete_as = False
+                 # Check if uncertain
+                 look_id = self.manager.config.filter.unsure_folder_id
+                 if look_id:
+                     look_folder = self.manager.message_store.GetFolder(look_id)
+                     if mapi_folder == look_folder:
+                         show_recover_as = True
+                         show_delete_as = True
+         except:
+             print "Error finding the MAPI folders for a folder switch event"
+             import traceback
+             traceback.print_exc()
+         self.but_recover_as.Visible = show_recover_as
+         self.but_delete_as.Visible = show_delete_as
+ 
+ # Events from our "Explorers" collection (not an Explorer instance)
+ class ExplorersEvent:
+     def Init(self, manager):
+         self.manager = manager
+         self.explorers = []
+ 
+     def Close(self):
+         self.explorers = None
+ 
+     def _DoNewExplorer(self, explorer, do_activate):
+         explorer = DispatchWithEvents(explorer, ExplorerWithEvents)
+         explorer.Init(self.manager, self.explorers)
+         if do_activate:
+             explorer.OnActivate()
+         self.explorers.append(explorer)
+ 
+     def OnNewExplorer(self, explorer):
+         self._DoNewExplorer(explorer, False)
+ 
  # The outlook Plugin COM object itself.
  class OutlookAddin:
***************
*** 420,424 ****
          self.folder_hooks = {}
          self.application = None
-         self.buttons = []
  
      def OnConnection(self, application, connectMode, addin, custom):
--- 524,527 ----
***************
*** 431,488 ****
          assert self.manager.addin is None, "Should not already have an addin"
          self.manager.addin = self
-         self.explorer_events = None
- 
-         # ActiveExplorer may be none when started without a UI (eg, WinCE synchronisation)
-         activeExplorer = application.ActiveExplorer()
-         if activeExplorer is not None:
-             bars = activeExplorer.CommandBars
-             toolbar = bars.Item("Standard")
-             # Add our "Delete as ..." and "Recover as" buttons
-             but_delete_as = button = toolbar.Controls.Add(
-                                     Type=constants.msoControlButton,
-                                     Temporary=True)
-             # Hook events for the item
-             button.BeginGroup = True
-             button = DispatchWithEvents(button, ButtonDeleteAsSpamEvent)
-             button.Init(self.manager, application)
-             self.buttons.append(button)
-             # And again for "Recover as"
-             but_recover_as = button = toolbar.Controls.Add(
-                                     Type=constants.msoControlButton,
-                                     Temporary=True)
-             button = DispatchWithEvents(button, ButtonRecoverFromSpamEvent)
-             self.buttons.append(button)
-             # Hook our explorer events, and pass the buttons.
-             button.Init(self.manager, application)
  
!             self.explorer_events = WithEvents(activeExplorer,
!                                                ExplorerEvent)
! 
!             self.explorer_events.Init(self.manager, application, but_delete_as, but_recover_as)
!             # And prime the event handler.
!             self.explorer_events.OnFolderSwitch()
! 
!             # The main tool-bar dropdown with all our entries.
!             # Add a pop-up menu to the toolbar
!             popup = toolbar.Controls.Add(
!                                 Type=constants.msoControlPopup,
!                                 Temporary=True)
!             popup.Caption="Anti-Spam"
!             popup.TooltipText = "Anti-Spam filters and functions"
!             popup.Enabled = True
!             # Convert from "CommandBarItem" to derived
!             # "CommandBarPopup" Not sure if we should be able to work
!             # this out ourselves, but no introspection I tried seemed
!             # to indicate we can.  VB does it via strongly-typed
!             # declarations.
!             popup = CastTo(popup, "CommandBarPopup")
!             # And add our children.
!             self._AddPopup(popup, manager.ShowManager, (self.manager,),
!                            Caption="Anti-Spam Manager...",
!                            TooltipText = "Show the Anti-Spam manager dialog.",
!                            Enabled = True)
!             self._AddPopup(popup, ShowClues, (self.manager, application),
!                            Caption="Show spam clues for current message",
!                            Enabled=True)
  
          self.FiltersChanged()
--- 534,546 ----
          assert self.manager.addin is None, "Should not already have an addin"
          self.manager.addin = self
  
!         explorers = application.Explorers
!         # and Explorers events so we know when new explorers spring into life.
!         self.explorers_events = WithEvents(explorers, ExplorersEvent)
!         self.explorers_events.Init(self.manager)
!         # And hook our UI elements to all existing explorers
!         for i in range(explorers.Count):
!             explorer = explorers.Item(i+1)
!             self.explorers_events._DoNewExplorer(explorer, True)
  
          self.FiltersChanged()
***************
*** 495,507 ****
                  traceback.print_exc()
  
-     def _AddPopup(self, parent, target, target_args, **item_attrs):
-         item = parent.Controls.Add(Type=constants.msoControlButton, Temporary=True)
-         # Hook events for the item
-         item = DispatchWithEvents(item, ButtonEvent)
-         item.Init(target, target_args)
-         for attr, val in item_attrs.items():
-             setattr(item, attr, val)
-         self.buttons.append(item)
- 
      def ProcessMissedMessages(self):
          # This could possibly spawn threads if it was too slow!
--- 553,556 ----
***************
*** 568,608 ****
          return new_hooks
  
-     def GetSelectedMessages(self, allow_multi = True, explorer = None):
-         if explorer is None:
-             explorer = self.application.ActiveExplorer()
-         sel = explorer.Selection
-         if sel.Count > 1 and not allow_multi:
-             win32ui.MessageBox("Please select a single item", "Large selection")
-             return None
- 
-         ret = []
-         for i in range(sel.Count):
-             item = sel.Item(i+1)
-             if item.Class == constants.olMail:
-                 msgstore_message = self.manager.message_store.GetMessage(item)
-                 ret.append(msgstore_message)
- 
-         if len(ret) == 0:
-             win32ui.MessageBox("No mail items are selected", "No selection")
-             return None
-         if allow_multi:
-             return ret
-         return ret[0]
- 
      def OnDisconnection(self, mode, custom):
          print "SpamAddin - Disconnecting from Outlook"
          self.folder_hooks = None
          self.application = None
          if self.manager is not None:
              self.manager.Save()
              self.manager.Close()
              self.manager = None
- 
-         if self.explorer_events is not None:
-             self.explorer_events = None
-         if self.buttons:
-             for button in self.buttons:
-                 button.Close()
-             self.buttons = None
  
          print "Addin terminating: %d COM client and %d COM servers exist." \
--- 617,629 ----
          return new_hooks
  
      def OnDisconnection(self, mode, custom):
          print "SpamAddin - Disconnecting from Outlook"
          self.folder_hooks = None
          self.application = None
+         self.explorers_events = None
          if self.manager is not None:
              self.manager.Save()
              self.manager.Close()
              self.manager = None
  
          print "Addin terminating: %d COM client and %d COM servers exist." \


From mhammond@users.sourceforge.net  Sat Nov 23 10:47:13 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 23 Nov 2002 02:47:13 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.37,1.38
Message-ID: <E18FXof-00039m-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv12048

Modified Files:
	addin.py 
Log Message:
Add comments I forgot to add while working out the best work around!


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** addin.py	23 Nov 2002 10:34:24 -0000	1.37
--- addin.py	23 Nov 2002 10:47:10 -0000	1.38
***************
*** 453,456 ****
--- 453,457 ----
      # The Outlook event handlers
      def OnActivate(self):
+         # See comments for OnNewExplorer below.
          if not self.have_setup_ui:
              self.SetupUI()
***************
*** 510,513 ****
--- 511,518 ----
  
      def OnNewExplorer(self, explorer):
+         # NOTE - Outlook has a bug, as confirmed by many on Usenet, in
+         # that OnNewExplorer is too early to access the CommandBars
+         # etc elements. We hack around this by putting the logic in
+         # the first OnActivate call of the explorer itself.
          self._DoNewExplorer(explorer, False)
  

From mhammond@users.sourceforge.net  Sat Nov 23 12:00:06 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 23 Nov 2002 04:00:06 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.34,1.35
Message-ID: <E18FYxC-00082D-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv30369

Modified Files:
	msgstore.py 
Log Message:
PR_CONTENT_UNREAD is documented as optional, and doens't always work with
Exchange stores - so nuke it completely.


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** msgstore.py	23 Nov 2002 06:45:43 -0000	1.34
--- msgstore.py	23 Nov 2002 12:00:02 -0000	1.35
***************
*** 229,242 ****
          else:
              message_id = self.NormalizeID(message_id)
!         prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
          mapi_object = self._OpenEntry(message_id)
          hr, data = mapi_object.GetProps(prop_ids,0)
          folder_eid = data[0][1]
          searchkey = data[1][1]
!         unread = data[2][1]
          folder_id = message_id[0], folder_eid
          folder = MAPIMsgStoreFolder(self, folder_id,
                                      "Unknown - temp message", -1)
!         return  MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
  
  _MapiTypeMap = {
--- 229,242 ----
          else:
              message_id = self.NormalizeID(message_id)
!         prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_MESSAGE_FLAGS
          mapi_object = self._OpenEntry(message_id)
          hr, data = mapi_object.GetProps(prop_ids,0)
          folder_eid = data[0][1]
          searchkey = data[1][1]
!         flags = data[2][1]
          folder_id = message_id[0], folder_eid
          folder = MAPIMsgStoreFolder(self, folder_id,
                                      "Unknown - temp message", -1)
!         return  MAPIMsgStoreMsg(self, folder, message_id, searchkey, flags)
  
  _MapiTypeMap = {
***************
*** 289,293 ****
                          (PR_MESSAGE_CLASS_A, "IPM.Note"))) # with this value
          table.Restrict(restriction, 0)
!         prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
          table.SetColumns(prop_ids, 0)
          while 1:
--- 289,293 ----
                          (PR_MESSAGE_CLASS_A, "IPM.Note"))) # with this value
          table.Restrict(restriction, 0)
!         prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_MESSAGE_FLAGS
          table.SetColumns(prop_ids, 0)
          while 1:
***************
*** 310,317 ****
          field_id = PROP_TAG( PT_I4, PROP_ID(resolve_ids[0]))
          # Setup the properties we want to read.
!         prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
          table.SetColumns(prop_ids, 0)
          # Set up the restriction
!         # Need to check message-flags - PR_CONTENT_UNREAD "optional"
          prop_restriction = (mapi.RES_BITMASK,   # a bitmask restriction
                                 (mapi.BMR_EQZ,      # when bit is clear
--- 310,319 ----
          field_id = PROP_TAG( PT_I4, PROP_ID(resolve_ids[0]))
          # Setup the properties we want to read.
!         prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_MESSAGE_FLAGS
          table.SetColumns(prop_ids, 0)
          # Set up the restriction
!         # Need to check message-flags
!         # (PR_CONTENT_UNREAD is optional, and somewhat unreliable
!         # PR_MESSAGE_FLAGS & MSGFLAG_READ is the official way)
          prop_restriction = (mapi.RES_BITMASK,   # a bitmask restriction
                                 (mapi.BMR_EQZ,      # when bit is clear
***************
*** 340,344 ****
  
  class MAPIMsgStoreMsg(MsgStoreMsg):
!     def __init__(self, msgstore, folder, entryid, searchkey, unread):
          self.folder = folder
          self.msgstore = msgstore
--- 342,346 ----
  
  class MAPIMsgStoreMsg(MsgStoreMsg):
!     def __init__(self, msgstore, folder, entryid, searchkey, flags):
          self.folder = folder
          self.msgstore = msgstore
***************
*** 352,356 ****
          # Thus, searchkey is the only reliable long-lived message key.
          self.searchkey = searchkey
!         self.unread = unread
          self.dirty = False
  
--- 354,359 ----
          # Thus, searchkey is the only reliable long-lived message key.
          self.searchkey = searchkey
!         self.flags = flags
!         self.unread = flags & MSGFLAG_READ == 0
          self.dirty = False
  

From mhammond@users.sourceforge.net  Sat Nov 23 12:07:17 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 23 Nov 2002 04:07:17 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/sandbox delete_outlook_field.py,1.5,1.6
Message-ID: <E18FZ49-0002Po-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory sc8-pr-cvs1:/tmp/cvs-serv9234

Modified Files:
	delete_outlook_field.py 
Log Message:
Fix error from previous checkin


Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** delete_outlook_field.py	23 Nov 2002 02:57:46 -0000	1.5
--- delete_outlook_field.py	23 Nov 2002 12:07:15 -0000	1.6
***************
*** 176,180 ****
                      print "Deleted", num, "field instances via MAPI"
                  if do_folder:
!                     num = DeleteField_Folder(folder, field_name)
                      if num:
                          print "Deleted property from folder"
--- 176,180 ----
                      print "Deleted", num, "field instances via MAPI"
                  if do_folder:
!                     num = DeleteField_Folder(driver, folder, field_name)
                      if num:
                          print "Deleted property from folder"


From npickett@users.sourceforge.net  Sat Nov 23 21:25:19 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sat, 23 Nov 2002 13:25:19 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.8,1.53.2.9
 hammie.py,1.40.2.4,1.40.2.5 hammiebulk.py,1.1.2.2,1.1.2.3
Message-ID: <E18FhmB-0007e0-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv27728

Modified Files:
      Tag: hammie-playground
	classifier.py hammie.py hammiebulk.py 
Log Message:
* Added PICKLE_VERSION to MetaInfo class, so dbdict can have version
  checking too
* Moved upate_probability out of WordInfo and back into Classifier
  class.  Classifier now caches probabilities in the Popiel/Hooft
  method ;)
* hammie.py now works exactly like it did before the branch


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.8
retrieving revision 1.53.2.9
diff -C2 -d -r1.53.2.8 -r1.53.2.9
*** classifier.py	22 Nov 2002 23:50:18 -0000	1.53.2.8
--- classifier.py	23 Nov 2002 21:25:16 -0000	1.53.2.9
***************
*** 47,51 ****
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 1
  
  class MetaInfo(object):
--- 47,51 ----
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 3
  
  class MetaInfo(object):
***************
*** 53,62 ****
  
      Contains nham and nspam, used for calculating probabilities.  Also
!     has a revision, incremented every time nham or nspam is adjusted to
!     invalidate any cached probabilities.
  
      """
      def __init__(self):
!         self.__setstate__((0, 0))
  
      def __repr__(self):
--- 53,62 ----
  
      Contains nham and nspam, used for calculating probabilities.  Also
!     has a revision, incremented every time nham or nspam is adjusted.
!     Nothing uses this, currently, but it's there if you want it.
  
      """
      def __init__(self):
!         self.__setstate__((PICKLE_VERSION, 0, 0))
  
      def __repr__(self):
***************
*** 66,73 ****
  
      def __getstate__(self):
!         return (self._nham, self._nspam)
  
      def __setstate__(self, t):
!         (self._nham, self._nspam) = t
          self.revision = 0
  
--- 66,75 ----
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self._nham, self._nspam)
  
      def __setstate__(self, t):
!         if t[0] != PICKLE_VERSION:
!             raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         (self._nham, self._nspam) = t[1:]
          self.revision = 0
  
***************
*** 112,199 ****
          self.revision = None
  
-     def _update_probability(self, meta):
-         """Compute and store p(word) = prob(msg is spam | msg contains word).
- 
-         This is the Graham calculation, but stripped of biases, and
-         stripped of clamping into 0.01 thru 0.99.  The Bayesian
-         adjustment following keeps them in a sane range, and one
-         that naturally grows the more evidence there is to back up
-         a probability.
- 
-         Returns True if the probability changed, False otherwise.
-         """
- 
-         nham = float(meta.nham or 1)
-         nspam = float(meta.nspam or 1)
- 
-         assert self.hamcount <= nham
-         hamratio = self.hamcount / nham
- 
-         assert self.spamcount <= nspam
-         spamratio = self.spamcount / nspam
- 
-         prob = spamratio / (hamratio + spamratio)
- 
-         if options.experimental_ham_spam_imbalance_adjustment:
-             spam2ham = min(nspam / nham, 1.0)
-             ham2spam = min(nham / nspam, 1.0)
-         else:
-             spam2ham = ham2spam = 1.0
- 
-         S = options.unknown_word_strength
-         StimesX = S * options.unknown_word_prob
- 
- 
-         # Now do Robinson's Bayesian adjustment.
-         #
-         #         s*x + n*p(w)
-         # f(w) = --------------
-         #           s + n
-         #
-         # I find this easier to reason about like so (equivalent when
-         # s != 0):
-         #
-         #        x - p
-         #  p +  -------
-         #       1 + n/s
-         #
-         # IOW, it moves p a fraction of the distance from p to x, and
-         # less so the larger n is, or the smaller s is.
- 
-         # Experimental:
-         # Picking a good value for n is interesting:  how much empirical
-         # evidence do we really have?  If nham == nspam,
-         # hamcount + spamcount makes a lot of sense, and the code here
-         # does that by default.
-         # But if, e.g., nham is much larger than nspam, p(w) can get a
-         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
-         # strong ham words (high hamcount) much stronger than strong
-         # spam words (high spamcount), and that makes the accidental
-         # appearance of a strong ham word in spam much more damaging than
-         # the accidental appearance of a strong spam word in ham.
-         # So we don't give hamcount full credit when nham > nspam (or
-         # spamcount when nspam > nham):  instead we knock hamcount down
-         # to what it would have been had nham been equal to nspam.  IOW,
-         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
-         # we don't "believe" any count to an extent more than
-         # min(nspam, nham) justifies.
- 
-         n = self.hamcount * spam2ham  +  self.spamcount * ham2spam
-         prob = (StimesX + n * prob) / (S + n)
- 
-         self.revision = meta.revision
- 
-         if self.spamprob != prob:
-             self.spamprob = prob
-             return True
-         else:
-             return False
- 
-     def probability(self, meta):
-         """Return this word's spam probability, recalculating if needed."""
-         if meta.revision != self.revision:
-             self._update_probability(meta)
-         return self.spamprob
- 
  
  class Classifier:
--- 114,117 ----
***************
*** 415,419 ****
          """
  
-         self.probcache = {}    # nuke the prob cache
          self._add_msg(wordstream, is_spam)
  
--- 333,336 ----
***************
*** 423,447 ****
          Pass the same arguments you passed to learn().
          """
-         self.probcache = {}    # nuke the prob cache
          self._remove_msg(wordstream, is_spam)
  
      def update_probabilities(self):
          """Update the word probabilities in the spam database.
  
          This computes a new probability for every word in the database,
!         so can be expensive.  learn() and unlearn() update the probabilities
!         each time by default.  Thay have an optional argument that allows
!         to skip this step when feeding in many messages, and in that case
!         you should call update_probabilities() after feeding the last
!         message and before calling spamprob().
! 
!         You probably don't need to call this, since probabilities are
!         automatically updated.
          """
  
          for word, record in self.wordinfo.iteritems():
!             # This method updates probability iff the metainfo revision
!             # has changed.
!             record.probability(self.meta)
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
--- 340,440 ----
          Pass the same arguments you passed to learn().
          """
          self._remove_msg(wordstream, is_spam)
  
+     def probability(self, record):
+         """Compute, store, and return prob(msg is spam | msg contains word).
+ 
+         This is the Graham calculation, but stripped of biases, and
+         stripped of clamping into 0.01 thru 0.99.  The Bayesian
+         adjustment following keeps them in a sane range, and one
+         that naturally grows the more evidence there is to back up
+         a probability.
+         """
+ 
+         spamcount = record.spamcount
+         hamcount = record.hamcount
+         
+         # Try the cache first
+         try:
+             return self.probcache[(spamcount, hamcount)]
+         except:
+             pass
+ 
+         nham = float(self.meta.nham or 1)
+         nspam = float(self.meta.nspam or 1)
+ 
+         assert hamcount <= nham
+         hamratio = hamcount / nham
+ 
+         assert spamcount <= nspam
+         spamratio = spamcount / nspam
+ 
+         prob = spamratio / (hamratio + spamratio)
+ 
+         if options.experimental_ham_spam_imbalance_adjustment:
+             spam2ham = min(nspam / nham, 1.0)
+             ham2spam = min(nham / nspam, 1.0)
+         else:
+             spam2ham = ham2spam = 1.0
+ 
+         S = options.unknown_word_strength
+         StimesX = S * options.unknown_word_prob
+ 
+ 
+         # Now do Robinson's Bayesian adjustment.
+         #
+         #         s*x + n*p(w)
+         # f(w) = --------------
+         #           s + n
+         #
+         # I find this easier to reason about like so (equivalent when
+         # s != 0):
+         #
+         #        x - p
+         #  p +  -------
+         #       1 + n/s
+         #
+         # IOW, it moves p a fraction of the distance from p to x, and
+         # less so the larger n is, or the smaller s is.
+ 
+         # Experimental:
+         # Picking a good value for n is interesting:  how much empirical
+         # evidence do we really have?  If nham == nspam,
+         # hamcount + spamcount makes a lot of sense, and the code here
+         # does that by default.
+         # But if, e.g., nham is much larger than nspam, p(w) can get a
+         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
+         # strong ham words (high hamcount) much stronger than strong
+         # spam words (high spamcount), and that makes the accidental
+         # appearance of a strong ham word in spam much more damaging than
+         # the accidental appearance of a strong spam word in ham.
+         # So we don't give hamcount full credit when nham > nspam (or
+         # spamcount when nspam > nham):  instead we knock hamcount down
+         # to what it would have been had nham been equal to nspam.  IOW,
+         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
+         # we don't "believe" any count to an extent more than
+         # min(nspam, nham) justifies.
+ 
+         n = hamcount * spam2ham  +  spamcount * ham2spam
+         prob = (StimesX + n * prob) / (S + n)
+ 
+         # Update the cache
+         self.probcache[(spamcount, hamcount)] = prob
+ 
+         return prob
+ 
      def update_probabilities(self):
          """Update the word probabilities in the spam database.
  
          This computes a new probability for every word in the database,
!         which can be expensive.  learn() and unlearn() clear the
!         probability cache each time by default, and that will be rebuilt
!         as probabilities are looked up.  If for some reason you need to
!         update all the probabilities in one step (say, for
!         benchmarking), you can call this method.
          """
  
          for word, record in self.wordinfo.iteritems():
!             self.probability(record)
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
***************
*** 466,469 ****
--- 459,464 ----
      # to exploit it.
      def _add_msg(self, wordstream, is_spam):
+         print "Nuking the prob cache"
+         self.probcache = {}    # nuke the prob cache
          if is_spam:
              self.meta.nspam += 1
***************
*** 488,491 ****
--- 483,487 ----
  
      def _remove_msg(self, wordstream, is_spam):
+         self.probcache = {}    # nuke the prob cache
          if is_spam:
              if self.meta.nspam <= 0:
***************
*** 539,565 ****
          return [t[1:] for t in clues]
  
-     def probability(self, word):
-         """Look up words (spamcount, hamcount) in the prob cache"""
- 
-         # Dictionary of dictionaries is used here for efficiency
- 
-         h = word.hamcount
-         s = word.spamcount
- 
-         try:
-             return self.probcache[h][s]
-         except (KeyError, TypeError):
-             pass
- 
-         # populate the cache, so this calculation won't have to be done again
-         try:
-             self.probcache[h]
-         except KeyError:
-             self.probcache[h] = {}
- 
-         word.probability(self.meta)
-         self.probcache[h][s] = word.spamprob
  
!         return word.spamprob
! 
! Bayes = Classifier
\ No newline at end of file
--- 535,538 ----
          return [t[1:] for t in clues]
  
  
! Bayes = Classifier

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40.2.4
retrieving revision 1.40.2.5
diff -C2 -d -r1.40.2.4 -r1.40.2.5
*** hammie.py	22 Nov 2002 03:00:36 -0000	1.40.2.4
--- hammie.py	23 Nov 2002 21:25:17 -0000	1.40.2.5
***************
*** 185,186 ****
--- 185,192 ----
      return Hammie(b)
  
+ 
+ if __name__ == "__main__":
+     # Everybody's used to running hammie.py.  Why mess with success?  ;)
+     import hammiebulk
+ 
+     hammiebulk.main()

Index: hammiebulk.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Attic/hammiebulk.py,v
retrieving revision 1.1.2.2
retrieving revision 1.1.2.3
diff -C2 -d -r1.1.2.2 -r1.1.2.3
*** hammiebulk.py	22 Nov 2002 03:00:45 -0000	1.1.2.2
--- hammiebulk.py	23 Nov 2002 21:25:17 -0000	1.1.2.3
***************
*** 73,78 ****
      for msg in mbox:
          i += 1
-         # XXX: Is the \r a Unixism?  I seem to recall it working in DOS
-         # back in the day.  Maybe it's a line-printer-ism ;)
          sys.stdout.write("\r%6d" % i)
          sys.stdout.flush()
--- 73,76 ----


From mhammond@users.sourceforge.net  Sat Nov 23 21:35:25 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 23 Nov 2002 13:35:25 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs FilterDialog.py,1.11,1.12
Message-ID: <E18Fhvx-0000Qp-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory sc8-pr-cvs1:/tmp/cvs-serv1562

Modified Files:
	FilterDialog.py 
Log Message:
Disable a couple of extra controls while filtering.


Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** FilterDialog.py	7 Nov 2002 22:30:10 -0000	1.11
--- FilterDialog.py	23 Nov 2002 21:35:23 -0000	1.12
***************
*** 304,308 ****
          [BUTTON,         'Close',               win32con.IDCANCEL,   (187,161,50,14), csts | win32con.BS_PUSHBUTTON],
      ]
!     disable_while_running_ids = [IDC_BUT_UNSEEN, IDC_BUT_UNREAD, IDC_BROWSE, win32con.IDCANCEL]
  
      def __init__(self, mgr, filterer):
--- 304,310 ----
          [BUTTON,         'Close',               win32con.IDCANCEL,   (187,161,50,14), csts | win32con.BS_PUSHBUTTON],
      ]
!     disable_while_running_ids = [IDC_BUT_UNSEEN, IDC_BUT_UNREAD,
!                                  IDC_BROWSE, win32con.IDCANCEL,
!                                  IDC_BUT_ACT_SCORE, IDC_BUT_ACT_SCORE]
  
      def __init__(self, mgr, filterer):


From npickett@users.sourceforge.net  Sat Nov 23 23:57:24 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sat, 23 Nov 2002 15:57:24 -0800
Subject: [Spambayes-checkins] spambayes Persistent.py,1.1.2.1,1.1.2.2
 classifier.py,1.53.2.9,1.53.2.10 hammiebulk.py,1.1.2.3,1.1.2.4
Message-ID: <E18Fk9M-0007Op-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv26743

Modified Files:
      Tag: hammie-playground
	Persistent.py classifier.py hammiebulk.py 
Log Message:
* Persistent.py no longer depends on Corpus.py (trying to reduce
  interdependencies).  However, the debug messages are a minor
  problem.  Maybe they should just be taken out?
* Tim Stone says a dict of dicts of ints is faster than a dict of
  tuple of ints, and I believe him, so classifier.probcache does
  that now
* hammie (and hammiebulk) now require either -D or -d (no default
  value).  I suspect people are using hammie.py to create a
  database, and the best choice of database store depends on how
  you're going to use it.


Index: Persistent.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Attic/Persistent.py,v
retrieving revision 1.1.2.1
retrieving revision 1.1.2.2
diff -C2 -d -r1.1.2.1 -r1.1.2.2
*** Persistent.py	22 Nov 2002 00:25:41 -0000	1.1.2.1
--- Persistent.py	23 Nov 2002 23:57:22 -0000	1.1.2.2
***************
*** 40,45 ****
      o Would Trainer.trainall really want to train with the whole corpus,
          or just a random subset?
-     o Corpus.Verbose is a bit of a strange thing to have.  Verbose
-         should be in the global namespace, but how do you get it there?
      o Suggestions?
  
--- 40,43 ----
***************
*** 54,58 ****
  all the spambayes contributors."
  
- import Corpus
  import classifier
  from Options import options
--- 52,55 ----
***************
*** 112,116 ****
          # that pickle does its job
  
!         if Corpus.Verbose:
              print 'Loading state from',self.db_name,'pickle'
  
--- 109,113 ----
          # that pickle does its job
  
!         if __debug__:
              print 'Loading state from',self.db_name,'pickle'
  
***************
*** 129,138 ****
              self.meta.nspam = tempbayes.get_nspam()
  
!             if Corpus.Verbose:
                  print '%s is an existing pickle, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new pickle
!             if Corpus.Verbose:
                  print self.db_name,'is a new pickle'
              self.wordinfo = {}
--- 126,135 ----
              self.meta.nspam = tempbayes.get_nspam()
  
!             if __debug__:
                  print '%s is an existing pickle, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new pickle
!             if __debug__:
                  print self.db_name,'is a new pickle'
              self.wordinfo = {}
***************
*** 143,147 ****
          '''Store self as a pickle'''
  
!         if Corpus.Verbose:
              print 'Persisting',self.db_name,'as a pickle'
  
--- 140,144 ----
          '''Store self as a pickle'''
  
!         if __debug__:
              print 'Persisting',self.db_name,'as a pickle'
  
***************
*** 172,176 ****
          '''Load state from WIDict'''
  
!         if Corpus.Verbose:
              print 'Loading state from',self.db_name,'WIDict'
  
--- 169,173 ----
          '''Load state from WIDict'''
  
!         if __debug__:
              print 'Loading state from',self.db_name,'WIDict'
  
***************
*** 183,192 ****
              self.set_nspam(nspam)
  
!             if Corpus.Verbose:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
!             if Corpus.Verbose:
                  print self.db_name,'is a new DBDict'
              self.set_nham(0)
--- 180,189 ----
              self.set_nspam(nspam)
  
!             if __debug__:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
!             if __debug__:
                  print self.db_name,'is a new DBDict'
              self.set_nham(0)
***************
*** 196,200 ****
          '''Place state into persistent store'''
  
!         if Corpus.Verbose:
              print 'Persisting',self.db_name,'state in WIDict'
  
--- 193,197 ----
          '''Place state into persistent store'''
  
!         if __debug__:
              print 'Persisting',self.db_name,'state in WIDict'
  
***************
*** 207,216 ****
      is an observer of the corpora'''
  
!     def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
!         '''Constructor(Classifier, \
!             Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
  
          self.bayes = bayes
!         self.trainertype = trainertype
          self.updateprobs = updateprobs
  
--- 204,212 ----
      is an observer of the corpora'''
  
!     def __init__(self, bayes, is_spam, updateprobs=NO_UPDATEPROBS):
!         '''Constructor(Classifier, is_spam(True|False), updprobs(True|False)'''
  
          self.bayes = bayes
!         self.is_spam = is_spam
          self.updateprobs = updateprobs
  
***************
*** 223,231 ****
          '''Train the database with the message'''
  
!         if Corpus.Verbose:
              print 'training with',message.key()
  
!         self.bayes.learn(message.tokenize(), \
!                          self.trainertype)
  #                         self.updateprobs)
  
--- 219,226 ----
          '''Train the database with the message'''
  
!         if __debug__:
              print 'training with',message.key()
  
!         self.bayes.learn(message.tokenize(), self.is_spam)
  #                         self.updateprobs)
  
***************
*** 238,246 ****
          '''Untrain the database with the message'''
  
!         if Corpus.Verbose:
              print 'untraining with',message.key()
  
!         self.bayes.unlearn(message.tokenize(), \
!                            self.trainertype)
  #                           self.updateprobs)
          # can raise ValueError if database is fouled.  If this is the case,
--- 233,240 ----
          '''Untrain the database with the message'''
  
!         if __debug__:
              print 'untraining with',message.key()
  
!         self.bayes.unlearn(message.tokenize(), self.is_spam)
  #                           self.updateprobs)
          # can raise ValueError if database is fouled.  If this is the case,
***************
*** 266,270 ****
          '''Constructor'''
  
!         Trainer.__init__(self, bayes, Corpus.SPAM, updateprobs)
  
  
--- 260,264 ----
          '''Constructor'''
  
!         Trainer.__init__(self, bayes, True, updateprobs)
  
  
***************
*** 275,279 ****
          '''Constructor'''
  
!         Trainer.__init__(self, bayes, Corpus.HAM, updateprobs)
  
  
--- 269,273 ----
          '''Constructor'''
  
!         Trainer.__init__(self, bayes, False, updateprobs)
  
  
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.9
retrieving revision 1.53.2.10
diff -C2 -d -r1.53.2.9 -r1.53.2.10
*** classifier.py	23 Nov 2002 21:25:16 -0000	1.53.2.9
--- classifier.py	23 Nov 2002 23:57:22 -0000	1.53.2.10
***************
*** 357,362 ****
          # Try the cache first
          try:
!             return self.probcache[(spamcount, hamcount)]
!         except:
              pass
  
--- 357,362 ----
          # Try the cache first
          try:
!             return self.probcache[spamcount][hamcount]
!         except KeyError:
              pass
  
***************
*** 420,424 ****
  
          # Update the cache
!         self.probcache[(spamcount, hamcount)] = prob
  
          return prob
--- 420,427 ----
  
          # Update the cache
!         try:
!             self.probcache[spamcount][hamcount] = prob
!         except KeyError:
!             self.probcache[spamcount] = {hamcount: prob}
  
          return prob
***************
*** 459,463 ****
      # to exploit it.
      def _add_msg(self, wordstream, is_spam):
-         print "Nuking the prob cache"
          self.probcache = {}    # nuke the prob cache
          if is_spam:
--- 462,465 ----

Index: hammiebulk.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Attic/hammiebulk.py,v
retrieving revision 1.1.2.3
retrieving revision 1.1.2.4
diff -C2 -d -r1.1.2.3 -r1.1.2.4
*** hammiebulk.py	23 Nov 2002 21:25:17 -0000	1.1.2.3
--- hammiebulk.py	23 Nov 2002 23:57:22 -0000	1.1.2.4
***************
*** 1,9 ****
  #! /usr/bin/env python
  
! """Usage: %(program)s [options]
  
  Where:
      -h
          show usage and exit
      -g PATH
          mbox or directory of known good messages (non-spam) to train on.
--- 1,27 ----
  #! /usr/bin/env python
  
! """Usage: %(program)s [-D|-d] [options]
  
  Where:
      -h
          show usage and exit
+     -d
+         use the DBM store.  A DBM file is larger than the pickle and
+         creating it is slower, but loading it is much faster,
+         especially for large word databases.  Recommended for use with
+         hammiefilter or any procmail-based filter.
+     -D
+         use the pickle store.  A pickle is smaller and faster to create,
+         but much slower to load.  Recommended for use with pop3proxy and
+         hammiesrv.
+     -p FILE
+         use file as the persistent store.  loads data from this file if it
+         exists, and saves data to this file at the end.
+         Default: %(DEFAULTDB)s
+ 
+     -f
+         run as a filter: read a single message from stdin, add a new
+         header, and write it to stdout.  If you want to run from
+         procmail, this is your option.
      -g PATH
          mbox or directory of known good messages (non-spam) to train on.
***************
*** 18,35 ****
          reverse the meaning of the check (report ham instead of spam).
          Only meaningful with the -u option.
-     -p FILE
-         use file as the persistent store.  loads data from this file if it
-         exists, and saves data to this file at the end.
-         Default: %(DEFAULTDB)s
-     -d
-         use the DBM store instead of cPickle.  The file is larger and
-         creating it is slower, but checking against it is much faster,
-         especially for large word databases. Default: %(USEDB)s
-     -D
-         the reverse of -d: use the cPickle instead of DBM
-     -f
-         run as a filter: read a single message from stdin, add a new
-         header, and write it to stdout.  If you want to run from
-         procmail, this is your option.
  """
  
--- 36,39 ----
***************
*** 59,65 ****
  DEFAULTDB = os.path.expanduser(options.hammiefilter_persistent_storage_file)
  
- # Use a database? If False, use a pickle
- USEDB = options.hammiefilter_persistent_use_database
- 
  # Probability at which a message is considered spam
  SPAM_THRESHOLD = options.spam_cutoff
--- 63,66 ----
***************
*** 138,142 ****
      reverse = 0
      do_filter = False
!     usedb = USEDB
      mode = 'r'
      for opt, arg in opts:
--- 139,143 ----
      reverse = 0
      do_filter = False
!     usedb = None
      mode = 'r'
      for opt, arg in opts:
***************
*** 163,166 ****
--- 164,170 ----
      if args:
          usage(2, "Positional arguments not allowed")
+ 
+     if usedb == None:
+         usage(2, "Must specify one of -d or -D")
  
      save = False


From tim_one@users.sourceforge.net  Sun Nov 24 07:41:05 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 23 Nov 2002 23:41:05 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.69,1.70
Message-ID: <E18FrO5-0004SO-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv16042

Modified Files:
	tokenizer.py 
Log Message:
Revamped the "look for special things and get rid of them" body
tokenization code, making most of this work thru a common new Stripper
class.  Moved the <style and HTML comment stripping into that framework,
so that re stack blowups should never happen again.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.69
retrieving revision 1.70
diff -C2 -d -r1.69 -r1.70
*** tokenizer.py	19 Nov 2002 02:13:00 -0000	1.69
--- tokenizer.py	24 Nov 2002 07:41:03 -0000	1.70
***************
*** 611,643 ****
                        msg.walk()))
  
- url_re = re.compile(r"""
-     (https? | ftp)  # capture the protocol
-     ://             # skip the boilerplate
-     # Do a reasonable attempt at detecting the end.  It may or may not
-     # be in HTML, may or may not be in quotes, etc.  If it's full of %
-     # escapes, cool -- that's a clue too.
-     ([^\s<>"'\x7f-\xff]+)  # capture the guts
- """, re.VERBOSE)                        # '
- 
- urlsep_re = re.compile(r"[;?:@&=+,$.]")
- 
  has_highbit_char = re.compile(r"[\x80-\xff]").search
  
  # Cheap-ass gimmick to probabilistically find HTML/XML tags.
  html_re = re.compile(r"""
      <
      (?![\s<>])  # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
!     (?:
!         # style sheets can be very long
!         style\b     # maybe it's <style>, or maybe <style type=...>, etc.
!         .{0,2048}?
!         </style
!     |   # so can comments
!         !--
!         .{0,2048}?
!         --
!     |   # guessing that other tags are usually "short"
!         [^>]{0,256} # search for the end '>', but don't run wild
!     )
      >
  """, re.VERBOSE | re.DOTALL)
--- 611,625 ----
                        msg.walk()))
  
  has_highbit_char = re.compile(r"[\x80-\xff]").search
  
  # Cheap-ass gimmick to probabilistically find HTML/XML tags.
+ # Note that <style and HTML comments are handled by crack_html_style()
+ # and crack_html_comment() instead -- they can be very long, and long
+ # minimal matches have a nasty habit of blowing the C stack.
  html_re = re.compile(r"""
      <
      (?![\s<>])  # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
!     # guessing that other tags are usually "short"
!     [^>]{0,256} # search for the end '>', but don't run wild
      >
  """, re.VERBOSE | re.DOTALL)
***************
*** 882,885 ****
--- 864,919 ----
      return log(n)/c
  
+ 
+ class Stripper(object):
+     def __init__(self, find_start, find_end):
+         # find_start and find_end have signature
+         #     string, int -> match_object
+         # where the search starts at string[int:int].  If a match isn't found,
+         # they must return None.  The match_object for find_start, if not
+         # None, is passed to self.tokenize, which returns a (possibly empty)
+         # list of tokens to generate.  Subclasses may override tokenize().
+         # Text between find_start and find_end is thrown away, except for
+         # whatever tokenize() produces.  A match_object must support method
+         #     span() -> int, int    # the slice bounds of what was matched
+         self.find_start = find_start
+         self.find_end = find_end
+ 
+     # Efficiency note:  This is cheaper than it looks if there aren't any
+     # special sections.  Under the covers, string[0:] is optimized to
+     # return string (no new object is built), and likewise ' '.join([string])
+     # is optimized to return string.  It would actually slow this code down
+     # to special-case these "do nothing" special cases at the Python level!
+ 
+     def analyze(self, text):
+         i = 0
+         retained = []
+         pushretained = retained.append
+         tokens = []
+         while True:
+             m = self.find_start(text, i)
+             if not m:
+                 pushretained(text[i:])
+                 break
+             start, end = m.span()
+             pushretained(text[i : start])
+             tokens.extend(self.tokenize(m))
+             m = self.find_end(text, end)
+             if not m:
+                 break
+             dummy, i = m.span()
+         # Replace each skipped portion with a single blank.
+         return ' '.join(retained), tokens
+ 
+     def tokenize(self, match_object):
+         # Override this if you want to suck info out of the start pattern.
+         return []
+ 
+ # Strip out uuencoded sections and produce tokens.  The return value
+ # is (new_text, sequence_of_tokens), where new_text no longer contains
+ # uuencoded stuff.  Note that we're not bothering to decode it!  Maybe
+ # we should.  One of my persistent false negatives is a spam containing
+ # nothing but a uuencoded money.txt; OTOH, uuencoded seems to be on
+ # its way out (that's an old spam).
+ 
  uuencode_begin_re = re.compile(r"""
      ^begin \s+
***************
*** 891,949 ****
  uuencode_end_re = re.compile(r"^end\s*\n", re.MULTILINE)
  
! # Strip out uuencoded sections and produce tokens.  The return value
! # is (new_text, sequence_of_tokens), where new_text no longer contains
! # uuencoded stuff.  Note that we're not bothering to decode it!  Maybe
! # we should.  One of my persistent false negatives is a spam containing
! # nothing but a uuencoded money.txt; OTOH, uuencoded seems to be on
! # its way out (that's an old spam).
! #
! # Efficiency note:  This is cheaper than it looks if there aren't any
! # uuencoded sections.  Under the covers, string[0:] is optimized to
! # return string (no new object is built), and likewise ''.join([string])
! # is optimized to return string.  It would actually slow this code down
! # to special-case these "do nothing" special cases at the Python level!
! def crack_uuencode(text):
!     new_text = []
!     tokens = []
!     i = 0
!     while True:
!         # Invariant:  Through text[:i], all non-uuencoded text is in
!         # new_text, and tokens contains summary clues for all uuencoded
!         # portions.  text[i:] hasn't been looked at yet.
!         m = uuencode_begin_re.search(text, i)
!         if not m:
!             new_text.append(text[i:])
!             break
!         start, end = m.span()
!         new_text.append(text[i : start])
          mode, fname = m.groups()
!         tokens.append('uuencode mode:%s' % mode)
!         tokens.extend(['uuencode:%s' % x for x in crack_filename(fname)])
!         m = uuencode_end_re.search(text, end)
!         if not m:
!             break
!         i = m.end()
  
!     return ''.join(new_text), tokens
  
! def crack_urls(text):
!     new_text = []
!     clues = []
!     pushclue = clues.append
!     i = 0
!     while True:
!         # Invariant:  Through text[:i], all non-URL text is in new_text, and
!         # clues contains clues for all URLs.  text[i:] hasn't been looked at
!         # yet.
!         m = url_re.search(text, i)
!         if not m:
!             new_text.append(text[i:])
!             break
          proto, guts = m.groups()
!         start, end = m.span()
!         new_text.append(text[i : start])
!         new_text.append(' ')
  
-         pushclue("proto:" + proto)
          # Lose the trailing punctuation for casual embedding, like:
          #     The code is at http://mystuff.org/here?  Didn't resolve.
--- 925,964 ----
  uuencode_end_re = re.compile(r"^end\s*\n", re.MULTILINE)
  
! class UUencodeStripper(Stripper):
!     def __init__(self):
!         Stripper.__init__(self, uuencode_begin_re.search,
!                                 uuencode_end_re.search)
! 
!     def tokenize(self, m):
          mode, fname = m.groups()
!         return (['uuencode mode:%s' % mode] +
!                 ['uuencode:%s' % x for x in crack_filename(fname)])
  
! crack_uuencode = UUencodeStripper().analyze
  
! 
! # Strip and specially tokenize embedded URLish thingies.
! 
! url_re = re.compile(r"""
!     (https? | ftp)  # capture the protocol
!     ://             # skip the boilerplate
!     # Do a reasonable attempt at detecting the end.  It may or may not
!     # be in HTML, may or may not be in quotes, etc.  If it's full of %
!     # escapes, cool -- that's a clue too.
!     ([^\s<>"'\x7f-\xff]+)  # capture the guts
! """, re.VERBOSE)                        # '
! 
! urlsep_re = re.compile(r"[;?:@&=+,$.]")
! 
! class URLStripper(Stripper):
!     def __init__(self):
!         # The empty regexp matches anything at once.
!         Stripper.__init__(self, url_re.search, re.compile("").search)
! 
!     def tokenize(self, m):
          proto, guts = m.groups()
!         tokens = ["proto:" + proto]
!         pushclue = tokens.append
  
          # Lose the trailing punctuation for casual embedding, like:
          #     The code is at http://mystuff.org/here?  Didn't resolve.
***************
*** 956,963 ****
              for chunk in urlsep_re.split(piece):
                  pushclue("url:" + chunk)
  
!         i = end
  
-     return ''.join(new_text), clues
  
  # Scan HTML for constructs often seen in viruses and worms.
--- 971,999 ----
              for chunk in urlsep_re.split(piece):
                  pushclue("url:" + chunk)
+         return tokens
  
! crack_urls = URLStripper().analyze
! 
! # Nuke HTML <style gimmicks.
! html_style_start_re = re.compile(r"""
!     < \s* style\b [^>]* >
! """, re.VERBOSE)
! 
! class StyleStripper(Stripper):
!     def __init__(self):
!         Stripper.__init__(self, html_style_start_re.search,
!                                 re.compile(r"</style>").search)
! 
! crack_html_style = StyleStripper().analyze
! 
! # Nuke HTML comments.
! 
! class CommentStripper(Stripper):
!     def __init__(self):
!         Stripper.__init__(self, re.compile(r"<!--").search,
!                                 re.compile(r"-->").search)
! 
! crack_html_comment = CommentStripper().analyze
  
  
  # Scan HTML for constructs often seen in viruses and worms.
***************
*** 1232,1251 ****
              text = text.lower()
  
-             # Get rid of uuencoded sections.
-             text, tokens = crack_uuencode(text)
-             for t in tokens:
-                 yield t
- 
              if options.replace_nonascii_chars:
                  # Replace high-bit chars and control chars with '?'.
                  text = text.translate(non_ascii_translate_tab)
  
-             # Special tagging of embedded URLs.
-             text, tokens = crack_urls(text)
-             for t in tokens:
-                 yield t
- 
              for t in find_html_virus_clues(text):
                  yield "virus:%s" % t
  
              # Remove HTML/XML tags.  Also &nbsp;.
--- 1268,1287 ----
              text = text.lower()
  
              if options.replace_nonascii_chars:
                  # Replace high-bit chars and control chars with '?'.
                  text = text.translate(non_ascii_translate_tab)
  
              for t in find_html_virus_clues(text):
                  yield "virus:%s" % t
+ 
+             # Get rid of uuencoded sections, embedded URLs, <style gimmicks,
+             # and HTML comments.
+             for cracker in (crack_uuencode,
+                             crack_urls,
+                             crack_html_style,
+                             crack_html_comment):
+                 text, tokens = cracker(text)
+                 for t in tokens:
+                     yield t
  
              # Remove HTML/XML tags.  Also &nbsp;.


From npickett@users.sourceforge.net  Sun Nov 24 09:31:09 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sun, 24 Nov 2002 01:31:09 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.10,1.53.2.11
Message-ID: <E18Ft6b-0002MW-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv8720

Modified Files:
      Tag: hammie-playground
	classifier.py 
Log Message:
* changed (nham, nspam) to (nspam, nham) to reflect the ordering
  used everywhere else in the classifier.  Unfortunately, this
  requires incrementing the pickle version.  Better now than after
  the merge though.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.10
retrieving revision 1.53.2.11
diff -C2 -d -r1.53.2.10 -r1.53.2.11
*** classifier.py	23 Nov 2002 23:57:22 -0000	1.53.2.10
--- classifier.py	24 Nov 2002 09:31:06 -0000	1.53.2.11
***************
*** 47,51 ****
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 3
  
  class MetaInfo(object):
--- 47,51 ----
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 4
  
  class MetaInfo(object):
***************
*** 61,75 ****
  
      def __repr__(self):
!         return "MetaInfo%r" % repr((self._nham,
!                                     self._nspam,
                                      self.revision))
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self._nham, self._nspam)
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         (self._nham, self._nspam) = t[1:]
          self.revision = 0
  
--- 61,75 ----
  
      def __repr__(self):
!         return "MetaInfo%r" % repr((self._nspam,
!                                     self._nham,
                                      self.revision))
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self._nspam, self._nham)
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         (self._nspam, self._nham) = t[1:]
          self.revision = 0
  

From npickett@users.sourceforge.net  Sun Nov 24 09:32:07 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sun, 24 Nov 2002 01:32:07 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.53.2.11,1.53.2.12
Message-ID: <E18Ft7X-0002V5-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9426

Modified Files:
      Tag: hammie-playground
	classifier.py 
Log Message:
* Removed remaining vestiges of WordInfo computing its own spamprob


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53.2.11
retrieving revision 1.53.2.12
diff -C2 -d -r1.53.2.11 -r1.53.2.12
*** classifier.py	24 Nov 2002 09:31:06 -0000	1.53.2.11
--- classifier.py	24 Nov 2002 09:32:05 -0000	1.53.2.12
***************
*** 102,107 ****
      def __repr__(self):
          return "WordInfo%r" % repr((self.spamcount,
!                                     self.hamcount,
!                                     self.spamprob))
  
      def __getstate__(self):
--- 102,106 ----
      def __repr__(self):
          return "WordInfo%r" % repr((self.spamcount,
!                                     self.hamcount))
  
      def __getstate__(self):
***************
*** 111,116 ****
      def __setstate__(self, t):
          (self.spamcount, self.hamcount) = t
-         self.spamprob = None
-         self.revision = None
  
  
--- 110,113 ----


From mhammond@users.sourceforge.net  Sun Nov 24 22:43:46 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 24 Nov 2002 14:43:46 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 README.txt,1.7,1.8 about.html,1.4,1.5
 addin.py,1.38,1.39 filter.py,1.13,1.14 manager.py,1.35,1.36
Message-ID: <E18G5Te-0007pQ-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv29193

Modified Files:
	README.txt about.html addin.py filter.py manager.py 
Log Message:
Use a percentage for the SpamScore - this is so we can play nicely
with Outlooks UserProperty API.

NOTE: Does require some user intervention - please see
http://mail.python.org/pipermail/spambayes/2002-November/002170.html
for details.


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** README.txt	19 Nov 2002 22:52:25 -0000	1.7
--- README.txt	24 Nov 2002 22:43:43 -0000	1.8
***************
*** 4,11 ****
  you *must* have win32all-149 or later.
  
! CDO is no longer needed :)
! 
! See below for a list of known problems (particularly that you must manually
! create an Outlook property before you can see the Spam scores)
  
  Outlook Addin
--- 4,8 ----
  you *must* have win32all-149 or later.
  
! See below for a list of known problems.
  
  Outlook Addin

Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** about.html	2 Nov 2002 07:01:21 -0000	1.4
--- about.html	24 Nov 2002 22:43:43 -0000	1.5
***************
*** 5,99 ****
  </head>
  <body>
! <span style="font-style: italic;">NOTE: This is very very early code. &nbsp;If
! you are looking this, you have probably been told about it against our better
! judgement &lt;wink&gt;. &nbsp;Stuff doesnt work correctly. &nbsp;Fields are
! funny. &nbsp;If you want something known to work well today for alot of people,
! this is not for you.<br>
  </span><br style="font-style: italic;">
! The source code is maintained at <a
   href="http://spambayes.sourceforge.net">SourceForge</a>.<br>
  <br>
! This spam filter uses Bayesian analysis to filter spam. &nbsp;Unlike other
! spam detection systems, Bayesian systems actually "learn" about what you
! consider spam, and continually adapt as both your regular email and spam
! patterns change.<br>
! 
! <h2>Training</h2>
! Due to the nature of the system, it must be trained before it can be effective.
! &nbsp;Although the system does learn over time, when first installed it has
! no knowledge of either spam or good email.<br>
! 
  <h3>Initial Training</h3>
  When first installed, it is recommended you perform the following steps:<br>
  <ul>
    <li>Create two folders - one for "Spam", and one for "Possible Spam"</li>
!   <li>Go through your Inbox and Deleted Items, and move as much spam as you
! can find to the "Spam" folder. &nbsp;Try and get as much Spam out of your
! inbox as possible.</li>
!   <li>Select the <span style="font-style: italic;">Training</span> dialog.
! &nbsp;Nominate your Spam folder for spam, and your Inbox for good messages,
! and start training.</li>
  </ul>
  To see how effective your Inbox cleanup was, you may like to try:<br>
  <ul>
!   <li>Go to the <span style="font-style: italic;">Filter Now</span> dialog.</li>
    <li>Select your Inbox as the folder to filter.</li>
!   <li>Select <span style="font-style: italic;">Score messages, but dont perform
! filter action</span>.</li>
    <li>Clear both checkboxes so all messages will be scored.</li>
    <li>Start the score operation.</li>
  </ul>
! You can then look at and sort by the Spam field in your Inbox - this is likely
! to find hidden spam that you missed from your inbox cleanup.
! 
  <h3>Incremental Training</h3>
! When you drag a message to your Spam folder, it will be automatically trained
! as spam. &nbsp;Thus, as the classifier misses spam (or is unsure about them),
! it learns as you correct it.<br>
! If messages are dropped back into the Inbox, they are trained as good - thus,
! the system learns what good messages look like should it incorrectly classify
! it as spam or possible spam.<br>
! 
! <h2>Creating a Spam Score Field</h2>
! A custom property named "Spam" is added to all Outlook messages scored.
! This is an integer in 0 (ham) through 100 (spam) inclusive.
! You can teach Outlook to display this field as a column in any table view,
! like the standard Messages view.
! <p>
! This takes some work, and has to be done again for every folder in which
! you want to display a Spam column:
  <ul>
!     <li>While looking at an Outlook table view (like Messages), right-click
!         on the line with column headers (From, Subject, To, Received, ...).
!         In the context menu that pops up, click on Field Chooser.  A box
!         with title <i>Field Chooser</i> pops up.
      <li>In the lower left corner of the <i>Field Chooser</i> box, click
!         <i>New...</i>.  A box with title <i>New Field</i> pops up.
!     <li>In the <i>Name:</i> box, type Spam.
!     <li>In the <i>Type:</i> dropdown list, select <i>Integer</i>.  This is the
!         last choice in the dropdown list.
!         Do not select <i>Number</i> -- it won't work.
!     <li>The <i>Format:</i> dropdown list should display "1,234" now.  Leave it alone.
!     <li>Click OK in the <i>New Field</i> box.  Now you're back in the
!         <i>Field Chooser</i> box.
!     <li>The dropdown list at the top of the <i>Field Chooser</i> box should say
!         <i>User-defined fields in FOLDER</i> now, where FOLDER is the name of the
!         folder you're currently looking at (like Inbox).  Below that, you
!         should see a new rectangular button with a Spam label.
!     <li>Use your mouse to drag the Spam button to the column header position
!         where you want to see the Spam column.  You don't have to be precise
!         here -- you can rearrange or resize the column later just by dragging
!         it around.
!     <li>You're done!  Close the <i>Field Chooser</i> box.
  </ul>
! Outlook's standard Automatic Formatting features can also be taught how to
! access the value of this field; for example, you could tell Outlook to display
! rows with suspected spam messages in green italic.  However, for whatever reason,
! the Outlook Rules Wizard does not allow creating rules based on user-defined
! fields.  That's why this addin supplies its own filtering rules.
! 
! <p>
! Contributions to this documentation are welcome!<br>
  <br>
  </body>
  </html>
--- 5,117 ----
  </head>
  <body>
! <h1>SpamBayes Outlook Plugin<br>
! </h1>
! <span style="font-style: italic;">NOTE: This is very very early code.
! &nbsp;If you are looking at this, you have probably been told about it
! against our better judgement &lt;wink&gt;. &nbsp;Stuff doesnt work
! correctly. &nbsp;If you want something known to work well today for alot
! of people, this is not for you.</span> &nbsp;That said, this plug-in
! works amazingly well! So welcome aboard.<span
!  style="font-style: italic;"><br>
  </span><br style="font-style: italic;">
! This spam filter uses Bayesian analysis to filter spam. &nbsp;Unlike
! other spam detection systems, Bayesian systems actually "learn" about
! what you consider spam, and continually adapt as both your regular email
! and spam patterns change. The source code is maintained at <a
   href="http://spambayes.sourceforge.net">SourceForge</a>.<br>
  <br>
! Here you can find information on:<br>
! <div style="margin-left: 40px;"><a href="#Training">Training</a><br>
! <a href="#Field">Viewing the Spam Score field</a><br>
! </div>
! <h2><a name="Training"></a>Training</h2>
! Due to the nature of the system, it must be trained before it can be
! effective. &nbsp;Although the system does learn over time, when first
! installed it has no knowledge of either spam or good email.<br>
  <h3>Initial Training</h3>
  When first installed, it is recommended you perform the following steps:<br>
  <ul>
    <li>Create two folders - one for "Spam", and one for "Possible Spam"</li>
!   <li>Go through your Inbox and Deleted Items, and move as much spam as
! you can find to the "Spam" folder. &nbsp;Try and get as much Spam out of
! your inbox as possible.</li>
!   <li>Select the <span style="font-style: italic;">Training</span>
! dialog. &nbsp;Nominate your Spam folder for spam, and your Inbox for
! good messages, and start training.</li>
  </ul>
  To see how effective your Inbox cleanup was, you may like to try:<br>
  <ul>
!   <li>Go to the <span style="font-style: italic;">Filter Now</span>
! dialog.</li>
    <li>Select your Inbox as the folder to filter.</li>
!   <li>Select <span style="font-style: italic;">Score messages, but
! dont perform filter action</span>.</li>
    <li>Clear both checkboxes so all messages will be scored.</li>
    <li>Start the score operation.</li>
  </ul>
! You can then look at and sort by the Spam field in your Inbox - this is
! likely to find hidden spam that you missed from your inbox cleanup.
  <h3>Incremental Training</h3>
! When you drag a message to your Spam folder, it will be automatically
! trained as spam. &nbsp;Thus, as the classifier misses spam (or is unsure
! about them), it learns as you correct it.<br>
! If messages are dropped back into the Inbox, they are trained as good -
! thus, the system learns what good messages look like should it
! incorrectly classify it as spam or possible spam.<br>
! You will also notice a "Delete as Spam" button (in all folders except
! the Spam folder) and a "Recover from Spam" button in the Spam and Unsure
! folders. &nbsp;These buttons have the same effect as the drags above.
! &nbsp;(Note that currently the "Recover from Spam" option will move the
! item to the Inbox - this is a bug - it should restore the message to
! the folder it was originally filtered from in the first place)<br>
! <h2><a name="Field"></a>Viewing the Spam Score Field</h2>
! A custom property named <span style="font-style: italic;">Spam</span>
! is added to all Outlook messages scored. This is a percentage indicating
! the likelihood of the message being spam (ie, 0% is "certain" ham; 100%
! if "certain" spam). You can teach Outlook to display this field as a
! column in any table view, like the standard Messages view.
! <p> This takes some work, and has to be done again for every folder in
! which you want to display a Spam column: </p>
  <ul>
!   <li>While looking at an Outlook table view (like Messages),
! right-click on the line with column headers (From, Subject, To,
! Received, ...).         In the context menu that pops up, click on Field
! Chooser.  A box         with title <i>Field Chooser</i> pops up.</li>
!   <li>In the drop-down list at the top of the <span
!  style="font-style: italic;">Field Chooser</span> window, select <span
!  style="font-style: italic;">User Defined Fields</span></li>
!   <li>Below the drop-down, you         should see a rectangular button
! with a <span style="font-style: italic;">Spam</span> label . This<span
!  style="font-style: italic;"></span> should be automatically created for
! all folders managed by the system, but if it does not appear, you will
! need to add it yourself. &nbsp;To do this, perform the following steps</li>
!   <ul>
      <li>In the lower left corner of the <i>Field Chooser</i> box, click
!  <i>New...</i>.  A box with title <i>New Field</i> pops up. </li>
!     <li>In the <i>Name:</i> box, type Spam. </li>
!     <li>In the <i>Type:</i> dropdown list, select <i>Percent</i>.
! This is the         third choice in the dropdown list.         Do not
! select any other format -- it won't work. </li>
!     <li>The <i>Format:</i> select the first entry in the list -
! "Rounded"</li>
!     <li>Click OK in the <i>New Field</i> box.  Now you're back in the <i>Field
! Chooser</i> box, with a new <span style="font-style: italic;">Spam</span>
! button shown. </li>
!   </ul>
!   <li>Use your mouse to drag the <span style="font-style: italic;">Spam</span>
! button to the column header position         where you want to see the
! Spam column.  You don't have to be precise         here -- you can
! rearrange or resize the column later just by dragging         it around. </li>
!   <li>You're done!  Close the <i>Field Chooser</i> box. </li>
  </ul>
! Outlook's standard Automatic Formatting features can also be taught how
! to access the value of this field; for example, you could tell Outlook
! to display rows with suspected spam messages in green italic.  However,
! for whatever reason, the Outlook Rules Wizard does not allow creating
! rules based on user-defined fields.  That's why this addin supplies its
! own filtering rules.
! <p> Contributions to this documentation are welcome!<br>
  <br>
+ </p>
  </body>
  </html>

Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -C2 -d -r1.38 -r1.39
*** addin.py	23 Nov 2002 10:47:10 -0000	1.38
--- addin.py	24 Nov 2002 22:43:43 -0000	1.39
***************
*** 199,203 ****
              import train
              trained_as_good = train.been_trained_as_ham(msgstore_message, self.manager)
!             if self.manager.config.filter.spam_threshold > prop or \
                 trained_as_good:
                  subject = item.Subject.encode("mbcs", "replace")
--- 199,203 ----
              import train
              trained_as_good = train.been_trained_as_ham(msgstore_message, self.manager)
!             if self.manager.config.filter.spam_threshold > prop * 100 or \
                 trained_as_good:
                  subject = item.Subject.encode("mbcs", "replace")
***************
*** 222,226 ****
  
      item = msgstore_message.GetOutlookItem()
!     score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
      new_msg = app.CreateItem(0)
      # NOTE: Silly Outlook always switches the message editor back to RTF
--- 222,226 ----
  
      item = msgstore_message.GetOutlookItem()
!     score, clues = mgr.score(msgstore_message, evidence=True)
      new_msg = app.CreateItem(0)
      # NOTE: Silly Outlook always switches the message editor back to RTF

Index: filter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** filter.py	7 Nov 2002 22:30:09 -0000	1.13
--- filter.py	24 Nov 2002 22:43:43 -0000	1.14
***************
*** 14,21 ****
      config = mgr.config.filter
      prob = mgr.score(msg)
!     if prob >= config.spam_threshold:
          disposition = "Yes"
          attr_prefix = "spam"
!     elif prob >= config.unsure_threshold:
          disposition = "Unsure"
          attr_prefix = "unsure"
--- 14,22 ----
      config = mgr.config.filter
      prob = mgr.score(msg)
!     prob_perc = prob * 100
!     if prob_perc >= config.spam_threshold:
          disposition = "Yes"
          attr_prefix = "spam"
!     elif prob_perc >= config.unsure_threshold:
          disposition = "Unsure"
          attr_prefix = "unsure"

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** manager.py	23 Nov 2002 10:32:48 -0000	1.35
--- manager.py	24 Nov 2002 22:43:43 -0000	1.36
***************
*** 96,99 ****
--- 96,105 ----
          # So until we know better, use Outlook to hack this in.
          # Should be called once per folder you are watching/filtering etc
+         #
+         # Oh the tribulations of our property grail
+         # We originally wanted to use the "Integer" Outlook field,
+         # but it seems this property type alone is not expose via the Object
+         # model.  So we resort to olPercent, and live with the % sign
+         # (which really is OK!)
          assert self.outlook is not None, "I need outlook :("
          ol = self.outlook
***************
*** 107,113 ****
          if item is not None:
              ups = item.UserProperties
-             # Display format is documented as being the 1-based index in
-             # the combo box in the outlook UI for the given data type.
-             # 1 is the first - "all digits", which seems fine.
              # *sigh* - need to search by int index
              for i in range(ups.Count):
--- 113,116 ----
***************
*** 117,133 ****
              else: # for not broken
                  try:
                      ups.Add(self.config.field_score_name,
!                            # "Integer" from the UI doesn't exist!
!                            # 'olNumber' doesn't seem to work with PT_INT*
!                            win32com.client.constants.olCombination,
!                            True) # Add to folder
                      item.Save()
                      if self.verbose > 1:
                          print "Created the UserProperty!"
!                 except pythoncom.com_error:
!                     pass # We know, we know...
! ##                    import traceback
! ##                    print "Failed to create the field"
! ##                    traceback.print_exc()
          # else no items in this folder - not much worth doing!
          if include_sub:
--- 120,142 ----
              else: # for not broken
                  try:
+                     # Display format is documented as being the 1-based index in
+                     # the combo box in the outlook UI for the given data type.
+                     # 1 is the first - "Rounded", which seems fine.
+                     format = 1
                      ups.Add(self.config.field_score_name,
!                            win32com.client.constants.olPercent,
!                            True, # Add to folder
!                            format)
                      item.Save()
                      if self.verbose > 1:
                          print "Created the UserProperty!"
!                 except pythoncom.com_error, details:
!                     print "Warning: failed to create the Outlook " \
!                           "user-property in folder '%s'" \
!                           % (folder.Name.encode("mbcs", "replace"),)
!                     print "", details
!                     print " This is probably because the code has recently"\
!                           " been changed, but it will"
!                     print " have no effect on the filtering or scoring."
          # else no items in this folder - not much worth doing!
          if include_sub:
***************
*** 251,255 ****
          self.outlook = None
  
!     def score(self, msg, evidence=False, scale=True):
          """Score a msg.
  
--- 260,264 ----
          self.outlook = None
  
!     def score(self, msg, evidence=False):
          """Score a msg.
  
***************
*** 261,280 ****
          where clues is a list of the (word, spamprob(word)) pairs that
          went into determining the score.  Else just the score is returned.
- 
-         If optional arg scale is specified and false, the score is a float
-         in 0.0 (ham) thru 1.0 (spam).  Else (the default), the score is
-         scaled into an integer from 0 (ham) thru 100 (spam).
          """
- 
          email = msg.GetEmailPackageObject()
          result = self.bayes.spamprob(bayes_tokenize(email), evidence)
-         if not scale:
-             return result
-         # For sister-friendliness, multiply score by 100 and round to an int.
          if evidence:
              score, the_evidence = result
          else:
              score = result
-         score = int(round(score * 100.0))
          if evidence:
              return score, the_evidence
--- 270,280 ----


From npickett@users.sourceforge.net  Mon Nov 25 02:29:47 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sun, 24 Nov 2002 18:29:47 -0800
Subject: [Spambayes-checkins] spambayes Persistent.py,1.1,1.2
	hammiebulk.py,1.1,1.2
	Corpus.py,1.2,1.3 FileCorpus.py,1.2,1.3 Options.py,1.75,1.76
	TestDriver.py,1.30,1.31 Tester.py,1.8,1.9 classifier.py,1.53,1.54
	dbdict.py,1.1,1.2 hammie.py,1.40,1.41 hammiefilter.py,1.2,1.3
	pop3proxy.py,1.18,1.19 Bayes.py,1.5,NONE
Message-ID: <E18G90M-0008Mj-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv31682

Modified Files:
	Corpus.py FileCorpus.py Options.py TestDriver.py Tester.py 
	classifier.py dbdict.py hammie.py hammiefilter.py pop3proxy.py 
Added Files:
	Persistent.py hammiebulk.py 
Removed Files:
	Bayes.py 
Log Message:
* Merge from hammie-playground to HEAD.  See spambayes list for more
  details.


Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** Corpus.py	16 Nov 2002 19:03:15 -0000	1.2
--- Corpus.py	25 Nov 2002 02:29:44 -0000	1.3
***************
*** 230,234 ****
  
          return msg
!         
  
  class ExpiryCorpus:
--- 230,234 ----
  
          return msg
! 
  
  class ExpiryCorpus:
***************
*** 272,276 ****
      def __init__(self):
          '''Constructor()'''
!         pass
  
      def load(self):
--- 272,278 ----
      def __init__(self):
          '''Constructor()'''
! 
!         self.payload = None
!         self.hdrtxt = None
  
      def load(self):
***************
*** 297,301 ****
          '''Instance as a printable string'''
  
!         return self.substance
  
      def name(self):
--- 299,303 ----
          '''Instance as a printable string'''
  
!         return self.getSubstance()
  
      def name(self):
***************
*** 311,322 ****
      def setSubstance(self, sub):
          '''set this message substance'''
!         
!         self.substance = sub
!         
      def getSubstance(self):
          '''Return this message substance'''
!         
!         return self.substance
!         
      def setSpamprob(self, prob):
          '''Score of the last spamprob calc, may not be persistent'''
--- 313,328 ----
      def setSubstance(self, sub):
          '''set this message substance'''
! 
!         bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
!         bmatch = bodyRE.search(sub)
!         if bmatch:
!             self.payload = bmatch.group(2)
!             self.hdrtxt = sub[:bmatch.start(2)]
! 
      def getSubstance(self):
          '''Return this message substance'''
! 
!         return self.hdrtxt + self.payload
! 
      def setSpamprob(self, prob):
          '''Score of the last spamprob calc, may not be persistent'''
***************
*** 327,331 ****
          '''Returns substance as tokens'''
  
!         return tokenizer.tokenize(self.substance)
  
      def createTimeStamp(self):
--- 333,337 ----
          '''Returns substance as tokens'''
  
!         return tokenizer.tokenize(self.getSubstance())
  
      def createTimeStamp(self):
***************
*** 335,338 ****
--- 341,398 ----
          raise NotImplementedError
  
+     def getFrom(self):
+         '''Return a message From header content'''
+ 
+         if self.hdrtxt:
+             match = re.search(r'^From:(.*)$', self.hdrtxt, re.MULTILINE)
+             return match.group(1)
+         else:
+             return None
+ 
+     def getSubject(self):
+         '''Return a message Subject header contents'''
+ 
+         if self.hdrtxt:
+             match = re.search(r'^Subject:(.*)$', self.hdrtxt, re.MULTILINE)
+             return match.group(1)
+         else:
+             return None
+ 
+     def getDate(self):
+         '''Return a message Date header contents'''
+ 
+         if self.hdrtxt:
+             match = re.search(r'^Date:(.*)$', self.hdrtxt, re.MULTILINE)
+             return match.group(1)
+         else:
+             return None
+ 
+     def getHeadersList(self):
+         '''Return a list of message header tuples'''
+ 
+         hdrregex = re.compile(r'^([A-Za-z0-9-_]*): ?(.*)$', re.MULTILINE)
+         data = re.sub(r'\r?\n\r?\s',' ',self.hdrtxt,re.MULTILINE)
+         match = hdrregex.findall(data)
+ 
+ 	return match
+ 	
+     def getHeaders(self):
+         '''Return message headers as text'''
+         
+         return self.hdrtxt
+ 
+     def getPayload(self):
+         '''Return the message body'''
+ 
+         return self.payload
+ 
+     def stripSBDHeader(self):
+         '''Removes the X-Spambayes-Disposition: header from the message'''
+ 
+         # This is useful for training, where a spammer may be spoofing
+         # our header, to make sure that our header doesn't become an
+         # overweight clue to hamminess
+ 
+         raise NotImplementedError
  
  
Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** FileCorpus.py	16 Nov 2002 19:06:27 -0000	1.2
--- FileCorpus.py	25 Nov 2002 02:29:44 -0000	1.3
***************
*** 86,90 ****
  
  import Corpus
! import Bayes
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
  
--- 86,90 ----
  
  import Corpus
! import Persistent
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
  
***************
*** 192,195 ****
--- 192,196 ----
          '''Constructor(message file name, corpus directory name)'''
  
+         Corpus.Message.__init__(self)
          self.file_name = file_name
          self.directory = directory
***************
*** 214,218 ****
                 raise
          else:
!            self.substance = fp.read()
             fp.close()
  
--- 215,219 ----
                 raise
          else:
!            self.setSubstance(fp.read())
             fp.close()
  
***************
*** 225,229 ****
          pn = self.pathname()
          fp = open(pn, 'wb')
!         fp.write(self.substance)
          fp.close()
  
--- 226,230 ----
          pn = self.pathname()
          fp = open(pn, 'wb')
!         fp.write(self.getSubstance())
          fp.close()
  
***************
*** 248,260 ****
  
          elip = ''
!         sub = self.substance
! 
          if Corpus.Verbose:
!             sub = self.substance
          else:
!             if len(self.substance) > 20:
!                 sub = self.substance[:20]
!                 if len(self.substance) > 40:
!                     sub += '...' + self.substance[-20:]
  
          pn = os.path.join(self.directory, self.file_name)
--- 249,261 ----
  
          elip = ''
!         sub = self.getSubstance()
!         
          if Corpus.Verbose:
!             sub = self.getSubstance()
          else:
!             if len(sub) > 20:
!                 sub = sub[:20]
!                 if len(sub) > 40:
!                     sub += '...' + sub[-20:]
  
          pn = os.path.join(self.directory, self.file_name)
***************
*** 304,308 ****
                  raise
          else:
!             self.substance = fp.read()
              fp.close()
  
--- 305,309 ----
                  raise
          else:
!             self.setSubstance(fp.read())
              fp.close()
  
***************
*** 316,320 ****
          pn = self.pathname()
          gz = gzip.open(pn, 'wb')
!         gz.write(self.substance)
          gz.flush()
          gz.close()
--- 317,321 ----
          pn = self.pathname()
          gz = gzip.open(pn, 'wb')
!         gz.write(self.getSubstance())
          gz.flush()
          gz.close()
***************
*** 342,354 ****
          print 'Executing with uncompressed files'
  
!     print '\n\nCreating two Bayes databases'
!     miscbayes = Bayes.PickledBayes('fctestmisc.bayes')
!     classbayes = Bayes.DBDictBayes('fctestclass.bayes')
  
      print '\n\nSetting up spam corpus'
      spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
!     spamtrainer = Bayes.SpamTrainer(miscbayes)
      spamcorpus.addObserver(spamtrainer)
!     anotherspamtrainer = Bayes.SpamTrainer(classbayes, Bayes.UPDATEPROBS)
      spamcorpus.addObserver(anotherspamtrainer)
  
--- 343,355 ----
          print 'Executing with uncompressed files'
  
!     print '\n\nCreating two Classifier databases'
!     miscbayes = Persistent.PickledClassifier('fctestmisc.bayes')
!     classbayes = Persistent.DBDictClassifier('fctestclass.bayes')
  
      print '\n\nSetting up spam corpus'
      spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
!     spamtrainer = Persistent.SpamTrainer(miscbayes)
      spamcorpus.addObserver(spamtrainer)
!     anotherspamtrainer = Persistent.SpamTrainer(classbayes, Persistent.UPDATEPROBS)
      spamcorpus.addObserver(anotherspamtrainer)
  
***************
*** 365,374 ****
                            'fctesthamcorpus', \
                            'MSG*')
!     hamtrainer = Bayes.HamTrainer(miscbayes)
      hamcorpus.addObserver(hamtrainer)
      hamtrainer.trainAll(hamcorpus)
  
! 
!     print '\n\nAdd a message to hamcorpus that does not match the filter'
      if useGzip:
          fmClass = GzipFileMessage
--- 366,374 ----
                            'fctesthamcorpus', \
                            'MSG*')
!     hamtrainer = Persistent.HamTrainer(miscbayes)
      hamcorpus.addObserver(hamtrainer)
      hamtrainer.trainAll(hamcorpus)
  
!     print '\n\nA couple of message related tests'
      if useGzip:
          fmClass = GzipFileMessage
***************
*** 377,380 ****
--- 377,383 ----
  
      m1 = fmClass('XMG00001', 'fctestspamcorpus')
+     m1.setSubstance(testmsg2())
+     
+     print '\n\nAdd a message to hamcorpus that does not match the filter'
  
      try:
***************
*** 417,421 ****
  
      print '\n\nTrain with an individual message'
!     anotherhamtrainer = Bayes.HamTrainer(classbayes)
      anotherhamtrainer.train(unsurecorpus['MSG00005'])
  
--- 420,424 ----
  
      print '\n\nTrain with an individual message'
!     anotherhamtrainer = Persistent.HamTrainer(classbayes)
      anotherhamtrainer.train(unsurecorpus['MSG00005'])
  
***************
*** 428,431 ****
--- 431,443 ----
      msg = spamcorpus['MSG00001']
      print msg
+     print '\n\nThis is some vital information in the message'
+     print 'Date header is',msg.getDate()
+     print 'Subject header is',msg.getSubject()
+     print 'From header is',msg.getFrom()
+     
+     print 'Header text is:',msg.getHeaders()
+     print 'Headers are:',msg.getHeadersList()
+     print 'Body is:',msg.getPayload()
+ 
  
  
***************
*** 526,538 ****
  
      m1 = fmClass('MSG00001', 'fctestspamcorpus')
!     m1.substance = tm1
      m1.store()
  
      m2 = fmClass('MSG00002', 'fctestspamcorpus')
!     m2.substance = tm2
      m2.store()
  
      m3 = fmClass('MSG00003', 'fctestunsurecorpus')
!     m3.substance = tm1
      m3.store()
  
--- 538,550 ----
  
      m1 = fmClass('MSG00001', 'fctestspamcorpus')
!     m1.setSubstance(tm1)
      m1.store()
  
      m2 = fmClass('MSG00002', 'fctestspamcorpus')
!     m2.setSubstance(tm2)
      m2.store()
  
      m3 = fmClass('MSG00003', 'fctestunsurecorpus')
!     m3.setSubstance(tm1)
      m3.store()
  
***************
*** 546,558 ****
  
      m4 = fmClass('MSG00004', 'fctestunsurecorpus')
!     m4.substance = tm1
      m4.store()
  
      m5 = fmClass('MSG00005', 'fctestunsurecorpus')
!     m5.substance = tm2
      m5.store()
  
      m6 = fmClass('MSG00006', 'fctestunsurecorpus')
!     m6.substance = tm2
      m6.store()
  
--- 558,570 ----
  
      m4 = fmClass('MSG00004', 'fctestunsurecorpus')
!     m4.setSubstance(tm1)
      m4.store()
  
      m5 = fmClass('MSG00005', 'fctestunsurecorpus')
!     m5.setSubstance(tm2)
      m5.store()
  
      m6 = fmClass('MSG00006', 'fctestunsurecorpus')
!     m6.setSubstance(tm2)
      m6.store()
  
***************
*** 583,587 ****
  Content-Type:text/plain; charset=us-ascii
  Content- Transfer- Encoding:7bit
- 
  Message-ID:<15814.42238.882013.702030@montanaro.dyndns.org>
  Date:Mon, 4 Nov 2002 10:49:02 -0600
--- 595,598 ----
***************
*** 644,648 ****
  Content-Type:text/plain; charset=us-ascii
  Content- Transfer- Encoding:7bit
- 
  X-Hammie- Disposition:Unsure
  
--- 655,658 ----

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.75
retrieving revision 1.76
diff -C2 -d -r1.75 -r1.76
*** Options.py	20 Nov 2002 22:41:50 -0000	1.75
--- Options.py	25 Nov 2002 02:29:44 -0000	1.76
***************
*** 198,209 ****
  show_unsure: False
  
- # Near the end of Driver.test(), you can get a listing of the best
- # discriminators in the words from the training sets.  These are the
- # words whose WordInfo.killcount values are highest, meaning they most
- # often were among the most extreme clues spamprob() found.  The number
- # of best discriminators to show is given by show_best_discriminators;
- # set this <= 0 to suppress showing any of the best discriminators.
- show_best_discriminators: 30
- 
  # The maximum # of characters to display for a msg displayed due to the
  # show_xyz options above.
--- 198,201 ----
***************
*** 346,356 ****
  clue_mailheader_cutoff: 0.5
  
! # The default database path used by hammie
! persistent_storage_file: hammie.db
! 
! # hammie can use either a database (quick to score one message) or a pickle
! # (quick to train on huge amounts of messages). Set this to True to use a
! # database by default.
! persistent_use_database: False
  
  [pop3proxy]
--- 338,347 ----
  clue_mailheader_cutoff: 0.5
  
! [hammiefilter]
! # hammiefilter can use either a database (quick to score one message) or
! # a pickle (quick to train on huge amounts of messages). Set this to
! # True to use a database by default.
! hammiefilter_persistent_use_database: True
! hammiefilter_persistent_storage_file: ~/.hammiedb
  
  [pop3proxy]
***************
*** 368,371 ****
--- 359,364 ----
  pop3proxy_ham_cache: pop3proxy-ham-cache
  pop3proxy_unknown_cache: pop3proxy-unknown-cache
+ pop3proxy_persistent_use_database: False
+ pop3proxy_persistent_storage_file: hammie.db
  
  # Deprecated - use pop3proxy_servers and pop3proxy_ports instead.
***************
*** 411,415 ****
                     'show_histograms': boolean_cracker,
                     'percentiles': ('get', lambda s: map(float, s.split())),
-                    'show_best_discriminators': int_cracker,
                     'save_trained_pickles': boolean_cracker,
                     'save_histogram_pickles': boolean_cracker,
--- 404,407 ----
***************
*** 436,440 ****
                    },
      'Hammie': {'hammie_header_name': string_cracker,
-                'persistent_storage_file': string_cracker,
                 'clue_mailheader_cutoff': float_cracker,
                 'persistent_use_database': boolean_cracker,
--- 428,431 ----
***************
*** 447,450 ****
--- 438,444 ----
                 'hammie_debug_header_name': string_cracker,
                 },
+     'hammiefilter' : {'hammiefilter_persistent_use_database': boolean_cracker,
+                       'hammiefilter_persistent_storage_file': string_cracker,
+                       },
      'pop3proxy': {'pop3proxy_servers': string_cracker,
                    'pop3proxy_ports': string_cracker,
***************
*** 457,460 ****
--- 451,456 ----
                    'pop3proxy_ham_cache': string_cracker,
                    'pop3proxy_unknown_cache': string_cracker,
+                   'pop3proxy_persistent_use_database': boolean_cracker,
+                   'pop3proxy_persistent_storage_file': string_cracker,
                    },
      'html_ui': {'html_ui_port': int_cracker,

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** TestDriver.py	19 Nov 2002 17:43:27 -0000	1.30
--- TestDriver.py	25 Nov 2002 02:29:44 -0000	1.31
***************
*** 305,324 ****
              printmsg(e, prob, clues)
  
-         if options.show_best_discriminators > 0:
-             print
-             print "    best discriminators:"
-             stats = [(-1, None)] * options.show_best_discriminators
-             smallest_killcount = -1
-             for w, r in c.wordinfo.iteritems():
-                 if r.killcount > smallest_killcount:
-                     heapreplace(stats, (r.killcount, w))
-                     smallest_killcount = stats[0][0]
-             stats.sort()
-             for count, w in stats:
-                 if count < 0:
-                     continue
-                 r = c.wordinfo[w]
-                 print "        %r %d %g" % (w, r.killcount, r.spamprob)
- 
          if options.show_histograms:
              printhist("this pair:", local_ham_hist, local_spam_hist)
--- 305,308 ----

Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** Tester.py	7 Nov 2002 22:30:04 -0000	1.8
--- Tester.py	25 Nov 2002 02:29:44 -0000	1.9
***************
*** 60,68 ****
          if hamstream is not None:
              for example in hamstream:
!                 learn(example, False, False)
          if spamstream is not None:
              for example in spamstream:
!                 learn(example, True, False)
!         self.classifier.update_probabilities()
  
      # Untrain the classifier on streams of ham and spam.  Updates
--- 60,67 ----
          if hamstream is not None:
              for example in hamstream:
!                 learn(example, False)
          if spamstream is not None:
              for example in spamstream:
!                 learn(example, True)
  
      # Untrain the classifier on streams of ham and spam.  Updates
***************
*** 73,81 ****
          if hamstream is not None:
              for example in hamstream:
!                 unlearn(example, False, False)
          if spamstream is not None:
              for example in spamstream:
!                 unlearn(example, True, False)
!         self.classifier.update_probabilities()
  
      # Run prediction on each sample in stream.  You're swearing that stream
--- 72,79 ----
          if hamstream is not None:
              for example in hamstream:
!                 unlearn(example, False)
          if spamstream is not None:
              for example in spamstream:
!                 unlearn(example, True)
  
      # Run prediction on each sample in stream.  You're swearing that stream

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53
retrieving revision 1.54
diff -C2 -d -r1.53 -r1.54
*** classifier.py	18 Nov 2002 18:23:09 -0000	1.53
--- classifier.py	25 Nov 2002 02:29:44 -0000	1.54
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
  # An implementation of a Bayes-like spam classifier.
  #
***************
*** 32,36 ****
  
  import math
- import time
  from sets import Set
  
--- 33,36 ----
***************
*** 47,92 ****
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 1
  
- class WordInfo(object):
-     __slots__ = ('atime',     # when this record was last used by scoring(*)
-                  'spamcount', # # of spams in which this word appears
-                  'hamcount',  # # of hams in which this word appears
-                  'killcount', # # of times this made it to spamprob()'s nbest
-                  'spamprob',  # prob(spam | msg contains this word)
-                 )
  
      # Invariant:  For use in a classifier database, at least one of
      # spamcount and hamcount must be non-zero.
-     #
-     # (*)atime is the last access time, a UTC time.time() value.  It's the
-     # most recent time this word was used by scoring (i.e., by spamprob(),
-     # not by training via learn()); or, if the word has never been used by
-     # scoring, the time the word record was created (i.e., by learn()).
-     # One good criterion for identifying junk (word records that have no
-     # value) is to delete words that haven't been used for a long time.
-     # Perhaps they were typos, or unique identifiers, or relevant to a
-     # once-hot topic or scam that's fallen out of favor.  Whatever, if
-     # a word is no longer being used, it's just wasting space.
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
!         self.atime = atime
!         self.spamcount = self.hamcount = self.killcount = 0
!         self.spamprob = spamprob
  
      def __repr__(self):
!         return "WordInfo%r" % repr((self.atime, self.spamcount,
!                                     self.hamcount, self.killcount,
!                                     self.spamprob))
  
      def __getstate__(self):
!         return (self.atime, self.spamcount, self.hamcount, self.killcount,
!                 self.spamprob)
  
      def __setstate__(self, t):
!         (self.atime, self.spamcount, self.hamcount, self.killcount,
!          self.spamprob) = t
  
! class Bayes:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
      # trying to hook this all up to ZODB as a persistent object.  There's
--- 47,116 ----
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 4
! 
! class MetaInfo(object):
!     """Information about the corpora.
! 
!     Contains nham and nspam, used for calculating probabilities.  Also
!     has a revision, incremented every time nham or nspam is adjusted.
!     Nothing uses this, currently, but it's there if you want it.
! 
!     """
!     def __init__(self):
!         self.__setstate__((PICKLE_VERSION, 0, 0))
! 
!     def __repr__(self):
!         return "MetaInfo%r" % repr((self._nspam,
!                                     self._nham,
!                                     self.revision))
! 
!     def __getstate__(self):
!         return (PICKLE_VERSION, self._nspam, self._nham)
! 
!     def __setstate__(self, t):
!         if t[0] != PICKLE_VERSION:
!             raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         (self._nspam, self._nham) = t[1:]
!         self.revision = 0
! 
!     def incr_rev(self):
!         self.revision += 1
! 
!     def get_nham(self):
!         return self._nham
!     def set_nham(self, val):
!         self._nham = val
!         self.incr_rev()
!     nham = property(get_nham, set_nham)
! 
!     def set_nspam(self, val):
!         self._nspam = val
!     def get_nspam(self):
!         return self._nspam
!     nspam = property(get_nspam, set_nspam)
! 
! 
  
  
+ class WordInfo(object):
      # Invariant:  For use in a classifier database, at least one of
      # spamcount and hamcount must be non-zero.
  
!     def __init__(self):
!         self.__setstate__((0, 0))
  
      def __repr__(self):
!         return "WordInfo%r" % repr((self.spamcount,
!                                     self.hamcount))
  
      def __getstate__(self):
!         return (self.spamcount,
!                 self.hamcount)
  
      def __setstate__(self, t):
!         (self.spamcount, self.hamcount) = t
  
! 
! class Classifier:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
      # trying to hook this all up to ZODB as a persistent object.  There's
***************
*** 105,117 ****
      def __init__(self):
          self.wordinfo = {}
!         self.nspam = self.nham = 0
  
      def __getstate__(self):
!         return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.wordinfo, self.nspam, self.nham = t[1:]
  
      # spamprob() implementations.  One of the following is aliased to
--- 129,156 ----
      def __init__(self):
          self.wordinfo = {}
!         self.meta = MetaInfo()
!         self.probcache = {}
  
      def __getstate__(self):
!         return PICKLE_VERSION, self.wordinfo, self.meta
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.wordinfo, self.meta = t[1:]
! 
!     # Slacker's way out--pass calls to nham/nspam up to the meta class
! 
!     def get_nham(self):
!         return self.meta.nham
!     def set_nham(self, val):
!         self.meta.nham = val
!     nham = property(get_nham, set_nham)
! 
!     def get_nspam(self):
!         return self.meta.nspam
!     def set_nspam(self, val):
!         self.meta.nspam = val
!     nspam = property(get_nspam, set_nspam)
  
      # spamprob() implementations.  One of the following is aliased to
***************
*** 145,150 ****
          clues = self._getclues(wordstream)
          for prob, word, record in clues:
-             if record is not None:  # else wordinfo doesn't know about it
-                 record.killcount += 1
              P *= 1.0 - prob
              Q *= prob
--- 184,187 ----
***************
*** 234,239 ****
          clues = self._getclues(wordstream)
          for prob, word, record in clues:
-             if record is not None:  # else wordinfo doesn't know about it
-                 record.killcount += 1
              S *= 1.0 - prob
              H *= prob
--- 271,274 ----
***************
*** 278,282 ****
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam, update_probabilities=True):
          """Teach the classifier by example.
  
--- 313,317 ----
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam):
          """Teach the classifier by example.
  
***************
*** 285,324 ****
          else that it's definitely not spam.
  
!         If optional arg update_probabilities is False (the default is True),
!         don't update word probabilities.  Updating them is expensive, and if
!         you're going to pass many messages to learn(), it's more efficient
!         to pass False here and call update_probabilities() once when you're
!         done -- or to call learn() with update_probabilities=True when
!         passing the last new example.  The important thing is that the
!         probabilities get updated before calling spamprob() again.
          """
  
          self._add_msg(wordstream, is_spam)
-         if update_probabilities:
-             self.update_probabilities()
  
!     def unlearn(self, wordstream, is_spam, update_probabilities=True):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
          Pass the same arguments you passed to learn().
          """
- 
          self._remove_msg(wordstream, is_spam)
-         if update_probabilities:
-             self.update_probabilities()
  
!     def update_probabilities(self):
!         """Update the word probabilities in the spam database.
  
!         This computes a new probability for every word in the database,
!         so can be expensive.  learn() and unlearn() update the probabilities
!         each time by default.  Thay have an optional argument that allows
!         to skip this step when feeding in many messages, and in that case
!         you should call update_probabilities() after feeding the last
!         message and before calling spamprob().
          """
  
!         nham = float(self.nham or 1)
!         nspam = float(self.nspam or 1)
  
          if options.experimental_ham_spam_imbalance_adjustment:
--- 320,371 ----
          else that it's definitely not spam.
  
!         If optional arg update_word_probabilities is False (the default
!         is True), don't update individual words' probabilities.
!         Updating them is expensive, and if you're going to pass many
!         messages to learn(), it's more efficient to pass False here and
!         call update_probabilities() once when you're done.  The
!         important thing is that the probabilities get updated before
!         calling spamprob() again.
! 
          """
  
          self._add_msg(wordstream, is_spam)
  
!     def unlearn(self, wordstream, is_spam):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
          Pass the same arguments you passed to learn().
          """
          self._remove_msg(wordstream, is_spam)
  
!     def probability(self, record):
!         """Compute, store, and return prob(msg is spam | msg contains word).
  
!         This is the Graham calculation, but stripped of biases, and
!         stripped of clamping into 0.01 thru 0.99.  The Bayesian
!         adjustment following keeps them in a sane range, and one
!         that naturally grows the more evidence there is to back up
!         a probability.
          """
  
!         spamcount = record.spamcount
!         hamcount = record.hamcount
!         
!         # Try the cache first
!         try:
!             return self.probcache[spamcount][hamcount]
!         except KeyError:
!             pass
! 
!         nham = float(self.meta.nham or 1)
!         nspam = float(self.meta.nspam or 1)
! 
!         assert hamcount <= nham
!         hamratio = hamcount / nham
! 
!         assert spamcount <= nspam
!         spamratio = spamcount / nspam
! 
!         prob = spamratio / (hamratio + spamratio)
  
          if options.experimental_ham_spam_imbalance_adjustment:
***************
*** 331,405 ****
          StimesX = S * options.unknown_word_prob
  
-         for word, record in self.wordinfo.iteritems():
-             # Compute p(word) = prob(msg is spam | msg contains word).
-             # This is the Graham calculation, but stripped of biases, and
-             # stripped of clamping into 0.01 thru 0.99.  The Bayesian
-             # adjustment following keeps them in a sane range, and one
-             # that naturally grows the more evidence there is to back up
-             # a probability.
-             hamcount = record.hamcount
-             assert hamcount <= nham
-             hamratio = hamcount / nham
  
!             spamcount = record.spamcount
!             assert spamcount <= nspam
!             spamratio = spamcount / nspam
! 
!             prob = spamratio / (hamratio + spamratio)
  
!             # Now do Robinson's Bayesian adjustment.
!             #
!             #         s*x + n*p(w)
!             # f(w) = --------------
!             #           s + n
!             #
!             # I find this easier to reason about like so (equivalent when
!             # s != 0):
!             #
!             #        x - p
!             #  p +  -------
!             #       1 + n/s
!             #
!             # IOW, it moves p a fraction of the distance from p to x, and
!             # less so the larger n is, or the smaller s is.
  
!             # Experimental:
!             # Picking a good value for n is interesting:  how much empirical
!             # evidence do we really have?  If nham == nspam,
!             # hamcount + spamcount makes a lot of sense, and the code here
!             # does that by default.
!             # But if, e.g., nham is much larger than nspam, p(w) can get a
!             # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!             # strong ham words (high hamcount) much stronger than strong
!             # spam words (high spamcount), and that makes the accidental
!             # appearance of a strong ham word in spam much more damaging than
!             # the accidental appearance of a strong spam word in ham.
!             # So we don't give hamcount full credit when nham > nspam (or
!             # spamcount when nspam > nham):  instead we knock hamcount down
!             # to what it would have been had nham been equal to nspam.  IOW,
!             # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!             # we don't "believe" any count to an extent more than
!             # min(nspam, nham) justifies.
  
!             n = hamcount * spam2ham  +  spamcount * ham2spam
!             prob = (StimesX + n * prob) / (S + n)
  
!             if record.spamprob != prob:
!                 record.spamprob = prob
!                 # The next seemingly pointless line appears to be a hack
!                 # to allow a persistent db to realize the record has changed.
!                 self.wordinfo[word] = record
  
!     def clearjunk(self, oldesttime):
!         """Forget useless wordinfo records.  This can shrink the database size.
  
!         A record for a word will be retained only if the word was accessed
!         at or after oldesttime.
          """
  
!         wordinfo = self.wordinfo
!         tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
!         for w in tonuke:
!             del wordinfo[w]
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
--- 378,440 ----
          StimesX = S * options.unknown_word_prob
  
  
!         # Now do Robinson's Bayesian adjustment.
!         #
!         #         s*x + n*p(w)
!         # f(w) = --------------
!         #           s + n
!         #
!         # I find this easier to reason about like so (equivalent when
!         # s != 0):
!         #
!         #        x - p
!         #  p +  -------
!         #       1 + n/s
!         #
!         # IOW, it moves p a fraction of the distance from p to x, and
!         # less so the larger n is, or the smaller s is.
  
!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.
  
!         n = hamcount * spam2ham  +  spamcount * ham2spam
!         prob = (StimesX + n * prob) / (S + n)
  
!         # Update the cache
!         try:
!             self.probcache[spamcount][hamcount] = prob
!         except KeyError:
!             self.probcache[spamcount] = {hamcount: prob}
  
!         return prob
  
!     def update_probabilities(self):
!         """Update the word probabilities in the spam database.
  
!         This computes a new probability for every word in the database,
!         which can be expensive.  learn() and unlearn() clear the
!         probability cache each time by default, and that will be rebuilt
!         as probabilities are looked up.  If for some reason you need to
!         update all the probabilities in one step (say, for
!         benchmarking), you can call this method.
          """
  
!         for word, record in self.wordinfo.iteritems():
!             self.probability(record)
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
***************
*** 424,439 ****
      # to exploit it.
      def _add_msg(self, wordstream, is_spam):
          if is_spam:
!             self.nspam += 1
          else:
!             self.nham += 1
  
          wordinfo = self.wordinfo
          wordinfoget = wordinfo.get
-         now = time.time()
          for word in Set(wordstream):
              record = wordinfoget(word)
              if record is None:
!                 record = self.WordInfoClass(now)
  
              if is_spam:
--- 459,474 ----
      # to exploit it.
      def _add_msg(self, wordstream, is_spam):
+         self.probcache = {}    # nuke the prob cache
          if is_spam:
!             self.meta.nspam += 1
          else:
!             self.meta.nham += 1
  
          wordinfo = self.wordinfo
          wordinfoget = wordinfo.get
          for word in Set(wordstream):
              record = wordinfoget(word)
              if record is None:
!                 record = self.WordInfoClass()
  
              if is_spam:
***************
*** 441,456 ****
              else:
                  record.hamcount += 1
              # Needed to tell a persistent DB that the content changed.
              wordinfo[word] = record
  
      def _remove_msg(self, wordstream, is_spam):
          if is_spam:
!             if self.nspam <= 0:
                  raise ValueError("spam count would go negative!")
!             self.nspam -= 1
          else:
!             if self.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.nham -= 1
  
          wordinfo = self.wordinfo
--- 476,494 ----
              else:
                  record.hamcount += 1
+ 
              # Needed to tell a persistent DB that the content changed.
              wordinfo[word] = record
  
+ 
      def _remove_msg(self, wordstream, is_spam):
+         self.probcache = {}    # nuke the prob cache
          if is_spam:
!             if self.meta.nspam <= 0:
                  raise ValueError("spam count would go negative!")
!             self.meta.nspam -= 1
          else:
!             if self.meta.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.meta.nham -= -1
  
          wordinfo = self.wordinfo
***************
*** 468,472 ****
                      del wordinfo[word]
                  else:
!                     # Needed to tell a persistent DB that the content changed.
                      wordinfo[word] = record
  
--- 506,511 ----
                      del wordinfo[word]
                  else:
!                     # Needed to tell a persistent DB that the content
!                     # changed.
                      wordinfo[word] = record
  
***************
*** 479,483 ****
  
          wordinfoget = self.wordinfo.get
-         now = time.time()
          for word in Set(wordstream):
              record = wordinfoget(word)
--- 518,521 ----
***************
*** 485,490 ****
                  prob = unknown
              else:
!                 record.atime = now
!                 prob = record.spamprob
              distance = abs(prob - 0.5)
              if distance >= mindist:
--- 523,527 ----
                  prob = unknown
              else:
!                 prob = self.probability(record)
              distance = abs(prob - 0.5)
              if distance >= mindist:
***************
*** 496,497 ****
--- 533,537 ----
          # Return (prob, word, record).
          return [t[1:] for t in clues]
+ 
+ 
+ Bayes = Classifier

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** dbdict.py	19 Nov 2002 23:31:44 -0000	1.1
--- dbdict.py	25 Nov 2002 02:29:44 -0000	1.2
***************
*** 1,6 ****
  #! /usr/bin/env python
  
  from __future__ import generators
! import dbhash
  try:
      import cPickle as pickle
--- 1,55 ----
  #! /usr/bin/env python
  
+ """DBDict.py - Dictionary access to dbhash
+ 
+ Classes:
+     DBDict - wraps a dbhash file
+ 
+ Abstract:
+     DBDict class wraps a dbhash file with a reasonably complete set
+     of dictionary access methods.  DBDicts can be iterated like a dictionary.
+     
+     The constructor accepts a class name which is used specifically to
+     to pickle/unpickle an instance of that class.  When an instance of
+     that class is being pickled, the pickler (actually __getstate__) prepends
+     a 'W' to the pickled string, and when the unpickler (really __setstate__)
+     encounters that 'W', it constructs that class (with no constructor
+     arguments) and executes __setstate__ on the constructed instance.
+ 
+     DBDict accepts an iterskip operand on the constructor.  This is a tuple
+     of hash keys that will be skipped (not seen) during iteration.  This
+     is for iteration only.  Methods such as keys() will return the entire
+     complement of keys in the dbm hash, even if they're in iterskip.  An
+     iterkeys() method is provided for iterating with skipped keys, and
+     itervaluess() is provided for iterating values with skipped keys.
+ 
+         >>> d = DBDict('/tmp/goober.db', MODE_CREATE, ('skipme', 'skipmetoo'))
+         >>> d['skipme'] = 'booga'
+         >>> d['countme'] = 'wakka'
+         >>> print d.keys()
+         ['skipme', 'countme']
+         >>> for k in d.iterkeys():
+         ...     print k
+         countme
+         >>> for v in d.itervalues():
+         ...     print v
+         wakka
+         >>> for k,v in d.iteritems():
+         ...     print k,v
+         countme wakka
+ 
+ To Do:
+     """
+ 
+ # This module is part of the spambayes project, which is Copyright 2002
+ # The Python Software Foundation and is covered by the Python Software
+ # Foundation license.
+ 
+ __author__ = "Neale Pickett <neale@woozle.org>, \
+               Tim Stone <tim@fourstonesExpressions.com>"
+ __credits__ = "Tim Peters (author of DBDict class), \
+                all the spambayes contributors."
  from __future__ import generators
! 
  try:
      import cPickle as pickle
***************
*** 8,11 ****
--- 57,72 ----
      import pickle
  
+ import dbhash
+ import errno
+ import copy
+ import shutil
+ import os
+ 
+ MODE_CREATE = 'c'       # create file if necessary, open for readwrite
+ MODE_NEW = 'n'          # always create new file, open for readwrite
+ MODE_READWRITE = 'w'    # open existing file for readwrite
+ MODE_READONLY = 'r'     # open existing file for read only
+ 
+ 
  class DBDict:
      """Database Dictionary.
***************
*** 19,23 ****
      like .keys() still list everything.  For instance:
  
!     >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo'))
      >>> d['skipme'] = 'booga'
      >>> d['countme'] = 'wakka'
--- 80,84 ----
      like .keys() still list everything.  For instance:
  
!     >>> d = DBDict('goober.db', MODE_CREATE, ('skipme', 'skipmetoo'))
      >>> d['skipme'] = 'booga'
      >>> d['countme'] = 'wakka'
***************
*** 30,36 ****
      """
  
!     def __init__(self, dbname, mode, iterskip=()):
          self.hash = dbhash.open(dbname, mode)
!         self.iterskip = iterskip
  
      def __getitem__(self, key):
--- 91,121 ----
      """
  
!     def __init__(self, dbname, mode, wclass, iterskip=()):
          self.hash = dbhash.open(dbname, mode)
!         if not iterskip:
!             self.iterskip = iterskip
!         else:
!             self.iterskip = ()
!         self.wclass=wclass
! 
!     def __getitem__(self, key):
!         v = self.hash[key]
!         if v[0] == 'W':
!             val = pickle.loads(v[1:])
!             # We could be sneaky, like pickle.Unpickler.load_inst,
!             # but I think that's overly confusing.
!             obj = self.wclass()
!             obj.__setstate__(val)
!             return obj
!         else:
!             return pickle.loads(v)
! 
!     def __setitem__(self, key, val):
!         if isinstance(val, self.wclass):
!             val = val.__getstate__()
!             v = 'W' + pickle.dumps(val, 1)
!         else:
!             v = pickle.dumps(val, 1)
!         self.hash[key] = v
  
      def __getitem__(self, key):
***************
*** 79,82 ****
--- 164,168 ----
      def itervalues(self):
          return self.__iter__(lambda k: k[1])
+ 
  
  open = DBDict

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** hammie.py	18 Nov 2002 18:13:54 -0000	1.40
--- hammie.py	25 Nov 2002 02:29:44 -0000	1.41
***************
*** 1,56 ****
  #! /usr/bin/env python
  
- # A driver for the classifier module and Tim's tokenizer that you can
- # call from procmail.
- 
- """Usage: %(program)s [options]
- 
- Where:
-     -h
-         show usage and exit
-     -g PATH
-         mbox or directory of known good messages (non-spam) to train on.
-         Can be specified more than once, or use - for stdin.
-     -s PATH
-         mbox or directory of known spam messages to train on.
-         Can be specified more than once, or use - for stdin.
-     -u PATH
-         mbox of unknown messages.  A ham/spam decision is reported for each.
-         Can be specified more than once.
-     -r
-         reverse the meaning of the check (report ham instead of spam).
-         Only meaningful with the -u option.
-     -p FILE
-         use file as the persistent store.  loads data from this file if it
-         exists, and saves data to this file at the end.
-         Default: %(DEFAULTDB)s
-     -d
-         use the DBM store instead of cPickle.  The file is larger and
-         creating it is slower, but checking against it is much faster,
-         especially for large word databases. Default: %(USEDB)s
-     -D
-         the reverse of -d: use the cPickle instead of DBM
-     -f
-         run as a filter: read a single message from stdin, add an
-         %(DISPHEADER)s header, and write it to stdout.  If you want to
-         run from procmail, this is your option.
- """
- 
- from __future__ import generators
- 
- import sys
- import os
- import types
- import getopt
- import mailbox
- import glob
- import email
- import errno
- import anydbm
- import cPickle as pickle
  
  import mboxutils
! import classifier
  from Options import options
  
  try:
--- 1,10 ----
  #! /usr/bin/env python
  
  
+ import dbdict
  import mboxutils
! import Persistent
  from Options import options
+ from tokenizer import tokenize
  
  try:
***************
*** 61,224 ****
  
  
! program = sys.argv[0] # For usage(); referenced by docstring above
! 
! # Name of the header to add in filter mode
! DISPHEADER = options.hammie_header_name
! DEBUGHEADER = options.hammie_debug_header_name
! DODEBUG = options.hammie_debug_header
! 
! # Default database name
! DEFAULTDB = options.persistent_storage_file
! 
! # Probability at which a message is considered spam
! SPAM_THRESHOLD = options.spam_cutoff
! HAM_THRESHOLD = options.ham_cutoff
! 
! # Probability limit for a clue to be added to the DISPHEADER
! SHOWCLUE = options.clue_mailheader_cutoff
! 
! # Use a database? If False, use a pickle
! USEDB = options.persistent_use_database
! 
! # Tim's tokenizer kicks far more booty than anything I would have
! # written.  Score one for analysis ;)
! from tokenizer import tokenize
! 
! class DBDict:
! 
!     """Database Dictionary.
! 
!     This wraps an anydbm to make it look even more like a dictionary.
! 
!     Call it with the name of your database file.  Optionally, you can
!     specify a list of keys to skip when iterating.  This only affects
!     iterators; things like .keys() still list everything.  For instance:
! 
!     >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo'))
!     >>> d['skipme'] = 'booga'
!     >>> d['countme'] = 'wakka'
!     >>> print d.keys()
!     ['skipme', 'countme']
!     >>> for k in d.iterkeys():
!     ...     print k
!     countme
! 
!     """
! 
!     def __init__(self, dbname, mode, iterskip=()):
!         self.hash = anydbm.open(dbname, mode)
!         self.iterskip = iterskip
! 
!     def __getitem__(self, key):
!         v = self.hash[key]
!         if v[0] == 'W':
!             val = pickle.loads(v[1:])
!             # We could be sneaky, like pickle.Unpickler.load_inst,
!             # but I think that's overly confusing.
!             obj = classifier.WordInfo(0)
!             obj.__setstate__(val)
!             return obj
!         else:
!             return pickle.loads(v)
! 
!     def __setitem__(self, key, val):
!         if isinstance(val, classifier.WordInfo):
!             val = val.__getstate__()
!             v = 'W' + pickle.dumps(val, 1)
!         else:
!             v = pickle.dumps(val, 1)
!         self.hash[key] = v
! 
!     def __delitem__(self, key, val):
!         del(self.hash[key])
! 
!     def __iter__(self, fn=None):
!         k = self.hash.first()
!         while k != None:
!             key = k[0]
!             val = self.__getitem__(key)
!             if key not in self.iterskip:
!                 if fn:
!                     yield fn((key, val))
!                 else:
!                     yield (key, val)
!             try:
!                 k = self.hash.next()
!             except KeyError:
!                 break
! 
!     def __contains__(self, name):
!         return self.has_key(name)
! 
!     def __getattr__(self, name):
!         # Pass the buck
!         return getattr(self.hash, name)
! 
!     def get(self, key, dfl=None):
!         if self.has_key(key):
!             return self[key]
!         else:
!             return dfl
! 
!     def iteritems(self):
!         return self.__iter__()
! 
!     def iterkeys(self):
!         return self.__iter__(lambda k: k[0])
! 
!     def itervalues(self):
!         return self.__iter__(lambda k: k[1])
! 
! 
! class PersistentBayes(classifier.Bayes):
! 
!     """A persistent Bayes classifier.
! 
!     This is just like classifier.Bayes, except that the dictionary is a
!     database.  You take less disk this way and you can pretend it's
!     persistent.  The tradeoffs vs. a pickle are: 1. it's slower
!     training, but faster checking, and 2. it needs less memory to run,
!     but takes more space on the hard drive.
  
!     On destruction, an instantiation of this class will write its state
!     to a special key.  When you instantiate a new one, it will attempt
!     to read these values out of that key again, so you can pick up where
!     you left off.
  
      """
  
-     # XXX: Would it be even faster to remember (in a list) which keys
-     # had been modified, and only recalculate those keys?  No sense in
-     # going over the entire word database if only 100 words are
-     # affected.
- 
-     # XXX: Another idea: cache stuff in memory.  But by then maybe we
-     # should just use ZODB.
- 
-     def __init__(self, dbname, mode):
-         classifier.Bayes.__init__(self)
-         self.statekey = "saved state"
-         self.wordinfo = DBDict(dbname, mode, (self.statekey,))
-         self.dbmode = mode
- 
-         self.restore_state()
- 
-     def __del__(self):
-         #super.__del__(self)
-         self.save_state()
- 
-     def save_state(self):
-         if self.dbmode != 'r':
-             self.wordinfo[self.statekey] = (self.nham, self.nspam)
- 
-     def restore_state(self):
-         if self.wordinfo.has_key(self.statekey):
-             self.nham, self.nspam = self.wordinfo[self.statekey]
- 
- 
- class Hammie:
- 
-     """A spambayes mail filter"""
- 
      def __init__(self, bayes):
          self.bayes = bayes
--- 15,26 ----
  
  
! class Hammie:
!     """A spambayes mail filter.
  
!     This implements the basic functionality needed to score, filter, or
!     train.  
  
      """
  
      def __init__(self, bayes):
          self.bayes = bayes
***************
*** 256,269 ****
          """
  
!         try:
!             return self._scoremsg(msg, evidence)
!         except:
!             print msg
!             import traceback
!             traceback.print_exc()
  
!     def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
!                ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER,
!                debug=DODEBUG):
          """Score (judge) a message and add a disposition header.
  
--- 58,66 ----
          """
  
!         return self._scoremsg(msg, evidence)
  
!     def filter(self, msg, header=None, spam_cutoff=None,
!                ham_cutoff=None, debugheader=None,
!                debug=None):
          """Score (judge) a message and add a disposition header.
  
***************
*** 283,286 ****
--- 80,94 ----
          """
  
+         if header == None:
+             header = options.hammie_header_name
+         if spam_cutoff == None:
+             spam_cutoff = options.spam_cutoff
+         if ham_cutoff == None:
+             ham_cutoff = options.ham_cutoff
+         if debugheader == None:
+             debugheader = options.hammie_debug_header_name
+         if debug == None:
+             debug = options.hammie_debug_header
+ 
          msg = mboxutils.get_message(msg)
          try:
***************
*** 323,327 ****
          """
  
!         self.bayes.learn(tokenize(msg), is_spam, False)
  
      def train_ham(self, msg):
--- 131,135 ----
          """
  
!         self.bayes.learn(tokenize(msg), is_spam)
  
      def train_ham(self, msg):
***************
*** 349,510 ****
          self.train(msg, True)
  
!     def update_probabilities(self):
!         """Update probability values.
  
!         You would want to call this after a training session.  It's
!         pretty slow, so if you have a lot of messages to train, wait
!         until you're all done before calling this.
  
          """
  
!         self.bayes.update_probabilities()
! 
! 
! def train(hammie, msgs, is_spam):
!     """Train bayes with all messages from a mailbox."""
!     mbox = mboxutils.getmbox(msgs)
!     i = 0
!     for msg in mbox:
!         i += 1
!         # XXX: Is the \r a Unixism?  I seem to recall it working in DOS
!         # back in the day.  Maybe it's a line-printer-ism ;)
!         sys.stdout.write("\r%6d" % i)
!         sys.stdout.flush()
!         hammie.train(msg, is_spam)
!     print
! 
! def score(hammie, msgs, reverse=0):
!     """Score (judge) all messages from a mailbox."""
!     # XXX The reporting needs work!
!     mbox = mboxutils.getmbox(msgs)
!     i = 0
!     spams = hams = 0
!     for msg in mbox:
!         i += 1
!         prob, clues = hammie.score(msg, True)
!         if hasattr(msg, '_mh_msgno'):
!             msgno = msg._mh_msgno
!         else:
!             msgno = i
!         isspam = (prob >= SPAM_THRESHOLD)
!         if isspam:
!             spams += 1
!             if not reverse:
!                 print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
!                 print hammie.formatclues(clues)
!         else:
!             hams += 1
!             if reverse:
!                 print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
!                 print hammie.formatclues(clues)
!     return (spams, hams)
! 
! def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
!     """Create a Bayes instance for the given pickle (which
!     doesn't have to exist).  Create a PersistentBayes if
!     usedb is True."""
!     if usedb:
!         bayes = PersistentBayes(pck, mode)
!     else:
!         bayes = None
!         try:
!             fp = open(pck, 'rb')
!         except IOError, e:
!             if e.errno <> errno.ENOENT: raise
!         else:
!             bayes = pickle.load(fp)
!             fp.close()
!         if bayes is None:
!             bayes = classifier.Bayes()
!     return bayes
! 
! def usage(code, msg=''):
!     """Print usage message and sys.exit(code)."""
!     if msg:
!         print >> sys.stderr, msg
!         print >> sys.stderr
!     print >> sys.stderr, __doc__ % globals()
!     sys.exit(code)
! 
! def main():
!     """Main program; parse options and go."""
!     try:
!         opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
!     except getopt.error, msg:
!         usage(2, msg)
! 
!     if not opts:
!         usage(2, "No options given")
! 
!     pck = DEFAULTDB
!     good = []
!     spam = []
!     unknown = []
!     reverse = 0
!     do_filter = False
!     usedb = USEDB
!     mode = 'r'
!     for opt, arg in opts:
!         if opt == '-h':
!             usage(0)
!         elif opt == '-g':
!             good.append(arg)
!             mode = 'c'
!         elif opt == '-s':
!             spam.append(arg)
!             mode = 'c'
!         elif opt == '-p':
!             pck = arg
!         elif opt == "-d":
!             usedb = True
!         elif opt == "-D":
!             usedb = False
!         elif opt == "-f":
!             do_filter = True
!         elif opt == '-u':
!             unknown.append(arg)
!         elif opt == '-r':
!             reverse = 1
!     if args:
!         usage(2, "Positional arguments not allowed")
! 
!     save = False
  
-     bayes = createbayes(pck, usedb, mode)
-     h = Hammie(bayes)
  
!     for g in good:
!         print "Training ham (%s):" % g
!         train(h, g, False)
!         save = True
  
!     for s in spam:
!         print "Training spam (%s):" % s
!         train(h, s, True)
!         save = True
  
!     if save:
!         h.update_probabilities()
!         if not usedb and pck:
!             fp = open(pck, 'wb')
!             pickle.dump(bayes, fp, 1)
!             fp.close()
  
!     if do_filter:
!         msg = sys.stdin.read()
!         filtered = h.filter(msg)
!         sys.stdout.write(filtered)
  
!     if unknown:
!         (spams, hams) = (0, 0)
!         for u in unknown:
!             if len(unknown) > 1:
!                 print "Scoring", u
!             s, g = score(h, u, reverse)
!             spams += s
!             hams += g
!         print "Total %d spam, %d ham" % (spams, hams)
  
  
  if __name__ == "__main__":
!     main()
--- 157,192 ----
          self.train(msg, True)
  
!     def store(self):
!         """Write out the persistent store.
  
!         This makes sure the persistent store reflects what is currently
!         in memory.  You would want to do this after a write and before
!         exiting.
  
          """
  
!         self.bayes.store()
  
  
! def open(filename, usedb=True, mode='r'):
!     """Open a file, returning a Hammie instance.
  
!     If usedb is False, open as a pickle instead of a DBDict.  mode is
  
!     used as the flag to open DBDict objects.  'c' for read-write (create
!     if needed), 'r' for read-only, 'w' for read-write.
  
!     """
  
!     if usedb:
!         b = Persistent.DBDictClassifier(filename, mode)
!     else:
!         b = Persistent.PickledClassifier(filename)
!     return Hammie(b)
  
  
  if __name__ == "__main__":
!     # Everybody's used to running hammie.py.  Why mess with success?  ;)
!     import hammiebulk
! 
!     hammiebulk.main()

Index: hammiefilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** hammiefilter.py	18 Nov 2002 18:14:04 -0000	1.2
--- hammiefilter.py	25 Nov 2002 02:29:44 -0000	1.3
***************
*** 52,95 ****
      sys.exit(code)
  
! def jar_pickle(h):
!     if not options.persistent_use_database:
!         import pickle
!         fp = open(options.persistent_storage_file, 'wb')
!         pickle.dump(h.bayes, fp, 1)
!         fp.close()
!     
! 
! def hammie_open(mode):
!     b = hammie.createbayes(options.persistent_storage_file,
!                            options.persistent_use_database,
!                            mode)
!     return hammie.Hammie(b)
  
! def newdb():
!     h = hammie_open('n')
!     jar_pickle(h)
!     print "Created new database in", options.persistent_storage_file
  
! def filter():
!     h = hammie_open('r')
!     msg = sys.stdin.read()
!     print h.filter(msg)
  
! def train_ham():
!     h = hammie_open('w')
!     msg = sys.stdin.read()
!     h.train_ham(msg)
!     h.update_probabilities()
!     jar_pickle(h)    
  
! def train_spam():
!     h = hammie_open('w')
!     msg = sys.stdin.read()
!     h.train_spam(msg)
!     h.update_probabilities()
!     jar_pickle(h)    
  
  def main():
!     action = filter
      opts, args = getopt.getopt(sys.argv[1:], 'hngs')
      for opt, arg in opts:
--- 52,91 ----
      sys.exit(code)
  
! class HammieFilter(object):
!     def __init__(self):
!         options = Options.options
!         options.mergefiles(['/etc/hammierc',
!                             os.path.expanduser('~/.hammierc')])
!         
!         self.dbname = options.hammiefilter_persistent_storage_file
!         self.dbname = os.path.expanduser(self.dbname)
!         self.usedb = options.hammiefilter_persistent_use_database
!         
  
!     def newdb(self):
!         h = hammie.open(self.dbname, self.usedb, 'n')
!         h.store()
!         print "Created new database in", self.dbname
  
!     def filter(self):
!         h = hammie.open(self.dbname, self.usedb, 'r')
!         msg = sys.stdin.read()
!         print h.filter(msg)
  
!     def train_ham(self):
!         h = hammie.open(self.dbname, self.usedb, 'c')
!         msg = sys.stdin.read()
!         h.train_ham(msg)
!         h.store()
  
!     def train_spam(self):
!         h = hammie.open(self.dbname, self.usedb, 'c')
!         msg = sys.stdin.read()
!         h.train_spam(msg)
!         h.store()
  
  def main():
!     h = HammieFilter()
!     action = h.filter
      opts, args = getopt.getopt(sys.argv[1:], 'hngs')
      for opt, arg in opts:
***************
*** 97,114 ****
              usage(0)
          elif opt == '-g':
!             action = train_ham
          elif opt == '-s':
!             action = train_spam
          elif opt == "-n":
!             action = newdb
! 
!     # hammiefilter overrides
!     config_overrides = """[Hammie]
! persistent_storage_file = %s
! persistent_use_database = True
! """ % os.path.expanduser('~/.hammiedb')
!     options.mergefilelike(StringIO.StringIO(config_overrides))
!     options.mergefiles(['/etc/hammierc',
!                         os.path.expanduser('~/.hammierc')])
  
      action()
--- 93,101 ----
              usage(0)
          elif opt == '-g':
!             action = h.train_ham
          elif opt == '-s':
!             action = h.train_spam
          elif opt == "-n":
!             action = h.newdb
  
      action()

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** pop3proxy.py	20 Nov 2002 22:41:50 -0000	1.18
--- pop3proxy.py	25 Nov 2002 02:29:44 -0000	1.19
***************
*** 119,123 ****
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import Bayes, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
--- 119,123 ----
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import Persistent, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
***************
*** 819,822 ****
--- 819,825 ----
          stateDict = state.__dict__
          stateDict.update(state.bayes.__dict__)
+         # so the property() isn't as cool as we thought.  -ntp
+         stateDict['nham'] = state.bayes.nham
+         stateDict['nspam'] = state.bayes.nspam
          body = (self.pageSection % ('Status', self.summary % stateDict)+
                  self.pageSection % ('Train on proxied messages', self.review)+
***************
*** 1119,1123 ****
  # This keeps the global state of the module - the command-line options,
  # statistics like how many mails have been classified, the handle of the
! # log file, the Bayes and FileCorpus objects, and so on.
  class State:
      def __init__(self):
--- 1122,1126 ----
  # This keeps the global state of the module - the command-line options,
  # statistics like how many mails have been classified, the handle of the
! # log file, the Classifier and FileCorpus objects, and so on.
  class State:
      def __init__(self):
***************
*** 1162,1167 ****
  
          # Load up the other settings from Option.py / bayescustomize.ini
!         self.databaseFilename = options.persistent_storage_file
!         self.useDB = options.persistent_use_database
          self.uiPort = options.html_ui_port
          self.launchUI = options.html_ui_launch_browser
--- 1165,1170 ----
  
          # Load up the other settings from Option.py / bayescustomize.ini
!         self.databaseFilename = options.pop3proxy_persistent_storage_file
!         self.useDB = options.pop3proxy_persistent_use_database
          self.uiPort = options.html_ui_port
          self.launchUI = options.html_ui_launch_browser
***************
*** 1200,1206 ****
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:
!             self.bayes = Bayes.DBDictBayes(self.databaseFilename)
          else:
!             self.bayes = Bayes.PickledBayes(self.databaseFilename)
          print "Done."
  
--- 1203,1209 ----
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:
!             self.bayes = Persistent.DBDictClassifier(self.databaseFilename)
          else:
!             self.bayes = Persistent.PickledClassifier(self.databaseFilename)
          print "Done."
  
***************
*** 1227,1232 ****
  
              # Create the Trainers.
!             self.spamTrainer = Bayes.SpamTrainer(self.bayes)
!             self.hamTrainer = Bayes.HamTrainer(self.bayes)
              self.spamCorpus.addObserver(self.spamTrainer)
              self.hamCorpus.addObserver(self.hamTrainer)
--- 1230,1235 ----
  
              # Create the Trainers.
!             self.spamTrainer = Persistent.SpamTrainer(self.bayes)
!             self.hamTrainer = Persistent.HamTrainer(self.bayes)
              self.spamCorpus.addObserver(self.spamTrainer)
              self.hamCorpus.addObserver(self.hamTrainer)

--- Bayes.py DELETED ---


From npickett@users.sourceforge.net  Mon Nov 25 04:11:31 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sun, 24 Nov 2002 20:11:31 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.54,1.55
Message-ID: <E18GAap-0007Ym-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv28730

Modified Files:
	classifier.py 
Log Message:
* Set Classifier.probcache on unpickle.  This should fix the
  non-obvious problem Mark Hammond encountered.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.54
retrieving revision 1.55
diff -C2 -d -r1.54 -r1.55
*** classifier.py	25 Nov 2002 02:29:44 -0000	1.54
--- classifier.py	25 Nov 2002 04:11:29 -0000	1.55
***************
*** 139,142 ****
--- 139,143 ----
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
          self.wordinfo, self.meta = t[1:]
+         self.probcache = {}
  
      # Slacker's way out--pass calls to nham/nspam up to the meta class


From npickett@users.sourceforge.net  Mon Nov 25 04:23:34 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sun, 24 Nov 2002 20:23:34 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.76,1.77
Message-ID: <E18GAmU-0000xz-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv2282

Modified Files:
	Options.py 
Log Message:
* Changed header to X-Spambayes-Classification: spam/ham/unsure
* Not capitalizing the S in "spam" to comply with Hormel's wishes.
  They've been very cool about the use of their trademark, and I
  don't mind keeping bit 6 high :)


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.76
retrieving revision 1.77
diff -C2 -d -r1.76 -r1.77
*** Options.py	25 Nov 2002 02:29:44 -0000	1.76
--- Options.py	25 Nov 2002 04:23:31 -0000	1.77
***************
*** 307,317 ****
  # The name of the header that hammie adds to an E-mail in filter mode
  # It contains the "classification" of the mail, plus the score.
! hammie_header_name: X-Hammie-Disposition
  
  # The three disposition names are added to the header as the following
  # Three words:
! header_spam_string: Yes
! header_unsure_string: Unsure
! header_ham_string: No
  
  # Accuracy of the score in the header in decimal digits
--- 307,317 ----
  # The name of the header that hammie adds to an E-mail in filter mode
  # It contains the "classification" of the mail, plus the score.
! hammie_header_name: X-Spambayes-Classification
  
  # The three disposition names are added to the header as the following
  # Three words:
! header_spam_string: spam
! header_ham_string: ham
! header_unsure_string: unsure
  
  # Accuracy of the score in the header in decimal digits


From timstone4@users.sourceforge.net  Mon Nov 25 04:59:53 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sun, 24 Nov 2002 20:59:53 -0800
Subject: [Spambayes-checkins] spambayes Corpus.py,1.2.2.2,1.2.2.3
Message-ID: <E18GBLd-000559-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19513

Modified Files:
      Tag: hammie-playground
	Corpus.py 
Log Message:
Changed use of Verbose to __debug__ (I'm learning all the time)

Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.2.2.2
retrieving revision 1.2.2.3
diff -C2 -d -r1.2.2.2 -r1.2.2.3
*** Corpus.py	22 Nov 2002 02:07:33 -0000	1.2.2.2
--- Corpus.py	25 Nov 2002 04:59:51 -0000	1.2.2.3
***************
*** 90,94 ****
  SPAM = True
  HAM = False
- Verbose = False
  
  class Corpus:
--- 90,93 ----
***************
*** 115,119 ****
          '''Add a Message to this corpus'''
  
!         if Verbose:
              print 'adding message %s to corpus' % (message.key())
  
--- 114,118 ----
          '''Add a Message to this corpus'''
  
!         if __debug__:
              print 'adding message %s to corpus' % (message.key())
  
***************
*** 134,138 ****
  
          key = message.key()
!         if Verbose:
              print 'removing message %s from corpus' % (key)
          self.unCacheMessage(key)
--- 133,137 ----
  
          key = message.key()
!         if __debug__:
              print 'removing message %s from corpus' % (key)
          self.unCacheMessage(key)
***************
*** 152,156 ****
          key = message.key()
  
!         if Verbose:
              print 'placing %s in corpus cache' % (key)
  
--- 151,155 ----
          key = message.key()
  
!         if __debug__:
              print 'placing %s in corpus cache' % (key)
  
***************
*** 169,173 ****
          # This method should probably not be overridden
  
!         if Verbose:
              print 'Flushing %s from corpus cache' % (key)
  
--- 168,172 ----
          # This method should probably not be overridden
  
!         if __debug__:
              print 'Flushing %s from corpus cache' % (key)
  
***************
*** 249,253 ****
              Corpus.cacheMessage(self, msg)
          else:
!             if Verbose:
                  print 'Not caching %s because it has expired' % (msg.key())
              raise KeyError, msg
--- 248,252 ----
              Corpus.cacheMessage(self, msg)
          else:
!             if __debug__:
                  print 'Not caching %s because it has expired' % (msg.key())
              raise KeyError, msg
***************
*** 262,266 ****
                  msg = self[key]
              except KeyError, e:
!                 if Verbose:
                      print 'message %s has expired' % (key)
                  self.removeMessage(e[0])
--- 261,265 ----
                  msg = self[key]
              except KeyError, e:
!                 if __debug__:
                      print 'message %s has expired' % (key)
                  self.removeMessage(e[0])


From timstone4@users.sourceforge.net  Mon Nov 25 05:00:05 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sun, 24 Nov 2002 21:00:05 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.2.2.2,1.2.2.3
Message-ID: <E18GBLp-00056q-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19579

Modified Files:
      Tag: hammie-playground
	FileCorpus.py 
Log Message:
Changed use of Verbose to __debug__ (I'm learning all the time)

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.2.2.2
retrieving revision 1.2.2.3
diff -C2 -d -r1.2.2.2 -r1.2.2.3
*** FileCorpus.py	22 Nov 2002 02:08:06 -0000	1.2.2.2
--- FileCorpus.py	25 Nov 2002 05:00:02 -0000	1.2.2.3
***************
*** 37,41 ****
          options:
              -h : show this message
!             -v : execute in verbose mode, useful for general understanding
                   and debugging purposes
              -g : use GzipFileMessage and GzipFileMessageFactory
--- 37,41 ----
          options:
              -h : show this message
!             -v : execute in __debug__ mode, useful for general understanding
                   and debugging purposes
              -g : use GzipFileMessage and GzipFileMessageFactory
***************
*** 133,137 ****
              raise ValueError
  
!         if Corpus.Verbose:
              print 'adding',message.key(),'to corpus'
  
--- 133,137 ----
              raise ValueError
  
!         if __debug__:
              print 'adding',message.key(),'to corpus'
  
***************
*** 145,149 ****
          '''Remove a Message from this corpus'''
  
!         if Corpus.Verbose:
              print 'removing',message.key(),'from corpus'
  
--- 145,149 ----
          '''Remove a Message from this corpus'''
  
!         if __debug__:
              print 'removing',message.key(),'from corpus'
  
***************
*** 163,167 ****
              s = ''
  
!         if Corpus.Verbose and nummsgs > 0:
              lst = ', ' + '%s' % (self.keys())
          else:
--- 163,167 ----
              s = ''
  
!         if __debug__ and nummsgs > 0:
              lst = ', ' + '%s' % (self.keys())
          else:
***************
*** 205,209 ****
          '''Read the Message substance from the file'''
  
!         if Corpus.Verbose:
              print 'loading', self.file_name
  
--- 205,209 ----
          '''Read the Message substance from the file'''
  
!         if __debug__:
              print 'loading', self.file_name
  
***************
*** 221,225 ****
          '''Write the Message substance to the file'''
  
!         if Corpus.Verbose:
              print 'storing', self.file_name
  
--- 221,225 ----
          '''Write the Message substance to the file'''
  
!         if __debug__:
              print 'storing', self.file_name
  
***************
*** 232,236 ****
          '''Message hara-kiri'''
  
!         if Corpus.Verbose:
              print 'physically deleting file',self.pathname()
  
--- 232,236 ----
          '''Message hara-kiri'''
  
!         if __debug__:
              print 'physically deleting file',self.pathname()
  
***************
*** 251,255 ****
          sub = self.getSubstance()
          
!         if Corpus.Verbose:
              sub = self.getSubstance()
          else:
--- 251,255 ----
          sub = self.getSubstance()
          
!         if __debug__:
              sub = self.getSubstance()
          else:
***************
*** 294,298 ****
          '''Read the Message substance from the file'''
  
!         if Corpus.Verbose:
              print 'loading', self.file_name
  
--- 294,298 ----
          '''Read the Message substance from the file'''
  
!         if __debug__:
              print 'loading', self.file_name
  
***************
*** 312,316 ****
          '''Write the Message substance to the file'''
  
!         if Corpus.Verbose:
              print 'storing', self.file_name
  
--- 312,316 ----
          '''Write the Message substance to the file'''
  
!         if __debug__:
              print 'storing', self.file_name
  
***************
*** 687,691 ****
          sys.exit()
  
-     Corpus.Verbose = False
      runTestServer = False
      setupTestServer = False
--- 687,690 ----
***************
*** 707,712 ****
          elif opt == '-c':
              cleanupTestServer = True
-         elif opt == '-v':
-             Corpus.Verbose = True
          elif opt == '-g':
              useGzip = True
--- 706,709 ----


From timstone4@users.sourceforge.net  Mon Nov 25 05:00:18 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Sun, 24 Nov 2002 21:00:18 -0800
Subject: [Spambayes-checkins] spambayes Persistent.py,1.1.2.2,1.1.2.3
Message-ID: <E18GBM2-00058O-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19721

Modified Files:
      Tag: hammie-playground
	Persistent.py 
Log Message:
Changed use of Verbose to __debug__ (I'm learning all the time)

Index: Persistent.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Persistent.py,v
retrieving revision 1.1.2.2
retrieving revision 1.1.2.3
diff -C2 -d -r1.1.2.2 -r1.1.2.3
*** Persistent.py	23 Nov 2002 23:57:22 -0000	1.1.2.2
--- Persistent.py	25 Nov 2002 05:00:16 -0000	1.1.2.3
***************
*** 157,161 ****
  
  class DBDictClassifier(PersistentClassifier):
!     '''Classifier object persisted in a WIDict'''
  
      def __init__(self, db_name, mode='c'):
--- 157,161 ----
  
  class DBDictClassifier(PersistentClassifier):
!     '''Classifier object persisted in a DBDict'''
  
      def __init__(self, db_name, mode='c'):
***************
*** 167,174 ****
  
      def load(self):
!         '''Load state from WIDict'''
  
          if __debug__:
!             print 'Loading state from',self.db_name,'WIDict'
  
          self.wordinfo = dbdict.DBDict(self.db_name, self.mode,
--- 167,174 ----
  
      def load(self):
!         '''Load state from DBDict'''
  
          if __debug__:
!             print 'Loading state from',self.db_name,'DBDict'
  
          self.wordinfo = dbdict.DBDict(self.db_name, self.mode,
***************
*** 194,198 ****
  
          if __debug__:
!             print 'Persisting',self.db_name,'state in WIDict'
  
          self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam())
--- 194,198 ----
  
          if __debug__:
!             print 'Persisting',self.db_name,'state in DBDict'
  
          self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam())


From tim.one@comcast.net  Mon Nov 25 05:06:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 25 Nov 2002 00:06:23 -0500
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.2.2.2,1.2.2.3
In-Reply-To: <E18GBLp-00056q-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEFACPAB.tim.one@comcast.net>

[Tim Stone]
> Changed use of Verbose to __debug__ (I'm learning all the time)

You should probably unlearn this one <wink>.  __debug__ is normal -- it's
rare that anyone bothers to run Python with -O.  It's indeed idiomatic to
define a "verbose" vrbl instead (and also idiomatic for attributes to have
names beginning with a lowercase letter).


From mhammond@users.sourceforge.net  Mon Nov 25 05:57:43 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 24 Nov 2002 21:57:43 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.35,1.36
Message-ID: <E18GCFb-0005CF-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv19768

Modified Files:
	msgstore.py 
Log Message:
Missed a place in the int->double re-conversion


Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** msgstore.py	23 Nov 2002 12:00:02 -0000	1.35
--- msgstore.py	25 Nov 2002 05:57:41 -0000	1.36
***************
*** 308,312 ****
          resolve_props = ( (mapi.PS_PUBLIC_STRINGS, "Spam"), )
          resolve_ids = folder.GetIDsFromNames(resolve_props, 0)
!         field_id = PROP_TAG( PT_I4, PROP_ID(resolve_ids[0]))
          # Setup the properties we want to read.
          prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_MESSAGE_FLAGS
--- 308,312 ----
          resolve_props = ( (mapi.PS_PUBLIC_STRINGS, "Spam"), )
          resolve_ids = folder.GetIDsFromNames(resolve_props, 0)
!         field_id = PROP_TAG( PT_DOUBLE, PROP_ID(resolve_ids[0]))
          # Setup the properties we want to read.
          prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_MESSAGE_FLAGS


From mhammond@users.sourceforge.net  Mon Nov 25 06:02:36 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 24 Nov 2002 22:02:36 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.18,1.19
Message-ID: <E18GCKK-0005u8-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv22540

Modified Files:
	train.py 
Log Message:
Fixes for new interface to main engine.


Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** train.py	13 Nov 2002 19:26:27 -0000	1.18
--- train.py	25 Nov 2002 06:02:34 -0000	1.19
***************
*** 43,50 ****
      if was_spam is not None:
          # The classification has changed; unlearn the old classification.
!         mgr.bayes.unlearn(tokenize(stream), was_spam, False)
  
      # Learn the correct classification.
!     mgr.bayes.learn(tokenize(stream), is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
--- 43,50 ----
      if was_spam is not None:
          # The classification has changed; unlearn the old classification.
!         mgr.bayes.unlearn(tokenize(stream), was_spam)
  
      # Learn the correct classification.
!     mgr.bayes.learn(tokenize(stream), is_spam)
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
***************
*** 53,57 ****
      if rescore:
          import filter
-         mgr.bayes.update_probabilities()  # else rescoring gives the same score
          filter.filter_message(msg, mgr, all_actions = False)
  
--- 53,56 ----


From npickett@users.sourceforge.net  Mon Nov 25 06:22:28 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Sun, 24 Nov 2002 22:22:28 -0800
Subject: [Spambayes-checkins] spambayes storage.py,NONE,1.1
	FileCorpus.py,1.3,1.4
	classifier.py,1.55,1.56 hammie.py,1.41,1.42 hammiebulk.py,1.2,1.3
	pop3proxy.py,1.19,1.20 Persistent.py,1.2,NONE
Message-ID: <E18GCdY-0000VY-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv1046

Modified Files:
	FileCorpus.py classifier.py hammie.py hammiebulk.py 
	pop3proxy.py 
Added Files:
	storage.py 
Removed Files:
	Persistent.py 
Log Message:
* renamed Persistent.py to storage.py
* removed PersistentClassifier class, moved classify() method to
  classifier.Classifier class.
* This cvs commit has a lot of class ;)


--- NEW FILE: storage.py ---
#! /usr/bin/env python

'''storage.py - Spambayes database management framework.

Classes:
    PickledClassifier - Classifier that uses a pickle db
    DBDictClassifier - Classifier that uses a DBDict db
    Trainer - Classifier training observer
    SpamTrainer - Trainer for spam
    HamTrainer - Trainer for ham

Abstract:
    *Classifier are subclasses of Classifier (classifier.Classifier)
    that add automatic state store/restore function to the Classifier class.

    PickledClassifier is a Classifier class that uses a cPickle
    datastore.  This database is relatively small, but slower than other
    databases.

    DBDictClassifier is a Classifier class that uses a DBDict
    datastore.

    Trainer is concrete class that observes a Corpus and trains a
    Classifier object based upon movement of messages between corpora  When
    an add message notification is received, the trainer trains the
    database with the message, as spam or ham as appropriate given the
    type of trainer (spam or ham).  When a remove message notification
    is received, the trainer untrains the database as appropriate.

    SpamTrainer and HamTrainer are convenience subclasses of Trainer, that
    initialize as the appropriate type of Trainer

To Do:
    o ZODBClassifier
    o Would Trainer.trainall really want to train with the whole corpus,
        or just a random subset?
    o Suggestions?

    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim@fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, Neale Pickett, \
all the spambayes contributors."

import classifier
from Options import options
import cPickle as pickle
import dbdict
import errno

PICKLE_TYPE = 1
NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
UPDATEPROBS = True       # Probabilities will be autoupdated with training
DEBUG = False

class PickledClassifier(classifier.Classifier):
    '''Classifier object persisted in a pickle'''

    def __init__(self, db_name):
        classifier.Classifier.__init__(self)
        self.db_name = db_name
        self.load()    

    def load(self):
        '''Load this instance from the pickle.'''
        # This is a bit strange, because the loading process
        # creates a temporary instance of PickledClassifier, from which
        # this object's state is copied.  This is a nuance of the way
        # that pickle does its job

        if DEBUG:
            print 'Loading state from',self.db_name,'pickle'

        tempbayes = None
        try:
            fp = open(self.db_name, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT: raise
        else:
            tempbayes = pickle.load(fp)
            fp.close()

        if tempbayes:
            self.wordinfo = tempbayes.wordinfo
            self.meta.nham = tempbayes.get_nham()
            self.meta.nspam = tempbayes.get_nspam()

            if DEBUG:
                print '%s is an existing pickle, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new pickle
            if DEBUG:
                print self.db_name,'is a new pickle'
            self.wordinfo = {}
            self.meta.nham = 0
            self.meta.nspam = 0

    def store(self):
        '''Store self as a pickle'''

        if DEBUG:
            print 'Persisting',self.db_name,'as a pickle'

        fp = open(self.db_name, 'wb')
        pickle.dump(self, fp, PICKLE_TYPE)
        fp.close()

    def __getstate__(self):
        return PICKLE_TYPE, self.wordinfo, self.meta

    def __setstate__(self, t):
        if t[0] != PICKLE_TYPE:
            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
        self.wordinfo, self.meta = t[1:]


class DBDictClassifier(classifier.Classifier):
    '''Classifier object persisted in a WIDict'''

    def __init__(self, db_name, mode='c'):
        '''Constructor(database name)'''

        classifier.Classifier.__init__(self)
        self.statekey = "saved state"
        self.mode = mode
        self.db_name = db_name
        self.load()

    def load(self):
        '''Load state from WIDict'''

        if DEBUG:
            print 'Loading state from',self.db_name,'WIDict'

        self.wordinfo = dbdict.DBDict(self.db_name, self.mode,
                             classifier.WordInfo,iterskip=[self.statekey])

        if self.wordinfo.has_key(self.statekey):
            (nham, nspam) = self.wordinfo[self.statekey]
            self.set_nham(nham)
            self.set_nspam(nspam)

            if DEBUG:
                print '%s is an existing DBDict, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new dbdict
            if DEBUG:
                print self.db_name,'is a new DBDict'
            self.set_nham(0)
            self.set_nspam(0)

    def store(self):
        '''Place state into persistent store'''

        if DEBUG:
            print 'Persisting',self.db_name,'state in WIDict'

        self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam())
        self.wordinfo.sync()


class Trainer:
    '''Associates a Classifier object and one or more Corpora, \
    is an observer of the corpora'''

    def __init__(self, bayes, is_spam, updateprobs=NO_UPDATEPROBS):
        '''Constructor(Classifier, is_spam(True|False), updprobs(True|False)'''

        self.bayes = bayes
        self.is_spam = is_spam
        self.updateprobs = updateprobs

    def onAddMessage(self, message):
        '''A message is being added to an observed corpus.'''

        self.train(message)

    def train(self, message):
        '''Train the database with the message'''

        if DEBUG:
            print 'training with',message.key()

        self.bayes.learn(message.tokenize(), self.is_spam)
#                         self.updateprobs)

    def onRemoveMessage(self, message):
        '''A message is being removed from an observed corpus.'''

        self.untrain(message)

    def untrain(self, message):
        '''Untrain the database with the message'''

        if DEBUG:
            print 'untraining with',message.key()

        self.bayes.unlearn(message.tokenize(), self.is_spam)
#                           self.updateprobs)
        # can raise ValueError if database is fouled.  If this is the case,
        # then retraining is the only recovery option.

    def trainAll(self, corpus):
        '''Train all the messages in the corpus'''

        for msg in corpus:
            self.train(msg)

    def untrainAll(self, corpus):
        '''Untrain all the messages in the corpus'''

        for msg in corpus:
            self.untrain(msg)


class SpamTrainer(Trainer):
    '''Trainer for spam'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, True, updateprobs)


class HamTrainer(Trainer):
    '''Trainer for ham'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, False, updateprobs)


if __name__ == '__main__':
    print >>sys.stderr, __doc__

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** FileCorpus.py	25 Nov 2002 02:29:44 -0000	1.3
--- FileCorpus.py	25 Nov 2002 06:22:26 -0000	1.4
***************
*** 86,90 ****
  
  import Corpus
! import Persistent
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
  
--- 86,90 ----
  
  import Corpus
! import storage
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
  
***************
*** 344,355 ****
  
      print '\n\nCreating two Classifier databases'
!     miscbayes = Persistent.PickledClassifier('fctestmisc.bayes')
!     classbayes = Persistent.DBDictClassifier('fctestclass.bayes')
  
      print '\n\nSetting up spam corpus'
      spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
!     spamtrainer = Persistent.SpamTrainer(miscbayes)
      spamcorpus.addObserver(spamtrainer)
!     anotherspamtrainer = Persistent.SpamTrainer(classbayes, Persistent.UPDATEPROBS)
      spamcorpus.addObserver(anotherspamtrainer)
  
--- 344,355 ----
  
      print '\n\nCreating two Classifier databases'
!     miscbayes = storage.PickledClassifier('fctestmisc.bayes')
!     classbayes = storage.DBDictClassifier('fctestclass.bayes')
  
      print '\n\nSetting up spam corpus'
      spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
!     spamtrainer = storage.SpamTrainer(miscbayes)
      spamcorpus.addObserver(spamtrainer)
!     anotherspamtrainer = storage.SpamTrainer(classbayes, storage.UPDATEPROBS)
      spamcorpus.addObserver(anotherspamtrainer)
  
***************
*** 366,370 ****
                            'fctesthamcorpus', \
                            'MSG*')
!     hamtrainer = Persistent.HamTrainer(miscbayes)
      hamcorpus.addObserver(hamtrainer)
      hamtrainer.trainAll(hamcorpus)
--- 366,370 ----
                            'fctesthamcorpus', \
                            'MSG*')
!     hamtrainer = storage.HamTrainer(miscbayes)
      hamcorpus.addObserver(hamtrainer)
      hamtrainer.trainAll(hamcorpus)
***************
*** 420,424 ****
  
      print '\n\nTrain with an individual message'
!     anotherhamtrainer = Persistent.HamTrainer(classbayes)
      anotherhamtrainer.train(unsurecorpus['MSG00005'])
  
--- 420,424 ----
  
      print '\n\nTrain with an individual message'
!     anotherhamtrainer = storage.HamTrainer(classbayes)
      anotherhamtrainer.train(unsurecorpus['MSG00005'])
  
***************
*** 723,725 ****
          print >>sys.stderr, __doc__
  
!        
\ No newline at end of file
--- 723,725 ----
          print >>sys.stderr, __doc__
  
!        

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.55
retrieving revision 1.56
diff -C2 -d -r1.55 -r1.56
*** classifier.py	25 Nov 2002 04:11:29 -0000	1.55
--- classifier.py	25 Nov 2002 06:22:26 -0000	1.56
***************
*** 158,161 ****
--- 158,177 ----
      # spamprob, depending on option settings.
  
+     def classify(self, message):
+         """Return the classification of a message as a string."""
+ 
+         prob = self.spamprob(message.tokenize())
+ 
+         message.setSpamprob(prob)       # don't like this
+ 
+         if prob < options.ham_cutoff:
+             type = options.header_ham_string
+         elif prob > options.spam_cutoff:
+             type = options.header_spam_string
+         else:
+             type = options.header_unsure_string
+ 
+         return type
+ 
      def gary_spamprob(self, wordstream, evidence=False):
          """Return best-guess probability that wordstream is spam.

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.41
retrieving revision 1.42
diff -C2 -d -r1.41 -r1.42
*** hammie.py	25 Nov 2002 02:29:44 -0000	1.41
--- hammie.py	25 Nov 2002 06:22:26 -0000	1.42
***************
*** 4,8 ****
  import dbdict
  import mboxutils
! import Persistent
  from Options import options
  from tokenizer import tokenize
--- 4,8 ----
  import dbdict
  import mboxutils
! import storage
  from Options import options
  from tokenizer import tokenize
***************
*** 180,186 ****
  
      if usedb:
!         b = Persistent.DBDictClassifier(filename, mode)
      else:
!         b = Persistent.PickledClassifier(filename)
      return Hammie(b)
  
--- 180,186 ----
  
      if usedb:
!         b = storage.DBDictClassifier(filename, mode)
      else:
!         b = storage.PickledClassifier(filename)
      return Hammie(b)
  

Index: hammiebulk.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiebulk.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** hammiebulk.py	25 Nov 2002 02:29:44 -0000	1.2
--- hammiebulk.py	25 Nov 2002 06:22:26 -0000	1.3
***************
*** 52,56 ****
  import mboxutils
  import classifier
! import Persistent
  import hammie
  import Corpus
--- 52,56 ----
  import mboxutils
  import classifier
! import storage
  import hammie
  import Corpus
***************
*** 104,117 ****
                  print h.formatclues(clues)
      return (spams, hams)
- 
- def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
-     """Create a Bayes instance for the given pickle (which
-     doesn't have to exist).  Create a PersistentBayes if
-     usedb is True."""
-     if usedb:
-         bayes = Persistent.DBDictClassifier(pck, mode)
-     else:
-         bayes = Persistent.PickledClassifier(pck)
-     return bayes
  
  def usage(code, msg=''):
--- 104,107 ----

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** pop3proxy.py	25 Nov 2002 02:29:44 -0000	1.19
--- pop3proxy.py	25 Nov 2002 06:22:26 -0000	1.20
***************
*** 119,123 ****
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import Persistent, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
--- 119,123 ----
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import storage, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
***************
*** 1203,1209 ****
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:
!             self.bayes = Persistent.DBDictClassifier(self.databaseFilename)
          else:
!             self.bayes = Persistent.PickledClassifier(self.databaseFilename)
          print "Done."
  
--- 1203,1209 ----
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:
!             self.bayes = storage.DBDictClassifier(self.databaseFilename)
          else:
!             self.bayes = storage.PickledClassifier(self.databaseFilename)
          print "Done."
  
***************
*** 1230,1235 ****
  
              # Create the Trainers.
!             self.spamTrainer = Persistent.SpamTrainer(self.bayes)
!             self.hamTrainer = Persistent.HamTrainer(self.bayes)
              self.spamCorpus.addObserver(self.spamTrainer)
              self.hamCorpus.addObserver(self.hamTrainer)
--- 1230,1235 ----
  
              # Create the Trainers.
!             self.spamTrainer = storage.SpamTrainer(self.bayes)
!             self.hamTrainer = storage.HamTrainer(self.bayes)
              self.spamCorpus.addObserver(self.spamTrainer)
              self.hamCorpus.addObserver(self.hamTrainer)

--- Persistent.py DELETED ---


From neale@woozle.org  Mon Nov 25 06:25:29 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 22:25:29 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.2.2.2,1.2.2.3
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFACPAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCOEFACPAB.tim.one@comcast.net>
Message-ID: <w53k7j2dv3a.fsf@woozle.org>

So then, Tim Peters <tim.one@comcast.net> is all like:

> [Tim Stone]
> > Changed use of Verbose to __debug__ (I'm learning all the time)
> 
> You should probably unlearn this one <wink>.  __debug__ is normal -- it's
> rare that anyone bothers to run Python with -O.  It's indeed idiomatic to
> define a "verbose" vrbl instead (and also idiomatic for attributes to have
> names beginning with a lowercase letter).

You can blame me for teaching him that one.  I've unlearned it already
in the storage.py module.  A verbose variable might not be a bad idea
as a globally recognized option.

From hooft@users.sourceforge.net  Mon Nov 25 13:20:04 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Mon, 25 Nov 2002 05:20:04 -0800
Subject: [Spambayes-checkins] spambayes CostCounter.py,1.4,1.5
Message-ID: <E18GJ9g-0002nF-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv10681

Modified Files:
	CostCounter.py 
Log Message:
protect against division by zero

Index: CostCounter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CostCounter.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** CostCounter.py	19 Nov 2002 21:54:57 -0000	1.4
--- CostCounter.py	25 Nov 2002 13:20:01 -0000	1.5
***************
*** 95,111 ****
           return ("Total messages: %d; %d (%.1f%%) ham + %d (%.1f%%) spam\n"%(
                       self._total,
!                      self._ham, (100.*self._ham)/self._total,
!                      self._spam, (100.*self._spam)/self._total)+
                   "Ham: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fp\n"%(
!                      self._correctham, (100.*self._correctham)/self._ham,
!                      self._unsureham, (100.*self._unsureham)/self._ham,
!                      self._fp, (100.*self._fp)/self._ham)+
                   "Spam: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fn\n"%(
!                      self._correctspam, (100.*self._correctspam)/self._spam,
!                      self._unsurespam, (100.*self._unsurespam)/self._spam,
!                      self._fn, (100.*self._fn)/self._spam)+
                   "Score False: %.2f%% Unsure %.2f%%"%(
!                      (100.*(self._fp+self._fn))/self._total,
!                      (100.*self._unsure)/self._total))
  
  class StdCostCounter(CostCounter):
--- 95,117 ----
           return ("Total messages: %d; %d (%.1f%%) ham + %d (%.1f%%) spam\n"%(
                       self._total,
!                      self._ham, zd(100.*self._ham,self._total),
!                      self._spam, zd(100.*self._spam,self._total))+
                   "Ham: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fp\n"%(
!                      self._correctham, zd(100.*self._correctham,self._ham),
!                      self._unsureham, zd(100.*self._unsureham,self._ham),
!                      self._fp, zd(100.*self._fp,self._ham))+
                   "Spam: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fn\n"%(
!                      self._correctspam, zd(100.*self._correctspam,self._spam),
!                      self._unsurespam, zd(100.*self._unsurespam,self._spam),
!                      self._fn, zd(100.*self._fn,self._spam))+
                   "Score False: %.2f%% Unsure %.2f%%"%(
!                      zd(100.*(self._fp+self._fn),self._total),
!                      zd(100.*self._unsure,self._total)))
! 
! def zd(x,y):
!     if y > 0:
!        return x / y
!     else:
!        return 0
  
  class StdCostCounter(CostCounter):


From hooft@users.sourceforge.net  Mon Nov 25 13:22:39 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Mon, 25 Nov 2002 05:22:39 -0800
Subject: [Spambayes-checkins] spambayes weaktest.py,1.5,1.6
Message-ID: <E18GJCB-0003DS-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv12331

Modified Files:
	weaktest.py 
Log Message:
adapt to new update philosophy; add one new training system

Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** weaktest.py	19 Nov 2002 22:38:37 -0000	1.5
--- weaktest.py	25 Nov 2002 13:22:37 -0000	1.6
***************
*** 23,28 ****
      -d decider 
          Name of the decider. One of %(decisionkeys)s
-     -u updater
-         Name of the updater. One of %(updaterkeys)s
      -m min
          Minimal number of messages to train on before involving the decider.
--- 23,26 ----
***************
*** 54,57 ****
--- 52,59 ----
      sys.exit(code)
  
+ DONT_TRAIN = None
+ TRAIN_AS_HAM = 1
+ TRAIN_AS_SPAM = 2
+ 
  class TrainDecision:
      def __call__(self,scr,is_spam):
***************
*** 63,89 ****
  class UnsureAndFalses(TrainDecision):
      def spamtrain(self,scr):
!         return scr < options.spam_cutoff
  
      def hamtrain(self,scr):
!         return scr > options.ham_cutoff
  
  class UnsureOnly(TrainDecision):
      def spamtrain(self,scr):
!         return options.ham_cutoff < scr < options.spam_cutoff
  
!     hamtrain = spamtrain
  
  class All(TrainDecision):
      def spamtrain(self,scr):
!         return 1
  
!     hamtrain = spamtrain
  
  class AllBut0and100(TrainDecision):
      def spamtrain(self,scr):
!         return scr < 0.995
  
      def hamtrain(self,scr):
!         return scr > 0.005
  
  decisions={'all': All,
--- 65,112 ----
  class UnsureAndFalses(TrainDecision):
      def spamtrain(self,scr):
!         if scr < options.spam_cutoff:
! 	    return TRAIN_AS_SPAM
  
      def hamtrain(self,scr):
!         if scr > options.ham_cutoff:
! 	    return TRAIN_AS_HAM
  
  class UnsureOnly(TrainDecision):
      def spamtrain(self,scr):
!         if options.ham_cutoff < scr < options.spam_cutoff:
! 	    return TRAIN_AS_SPAM
  
!     def hamtrain(self,scr):
!         if options.ham_cutoff < scr < options.spam_cutoff:
! 	    return TRAIN_AS_HAM
  
  class All(TrainDecision):
      def spamtrain(self,scr):
!         return TRAIN_AS_SPAM
  
!     def hamtrain(self,scr):
!         return TRAIN_AS_HAM
  
  class AllBut0and100(TrainDecision):
      def spamtrain(self,scr):
!         if scr < 0.995:
! 	    return TRAIN_AS_SPAM
  
      def hamtrain(self,scr):
!         if scr > 0.005:
!             return TRAIN_AS_HAM
! 
! class OwnDecision(TrainDecision):
!     def hamtrain(self,scr):
!         if scr < options.ham_cutoff:
! 	    return TRAIN_AS_HAM
!         elif scr > options.spam_cutoff:
! 	    return TRAIN_AS_SPAM
! 
!     spamtrain = hamtrain
! 
! class OwnDecisionFNCorrection(OwnDecision):
!     def spamtrain(self,scr):
!         return TRAIN_AS_SPAM
  
  decisions={'all': All,
***************
*** 91,94 ****
--- 114,119 ----
             'unsureonly': UnsureOnly,
             'unsureandfalses': UnsureAndFalses,
+            'owndecision': OwnDecision,
+            'owndecision+fn': OwnDecisionFNCorrection,
            }
  decisionkeys=decisions.keys()
***************
*** 104,108 ****
          self.x += 1
          if self.tooearly():
!             return True
          else:
              return self.client(scr,is_spam)
--- 129,136 ----
          self.x += 1
          if self.tooearly():
!             if is_spam:
! 		return TRAIN_AS_SPAM
!             else:
! 		return TRAIN_AS_HAM
          else:
              return self.client(scr,is_spam)
***************
*** 118,143 ****
          self.d=d
  
! class AlwaysUpdate(Updater):
!     def __call__(self):
!         self.d.update_probabilities()
! 
! class SometimesUpdate(Updater):
!     def __init__(self,d=None,factor=10):
!         Updater.__init__(self,d)
!         self.factor=factor
!         self.n = 0
! 
!     def __call__(self):
!         self.n += 1
!         if self.n % self.factor == 0:
!             self.d.update_probabilities()
! 
! updaters={'always':AlwaysUpdate,
!           'sometimes':SometimesUpdate,
!          }
! updaterkeys=updaters.keys()
! updaterkeys.sort()
! 
! def drive(nsets,decision,updater):
      print options.display()
  
--- 146,150 ----
          self.d=d
  
! def drive(nsets,decision):
      print options.display()
  
***************
*** 156,161 ****
          allfns[fn] = None
  
!     d = hammie.Hammie(hammie.createbayes('weaktest.db', False))
!     updater.setd(d)
  
      hamtrain = 0
--- 163,167 ----
          allfns[fn] = None
  
!     d = hammie.open('weaktest.db', False)
  
      hamtrain = 0
***************
*** 179,190 ****
                      print "Ham with score %.2f"%scr
                  cc.ham(scr)
!         if decision(scr,is_spam):
!             if is_spam:
!                 d.train_spam(m)
!                 spamtrain += 1
!             else:
!                 d.train_ham(m)
!                 hamtrain += 1
!             updater()
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d"%(
--- 185,195 ----
                      print "Ham with score %.2f"%scr
                  cc.ham(scr)
!         de = decision(scr,is_spam) 
!         if de == TRAIN_AS_SPAM: 
!             d.train_spam(m)
!             spamtrain += 1
!         elif de == TRAIN_AS_HAM:
!             d.train_ham(m)
!             hamtrain += 1
          if n % 100 == 0:
              print "%5d trained:%dH+%dS wrds:%d"%(
***************
*** 202,206 ****
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'vd:u:hn:m:')
      except getopt.error, msg:
          usage(1, msg)
--- 207,211 ----
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'vd:hn:m:')
      except getopt.error, msg:
          usage(1, msg)
***************
*** 208,212 ****
      nsets = None
      decision = decisions['unsureonly']
-     updater = updaters['always']
      m = 10
  
--- 213,216 ----
***************
*** 224,231 ****
                  usage(1,'Unknown decisionmaker')
              decision = decisions[arg]
-         elif opt == '-u':
-             if not updaters.has_key(arg):
-                 usage(1,'Unknown updater')
-             updater = updaters[arg]
  
      if args:
--- 228,231 ----
***************
*** 234,238 ****
          usage(1, "-n is required")
  
!     drive(nsets,decision=FirstN(m,decision()),updater=updater())
  
  if __name__ == "__main__":
--- 234,238 ----
          usage(1, "-n is required")
  
!     drive(nsets,decision=FirstN(m,decision()))
  
  if __name__ == "__main__":


From npickett@users.sourceforge.net  Mon Nov 25 16:24:28 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Mon, 25 Nov 2002 08:24:28 -0800
Subject: [Spambayes-checkins] spambayes dbdict.py,1.2,1.3
	hammiebulk.py,1.3,1.4
Message-ID: <E18GM28-00080k-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv30660

Modified Files:
	dbdict.py hammiebulk.py 
Log Message:
* s/dbhash/anydbm/


Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** dbdict.py	25 Nov 2002 02:29:44 -0000	1.2
--- dbdict.py	25 Nov 2002 16:24:26 -0000	1.3
***************
*** 1,11 ****
  #! /usr/bin/env python
  
! """DBDict.py - Dictionary access to dbhash
  
  Classes:
!     DBDict - wraps a dbhash file
  
  Abstract:
!     DBDict class wraps a dbhash file with a reasonably complete set
      of dictionary access methods.  DBDicts can be iterated like a dictionary.
      
--- 1,11 ----
  #! /usr/bin/env python
  
! """DBDict.py - Dictionary access to anydbm
  
  Classes:
!     DBDict - wraps an anydbm file
  
  Abstract:
!     DBDict class wraps an anydbm file with a reasonably complete set
      of dictionary access methods.  DBDicts can be iterated like a dictionary.
      
***************
*** 57,61 ****
      import pickle
  
! import dbhash
  import errno
  import copy
--- 57,61 ----
      import pickle
  
! import anydbm
  import errno
  import copy
***************
*** 72,76 ****
      """Database Dictionary.
  
!     This wraps a dbhash database to make it look even more like a
      dictionary, much like the built-in shelf class.  The difference is
      that a DBDict supports all dict methods.
--- 72,76 ----
      """Database Dictionary.
  
!     This wraps an anydbm database to make it look even more like a
      dictionary, much like the built-in shelf class.  The difference is
      that a DBDict supports all dict methods.
***************
*** 92,96 ****
  
      def __init__(self, dbname, mode, wclass, iterskip=()):
!         self.hash = dbhash.open(dbname, mode)
          if not iterskip:
              self.iterskip = iterskip
--- 92,96 ----
  
      def __init__(self, dbname, mode, wclass, iterskip=()):
!         self.hash = anydbm.open(dbname, mode)
          if not iterskip:
              self.iterskip = iterskip

Index: hammiebulk.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiebulk.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** hammiebulk.py	25 Nov 2002 06:22:26 -0000	1.3
--- hammiebulk.py	25 Nov 2002 16:24:26 -0000	1.4
***************
*** 46,50 ****
  import email
  import errno
- import anydbm
  import cPickle as pickle
  
--- 46,49 ----


From nascheme@users.sourceforge.net  Mon Nov 25 18:13:42 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Mon, 25 Nov 2002 10:13:42 -0800
Subject: [Spambayes-checkins] spambayes neilfilter.py,1.4,1.5
Message-ID: <E18GNjq-0001j6-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv6617

Modified Files:
	neilfilter.py 
Log Message:
Repair to work with new Classifer interface.


Index: neilfilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neilfilter.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** neilfilter.py	2 Oct 2002 16:05:27 -0000	1.4
--- neilfilter.py	25 Nov 2002 18:13:40 -0000	1.5
***************
*** 20,37 ****
  SPAM_CUTOFF = 0.57
  
! class CdbWrapper(cdb.Cdb):
!     def get(self, key, default=None,
!             cdb_get=cdb.Cdb.get,
!             WordInfo=classifier.WordInfo):
!         prob = cdb_get(self, key, default)
!         if prob is None:
!             return None
!         else:
!             return WordInfo(0, float(prob))
! 
! class CdbBayes(classifier.Bayes):
      def __init__(self, cdbfile):
          classifier.Bayes.__init__(self)
!         self.wordinfo = CdbWrapper(cdbfile)
  
  def maketmp(dir):
--- 20,30 ----
  SPAM_CUTOFF = 0.57
  
! class CdbClassifer(classifier.Classifier):
      def __init__(self, cdbfile):
          classifier.Bayes.__init__(self)
!         self.wordinfo = cdb.Cdb(cdbfile)
! 
!     def probability(self, record):
!         return float(record)
  
  def maketmp(dir):
***************
*** 94,98 ****
              msg = email.message_from_string(msgdata)
              del msgdata
!             bayes = CdbBayes(open(wordprobfilename, 'rb'))
              prob = bayes.spamprob(tokenize(msg))
          else:
--- 87,91 ----
              msg = email.message_from_string(msgdata)
              del msgdata
!             bayes = CdbClassifer(open(wordprobfilename, 'rb'))
              prob = bayes.spamprob(tokenize(msg))
          else:


From nascheme@users.sourceforge.net  Mon Nov 25 18:14:03 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Mon, 25 Nov 2002 10:14:03 -0800
Subject: [Spambayes-checkins] spambayes neiltrain.py,1.4,1.5
Message-ID: <E18GNkB-0001nE-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv6852

Modified Files:
	neiltrain.py 
Log Message:
Repair to work with new Classifer interface.


Index: neiltrain.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** neiltrain.py	7 Nov 2002 22:30:07 -0000	1.4
--- neiltrain.py	25 Nov 2002 18:13:59 -0000	1.5
***************
*** 28,32 ****
      mbox = mboxutils.getmbox(msgs)
      for msg in mbox:
!         bayes.learn(tokenize(msg), is_spam, False)
  
  def usage(code, msg=''):
--- 28,32 ----
      mbox = mboxutils.getmbox(msgs)
      for msg in mbox:
!         bayes.learn(tokenize(msg), is_spam)
  
  def usage(code, msg=''):
***************
*** 46,50 ****
      ham_name = sys.argv[2]
      db_name = sys.argv[3]
!     bayes = classifier.Bayes()
      print 'Training with spam...'
      train(bayes, spam_name, True)
--- 46,50 ----
      ham_name = sys.argv[2]
      db_name = sys.argv[3]
!     bayes = classifier.Classifier()
      print 'Training with spam...'
      train(bayes, spam_name, True)
***************
*** 54,60 ****
      bayes.update_probabilities()
      items = []
!     for word, winfo in bayes.wordinfo.iteritems():
!         #print `word`, str(winfo.spamprob)
!         items.append((word, str(winfo.spamprob)))
      print 'Writing DB...'
      db = open(db_name, "wb")
--- 54,61 ----
      bayes.update_probabilities()
      items = []
!     for word, record in bayes.wordinfo.iteritems():
!         prob = bayes.probability(record)
!         #print `word`, prob
!         items.append((word, str(prob)))
      print 'Writing DB...'
      db = open(db_name, "wb")


From npickett@users.sourceforge.net  Mon Nov 25 20:49:26 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Mon, 25 Nov 2002 12:49:26 -0800
Subject: [Spambayes-checkins] 
 spambayes FileCorpus.py,1.4,1.5 classifier.py,1.56,1.57
 dbdict.py,1.3,1.4 hammie.py,1.42,1.43 neiltrain.py,1.5,1.6
 pop3proxy.py,1.20,1.21
Message-ID: <E18GQAY-0006ac-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv25160

Modified Files:
	FileCorpus.py classifier.py dbdict.py hammie.py neiltrain.py 
	pop3proxy.py 
Log Message:
* Removed Classifier.update_probabilities() and all references to it
* Eliminates dbdict's iteritems() and associates; we don't need it
  anymore and not having them allows more dbm types to be used


Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** FileCorpus.py	25 Nov 2002 06:22:26 -0000	1.4
--- FileCorpus.py	25 Nov 2002 20:49:12 -0000	1.5
***************
*** 473,478 ****
  
  
!     print '\n\nUpdating and storing bayes databases'
!     miscbayes.update_probabilities()  # if we don't, training is forgotten
      miscbayes.store()
      classbayes.store()
--- 473,477 ----
  
  
!     print '\n\nStoring bayes databases'
      miscbayes.store()
      classbayes.store()

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.56
retrieving revision 1.57
diff -C2 -d -r1.56 -r1.57
*** classifier.py	25 Nov 2002 06:22:26 -0000	1.56
--- classifier.py	25 Nov 2002 20:49:13 -0000	1.57
***************
*** 337,348 ****
          else that it's definitely not spam.
  
-         If optional arg update_word_probabilities is False (the default
-         is True), don't update individual words' probabilities.
-         Updating them is expensive, and if you're going to pass many
-         messages to learn(), it's more efficient to pass False here and
-         call update_probabilities() once when you're done.  The
-         important thing is that the probabilities get updated before
-         calling spamprob() again.
- 
          """
  
--- 337,340 ----
***************
*** 440,457 ****
  
          return prob
- 
-     def update_probabilities(self):
-         """Update the word probabilities in the spam database.
- 
-         This computes a new probability for every word in the database,
-         which can be expensive.  learn() and unlearn() clear the
-         probability cache each time by default, and that will be rebuilt
-         as probabilities are looked up.  If for some reason you need to
-         update all the probabilities in one step (say, for
-         benchmarking), you can call this method.
-         """
- 
-         for word, record in self.wordinfo.iteritems():
-             self.probability(record)
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
--- 432,435 ----

Index: dbdict.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/dbdict.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** dbdict.py	25 Nov 2002 16:24:26 -0000	1.3
--- dbdict.py	25 Nov 2002 20:49:16 -0000	1.4
***************
*** 50,54 ****
  __credits__ = "Tim Peters (author of DBDict class), \
                 all the spambayes contributors."
- from __future__ import generators
  
  try:
--- 50,53 ----
***************
*** 128,146 ****
          del(self.hash[key])
  
-     def __iter__(self, fn=None):
-         k = self.hash.first()
-         while k != None:
-             key = k[0]
-             val = self.__getitem__(key)
-             if key not in self.iterskip:
-                 if fn:
-                     yield fn((key, val))
-                 else:
-                     yield (key, val)
-             try:
-                 k = self.hash.next()
-             except KeyError:
-                 break
- 
      def __contains__(self, name):
          return self.has_key(name)
--- 127,130 ----
***************
*** 155,168 ****
          else:
              return dfl
- 
-     def iteritems(self):
-         return self.__iter__()
- 
-     def iterkeys(self):
-         return self.__iter__(lambda k: k[0])
- 
-     def itervalues(self):
-         return self.__iter__(lambda k: k[1])
- 
  
  open = DBDict
--- 139,142 ----

Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** hammie.py	25 Nov 2002 06:22:26 -0000	1.42
--- hammie.py	25 Nov 2002 20:49:17 -0000	1.43
***************
*** 126,132 ****
          is_spam should be 1 if the message is spam, 0 if not.
  
-         Probabilities are not updated after this call is made; to do
-         that, call update_probabilities().
- 
          """
  
--- 126,129 ----
***************
*** 138,144 ****
          msg can be a string, a file object, or a Message object.
  
-         Probabilities are not updated after this call is made; to do
-         that, call update_probabilities().
- 
          """
  
--- 135,138 ----
***************
*** 149,155 ****
  
          msg can be a string, a file object, or a Message object.
- 
-         Probabilities are not updated after this call is made; to do
-         that, call update_probabilities().
  
          """
--- 143,146 ----

Index: neiltrain.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** neiltrain.py	25 Nov 2002 18:13:59 -0000	1.5
--- neiltrain.py	25 Nov 2002 20:49:18 -0000	1.6
***************
*** 52,56 ****
      train(bayes, ham_name, False)
      print 'Updating probabilities...'
-     bayes.update_probabilities()
      items = []
      for word, record in bayes.wordinfo.iteritems():
--- 52,55 ----

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** pop3proxy.py	25 Nov 2002 06:22:26 -0000	1.20
--- pop3proxy.py	25 Nov 2002 20:49:19 -0000	1.21
***************
*** 1004,1008 ****
                          pass  # Must be a reload.
  
!         # Update the probabilities if we've done any training.
          if numTrained > 0:
              plural = ''
--- 1004,1008 ----
                          pass  # Must be a reload.
  
!         # Report on any training.
          if numTrained > 0:
              plural = ''
***************
*** 1010,1017 ****
                  plural = 's'
              self.push("Trained on %d message%s. " % (numTrained, plural))
-             self.push("Updating probabilities... ")
-             self.push(" ")
-             state.bayes.update_probabilities()
-             self.push("Done.</b></p>")
  
          # If any messages were deferred, show the same page again.
--- 1010,1013 ----


From npickett@users.sourceforge.net  Mon Nov 25 20:52:52 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Mon, 25 Nov 2002 12:52:52 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.19,1.20
Message-ID: <E18GQDs-00072T-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv26793

Modified Files:
	train.py 
Log Message:
* Removed reference to Classifier.update_probabilites(), which no
  longer exists.
* Hope I did this right...


Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** train.py	25 Nov 2002 06:02:34 -0000	1.19
--- train.py	25 Nov 2002 20:52:49 -0000	1.20
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
  # Train a classifier from Outlook Mail folders
  # Authors: Sean D. True, WebReply.Com, Mark Hammond
***************
*** 104,110 ****
              return
  
-     progress.tick()
-     progress.set_status('Updating probabilities...')
-     bayes.update_probabilities()
      progress.tick()
      if progress.stop_requested():
--- 105,108 ----


From npickett@users.sourceforge.net  Mon Nov 25 21:02:07 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Mon, 25 Nov 2002 13:02:07 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.57,1.58
Message-ID: <E18GQMp-0008OW-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv32113

Modified Files:
	classifier.py 
Log Message:
* Pruned Classifier.classify() -- nothing was using it


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.57
retrieving revision 1.58
diff -C2 -d -r1.57 -r1.58
*** classifier.py	25 Nov 2002 20:49:13 -0000	1.57
--- classifier.py	25 Nov 2002 21:02:03 -0000	1.58
***************
*** 158,177 ****
      # spamprob, depending on option settings.
  
-     def classify(self, message):
-         """Return the classification of a message as a string."""
- 
-         prob = self.spamprob(message.tokenize())
- 
-         message.setSpamprob(prob)       # don't like this
- 
-         if prob < options.ham_cutoff:
-             type = options.header_ham_string
-         elif prob > options.spam_cutoff:
-             type = options.header_spam_string
-         else:
-             type = options.header_unsure_string
- 
-         return type
- 
      def gary_spamprob(self, wordstream, evidence=False):
          """Return best-guess probability that wordstream is spam.
--- 158,161 ----


From timstone4@users.sourceforge.net  Mon Nov 25 21:13:12 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Mon, 25 Nov 2002 13:13:12 -0800
Subject: [Spambayes-checkins] spambayes sb0.5.exe,NONE,1.1.2.1
Message-ID: <E18GQXY-0001Mb-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv5123

Added Files:
      Tag: hammie-playground
	sb0.5.exe 
Log Message:
This is a first crack at packaging.  It's a self-extracting executable, created
on W2K.  I'm not sure exactly how compatible it is with other windoze platforms.
I don't have a linux platform installable yet.  It doesn't include outlook stuff,
yet.  To run, extract it, then right click on sb0.5.msi and hit install.  I don't
know how to make it automatically install, yet.  I don't know much else,
either.  But I'm learning... 

--- NEW FILE: sb0.5.exe ---
(This appears to be a binary file; contents omitted.)


From timstone4@users.sourceforge.net  Mon Nov 25 21:39:30 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Mon, 25 Nov 2002 13:39:30 -0800
Subject: [Spambayes-checkins] spambayes sb0.5.exe,1.1.2.1,NONE
Message-ID: <E18GQx0-00051Q-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv19187

Removed Files:
      Tag: hammie-playground
	sb0.5.exe 
Log Message:
Objection noted.

--- sb0.5.exe DELETED ---


From timstone4@users.sourceforge.net  Tue Nov 26 00:43:53 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Mon, 25 Nov 2002 16:43:53 -0800
Subject: [Spambayes-checkins] spambayes Corpus.py,1.3,1.4
	FileCorpus.py,1.5,1.6 Options.py,1.77,1.78 storage.py,1.1,1.2
Message-ID: <E18GTpR-0004mB-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv18305

Modified Files:
	Corpus.py FileCorpus.py Options.py storage.py 
Log Message:
Added [globals] section to Options, with a verbose boolean

Changed FileCorpus, Corpus, storage to use the verbose global

Changed the test harness in FileCorpus to account for the wanton
destruction of my favorite method: classifier.classify()  <wink>

Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** Corpus.py	25 Nov 2002 02:29:44 -0000	1.3
--- Corpus.py	26 Nov 2002 00:43:51 -0000	1.4
***************
*** 87,94 ****
  import tokenizer
  import re
  
  SPAM = True
  HAM = False
- Verbose = False
  
  class Corpus:
--- 87,94 ----
  import tokenizer
  import re
+ from Options import options
  
  SPAM = True
  HAM = False
  
  class Corpus:
***************
*** 115,119 ****
          '''Add a Message to this corpus'''
  
!         if Verbose:
              print 'adding message %s to corpus' % (message.key())
  
--- 115,119 ----
          '''Add a Message to this corpus'''
  
!         if options.verbose:
              print 'adding message %s to corpus' % (message.key())
  
***************
*** 134,138 ****
  
          key = message.key()
!         if Verbose:
              print 'removing message %s from corpus' % (key)
          self.unCacheMessage(key)
--- 134,138 ----
  
          key = message.key()
!         if options.verbose:
              print 'removing message %s from corpus' % (key)
          self.unCacheMessage(key)
***************
*** 152,156 ****
          key = message.key()
  
!         if Verbose:
              print 'placing %s in corpus cache' % (key)
  
--- 152,156 ----
          key = message.key()
  
!         if options.verbose:
              print 'placing %s in corpus cache' % (key)
  
***************
*** 169,173 ****
          # This method should probably not be overridden
  
!         if Verbose:
              print 'Flushing %s from corpus cache' % (key)
  
--- 169,173 ----
          # This method should probably not be overridden
  
!         if options.verbose:
              print 'Flushing %s from corpus cache' % (key)
  
***************
*** 249,253 ****
              Corpus.cacheMessage(self, msg)
          else:
!             if Verbose:
                  print 'Not caching %s because it has expired' % (msg.key())
              raise KeyError, msg
--- 249,253 ----
              Corpus.cacheMessage(self, msg)
          else:
!             if options.verbose:
                  print 'Not caching %s because it has expired' % (msg.key())
              raise KeyError, msg
***************
*** 262,266 ****
                  msg = self[key]
              except KeyError, e:
!                 if Verbose:
                      print 'message %s has expired' % (key)
                  self.removeMessage(e[0])
--- 262,266 ----
                  msg = self[key]
              except KeyError, e:
!                 if options.verbose:
                      print 'message %s has expired' % (key)
                  self.removeMessage(e[0])

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** FileCorpus.py	25 Nov 2002 20:49:12 -0000	1.5
--- FileCorpus.py	26 Nov 2002 00:43:51 -0000	1.6
***************
*** 88,91 ****
--- 88,92 ----
  import storage
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
+ from Options import options
  
  class FileCorpus(Corpus.Corpus):
***************
*** 133,137 ****
              raise ValueError
  
!         if Corpus.Verbose:
              print 'adding',message.key(),'to corpus'
  
--- 134,138 ----
              raise ValueError
  
!         if options.verbose:
              print 'adding',message.key(),'to corpus'
  
***************
*** 145,149 ****
          '''Remove a Message from this corpus'''
  
!         if Corpus.Verbose:
              print 'removing',message.key(),'from corpus'
  
--- 146,150 ----
          '''Remove a Message from this corpus'''
  
!         if options.verbose:
              print 'removing',message.key(),'from corpus'
  
***************
*** 163,167 ****
              s = ''
  
!         if Corpus.Verbose and nummsgs > 0:
              lst = ', ' + '%s' % (self.keys())
          else:
--- 164,168 ----
              s = ''
  
!         if options.verbose and nummsgs > 0:
              lst = ', ' + '%s' % (self.keys())
          else:
***************
*** 205,209 ****
          '''Read the Message substance from the file'''
  
!         if Corpus.Verbose:
              print 'loading', self.file_name
  
--- 206,210 ----
          '''Read the Message substance from the file'''
  
!         if options.verbose:
              print 'loading', self.file_name
  
***************
*** 221,225 ****
          '''Write the Message substance to the file'''
  
!         if Corpus.Verbose:
              print 'storing', self.file_name
  
--- 222,226 ----
          '''Write the Message substance to the file'''
  
!         if options.verbose:
              print 'storing', self.file_name
  
***************
*** 232,236 ****
          '''Message hara-kiri'''
  
!         if Corpus.Verbose:
              print 'physically deleting file',self.pathname()
  
--- 233,237 ----
          '''Message hara-kiri'''
  
!         if options.verbose:
              print 'physically deleting file',self.pathname()
  
***************
*** 251,255 ****
          sub = self.getSubstance()
          
!         if Corpus.Verbose:
              sub = self.getSubstance()
          else:
--- 252,256 ----
          sub = self.getSubstance()
          
!         if options.verbose:
              sub = self.getSubstance()
          else:
***************
*** 294,298 ****
          '''Read the Message substance from the file'''
  
!         if Corpus.Verbose:
              print 'loading', self.file_name
  
--- 295,299 ----
          '''Read the Message substance from the file'''
  
!         if options.verbose:
              print 'loading', self.file_name
  
***************
*** 312,316 ****
          '''Write the Message substance to the file'''
  
!         if Corpus.Verbose:
              print 'storing', self.file_name
  
--- 313,317 ----
          '''Write the Message substance to the file'''
  
!         if options.verbose:
              print 'storing', self.file_name
  
***************
*** 445,459 ****
  
      for msg in unsurecorpus:
!         type = classbayes.classify(msg)
  
!         print 'Message %s spam probability is %f' % (msg.key(), msg.spamprob)
  
!         if type == 'ham':
              print 'Moving %s from unsurecorpus to hamcorpus, \
! based on prob of %f' % (msg.key(), msg.spamprob)
              hamcorpus.takeMessage(msg.key(), unsurecorpus)
!         elif type == 'spam':
              print 'Moving %s from unsurecorpus to spamcorpus, \
! based on prob of %f' % (msg.key(), msg.spamprob)
              spamcorpus.takeMessage(msg.key(), unsurecorpus)
  
--- 446,460 ----
  
      for msg in unsurecorpus:
!         prob = classbayes.spamprob(msg.tokenize())
  
!         print 'Message %s spam probability is %f' % (msg.key(), prob)
  
!         if prob < options.ham_cutoff:
              print 'Moving %s from unsurecorpus to hamcorpus, \
! based on prob of %f' % (msg.key(), prob)
              hamcorpus.takeMessage(msg.key(), unsurecorpus)
!         elif prob > options.spam_cutoff:
              print 'Moving %s from unsurecorpus to spamcorpus, \
! based on prob of %f' % (msg.key(), prob)
              spamcorpus.takeMessage(msg.key(), unsurecorpus)
  
***************
*** 686,690 ****
          sys.exit()
  
!     Corpus.Verbose = False
      runTestServer = False
      setupTestServer = False
--- 687,691 ----
          sys.exit()
  
!     options.verbose = False
      runTestServer = False
      setupTestServer = False
***************
*** 707,715 ****
              cleanupTestServer = True
          elif opt == '-v':
!             Corpus.Verbose = True
          elif opt == '-g':
              useGzip = True
          elif opt == '-u':
              useExistingDB = True
  
      if setupTestServer:
--- 708,718 ----
              cleanupTestServer = True
          elif opt == '-v':
!             options.verbose = True
          elif opt == '-g':
              useGzip = True
          elif opt == '-u':
              useExistingDB = True
+         elif opt == '-v':
+             options.verbose = True
  
      if setupTestServer:

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.77
retrieving revision 1.78
diff -C2 -d -r1.77 -r1.78
*** Options.py	25 Nov 2002 04:23:31 -0000	1.77
--- Options.py	26 Nov 2002 00:43:51 -0000	1.78
***************
*** 370,373 ****
--- 370,376 ----
  html_ui_port: 8880
  html_ui_launch_browser: False
+ 
+ [globals]
+ verbose: False
  """
  
***************
*** 456,459 ****
--- 459,464 ----
      'html_ui': {'html_ui_port': int_cracker,
                  'html_ui_launch_browser': boolean_cracker,
+                 },
+     'globals': {'verbose': boolean_cracker,
                  },
  }

Index: storage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/storage.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** storage.py	25 Nov 2002 06:22:26 -0000	1.1
--- storage.py	26 Nov 2002 00:43:51 -0000	1.2
***************
*** 56,60 ****
  NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
  UPDATEPROBS = True       # Probabilities will be autoupdated with training
- DEBUG = False
  
  class PickledClassifier(classifier.Classifier):
--- 56,59 ----
***************
*** 73,77 ****
          # that pickle does its job
  
!         if DEBUG:
              print 'Loading state from',self.db_name,'pickle'
  
--- 72,76 ----
          # that pickle does its job
  
!         if options.verbose:
              print 'Loading state from',self.db_name,'pickle'
  
***************
*** 90,99 ****
              self.meta.nspam = tempbayes.get_nspam()
  
!             if DEBUG:
                  print '%s is an existing pickle, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new pickle
!             if DEBUG:
                  print self.db_name,'is a new pickle'
              self.wordinfo = {}
--- 89,98 ----
              self.meta.nspam = tempbayes.get_nspam()
  
!             if options.verbose:
                  print '%s is an existing pickle, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new pickle
!             if options.verbose:
                  print self.db_name,'is a new pickle'
              self.wordinfo = {}
***************
*** 104,108 ****
          '''Store self as a pickle'''
  
!         if DEBUG:
              print 'Persisting',self.db_name,'as a pickle'
  
--- 103,107 ----
          '''Store self as a pickle'''
  
!         if options.verbose:
              print 'Persisting',self.db_name,'as a pickle'
  
***************
*** 135,139 ****
          '''Load state from WIDict'''
  
!         if DEBUG:
              print 'Loading state from',self.db_name,'WIDict'
  
--- 134,138 ----
          '''Load state from WIDict'''
  
!         if options.verbose:
              print 'Loading state from',self.db_name,'WIDict'
  
***************
*** 146,155 ****
              self.set_nspam(nspam)
  
!             if DEBUG:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
!             if DEBUG:
                  print self.db_name,'is a new DBDict'
              self.set_nham(0)
--- 145,154 ----
              self.set_nspam(nspam)
  
!             if options.verbose:
                  print '%s is an existing DBDict, with %d ham and %d spam' \
                        % (self.db_name, self.nham, self.nspam)
          else:
              # new dbdict
!             if options.verbose:
                  print self.db_name,'is a new DBDict'
              self.set_nham(0)
***************
*** 159,163 ****
          '''Place state into persistent store'''
  
!         if DEBUG:
              print 'Persisting',self.db_name,'state in WIDict'
  
--- 158,162 ----
          '''Place state into persistent store'''
  
!         if options.verbose:
              print 'Persisting',self.db_name,'state in WIDict'
  
***************
*** 185,189 ****
          '''Train the database with the message'''
  
!         if DEBUG:
              print 'training with',message.key()
  
--- 184,188 ----
          '''Train the database with the message'''
  
!         if options.verbose:
              print 'training with',message.key()
  
***************
*** 199,203 ****
          '''Untrain the database with the message'''
  
!         if DEBUG:
              print 'untraining with',message.key()
  
--- 198,202 ----
          '''Untrain the database with the message'''
  
!         if options.verbose:
              print 'untraining with',message.key()
  

From mhammond@skippinet.com.au  Tue Nov 26 01:23:32 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 26 Nov 2002 12:23:32 +1100
Subject: [Spambayes-checkins] spambayes
	Corpus.py,1.3,1.4FileCorpus.py,1.5,1.6 Options.py,1.77,1.78 storage.py,1.1,1.2
In-Reply-To: <E18GTpR-0004mB-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPOEHGHOAA.mhammond@skippinet.com.au>

> Modified Files:
> 	Corpus.py FileCorpus.py Options.py storage.py 
> Log Message:
> Added [globals] section to Options, with a verbose boolean

A verbose *level* can also be handy <wink>

Mark.


From timstone4@users.sourceforge.net  Tue Nov 26 03:32:13 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Mon, 25 Nov 2002 19:32:13 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.21,1.22
Message-ID: <E18GWSL-0005qz-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv22470

Modified Files:
	pop3proxy.py 
Log Message:
Got rid of an infernal 'Bad file descriptor' asyncore error message.
Forgive me, Richie.

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** pop3proxy.py	25 Nov 2002 20:49:19 -0000	1.21
--- pop3proxy.py	26 Nov 2002 03:32:11 -0000	1.22
***************
*** 176,180 ****
          """Let SystemExit cause an exit."""
          type, v, t = sys.exc_info()
!         if type == SystemExit:
              raise
          else:
--- 176,182 ----
          """Let SystemExit cause an exit."""
          type, v, t = sys.exc_info()
!         if type == socket.error and v[0] == 9:  # Why?  Who knows...
!             pass
!         elif type == SystemExit:
              raise
          else:


From timstone4@users.sourceforge.net  Tue Nov 26 04:27:21 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Mon, 25 Nov 2002 20:27:21 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.22,1.23
Message-ID: <E18GXJh-00030Z-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv11550

Modified Files:
	pop3proxy.py 
Log Message:
Corrected reference to __slots__ in word query.

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** pop3proxy.py	26 Nov 2002 03:32:11 -0000	1.22
--- pop3proxy.py	26 Nov 2002 04:27:19 -0000	1.23
***************
*** 1099,1112 ****
          word = word.lower()
          try:
-             # Must be a better way to get __dict__ for a new-style class...
              wi = state.bayes.wordinfo[word]
!             members = dict(map(lambda n: (n, getattr(wi, n)), wi.__slots__))
!             members['atime'] = time.asctime(time.localtime(members['atime']))
              info = """Number of spam messages: <b>%(spamcount)d</b>.<br>
                     Number of ham messages: <b>%(hamcount)d</b>.<br>
-                    Number of times used to classify: <b>%(killcount)s</b>.<br>
                     Probability that a message containing this word is spam:
!                    <b>%(spamprob)f</b>.<br>
!                    Last used: <b>%(atime)s</b>.<br>""" % members
          except KeyError:
              info = "%r does not appear in the database." % word
--- 1099,1109 ----
          word = word.lower()
          try:
              wi = state.bayes.wordinfo[word]
!             members = wi.__dict__
!             members['spamprob'] = state.bayes.probability(wi)
              info = """Number of spam messages: <b>%(spamcount)d</b>.<br>
                     Number of ham messages: <b>%(hamcount)d</b>.<br>
                     Probability that a message containing this word is spam:
!                    <b>%(spamprob)f</b>.<br>""" % members
          except KeyError:
              info = "%r does not appear in the database." % word


From richiehindle@users.sourceforge.net  Tue Nov 26 16:22:14 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Tue, 26 Nov 2002 08:22:14 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.23,1.24
Message-ID: <E18GiTV-0007uS-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv30178

Modified Files:
	pop3proxy.py 
Log Message:
 o You can now train on mbox files through the web interface.
 o Automatically save after training.  This can be slow, but we get nasty
   consequences from not doing it.
 o Also removed the "Shutdown without saving" button, and moved the "Save"
   button to the footer - the "Save" button should be all-but-redundant
   now, but I've left it in out of paranoia.
 o Updated the training functions to account for the new Classifier API.
 o Improve the look-n-feel of the training interface, especially on the
   Mac, by centring the radio buttons using the more-universally-accepted
   <center> tag and by spreading them out a little more.
 o Replaced instances of "X-Hammie-Disposition" in comments with the new
   "X-Spambayes-Classification".
 o Forced the test code to always use pickles.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** pop3proxy.py	26 Nov 2002 04:27:19 -0000	1.23
--- pop3proxy.py	26 Nov 2002 16:22:11 -0000	1.24
***************
*** 2,8 ****
  
  """A POP3 proxy that works with classifier.py, and adds a simple
! X-Hammie-Disposition header (Yes/No/Unsure) to each incoming email.
! You point pop3proxy at your POP3 server, and configure your email
! client to collect mail from the proxy then filter on the added
  header.  Usage:
  
--- 2,8 ----
  
  """A POP3 proxy that works with classifier.py, and adds a simple
! X-Spambayes-Classification header (ham/spam/unsure) to each incoming
! email.  You point pop3proxy at your POP3 server, and configure your
! email client to collect mail from the proxy then filter on the added
  header.  Usage:
  
***************
*** 31,35 ****
  written out to _pop3proxy.log for each run.
  
! To make rebuilding the database easier, trained messages are appended
  to _pop3proxyham.mbox and _pop3proxyspam.mbox.
  """
--- 31,35 ----
  written out to _pop3proxy.log for each run.
  
! To make rebuilding the database easier, uploaded messages are appended
  to _pop3proxyham.mbox and _pop3proxyspam.mbox.
  """
***************
*** 60,63 ****
--- 60,65 ----
   o [Francois Granger] Show the raw spambrob number close to the buttons
     (this would mean using the extra X-Hammie header by default).
+  o Add Today and Refresh buttons on the Review page.
+  o "There are no untrained messages to display.  Return Home."
  
  
***************
*** 69,74 ****
   o Can it cleanly dynamically update its status display while having a
     POP3 converation?  Hammering reload sucks.
-  o Add a command to save the database without shutting down, and one to
-    reload the database.
   o Save the stats (num classified, etc.) between sessions.
   o "Reload database" button.
--- 71,74 ----
***************
*** 84,92 ****
     the training code update (rather than replace!) the database.
   o Allow use of the UI without the POP3 proxy.
!  o Remove any existing X-Hammie-Disposition header from incoming emails.
   o Whitelist.
   o Online manual.
   o Links to project homepage, mailing list, etc.
   o Edit settings through the web.
  
  
--- 84,94 ----
     the training code update (rather than replace!) the database.
   o Allow use of the UI without the POP3 proxy.
!  o Remove any existing X-Spambayes-Classification header from incoming
!    emails.
   o Whitelist.
   o Online manual.
   o Links to project homepage, mailing list, etc.
   o Edit settings through the web.
+  o List of words with stats (it would have to be paged!) a la SpamSieve.
  
  
***************
*** 115,123 ****
   o Zoe...!
  
  """
  
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import storage, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
--- 117,138 ----
   o Zoe...!
  
+ Notes, for the sake of somewhere better to put them:
+ 
+ Don't proxy spams at all?  This would mean writing a full POP3 client
+ and server - it would download all your mail on a timer and serve to you
+ all the non-spams.  It could be 'safe' in that it leaves the messages in
+ the real POP3 account until you collect them from it (or in the case of
+ spams, until you collect contemporaneous hams).  The web interface would
+ then present all the spams so that you could correct any FPs and mark
+ them for collection.  The thing is no longer a proxy (because the first
+ POP3 command in a conversion is STAT or LIST, which tells you how many
+ mails there are - it wouldn't know the answer, and finding out could
+ take weeks over a modem - I've already had problems with clients timing
+ out while the proxy was downloading stuff from the server).
  """
  
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import mailbox, storage, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
***************
*** 298,301 ****
--- 313,329 ----
              return False
  
+     ## This is an attempt to solve the problem whereby the email client
+     ## times out and closes the connection but the ServerLineReader is still
+     ## connected, so you get errors from the POP3 server next time because
+     ## there's already an active connection.  But after introducing this,
+     ## I kept getting unexplained "Bad file descriptor" errors in recv.
+     ##
+     ## def handle_close(self):
+     ##     """If the email client closes the connection unexpectedly, eg.
+     ##     because of a timeout, close the server connection."""
+     ##     self.serverSocket.shutdown(2)
+     ##     self.serverSocket.close()
+     ##     self.close()
+ 
      def collect_incoming_data(self, data):
          """Asynchat override."""
***************
*** 598,602 ****
  
      footer = """</div>
!              <form action='shutdown' method='POST'>
               <table width='100%%' cellspacing='0'>
               <tr><td class='banner'>&nbsp;<a href='home'>Spambayes Proxy</a>,
--- 626,630 ----
  
      footer = """</div>
!              <form action='save' method='POST'>
               <table width='100%%' cellspacing='0'>
               <tr><td class='banner'>&nbsp;<a href='home'>Spambayes Proxy</a>,
***************
*** 608,614 ****
               </body></html>\n"""
  
!     shutdownDB = """<input type='submit' name='how' value='Shutdown'>"""
! 
!     shutdownPickle = shutdownDB + """&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
  
--- 636,640 ----
               </body></html>\n"""
  
!     saveButtons = """<input type='submit' name='how' value='Save'>&nbsp;&nbsp;
              <input type='submit' name='how' value='Save &amp; shutdown'>"""
  
***************
*** 626,632 ****
                Total emails trained: Spam: <b>%(nspam)d</b>
                                       Ham: <b>%(nham)d</b><br>
-               <form action='save' method='POST'>
-               <input type='submit' value='Save database'>
-               </form>
                """
  
--- 652,655 ----
***************
*** 667,673 ****
      upload = """<form action='%s' method='POST'
                  enctype='multipart/form-data'>
!              Either upload a message file:
               <input type='file' name='file' value=''><br>
!              Or paste the whole message (incuding headers) here:<br>
               <textarea name='text' rows='3' cols='60'></textarea><br>
               %s
--- 690,696 ----
      upload = """<form action='%s' method='POST'
                  enctype='multipart/form-data'>
!              Either upload a message %s file:
               <input type='file' name='file' value=''><br>
!              Or paste one whole message (incuding headers) here:<br>
               <textarea name='text' rows='3' cols='60'></textarea><br>
               %s
***************
*** 676,684 ****
      uploadSumbit = """<input type='submit' name='which' value='%s'>"""
  
!     train = upload % ('train',
                        (uploadSumbit % "Train as Spam") + "&nbsp;" + \
                        (uploadSumbit % "Train as Ham"))
  
!     classify = upload % ('classify', uploadSumbit % "Classify")
  
      def __init__(self, clientSocket, socketMap=asyncore.socket_map):
--- 699,707 ----
      uploadSumbit = """<input type='submit' name='which' value='%s'>"""
  
!     train = upload % ('train', "or mbox",
                        (uploadSumbit % "Train as Spam") + "&nbsp;" + \
                        (uploadSumbit % "Train as Ham"))
  
!     classify = upload % ('classify', "", uploadSumbit % "Classify")
  
      def __init__(self, clientSocket, socketMap=asyncore.socket_map):
***************
*** 760,770 ****
                  # This is a request for a valid page; run the handler.
                  self.pushOKHeaders('text/html')
!                 self.pushPreamble(name, showImage=(name != 'Shutdown'))
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 if state.useDB:
!                     self.push(self.footer % (timeString, self.shutdownDB))
!                 else:
!                     self.push(self.footer % (timeString, self.shutdownPickle))
  
      def pushOKHeaders(self, contentType, extraHeaders={}):
--- 783,791 ----
                  # This is a request for a valid page; run the handler.
                  self.pushOKHeaders('text/html')
!                 isKill = (params.get('how', '').lower().find('shutdown') >= 0)
!                 self.pushPreamble(name, showImage=(not isKill))
                  handler(params)
                  timeString = time.asctime(time.localtime())
!                 self.push(self.footer % (timeString, self.saveButtons))
  
      def pushOKHeaders(self, contentType, extraHeaders={}):
***************
*** 832,836 ****
  
      def doSave(self):
!         """Saves the database.  Worker for onSave and onShutdown."""
          self.push("<b>Saving... ")
          self.push(' ')
--- 853,857 ----
  
      def doSave(self):
!         """Saves the database."""
          self.push("<b>Saving... ")
          self.push(' ')
***************
*** 839,878 ****
  
      def onSave(self, params):
!         """Command handler for "Save"."""
          self.doSave()
! 
!     def onShutdown(self, params):
!         """Shutdown the server, saving the pickle if requested to do so."""
!         if params['how'].lower().find('save') >= 0:
!             self.doSave()
!         self.push("<b>Shutdown</b>. Goodbye.</div></body></html>")
!         self.push(' ')
!         self.shutdown(2)
!         self.close()
!         raise SystemExit
  
      def onTrain(self, params):
          """Train on an uploaded or pasted message."""
          # Upload or paste?  Spam or ham?
!         message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'Train as Spam')
  
!         # Append the message to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
          # it should keep a Corpus of trained messages.
-         message = message.replace('\r\n', '\n').replace('\r', '\n') # For Macs
          if isSpam:
              f = open("_pop3proxyspam.mbox", "a")
          else:
              f = open("_pop3proxyham.mbox", "a")
-         f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
-         f.write(message)
-         f.write("\n\n")
-         f.close()
  
!         # Train on the message.
!         tokens = tokenizer.tokenize(message)
!         state.bayes.learn(tokens, isSpam, True)
!         self.push("<p>OK. Return <a href='home'>Home</a> or train another:</p>")
          self.push(self.pageSection % ('Train another', self.train))
  
--- 860,916 ----
  
      def onSave(self, params):
!         """Command handler for "Save" and "Save & shutdown"."""
          self.doSave()
!         if params['how'].lower().find('shutdown') >= 0:
!             self.push("<b>Shutdown</b>. Goodbye.</div></body></html>")
!             self.push(' ')
!             self.shutdown(2)
!             self.close()
!             raise SystemExit
  
      def onTrain(self, params):
          """Train on an uploaded or pasted message."""
          # Upload or paste?  Spam or ham?
!         content = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'Train as Spam')
  
!         # Convert platform-specific line endings into unix-style.
!         content = content.replace('\r\n', '\n').replace('\r', '\n')
! 
!         # Single message or mbox?
!         if content.startswith('From '):
!             # Get a list of raw messages from the mbox content.
!             class SimpleMessage:
!                 def __init__(self, fp):
!                     self.guts = fp.read()
!             contentFile = cStringIO.StringIO(content)
!             mbox = mailbox.PortableUnixMailbox(contentFile, SimpleMessage)
!             messages = map(lambda m: m.guts, mbox)
!         else:
!             # Just the one message.
!             messages = [content]
! 
!         # Append the message(s) to a file, to make it easier to rebuild
          # the database later.   This is a temporary implementation -
          # it should keep a Corpus of trained messages.
          if isSpam:
              f = open("_pop3proxyspam.mbox", "a")
          else:
              f = open("_pop3proxyham.mbox", "a")
  
!         # Train on the uploaded message(s).
!         self.push("<b>Training...</b>\n")
!         self.push(' ')
!         for message in messages:
!             tokens = tokenizer.tokenize(message)
!             state.bayes.learn(tokens, isSpam)
!             f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
!             f.write(message)
!             f.write("\n\n")
! 
!         # Save the database and return a link Home and another training form.
!         f.close()
!         self.doSave()
!         self.push("<p>OK. Return <a href='home'>Home</a> or train again:</p>")
          self.push(self.pageSection % ('Train another', self.train))
  
***************
*** 934,941 ****
      def appendMessages(self, lines, keyedMessages, judgement):
          """Appends the lines of a table of messages to 'lines'."""
!         buttons = """<input type='radio' name='classify:%s' value='discard'>
!                   <input type='radio' name='classify:%s' value='defer' %s>
!                   <input type='radio' name='classify:%s' value='ham' %s>
!                   <input type='radio' name='classify:%s' value='spam' %s>"""
          stripe = 0
          for key, message in keyedMessages:
--- 972,980 ----
      def appendMessages(self, lines, keyedMessages, judgement):
          """Appends the lines of a table of messages to 'lines'."""
!         buttons = \
!              """<input type='radio' name='classify:%s' value='discard'>&nbsp;
!                 <input type='radio' name='classify:%s' value='defer' %s>&nbsp;
!                 <input type='radio' name='classify:%s' value='ham' %s>&nbsp;
!                 <input type='radio' name='classify:%s' value='spam' %s>"""
          stripe = 0
          for key, message in keyedMessages:
***************
*** 970,974 ****
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             <td align='middle'>%s</td></tr>""" % \
                              (stripeClass, subject, from_, radioGroup))
              stripe = stripe ^ 1
--- 1009,1013 ----
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             <td><center>%s</center></td></tr>""" % \
                              (stripeClass, subject, from_, radioGroup))
              stripe = stripe ^ 1
***************
*** 1006,1010 ****
                          pass  # Must be a reload.
  
!         # Report on any training.
          if numTrained > 0:
              plural = ''
--- 1045,1049 ----
                          pass  # Must be a reload.
  
!         # Report on any training, and save the database if there was any.
          if numTrained > 0:
              plural = ''
***************
*** 1012,1015 ****
--- 1051,1056 ----
                  plural = 's'
              self.push("Trained on %d message%s. " % (numTrained, plural))
+             self.doSave()
+             self.push("<br>&nbsp;")
  
          # If any messages were deferred, show the same page again.
***************
*** 1196,1199 ****
--- 1237,1241 ----
          print "Loading database...",
          if self.isTest:
+             self.useDB = True
              self.databaseFilename = '_pop3proxy_test.pickle'   # Never saved
          if self.useDB:


From richiehindle@users.sourceforge.net  Tue Nov 26 16:20:42 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Tue, 26 Nov 2002 08:20:42 -0800
Subject: [Spambayes-checkins] spambayes INTEGRATION.txt,1.2,1.3
Message-ID: <E18GiS2-0007hD-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv29495

Modified Files:
	INTEGRATION.txt 
Log Message:
Added the first-draft user documentation I posted to the list last week.

Index: INTEGRATION.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/INTEGRATION.txt,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** INTEGRATION.txt	7 Nov 2002 22:25:46 -0000	1.2
--- INTEGRATION.txt	26 Nov 2002 16:20:38 -0000	1.3
***************
*** 13,17 ****
  and hamminess qualities.
  
! To train Spambayes, you need to save your incoming email for awhile,
  segregating it into two piles, known spam and known ham (ham is our nickname
  for good mail).  It's best to train on recent email, because your interests
--- 13,19 ----
  and hamminess qualities.
  
! To train Spambayes (which you don't need to do if you're going to be using
! the POP3 proxy to classify messages, but you'll get better results from
! the outset if you do) you need to save your incoming email for awhile,
  segregating it into two piles, known spam and known ham (ham is our nickname
  for good mail).  It's best to train on recent email, because your interests
***************
*** 21,45 ****
  ham and my spam".  It will then process that mail and save information about
  different patterns which appear in ham and spam.  That information is then
! used during the filtering stage.
  
  When Spambayes filters your email, it compares each unclassified message
  against the information it saved from training and makes a decision about
  whether it thinks the message qualifies as ham or spam, or if it's unsure
! about how to classify the message.
  
  In the sections below, are gathered notes about how Spambayes can be
! integrated into your mail processing system.  As a general requirement, you
! must have a recent version of Python installed on your computer, version
! 2.2 or later.  (Don't ask about backporting it to earlier versions of
! Python.  It's almost a certainty this won't happen.)  If you need to install
! Python on your system, check the Python download page for the version
! appropriate to your computer:
  
      http://www.python.org/download/
  
  
! Training
  --------
  
  Given a pair of Unix mailbox format files (each message starts with a line
  which begins with 'From '), one containing nothing but spam and the other
--- 23,170 ----
  ham and my spam".  It will then process that mail and save information about
  different patterns which appear in ham and spam.  That information is then
! used during the filtering stage.  See the "Command-line training" section
! below for details.
  
  When Spambayes filters your email, it compares each unclassified message
  against the information it saved from training and makes a decision about
  whether it thinks the message qualifies as ham or spam, or if it's unsure
! about how to classify the message.  It adds its classification to the message
! by adding a header, X-Spambayes-Classification: spam|ham|unsure.  You can
! then filter on this header, to file away suspected spam into its own mail
! folder for example.
  
  In the sections below, are gathered notes about how Spambayes can be
! integrated into your mail processing system.
! 
! 
! Requirements
! ------------
! 
! As a general requirement, you must have a recent version of Python installed
! on your computer, version 2.2 or later.  (Don't ask about backporting it to
! earlier versions of Python.  It's almost a certainty this won't happen.)  If
! you need to install Python on your system, check the Python download page
! for the version appropriate to your computer:
  
      http://www.python.org/download/
  
+ You also need version 2.4.3 or above of the Python "email" package.  If
+ you're running Python 2.3 (which at the time of writing is only available
+ from SourceForge CVS) then you already have this.  If not, you can download
+ it from http://mimelib.sf.net and install it - unpack the archive, cd to
+ the email-2.4.3 directory and type "python setup.py install" (YMMV on
+ different platforms).  This will install it into your Python site-packages
+ directory.  You'll also need to move aside the standard "email" library -
+ go to your Python "Lib" directory and rename "email" to "email_old".
  
! 
! Overview
  --------
  
+ There are six main components to the Spambayes system:
+ 
+  o A database.  Loosely speaking, this is a collection of words and
+    associated spam and ham probabilities.  The database says "If a message
+    contains the word 'Viagra' then there's a 98% chance that it's spam, and
+    a 2% chance that it's ham."  This database is created by training - you
+    give it messages, tell it whether those messages are ham or spam, and it
+    adjusts its probabilities accordingly.  How to train it is covered
+    below.  By default it lives in a file called "hammie.db".
+ 
+  o The tokeniser/classifier.  This is the core engine of the system.  The
+    tokenizer splits emails into tokens (words, roughly speaking), and the
+    classifier looks at those tokens to determine whether the message looks
+    like spam or not.  You don't use the tokeniser/classifier directly -
+    it powers the other parts of the system.
+ 
+  o The POP3 proxy.  This sits between your email client (Eudora, Outlook
+    Express, etc) and your email server, and adds the classification header
+    to emails as you download them.  A typical user's email setup looks
+    like this:
+ 
+        +-----------------+                              +-------------+
+        | Outlook Express |      Internet or intranet    |             |
+        |  (or similar)   | <--------------------------> | POP3 server |
+        |                 |                              |             |
+        +-----------------+                              +-------------+
+ 
+    The POP3 server runs either at your ISP for internet mail, or somewhere
+    on your internal network for corporate mail.  The POP3 proxy sits in the
+    middle and adds the classification header as you retrieve your email:
+ 
+        +-----------------+        +------------+        +-------------+
+        | Outlook Express |        | Spambayes  |        |             |
+        |  (or similar)   | <----> | POP3 proxy | <----> | POP3 server |
+        |                 |        |            |        |             |
+        +-----------------+        +------------+        +-------------+
+ 
+    So where you currently have your email client configured to talk to
+    say, "pop3.my-isp.com", you instead configure the *proxy* to talk to
+    "pop3.my-isp.com" and configure your email client to talk to the proxy.
+    The POP3 proxy can live on your PC, or on the same machine as the POP3
+    server, or on a different machine entirely, it really doesn't matter.
+    Say it's living on your PC, you'd configure your email client to talk
+    to "localhost".  You can configure the proxy to talk to multiple POP3
+    servers, if you have more than one email account.
+ 
+  o The web interface.  This is a server that runs alongside the POP3 proxy
+    and lets you control it through the web.  You can upload emails to it
+    for training or classification, query the probabilities database ("How
+    many of my emails really *do* contain the word Viagra"?) and most
+    importantly, train it on the emails you've received.  When you start
+    using the system, unless you train it using the Hammie script it will
+    classify most things as Unsure, and often make mistakes.  But it keeps
+    copies of all the email's its seen, and through the web interface you
+    can train it by going through a list of all the emails you've received
+    and checking a Ham/Spam box next to each one.  After training on a few
+    messages (say 20 spams and 20 hams), you'll find that it's getting it
+    right most of the time.   The web training interface automatically
+    checks the Ham/Spam boxes according to what it thinks, so all you need
+    to do it correct the odd mistake - it's very quick and easy.
+ 
+  o The Outlook plug-in.  For Outlook 2000 users (not Outlook Express) this
+    lets you manage the whole thing from within Outlook.  You set up a Ham
+    folder and a Spam folder, and train it simply by dragging messages into
+    those folders.  Alternatively there are buttons to do the same thing.
+    And it integrates into Outlook's filtering system to make it easy to
+    file all the suspected spam into its own folder, for instance.
+ 
+  o The Hammie script.  This does three jobs: command-line training,
+    procmail filtering, and XML-RPC.  See below for details of how to use
+    Hammie for training, and how to use it as procmail filter.  Hammie can
+    also run as an XML-RPC server, so that a programmer can write code that
+    uses a remote server to classify emails programmatically - see
+    hammiesrv.py.
+ 
+ 
+ Where things live
+ -----------------
+ 
+ The Hammie script is called hammie.py.  The POP3 proxy and the web
+ interface live in pop3proxy.py.  The Outlook plug-in lives in the
+ Outlook2000 subdirectory - see the README.txt in that directory for more
+ information on that.
+ 
+ As well as these components, there's also a whole pile of utility scripts,
+ test harnesses and so on - see README.txt and TESTING.txt in the spambayes
+ distribution for more information.
+ 
+ 
+ Configuration
+ -------------
+ 
+ The system is configured through a file called "bayescustomize.ini".  In
+ here you can configure the name and type of your database, the POP3
+ server(s) you want to proxy to, the ports you want the proxy and the web
+ interface to run on, and so on.  You can also control details like how sure
+ you want the system to be that message really is spam before it marks it as
+ such.  The default values for all the options, and the documentation for
+ them, all lives in Options.py.  To change an option, create a
+ bayescustomize.ini and add the option to that - don't edit Options.py.
+ 
+ 
+ Command-line training
+ ---------------------
+ 
  Given a pair of Unix mailbox format files (each message starts with a line
  which begins with 'From '), one containing nothing but spam and the other
***************
*** 48,73 ****
      hammie.py -g ~/tmp/newham -s ~/tmp/newspam
  
! The above command is Unix-centric.  In other environments it's likely that a
! less command-line-oriented tool will be available in the near future.
  
  
! Windows
! -------
  
! TBD.
  
  
! Unix/Linux
! ----------
  
! Unlike Windows, there are too many combinations of mail reading tools (mutt,
! pine, Eudora, ...) and mail transport and delivery tools (sendmail, exim,
! procmail, qmail, ...) to attempt to be exhaustive about how to integrate
! Spambayes into your environment at this time.  This section just documents
! some of what's possible.
  
  
! Procmail
! --------
  
  Many people on Unix-like systems have procmail available as an optional or
--- 173,251 ----
      hammie.py -g ~/tmp/newham -s ~/tmp/newspam
  
! The above command is command-line-centric (eg. unix, or Windows command
! prompt).  You can also use the web interface for training as detailed below.
  
  
! Minimal setup for using the POP3 proxy and web interface
! --------------------------------------------------------
  
! The minimum you need too do to get started is create a bayescustomize.ini
! containing the following:
  
+ [pop3proxy]
+ pop3proxy_servers: pop3.my-isp.com
  
! where "pop3.my-isp.com" is wherever you currently have your email client
! configured to collect mail from.  The proxy will run on port 110 - if you're
! already running a real POP3 proxy on that port, or you're running on a
! platform that won't let unprivileged processes use that port (eg. unix),
! you can use a different one by adding a line like this:
  
! pop3proxy_ports: 1110
  
+ to the [pop3proxy] section of bayescustomize.ini.
  
! You can now run the proxy by running "python pop3proxy.py".  This will
! print some status messages, which should include:
! 
! BayesProxyListener listening on port 110.
! UserInterfaceListener listening on port 8880.
! 
! What that means is that the POP3 proxy is ready for your email client to
! connect to it on port 110 and that the web interface is ready for your
! browser to connect to it.  The address of the web interface is
! http://localhost:8880/ (or if you're running it on a different machine,
! replace 'localhost' with the name of the machine).  You can have a look
! at the web interface now, but it won't be very interesting because the
! system hasn't seen any messages yet.
! 
! 
! Reading emails and training the classifier
! ------------------------------------------
! 
! You now need to configure your email client to talk to the proxy instead of
! the real email server.  Change your equivalent of "pop3.my-isp.com" to
! "localhost" (or to the name of the machine you're running the proxy on) in
! your email client's setup.  Hit "Get new email" and look at the headers of
! the emails (send yourself an email if you don't have any!) - there should
! be an X-Spambayes-Classification header there.  It probably says "unsure",
! if you haven't done any training yet.  You should be able to create a
! mail folder called "Suspected spam" and set up a filtering rule that puts
! emails with an "X-Spambayes-Classification: spam" heading into that folder.
! (Eventually we should publish instructions on how to do this in all the
! popular email clients).
! 
! You can now train the system through the web interface - follow the "Review
! messages" link and you'll see a list of the emails that the system has seen
! so far.  Check the appropriate boxes and hit Train.  The messages disappear
! (eventually you'll be able to get back to them, for instance to correct any
! training mistakes) and if you go back to the home page you'll see that the
! "Total emails trained" has increased.
! 
! Once you've done this on a few spams and a few hams, you'll find that the
! X-Spambayes-Classification header is getting it right most of the time.  The
! more you train it the more accurate it gets.  There's no need to train it on
! every message you receive, but you should train on a few spams and a few
! hams on a regular basis.  You should also try to train it on about the same
! number of spams as hams.
! 
! You can train it on lots of messages in one go by either using the Hammie
! script as explained in the "Command-line training" section, or by giving
! messages to the web interface via the "Train" form on the Home page.  You
! can train on individual messages (which is tedious) or using mbox files.
! 
! 
! Procmail filtering
! ------------------
  
  Many people on Unix-like systems have procmail available as an optional or
***************
*** 90,97 ****
  The result of running hammie.py in filter mode is that Procmail will use the
  output from the run as the mail message for further processing downstream.
! Hammie.py inserts an X-Hammie-Disposition header in the output message which
! looks like
  
!     X-Hammie-Disposition: No; 0.00; '*H*': 1.00; '*S*': 0.00; 'python': 0.00;
  	'linux,': 0.01; 'desirable': 0.01; 'cvs,': 0.01; 'perl.': 0.02;
  	...
--- 268,275 ----
  The result of running hammie.py in filter mode is that Procmail will use the
  output from the run as the mail message for further processing downstream.
! Hammie.py inserts an X-Spambayes-Classification header in the output message
! which looks like:
  
!     X-Spambayes-Classification: ham; 0.00; '*H*': 1.00; '*S*': 0.00; 'python': 0.00;
  	'linux,': 0.01; 'desirable': 0.01; 'cvs,': 0.01; 'perl.': 0.02;
  	...
***************
*** 101,109 ****
  
      :0
!     * ^X-Hammie-Disposition: Yes
      spam
  
      :0
!     * ^X-Hammie-Disposition: Unsure
      unsure
  
--- 279,287 ----
  
      :0
!     * ^X-Spambayes-Classification: spam
      spam
  
      :0
!     * ^X-Spambayes-Classification: unsure
      unsure
  

From jhylton@users.sourceforge.net  Tue Nov 26 17:16:37 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Tue, 26 Nov 2002 09:16:37 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.5,1.6
Message-ID: <E18GjK9-0005tq-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory sc8-pr-cvs1:/tmp/cvs-serv22659/pspam/pspam

Modified Files:
	profile.py 
Log Message:
Minimal and ultimately fruitless attempt to get this code uptodate.


Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** profile.py	12 Nov 2002 07:03:20 -0000	1.5
--- profile.py	26 Nov 2002 17:16:35 -0000	1.6
***************
*** 44,56 ****
  class WordInfo(Persistent):
  
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
!         self.atime = atime
!         self.spamcount = self.hamcount = self.killcount = 0
!         self.spamprob = spamprob
  
      def __repr__(self):
!         return "WordInfo%r" % repr((self.atime, self.spamcount,
!                                     self.hamcount, self.killcount,
!                                     self.spamprob))
  
  class PBayes(classifier.Bayes, Persistent):
--- 44,55 ----
  class WordInfo(Persistent):
  
!     def __init__(self):
!         self.spamcount = self.hamcount = 0
  
      def __repr__(self):
!         return "WordInfo(%r, %r)" % (self.spamcount, self.hamcount)
! 
! class PMetaInfo(classifier.MetaInfo, Persistent):
!     pass
  
  class PBayes(classifier.Bayes, Persistent):
***************
*** 61,64 ****
--- 60,64 ----
          classifier.Bayes.__init__(self)
          self.wordinfo = IterOOBTree()
+         self.meta = PMetaInfo()
  
      # XXX what about the getstate and setstate defined in base class
***************
*** 88,93 ****
          changed1 = self._update(self.hams, False)
          changed2 = self._update(self.spams, True)
!         if changed1 or changed2:
!             self.classifier.update_probabilities()
          get_transaction().commit()
          log("updated probabilities")
--- 88,93 ----
          changed1 = self._update(self.hams, False)
          changed2 = self._update(self.spams, True)
! ##        if changed1 or changed2:
! ##            self.classifier.update_probabilities()
          get_transaction().commit()
          log("updated probabilities")
***************
*** 111,120 ****
              # Otherwise some new entries will cause scoring to fail.
              for msg in added.keys():
!                 self.classifier.learn(tokenize(msg), is_spam, False)
              del added
              get_transaction().commit(1)
              log("learned")
              for msg in removed.keys():
!                 self.classifier.unlearn(tokenize(msg), is_spam, False)
              if removed:
                  log("unlearned")
--- 111,120 ----
              # Otherwise some new entries will cause scoring to fail.
              for msg in added.keys():
!                 self.classifier.learn(tokenize(msg), is_spam)
              del added
              get_transaction().commit(1)
              log("learned")
              for msg in removed.keys():
!                 self.classifier.unlearn(tokenize(msg), is_spam)
              if removed:
                  log("unlearned")


From jhylton@users.sourceforge.net  Tue Nov 26 17:16:56 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Tue, 26 Nov 2002 09:16:56 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.58,1.59
Message-ID: <E18GjKS-0005vh-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv22765

Modified Files:
	classifier.py 
Log Message:
Remove needless parens.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.58
retrieving revision 1.59
diff -C2 -d -r1.58 -r1.59
*** classifier.py	25 Nov 2002 21:02:03 -0000	1.58
--- classifier.py	26 Nov 2002 17:16:52 -0000	1.59
***************
*** 71,75 ****
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         (self._nspam, self._nham) = t[1:]
          self.revision = 0
  
--- 71,75 ----
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self._nspam, self._nham = t[1:]
          self.revision = 0
  

From npickett@users.sourceforge.net  Tue Nov 26 20:22:08 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Tue, 26 Nov 2002 12:22:08 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.59,1.60
Message-ID: <E18GmDg-0001kG-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv6102

Modified Files:
	classifier.py 
Log Message:
* MetaInfo doesn't need the revision anymore, so it's gone.  This
  makes the class much simpler.


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.59
retrieving revision 1.60
diff -C2 -d -r1.59 -r1.60
*** classifier.py	26 Nov 2002 17:16:52 -0000	1.59
--- classifier.py	26 Nov 2002 20:22:05 -0000	1.60
***************
*** 52,58 ****
      """Information about the corpora.
  
!     Contains nham and nspam, used for calculating probabilities.  Also
!     has a revision, incremented every time nham or nspam is adjusted.
!     Nothing uses this, currently, but it's there if you want it.
  
      """
--- 52,56 ----
      """Information about the corpora.
  
!     Contains nham and nspam, used for calculating probabilities.
  
      """
***************
*** 66,94 ****
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self._nspam, self._nham)
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self._nspam, self._nham = t[1:]
          self.revision = 0
- 
-     def incr_rev(self):
-         self.revision += 1
- 
-     def get_nham(self):
-         return self._nham
-     def set_nham(self, val):
-         self._nham = val
-         self.incr_rev()
-     nham = property(get_nham, set_nham)
- 
-     def set_nspam(self, val):
-         self._nspam = val
-     def get_nspam(self):
-         return self._nspam
-     nspam = property(get_nspam, set_nspam)
- 
- 
  
  
--- 64,74 ----
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self.nspam, self.nham)
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.nspam, self.nham = t[1:]
          self.revision = 0
  
  
From mhammond@users.sourceforge.net  Wed Nov 27 05:49:58 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 26 Nov 2002 21:49:58 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000 addin.py,1.39,1.40 manager.py,1.36,1.37
Message-ID: <E18Gv5C-0007bq-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory sc8-pr-cvs1:/tmp/cvs-serv29012

Modified Files:
	addin.py manager.py 
Log Message:
First steps to stand-alone filter - check sys.frozen, and use sys.argv[0]
rather than __file__ to determine where we are.  Apparently this will work
with Gordon's installer <wink>


Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** addin.py	24 Nov 2002 22:43:43 -0000	1.39
--- addin.py	27 Nov 2002 05:49:52 -0000	1.40
***************
*** 349,353 ****
      # this, we can not simply perform this load once and reuse the image.
      if not os.path.isabs(fname):
!         fname = os.path.join( os.path.dirname(__file__), "images", fname)
      if not os.path.isfile(fname):
          print "WARNING - Trying to use image '%s', but it doesn't exist" % (fname,)
--- 349,359 ----
      # this, we can not simply perform this load once and reuse the image.
      if not os.path.isabs(fname):
!         if hasattr(sys, "frozen"):
!             # images relative to the executable.
!             fname = os.path.join(os.path.dirname(sys.argv[0]),
!                                  "images", fname)
!         else:
!             # Ensure references are relative to this .py file
!             fname = os.path.join( os.path.dirname(__file__), "images", fname)
      if not os.path.isfile(fname):
          print "WARNING - Trying to use image '%s', but it doesn't exist" % (fname,)

Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** manager.py	24 Nov 2002 22:43:43 -0000	1.36
--- manager.py	27 Nov 2002 05:49:53 -0000	1.37
***************
*** 19,24 ****
  
  try:
!     this_filename = os.path.abspath(__file__)
! except NameError:
      this_filename = os.path.abspath(sys.argv[0])
  
--- 19,27 ----
  
  try:
!     if hasattr(sys, "frozen"):
!         this_filename = os.path.abspath(sys.argv[0])
!     else:
!         this_filename = os.path.abspath(__file__)
! except NameError: # no __file__
      this_filename = os.path.abspath(sys.argv[0])
  

From mhammond@users.sourceforge.net  Wed Nov 27 05:50:00 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 26 Nov 2002 21:50:00 -0800
Subject: [Spambayes-checkins] 
 spambayes/Outlook2000/dialogs ManagerDialog.py,1.8,1.9
Message-ID: <E18Gv5E-0007c4-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory sc8-pr-cvs1:/tmp/cvs-serv29012/dialogs

Modified Files:
	ManagerDialog.py 
Log Message:
First steps to stand-alone filter - check sys.frozen, and use sys.argv[0]
rather than __file__ to determine where we are.  Apparently this will work
with Gordon's installer <wink>


Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** ManagerDialog.py	7 Nov 2002 22:30:10 -0000	1.8
--- ManagerDialog.py	27 Nov 2002 05:49:56 -0000	1.9
***************
*** 1,3 ****
! import os
  import operator
  
--- 1,3 ----
! import os, sys
  import operator
  
***************
*** 156,165 ****
      def OnButAbout(self, id, code):
          if code == win32con.BN_CLICKED:
! 
!             fname = os.path.join(os.path.dirname(__file__),
!                                  os.pardir,
!                                  "about.html")
              fname = os.path.abspath(fname)
-             print fname
              if os.path.isfile(fname):
                  win32ui.DoWaitCursor(1)
--- 156,169 ----
      def OnButAbout(self, id, code):
          if code == win32con.BN_CLICKED:
!             if hasattr(sys, "frozen"):
!                 # Same directory as to the executable.
!                 fname = os.path.join(os.path.dirname(sys.argv[0]),
!                                      "about.html")
!             else:
!                 # In the parent (ie, main Outlook2000) dir
!                 fname = os.path.join(os.path.dirname(__file__),
!                                      os.pardir,
!                                      "about.html")
              fname = os.path.abspath(fname)
              if os.path.isfile(fname):
                  win32ui.DoWaitCursor(1)


From richiehindle@users.sourceforge.net  Wed Nov 27 17:04:06 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 27 Nov 2002 09:04:06 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.24,1.25
Message-ID: <E18H5ba-0000M5-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv1148

Modified Files:
	pop3proxy.py 
Log Message:
Use Tim's new HTML-stripping devices to build the hovertips for HTML-only emails.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** pop3proxy.py	26 Nov 2002 16:22:11 -0000	1.24
--- pop3proxy.py	27 Nov 2002 17:04:02 -0000	1.25
***************
*** 989,993 ****
                  try:
                      part = typed_subpart_iterator(message, 'text', 'html').next()
!                     text = tokenizer.html_re.sub(' ', part.get_payload())
                      text = '(this message only has an HTML body)\n' + text
                  except StopIteration:
--- 989,996 ----
                  try:
                      part = typed_subpart_iterator(message, 'text', 'html').next()
!                     text = part.get_payload()
!                     text, _ = tokenizer.crack_html_style(text)
!                     text, _ = tokenizer.crack_html_comment(text)
!                     text = tokenizer.html_re.sub(' ', text)
                      text = '(this message only has an HTML body)\n' + text
                  except StopIteration:


From richiehindle@users.sourceforge.net  Wed Nov 27 18:44:44 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 27 Nov 2002 10:44:44 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.25,1.26
Message-ID: <E18H7Ay-0006aW-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv25125

Modified Files:
	pop3proxy.py 
Log Message:
 o The web interface now decodes charset-sections in headers, so that
   Fran�ois' name should be displayed correctly on the Review page.  8-)
 o You can now click the Discard / Defer / Ham / Spam headers to check
   all the radio buttons in a section in one go (assuming the Javascript works
   for you - feedback is welcome!)


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** pop3proxy.py	27 Nov 2002 17:04:02 -0000	1.25
--- pop3proxy.py	27 Nov 2002 18:44:41 -0000	1.26
***************
*** 134,138 ****
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import mailbox, storage, tokenizer, mboxutils
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
--- 134,138 ----
  import os, sys, re, operator, errno, getopt, string, cStringIO, time, bisect
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
! import mailbox, storage, tokenizer, mboxutils, email.Header
  from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
***************
*** 615,618 ****
--- 615,619 ----
                                 font-weight: bold }
               .sectionbody { padding: 1em }
+              .reviewheaders a { color: #000000 }
               .stripe_on td { background: #f4f4f4 }
               </style>
***************
*** 664,671 ****
  
      reviewHeader = """<p>These are untrained emails, which you can use to
!                    train the classifier.  Check the Discard / Defer / Ham /
!                    Spam buttton for each email, then click 'Train' below.
!                    (Defer leaves the message here, to be trained on
!                    later.)</p>
                     <form action='review' method='GET'>
                         <input type='hidden' name='prior' value='%d'>
--- 665,673 ----
  
      reviewHeader = """<p>These are untrained emails, which you can use to
!                    train the classifier.  Check the appropriate buttton for
!                    each email, then click 'Train' below.  'Defer' leaves the
!                    message here, to be trained on later.  Click one of the
!                    Discard / Defer / Ham / Spam headers to check all of the
!                    buttons in that section in one go.</p>
                     <form action='review' method='GET'>
                         <input type='hidden' name='prior' value='%d'>
***************
*** 684,690 ****
                     """
  
!     reviewSubheader = """<tr><td><b>Messages classified as %s:</b></td>
!                           <td><b>From:</b></td>
!                           <td><b>Discard / Defer / Ham / Spam</b></td></tr>"""
  
      upload = """<form action='%s' method='POST'
--- 686,719 ----
                     """
  
!     onReviewHeader = \
!     """<script type='text/javascript'>
!     function onHeader(type, switchTo)
!     {
!         if (document.forms && document.forms.length >= 2)
!         {
!             form = document.forms[1];
!             for (i = 0; i < form.length; i++)
!             {
!                 splitName = form[i].name.split(':');
!                 if (splitName.length == 3 && splitName[1] == type &&
!                     form[i].value == switchTo.toLowerCase())
!                 {
!                     form[i].checked = true;
!                 }
!             }
!         }
!     }
!     </script>
!     """
! 
!     reviewSubheader = \
!         """<tr><td><b>Messages classified as %s:</b></td>
!           <td><b>From:</b></td>
!           <td class='reviewheaders'><b>
!               <a href='javascript: onHeader("%s", "Discard");'>Discard</a> /
!               <a href='javascript: onHeader("%s", "Defer");'>Defer</a> /
!               <a href='javascript: onHeader("%s", "Ham");'>Ham</a> /
!               <a href='javascript: onHeader("%s", "Spam");'>Spam</a>
!           </b></td></tr>"""
  
      upload = """<form action='%s' method='POST'
***************
*** 833,837 ****
      def trimAndQuote(self, field, limit, quote=False):
          """Trims a string, adding an ellipsis if necessary, and
!         HTML-quotes it."""
          if len(field) > limit:
              field = field[:limit-3] + "..."
--- 862,871 ----
      def trimAndQuote(self, field, limit, quote=False):
          """Trims a string, adding an ellipsis if necessary, and
!         HTML-quotes it.  Also pumps it through email.Header.decode_header,
!         which understands charset sections in email headers - I suspect
!         this will only work for Latin character sets, but hey, it works for
!         Francois Granger's name.  8-)"""
!         sections = email.Header.decode_header(field)
!         field = ' '.join([text for text, _ in sections])
          if len(field) > limit:
              field = field[:limit-3] + "..."
***************
*** 970,980 ****
          return keys, date, prior, start, end
  
!     def appendMessages(self, lines, keyedMessages, judgement):
          """Appends the lines of a table of messages to 'lines'."""
          buttons = \
!              """<input type='radio' name='classify:%s' value='discard'>&nbsp;
!                 <input type='radio' name='classify:%s' value='defer' %s>&nbsp;
!                 <input type='radio' name='classify:%s' value='ham' %s>&nbsp;
!                 <input type='radio' name='classify:%s' value='spam' %s>"""
          stripe = 0
          for key, message in keyedMessages:
--- 1004,1014 ----
          return keys, date, prior, start, end
  
!     def appendMessages(self, lines, keyedMessages, label):
          """Appends the lines of a table of messages to 'lines'."""
          buttons = \
!           """<input type='radio' name='classify:%s:%s' value='discard'>&nbsp;
!              <input type='radio' name='classify:%s:%s' value='defer' %s>&nbsp;
!              <input type='radio' name='classify:%s:%s' value='ham' %s>&nbsp;
!              <input type='radio' name='classify:%s:%s' value='spam' %s>"""
          stripe = 0
          for key, message in keyedMessages:
***************
*** 1002,1013 ****
              # Output the table row for this message.
              defer = ham = spam = ""
!             if judgement == options.header_spam_string:
                  spam='checked'
!             elif judgement == options.header_ham_string:
                  ham='checked'
!             elif judgement == options.header_unsure_string:
                  defer='checked'
              subject = "<span title=\"%s\">%s</span>" % (text, subject)
!             radioGroup = buttons % (key, key, defer, key, ham, key, spam)
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
--- 1036,1050 ----
              # Output the table row for this message.
              defer = ham = spam = ""
!             if label == 'Spam':
                  spam='checked'
!             elif label == 'Ham':
                  ham='checked'
!             elif label == 'Unsure':
                  defer='checked'
              subject = "<span title=\"%s\">%s</span>" % (text, subject)
!             radioGroup = buttons % (label, key,
!                                     label, key, defer,
!                                     label, key, ham,
!                                     label, key, spam)
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
***************
*** 1024,1028 ****
          for key, value in params.items():
              if key.startswith('classify:'):
!                 id = key.split(':', 1)[1]
                  if value == 'spam':
                      targetCorpus = state.spamCorpus
--- 1061,1065 ----
          for key, value in params.items():
              if key.startswith('classify:'):
!                 id = key.split(':')[2]
                  if value == 'spam':
                      targetCorpus = state.spamCorpus
***************
*** 1103,1114 ****
              if not next:
                  nextState = 'disabled'
!             lines = [self.reviewHeader % (prior, next, priorState, nextState)]
!             for header, type in ((options.header_spam_string, 'Spam'),
!                                  (options.header_ham_string, 'Ham'),
!                                  (options.header_unsure_string, 'Unsure')):
                  if keyedMessages[header]:
                      lines.append("<tr><td>&nbsp;</td><td></td><td></td></tr>")
!                     lines.append(self.reviewSubheader % type)
!                     self.appendMessages(lines, keyedMessages[header], header)
  
              lines.append("""<tr><td></td><td></td><td align='middle'>&nbsp;<br>
--- 1140,1153 ----
              if not next:
                  nextState = 'disabled'
!             lines = [self.onReviewHeader,
!                      self.reviewHeader % (prior, next, priorState, nextState)]
!             for header, label in ((options.header_spam_string, 'Spam'),
!                                   (options.header_ham_string, 'Ham'),
!                                   (options.header_unsure_string, 'Unsure')):
                  if keyedMessages[header]:
                      lines.append("<tr><td>&nbsp;</td><td></td><td></td></tr>")
!                     lines.append(self.reviewSubheader %
!                                  (label, label, label, label, label))
!                     self.appendMessages(lines, keyedMessages[header], label)
  
              lines.append("""<tr><td></td><td></td><td align='middle'>&nbsp;<br>


From npickett@users.sourceforge.net  Wed Nov 27 22:38:00 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Wed, 27 Nov 2002 14:38:00 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.60,1.61
	hammie.py,1.43,1.44 storage.py,1.2,1.3 dbdict.py,1.4,NONE
Message-ID: <E18HAoi-0008Gz-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv31393

Modified Files:
	classifier.py hammie.py storage.py 
Removed Files:
	dbdict.py 
Log Message:
* Caching dbdict implementation.  You'll have to retrain your
  databases again (sorry)


Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.60
retrieving revision 1.61
diff -C2 -d -r1.60 -r1.61
*** classifier.py	26 Nov 2002 20:22:05 -0000	1.60
--- classifier.py	27 Nov 2002 22:37:55 -0000	1.61
***************
*** 47,75 ****
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 4
! 
! class MetaInfo(object):
!     """Information about the corpora.
! 
!     Contains nham and nspam, used for calculating probabilities.
! 
!     """
!     def __init__(self):
!         self.__setstate__((PICKLE_VERSION, 0, 0))
! 
!     def __repr__(self):
!         return "MetaInfo%r" % repr((self._nspam,
!                                     self._nham,
!                                     self.revision))
! 
!     def __getstate__(self):
!         return (PICKLE_VERSION, self.nspam, self.nham)
! 
!     def __setstate__(self, t):
!         if t[0] != PICKLE_VERSION:
!             raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.nspam, self.nham = t[1:]
!         self.revision = 0
! 
  
  class WordInfo(object):
--- 47,51 ----
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 5
  
  class WordInfo(object):
***************
*** 109,138 ****
      def __init__(self):
          self.wordinfo = {}
-         self.meta = MetaInfo()
          self.probcache = {}
  
      def __getstate__(self):
!         return PICKLE_VERSION, self.wordinfo, self.meta
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.wordinfo, self.meta = t[1:]
          self.probcache = {}
  
-     # Slacker's way out--pass calls to nham/nspam up to the meta class
- 
-     def get_nham(self):
-         return self.meta.nham
-     def set_nham(self, val):
-         self.meta.nham = val
-     nham = property(get_nham, set_nham)
- 
-     def get_nspam(self):
-         return self.meta.nspam
-     def set_nspam(self, val):
-         self.meta.nspam = val
-     nspam = property(get_nspam, set_nspam)
- 
      # spamprob() implementations.  One of the following is aliased to
      # spamprob, depending on option settings.
--- 85,100 ----
      def __init__(self):
          self.wordinfo = {}
          self.probcache = {}
+         self.nspam = self.nham = 0
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self.wordinfo, self.nspam, self.nham)
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         (self.wordinfo, self.nspam, self.nham) = t[1:]
          self.probcache = {}
  
      # spamprob() implementations.  One of the following is aliased to
      # spamprob, depending on option settings.
***************
*** 331,336 ****
              pass
  
!         nham = float(self.meta.nham or 1)
!         nspam = float(self.meta.nspam or 1)
  
          assert hamcount <= nham
--- 293,298 ----
              pass
  
!         nham = float(self.nham or 1)
!         nspam = float(self.nspam or 1)
  
          assert hamcount <= nham
***************
*** 420,431 ****
          self.probcache = {}    # nuke the prob cache
          if is_spam:
!             self.meta.nspam += 1
          else:
!             self.meta.nham += 1
  
-         wordinfo = self.wordinfo
-         wordinfoget = wordinfo.get
          for word in Set(wordstream):
!             record = wordinfoget(word)
              if record is None:
                  record = self.WordInfoClass()
--- 382,391 ----
          self.probcache = {}    # nuke the prob cache
          if is_spam:
!             self.nspam += 1
          else:
!             self.nham += 1
  
          for word in Set(wordstream):
!             record = self._wordinfoget(word)
              if record is None:
                  record = self.WordInfoClass()
***************
*** 436,441 ****
                  record.hamcount += 1
  
!             # Needed to tell a persistent DB that the content changed.
!             wordinfo[word] = record
  
  
--- 396,400 ----
                  record.hamcount += 1
  
!             self._wordinfoset(word, record)
  
  
***************
*** 443,458 ****
          self.probcache = {}    # nuke the prob cache
          if is_spam:
!             if self.meta.nspam <= 0:
                  raise ValueError("spam count would go negative!")
!             self.meta.nspam -= 1
          else:
!             if self.meta.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.meta.nham -= -1
  
-         wordinfo = self.wordinfo
-         wordinfoget = wordinfo.get
          for word in Set(wordstream):
!             record = wordinfoget(word)
              if record is not None:
                  if is_spam:
--- 402,415 ----
          self.probcache = {}    # nuke the prob cache
          if is_spam:
!             if self.nspam <= 0:
                  raise ValueError("spam count would go negative!")
!             self.nspam -= 1
          else:
!             if self.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
!             self.nham -= -1
  
          for word in Set(wordstream):
!             record = self._wordinfoget(word)
              if record is not None:
                  if is_spam:
***************
*** 463,471 ****
                          record.hamcount -= 1
                  if record.hamcount == 0 == record.spamcount:
!                     del wordinfo[word]
                  else:
!                     # Needed to tell a persistent DB that the content
!                     # changed.
!                     wordinfo[word] = record
  
      def _getclues(self, wordstream):
--- 420,426 ----
                          record.hamcount -= 1
                  if record.hamcount == 0 == record.spamcount:
!                     self._wordinfodel(word)
                  else:
!                     self._wordinfoset(word, record)
  
      def _getclues(self, wordstream):
***************
*** 476,482 ****
          pushclue = clues.append
  
-         wordinfoget = self.wordinfo.get
          for word in Set(wordstream):
!             record = wordinfoget(word)
              if record is None:
                  prob = unknown
--- 431,436 ----
          pushclue = clues.append
  
          for word in Set(wordstream):
!             record = self._wordinfoget(word)
              if record is None:
                  prob = unknown
***************
*** 492,495 ****
--- 446,459 ----
          # Return (prob, word, record).
          return [t[1:] for t in clues]
+ 
+     def _wordinfoget(self, word):
+         return self.wordinfo.get(word)
+ 
+     def _wordinfoset(self, word, record):
+         self.wordinfo[word] = record
+ 
+     def _wordinfodel(self, word):
+         del self.wordinfo[word]
+         
  
  
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** hammie.py	25 Nov 2002 20:49:17 -0000	1.43
--- hammie.py	27 Nov 2002 22:37:56 -0000	1.44
***************
*** 2,6 ****
  
  
- import dbdict
  import mboxutils
  import storage
--- 2,5 ----
***************
*** 45,49 ****
                           for word, prob in clues
                           if (word[0] == '*' or
!                              prob <= SHOWCLUE or prob >= 1.0 - SHOWCLUE)])
  
      def score(self, msg, evidence=False):
--- 44,49 ----
                           for word, prob in clues
                           if (word[0] == '*' or
!                              prob <= options.clue_mailheader_cutoff or
!                              prob >= 1.0 - options.clue_mailheader_cutoff)])
  
      def score(self, msg, evidence=False):

Index: storage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/storage.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** storage.py	26 Nov 2002 00:43:51 -0000	1.2
--- storage.py	27 Nov 2002 22:37:56 -0000	1.3
***************
*** 5,9 ****
  Classes:
      PickledClassifier - Classifier that uses a pickle db
!     DBDictClassifier - Classifier that uses a DBDict db
      Trainer - Classifier training observer
      SpamTrainer - Trainer for spam
--- 5,9 ----
  Classes:
      PickledClassifier - Classifier that uses a pickle db
!     DBDictClassifier - Classifier that uses a DBM db
      Trainer - Classifier training observer
      SpamTrainer - Trainer for spam
***************
*** 18,23 ****
      databases.
  
!     DBDictClassifier is a Classifier class that uses a DBDict
!     datastore.
  
      Trainer is concrete class that observes a Corpus and trains a
--- 18,23 ----
      databases.
  
!     DBDictClassifier is a Classifier class that uses a database
!     store.
  
      Trainer is concrete class that observes a Corpus and trains a
***************
*** 50,55 ****
  from Options import options
  import cPickle as pickle
- import dbdict
  import errno
  
  PICKLE_TYPE = 1
--- 50,55 ----
  from Options import options
  import cPickle as pickle
  import errno
+ import shelve
  
  PICKLE_TYPE = 1
***************
*** 84,91 ****
              fp.close()
  
          if tempbayes:
              self.wordinfo = tempbayes.wordinfo
!             self.meta.nham = tempbayes.get_nham()
!             self.meta.nspam = tempbayes.get_nspam()
  
              if options.verbose:
--- 84,92 ----
              fp.close()
  
+         # XXX: why not self.__setstate__(tempbayes.__getstate__())?
          if tempbayes:
              self.wordinfo = tempbayes.wordinfo
!             self.nham = tempbayes.nham
!             self.nspam = tempbayes.nspam
  
              if options.verbose:
***************
*** 97,102 ****
                  print self.db_name,'is a new pickle'
              self.wordinfo = {}
!             self.meta.nham = 0
!             self.meta.nspam = 0
  
      def store(self):
--- 98,103 ----
                  print self.db_name,'is a new pickle'
              self.wordinfo = {}
!             self.nham = 0
!             self.nspam = 0
  
      def store(self):
***************
*** 110,124 ****
          fp.close()
  
-     def __getstate__(self):
-         return PICKLE_TYPE, self.wordinfo, self.meta
- 
-     def __setstate__(self, t):
-         if t[0] != PICKLE_TYPE:
-             raise ValueError("Can't unpickle -- version %s unknown" % t[0])
-         self.wordinfo, self.meta = t[1:]
- 
  
  class DBDictClassifier(classifier.Classifier):
!     '''Classifier object persisted in a WIDict'''
  
      def __init__(self, db_name, mode='c'):
--- 111,117 ----
          fp.close()
  
  
  class DBDictClassifier(classifier.Classifier):
!     '''Classifier object persisted in a caching database'''
  
      def __init__(self, db_name, mode='c'):
***************
*** 126,129 ****
--- 119,123 ----
  
          classifier.Classifier.__init__(self)
+         self.wordcache = {}
          self.statekey = "saved state"
          self.mode = mode
***************
*** 132,157 ****
  
      def load(self):
!         '''Load state from WIDict'''
  
          if options.verbose:
!             print 'Loading state from',self.db_name,'WIDict'
  
!         self.wordinfo = dbdict.DBDict(self.db_name, self.mode,
!                              classifier.WordInfo,iterskip=[self.statekey])
  
!         if self.wordinfo.has_key(self.statekey):
!             (nham, nspam) = self.wordinfo[self.statekey]
!             self.set_nham(nham)
!             self.set_nspam(nspam)
  
              if options.verbose:
!                 print '%s is an existing DBDict, with %d ham and %d spam' \
!                       % (self.db_name, self.nham, self.nspam)
          else:
!             # new dbdict
              if options.verbose:
!                 print self.db_name,'is a new DBDict'
!             self.set_nham(0)
!             self.set_nspam(0)
  
      def store(self):
--- 126,152 ----
  
      def load(self):
!         '''Load state from database'''
  
          if options.verbose:
!             print 'Loading state from',self.db_name,'database'
  
!         self.db = shelve.DbfilenameShelf(self.db_name, self.mode)
  
!         if self.db.has_key(self.statekey):
!             t = self.db[self.statekey]
!             if t[0] != classifier.PICKLE_VERSION:
!                 raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!             (self.nspam, self.nham) = t[1:]
  
              if options.verbose:
!                 print '%s is an existing database, with %d spam and %d ham' \
!                       % (self.db_name, self.nspam, self.nham)
          else:
!             # new database
              if options.verbose:
!                 print self.db_name,'is a new database'
!             self.nspam = 0
!             self.nham = 0
!         self.wordinfo = {}
  
      def store(self):
***************
*** 159,166 ****
  
          if options.verbose:
!             print 'Persisting',self.db_name,'state in WIDict'
  
!         self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam())
!         self.wordinfo.sync()
  
  
--- 154,186 ----
  
          if options.verbose:
!             print 'Persisting',self.db_name,'state in database'
  
!         for key, val in self.wordinfo.iteritems():
!             if val == None:
!                 del self.wordinfo[key]
!                 try:
!                     del self.db[key]
!                 except KeyError:
!                     pass
!             else:
!                 self.db[key] = val.__getstate__()
!         self.db[self.statekey] = (classifier.PICKLE_VERSION,
!                                   self.nspam, self.nham)
!         self.db.sync()
! 
!     def _wordinfoget(self, word):
!         ret = self.wordinfo.get(word)
!         if not ret:
!             r = self.db.get(word)
!             if r:
!                 ret = self.WordInfoClass()
!                 ret.__setstate__(r)
!                 self.wordinfo[word] = ret
!         return ret
! 
!     # _wordinfoset is the same
! 
!     def _wordinfodel(self, word):
!         self.wordinfo[word] = None
  
  
--- dbdict.py DELETED ---


From timstone4@users.sourceforge.net  Wed Nov 27 23:04:17 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Wed, 27 Nov 2002 15:04:17 -0800
Subject: [Spambayes-checkins] spambayes storage.py,1.3,1.4
Message-ID: <E18HBE9-0002qc-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv10926

Modified Files:
	storage.py 
Log Message:
Fixed a couple of comments

Index: storage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/storage.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** storage.py	27 Nov 2002 22:37:56 -0000	1.3
--- storage.py	27 Nov 2002 23:04:14 -0000	1.4
***************
*** 5,9 ****
  Classes:
      PickledClassifier - Classifier that uses a pickle db
!     DBDictClassifier - Classifier that uses a DBM db
      Trainer - Classifier training observer
      SpamTrainer - Trainer for spam
--- 5,9 ----
  Classes:
      PickledClassifier - Classifier that uses a pickle db
!     DBDictClassifier - Classifier that uses a shelve db
      Trainer - Classifier training observer
      SpamTrainer - Trainer for spam
***************
*** 43,49 ****
  # Foundation license.
  
! __author__ = "Tim Stone <tim@fourstonesExpressions.com>"
! __credits__ = "Richie Hindle, Tim Peters, Neale Pickett, \
! all the spambayes contributors."
  
  import classifier
--- 43,49 ----
  # Foundation license.
  
! __author__ = "Neale Pickett <neale@woozle.org>, \
! Tim Stone <tim@fourstonesExpressions.com>"
! __credits__ = "All the spambayes contributors."
  
  import classifier


From Paul.Moore@atosorigin.com  Thu Nov 28 09:26:34 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 28 Nov 2002 09:26:34 -0000
Subject: [Spambayes-checkins] spambayes
	classifier.py,1.60,1.61hammie.py,1.43,1.44 storage.py,1.2,1.3
	dbdict.py,1.4,NONE
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861995C@UKDCX001.uk.int.atosorigin.com>

From: Neale Pickett [mailto:npickett@users.sourceforge.net]
> + import shelve
> !         self.wordinfo =3D dbdict.DBDict(self.db_name, self.mode,
> !                              =
classifier.WordInfo,iterskip=3D[self.statekey])

> !         self.db =3D shelve.DbfilenameShelf(self.db_name, self.mode)

You do realise that shelve uses anydbm under the hood, making it =
susceptible to
the same problems with Windows (only broken DBM or dumbdbm available) =
that the
old version had - but with no obvious way of patching it up to allow =
customisation
by the user?

As I said, I use pickles now, so I no longer have a use case where =
Windows users
would be using DBM format anyway, but there probably should be at least =
a warning
in a comment somewhere...

Paul

From sjoerd@users.sourceforge.net  Thu Nov 28 15:48:31 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Thu, 28 Nov 2002 07:48:31 -0800
Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.6,1.7
Message-ID: <E18HQtz-0000H5-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv1021

Modified Files:
	FileCorpus.py 
Log Message:
Use double quotes for some triple-quoted strings that contain lonely
single quotes.
This makes XEmacs' fontification a whole lot happier.


Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** FileCorpus.py	26 Nov 2002 00:43:51 -0000	1.6
--- FileCorpus.py	28 Nov 2002 15:48:29 -0000	1.7
***************
*** 1,5 ****
  #! /usr/bin/env python
  
! '''FileCorpus.py - Corpus composed of file system artifacts
  
  Classes:
--- 1,5 ----
  #! /usr/bin/env python
  
! """FileCorpus.py - Corpus composed of file system artifacts
  
  Classes:
***************
*** 74,78 ****
      o Suggestions?
  
! '''
  
  # This module is part of the spambayes project, which is Copyright 2002
--- 74,78 ----
      o Suggestions?
  
! """
  
  # This module is part of the spambayes project, which is Copyright 2002
***************
*** 572,576 ****
  def testmsg1():
  
!     return '''
  X-Hd:skip@pobox.com Mon Nov 04 10:50:49 2002
  Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
--- 572,576 ----
  def testmsg1():
  
!     return """
  X-Hd:skip@pobox.com Mon Nov 04 10:50:49 2002
  Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
***************
*** 626,633 ****
  >
  - Tim
! www.fourstonesExpressions.com '''
  
  def testmsg2():
!     return '''
  X-Hd:richie@entrian.com Wed Nov 06 12:05:41 2002
  Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
--- 626,633 ----
  >
  - Tim
! www.fourstonesExpressions.com """
  
  def testmsg2():
!     return """
  X-Hd:richie@entrian.com Wed Nov 06 12:05:41 2002
  Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
***************
*** 677,681 ****
  --
  Richie Hindle
! richie@entrian.com'''
  
  if __name__ == '__main__':
--- 677,681 ----
  --
  Richie Hindle
! richie@entrian.com"""
  
  if __name__ == '__main__':


From richiehindle@users.sourceforge.net  Thu Nov 28 16:10:49 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Thu, 28 Nov 2002 08:10:49 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.26,1.27
Message-ID: <E18HRFZ-0004D5-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv15947

Modified Files:
	pop3proxy.py 
Log Message:
 o Fixed Tim Stone's hanging problem - "LIST 1" would hang
   because it thought that the response should be multiline (I
   don't like nested scopes 8-)
 o Don't allow the radio buttons headers in the training interface
   to word wrap.
 o When the POP3 server is unreachable, return an error to the
   email client as well as printing it to the console.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** pop3proxy.py	27 Nov 2002 18:44:41 -0000	1.26
--- pop3proxy.py	28 Nov 2002 16:10:46 -0000	1.27
***************
*** 14,22 ****
          options:
              -z      : Runs a self-test and exits.
!             -t      : Runs a test POP3 server on port 8110 (for testing).
              -h      : Displays this help message.
  
!             -p FILE : use the named data file
!             -d      : the file is a DBM file rather than a pickle
              -l port : proxy listens on this port number (default 110)
              -u port : User interface listens on this port number
--- 14,22 ----
          options:
              -z      : Runs a self-test and exits.
!             -t      : Runs a fake POP3 server on port 8110 (for testing).
              -h      : Displays this help message.
  
!             -p FILE : use the named database file
!             -d      : the database is a DBM file rather than a pickle
              -l port : proxy listens on this port number (default 110)
              -u port : User interface listens on this port number
***************
*** 25,30 ****
  
          All command line arguments and switches take their default
!         values from the [Hammie], [pop3proxy] and [html_ui] sections
!         of bayescustomize.ini.
  
  For safety, and to help debugging, the whole POP3 conversation is
--- 25,30 ----
  
          All command line arguments and switches take their default
!         values from the [pop3proxy] and [html_ui] sections of
!         bayescustomize.ini.
  
  For safety, and to help debugging, the whole POP3 conversation is
***************
*** 40,44 ****
  
  __author__ = "Richie Hindle <richie@entrian.com>"
! __credits__ = "Tim Peters, Neale Pickett, all the spambayes contributors."
  
  try:
--- 40,44 ----
  
  __author__ = "Richie Hindle <richie@entrian.com>"
! __credits__ = "Tim Peters, Neale Pickett, Tim Stone, all the Spambayes folk."
  
  try:
***************
*** 56,59 ****
--- 56,61 ----
   o Review already-trained messages, and purge them.
   o Put in a link to view a message (plain text, html, multipart...?)
+    Include a Reply link that launches the registered email client, eg.
+    mailto:tim@fourstonesExpressions.com?subject=Re:%20pop3proxy&body=Hi%21%0D
   o Keyboard navigation (David Ascher).  But aren't Tab and left/right
     arrow enough?
***************
*** 130,133 ****
--- 132,139 ----
  take weeks over a modem - I've already had problems with clients timing
  out while the proxy was downloading stuff from the server).
+ 
+ Adam's idea: add checkboxes to a Google results list for "Relevant" /
+ "Irrelevant", then submit that to build a search including the
+ highest-scoring tokens and excluding the lowest-scoring ones.
  """
  
***************
*** 214,221 ****
              self.connect((serverName, serverPort))
          except socket.error, e:
!             print >>sys.stderr, "Can't connect to %s:%d: %s" % \
!                                 (serverName, serverPort, e)
!             self.close()
              self.lineCallback('')   # "The socket's been closed."
  
      def collect_incoming_data(self, data):
--- 220,228 ----
              self.connect((serverName, serverPort))
          except socket.error, e:
!             error = "Can't connect to %s:%d: %s" % (serverName, serverPort, e)
!             print >>sys.stderr, error
!             self.lineCallback('-ERR %s\r\n' % error)
              self.lineCallback('')   # "The socket's been closed."
+             self.close()
  
      def collect_incoming_data(self, data):
***************
*** 304,308 ****
              return True
          elif self.command in ['LIST', 'UIDL']:
!             return len(args) == 0
          else:
              # Assume that an unknown command will get a single-line
--- 311,315 ----
              return True
          elif self.command in ['LIST', 'UIDL']:
!             return len(self.args) == 0
          else:
              # Assume that an unknown command will get a single-line
***************
*** 710,714 ****
          """<tr><td><b>Messages classified as %s:</b></td>
            <td><b>From:</b></td>
!           <td class='reviewheaders'><b>
                <a href='javascript: onHeader("%s", "Discard");'>Discard</a> /
                <a href='javascript: onHeader("%s", "Defer");'>Defer</a> /
--- 717,721 ----
          """<tr><td><b>Messages classified as %s:</b></td>
            <td><b>From:</b></td>
!           <td class='reviewheaders' nowrap><b>
                <a href='javascript: onHeader("%s", "Discard");'>Discard</a> /
                <a href='javascript: onHeader("%s", "Defer");'>Defer</a> /


From timstone4@users.sourceforge.net  Thu Nov 28 16:35:59 2002
From: timstone4@users.sourceforge.net (Tim Stone)
Date: Thu, 28 Nov 2002 08:35:59 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.27,1.28
Message-ID: <E18HRdv-0007q9-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv30105

Modified Files:
	pop3proxy.py 
Log Message:
Changed startup messages to be a bit more informative.

Made writing of log file dependent on options.verbose

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** pop3proxy.py	28 Nov 2002 16:10:46 -0000	1.27
--- pop3proxy.py	28 Nov 2002 16:35:57 -0000	1.28
***************
*** 29,33 ****
  
  For safety, and to help debugging, the whole POP3 conversation is
! written out to _pop3proxy.log for each run.
  
  To make rebuilding the database easier, uploaded messages are appended
--- 29,33 ----
  
  For safety, and to help debugging, the whole POP3 conversation is
! written out to _pop3proxy.log for each run, if options.verbose is True.
  
  To make rebuilding the database easier, uploaded messages are appended
***************
*** 166,170 ****
          self.set_socket(s, socketMap)
          self.set_reuse_addr()
!         print "%s listening on port %d." % (self.__class__.__name__, port)
          self.bind(('', port))
          self.listen(5)
--- 166,171 ----
          self.set_socket(s, socketMap)
          self.set_reuse_addr()
!         if options.verbose:
!             print "%s listening on port %d." % (self.__class__.__name__, port)
          self.bind(('', port))
          self.listen(5)
***************
*** 389,392 ****
--- 390,394 ----
          proxyArgs = (serverName, serverPort)
          Listener.__init__(self, proxyPort, BayesProxy, proxyArgs)
+         print 'Listener on port %d is proxying %s:%d' % (proxyPort, serverName, serverPort)
  
  
***************
*** 429,434 ****
      def send(self, data):
          """Logs the data to the log file."""
!         state.logFile.write(data)
!         state.logFile.flush()
          try:
              return POP3ProxyBase.send(self, data)
--- 431,437 ----
      def send(self, data):
          """Logs the data to the log file."""
!         if options.verbose:
!             state.logFile.write(data)
!             state.logFile.flush()
          try:
              return POP3ProxyBase.send(self, data)
***************
*** 442,447 ****
          """Logs the data to the log file."""
          data = POP3ProxyBase.recv(self, size)
!         state.logFile.write(data)
!         state.logFile.flush()
          return data
  
--- 445,451 ----
          """Logs the data to the log file."""
          data = POP3ProxyBase.recv(self, size)
!         if options.verbose:
!             state.logFile.write(data)
!             state.logFile.flush()
          return data
  
***************
*** 565,568 ****
--- 569,573 ----
      def __init__(self, uiPort, socketMap=asyncore.socket_map):
          Listener.__init__(self, uiPort, UserInterface, (), socketMap=socketMap)
+         print 'User interface url is http://localhost:%d' % (uiPort)
  
  
***************
*** 1215,1219 ****
          __main__ code below."""
          # Open the log file.
!         self.logFile = open('_pop3proxy.log', 'wb', 0)
  
          # Load up the old proxy settings from Options.py / bayescustomize.ini
--- 1220,1225 ----
          __main__ code below."""
          # Open the log file.
!         if options.verbose:
!             self.logFile = open('_pop3proxy.log', 'wb', 0)
  
          # Load up the old proxy settings from Options.py / bayescustomize.ini


From richiehindle@users.sourceforge.net  Thu Nov 28 17:05:00 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Thu, 28 Nov 2002 09:05:00 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.28,1.29
Message-ID: <E18HS60-00043z-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv15372

Modified Files:
	pop3proxy.py 
Log Message:
Don't introduce module-level variables in the __main__ code, because
they mask potential NameErrors later on.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** pop3proxy.py	28 Nov 2002 16:35:57 -0000	1.28
--- pop3proxy.py	28 Nov 2002 17:04:58 -0000	1.29
***************
*** 1572,1576 ****
  # ===================================================================
  
! if __name__ == '__main__':
      # Read the arguments.
      try:
--- 1572,1576 ----
  # ===================================================================
  
! def run():
      # Read the arguments.
      try:
***************
*** 1633,1634 ****
--- 1633,1637 ----
      else:
          print >>sys.stderr, __doc__
+ 
+ if __name__ == '__main__':
+     run()


From richiehindle@users.sourceforge.net  Thu Nov 28 21:27:11 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Thu, 28 Nov 2002 13:27:11 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.29,1.30
Message-ID: <E18HWBj-0000tV-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv3157

Modified Files:
	pop3proxy.py 
Log Message:
HTML tidyings.


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** pop3proxy.py	28 Nov 2002 17:04:58 -0000	1.29
--- pop3proxy.py	28 Nov 2002 21:27:09 -0000	1.30
***************
*** 611,615 ****
      #    value.  This is so that setFieldValue can set the value.
  
!     header = """<html><head><title>Spambayes proxy: %s</title>
               <style>
               body { font: 90%% arial, swiss, helvetica; margin: 0 }
--- 611,616 ----
      #    value.  This is so that setFieldValue can set the value.
  
!     header = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
!              <html><head><title>Spambayes proxy: %s</title>
               <style>
               body { font: 90%% arial, swiss, helvetica; margin: 0 }
***************
*** 852,856 ****
              homeLink = name
          else:
!             homeLink = "<a href='home'>Home</a> > %s" % name
          if showImage:
              image = "<img src='helmet.gif' align='absmiddle'>&nbsp;"
--- 853,857 ----
              homeLink = name
          else:
!             homeLink = "<a href='home'>Home</a> &gt; %s" % name
          if showImage:
              image = "<img src='helmet.gif' align='absmiddle'>&nbsp;"
***************
*** 1061,1065 ****
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             <td><center>%s</center></td></tr>""" % \
                              (stripeClass, subject, from_, radioGroup))
              stripe = stripe ^ 1
--- 1062,1066 ----
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             <td align='center'>%s</td></tr>""" % \
                              (stripeClass, subject, from_, radioGroup))
              stripe = stripe ^ 1
***************
*** 1163,1167 ****
                      self.appendMessages(lines, keyedMessages[header], label)
  
!             lines.append("""<tr><td></td><td></td><td align='middle'>&nbsp;<br>
                              <input type='submit' value='Train'></td></tr>""")
              lines.append("</table></form>")
--- 1164,1168 ----
                      self.appendMessages(lines, keyedMessages[header], label)
  
!             lines.append("""<tr><td></td><td></td><td align='center'>&nbsp;<br>
                              <input type='submit' value='Train'></td></tr>""")
              lines.append("</table></form>")


From richiehindle@users.sourceforge.net  Thu Nov 28 22:02:48 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Thu, 28 Nov 2002 14:02:48 -0800
Subject: [Spambayes-checkins] 
 spambayes Corpus.py,1.4,1.5 FileCorpus.py,1.7,1.8 pop3proxy.py,1.30,1.31
Message-ID: <E18HWkC-0007KY-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv27764

Modified Files:
	Corpus.py FileCorpus.py pop3proxy.py 
Log Message:
Expire old messages from the trained corpuses.  ExpiryFileCorpus is
now less clever - you need to call removeExpiredMessages() for it
to expire anything.  "Explicit is better than implicit."


Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Corpus.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** Corpus.py	26 Nov 2002 00:43:51 -0000	1.4
--- Corpus.py	28 Nov 2002 22:02:46 -0000	1.5
***************
*** 235,268 ****
      '''Corpus of "young" file system artifacts'''
  
!     def __init__(self, expireBefore, factory, cacheSize=-1):
          '''Constructor'''
  
          self.expireBefore = expireBefore
-         Corpus.__init__(self, factory, cacheSize)
- 
-     def cacheMessage(self, msg):
-         '''Add a message to the in-memory cache'''
-         # This is where the expiry of a message is enforced
-         # This method should probably not be overridden
- 
-         if msg.createTimestamp() >= time.time() - self.expireBefore:
-             Corpus.cacheMessage(self, msg)
-         else:
-             if options.verbose:
-                 print 'Not caching %s because it has expired' % (msg.key())
-             raise KeyError, msg
- 
-         return msg
  
      def removeExpiredMessages(self):
          '''Kill expired messages'''
  
!         for key in self.keys():
!             try:
!                 msg = self[key]
!             except KeyError, e:
                  if options.verbose:
                      print 'message %s has expired' % (key)
!                 self.removeMessage(e[0])
  
  
--- 235,251 ----
      '''Corpus of "young" file system artifacts'''
  
!     def __init__(self, expireBefore):
          '''Constructor'''
  
          self.expireBefore = expireBefore
  
      def removeExpiredMessages(self):
          '''Kill expired messages'''
  
!         for msg in self:
!             if msg.createTimestamp() < time.time() - self.expireBefore:
                  if options.verbose:
                      print 'message %s has expired' % (key)
!                 self.removeMessage(msg)
  
  
***************
*** 376,383 ****
  
  	return match
! 	
      def getHeaders(self):
          '''Return message headers as text'''
!         
          return self.hdrtxt
  
--- 359,366 ----
  
  	return match
! 
      def getHeaders(self):
          '''Return message headers as text'''
! 
          return self.hdrtxt
  
***************
*** 411,413 ****
  
  if __name__ == '__main__':
!     print >>sys.stderr, __doc__
\ No newline at end of file
--- 394,396 ----
  
  if __name__ == '__main__':
!     print >>sys.stderr, __doc__

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** FileCorpus.py	28 Nov 2002 15:48:29 -0000	1.7
--- FileCorpus.py	28 Nov 2002 22:02:46 -0000	1.8
***************
*** 183,187 ****
  filter'''
  
!         Corpus.ExpiryCorpus.__init__(self, expireBefore, factory, cacheSize)
          FileCorpus.__init__(self, factory, directory, filter, cacheSize)
  
--- 183,187 ----
  filter'''
  
!         Corpus.ExpiryCorpus.__init__(self, expireBefore)
          FileCorpus.__init__(self, factory, directory, filter, cacheSize)
  
***************
*** 251,255 ****
          elip = ''
          sub = self.getSubstance()
!         
          if options.verbose:
              sub = self.getSubstance()
--- 251,255 ----
          elip = ''
          sub = self.getSubstance()
! 
          if options.verbose:
              sub = self.getSubstance()
***************
*** 379,383 ****
      m1 = fmClass('XMG00001', 'fctestspamcorpus')
      m1.setSubstance(testmsg2())
!     
      print '\n\nAdd a message to hamcorpus that does not match the filter'
  
--- 379,383 ----
      m1 = fmClass('XMG00001', 'fctestspamcorpus')
      m1.setSubstance(testmsg2())
! 
      print '\n\nAdd a message to hamcorpus that does not match the filter'
  
***************
*** 404,407 ****
--- 404,408 ----
      unsurecorpus = ExpiryFileCorpus(5, fmFact, \
                                      'fctestunsurecorpus', 'MSG*', 2)
+     unsurecorpus.removeExpiredMessages()
  
  
***************
*** 436,440 ****
      print 'Subject header is',msg.getSubject()
      print 'From header is',msg.getFrom()
!     
      print 'Header text is:',msg.getHeaders()
      print 'Headers are:',msg.getHeadersList()
--- 437,441 ----
      print 'Subject header is',msg.getSubject()
      print 'From header is',msg.getFrom()
! 
      print 'Header text is:',msg.getHeaders()
      print 'Headers are:',msg.getHeadersList()
***************
*** 492,496 ****
              if e.errno != 2:     # errno.<WHAT>
                  raise
!     
          try:
              os.unlink('fctestclass.bayes')
--- 493,497 ----
              if e.errno != 2:     # errno.<WHAT>
                  raise
! 
          try:
              os.unlink('fctestclass.bayes')
***************
*** 725,727 ****
          print >>sys.stderr, __doc__
  
!        
--- 726,728 ----
          print >>sys.stderr, __doc__
  
! 

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** pop3proxy.py	28 Nov 2002 21:27:09 -0000	1.30
--- pop3proxy.py	28 Nov 2002 22:02:46 -0000	1.31
***************
*** 141,145 ****
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
  import mailbox, storage, tokenizer, mboxutils, email.Header
! from FileCorpus import FileCorpus, FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
  from Options import options
--- 141,146 ----
  import socket, asyncore, asynchat, cgi, urlparse, webbrowser
  import mailbox, storage, tokenizer, mboxutils, email.Header
! from FileCorpus import FileCorpus, ExpiryFileCorpus
! from FileCorpus import FileMessageFactory, GzipFileMessageFactory
  from email.Iterators import typed_subpart_iterator
  from Options import options
***************
*** 1314,1324 ****
              map(ensureDir, [self.spamCache, self.hamCache, self.unknownCache])
              if self.gzipCache:
!                 messageFactory = GzipFileMessageFactory()
              else:
!                 messageFactory = FileMessageFactory()
!             self.messageFactory = messageFactory
!             self.spamCorpus = FileCorpus(messageFactory, self.spamCache)
!             self.hamCorpus = FileCorpus(messageFactory, self.hamCache)
!             self.unknownCorpus = FileCorpus(messageFactory, self.unknownCache)
  
              # Create the Trainers.
--- 1315,1329 ----
              map(ensureDir, [self.spamCache, self.hamCache, self.unknownCache])
              if self.gzipCache:
!                 factory = GzipFileMessageFactory()
              else:
!                 factory = FileMessageFactory()
!             age = options.pop3proxy_cache_expiry_days*24*60*60
!             self.spamCorpus = ExpiryFileCorpus(age, factory, self.spamCache)
!             self.hamCorpus = ExpiryFileCorpus(age, factory, self.hamCache)
!             self.unknownCorpus = FileCorpus(factory, self.unknownCache)
! 
!             # Expire old messages from the trained corpuses.
!             self.spamCorpus.removeExpiredMessages()
!             self.hamCorpus.removeExpiredMessages()
  
              # Create the Trainers.


From nascheme@users.sourceforge.net  Fri Nov 29 00:57:25 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Thu, 28 Nov 2002 16:57:25 -0800
Subject: [Spambayes-checkins] spambayes mailsort.py,NONE,1.1
	README.txt,1.43,1.44 neilfilter.py,1.5,NONE neiltrain.py,1.6,NONE
Message-ID: <E18HZTB-0004bJ-00@sc8-pr-cvs1.sourceforge.net>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv17382

Modified Files:
	README.txt 
Added Files:
	mailsort.py 
Removed Files:
	neilfilter.py neiltrain.py 
Log Message:
Merge neiltrain.py and neilfilter.py as mailsort.py.


--- NEW FILE: mailsort.py ---
#! /usr/bin/env python
"""\
To train:
    %(program)s -t wordprobs.cdb ham.mbox spam.mbox

To filter mail (using .forward or .qmail):
    |%(program)s wordprobs.cdb Maildir/ Mail/Spam/

To print the score and top evidence for a message or messages:
    %(program)s -s wordprobs.cdb message [...]
"""

SPAM_CUTOFF = 0.57
SIZE_LIMIT = 5000000 # messages larger are not analyzed
BLOCK_SIZE = 10000

import sys
import os
import getopt
import email
import time
import signal
import socket
import email
import mboxutils

import cdb
from tokenizer import tokenize
import classifier


try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


program = sys.argv[0] # For usage(); referenced by docstring above


def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

class CdbClassifer(classifier.Classifier):
    def __init__(self, cdbfile):
        classifier.Bayes.__init__(self)
        self.wordinfo = cdb.Cdb(cdbfile)

    def probability(self, record):
        return float(record)

def maketmp(dir):
    hostname = socket.gethostname()
    pid = os.getpid()
    fd = -1
    for x in xrange(200):
        filename = "%d.%d.%s" % (time.time(), pid, hostname)
        pathname = "%s/tmp/%s" % (dir, filename)
        try:
            fd = os.open(pathname, os.O_WRONLY|os.O_CREAT|os.O_EXCL, 0600)
        except IOError, exc:
            if exc[i] not in (errno.EINT, errno.EEXIST):
                raise
        else:
            break
        time.sleep(2)
    if fd == -1:
        raise SystemExit, "could not create a mail file"
    return (os.fdopen(fd, "wb"), pathname, filename)

def train(bayes, msgs, is_spam):
    """Train bayes with all messages from a mailbox."""
    mbox = mboxutils.getmbox(msgs)
    for msg in mbox:
        bayes.learn(tokenize(msg), is_spam)

def train_messages(db_name, ham_name, spam_name):
    """Create database using messages."""

    bayes = classifier.Classifier()
    print 'Training with ham...'
    train(bayes, ham_name, False)
    print 'Training with spam...'
    train(bayes, spam_name, True)
    print 'Updating probabilities...'
    items = []
    for word, record in bayes.wordinfo.iteritems():
        prob = bayes.probability(record)
        #print `word`, prob
        items.append((word, str(prob)))
    print 'Writing DB...'
    db = open(db_name, "wb")
    cdb.cdb_make(db, items)
    db.close()
    print 'done'

def filter_message(db_name, hamdir, spamdir):
    signal.signal(signal.SIGALRM, lambda s: sys.exit(1))
    signal.alarm(24 * 60 * 60)

    # write message to temporary file (must be on same partition)
    tmpfile, pathname, filename = maketmp(hamdir)
    try:
        tmpfile.write(os.environ.get("DTLINE", "")) # delivered-to line
        bytes = 0
        blocks = []
        while 1:
            block = sys.stdin.read(BLOCK_SIZE)
            if not block:
                break
            bytes += len(block)
            if bytes < SIZE_LIMIT:
                blocks.append(block)
            tmpfile.write(block)
        tmpfile.close()

        if bytes < SIZE_LIMIT:
            msgdata = ''.join(blocks)
            del blocks
            msg = email.message_from_string(msgdata)
            del msgdata
            bayes = CdbClassifer(open(db_name, 'rb'))
            prob = bayes.spamprob(tokenize(msg))
        else:
            prob = 0.0

        if prob > SPAM_CUTOFF:
            os.rename(pathname, "%s/new/%s" % (spamdir, filename))
        else:
            os.rename(pathname, "%s/new/%s" % (hamdir, filename))
    except:
        os.unlink(pathname)
        raise

def print_message_score(db_name, msg_name):
    msg = email.message_from_file(open(msg_name))
    bayes = CdbClassifer(open(db_name, 'rb'))
    prob, evidence = bayes.spamprob(tokenize(msg), evidence=True)
    print msg_name, prob
    for word, prob in evidence:
        print '  ', `word`, prob

def main():
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'ts')
    except getopt.error, msg:
        usage(2, msg)

    if len(opts) > 1:
        usage(2, 'conflicting options')

    if not opts:
        if len(args) != 3:
            usage(2, 'wrong number of arguments')
        filter_message(args[0], args[1], args[2])
    elif opts[0][0] == '-t':
        if len(args) != 3:
            usage(2, 'wrong number of arguments')
        train_messages(args[0], args[1], args[2])
    elif opts[0][0] == '-s':
        db = args[0]
        for msg in args[1:]:
            print_message_score(db, msg)
    else:
        raise RuntimeError # shouldn't get here
    
    
if __name__ == "__main__":
    main()

Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** README.txt	17 Nov 2002 03:42:36 -0000	1.43
--- README.txt	29 Nov 2002 00:57:23 -0000	1.44
***************
*** 86,101 ****
      a separate module.
  
! neiltrain.py
!     Builds a CDB (constant database) file of word probabilities using
!     spam and non-spam mail.  The database in intended for use with
!     neilfilter.py.
! 
! neilfilter.py
!     A delivery agent that uses the CDB created by neiltrain.py and
!     delivers a message to one of two Maildir message folders, depending
!     on the classifier score.  Note that both Maildirs must be on the
!     same device.  An example .qmail or .forward file would be:
! 
!      |python2.3 spambayes/neilfilter.py wordprobs.cdb Maildir/ Mail/Spam/
  
  
--- 86,94 ----
      a separate module.
  
! mailsort.py
!     A delivery agent that uses a CDB of word probabilities and delivers
!     a message to one of two Maildir message folders, depending on the
!     classifier score.  Note that both Maildirs must be on the same
!     device.
  
  
--- neilfilter.py DELETED ---

--- neiltrain.py DELETED ---


From nas@python.ca  Mon Nov 25 21:40:14 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 25 Nov 2002 13:40:14 -0800
Subject: [Spambayes-checkins] spambayes sb0.5.exe,NONE,1.1.2.1
In-Reply-To: <E18GQXY-0001Mb-00@sc8-pr-cvs1.sourceforge.net>
References: <E18GQXY-0001Mb-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <20021125214014.GA11635@glacier.arctrix.com>

Tim Stone wrote:
> Added Files:
>       Tag: hammie-playground
> 	sb0.5.exe 

I don't think this belongs in CVS.  SF has a file distribution feature
that would be more appropriate.

  Neil