From montanaro@users.sourceforge.net Fri Nov 1 01:23:30 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Thu, 31 Oct 2002 17:23:30 -0800 Subject: [Spambayes-checkins] spambayes INTEGRATION.txt,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26766 Added Files: INTEGRATION.txt Log Message: first scribbled notes about integrating Spambayes with different email packages. --- NEW FILE: INTEGRATION.txt --- ======================================= Integrating Spambayes with mail systems ======================================= General ------- Spambayes in a tool used to segregate unwanted (spam) mail from the mail you want (ham). Before Spambayes can be your spam filter of choice you need to train it on representative samples of email you receive. After it's been trained, you use Spambayes to classify new mail according to its spamminess and hamminess qualities. To train Spambayes, you need to save your incoming email for awhile, segregating it into two piles, known spam and known ham (ham is our nickname for good mail). It's best to train on recent email, because your interests and the nature of what spam looks like change over time. Once you've collected a fair portion of each (anything is better than nothing, but it helps to have a couple hundred of each), you can tell Spambayes, "Here's my ham and my spam". It will then process that mail and save information about different patterns which appear in ham and spam. That information is then used during the filtering stage. When Spambayes filters your email, it compares each unclassified message against the information it saved from training and makes a decision about whether it thinks the message qualifies as ham or spam, or if it's unsure about how to classify the message. In the sections below, are gathered notes about how Spambayes can be integrated into your mail processing system. As a general requirement, you must have a recent version of Python installed on your computer, version 2.2.1 or later. (Don't ask about backporting it to earlier versions of Python. It's almost a certainty this won't happen.) If you need to install Python on your system, check the Python download page for the version appropriate to your computer: http://www.python.org/download/ Training -------- Given a pair of Unix mailbox format files (each message starts with a line which begins with 'From '), one containing nothing but spam and the other containing nothing but ham, you can train Spambayes using a command like hammie.py -g ~/tmp/newham -s ~/tmp/newspam The above command is Unix-centric. In other environments it's likely that a less command-line-oriented tool will be available in the near future. Windows ------- TBD. Unix/Linux ---------- Unlike Windows, there are too many combinations of mail reading tools (mutt, pine, Eudora, ...) and mail transport and delivery tools (sendmail, exim, procmail, qmail, ...) to attempt to be exhaustive about how to integrate Spambayes into your environment at this time. This section just documents some of what's possible. Procmail -------- Many people on Unix-like systems have procmail available as an optional or as the default local delivery agent. Integrating Spambayes checking with Procmail is straightforward. Once you've trained Spambayes on your collection of know ham and spam, you can use the hammie.py script to classify incoming mail like so: :0 fw:hamlock | /usr/local/bin/hammie.py -f -d -p $HOME/hammie.db The above Procmail recipe tells it to run /usr/local/bin/hammie.py in filter mode (-f), and to use the training results stored in the dbm-style file ~/hammie.db. While hammie.py is runnning, Procmail uses the lock file hamlock to prevent multiple invocations from stepping on each others' toes. (It's not strictly necessary in this case since no files on-disk are modified, but Procmail will still complain if you don't specify a lock file.) The result of running hammie.py in filter mode is that Procmail will use the output from the run as the mail message for further processing downstream. Hammie.py inserts an X-Hammie-Disposition header in the output message which looks like X-Hammie-Disposition: No; 0.00; '*H*': 1.00; '*S*': 0.00; 'python': 0.00; 'linux,': 0.01; 'desirable': 0.01; 'cvs,': 0.01; 'perl.': 0.02; ... You can then use this to segregate your messages into various inboxes, like so: :0 * ^X-Hammie-Disposition: Yes spam :0 * ^X-Hammie-Disposition: Unsure unsure The first recipe catches all messages which hammie.py classified as spam. The second catches all messages about which it was unsure. The combination allows you to isolate spam from your good mail and tuck away messages it was unsure about so you can scan them more closely. X/Emacs+VM ---------- Emacs and XEmacs both come with VM, one of a choice of several Emacs-based mail packages. Emacs is extensible using Emacs Lisp or Pymacs. This extensibility allows you to easily segregate your incoming mail for training purposes. Here's one such example. If you place the following code in your ~/.vm file: (defun copy-to-spam () (interactive) (vm-save-message (expand-file-name "~/tmp/newspam")) (vm-undelete-message 1)) (defun copy-to-nonspam () (interactive) (vm-save-message (expand-file-name "~/tmp/newham")) (vm-undelete-message 1)) (define-key vm-mode-map "ls" 'copy-to-spam) (define-key vm-summary-mode-map "ls" 'copy-to-spam) (define-key vm-mode-map "lh" 'copy-to-nonspam) (define-key vm-summary-mode-map "lh" 'copy-to-nonspam) 'ls' will save a copy of the current message to ~/tmp/newspam and 'lh' will save a copy of the current message to ~/tmp/newham. You can then use those files later as arguments to hammie.py for training. Things to watch out for ----------------------- While Spambayes does an excellent job of classifying incoming mail, it is only as good as the data on which it was trained. Here are some tips to help you create a good training set: * Don't use old mail. The characteristics of your email change over time, sometimes subtly, sometimes dramatically, so it's best to use very recent mail to train Spambayes. If you've abandoned an email address in the past because it was getting spammed heavily, there are probably some clues in mail sent to your old address which would bias Spambayes. * Check and recheck your training collections. While you are manually classifying mail as spam or ham, it's easy to make a mistake and toss a message or ten in the wrong file. Such miscategorized mail will throw off the classifier. From mhammond@users.sourceforge.net Fri Nov 1 01:23:39 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Thu, 31 Oct 2002 17:23:39 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FilterDialog.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv26773 Modified Files: FilterDialog.py Log Message: Missing an import of the win32com constants. Index: FilterDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** FilterDialog.py 31 Oct 2002 21:57:00 -0000 1.6 --- FilterDialog.py 1 Nov 2002 01:23:27 -0000 1.7 *************** *** 7,10 **** --- 7,11 ---- import win32api import pythoncom + from win32com.client import constants from DialogGlobals import * *************** *** 365,369 **** if __name__=='__main__': ! from win32com.client import Dispatch, constants outlook = Dispatch("Outlook.Application") --- 366,370 ---- if __name__=='__main__': ! from win32com.client import Dispatch outlook = Dispatch("Outlook.Application") From mhammond@users.sourceforge.net Fri Nov 1 01:24:52 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Thu, 31 Oct 2002 17:24:52 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv26936 Modified Files: about.html Log Message: Add a bit more cruft Index: about.html =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** about.html 31 Oct 2002 21:56:59 -0000 1.1 --- about.html 1 Nov 2002 01:24:09 -0000 1.2 *************** *** 1,7 **** ! ! About SpamBayes ! ! ! Contributions welcome! ! ! \ No newline at end of file --- 1,57 ---- ! ! ! ! About SpamBayes ! ! ! NOTE: This is very very early code.  If ! you are looking this, you have probably been told about it against our better ! judgement <wink>.  Stuff doesnt work correctly.  Fields are ! funny.  If you want something known to work well today for alot of people, ! this is not for you.
!

! The source code is maintained at SourceForge.
!
! This spam filter uses Bayesian analysis to filter spam.  Unlike other ! spam detection systems, Bayesian systems actually "learn" about what you ! consider spam, and continually adapt as both your regular email and spam ! patterns change.
!

Training

! Due to the nature of the system, it must be trained before it can be effective. !  Although the system does learn over time, when first installed it has ! no knowledge of either spam or good email.
!

Initial Training

! When first installed, it is recommended you perform the following steps:
!
    !
  • Create two folders - one for "Spam", and one for "Possible Spam"
  • !
  • Go through your Inbox and Deleted Items, and move as much spam as you ! can find to the "Spam" folder.  Try and get as much Spam out of your ! inbox as possible.
  • !
  • Select the Training dialog. !  Nominate your Spam folder for spam, and your Inbox for good messages, ! and start training.
  • !
! To see how effective your Inbox cleanup was, you may like to try:
!
    !
  • Go to the Filter Now dialog.
  • !
  • Select your Inbox as the folder to filter.
  • !
  • Select Score messages, but dont perform ! filter action.
  • !
  • Clear both checkboxes so all messages will be scored.
  • !
  • Start the score operation.
  • !
! You can then look at and sort by the Spam field in your Inbox - this is likely ! to find hidden spam that you missed from your inbox cleanup. !

Incremental Training

! When you drag a message to your Spam folder, it will be automatically trained ! as spam.  Thus, as the classifier misses spam (or is unsure about them), ! it learns as you correct it.
! If messages are dropped back into the Inbox, they are trained as good - thus, ! the system learns what good messages look like should it incorrectly classify ! it as spam or possible spam.
!
! Contributions to this documentation are welcome!
!
! ! From tim_one@users.sourceforge.net Fri Nov 1 02:04:36 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 31 Oct 2002 18:04:36 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.20,1.21 filter.py,1.11,1.12 manager.py,1.27,1.28 msgstore.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000 Modified Files: addin.py filter.py manager.py msgstore.py Log Message: Whitespace normalization. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** addin.py 31 Oct 2002 21:56:59 -0000 1.20 --- addin.py 1 Nov 2002 02:03:39 -0000 1.21 *************** *** 300,304 **** self.folder_hooks[k]._obj_.close() self.folder_hooks = new_hooks ! def _HookFolderEvents(self, folder_ids, include_sub, HandlerClass): new_hooks = {} --- 300,304 ---- self.folder_hooks[k]._obj_.close() self.folder_hooks = new_hooks ! def _HookFolderEvents(self, folder_ids, include_sub, HandlerClass): new_hooks = {} Index: filter.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** filter.py 31 Oct 2002 21:56:59 -0000 1.11 --- filter.py 1 Nov 2002 02:03:42 -0000 1.12 *************** *** 79,83 **** if progress.stop_requested(): return ! # All done - report what we did. err_text = "" if dispositions.has_key("Error"): --- 79,83 ---- if progress.stop_requested(): return ! # All done - report what we did. err_text = "" if dispositions.has_key("Error"): Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** manager.py 31 Oct 2002 21:56:59 -0000 1.27 --- manager.py 1 Nov 2002 02:03:43 -0000 1.28 *************** *** 113,117 **** # "Integer" from the UI doesn't exist! # 'olNumber' doesn't seem to work with PT_INT* ! win32com.client.constants.olCombination, True) # Add to folder item.Save() --- 113,117 ---- # "Integer" from the UI doesn't exist! # 'olNumber' doesn't seem to work with PT_INT* ! win32com.client.constants.olCombination, True) # Add to folder item.Save() *************** *** 130,134 **** self.EnsureOutlookFieldsForFolder(folder.EntryID, True) folder = folders.GetNext() ! def LoadBayes(self): if not os.path.exists(self.ini_filename): --- 130,134 ---- self.EnsureOutlookFieldsForFolder(folder.EntryID, True) folder = folders.GetNext() ! def LoadBayes(self): if not os.path.exists(self.ini_filename): Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** msgstore.py 31 Oct 2002 21:56:59 -0000 1.13 --- msgstore.py 1 Nov 2002 02:03:45 -0000 1.14 *************** *** 363,367 **** # objects use the same name-to-identifier mapping. # [MarkH: Note MAPIUUID object are supported and hashable] ! # XXX If the SpamProb (Hammie, whatever) property is passed in as an # XXX int, Outlook displays the field as all blanks, and sorting on --- 363,367 ---- # objects use the same name-to-identifier mapping. # [MarkH: Note MAPIUUID object are supported and hashable] ! # XXX If the SpamProb (Hammie, whatever) property is passed in as an # XXX int, Outlook displays the field as all blanks, and sorting on From tim_one@users.sourceforge.net Fri Nov 1 02:04:39 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 31 Oct 2002 18:04:39 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FilterDialog.py,1.7,1.8 ManagerDialog.py,1.4,1.5 TrainingDialog.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000/dialogs Modified Files: FilterDialog.py ManagerDialog.py TrainingDialog.py Log Message: Whitespace normalization. Index: FilterDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** FilterDialog.py 1 Nov 2002 01:23:27 -0000 1.7 --- FilterDialog.py 1 Nov 2002 02:03:46 -0000 1.8 *************** *** 213,217 **** slider_pos = slider.GetPos() self.SetDlgItemText(idc_edit, "%d" % slider_pos) ! def _InitSlider(self, idc_slider, idc_edit): slider = self.GetDlgItem(idc_slider) --- 213,217 ---- slider_pos = slider.GetPos() self.SetDlgItemText(idc_edit, "%d" % slider_pos) ! def _InitSlider(self, idc_slider, idc_edit): slider = self.GetDlgItem(idc_slider) *************** *** 285,289 **** [BUTTON, action_score, IDC_BUT_ACT_SCORE, (15,62,203,10), csts | win32con.BS_AUTORADIOBUTTON], ! [BUTTON, only_group, -1, (7,84,230,35), cs | win32con.BS_GROUPBOX | win32con.WS_GROUP], [BUTTON, only_unread, IDC_BUT_UNREAD, (15,94,149,9), csts | win32con.BS_AUTOCHECKBOX], --- 285,289 ---- [BUTTON, action_score, IDC_BUT_ACT_SCORE, (15,62,203,10), csts | win32con.BS_AUTORADIOBUTTON], ! [BUTTON, only_group, -1, (7,84,230,35), cs | win32con.BS_GROUPBOX | win32con.WS_GROUP], [BUTTON, only_unread, IDC_BUT_UNREAD, (15,94,149,9), csts | win32con.BS_AUTOCHECKBOX], Index: ManagerDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** ManagerDialog.py 31 Oct 2002 21:57:00 -0000 1.4 --- ManagerDialog.py 1 Nov 2002 02:03:48 -0000 1.5 *************** *** 28,32 **** training_intro = "Training is the process of giving examples of both good and bad email to the system so it can classify future email" filtering_intro = "Filtering defines how spam is handled as it arrives" ! dt = [ # Dialog itself. --- 28,32 ---- training_intro = "Training is the process of giving examples of both good and bad email to the system so it can classify future email" filtering_intro = "Filtering defines how spam is handled as it arrives" ! dt = [ # Dialog itself. *************** *** 39,48 **** [BUTTON, "It is moved from a spam folder back to the Inbox", IDC_BUT_TRAIN_FROM_SPAM_FOLDER,(20,50,204,9), csts | win32con.BS_AUTOCHECKBOX], ! [STATIC, "Automatically train that a message is spam when", -1, (15,64,208,10), cs], [BUTTON, "It is moved to the certain-spam folder", IDC_BUT_TRAIN_TO_SPAM_FOLDER,(20,75,204,9), csts | win32con.BS_AUTOCHECKBOX], ! [STATIC, "", IDC_TRAINING_STATUS, (15,88,146,14), cs | win32con.SS_LEFTNOWORDWRAP | win32con.SS_CENTERIMAGE | win32con.SS_SUNKEN], [BUTTON, 'Train Now...', IDC_BUT_TRAIN_NOW, (167,88,63,14), csts | win32con.BS_PUSHBUTTON], --- 39,48 ---- [BUTTON, "It is moved from a spam folder back to the Inbox", IDC_BUT_TRAIN_FROM_SPAM_FOLDER,(20,50,204,9), csts | win32con.BS_AUTOCHECKBOX], ! [STATIC, "Automatically train that a message is spam when", -1, (15,64,208,10), cs], [BUTTON, "It is moved to the certain-spam folder", IDC_BUT_TRAIN_TO_SPAM_FOLDER,(20,75,204,9), csts | win32con.BS_AUTOCHECKBOX], ! [STATIC, "", IDC_TRAINING_STATUS, (15,88,146,14), cs | win32con.SS_LEFTNOWORDWRAP | win32con.SS_CENTERIMAGE | win32con.SS_SUNKEN], [BUTTON, 'Train Now...', IDC_BUT_TRAIN_NOW, (167,88,63,14), csts | win32con.BS_PUSHBUTTON], *************** *** 72,76 **** (IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"), ] ! dialog.Dialog.__init__(self, self.dt) --- 72,76 ---- (IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"), ] ! dialog.Dialog.__init__(self, self.dt) *************** *** 125,129 **** filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \ % (watch_names, certain_spam_name, unsure_name) ! self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable) enabled = config.enabled --- 125,129 ---- filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \ % (watch_names, certain_spam_name, unsure_name) ! self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable) enabled = config.enabled *************** *** 133,137 **** def OnButAbout(self, id, code): if code == win32con.BN_CLICKED: ! fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html") fname = os.path.abspath(fname) --- 133,137 ---- def OnButAbout(self, id, code): if code == win32con.BN_CLICKED: ! fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html") fname = os.path.abspath(fname) Index: TrainingDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/TrainingDialog.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** TrainingDialog.py 31 Oct 2002 21:57:00 -0000 1.6 --- TrainingDialog.py 1 Nov 2002 02:03:52 -0000 1.7 *************** *** 76,80 **** if len(self.config.spam_folder_ids)==0 and self.mgr.config.filter.spam_folder_id: self.config.spam_folder_ids = [self.mgr.config.filter.spam_folder_id] ! names = [] for eid in self.config.ham_folder_ids: --- 76,80 ---- if len(self.config.spam_folder_ids)==0 and self.mgr.config.filter.spam_folder_id: self.config.spam_folder_ids = [self.mgr.config.filter.spam_folder_id] ! names = [] for eid in self.config.ham_folder_ids: From tim_one@users.sourceforge.net Fri Nov 1 02:04:39 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 31 Oct 2002 18:04:39 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox delete_outlook_field.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000/sandbox Modified Files: delete_outlook_field.py Log Message: Whitespace normalization. Index: delete_outlook_field.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** delete_outlook_field.py 31 Oct 2002 21:57:00 -0000 1.1 --- delete_outlook_field.py 1 Nov 2002 02:04:03 -0000 1.2 *************** *** 69,73 **** None, mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS) ! table = mapi_folder.GetContentsTable(0) prop_ids = PR_ENTRYID, --- 69,73 ---- None, mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS) ! table = mapi_folder.GetContentsTable(0) prop_ids = PR_ENTRYID, *************** *** 152,156 **** print msg ! def main(): import getopt --- 152,156 ---- print msg ! def main(): import getopt From npickett@users.sourceforge.net Fri Nov 1 02:55:35 2002 From: npickett@users.sourceforge.net (Neale Pickett) Date: Thu, 31 Oct 2002 18:55:35 -0800 Subject: [Spambayes-checkins] spambayes hammiesrv.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18408 Modified Files: hammiesrv.py Log Message: * XML-encode the output (thanks Toby Dickenson) Index: hammiesrv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** hammiesrv.py 27 Oct 2002 05:13:55 -0000 1.8 --- hammiesrv.py 1 Nov 2002 02:55:32 -0000 1.9 *************** *** 41,45 **** except AttributeError: pass ! return hammie.Hammie.score(self, msg, *extra) def filter(self, msg, *extra): --- 41,45 ---- except AttributeError: pass ! return xmlrpclib.Binary(hammie.Hammie.score(self, msg, *extra)) def filter(self, msg, *extra): *************** *** 48,52 **** except AttributeError: pass ! return hammie.Hammie.filter(self, msg, *extra) --- 48,52 ---- except AttributeError: pass ! return xmlrpclib.Binary(hammie.Hammie.filter(self, msg, *extra)) From anthonybaxter@users.sourceforge.net Fri Nov 1 04:06:52 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 31 Oct 2002 20:06:52 -0800 Subject: [Spambayes-checkins] website related.ht,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv6404 Modified Files: related.ht Log Message: bogofilter now on SF. Index: related.ht =================================================================== RCS file: /cvsroot/spambayes/website/related.ht,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** related.ht 30 Sep 2002 04:02:31 -0000 1.2 --- related.ht 1 Nov 2002 04:06:49 -0000 1.3 *************** *** 9,13 ****
  • Gary Arnold's bayespam, a perl qmail filter.
  • The mozilla project is working on this, see bug 163188 !
  • Eric Raymond's bogofilter, a C code bayesian filter.
  • ifile, a Naive Bayes classification system.
  • PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification. --- 9,13 ----
  • Gary Arnold's bayespam, a perl qmail filter.
  • The mozilla project is working on this, see bug 163188 !
  • Eric Raymond's bogofilter, a C code bayesian filter.
  • ifile, a Naive Bayes classification system.
  • PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification. From anthonybaxter@users.sourceforge.net Fri Nov 1 04:10:52 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 31 Oct 2002 20:10:52 -0800 Subject: [Spambayes-checkins] spambayes timcv.py,1.10,1.11 msgs.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7003 Modified Files: timcv.py msgs.py Log Message: Added support for specifying different numbers for training and testing ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam) still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain have been added to timcv.py. Note that msgs.setparms _tries_ to do the right thing if it's called as an old 3-arg form, but I might not have captured all the possible twistedness. As far as I can tell, only timcv.py and timtest.py actually call these Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** timcv.py 10 Oct 2002 04:55:15 -0000 1.10 --- timcv.py 1 Nov 2002 04:10:50 -0000 1.11 *************** *** 14,24 **** If you only want to use some of the messages in each set, --ham-keep int ! The maximum number of msgs to use from each Ham set. The msgs are ! chosen randomly. See also the -s option. --spam-keep int ! The maximum number of msgs to use from each Spam set. The msgs are ! chosen randomly. See also the -s option. -s int --- 14,40 ---- If you only want to use some of the messages in each set, + --HamTrain int + The maximum number of msgs to use from each Ham set for training. + The msgs are chosen randomly. See also the -s option. + + --SpamTrain int + The maximum number of msgs to use from each Spam set for training. + The msgs are chosen randomly. See also the -s option. + + --HamTest int + The maximum number of msgs to use from each Ham set for testing. + The msgs are chosen randomly. See also the -s option. + + --SpamTest int + The maximum number of msgs to use from each Spam set for testing. + The msgs are chosen randomly. See also the -s option. + --ham-keep int ! The maximum number of msgs to use from each Ham set for testing ! and training. The msgs are chosen randomly. See also the -s option. --spam-keep int ! The maximum number of msgs to use from each Spam set for testing ! and training. The msgs are chosen randomly. See also the -s option. -s int *************** *** 57,62 **** d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:])) # Now run nsets times, predicting pair i against all except pair i. --- 73,80 ---- d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), ! hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), ! spamdirs[1:], train=1)) # Now run nsets times, predicting pair i against all except pair i. *************** *** 64,69 **** h = hamdirs[i] s = spamdirs[i] ! hamstream = msgs.HamStream(h, [h]) ! spamstream = msgs.SpamStream(s, [s]) if i > 0: --- 82,87 ---- h = hamdirs[i] s = spamdirs[i] ! hamstream = msgs.HamStream(h, [h], train=0) ! spamstream = msgs.SpamStream(s, [s], train=0) if i > 0: *************** *** 80,84 **** del s2[i] ! d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2)) else: --- 98,103 ---- del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), ! msgs.SpamStream(sname, s2, train=1)) else: *************** *** 101,109 **** try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': --- 120,131 ---- try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['HamTrain=', 'SpamTrain=', ! 'HamTest=', 'SpamTest=', ! 'ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = hamtrain = spamtrain = None ! hamtest = spamtest = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': *************** *** 113,116 **** --- 135,146 ---- elif opt == '-s': seed = int(arg) + elif opt == '--HamTest': + hamtest = int(arg) + elif opt == '--SpamTest': + spamtest = int(arg) + elif opt == '--HamTrain': + hamtrain = int(arg) + elif opt == '--SpamTrain': + spamtrain = int(arg) elif opt == '--ham-keep': hamkeep = int(arg) *************** *** 123,127 **** usage(1, "-n is required") ! msgs.setparms(hamkeep, spamkeep, seed) drive(nsets) --- 153,160 ---- usage(1, "-n is required") ! if hamkeep is not None: ! msgs.setparms(hamkeep, spamkeep, seed=seed) ! else: ! msgs.setparms(hamtrain, spamtrain, hamtest, spamtest, seed) drive(nsets) Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** msgs.py 25 Sep 2002 20:07:06 -0000 1.4 --- msgs.py 1 Nov 2002 04:10:50 -0000 1.5 *************** *** 6,11 **** from tokenizer import tokenize ! HAMKEEP = None ! SPAMKEEP = None SEED = random.randrange(2000000000) --- 6,13 ---- from tokenizer import tokenize ! HAMTEST = None ! SPAMTEST = None ! HAMTRAIN = None ! SPAMTRAIN = None SEED = random.randrange(2000000000) *************** *** 68,83 **** class HamStream(MsgStream): ! def __init__(self, tag, directories): ! MsgStream.__init__(self, tag, directories, HAMKEEP) class SpamStream(MsgStream): ! def __init__(self, tag, directories): ! MsgStream.__init__(self, tag, directories, SPAMKEEP) ! def setparms(hamkeep, spamkeep, seed=None): ! """Set HAMKEEP and SPAMKEEP. If seed is not None, also set SEED.""" ! global HAMKEEP, SPAMKEEP, SEED ! HAMKEEP, SPAMKEEP = hamkeep, spamkeep if seed is not None: SEED = seed --- 70,103 ---- class HamStream(MsgStream): ! def __init__(self, tag, directories, train=0): ! if train: ! MsgStream.__init__(self, tag, directories, HAMTRAIN) ! else: ! MsgStream.__init__(self, tag, directories, HAMTEST) class SpamStream(MsgStream): ! def __init__(self, tag, directories, train=0): ! if train: ! MsgStream.__init__(self, tag, directories, SPAMTRAIN) ! else: ! MsgStream.__init__(self, tag, directories, SPAMTEST) ! def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. ! If seed is not None, also set SEED. ! If (ham|spam)test are not set, set to the same as the (ham|spam)train ! numbers (backwards compat option). ! """ ! global HAMTEST, SPAMTEST, HAMTRAIN, SPAMTRAIN, SEED ! HAMTRAIN, SPAMTRAIN = hamtrain, spamtrain ! if hamtest is None: ! HAMTEST = HAMTRAIN ! else: ! HAMTEST = hamtest ! if spamtest is None: ! SPAMTEST = SPAMTRAIN ! else: ! SPAMTEST = spamtest if seed is not None: SEED = seed From anthonybaxter@users.sourceforge.net Fri Nov 1 04:13:13 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 31 Oct 2002 20:13:13 -0800 Subject: [Spambayes-checkins] spambayes timtest.py,1.29,1.30 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8231 Modified Files: timtest.py Log Message: Added support for specifying different numbers for training and testing ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam) still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain have been added to timcv.py. Note that msgs.setparms _tries_ to do the right thing if it's called as an old 3-arg form, but I might not have captured all the possible twistedness. As far as I can tell, only timcv.py and timtest.py actually call these. Also, msgs.HamStream and msgs.SpamStream now have an option 'train' argument (which defaults to 0/False), which tells them whether to use the test or train numbers. If you have your own test harnesses, you _might_ need to update them a little. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** timtest.py 24 Sep 2002 05:37:11 -0000 1.29 --- timtest.py 1 Nov 2002 04:13:11 -0000 1.30 *************** *** 98,102 **** usage(1, "-n is required") ! msgs.setparms(hamkeep, spamkeep, seed) drive(nsets) --- 98,102 ---- usage(1, "-n is required") ! msgs.setparms(hamkeep, spamkeep, seed=seed) drive(nsets) From anthony@interlink.com.au Fri Nov 1 04:13:29 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 01 Nov 2002 15:13:29 +1100 Subject: [Spambayes-checkins] spambayes timcv.py,1.10,1.11 msgs.py,1.4,1.5 In-Reply-To: Message-ID: <200211010413.gA14DUn09404@localhost.localdomain> >>> "Anthony Baxter" wrote > Update of /cvsroot/spambayes/spambayes > In directory usw-pr-cvs1:/tmp/cvs-serv7003 > > Modified Files: > timcv.py msgs.py > Log Message: > Added support for specifying different numbers for training and testing > ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam) > still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain > have been added to timcv.py. > > Note that msgs.setparms _tries_ to do the right thing if it's called as > an old 3-arg form, but I might not have captured all the possible > twistedness. As far as I can tell, only timcv.py and timtest.py > actually call these Wierd. My cvs commit aborted and only did two of the files, and truncated my commit message??? I'll use cvs admin to fix the commit message next. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthonybaxter@users.sourceforge.net Fri Nov 1 04:50:21 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 31 Oct 2002 20:50:21 -0800 Subject: [Spambayes-checkins] website applications.ht,NONE,1.1 index.ht,1.1.1.1,1.2 links.h,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv20352 Modified Files: index.ht links.h Added Files: applications.ht Log Message: initial 'applications' notes. --- NEW FILE: applications.ht --- Title: SpamBayes: Applications Author-Email: spambayes@python.org Author: spambayes

    Applications

    A number of applications are available in the SpamBayes project. None of these are particularly polished, finished pieces of work, but they're getting there (and help is always appreciated).

    Outlook2000

    Sean True and Mark Hammond have developed an addin for Outlook2000 that adds support for the spambayes classifier.

    Requirements

    • Python2.2 or later (2.2.2 recommended)
    • Outlook 2000 (not Outlook Express)
    • Python's win32com extensions (win32all-149 or later)
    • CDO installed.
    For more on this, see the README.txt or about.html file in the spambayes CVS repository's Outlook2000 directory.

    Availability

    At the moment, you'll need to use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

    hammie.py

    hammie is a command line tool for marking mail as ham or spam. Skip Montanaro has started a guide to integrating hammie with your mailer (Unix-only instructions at the moment - additions welcome!). Currently it focusses on running hammie via procmail.

    Requirements

    • Python2.2 or later (2.2.2 recommended)
    • Currently documentation focusses on Unix.

    Availability

    At the moment, you'll need to use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

    pop3proxy.py

    pop3proxy sits between your mail client and your real POP3 server and marks mail as ham or spam as it passes through. See the docstring at the top of pop3proxy.py for more.

    Requirements

    • Python2.2 or later (2.2.2 recommended)
    • Should work on windows/unix/whatever... ?

    Availability

    At the moment, you'll need to use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

    Index: index.ht =================================================================== RCS file: /cvsroot/spambayes/website/index.ht,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** index.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1 --- index.ht 1 Nov 2002 04:50:19 -0000 1.2 *************** *** 12,16 **** via CVS - note that it's not yet ! suitable for end-users, but for people interested in experimenting.

    --- 12,22 ---- via CVS - note that it's not yet ! suitable for non-technical end-users, but for people interested ! in experimenting. !

    !

    ! There are now a couple of end-user applications available for those ! excited by the bleeding edge - these are detailed on the ! Applications page.

    Index: links.h =================================================================== RCS file: /cvsroot/spambayes/website/links.h,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** links.h 19 Sep 2002 23:39:24 -0000 1.2 --- links.h 1 Nov 2002 04:50:19 -0000 1.3 *************** *** 3,6 **** --- 3,7 ----
  • Background
  • Documentation +
  • Applications
  • Developers
  • Related From mhammond@users.sourceforge.net Fri Nov 1 05:48:02 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Thu, 31 Oct 2002 21:48:02 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FolderSelector.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv548/dialogs Modified Files: FolderSelector.py Log Message: All items are now identified by a (store_id, entry_id) tuple. This was done in such a way that old config files should be fully supported - no need to reconfigure. Not much should look different, except mutiple stores should be *fully* supported - you should be able to train and filter across stores to your hearts content. Index: FolderSelector.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** FolderSelector.py 31 Oct 2002 21:57:00 -0000 1.5 --- FolderSelector.py 1 Nov 2002 05:47:59 -0000 1.6 *************** *** 53,63 **** from win32com.mapi.mapitags import * def _BuildFoldersMAPI(msgstore, folder): # Get the hierarchy table for it. table = folder.GetHierarchyTable(0) children = [] ! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID,PR_DISPLAY_NAME_A), None, None, 0) ! for (eid_tag, eid),(name_tag, name) in rows: ! spec = FolderSpec(mapi.HexFromBin(eid), name) child_folder = msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS) spec.children = _BuildFoldersMAPI(msgstore, child_folder) --- 53,66 ---- from win32com.mapi.mapitags import * + default_store_id = None + def _BuildFoldersMAPI(msgstore, folder): # Get the hierarchy table for it. table = folder.GetHierarchyTable(0) children = [] ! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0) ! for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows: ! folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid) ! spec = FolderSpec(folder_id, name) child_folder = msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS) spec.children = _BuildFoldersMAPI(msgstore, child_folder) *************** *** 66,79 **** def BuildFolderTreeMAPI(session): root = FolderSpec(None, "root") tab = session.GetMsgStoresTable(0) ! rows = mapi.HrQueryAllRows(tab, (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0) for row in rows: ! (eid_tag, eid), (name_tag, name) = row msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS) hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0) subtree_eid = data[0][1] folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS) ! spec = FolderSpec(mapi.HexFromBin(subtree_eid), name) spec.children = _BuildFoldersMAPI(msgstore, folder) root.children.append(spec) --- 69,89 ---- def BuildFolderTreeMAPI(session): + global default_store_id root = FolderSpec(None, "root") tab = session.GetMsgStoresTable(0) ! prop_tags = PR_ENTRYID, PR_DEFAULT_STORE, PR_DISPLAY_NAME_A ! rows = mapi.HrQueryAllRows(tab, prop_tags, None, None, 0) for row in rows: ! (eid_tag, eid), (is_def_tag, is_def), (name_tag, name) = row ! hex_eid = mapi.HexFromBin(eid) ! if is_def: ! default_store_id = hex_eid ! msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS) hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0) subtree_eid = data[0][1] folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS) ! folder_id = hex_eid, mapi.HexFromBin(subtree_eid) ! spec = FolderSpec(folder_id, name) spec.children = _BuildFoldersMAPI(msgstore, folder) root.children.append(spec) *************** *** 126,129 **** --- 136,153 ---- self.checkbox_text = checkbox_text or "Include &subfolders" + def CompareIDs(self, id1, id2): + if type(id1) != type(()): + id1 = default_store_id, id1 + if type(id2) != type(()): + id2 = default_store_id, id2 + return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \ + self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1])) + + def InIDs(self, id, ids): + for id_check in ids: + if self.CompareIDs(id_check, id): + return True + return False + def _MakeItemParam(self, item): item_id = self.next_item_id *************** *** 144,148 **** mask = state = 0 else: ! if self.selected_ids and child.folder_id in self.selected_ids: state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED) num_children_selected += 1 --- 168,172 ---- mask = state = 0 else: ! if self.selected_ids and self.InIDs(child.folder_id, self.selected_ids): state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED) num_children_selected += 1 *************** *** 152,156 **** item_id = self._MakeItemParam(child) hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id)) ! if self.single_select and self.selected_ids and child.folder_id in self.selected_ids: self.list.SelectItem(hitem) --- 176,180 ---- item_id = self._MakeItemParam(child) hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id)) ! if self.single_select and self.selected_ids and self.InIDs(child.folder_id, self.selected_ids): self.list.SelectItem(hitem) From mhammond@users.sourceforge.net Fri Nov 1 05:48:01 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Thu, 31 Oct 2002 21:48:01 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.21,1.22 manager.py,1.28,1.29 msgstore.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv548 Modified Files: addin.py manager.py msgstore.py Log Message: All items are now identified by a (store_id, entry_id) tuple. This was done in such a way that old config files should be fully supported - no need to reconfigure. Not much should look different, except mutiple stores should be *fully* supported - you should be able to train and filter across stores to your hearts content. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** addin.py 1 Nov 2002 02:03:39 -0000 1.21 --- addin.py 1 Nov 2002 05:47:59 -0000 1.22 *************** *** 308,312 **** existing = self.folder_hooks.get(eid) if existing is None or existing.__class__ != HandlerClass: ! folder = self.application.Session.GetFolderFromID(eid) name = folder.Name.encode("mbcs", "replace") try: --- 308,312 ---- existing = self.folder_hooks.get(eid) if existing is None or existing.__class__ != HandlerClass: ! folder = self.application.Session.GetFolderFromID(*eid) name = folder.Name.encode("mbcs", "replace") try: Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** manager.py 1 Nov 2002 02:03:43 -0000 1.28 --- manager.py 1 Nov 2002 05:47:59 -0000 1.29 *************** *** 92,96 **** assert self.outlook is not None, "I need outlook :(" ol = self.outlook ! folder = ol.Session.GetFolderFromID(folder_id) if self.verbose > 1: print "Checking folder '%s' for our field '%s'" \ --- 92,96 ---- assert self.outlook is not None, "I need outlook :(" ol = self.outlook ! folder = ol.Session.GetFolderFromID(*folder_id) if self.verbose > 1: print "Checking folder '%s' for our field '%s'" \ Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** msgstore.py 1 Nov 2002 02:03:45 -0000 1.14 --- msgstore.py 1 Nov 2002 05:47:59 -0000 1.15 *************** *** 91,123 **** mapi.MAPI_USE_DEFAULT) self.session = mapi.MAPILogonEx(0, None, None, logonFlags) ! self._FindDefaultMessageStore() os.chdir(cwd) def Close(self): ! self.mapi_msgstore = None self.session.Logoff(0, 0, 0) self.session = None mapi.MAPIUninitialize() ! def _FindDefaultMessageStore(self): ! tab = self.session.GetMsgStoresTable(0) ! # Restriction for the table: get rows where PR_DEFAULT_STORE is true. ! # There should be only one. ! restriction = (mapi.RES_PROPERTY, # a property restriction ! (mapi.RELOP_EQ, # check for equality ! PR_DEFAULT_STORE, # of the PR_DEFAULT_STORE prop ! (PR_DEFAULT_STORE, True))) # with True ! rows = mapi.HrQueryAllRows(tab, ! (PR_ENTRYID,), # columns to retrieve ! restriction, # only these rows ! None, # any sort order is fine ! 0) # any # of results is fine ! # get first entry, a (property_tag, value) pair, for PR_ENTRYID ! row = rows[0] ! eid_tag, eid = row[0] ! # Open the store. ! self.mapi_msgstore = self.session.OpenMsgStore( 0, # no parent window ! eid, # msg store to open None, # IID; accept default IMsgStore # need write access to add score fields --- 91,135 ---- mapi.MAPI_USE_DEFAULT) self.session = mapi.MAPILogonEx(0, None, None, logonFlags) ! self.mapi_msg_stores = {} ! self.default_store_bin_eid = None ! self._GetMessageStore(None) os.chdir(cwd) def Close(self): ! self.mapi_msg_stores = None self.session.Logoff(0, 0, 0) self.session = None mapi.MAPIUninitialize() ! def _GetMessageStore(self, store_eid): # bin eid. ! try: ! # Will usually be pre-fetched, so fast-path out ! return self.mapi_msg_stores[store_eid] ! except KeyError: ! pass ! given_store_eid = store_eid ! if store_eid is None: ! # Find the EID for the default store. ! tab = self.session.GetMsgStoresTable(0) ! # Restriction for the table: get rows where PR_DEFAULT_STORE is true. ! # There should be only one. ! restriction = (mapi.RES_PROPERTY, # a property restriction ! (mapi.RELOP_EQ, # check for equality ! PR_DEFAULT_STORE, # of the PR_DEFAULT_STORE prop ! (PR_DEFAULT_STORE, True))) # with True ! rows = mapi.HrQueryAllRows(tab, ! (PR_ENTRYID,), # columns to retrieve ! restriction, # only these rows ! None, # any sort order is fine ! 0) # any # of results is fine ! # get first entry, a (property_tag, value) pair, for PR_ENTRYID ! row = rows[0] ! eid_tag, store_eid = row[0] ! self.default_store_bin_eid = store_eid ! ! # Open it. ! store = self.session.OpenMsgStore( 0, # no parent window ! store_eid, # msg store to open None, # IID; accept default IMsgStore # need write access to add score fields *************** *** 126,158 **** mapi.MDB_NO_MAIL | USE_DEFERRED_ERRORS) def _GetSubFolderIter(self, folder): table = folder.GetHierarchyTable(0) rows = mapi.HrQueryAllRows(table, ! (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0) ! for (eid_tag, eid),(name_tag, name) in rows: ! sub = self.mapi_msgstore.OpenEntry(eid, ! None, ! mapi.MAPI_MODIFY | ! USE_DEFERRED_ERRORS) table = sub.GetContentsTable(0) ! yield MAPIMsgStoreFolder(self, eid, name, table.GetRowCount(0)) ! folder = self.mapi_msgstore.OpenEntry(eid, ! None, ! mapi.MAPI_MODIFY | ! USE_DEFERRED_ERRORS) ! for store_folder in self._GetSubFolderIter(folder): yield store_folder def GetFolderGenerator(self, folder_ids, include_sub): for folder_id in folder_ids: ! folder_id = mapi.BinFromHex(folder_id) ! folder = self.mapi_msgstore.OpenEntry(folder_id, ! None, ! mapi.MAPI_MODIFY | ! USE_DEFERRED_ERRORS) table = folder.GetContentsTable(0) rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0) --- 138,191 ---- mapi.MDB_NO_MAIL | USE_DEFERRED_ERRORS) + # cache it + self.mapi_msg_stores[store_eid] = store + if given_store_eid is None: # The default store + self.mapi_msg_stores[None] = store + return store + + def _OpenEntry(self, id, iid = None, flags = None): + # id is already normalized. + store_id, item_id = id + store = self._GetMessageStore(store_id) + if flags is None: + flags = mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS + return store.OpenEntry(item_id, iid, flags) + + # Given an ID, normalize it into a (store_id, item_id) binary tuple. + # item_id may be: + # - Simple hex EID, in wich case default store ID is assumed. + # - Tuple of (None, hex_eid), in which case default store assumed. + # - Tuple of (hex_store_id, hex_id) + def NormalizeID(self, item_id): + if type(item_id)==type(()): + store_id, item_id = item_id + item_id = mapi.BinFromHex(item_id) + if store_id is None: + store_id = self.default_store_bin_eid + else: + store_id = mapi.BinFromHex(store_id) + return store_id, item_id + assert type(item_id) in [type(''), type(u'')], "What kind of ID is '%r'?" % (item_id,) + return self.default_store_bin_eid, mapi.BinFromHex(item_id) def _GetSubFolderIter(self, folder): table = folder.GetHierarchyTable(0) rows = mapi.HrQueryAllRows(table, ! (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0) ! for (eid_tag, eid), (store_eid_tag, store_eid), (name_tag, name) in rows: ! item_id = store_eid, eid ! sub = self._OpenEntry(item_id) table = sub.GetContentsTable(0) ! yield MAPIMsgStoreFolder(self, item_id, name, table.GetRowCount(0)) ! for store_folder in self._GetSubFolderIter(sub): yield store_folder def GetFolderGenerator(self, folder_ids, include_sub): for folder_id in folder_ids: ! folder_id = self.NormalizeID(folder_id) ! folder = self._OpenEntry(folder_id) table = folder.GetContentsTable(0) rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0) *************** *** 165,173 **** def GetFolder(self, folder_id): # Return a single folder given the ID. ! folder_id = mapi.BinFromHex(folder_id) ! folder = self.mapi_msgstore.OpenEntry(folder_id, ! None, ! mapi.MAPI_MODIFY | ! USE_DEFERRED_ERRORS) table = folder.GetContentsTable(0) rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0) --- 198,203 ---- def GetFolder(self, folder_id): # Return a single folder given the ID. ! folder_id = self.NormalizeID(folder_id) ! folder = self._OpenEntry(folder_id) table = folder.GetContentsTable(0) rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0) *************** *** 177,191 **** def GetMessage(self, message_id): # Return a single message given the ID. ! message_id = mapi.BinFromHex(message_id) prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD ! mapi_object = self.mapi_msgstore.OpenEntry(message_id, ! None, ! mapi.MAPI_MODIFY | ! USE_DEFERRED_ERRORS) hr, data = mapi_object.GetProps(prop_ids,0) folder_eid = data[0][1] searchkey = data[1][1] unread = data[2][1] ! folder = MAPIMsgStoreFolder(self, folder_eid, "Unknown - temp message", -1) return MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread) --- 207,219 ---- def GetMessage(self, message_id): # Return a single message given the ID. ! message_id = self.NormalizeID(message_id) prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD ! mapi_object = self._OpenEntry(message_id) hr, data = mapi_object.GetProps(prop_ids,0) folder_eid = data[0][1] searchkey = data[1][1] unread = data[2][1] ! folder_id = message_id[0], folder_eid ! folder = MAPIMsgStoreFolder(self, folder_id, "Unknown - temp message", -1) return MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread) *************** *** 216,232 **** def __repr__(self): ! return "<%s '%s' (%d items), id=%s>" % (self.__class__.__name__, self.name, self.count, ! mapi.HexFromBin(self.id)) def GetOutlookEntryID(self): ! return mapi.HexFromBin(self.id) def GetMessageGenerator(self): ! folder = self.msgstore.mapi_msgstore.OpenEntry(self.id, ! None, ! mapi.MAPI_MODIFY | ! USE_DEFERRED_ERRORS) table = folder.GetContentsTable(0) prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD --- 244,263 ---- def __repr__(self): ! return "<%s '%s' (%d items), id=%s/%s>" % (self.__class__.__name__, self.name, self.count, ! mapi.HexFromBin(self.id[0]), ! mapi.HexFromBin(self.id[1])) def GetOutlookEntryID(self): ! # Return EntryID, StoreID - we use this order as it is the same as ! # Session.GetItemFromID() uses - thus: ! # ids = me.GetOutlookEntryID() ! # session.GetItemFromID(*ids) ! # should work. ! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0]) def GetMessageGenerator(self): ! folder = self.msgstore._OpenEntry(self.id) table = folder.GetContentsTable(0) prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD *************** *** 239,244 **** break for row in rows: yield MAPIMsgStoreMsg(self.msgstore, self, ! row[0][1], row[1][1], row[2][1]) --- 270,276 ---- break for row in rows: + item_id = self.id[0], row[0][1] # assume in same store as folder! yield MAPIMsgStoreMsg(self.msgstore, self, ! item_id, row[1][1], row[2][1]) *************** *** 263,272 **** else: urs = "unread" ! return "<%s, (%s) id=%s>" % (self.__class__.__name__, urs, ! mapi.HexFromBin(self.id)) def GetOutlookEntryID(self): ! return mapi.HexFromBin(self.id) def _GetPropFromStream(self, prop_id): --- 295,310 ---- else: urs = "unread" ! return "<%s, (%s) id=%s/%s>" % (self.__class__.__name__, urs, ! mapi.HexFromBin(self.id[0]), ! mapi.HexFromBin(self.id[1])) def GetOutlookEntryID(self): ! # Return EntryID, StoreID - we use this order as it is the same as ! # Session.GetItemFromID() uses - thus: ! # ids = me.GetOutlookEntryID() ! # session.GetItemFromID(*ids) ! # should work. ! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0]) def _GetPropFromStream(self, prop_id): *************** *** 319,326 **** def _EnsureObject(self): if self.mapi_object is None: ! self.mapi_object = self.msgstore.mapi_msgstore.OpenEntry( ! self.id, ! None, ! mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS) def GetEmailPackageObject(self, strip_mime_headers=True): --- 357,361 ---- def _EnsureObject(self): if self.mapi_object is None: ! self.mapi_object = self.msgstore._OpenEntry(self.id) def GetEmailPackageObject(self, strip_mime_headers=True): *************** *** 418,432 **** assert not self.dirty, \ "asking me to move a dirty message - later saves will fail!" ! dest_folder = self.msgstore.mapi_msgstore.OpenEntry( ! folder.id, ! None, ! mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS) ! source_folder = self.msgstore.mapi_msgstore.OpenEntry( ! self.folder.id, ! None, ! mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS) flags = 0 if isMove: flags |= MESSAGE_MOVE ! source_folder.CopyMessages((self.id,), None, dest_folder, --- 453,462 ---- assert not self.dirty, \ "asking me to move a dirty message - later saves will fail!" ! dest_folder = self.msgstore._OpenEntry(folder.id) ! source_folder = self.msgstore._OpenEntry(self.folder.id) flags = 0 if isMove: flags |= MESSAGE_MOVE ! eid = self.id[1] ! source_folder.CopyMessages((eid,), None, dest_folder, *************** *** 434,438 **** None, flags) ! self.folder = self.msgstore.GetFolder(mapi.HexFromBin(folder.id)) def MoveTo(self, folder): --- 464,473 ---- None, flags) ! # At this stage, I think we have lost meaningful ID etc values ! # Set everything to None to make it clearer what is wrong should ! # this become an issue. We would need to re-fetch the eid of ! # the item, and set the store_id to the dest folder. ! self.id = None ! self.folder = None def MoveTo(self, folder): *************** *** 453,457 **** print msg store.Close() - if __name__=='__main__': --- 488,491 ---- From mhammond@users.sourceforge.net Fri Nov 1 06:09:08 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Thu, 31 Oct 2002 22:09:08 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.29,1.30 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv5475 Modified Files: manager.py Log Message: Stop everyone fretting over a known problem. Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** manager.py 1 Nov 2002 05:47:59 -0000 1.29 --- manager.py 1 Nov 2002 06:09:06 -0000 1.30 *************** *** 119,125 **** print "Created the UserProperty!" except pythoncom.com_error: ! import traceback ! print "Failed to create the field" ! traceback.print_exc() # else no items in this folder - not much worth doing! if include_sub: --- 119,126 ---- print "Created the UserProperty!" except pythoncom.com_error: ! pass # We know, we know... ! ## import traceback ! ## print "Failed to create the field" ! ## traceback.print_exc() # else no items in this folder - not much worth doing! if include_sub: From tim.one@comcast.net Fri Nov 1 06:22:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 01:22:38 -0500 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FolderSelector.py,1.5,1.6 In-Reply-To: Message-ID: [Mark Hammond] > Modified Files: > FolderSelector.py > Log Message: > All items are now identified by a (store_id, entry_id) tuple. This was > done in such a way that old config files should be fully supported - no > need to reconfigure. > > Not much should look different, except mutiple stores should be *fully* > supported - you should be able to train and filter across stores to your > hearts content. That's impressive! I'll do my bit next by ensuring there's no trailing whitespace . From richiehindle@users.sourceforge.net Fri Nov 1 09:14:50 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Fri, 01 Nov 2002 01:14:50 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16187 Modified Files: pop3proxy.py Log Message: Made this work on Linux, where socket.makefile behaves differently from Windows. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** pop3proxy.py 29 Oct 2002 21:02:40 -0000 1.7 --- pop3proxy.py 1 Nov 2002 09:14:47 -0000 1.8 *************** *** 87,94 **** self.request = '' self.set_terminator('\r\n') ! serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! serverSocket.connect((serverName, serverPort)) ! self.serverFile = serverSocket.makefile() ! self.push(self.serverFile.readline()) def handle_connect(self): --- 87,94 ---- self.request = '' self.set_terminator('\r\n') ! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! self.serverSocket.connect((serverName, serverPort)) ! self.serverIn = self.serverSocket.makefile('r') # For reading only ! self.push(self.serverIn.readline()) def handle_connect(self): *************** *** 135,139 **** seenAllHeaders = False while True: ! line = self.serverFile.readline() if not line: # The socket's been closed by the server, probably by QUIT. --- 135,139 ---- seenAllHeaders = False while True: ! line = self.serverIn.readline() if not line: # The socket's been closed by the server, probably by QUIT. *************** *** 173,184 **** # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': ! self.serverFile.write('QUIT\r\n') ! self.serverFile.flush() self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverFile.write(self.request + '\r\n') ! self.serverFile.flush() if self.request.strip() == '': # Someone just hit the Enter key. --- 173,182 ---- # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': ! self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverSocket.sendall(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. *************** *** 200,204 **** if timedOut: while True: ! line = self.serverFile.readline() if not line: # The socket's been closed by the server. --- 198,202 ---- if timedOut: while True: ! line = self.serverIn.readline() if not line: # The socket's been closed by the server. *************** *** 529,532 **** --- 527,531 ---- asyncore.loop(map=testSocketMap) + proxyReady = threading.Event() def runProxy(): # Name the database in case it ever gets auto-flushed to disk. *************** *** 535,538 **** --- 534,538 ---- bayes.learn(tokenizer.tokenize(spam1), True) bayes.learn(tokenizer.tokenize(good1), False) + proxyReady.set() asyncore.loop() *************** *** 540,548 **** testServerReady.wait() threading.Thread(target=runProxy).start() # Connect to the proxy. proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM) proxy.connect(('localhost', 8111)) ! assert proxy.recv(100) == "+OK ready\r\n" # Stat the mailbox to get the number of messages. --- 540,550 ---- testServerReady.wait() threading.Thread(target=runProxy).start() + proxyReady.wait() # Connect to the proxy. proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM) proxy.connect(('localhost', 8111)) ! response = proxy.recv(100) ! assert response == "+OK ready\r\n" # Stat the mailbox to get the number of messages. From mhammond@users.sourceforge.net Fri Nov 1 14:35:10 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 06:35:10 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.22,1.23 manager.py,1.30,1.31 msgstore.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv14364 Modified Files: addin.py manager.py msgstore.py Log Message: Fix a problem with the (store_id, item_id) change, and remove the confusing GetOutlookItemID concept - just get the item! Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** addin.py 1 Nov 2002 05:47:59 -0000 1.22 --- addin.py 1 Nov 2002 14:35:05 -0000 1.23 *************** *** 305,312 **** for msgstore_folder in self.manager.message_store.GetFolderGenerator( folder_ids, include_sub): ! eid = msgstore_folder.GetOutlookEntryID() ! existing = self.folder_hooks.get(eid) if existing is None or existing.__class__ != HandlerClass: ! folder = self.application.Session.GetFolderFromID(*eid) name = folder.Name.encode("mbcs", "replace") try: --- 305,311 ---- for msgstore_folder in self.manager.message_store.GetFolderGenerator( folder_ids, include_sub): ! existing = self.folder_hooks.get(msgstore_folder.id) if existing is None or existing.__class__ != HandlerClass: ! folder = msgstore_folder.GetOutlookItem() name = folder.Name.encode("mbcs", "replace") try: *************** *** 317,325 **** if new_hook is not None: new_hook.Init(folder, self.application, self.manager) ! new_hooks[eid] = new_hook ! self.manager.EnsureOutlookFieldsForFolder(eid) print "AntiSpam: Watching for new messages in folder", name else: ! new_hooks[eid] = existing return new_hooks --- 316,324 ---- if new_hook is not None: new_hook.Init(folder, self.application, self.manager) ! new_hooks[msgstore_folder.id] = new_hook ! self.manager.EnsureOutlookFieldsForFolder(msgstore_folder.GetID()) print "AntiSpam: Watching for new messages in folder", name else: ! new_hooks[msgstore_folder.id] = existing return new_hooks Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** manager.py 1 Nov 2002 06:09:06 -0000 1.30 --- manager.py 1 Nov 2002 14:35:05 -0000 1.31 *************** *** 92,96 **** assert self.outlook is not None, "I need outlook :(" ol = self.outlook ! folder = ol.Session.GetFolderFromID(*folder_id) if self.verbose > 1: print "Checking folder '%s' for our field '%s'" \ --- 92,97 ---- assert self.outlook is not None, "I need outlook :(" ol = self.outlook ! msgstore_folder = self.message_store.GetFolder(folder_id) ! folder = msgstore_folder.GetOutlookItem() if self.verbose > 1: print "Checking folder '%s' for our field '%s'" \ Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** msgstore.py 1 Nov 2002 05:47:59 -0000 1.15 --- msgstore.py 1 Nov 2002 14:35:06 -0000 1.16 *************** *** 219,230 **** return MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread) - ## # Currently no need for this - ## def GetOutlookObjectFromID(self, eid): - ## if self.outlook is None: - ## from win32com.client import Dispatch - ## self.outlook = Dispatch("Outlook.Application") - ## return self.outlook.Session.GetItemFromID(mapi.HexFromBin(eid)) - - _MapiTypeMap = { type(0.0): PT_DOUBLE, --- 219,222 ---- *************** *** 250,260 **** mapi.HexFromBin(self.id[1])) ! def GetOutlookEntryID(self): ! # Return EntryID, StoreID - we use this order as it is the same as ! # Session.GetItemFromID() uses - thus: ! # ids = me.GetOutlookEntryID() ! # session.GetItemFromID(*ids) ! # should work. ! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0]) def GetMessageGenerator(self): --- 242,252 ---- mapi.HexFromBin(self.id[1])) ! def GetID(self): ! return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1]) ! ! def GetOutlookItem(self): ! hex_item_id = mapi.HexFromBin(self.id[1]) ! hex_store_id = mapi.HexFromBin(self.id[0]) ! return self.msgstore.outlook.Session.GetFolderFromID(hex_item_id, hex_store_id) def GetMessageGenerator(self): *************** *** 300,310 **** mapi.HexFromBin(self.id[1])) ! def GetOutlookEntryID(self): ! # Return EntryID, StoreID - we use this order as it is the same as ! # Session.GetItemFromID() uses - thus: ! # ids = me.GetOutlookEntryID() ! # session.GetItemFromID(*ids) ! # should work. ! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0]) def _GetPropFromStream(self, prop_id): --- 292,302 ---- mapi.HexFromBin(self.id[1])) ! def GetID(self): ! return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1]) ! ! def GetOutlookItem(self): ! hex_item_id = mapi.HexFromBin(self.id[1]) ! store_hex_id = mapi.HexFromBin(self.id[0]) ! return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id) def _GetPropFromStream(self, prop_id): From tim_one@users.sourceforge.net Fri Nov 1 16:01:20 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 01 Nov 2002 08:01:20 -0800 Subject: [Spambayes-checkins] spambayes classifier.py,1.45,1.46 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7943 Modified Files: classifier.py Log Message: WordInfo.__init__: if an initial spamprob isn't specified, set it to options.robinson_probability_x (the "unknown word" probability) instead of to None. If threads exist such that scoring can happen in parallel with training, None could cause scoring to raise an exception. "A real" spamprob can't be computed until update_probabilities is called to recalculate the entire database; before then, giving a new word the unknown-word spamprob is thoroughly appropriate. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.45 retrieving revision 1.46 diff -C2 -d -r1.45 -r1.46 *** classifier.py 27 Oct 2002 17:11:00 -0000 1.45 --- classifier.py 1 Nov 2002 16:01:14 -0000 1.46 *************** *** 62,66 **** # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=None): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 62,66 ---- # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 From sjoerd@users.sourceforge.net Fri Nov 1 16:10:18 2002 From: sjoerd@users.sourceforge.net (Sjoerd Mullender) Date: Fri, 01 Nov 2002 08:10:18 -0800 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.59,1.60 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13555 Modified Files: tokenizer.py Log Message: Switch " and ' in url_re character class and add # ' token the re to resync python-mode. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.59 retrieving revision 1.60 diff -C2 -d -r1.59 -r1.60 *** tokenizer.py 31 Oct 2002 15:43:55 -0000 1.59 --- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 *************** *** 604,609 **** # be in HTML, may or may not be in quotes, etc. If it's full of % # escapes, cool -- that's a clue too. ! ([^\s<>'"\x7f-\xff]+) # capture the guts ! """, re.VERBOSE) urlsep_re = re.compile(r"[;?:@&=+,$.]") --- 604,609 ---- # be in HTML, may or may not be in quotes, etc. If it's full of % # escapes, cool -- that's a clue too. ! ([^\s<>"'\x7f-\xff]+) # capture the guts ! """, re.VERBOSE) # ' urlsep_re = re.compile(r"[;?:@&=+,$.]") From mhammond@users.sourceforge.net Fri Nov 1 23:54:05 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 15:54:05 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.23,1.24 msgstore.py,1.16,1.17 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv14570 Modified Files: addin.py msgstore.py Log Message: Fix a couple of places the "multiple stores" concept fell over. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** addin.py 1 Nov 2002 14:35:05 -0000 1.23 --- addin.py 1 Nov 2002 23:54:03 -0000 1.24 *************** *** 121,125 **** # PR_RECEIVED_BY_ENTRYID # PR_TRANSPORT_MESSAGE_HEADERS ! msgstore_message = self.manager.message_store.GetMessage(item.EntryID) if msgstore_message.GetField(self.manager.config.field_score_name) is not None: # Already seem this message - user probably moving it back --- 121,125 ---- # PR_RECEIVED_BY_ENTRYID # PR_TRANSPORT_MESSAGE_HEADERS ! msgstore_message = self.manager.message_store.GetMessage(item) if msgstore_message.GetField(self.manager.config.field_score_name) is not None: # Already seem this message - user probably moving it back *************** *** 154,158 **** if not self.manager.config.training.train_manual_spam: return ! msgstore_message = self.manager.message_store.GetMessage(item.EntryID) prop = msgstore_message.GetField(self.manager.config.field_score_name) if prop is not None: --- 154,158 ---- if not self.manager.config.training.train_manual_spam: return ! msgstore_message = self.manager.message_store.GetMessage(item) prop = msgstore_message.GetField(self.manager.config.field_score_name) if prop is not None: *************** *** 189,193 **** return ! msgstore_message = mgr.message_store.GetMessage(item.EntryID) score, clues = mgr.score(msgstore_message, evidence=True, scale=False) new_msg = app.CreateItem(0) --- 189,193 ---- return ! msgstore_message = mgr.message_store.GetMessage(item) score, clues = mgr.score(msgstore_message, evidence=True, scale=False) new_msg = app.CreateItem(0) Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** msgstore.py 1 Nov 2002 14:35:06 -0000 1.16 --- msgstore.py 1 Nov 2002 23:54:03 -0000 1.17 *************** *** 206,211 **** def GetMessage(self, message_id): ! # Return a single message given the ID. ! message_id = self.NormalizeID(message_id) prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD mapi_object = self._OpenEntry(message_id) --- 206,217 ---- def GetMessage(self, message_id): ! # Return a single message given either the ID, or an Outlook ! # message representing the object. ! if hasattr(message_id, "EntryID"): ! # A CDO object ! message_id = mapi.BinFromHex(message_id.Parent.StoreID), \ ! mapi.BinFromHex(message_id.EntryID) ! else: ! message_id = self.NormalizeID(message_id) prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD mapi_object = self._OpenEntry(message_id) From mhammond@users.sourceforge.net Sat Nov 2 03:12:15 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 19:12:15 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox delete_outlook_field.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv30593 Modified Files: delete_outlook_field.py Log Message: Fix missing quote in usage string. Index: delete_outlook_field.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** delete_outlook_field.py 1 Nov 2002 02:04:03 -0000 1.2 --- delete_outlook_field.py 2 Nov 2002 03:12:12 -0000 1.3 *************** *** 147,151 **** of the default message store ! Eg, python\\python-dev' will locate a python-dev subfolder in a python subfolder in your default store. """ % os.path.basename(sys.argv[0]) --- 147,151 ---- of the default message store ! Eg, 'python\\python-dev' will locate a python-dev subfolder in a python subfolder in your default store. """ % os.path.basename(sys.argv[0]) From mhammond@users.sourceforge.net Sat Nov 2 03:13:24 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 19:13:24 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox dump_props.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv30848 Added Files: dump_props.py Log Message: Tool to dump everything we know about a message. --- NEW FILE: dump_props.py --- # Dump every property we can find for a MAPI item from win32com.client import Dispatch, constants import pythoncom import os, sys from win32com.mapi import mapi, mapiutil from win32com.mapi.mapitags import * mapi.MAPIInitialize(None) logonFlags = (mapi.MAPI_NO_MAIL | mapi.MAPI_EXTENDED | mapi.MAPI_USE_DEFAULT) session = mapi.MAPILogonEx(0, None, None, logonFlags) def _FindDefaultMessageStore(): tab = session.GetMsgStoresTable(0) # Restriction for the table: get rows where PR_DEFAULT_STORE is true. # There should be only one. restriction = (mapi.RES_PROPERTY, # a property restriction (mapi.RELOP_EQ, # check for equality PR_DEFAULT_STORE, # of the PR_DEFAULT_STORE prop (PR_DEFAULT_STORE, True))) # with True rows = mapi.HrQueryAllRows(tab, (PR_ENTRYID,), # columns to retrieve restriction, # only these rows None, # any sort order is fine 0) # any # of results is fine # get first entry, a (property_tag, value) pair, for PR_ENTRYID row = rows[0] eid_tag, eid = row[0] # Open the store. return session.OpenMsgStore( 0, # no parent window eid, # msg store to open None, # IID; accept default IMsgStore # need write access to add score fields mapi.MDB_WRITE | # we won't send or receive email mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS) def _FindItemsWithValue(folder, prop_tag, prop_val): tab = folder.GetContentsTable(0) # Restriction for the table: get rows where our prop values match restriction = (mapi.RES_CONTENT, # a property restriction (mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level prop_tag, # of the given prop (prop_tag, prop_val))) # with given val ## tab.SetColumns((PR_ENTRYID,), 0) ## restriction = None rows = mapi.HrQueryAllRows(tab, (PR_ENTRYID,), # columns to retrieve restriction, # only these rows None, # any sort order is fine 0) # any # of results is fine # get entry IDs print rows return [row[0][1] for row in rows] def _FindFolderEID(name): assert name from win32com.mapi import exchange if not name.startswith("\\"): name = "\\Top Of Personal Folders\\" + name store = _FindDefaultMessageStore() folder_eid = exchange.HrMAPIFindFolderEx(store, "\\", name) return folder_eid # Also in new versions of mapituil def GetAllProperties(obj, make_tag_names = True): tags = obj.GetPropList(0) hr, data = obj.GetProps(tags) ret = [] for tag, val in data: if make_tag_names: hr, tags, array = obj.GetNamesFromIDs( (tag,) ) if type(array[0][1])==type(u''): name = array[0][1] else: name = mapiutil.GetPropTagName(tag) else: name = tag ret.append((name, val)) return ret def DumpProps(folder_eid, subject, shorten): mapi_msgstore = _FindDefaultMessageStore() mapi_folder = mapi_msgstore.OpenEntry(folder_eid, None, mapi.MAPI_DEFERRED_ERRORS) hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0) name = data[0][1] print name eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject) print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject) for eid in eids: print "Dumping item with ID", mapi.HexFromBin(eid) item = mapi_msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS) for prop_name, prop_val in GetAllProperties(item): prop_repr = repr(prop_val) if shorten: prop_repr = prop_repr[:50] print "%-20s: %s" % (prop_name, prop_repr) def usage(): msg = """\ Usage: %s [-f foldername] subject of the message -f - Search for the message in the specified folder (default = Inbox) -s - Shorten long property values. Dumps all properties for all messages that match the subject. Subject matching is substring and ignore-case. Folder name must be a hierarchical 'path' name, using '\\' as the path seperator. If the folder name begins with a \\, it must be a fully-qualified name, including the message store name (eg, "Top of Public Folders"). If the path does not begin with a \\, it is assumed to be fully-qualifed from the root of the default message store Eg, python\\python-dev' will locate a python-dev subfolder in a python subfolder in your default store. """ % os.path.basename(sys.argv[0]) print msg def main(): import getopt try: opts, args = getopt.getopt(sys.argv[1:], "f:s") except getopt.error, e: print e print usage() sys.exit(1) folder_name = "" subject = " ".join(args) if not subject: usage() sys.exit(1) shorten = False for opt, opt_val in opts: if opt == "-f": folder_name = opt_val elif opt == "-s": shorten = True else: print "Invalid arg" return if not folder_name: folder_name = "Inbox" # Assume this exists! eid = _FindFolderEID(folder_name) if eid is None: print "*** Cant find folder", folder_name return DumpProps(eid, subject, shorten) if __name__=='__main__': main() From mhammond@users.sourceforge.net Sat Nov 2 03:18:10 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 19:18:10 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox dump_props.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv31673 Modified Files: dump_props.py Log Message: Remove old debug code I missed. Index: dump_props.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** dump_props.py 2 Nov 2002 03:13:22 -0000 1.1 --- dump_props.py 2 Nov 2002 03:18:08 -0000 1.2 *************** *** 48,53 **** prop_tag, # of the given prop (prop_tag, prop_val))) # with given val - ## tab.SetColumns((PR_ENTRYID,), 0) - ## restriction = None rows = mapi.HrQueryAllRows(tab, (PR_ENTRYID,), # columns to retrieve --- 48,51 ---- *************** *** 56,60 **** 0) # any # of results is fine # get entry IDs - print rows return [row[0][1] for row in rows] --- 54,57 ---- From mhammond@users.sourceforge.net Sat Nov 2 04:00:45 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 20:00:45 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv13755 Modified Files: README.txt Log Message: Update to reflect the current world state. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** README.txt 21 Oct 2002 01:38:10 -0000 1.4 --- README.txt 2 Nov 2002 04:00:43 -0000 1.5 *************** *** 4,12 **** to run the Outlook Addin you *must* have win32all-149 or later. ! ** NOTE ** - You also need CDO installed. This comes with Outlook 2k, but is ! not installed by default. Attempting to install the add-in will detect this ! situation, and print instructions how to install CDO. Note however that ! running the stand-alone scripts (see below) will generally just print the raw ! Python exception - generally a semi-incomprehensible COM exception. Outlook Addin --- 4,8 ---- to run the Outlook Addin you *must* have win32all-149 or later. ! CDO is no longer needed :) Outlook Addin *************** *** 43,54 **** Inbox filter). You can watch as many folders for Spam as you like. - You can define any number of filters to apply, each performing a different - action or testing a different spam probability. You can enable and disable - any rule, and you can "Filter Now" an entire folder in one step. - - Note that the rule ordering can be important, as if early rules move - a message, later rules will not fire for that message (cos MAPI - appears to make access to the message once moved impossible) - Command Line Tools ------------------- --- 39,42 ---- *************** *** 66,76 **** plugin must be running for filtering of new mail to occur) - classify.py - Creates a field in each message with the classifier score. Once run, - the Outlook Field Chooser can be used to display, sort etc the field, - or used to change formatting of these messages. The field will appear - in "user defined fields" - - Misc Comments =========== --- 54,57 ---- *************** *** 78,86 **** Somewhere over 4MB, they seem to stop working. Mark's hasn't got that big yet - just over 2MB and going strong. - - Outlook will occasionally complain that folders are corrupted after running - filter. Closing and reopening Outlook always seems to restore things, - with no fuss. Your mileage may vary. Buyer beware. Worth what you paid. - (Mark hasn't seen this) Copyright transferred to PSF from Sean D. True and WebReply.com. --- 59,62 ---- From mhammond@users.sourceforge.net Sat Nov 2 04:08:04 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 20:08:04 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv15352 Modified Files: README.txt Log Message: Add known problems. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** README.txt 2 Nov 2002 04:00:43 -0000 1.5 --- README.txt 2 Nov 2002 04:08:02 -0000 1.6 *************** *** 2,9 **** Outlook 2000, courtesy of Sean True and Mark Hammond. Note that you need Python's win32com extensions (http://starship.python.net/crew/mhammond) and ! to run the Outlook Addin you *must* have win32all-149 or later. CDO is no longer needed :) Outlook Addin ========== --- 2,12 ---- Outlook 2000, courtesy of Sean True and Mark Hammond. Note that you need Python's win32com extensions (http://starship.python.net/crew/mhammond) and ! you *must* have win32all-149 or later. CDO is no longer needed :) + See below for a list of known problems (particularly that you must manually + create an Outlook property before you can see the Spam scores) + Outlook Addin ========== *************** *** 54,63 **** plugin must be running for filtering of new mail to occur) Misc Comments =========== - Sean reports bad output saving very large classifiers in training.py. - Somewhere over 4MB, they seem to stop working. Mark's hasn't got - that big yet - just over 2MB and going strong. - Copyright transferred to PSF from Sean D. True and WebReply.com. Licensed under PSF, see Tim Peters for IANAL interpretation. --- 57,76 ---- plugin must be running for filtering of new mail to occur) + Known Problems + --------------- + * No field is created in Outlook for the Spam Score field. To create + the field, go to the field chooser for the folder you are interested + in, and create a new User Property called "Spam". Ensure the type + of the field is "Integer" (the last option), NOT "Number". This is only + necessary for you to *see* the score, not for the scoring to work. + + * Filtering an Exchange Server public store appears to not work. + + * Sean reports bad output saving very large classifiers in training.py. + Somewhere over 4MB, they seem to stop working. Mark's hasn't got + that big yet - just over 2MB and going strong. + Misc Comments =========== Copyright transferred to PSF from Sean D. True and WebReply.com. Licensed under PSF, see Tim Peters for IANAL interpretation. From tim.one@comcast.net Sat Nov 2 04:12:29 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 23:12:29 -0500 Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5 In-Reply-To: Message-ID: [Mark Hammond] > ... > Modified Files: > README.txt > Log Message: > Update to reflect the current world state. > ... > - Outlook will occasionally complain that folders are corrupted > - after running filter. Closing and reopening Outlook always seems to > - restore things, with no fuss. Your mileage may vary. Buyer beware. > - Worth what you paid. > - (Mark hasn't seen this) I meant to mention before that I've never seen this either. Sean, do you still see it? scanpst.exe sometimes claims there are minor inconsistencies when I run it, but it's always done that, and AFAICT it doesn't claim it more often now than before I started using the addin. From mhammond@skippinet.com.au Sat Nov 2 04:18:30 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 2 Nov 2002 15:18:30 +1100 Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5 In-Reply-To: Message-ID: > > ... > > - Outlook will occasionally complain that folders are corrupted > > - after running filter. Closing and reopening Outlook always seems to > > - restore things, with no fuss. Your mileage may vary. Buyer beware. > > - Worth what you paid. > > - (Mark hasn't seen this) > > I meant to mention before that I've never seen this either. Sean, do you > still see it? scanpst.exe sometimes claims there are minor > inconsistencies > when I run it, but it's always done that, and AFAICT it doesn't claim it > more often now than before I started using the addin. Actually, I saw similar things when using the Outlook model to scan huge folders. Since moving to MAPI I think it will have gone away. Mark. From mhammond@users.sourceforge.net Sat Nov 2 05:26:55 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 21:26:55 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox dump_props.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv4243 Modified Files: dump_props.py Log Message: Add support for dumping attachments too Index: dump_props.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** dump_props.py 2 Nov 2002 03:18:08 -0000 1.2 --- dump_props.py 2 Nov 2002 05:26:52 -0000 1.3 *************** *** 82,86 **** return ret ! def DumpProps(folder_eid, subject, shorten): mapi_msgstore = _FindDefaultMessageStore() mapi_folder = mapi_msgstore.OpenEntry(folder_eid, --- 82,93 ---- return ret ! def DumpItemProps(item, shorten): ! for prop_name, prop_val in GetAllProperties(item): ! prop_repr = repr(prop_val) ! if shorten: ! prop_repr = prop_repr[:50] ! print "%-20s: %s" % (prop_name, prop_repr) ! ! def DumpProps(folder_eid, subject, include_attach, shorten): mapi_msgstore = _FindDefaultMessageStore() mapi_folder = mapi_msgstore.OpenEntry(folder_eid, *************** *** 89,93 **** hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0) name = data[0][1] - print name eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject) print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject) --- 96,99 ---- *************** *** 97,105 **** None, mapi.MAPI_DEFERRED_ERRORS) ! for prop_name, prop_val in GetAllProperties(item): ! prop_repr = repr(prop_val) ! if shorten: ! prop_repr = prop_repr[:50] ! print "%-20s: %s" % (prop_name, prop_repr) def usage(): --- 103,116 ---- None, mapi.MAPI_DEFERRED_ERRORS) ! DumpItemProps(item, shorten) ! if include_attach: ! print ! table = item.GetAttachmentTable(0) ! rows = mapi.HrQueryAllRows(table, (PR_ATTACH_NUM,), None, None, 0) ! for row in rows: ! attach_num = row[0][1] ! print "Dumping attachment (PR_ATTACH_NUM=%d)" % (attach_num,) ! attach = item.OpenAttach(attach_num, None, mapi.MAPI_DEFERRED_ERRORS) ! DumpItemProps(attach, shorten) def usage(): *************** *** 108,111 **** --- 119,123 ---- -f - Search for the message in the specified folder (default = Inbox) -s - Shorten long property values. + -a - Include attachments Dumps all properties for all messages that match the subject. Subject *************** *** 128,132 **** import getopt try: ! opts, args = getopt.getopt(sys.argv[1:], "f:s") except getopt.error, e: print e --- 140,144 ---- import getopt try: ! opts, args = getopt.getopt(sys.argv[1:], "af:s") except getopt.error, e: print e *************** *** 141,144 **** --- 153,157 ---- shorten = False + include_attach = False for opt, opt_val in opts: if opt == "-f": *************** *** 146,149 **** --- 159,164 ---- elif opt == "-s": shorten = True + elif opt == "-a": + include_attach = True else: print "Invalid arg" *************** *** 157,161 **** print "*** Cant find folder", folder_name return ! DumpProps(eid, subject, shorten) if __name__=='__main__': --- 172,176 ---- print "*** Cant find folder", folder_name return ! DumpProps(eid, subject, include_attach, shorten) if __name__=='__main__': From mhammond@users.sourceforge.net Sat Nov 2 06:12:36 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Fri, 01 Nov 2002 22:12:36 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.17,1.18 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv13542 Modified Files: msgstore.py Log Message: Correct misleading comment. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** msgstore.py 1 Nov 2002 23:54:03 -0000 1.17 --- msgstore.py 2 Nov 2002 06:12:34 -0000 1.18 *************** *** 209,213 **** # message representing the object. if hasattr(message_id, "EntryID"): ! # A CDO object message_id = mapi.BinFromHex(message_id.Parent.StoreID), \ mapi.BinFromHex(message_id.EntryID) --- 209,213 ---- # message representing the object. if hasattr(message_id, "EntryID"): ! # An Outlook object message_id = mapi.BinFromHex(message_id.Parent.StoreID), \ mapi.BinFromHex(message_id.EntryID) From tim_one@users.sourceforge.net Sat Nov 2 06:53:26 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 01 Nov 2002 22:53:26 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv21025/Outlook2000 Modified Files: about.html Log Message: Added exhaustive sister-friendly instructions for creating a Spam column in a view in a folder. Index: about.html =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** about.html 1 Nov 2002 01:24:09 -0000 1.2 --- about.html 2 Nov 2002 06:53:24 -0000 1.3 *************** *** 18,25 **** --- 18,27 ---- consider spam, and continually adapt as both your regular email and spam patterns change.
    +

    Training

    Due to the nature of the system, it must be trained before it can be effective.  Although the system does learn over time, when first installed it has no knowledge of either spam or good email.
    +

    Initial Training

    When first installed, it is recommended you perform the following steps:
    *************** *** 44,47 **** --- 46,50 ---- You can then look at and sort by the Spam field in your Inbox - this is likely to find hidden spam that you missed from your inbox cleanup. +

    Incremental Training

    When you drag a message to your Spam folder, it will be automatically trained *************** *** 51,55 **** the system learns what good messages look like should it incorrectly classify it as spam or possible spam.
    !
    Contributions to this documentation are welcome!

    --- 54,97 ---- the system learns what good messages look like should it incorrectly classify it as spam or possible spam.
    ! !

    Creating a Spam Score Field

    ! A custom property named "Spam" is added to all Outlook messages scored. ! This is an integer in 0 (ham) through 100 (spam) inclusive. ! You can teach Outlook to display this field as a column in any table view, ! like the standard Messages view. !

    ! This takes some work, and has to be done again for every folder in which ! you want to display a Spam column: !

      !
    • While looking at an Outlook table view (like Messages), right-click ! on the line with column headers (From, Subject, To, Received, ...). ! In the context menu that pops up, click on Field Chooser. A box ! with title Field Chooser pops up. !
    • In the lower left corner of the Field Chooser box, click ! New.... A box with title New Field pops up. !
    • In the Name: box, type Spam. !
    • In the Type: dropdown list, select Integer. This is the ! last choice in the dropdown list. ! Do not select Number -- it won't work. !
    • The Format: dropdown list should display "1,234" now. Leave it alone. !
    • Click OK in the New Field box. Now you're back in the ! Field Chooser box. !
    • The dropdown list at the top of the Field Chooser box should say ! User-defined fields in FOLDER now, where FOLDER is the name of the ! folder you're currently looking at (like Inbox). Below that, you ! should see a new rectangular button with a Spam label. !
    • Use your mouse to drag the Spam button to the column header position ! where you want to see the Spam column. You don't have to be precise ! here -- you can rearrange or resize the column later just by dragging ! it around. !
    • You're done! Close the Field Chooser box. !
    ! Outlook's standard Automatic Formatting features can also be taught how ! access the value of this field; for example, you could tell Outlook to display ! rows with suspected spam messages in green italic. However, for whatever reason, ! the Outlook Rules Wizard does not allow creating rules based on user-defined ! fields. That's why this addin supplies its own filtering rules. ! !

    Contributions to this documentation are welcome!

    From tim_one@users.sourceforge.net Sat Nov 2 07:01:24 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 01 Nov 2002 23:01:24 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv22485/Outlook2000 Modified Files: about.html Log Message: Grammar repair in new stuff. Index: about.html =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** about.html 2 Nov 2002 06:53:24 -0000 1.3 --- about.html 2 Nov 2002 07:01:21 -0000 1.4 *************** *** 87,91 ****

  • You're done! Close the Field Chooser box. ! Outlook's standard Automatic Formatting features can also be taught how access the value of this field; for example, you could tell Outlook to display rows with suspected spam messages in green italic. However, for whatever reason, --- 87,91 ----
  • You're done! Close the Field Chooser box. ! Outlook's standard Automatic Formatting features can also be taught how to access the value of this field; for example, you could tell Outlook to display rows with suspected spam messages in green italic. However, for whatever reason, From mhammond@users.sourceforge.net Sat Nov 2 11:27:55 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sat, 02 Nov 2002 03:27:55 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox dump_props.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv9291/sandbox Modified Files: dump_props.py Log Message: Beat Tim to the whitespace normalization Index: dump_props.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** dump_props.py 2 Nov 2002 05:26:52 -0000 1.3 --- dump_props.py 2 Nov 2002 11:27:53 -0000 1.4 *************** *** 55,59 **** # get entry IDs return [row[0][1] for row in rows] ! def _FindFolderEID(name): assert name --- 55,59 ---- # get entry IDs return [row[0][1] for row in rows] ! def _FindFolderEID(name): assert name *************** *** 67,84 **** # Also in new versions of mapituil def GetAllProperties(obj, make_tag_names = True): ! tags = obj.GetPropList(0) ! hr, data = obj.GetProps(tags) ! ret = [] ! for tag, val in data: ! if make_tag_names: ! hr, tags, array = obj.GetNamesFromIDs( (tag,) ) ! if type(array[0][1])==type(u''): ! name = array[0][1] ! else: ! name = mapiutil.GetPropTagName(tag) ! else: ! name = tag ! ret.append((name, val)) ! return ret def DumpItemProps(item, shorten): --- 67,84 ---- # Also in new versions of mapituil def GetAllProperties(obj, make_tag_names = True): ! tags = obj.GetPropList(0) ! hr, data = obj.GetProps(tags) ! ret = [] ! for tag, val in data: ! if make_tag_names: ! hr, tags, array = obj.GetNamesFromIDs( (tag,) ) ! if type(array[0][1])==type(u''): ! name = array[0][1] ! else: ! name = mapiutil.GetPropTagName(tag) ! else: ! name = tag ! ret.append((name, val)) ! return ret def DumpItemProps(item, shorten): *************** *** 88,92 **** prop_repr = prop_repr[:50] print "%-20s: %s" % (prop_name, prop_repr) ! def DumpProps(folder_eid, subject, include_attach, shorten): mapi_msgstore = _FindDefaultMessageStore() --- 88,92 ---- prop_repr = prop_repr[:50] print "%-20s: %s" % (prop_name, prop_repr) ! def DumpProps(folder_eid, subject, include_attach, shorten): mapi_msgstore = _FindDefaultMessageStore() *************** *** 167,171 **** if not folder_name: folder_name = "Inbox" # Assume this exists! ! eid = _FindFolderEID(folder_name) if eid is None: --- 167,171 ---- if not folder_name: folder_name = "Inbox" # Assume this exists! ! eid = _FindFolderEID(folder_name) if eid is None: From mhammond@users.sourceforge.net Sat Nov 2 12:09:38 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sat, 02 Nov 2002 04:09:38 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.18,1.19 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv812 Modified Files: msgstore.py Log Message: Nice patch from Piers Haken that does the best we can with Exchange Server delivered messages. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** msgstore.py 2 Nov 2002 06:12:34 -0000 1.18 --- msgstore.py 2 Nov 2002 12:09:36 -0000 1.19 *************** *** 351,355 **** --- 351,379 ---- body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1]) html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2]) + # Mail delivered internally via Exchange Server etc may not have + # headers - fake some up. + if not headers: + headers = self._GetFakeHeaders () + # Mail delivered via the Exchange Internet Mail MTA may have + # gibberish at the start of the headers - fix this. + elif headers.startswith("Microsoft Mail"): + headers = "X-MS-Mail-Gibberish: " + headers return "%s\n%s\n%s" % (headers, html, body) + + def _GetFakeHeaders(self): + # This is designed to fake up some SMTP headers for messages + # on an exchange server that do not have such headers of their own + prop_ids = PR_SUBJECT_A, PR_DISPLAY_NAME_A, PR_DISPLAY_TO_A, PR_DISPLAY_CC_A + hr, data = self.mapi_object.GetProps(prop_ids,0) + subject = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0]) + sender = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1]) + to = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2]) + cc = self._GetPotentiallyLargeStringProp(prop_ids[3], data[3]) + headers = ["X-Exchange-Message: true"] + if subject: headers.append("Subject: "+subject) + if sender: headers.append("From: "+sender) + if to: headers.append("To: "+to) + if cc: headers.append("CC: "+cc) + return "\n".join(headers) def _EnsureObject(self): From mhammond@users.sourceforge.net Sat Nov 2 12:28:41 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sat, 02 Nov 2002 04:28:41 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FilterDialog.py,1.8,1.9 FolderSelector.py,1.6,1.7 TrainingDialog.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv9661 Modified Files: FilterDialog.py FolderSelector.py TrainingDialog.py Log Message: Another nice patch from Piers Haken - use the Outlook object model for the folder dialog. I have no idea why this is necessary for Exchange server, but it seems OK, and is trivial to revert. I'm certain that Exchange Server can be navigated via Ext MAPI, but I'm happy this at least gets more people going. Note after applying this, the Folder dialog may not automatically pre-select the folders you had selected (but they are still working) - however, once you have re-selected, it does re-remember. (It seems Outlook has done something funky with the entry IDs, and made them binary comparable, whereas MAPI and CDO ones are not. Whatever) Index: FilterDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** FilterDialog.py 1 Nov 2002 02:03:46 -0000 1.8 --- FilterDialog.py 2 Nov 2002 12:28:38 -0000 1.9 *************** *** 194,198 **** ids = [ids] single_select = not ids_are_list ! d = FolderSelector.FolderSelector(self.mgr.message_store.session, ids, checkbox_state=None, single_select=single_select) if d.DoModal()==win32con.IDOK: new_ids, include_sub = d.GetSelectedIDs() --- 194,199 ---- ids = [ids] single_select = not ids_are_list ! # d = FolderSelector.FolderSelector(self.mgr.message_store.session, ids, checkbox_state=None, single_select=single_select) ! d = FolderSelector.FolderSelector(self.mgr.outlook.Session, ids, checkbox_state=None, single_select=single_select) if d.DoModal()==win32con.IDOK: new_ids, include_sub = d.GetSelectedIDs() Index: FolderSelector.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** FolderSelector.py 1 Nov 2002 05:47:59 -0000 1.6 --- FolderSelector.py 2 Nov 2002 12:28:38 -0000 1.7 *************** *** 22,25 **** --- 22,35 ---- c.dump(level+1) + # Oh, lord help us. + # We started with a CDO version - but CDO sucks for lots of reasons I + # wont even start to mention. + # So we moved to an Extended MAPI version with is nice and fast - screams + # along! Except it doesn't work in all cases with Exchange (which + # strikes Mark as extremely strange given that the Extended MAPI Python + # bindings were developed against an Exchange Server - but Mark doesn't + # have an Exchange server handy these days, and really doesn't give a + # rat's arse + # So finally we have an Outlook object model version! ######################################################################### ## CDO version of a folder walker. *************** *** 90,93 **** --- 100,118 ---- return root + ## - An Outlook object model version + def _BuildFolderTreeOutlook(session, parent): + children = [] + for i in range (parent.Folders.Count): + folder = parent.Folders [i+1] + spec = FolderSpec ((folder.StoreID, folder.EntryID), folder.Name.encode("mbcs", "replace")) + if folder.Folders != None: + spec.children = _BuildFolderTreeOutlook (session, folder) + children.append(spec) + return children + + def BuildFolderTreeOutlook(session): + root = FolderSpec(None, "root") + root.children = _BuildFolderTreeOutlook(session, session) + return root ######################################################################### *************** *** 141,146 **** if type(id2) != type(()): id2 = default_store_id, id2 ! return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \ ! self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1])) def InIDs(self, id, ids): --- 166,172 ---- if type(id2) != type(()): id2 = default_store_id, id2 ! return id1 == id2 ! # return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \ ! # self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1])) def InIDs(self, id, ids): *************** *** 251,260 **** self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE) ! if hasattr(self.mapi, "_oleobj_"): # Dispatch COM object ! # CDO ! tree = BuildFolderTreeCDO(self.mapi) ! else: ! # Extended MAPI. ! tree = BuildFolderTreeMAPI(self.mapi) self._InsertSubFolders(0, tree) self.selected_ids = [] # wipe this out while we are alive. --- 277,287 ---- self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE) ! tree = BuildFolderTreeOutlook(self.mapi) ! # if hasattr(self.mapi, "_oleobj_"): # Dispatch COM object ! # # CDO ! # tree = BuildFolderTreeCDO(self.mapi) ! # else: ! # # Extended MAPI. ! # tree = BuildFolderTreeMAPI(self.mapi) self._InsertSubFolders(0, tree) self.selected_ids = [] # wipe this out while we are alive. *************** *** 353,356 **** print d.GetSelectedIDs() if __name__=='__main__': ! TestWithMAPI() --- 380,391 ---- print d.GetSelectedIDs() + def TestWithOutlook(): + from win32com.client import Dispatch + outlook = Dispatch("Outlook.Application") + d=FolderSelector(outlook.Session, None, single_select = False) + d.DoModal() + print d.GetSelectedIDs() + + if __name__=='__main__': ! TestWithOutlook() Index: TrainingDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/TrainingDialog.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** TrainingDialog.py 1 Nov 2002 02:03:52 -0000 1.7 --- TrainingDialog.py 2 Nov 2002 12:28:38 -0000 1.8 *************** *** 105,109 **** sub_attr = "ham_include_sub" include_sub = getattr(self.config, sub_attr) ! d = FolderSelector.FolderSelector(self.mgr.message_store.session, l, checkbox_state=include_sub) if d.DoModal()==win32con.IDOK: l[:], include_sub = d.GetSelectedIDs()[:] --- 105,110 ---- sub_attr = "ham_include_sub" include_sub = getattr(self.config, sub_attr) ! # d = FolderSelector.FolderSelector(self.mgr.message_store.session, l, checkbox_state=include_sub) ! d = FolderSelector.FolderSelector(self.mgr.outlook.Session, l, checkbox_state=include_sub) if d.DoModal()==win32con.IDOK: l[:], include_sub = d.GetSelectedIDs()[:] From tim_one@users.sourceforge.net Sat Nov 2 17:11:50 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 02 Nov 2002 09:11:50 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FolderSelector.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv19232/Outlook2000/dialogs Modified Files: FolderSelector.py Log Message: Folded long lines so I could read it better. We've got a regression here: the folder selectors in the Training and Define Filters dialogs still work, but in the Filter Now dialog clicking Browse dies with Traceback (most recent call last): File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py", line 313, in OnInitDialog tree = BuildFolderTreeOutlook(self.mapi) File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py", line 119, in BuildFolderTreeOutlook root.children = _BuildFolderTreeOutlook(session, session) File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py", line 108, in _BuildFolderTreeOutlook for i in range(parent.Folders.Count): AttributeError: Folders win32ui: OnInitDialog() virtual handler (>) raised an exception Index: FolderSelector.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** FolderSelector.py 2 Nov 2002 12:28:38 -0000 1.7 --- FolderSelector.py 2 Nov 2002 17:11:47 -0000 1.8 *************** *** 26,34 **** # wont even start to mention. # So we moved to an Extended MAPI version with is nice and fast - screams ! # along! Except it doesn't work in all cases with Exchange (which # strikes Mark as extremely strange given that the Extended MAPI Python # bindings were developed against an Exchange Server - but Mark doesn't # have an Exchange server handy these days, and really doesn't give a ! # rat's arse # So finally we have an Outlook object model version! ######################################################################### --- 26,34 ---- # wont even start to mention. # So we moved to an Extended MAPI version with is nice and fast - screams ! # along! Except it doesn't work in all cases with Exchange (which # strikes Mark as extremely strange given that the Extended MAPI Python # bindings were developed against an Exchange Server - but Mark doesn't # have an Exchange server handy these days, and really doesn't give a ! # rat's arse ). # So finally we have an Outlook object model version! ######################################################################### *************** *** 69,73 **** table = folder.GetHierarchyTable(0) children = [] ! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0) for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows: folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid) --- 69,75 ---- table = folder.GetHierarchyTable(0) children = [] ! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, ! PR_STORE_ENTRYID, ! PR_DISPLAY_NAME_A), None, None, 0) for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows: folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid) *************** *** 90,95 **** default_store_id = hex_eid ! msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS) ! hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0) subtree_eid = data[0][1] folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS) --- 92,98 ---- default_store_id = hex_eid ! msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | ! mapi.MAPI_DEFERRED_ERRORS) ! hr, data = msgstore.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0) subtree_eid = data[0][1] folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS) *************** *** 103,111 **** def _BuildFolderTreeOutlook(session, parent): children = [] ! for i in range (parent.Folders.Count): ! folder = parent.Folders [i+1] ! spec = FolderSpec ((folder.StoreID, folder.EntryID), folder.Name.encode("mbcs", "replace")) ! if folder.Folders != None: ! spec.children = _BuildFolderTreeOutlook (session, folder) children.append(spec) return children --- 106,115 ---- def _BuildFolderTreeOutlook(session, parent): children = [] ! for i in range(parent.Folders.Count): ! folder = parent.Folders[i+1] ! spec = FolderSpec((folder.StoreID, folder.EntryID), ! folder.Name.encode("mbcs", "replace")) ! if folder.Folders: ! spec.children = _BuildFolderTreeOutlook(session, folder) children.append(spec) return children *************** *** 128,136 **** class FolderSelector(dialog.Dialog): ! style = win32con.DS_MODALFRAME | win32con.WS_POPUP | win32con.WS_VISIBLE | win32con.WS_CAPTION | win32con.WS_SYSMENU | win32con.DS_SETFONT cs = win32con.WS_CHILD | win32con.WS_VISIBLE ! treestyle = cs | win32con.WS_BORDER | commctrl.TVS_HASLINES | commctrl.TVS_LINESATROOT | \ ! commctrl.TVS_CHECKBOXES | commctrl.TVS_HASBUTTONS | \ ! commctrl.TVS_DISABLEDRAGDROP | commctrl.TVS_SHOWSELALWAYS dt = [ # Dialog itself. --- 132,150 ---- class FolderSelector(dialog.Dialog): ! style = (win32con.DS_MODALFRAME | ! win32con.WS_POPUP | ! win32con.WS_VISIBLE | ! win32con.WS_CAPTION | ! win32con.WS_SYSMENU | ! win32con.DS_SETFONT) cs = win32con.WS_CHILD | win32con.WS_VISIBLE ! treestyle = (cs | ! win32con.WS_BORDER | ! commctrl.TVS_HASLINES | ! commctrl.TVS_LINESATROOT | ! commctrl.TVS_CHECKBOXES | ! commctrl.TVS_HASBUTTONS | ! commctrl.TVS_DISABLEDRAGDROP | ! commctrl.TVS_SHOWSELALWAYS) dt = [ # Dialog itself. *************** *** 147,151 **** ] ! def __init__ (self, mapi, selected_ids = None, single_select = False, checkbox_state = False, checkbox_text = None, desc_noun = "Select", desc_noun_suffix = "ed"): assert not single_select or selected_ids is None or len(selected_ids)<=1 dialog.Dialog.__init__ (self, self.dt) --- 161,170 ---- ] ! def __init__ (self, mapi, selected_ids=None, ! single_select=False, ! checkbox_state=False, ! checkbox_text=None, ! desc_noun="Select", ! desc_noun_suffix="ed"): assert not single_select or selected_ids is None or len(selected_ids)<=1 dialog.Dialog.__init__ (self, self.dt) *************** *** 194,198 **** mask = state = 0 else: ! if self.selected_ids and self.InIDs(child.folder_id, self.selected_ids): state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED) num_children_selected += 1 --- 213,218 ---- mask = state = 0 else: ! if (self.selected_ids and ! self.InIDs(child.folder_id, self.selected_ids)): state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED) num_children_selected += 1 *************** *** 201,206 **** mask = commctrl.TVIS_STATEIMAGEMASK item_id = self._MakeItemParam(child) ! hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id)) ! if self.single_select and self.selected_ids and self.InIDs(child.folder_id, self.selected_ids): self.list.SelectItem(hitem) --- 221,236 ---- mask = commctrl.TVIS_STATEIMAGEMASK item_id = self._MakeItemParam(child) ! hitem = self.list.InsertItem(hParent, 0, ! (None, ! state, ! mask, ! text, ! bitmapCol, ! bitmapSel, ! cItems, ! item_id)) ! if (self.single_select and ! self.selected_ids and ! self.InIDs(child.folder_id, self.selected_ids)): self.list.SelectItem(hitem) *************** *** 232,236 **** def _YieldCheckedChildren(self): if self.single_select: ! # If single-select, the checked state is not used, just the selected state. try: h = self.list.GetSelectedItem() --- 262,267 ---- def _YieldCheckedChildren(self): if self.single_select: ! # If single-select, the checked state is not used, just the ! # selected state. try: h = self.list.GetSelectedItem() *************** *** 271,277 **** if self.single_select: # Remove the checkbox style from the list for single-selection ! style = win32api.GetWindowLong(self.list.GetSafeHwnd(), win32con.GWL_STYLE) style = style & ~commctrl.TVS_CHECKBOXES ! win32api.SetWindowLong(self.list.GetSafeHwnd(), win32con.GWL_STYLE, style) # Hide "clear all" self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE) --- 302,311 ---- if self.single_select: # Remove the checkbox style from the list for single-selection ! style = win32api.GetWindowLong(self.list.GetSafeHwnd(), ! win32con.GWL_STYLE) style = style & ~commctrl.TVS_CHECKBOXES ! win32api.SetWindowLong(self.list.GetSafeHwnd(), ! win32con.GWL_STYLE, ! style) # Hide "clear all" self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE) *************** *** 283,287 **** # else: # # Extended MAPI. ! # tree = BuildFolderTreeMAPI(self.mapi) self._InsertSubFolders(0, tree) self.selected_ids = [] # wipe this out while we are alive. --- 317,321 ---- # else: # # Extended MAPI. ! # tree = BuildFolderTreeMAPI(self.mapi) self._InsertSubFolders(0, tree) self.selected_ids = [] # wipe this out while we are alive. *************** *** 311,315 **** names.append(info[3]) ! status_string = "%s%s %d folder" % (self.select_desc_noun, self.select_desc_noun_suffix, num_checked) if num_checked != 1: status_string += "s" --- 345,351 ---- names.append(info[3]) ! status_string = "%s%s %d folder" % (self.select_desc_noun, ! self.select_desc_noun_suffix, ! num_checked) if num_checked != 1: status_string += "s" From tim_one@users.sourceforge.net Sat Nov 2 17:27:49 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 02 Nov 2002 09:27:49 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs FilterDialog.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv5390/Outlook2000/dialogs Modified Files: FilterDialog.py Log Message: FilterNowDialog.OnButBrowse(): Repaired the way FolderSelector is called so that this works again. Index: FilterDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** FilterDialog.py 2 Nov 2002 12:28:38 -0000 1.9 --- FilterDialog.py 2 Nov 2002 17:27:44 -0000 1.10 *************** *** 333,338 **** import FolderSelector filter = self.mgr.config.filter_now ! d = FolderSelector.FolderSelector(self.mgr.message_store.session, filter.folder_ids,checkbox_state=filter.include_sub) ! if d.DoModal()==win32con.IDOK: filter.folder_ids, filter.include_sub = d.GetSelectedIDs() self.UpdateFolderNames() --- 333,341 ---- import FolderSelector filter = self.mgr.config.filter_now ! # d = FolderSelector.FolderSelector(self.mgr.message_store.session, filter.folder_ids,checkbox_state=filter.include_sub) ! d = FolderSelector.FolderSelector(self.mgr.outlook.Session, ! filter.folder_ids, ! checkbox_state=filter.include_sub) ! if d.DoModal() == win32con.IDOK: filter.folder_ids, filter.include_sub = d.GetSelectedIDs() self.UpdateFolderNames() From richiehindle@users.sourceforge.net Sat Nov 2 21:00:23 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Sat, 02 Nov 2002 13:00:23 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13701 Modified Files: pop3proxy.py Log Message: Can now listen on the port of your choice (thanks to Tim Stone). Now supports the 'Unsure' value for X-Hammie-Disposition. Now less anal about correcting for the size of the added header. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** pop3proxy.py 1 Nov 2002 09:14:47 -0000 1.8 --- pop3proxy.py 2 Nov 2002 21:00:21 -0000 1.9 *************** *** 12,18 **** defaults to 110. ! options (the same as hammie): -p FILE : use the named data file -d : the file is a DBM file rather than a pickle pop3proxy -t --- 12,19 ---- defaults to 110. ! options: -p FILE : use the named data file -d : the file is a DBM file rather than a pickle + -l port : listen on this port number (default 110) pop3proxy -t *************** *** 39,44 **** from Options import options HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER ! HEADER_EXAMPLE = '%s: Yes\r\n' % hammie.DISPHEADER --- 40,47 ---- from Options import options + # HEADER_EXAMPLE is the longest possible header - the length of this one + # is added to the size of each message. HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER ! HEADER_EXAMPLE = '%s: Unsure\r\n' % hammie.DISPHEADER *************** *** 58,61 **** --- 61,65 ---- self.set_socket(s, socketMap) self.set_reuse_addr() + print "Listening on port %d." % port self.bind(('', port)) self.listen(5) *************** *** 337,350 **** ok, message = response.split('\n', 1) ! # Now find the spam disposition and add the header. The ! # trailing space in "No " ensures consistent lengths - this ! # is required because POP3 commands like 'STAT' and 'LIST' ! # need to be able to report the size of a message before ! # it's been classified. prob = self.bayes.spamprob(tokenizer.tokenize(message)) ! if prob > options.spam_cutoff: disposition = "Yes" else: ! disposition = "No " headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" --- 341,353 ---- ok, message = response.split('\n', 1) ! # Now find the spam disposition and add the header. prob = self.bayes.spamprob(tokenizer.tokenize(message)) ! if prob < options.ham_cutoff: ! disposition = "No" ! elif prob > options.spam_cutoff: disposition = "Yes" else: ! disposition = "Unsure" ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" *************** *** 577,581 **** # Read the arguments. try: ! opts, args = getopt.getopt(sys.argv[1:], 'htdp:') except getopt.error, msg: print >>sys.stderr, str(msg) + '\n\n' + __doc__ --- 580,584 ---- # Read the arguments. try: ! opts, args = getopt.getopt(sys.argv[1:], 'htdp:l:') except getopt.error, msg: print >>sys.stderr, str(msg) + '\n\n' + __doc__ *************** *** 583,586 **** --- 586,590 ---- pickleName = hammie.DEFAULTDB + proxyPort = 110 useDB = False runTestServer = False *************** *** 595,599 **** elif opt == '-p': pickleName = arg ! # Do whatever we've been asked to do... if not opts and not args: --- 599,605 ---- elif opt == '-p': pickleName = arg ! elif opt == '-l': ! proxyPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: *************** *** 609,617 **** elif len(args) == 1: # Named POP3 server, default port. ! main(args[0], 110, 110, pickleName, useDB) elif len(args) == 2: # Named POP3 server, named port. ! main(args[0], int(args[1]), 110, pickleName, useDB) else: --- 615,623 ---- elif len(args) == 1: # Named POP3 server, default port. ! main(args[0], 110, proxyPort, pickleName, useDB) elif len(args) == 2: # Named POP3 server, named port. ! main(args[0], int(args[1]), proxyPort, pickleName, useDB) else: From mhammond@users.sourceforge.net Sun Nov 3 02:00:33 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sat, 02 Nov 2002 18:00:33 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.19,1.20 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv31898 Modified Files: msgstore.py Log Message: _GetFakeHeaders must end with \n Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** msgstore.py 2 Nov 2002 12:09:36 -0000 1.19 --- msgstore.py 3 Nov 2002 02:00:31 -0000 1.20 *************** *** 375,379 **** if to: headers.append("To: "+to) if cc: headers.append("CC: "+cc) ! return "\n".join(headers) def _EnsureObject(self): --- 375,379 ---- if to: headers.append("To: "+to) if cc: headers.append("CC: "+cc) ! return "\n".join(headers) + "\n" def _EnsureObject(self): From hooft@users.sourceforge.net Sun Nov 3 13:48:49 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 03 Nov 2002 05:48:49 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.63,1.64 hammie.py,1.33,1.34 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16667 Modified Files: Options.py hammie.py Log Message: * Added options "header_spam_string", "header_unsure_string", "header_ham_string". Defaults are set to "Yes", "Unsure", "No". * Added options header_score_digits and header_score_logarithm. The first is an integer telling hammie in how many digits it should show the score. If the second option is set to "True", scores of 1.00 or 0.00 are augmented by a logarithmic "one-ness" or "zero-ness" score (basically it shows the "number of zeros" or "number of nines" next to the score value). * Added support for a debugging header using the boolean hammie_debug_header option and the string hammie_debug_header_name * Changed hammie.py to use all of the new options Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** Options.py 28 Oct 2002 20:19:46 -0000 1.63 --- Options.py 3 Nov 2002 13:48:47 -0000 1.64 *************** *** 286,302 **** [Hammie] # The name of the header that hammie adds to an E-mail in filter mode hammie_header_name: X-Hammie-Disposition ! # The default database path used by hammie ! persistent_storage_file: hammie.db ! # The range of clues that are added to the "hammie" header in the E-mail # All clues that have their probability smaller than this number, or larger # than one minus this number are added to the header such that you can see # why spambayes thinks this is ham/spam or why it is unsure. The default is # to show all clues, but you can reduce that by setting showclue to a lower ! # value, such as 0.1 (which Rob is using) clue_mailheader_cutoff: 0.5 # hammie can use either a database (quick to score one message) or a pickle # (quick to train on huge amounts of messages). Set this to True to use a --- 286,324 ---- [Hammie] # The name of the header that hammie adds to an E-mail in filter mode + # It contains the "classification" of the mail, plus the score. hammie_header_name: X-Hammie-Disposition ! # The three disposition names are added to the header as the following ! # Three words: ! header_spam_string: Yes ! header_unsure_string: Unsure ! header_ham_string: No ! # Accuracy of the score in the header in decimal digits ! header_score_digits: 2 ! ! # Set this to "True", to augment scores of 1.00 or 0.00 by a logarithmic ! # "one-ness" or "zero-ness" score (basically it shows the "number of zeros" ! # or "number of nines" next to the score value). ! header_score_logarithm: False ! ! # Enable debugging information in the header. ! hammie_debug_header: False ! ! # Name of a debugging header for spambayes hackers, showing the strongest ! # clues that have resulted in the classification in the standard header. ! hammie_debug_header_name: X-Hammie-Debug ! ! # The range of clues that are added to the "debug" header in the E-mail # All clues that have their probability smaller than this number, or larger # than one minus this number are added to the header such that you can see # why spambayes thinks this is ham/spam or why it is unsure. The default is # to show all clues, but you can reduce that by setting showclue to a lower ! # value, such as 0.1 clue_mailheader_cutoff: 0.5 + # The default database path used by hammie + persistent_storage_file: hammie.db + # hammie can use either a database (quick to score one message) or a pickle # (quick to train on huge amounts of messages). Set this to True to use a *************** *** 363,366 **** --- 385,395 ---- 'clue_mailheader_cutoff': float_cracker, 'persistent_use_database': boolean_cracker, + 'header_spam_string': string_cracker, + 'header_unsure_string': string_cracker, + 'header_ham_string': string_cracker, + 'header_score_digits': int_cracker, + 'header_score_logarithm': boolean_cracker, + 'hammie_debug_header': boolean_cracker, + 'hammie_debug_header_name': string_cracker, }, Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** hammie.py 27 Oct 2002 22:56:15 -0000 1.33 --- hammie.py 3 Nov 2002 13:48:47 -0000 1.34 *************** *** 57,60 **** --- 57,62 ---- # Name of the header to add in filter mode DISPHEADER = options.hammie_header_name + DEBUGHEADER = options.hammie_debug_header_name + DODEBUG = options.hammie_debug_header # Default database name *************** *** 242,246 **** def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD, ! ham_cutoff=HAM_THRESHOLD): """Score (judge) a message and add a disposition header. --- 244,249 ---- def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD, ! ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER, ! debug=DODEBUG): """Score (judge) a message and add a disposition header. *************** *** 248,253 **** Optionally, set header to the name of the header to add, and/or ! cutoff to the probability value which must be met or exceeded ! for a message to get a 'Yes' disposition. Returns the same message with a new disposition header. --- 251,261 ---- Optionally, set header to the name of the header to add, and/or ! spam_cutoff/ham_cutoff to the probability values which must be met ! or exceeded for a message to get a 'Spam' or 'Ham' classification. ! ! An extra debugging header can be added if 'debug' is set to True. ! The name of the debugging header is given as 'debugheader'. ! ! All defaults for optional parameters come from the Options file. Returns the same message with a new disposition header. *************** *** 261,272 **** prob, clues = self._scoremsg(msg, True) if prob < ham_cutoff: ! disp = "No" elif prob > spam_cutoff: ! disp = "Yes" else: ! disp = "Unsure" ! disp += "; %.2f" % prob ! disp += "; " + self.formatclues(clues) msg.add_header(header, disp) return msg.as_string(unixfrom=(msg.get_unixfrom() is not None)) --- 269,291 ---- prob, clues = self._scoremsg(msg, True) if prob < ham_cutoff: ! disp = options.header_ham_string elif prob > spam_cutoff: ! disp = options.header_spam_string else: ! disp = options.header_unknown_string ! disp += ("; %."+str(options.header_score_digits)+"f") % prob ! if options.header_score_logarithm: ! if prob<=0.005 and prob>0.0: ! import math ! x=-math.log10(prob) ! disp += " (%d)"%x ! if prob>=0.995 and prob<1.0: ! import math ! x=-math.log10(1.0-prob) ! disp += " (%d)"%x msg.add_header(header, disp) + if debug: + disp = self.formatclues(clues) + msg.add_header(debugheader, disp) return msg.as_string(unixfrom=(msg.get_unixfrom() is not None)) From hooft@users.sourceforge.net Sun Nov 3 14:24:38 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 03 Nov 2002 06:24:38 -0800 Subject: [Spambayes-checkins] spambayes hammie.py,1.34,1.35 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25907 Modified Files: hammie.py Log Message: fix typo(?) Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** hammie.py 3 Nov 2002 13:48:47 -0000 1.34 --- hammie.py 3 Nov 2002 14:24:36 -0000 1.35 *************** *** 273,277 **** disp = options.header_spam_string else: ! disp = options.header_unknown_string disp += ("; %."+str(options.header_score_digits)+"f") % prob if options.header_score_logarithm: --- 273,277 ---- disp = options.header_spam_string else: ! disp = options.header_unsure_string disp += ("; %."+str(options.header_score_digits)+"f") % prob if options.header_score_logarithm: From mhammond@users.sourceforge.net Mon Nov 4 00:41:10 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 16:41:10 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.20,1.21 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv26387 Modified Files: msgstore.py Log Message: Allow an Outlook folder to be passed as a "folder id" (in the same way we did that for messages). Give __eq__ and __ne__ methods to compare folders. I'm pretty sure the MAPI semantics are correct, but not as confident on the new rich comparisons . Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** msgstore.py 3 Nov 2002 02:00:31 -0000 1.20 --- msgstore.py 4 Nov 2002 00:41:08 -0000 1.21 *************** *** 198,202 **** def GetFolder(self, folder_id): # Return a single folder given the ID. ! folder_id = self.NormalizeID(folder_id) folder = self._OpenEntry(folder_id) table = folder.GetContentsTable(0) --- 198,207 ---- def GetFolder(self, folder_id): # Return a single folder given the ID. ! if hasattr(folder_id, "EntryID"): ! # An Outlook object ! folder_id = mapi.BinFromHex(folder_id.StoreID), \ ! mapi.BinFromHex(folder_id.EntryID) ! else: ! folder_id = self.NormalizeID(folder_id) folder = self._OpenEntry(folder_id) table = folder.GetContentsTable(0) *************** *** 248,251 **** --- 253,265 ---- mapi.HexFromBin(self.id[1])) + def __eq__(self, other): + if other is None: return False + ceid = self.msgstore.session.CompareEntryIDs + return ceid(self.id[0], other.id[0]) and \ + ceid(self.id[1], other.id[1]) + + def __ne__(self, other): + return not self.__eq__(other) + def GetID(self): return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1]) *************** *** 298,301 **** --- 312,324 ---- mapi.HexFromBin(self.id[1])) + def __eq__(self, other): + if other is None: return False + ceid = self.msgstore.session.CompareEntryIDs + return ceid(self.id[0], other.id[0]) and \ + ceid(self.id[1], other.id[1]) + + def __ne__(self, other): + return not self.__eq__(other) + def GetID(self): return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1]) *************** *** 303,307 **** def GetOutlookItem(self): hex_item_id = mapi.HexFromBin(self.id[1]) ! store_hex_id = mapi.HexFromBin(self.id[0]) return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id) --- 326,330 ---- def GetOutlookItem(self): hex_item_id = mapi.HexFromBin(self.id[1]) ! hex_store_id = mapi.HexFromBin(self.id[0]) return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id) From mhammond@users.sourceforge.net Mon Nov 4 00:49:13 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 16:49:13 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox dump_props.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox In directory usw-pr-cvs1:/tmp/cvs-serv29119 Modified Files: dump_props.py Log Message: If the property type is PT_ERROR, show the best error code repr we can. Index: dump_props.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** dump_props.py 2 Nov 2002 11:27:53 -0000 1.4 --- dump_props.py 4 Nov 2002 00:49:11 -0000 1.5 *************** *** 66,75 **** # Also in new versions of mapituil ! def GetAllProperties(obj, make_tag_names = True): tags = obj.GetPropList(0) hr, data = obj.GetProps(tags) ret = [] for tag, val in data: ! if make_tag_names: hr, tags, array = obj.GetNamesFromIDs( (tag,) ) if type(array[0][1])==type(u''): --- 66,75 ---- # Also in new versions of mapituil ! def GetAllProperties(obj, make_pretty = True): tags = obj.GetPropList(0) hr, data = obj.GetProps(tags) ret = [] for tag, val in data: ! if make_pretty: hr, tags, array = obj.GetNamesFromIDs( (tag,) ) if type(array[0][1])==type(u''): *************** *** 77,80 **** --- 77,83 ---- else: name = mapiutil.GetPropTagName(tag) + # pretty value transformations + if PROP_TYPE(tag)==PT_ERROR: + val = mapiutil.GetScodeString(val) else: name = tag From mhammond@users.sourceforge.net Mon Nov 4 00:50:11 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 16:50:11 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.31,1.32 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv29458 Modified Files: manager.py Log Message: Wipe outlook reference as we die. Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** manager.py 1 Nov 2002 14:35:05 -0000 1.31 --- manager.py 4 Nov 2002 00:50:09 -0000 1.32 *************** *** 239,242 **** --- 239,243 ---- self.message_store.Close() self.message_store = None + self.outlook = None def score(self, msg, evidence=False, scale=True): From mhammond@users.sourceforge.net Mon Nov 4 00:50:26 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 16:50:26 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/images - New directory Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/images In directory usw-pr-cvs1:/tmp/cvs-serv29597/images Log Message: Directory /cvsroot/spambayes/spambayes/Outlook2000/images added to the repository From mhammond@users.sourceforge.net Mon Nov 4 00:51:18 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 16:51:18 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/images delete_as_spam.bmp,NONE,1.1 recover_ham.bmp,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/images In directory usw-pr-cvs1:/tmp/cvs-serv29827 Added Files: delete_as_spam.bmp recover_ham.bmp Log Message: Some button images :) --- NEW FILE: delete_as_spam.bmp --- (This appears to be a binary file; contents omitted.) --- NEW FILE: recover_ham.bmp --- (This appears to be a binary file; contents omitted.) From mhammond@users.sourceforge.net Mon Nov 4 00:52:12 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 16:52:12 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.24,1.25 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv29880 Modified Files: addin.py Log Message: New "Delete As Spam" button, complete with button image, and the button changes appearance and behaviour when one of the spam folders is selected. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** addin.py 1 Nov 2002 23:54:03 -0000 1.24 --- addin.py 4 Nov 2002 00:52:10 -0000 1.25 *************** *** 1,5 **** # SpamBayes Outlook Addin ! import sys import warnings --- 1,5 ---- # SpamBayes Outlook Addin ! import sys, os import warnings *************** *** 16,19 **** --- 16,21 ---- import win32ui + import win32gui, win32con, win32clipboard # for button images! + # If we are not running in a console, redirect all print statements to the # win32traceutil collector. *************** *** 28,38 **** ! # A lovely big block that attempts to catch the most common errors - COM objects not installed. try: ! # Support for COM objects we use. gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True) # Outlook 9 gencache.EnsureModule('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1, bForDemand=True) # Office 9 ! # The TLB defiining the interfaces we implement universal.RegisterInterfaces('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0, ["_IDTExtensibility2"]) except pythoncom.com_error, (hr, msg, exc, arg): --- 30,40 ---- ! # Attempt to catch the most common errors - COM objects not installed. try: ! # Generate support so we get complete support including events gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True) # Outlook 9 gencache.EnsureModule('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1, bForDemand=True) # Office 9 ! # Register what vtable based interfaces we need to implement. universal.RegisterInterfaces('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0, ["_IDTExtensibility2"]) except pythoncom.com_error, (hr, msg, exc, arg): *************** *** 46,76 **** if exc: print "Exception: %s" % (exc) ! print "Sorry, I can't be more help, but I can't continue while I have this error." sys.exit(1) ! # Something that should be in win32com in some form or another. def CastToClone(ob, target): """'Cast' a COM object to another type""" - # todo - should support target being an IID if hasattr(target, "index"): # string like # for now, we assume makepy for this to work. if not ob.__class__.__dict__.has_key("CLSID"): - # Eeek - no makepy support - try and build it. ob = gencache.EnsureDispatch(ob) if not ob.__class__.__dict__.has_key("CLSID"): raise ValueError, "Must be a makepy-able object for this to work" clsid = ob.CLSID - # Lots of hoops to support "demand-build" - ie, generating - # code for an interface first time it is used. We assume the - # interface name exists in the same library as the object. - # This is generally the case - only referenced typelibs may be - # a problem, and we can handle that later. Maybe - # So get the generated module for the library itself, then - # find the interface CLSID there. mod = gencache.GetModuleForCLSID(clsid) - # Get the 'root' module. mod = gencache.GetModuleForTypelib(mod.CLSID, mod.LCID, mod.MajorVersion, mod.MinorVersion) - # Find the CLSID of the target # XXX - should not be looking in VTables..., but no general map currently exists # (Fixed in win32all!) --- 48,69 ---- if exc: print "Exception: %s" % (exc) ! print "Sorry I can't be more help, but I can't continue while I have this error." sys.exit(1) ! # A couple of functions that are in new win32all, but we dont want to ! # force people to ugrade if we can avoid it. ! # NOTE: Most docstrings and comments removed - see the win32all version def CastToClone(ob, target): """'Cast' a COM object to another type""" if hasattr(target, "index"): # string like # for now, we assume makepy for this to work. if not ob.__class__.__dict__.has_key("CLSID"): ob = gencache.EnsureDispatch(ob) if not ob.__class__.__dict__.has_key("CLSID"): raise ValueError, "Must be a makepy-able object for this to work" clsid = ob.CLSID mod = gencache.GetModuleForCLSID(clsid) mod = gencache.GetModuleForTypelib(mod.CLSID, mod.LCID, mod.MajorVersion, mod.MinorVersion) # XXX - should not be looking in VTables..., but no general map currently exists # (Fixed in win32all!) *************** *** 81,85 **** mod = gencache.GetModuleForCLSID(target_clsid) target_class = getattr(mod, target) - # resolve coclass to interface target_class = getattr(target_class, "default_interface", target_class) return target_class(ob) # auto QI magic happens --- 74,77 ---- *************** *** 90,93 **** --- 82,118 ---- CastTo = CastToClone + # Something else in later win32alls - like "DispatchWithEvents", but the + # returned object is not both the Dispatch *and* the event handler + def WithEventsClone(clsid, user_event_class): + clsid = getattr(clsid, "_oleobj_", clsid) + disp = Dispatch(clsid) + if not disp.__dict__.get("CLSID"): # Eeek - no makepy support - try and build it. + try: + ti = disp._oleobj_.GetTypeInfo() + disp_clsid = ti.GetTypeAttr()[0] + tlb, index = ti.GetContainingTypeLib() + tla = tlb.GetLibAttr() + mod = gencache.EnsureModule(tla[0], tla[1], tla[3], tla[4]) + disp_class = gencache.GetClassForProgID(str(disp_clsid)) + except pythoncom.com_error: + raise TypeError, "This COM object can not automate the makepy process - please run makepy manually for this object" + else: + disp_class = disp.__class__ + clsid = disp_class.CLSID + import new + events_class = getevents(clsid) + if events_class is None: + raise ValueError, "This COM object does not support events." + result_class = new.classobj("COMEventClass", (events_class, user_event_class), {}) + instance = result_class(disp) # This only calls the first base class __init__. + if hasattr(user_event_class, "__init__"): + user_event_class.__init__(instance) + return instance + + try: + from win32com.client import WithEvents + except ImportError: # appears in 151 and later. + WithEvents = WithEventsClone + # Whew - we seem to have all the COM support we need - let's rock! *************** *** 97,101 **** self.handler = handler self.args = args ! def OnClick(self, button, cancel): self.handler(*self.args) --- 122,127 ---- self.handler = handler self.args = args ! def Close(self): ! self.handler = self.args = None def OnClick(self, button, cancel): self.handler(*self.args) *************** *** 107,110 **** --- 133,138 ---- self.manager = manager self.target = target + def Close(self): + self.application = self.manager = self.target = None class FolderItemsEvent(_BaseItemsEvent): *************** *** 172,195 **** assert train.been_trained_as_spam(msgstore_message, self.manager) def ShowClues(mgr, app): from cgi import escape ! sel = app.ActiveExplorer().Selection ! if sel.Count == 0: ! win32ui.MessageBox("No items are selected", "No selection") ! return ! if sel.Count > 1: ! win32ui.MessageBox("Please select a single item", "Large selection") ! return ! ! item = sel.Item(1) ! if item.Class != constants.olMail: ! win32ui.MessageBox("This function can only be performed on mail items", ! "Not a mail message") return ! ! msgstore_message = mgr.message_store.GetMessage(item) score, clues = mgr.score(msgstore_message, evidence=True, scale=False) new_msg = app.CreateItem(0) body = ["

    Spam Score: %g


    " % score] push = body.append --- 200,217 ---- assert train.been_trained_as_spam(msgstore_message, self.manager) + # Event function fired from the "Show Clues" UI items. def ShowClues(mgr, app): from cgi import escape ! msgstore_message = mgr.addin.GetSelectedMessages(False) ! if msgstore_message is None: return ! item = msgstore_message.GetOutlookItem() score, clues = mgr.score(msgstore_message, evidence=True, scale=False) new_msg = app.CreateItem(0) + # NOTE: Silly Outlook always switches the message editor back to RTF + # once the Body property has been set. Thus, there is no reasonable + # way to get this as text only. Next best then is to use HTML, 'cos at + # least we know how to exploit it! body = ["

    Spam Score: %g


    " % score] push = body.append *************** *** 210,215 **** new_msg.Subject = "Spam Clues: " + item.Subject ! # Stupid outlook always switches to RTF :( Work-around ! ## new_msg.Body = body new_msg.HTMLBody = "" + body + "" # Attach the source message to it --- 232,236 ---- new_msg.Subject = "Spam Clues: " + item.Subject ! # As above, use HTMLBody else Outlook refuses to behave. new_msg.HTMLBody = "" + body + "" # Attach the source message to it *************** *** 218,221 **** --- 239,359 ---- new_msg.Display() + # The "Delete As Spam" and "Recover Spam" button + # The event from Outlook's explorer that our folder has changed. + class ButtonDeleteAsExplorerEvent: + def Init(self, but): + self.but = but + def Close(self): + self.but = None + def OnFolderSwitch(self): + self.but._UpdateForFolderChange() + + class ButtonDeleteAsEvent: + def Init(self, manager, application, explorer): + # NOTE - keeping a reference to 'explorer' in this event + # appears to cause an Outlook circular reference, and outlook + # never terminates (it does close, but the process remains alive) + # This is why we needed to use WithEvents, so the event class + # itself doesnt keep such a reference (and we need to keep a ref + # to the event class so it doesn't auto-disconnect!) + self.manager = manager + self.application = application + self.explorer_events = WithEvents(explorer, + ButtonDeleteAsExplorerEvent) + self.set_for_as_spam = None + self.explorer_events.Init(self) + self._UpdateForFolderChange() + + def Close(self): + self.manager = self.application = self.explorer = None + + def _UpdateForFolderChange(self): + explorer = self.application.ActiveExplorer() + if explorer is None: + print "** Folder Change, but don't have an explorer" + return + outlook_folder = explorer.CurrentFolder + is_spam = False + if outlook_folder is not None: + mapi_folder = self.manager.message_store.GetFolder(outlook_folder) + look_id = self.manager.config.filter.spam_folder_id + if look_id: + look_folder = self.manager.message_store.GetFolder(look_id) + if mapi_folder == look_folder: + is_spam = True + if not is_spam: + look_id = self.manager.config.filter.unsure_folder_id + if look_id: + look_folder = self.manager.message_store.GetFolder(look_id) + if mapi_folder == look_folder: + is_spam = True + if is_spam: + set_for_as_spam = False + else: + set_for_as_spam = True + if set_for_as_spam != self.set_for_as_spam: + if set_for_as_spam: + image = "delete_as_spam.bmp" + self.Caption = "Delete As Spam" + self.TooltipText = \ + "Move the selected message to the Spam folder,\n" \ + "and train the system that this is Spam." + else: + image = "recover_ham.bmp" + self.Caption = "Recover from Spam" + self.TooltipText = \ + "Recovers the selected item back to the folder\n" \ + "it was filtered from (or to the Inbox if this\n" \ + "folder is not known), and trains the system that\n" \ + "this is a good message\n" + # Set the image. + print "Setting image to", image + SetButtonImage(self, image) + self.set_for_as_spam = set_for_as_spam + + def OnClick(self, button, cancel): + msgstore = self.manager.message_store + msgstore_messages = self.manager.addin.GetSelectedMessages(True) + if not msgstore_messages: + return + if self.set_for_as_spam: + # Delete this item as spam. + spam_folder_id = self.manager.config.filter.spam_folder_id + spam_folder = msgstore.GetFolder(spam_folder_id) + if not spam_folder: + win32ui.MessageBox("You must configure the Spam folder", + "Invalid Configuration") + return + import train + for msgstore_message in msgstore_messages: + # Must train before moving, else we lose the message! + print "Training on message - ", + if train.train_message(msgstore_message, True, self.manager): + print "trained as spam" + else: + print "already was trained as spam" + # Now move it. + msgstore_message.MoveTo(spam_folder) + else: + win32ui.MessageBox("Please be patient ") + + # Helpers to work with images on buttons/toolbars. + def SetButtonImage(button, fname): + # whew - http://support.microsoft.com/default.aspx?scid=KB;EN-US;q288771 + # shows how to make a transparent bmp. + # Also note that the clipboard takes ownership of the handle - + # this, we can not simply perform this load once and reuse the image. + if not os.path.isabs(fname): + fname = os.path.join( os.path.dirname(__file__), "images", fname) + if not os.path.isfile(fname): + print "WARNING - Trying to use image '%s', but it doesn't exist" % (fname,) + return None + handle = win32gui.LoadImage(0, fname, win32con.IMAGE_BITMAP, 0, 0, win32con.LR_DEFAULTSIZE | win32con.LR_LOADFROMFILE) + win32clipboard.OpenClipboard() + win32clipboard.SetClipboardData(win32con.CF_BITMAP, handle) + win32clipboard.CloseClipboard() + button.Style = constants.msoButtonIconAndCaption + button.PasteFace() + # The outlook Plugin COM object itself. class OutlookAddin: *************** *** 247,250 **** --- 385,396 ---- bars = activeExplorer.CommandBars toolbar = bars.Item("Standard") + # Add our "Delete as ..." button + button = toolbar.Controls.Add(Type=constants.msoControlButton, Temporary=True) + # Hook events for the item + button.BeginGroup = True + button = DispatchWithEvents(button, ButtonDeleteAsEvent) + button.Init(self.manager, application, activeExplorer) + self.buttons.append(button) + # Add a pop-up menu to the toolbar popup = toolbar.Controls.Add(Type=constants.msoControlPopup, Temporary=True) *************** *** 323,326 **** --- 469,494 ---- return new_hooks + def GetSelectedMessages(self, allow_multi = True, explorer = None): + if explorer is None: + explorer = self.application.ActiveExplorer() + sel = explorer.Selection + if sel.Count > 1 and not allow_multi: + win32ui.MessageBox("Please select a single item", "Large selection") + return None + + ret = [] + for i in range(sel.Count): + item = sel.Item(i+1) + if item.Class == constants.olMail: + msgstore_message = self.manager.message_store.GetMessage(item) + ret.append(msgstore_message) + + if len(ret) == 0: + win32ui.MessageBox("No mail items are selected", "No selection") + return None + if allow_multi: + return ret + return ret[0] + def OnDisconnection(self, mode, custom): print "SpamAddin - Disconnecting from Outlook" *************** *** 331,336 **** self.manager.Close() self.manager = None ! self.buttons = None ! print "Addin terminating: %d COM client and %d COM servers exist." \ % (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount()) --- 499,506 ---- self.manager.Close() self.manager = None ! if self.buttons: ! for button in self.buttons: ! button.Close() ! self.buttons = None print "Addin terminating: %d COM client and %d COM servers exist." \ % (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount()) From mhammond@users.sourceforge.net Mon Nov 4 01:12:56 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Sun, 03 Nov 2002 17:12:56 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv2046 Modified Files: train.py Log Message: Fix the root of my: File "F:\src\spambayes\classifier.py", line 450, in _getclues distance = abs(prob - 0.5) Exception - problem is that we trained, but didn't update probabilities - thus, we failed for every new word seen only since the last complete retrain. There may be a case for _getclues() to detect a probability of None and call update_probabilities() automatically - either that or just keep throwing vague exceptions Index: train.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** train.py 31 Oct 2002 22:03:35 -0000 1.12 --- train.py 4 Nov 2002 01:12:53 -0000 1.13 *************** *** 19,23 **** return spam == True ! def train_message(msg, is_spam, mgr): # Train an individual message. # Returns True if newly added (message will be correctly --- 19,23 ---- return spam == True ! def train_message(msg, is_spam, mgr, update_probs = True): # Train an individual message. # Returns True if newly added (message will be correctly *************** *** 41,44 **** --- 41,47 ---- mgr.bayes.learn(tokens, is_spam, False) mgr.message_db[msg.searchkey] = is_spam + if update_probs: + mgr.bayes.update_probabilities() + mgr.bayes_dirty = True return True *************** *** 51,55 **** progress.tick() try: ! if train_message(message, isspam, mgr): num_added += 1 except: --- 54,58 ---- progress.tick() try: ! if train_message(message, isspam, mgr, False): num_added += 1 except: From jhylton@users.sourceforge.net Mon Nov 4 04:36:01 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sun, 03 Nov 2002 20:36:01 -0800 Subject: [Spambayes-checkins] spambayes/pspam - New directory Message-ID: Update of /cvsroot/spambayes/spambayes/pspam In directory usw-pr-cvs1:/tmp/cvs-serv19246/pspam Log Message: Directory /cvsroot/spambayes/spambayes/pspam added to the repository From jhylton@users.sourceforge.net Mon Nov 4 04:42:44 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sun, 03 Nov 2002 20:42:44 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam - New directory Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv21182/pspam/pspam Log Message: Directory /cvsroot/spambayes/spambayes/pspam/pspam added to the repository From jhylton@users.sourceforge.net Mon Nov 4 04:44:22 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sun, 03 Nov 2002 20:44:22 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam __init__.py,NONE,1.1 database.py,NONE,1.1 folder.py,NONE,1.1 message.py,NONE,1.1 options.py,NONE,1.1 profile.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv21558/pspam/pspam Added Files: __init__.py database.py folder.py message.py options.py profile.py Log Message: Initial checkin of pspam code. --- NEW FILE: __init__.py --- """Package for interacting with VM folders. Design notes go here. Use ZODB to store training data and classifier. The spam and ham data are culled from sets of folders. The actual tokenized messages are stored in a training database. When the folder changes, the training data is updated. - Updates are incremental. - Changes to a folder are detected based on mtime and folder size. - The contents of the folder are keyed on message-id. - If a message is removed from a folder, it is removed from training data. """ --- NEW FILE: database.py --- from pspam.options import options import ZODB from ZEO.ClientStorage import ClientStorage import zLOG import os def logging(): os.environ["STUPID_LOG_FILE"] = options.event_log_file os.environ["STUPID_LOG_SEVERITY"] = str(options.event_log_severity) zLOG.initialize() def open(): cs = ClientStorage(options.zeo_addr) db = ZODB.DB(cs, cache_size=options.cache_size) return db --- NEW FILE: folder.py --- import ZODB from Persistence import Persistent from BTrees.OOBTree import OOBTree, OOSet, difference import email import mailbox import os import stat from pspam.message import PMessage def factory(fp): try: return email.message_from_file(fp, PMessage) except email.Errors.MessageError, msg: print msg return PMessage() class Folder(Persistent): def __init__(self, path): self.path = path self.mtime = 0 self.size = 0 self.messages = OOBTree() def _stat(self): t = os.stat(self.path) self.mtime = t[stat.ST_MTIME] self.size = t[stat.ST_SIZE] def changed(self): t = os.stat(self.path) if (t[stat.ST_MTIME] != self.mtime or t[stat.ST_SIZE] != self.size): return True else: return False def read(self): """Return messages added and removed from folder. Two sets of message objects are returned. The first set is messages that were added to the folder since the last read. The second set is the messages that were removed from the folder since the last read. The code assumes messages are added and removed but not edited. """ mbox = mailbox.UnixMailbox(open(self.path, "rb"), factory) self._stat() cur = OOSet() new = OOSet() while 1: msg = mbox.next() if msg is None: break msgid = msg["message-id"] cur.insert(msgid) if not self.messages.has_key(msgid): self.messages[msgid] = msg new.insert(msg) removed = difference(self.messages, cur) for msgid in removed.keys(): del self.messages[msgid] # XXX perhaps just return the OOBTree for removed? return new, OOSet(removed.values()) if __name__ == "__main__": f = Folder("/home/jeremy/Mail/INBOX") --- NEW FILE: message.py --- import ZODB from Persistence import Persistent from email.Message import Message class PMessage(Message, Persistent): def __hash__(self): return id(self) --- NEW FILE: options.py --- from Options import options, all_options, \ boolean_cracker, float_cracker, int_cracker, string_cracker from sets import Set all_options["Score"] = {'max_ham': float_cracker, 'min_spam': float_cracker, } all_options["Train"] = {'folder_dir': string_cracker, 'spam_folders': ('get', lambda s: Set(s.split())), 'ham_folders': ('get', lambda s: Set(s.split())), } all_options["Proxy"] = {'server': string_cracker, 'server_port': int_cracker, 'proxy_port': int_cracker, 'log_pop_session': boolean_cracker, 'log_pop_session_file': string_cracker, } all_options["ZODB"] = {'zeo_addr': string_cracker, 'event_log_file': string_cracker, 'event_log_severity': int_cracker, 'cache_size': int_cracker, } import os options.mergefiles("vmspam.ini") def mergefile(p): options.mergefiles(p) --- NEW FILE: profile.py --- """Spam/ham profile for a single VM user.""" import ZODB from ZODB.PersistentList import PersistentList from Persistence import Persistent from BTrees.OOBTree import OOBTree import classifier from tokenizer import tokenize from pspam.folder import Folder import os def open_folders(dir, names, klass): L = [] for name in names: path = os.path.join(dir, name) L.append(klass(path)) return L import time _start = None def log(s): global _start if _start is None: _start = time.time() print round(time.time() - _start, 2), s class IterOOBTree(OOBTree): def iteritems(self): return self.items() class WordInfo(Persistent): def __init__(self, atime, spamprob=None): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 self.spamprob = spamprob def __repr__(self): return "WordInfo%r" % repr((self.atime, self.spamcount, self.hamcount, self.killcount, self.spamprob)) class PBayes(classifier.Bayes, Persistent): WordInfoClass = WordInfo def __init__(self): classifier.Bayes.__init__(self) self.wordinfo = IterOOBTree() # XXX what about the getstate and setstate defined in base class class Profile(Persistent): FolderClass = Folder def __init__(self, folder_dir): self._dir = folder_dir self.classifier = PBayes() self.hams = PersistentList() self.spams = PersistentList() def add_ham(self, folder): p = os.path.join(self._dir, folder) f = self.FolderClass(p) self.hams.append(f) def add_spam(self, folder): p = os.path.join(self._dir, folder) f = self.FolderClass(p) self.spams.append(f) def update(self): """Update classifier from current folder contents.""" changed1 = self._update(self.hams, False) changed2 = self._update(self.spams, True) if changed1 or changed2: self.classifier.update_probabilities() get_transaction().commit() log("updated probabilities") def _update(self, folders, is_spam): changed = False for f in folders: log("update from %s" % f.path) added, removed = f.read() if added: log("added %d" % len(added)) if removed: log("removed %d" % len(removed)) get_transaction().commit() if not (added or removed): continue changed = True # It's important not to commit a transaction until # after update_probabilities is called in update(). # Otherwise some new entries will cause scoring to fail. for msg in added.keys(): self.classifier.learn(tokenize(msg), is_spam, False) del added get_transaction().commit(1) log("learned") for msg in removed.keys(): self.classifier.unlearn(tokenize(msg), is_spam, False) if removed: log("unlearned") del removed get_transaction().commit(1) return changed From jhylton@users.sourceforge.net Mon Nov 4 04:44:22 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sun, 03 Nov 2002 20:44:22 -0800 Subject: [Spambayes-checkins] spambayes/pspam README.txt,NONE,1.1 pop.py,NONE,1.1vmspam.ini,NONE,1.1zeo.sh,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam In directory usw-pr-cvs1:/tmp/cvs-serv21558/pspam Added Files: README.txt pop.py scoremsg.py update.py vmspam.ini zeo.sh Log Message: Initial checkin of pspam code. --- NEW FILE: README.txt --- pspam: persistent spambayes filtering system -------------------------------------------- pspam uses a POP proxy to score incoming messages, a set of VM folders to manage training data, and a ZODB database to manage data used by the various applications. The current code only works with a patched version of classifier.py. Remove the object base class & change the class used to create new WordInfo objects. This directory contains: pspam -- a Python package pop.py -- a POP proxy based on SocketServer scoremsg.py -- prints the evidence for a single message read from stdin update.py -- a script to update training data from folders vmspam.ini -- a sample configuration file zeo.sh -- a script to start a ZEO server The code depends on ZODB3, which you can download from http://www.zope.org/Products/StandaloneZODB. --- NEW FILE: pop.py --- """Spam-filtering proxy for a POP3 server. The implementation uses the SocketServer module to run a multi-threaded POP3 proxy. It adds an X-Spambayes header with a spam probability. It scores a message using a persistent spambayes classifier loaded from a ZEO server. The strategy for adding spam headers is from Richie Hindler's pop3proxy.py. The STAT, LIST, RETR, and TOP commands are intercepted to change the number of bytes the client is told to expect and/or to insert the spam header. XXX A POP3 server sometimes adds the number of bytes in the +OK response to some commands when the POP3 spec doesn't require it to. In those case, the proxy does not re-write the number of bytes. I assume the clients won't be confused by this behavior, because they shouldn't be expecting to see the number of bytes. POP3 is documented in RFC 1939. """ import SocketServer import asyncore import cStringIO import email import re import socket import sys import threading import time import ZODB from ZEO.ClientStorage import ClientStorage import zLOG from tokenizer import tokenize import pspam.database from pspam.options import options HEADER = "X-Spambayes: %5.3f\r\n" HEADER_SIZE = len(HEADER % 0.0) class POP3ProxyServer(SocketServer.ThreadingTCPServer): allow_reuse_address = True def __init__(self, addr, handler, classifier, real_server, log, zodb): SocketServer.ThreadingTCPServer.__init__(self, addr, handler) self.classifier = classifier self.pop_server = real_server self.log = log self.zodb = zodb class LogWrapper: def __init__(self, log, file): self.log = log self.file = file def readline(self): line = self.file.readline() self.log.write(line) return line def write(self, buf): self.log.write(buf) return self.file.write(buf) def close(self): self.file.close() class POP3RequestHandler(SocketServer.StreamRequestHandler): """Act as proxy between POP client and server.""" def connect_pop(self): # connect to the pop server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(self.server.pop_server) self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb")) # the write side should be unbuffered self.pop_wfile = LogWrapper(self.server.log, s.makefile("wb", 0)) def close_pop(self): self.pop_rfile.close() self.pop_wfile.close() def handle(self): zLOG.LOG("POP3", zLOG.INFO, "Connection from %s" % repr(self.client_address)) self.server.zodb.sync() self.sess_retr_count = 0 self.connect_pop() try: self.handle_pop() finally: self.close_pop() if self.sess_retr_count == 1: ending = "" else: ending = "s" zLOG.LOG("POP3", zLOG.INFO, "Ending session (%d message%s retrieved)" % (self.sess_retr_count, ending)) _multiline = {"RETR": True, "TOP": True,} _multiline_noargs = {"LIST": True, "UIDL": True,} def is_multiline(self, command, args): if command in self._multiline: return True if command in self._multiline_noargs and not args: return True return False def parse_request(self, req): parts = req.split() req = parts[0] args = tuple(parts[1:]) return req, args def handle_pop(self): # send the initial server hello hello = self.pop_rfile.readline() self.wfile.write(hello) # now get client requests and return server responses while 1: line = self.rfile.readline() if line == '': break self.pop_wfile.write(line) if not self.handle_pop_response(line): break def handle_pop_response(self, req): # Return True if connection is still open cmd, args = self.parse_request(req) multiline = self.is_multiline(cmd, args) firstline = self.pop_rfile.readline() zLOG.LOG("POP3", zLOG.DEBUG, "command %s multiline %s resp %s" % (cmd, multiline, firstline.strip())) if multiline: # Collect the entire response as one string resp = cStringIO.StringIO() while 1: line = self.pop_rfile.readline() resp.write(line) # The response is finished if we get . or an error. # XXX should handle byte-stuffed response if line == ".\r\n": break if line.startswith("-ERR"): break buf = resp.getvalue() else: buf = None handler = getattr(self, "handle_%s" % cmd, None) if handler: firstline, buf = handler(cmd, args, firstline, buf) self.wfile.write(firstline) if buf is not None: self.wfile.write(buf) if cmd == "QUIT": return False else: return True def handle_RETR(self, cmd, args, firstline, resp): if not resp: return firstline, resp try: msg = email.message_from_string(resp) except email.Errors.MessageParseError, err: zLOG.LOG("POP3", zLOG.WARNING, "Failed to parse msg: %s" % err, error=sys.exc_info()) resp = self.message_parse_error(resp) else: self.score_msg(msg) resp = msg.as_string() self.sess_retr_count += 1 return firstline, resp def handle_TOP(self, cmd, args, firstline, resp): # XXX Just handle TOP like RETR? return self.handle_RETR(cmd, args, firstline, resp) rx_STAT = re.compile("\+OK (\d+) (\d+)(.*)", re.DOTALL) def handle_STAT(self, cmd, args, firstline, resp): # STAT returns the number of messages and the total size. The # proxy must add the size of new headers to the total size. # Example: +OK 3 340 mo = self.rx_STAT.match(firstline) if mo is None: return firstline, resp count, size, extra = mo.group(1, 2, 3) count = int(count) size = int(size) size += count * HEADER_SIZE firstline = "+OK %d %d%s" % (count, size, extra) return firstline, resp rx_LIST = re.compile("\+OK (\d+) (\d+)(.*)", re.DOTALL) rx_LIST_2 = re.compile("(\d+) (\d+)(.*)", re.DOTALL) def handle_LIST(self, cmd, args, firstline, resp): # If there are no args, LIST returns size info for each message. # If there is an arg, LIST return number and size for one message. mo = self.rx_LIST.match(firstline) if mo: # a single-line response n, size, extra = mo.group(1, 2, 3) size = int(size) + HEADER_SIZE firstline = "+OK %s %d%s" % (n, size, extra) return firstline, resp else: # possibility a multiline response if not firstline.startswith("+OK"): return firstline, resp # update each line of the response L = [] for line in resp.split("\r\n"): if not line: continue mo = self.rx_LIST_2.match(line) if not mo: L.append(line) else: n, size, extra = mo.group(1, 2, 3) size = int(size) + HEADER_SIZE L.append("%s %d%s" % (n, size, extra)) return firstline, "\r\n".join(L) def message_parse_error(self, buf): # We get an error parsing the message. We've already told the # client to expect more bytes that this buffer contains, but # there's not clean way to add the header. self.server.log.write("# error: %s\n" % repr(buf)) # XXX what to do? list's just add it after the first line score = self.server.classifier.spamprob(tokenize(buf)) L = buf.split("\n") L.insert(1, HEADER % score) return "\n".join(L) def score_msg(self, msg): score = self.server.classifier.spamprob(tokenize(msg)) msg.add_header("X-Spambayes", "%5.3f" % score) def main(): db = pspam.database.open() conn = db.open() r = conn.root() profile = r["profile"] log = open("/var/tmp/pop.log", "ab") print >> log, "+PROXY start", time.ctime() server = POP3ProxyServer(('', options.proxy_port), POP3RequestHandler, profile.classifier, (options.server, options.server_port), log, conn, ) server.serve_forever() if __name__ == "__main__": main() --- NEW FILE: scoremsg.py --- #! /usr/bin/env python """Score a message provided on stdin and show the evidence.""" import ZODB from ZEO.ClientStorage import ClientStorage from tokenizer import tokenize import email import sys import pspam.options def main(fp): cs = ClientStorage("/var/tmp/zeospam") db = ZODB.DB(cs) r = db.open().root() # make sure scoring uses the right set of options pspam.options.mergefile("/home/jeremy/src/vmspam/vmspam.ini") p = r["profile"] msg = email.message_from_file(fp) prob, evidence = p.classifier.spamprob(tokenize(msg), True) print "Score:", prob print print "Clues" print "-----" for clue, prob in evidence: print clue, prob ## print ## print msg if __name__ == "__main__": main(sys.stdin) --- NEW FILE: update.py --- import getopt import os import sys import ZODB from ZEO.ClientStorage import ClientStorage import pspam.database from pspam.profile import Profile from pspam.options import options def folder_exists(L, p): """Return true folder with path p exists in list L.""" for f in L: if f.path == p: return True return False def main(rebuild=False): db = pspam.database.open() r = db.open().root() profile = r.get("profile") if profile is None or rebuild: # if there is no profile, create it profile = r["profile"] = Profile(options.folder_dir) get_transaction().commit() # check for new folders of training data for ham in options.ham_folders: p = os.path.join(options.folder_dir, ham) if not folder_exists(profile.hams, p): profile.add_ham(p) for spam in options.spam_folders: p = os.path.join(options.folder_dir, spam) if not folder_exists(profile.spams, p): profile.add_spam(p) get_transaction().commit() # read new messages from folders profile.update() get_transaction().commit() db.close() if __name__ == "__main__": FORCE_REBUILD = False opts, args = getopt.getopt(sys.argv[1:], 'F') for k, v in opts: if k == '-F': FORCE_REBUILD = True main(FORCE_REBUILD) --- NEW FILE: vmspam.ini --- [Train] folder_dir: /home/jeremy/Mail spam_folders: train/spam ham_folders: train/ham [Score] max_ham: 0.05 min_spam: 0.99 [Proxy] server: mail.zope.com server_port: 110 proxy_port: 1111 log_pop_session: true log_pop_session_file: /var/tmp/pop.log [ZODB] zeo_addr: /var/tmp/zeospam event_log_file: /var/tmp/zeospam.log event_log_severity: 0 cache_size: 2000 --- NEW FILE: zeo.sh --- #! /bin/bash export STUPID_LOG_FILE=/var/tmp/zeospam.log export LIBDIR=/usr/local/lib/python2.3/site-packages python2.3 $LIBDIR/ZEO/start.py -U /var/tmp/zeospam /var/tmp/zeospam.fs From tim.one@comcast.net Mon Nov 4 05:03:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 00:03:05 -0500 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13 In-Reply-To: Message-ID: [Mark Hammond] > Modified Files: > train.py > Log Message: > Fix the root of my: > File "F:\src\spambayes\classifier.py", line 450, in _getclues > distance = abs(prob - 0.5) > > Exception - problem is that we trained, but didn't update probabilities - > thus, we failed for every new word seen only since the last complete > retrain. Mark, I've never seen this, and believed I fixed the only way it could have happened last week -- WordInfo records start life with a genuine probability (spamprob) now, instead with a spamprob of None. It's possible, though, that you had some leftover WordInfo record with None in your dict, and didn't retrain from scratch after that fix. Or it's possible there's an entirely different bug I still don't know about. > There may be a case for _getclues() to detect a probability of None > and call update_probabilities() automatically - either that or just > keep throwing vague exceptions Except it should never be possible for _getclues() to see None -- if that was still happening for you, there's a deeper bug that still needs to be fixed. In other news, here's a shallow bug, upon starting Outlook now: Traceback (most recent call last): File "C:\PYTHON22\lib\site-packages\win32com\universal.py", line 150, in dispatch retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD, args, None, None) File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 322, in _InvokeEx_ return self._invokeex_(dispid, lcid, wFlags, args, kwargs, serviceProvider) File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 562, in _invokeex_ return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags, args, kwArgs, serviceProvider) File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 510, in _invokeex_ return apply(func, args) File "C:\Code\spambayes\Outlook2000\addin.py", line 392, in OnConnection button.Init(self.manager, application, activeExplorer) File "C:\Code\spambayes\Outlook2000\addin.py", line 262, in Init ButtonDeleteAsExplorerEvent) File "C:\Code\spambayes\Outlook2000\addin.py", line 103, in WithEventsClone events_class = getevents(clsid) exceptions.NameError: global name 'getevents' is not defined It can't have worked for you, either. I fiddled my local copy to do from win32com.client import constants, getevents near the top, and that appears to have fixed it. I'll check that in, but please ensure that was the correct fix. From tim_one@users.sourceforge.net Mon Nov 4 05:03:49 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 03 Nov 2002 21:03:49 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.25,1.26 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv25951/Outlook2000 Modified Files: addin.py Log Message: Fix whar appeared to be a missing import of win32.client.getevents. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** addin.py 4 Nov 2002 00:52:10 -0000 1.25 --- addin.py 4 Nov 2002 05:03:47 -0000 1.26 *************** *** 13,17 **** import win32api import pythoncom ! from win32com.client import constants import win32ui --- 13,17 ---- import win32api import pythoncom ! from win32com.client import constants, getevents import win32ui From anthonybaxter@users.sourceforge.net Mon Nov 4 06:38:54 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Sun, 03 Nov 2002 22:38:54 -0800 Subject: [Spambayes-checkins] website developer.ht,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv16008 Modified Files: developer.ht Log Message: added a "what needs to be done" section. Index: developer.ht =================================================================== RCS file: /cvsroot/spambayes/website/developer.ht,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** developer.ht 22 Sep 2002 07:48:03 -0000 1.3 --- developer.ht 4 Nov 2002 06:38:52 -0000 1.4 *************** *** 27,30 **** --- 27,38 ---- available as links from the documentation page. +

    So what needs to be done

    +

    Currently (early November) work is now being focussed on finding + additional things that are beneficial to the tokenizer. The combining + scheme is now pretty solid and pretty amazing. The other big body of + work at the moment is producing something that's useful to end-users - + actually building the applications and the code so that Tim's sister + <wink> can use the system.

    +

    Collecting training data

    One of the tricky problems is collecting a set of data that's From anthonybaxter@users.sourceforge.net Mon Nov 4 06:39:44 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Sun, 03 Nov 2002 22:39:44 -0800 Subject: [Spambayes-checkins] website background.ht,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv16178 Modified Files: background.ht Log Message: A bit of a potted history here. I probably have a bunch of things here that need to be cleaned up and made more obvious, but hey, it's a start. Index: background.ht =================================================================== RCS file: /cvsroot/spambayes/website/background.ht,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** background.ht 19 Sep 2002 23:39:24 -0000 1.1 --- background.ht 4 Nov 2002 06:39:42 -0000 1.2 *************** *** 15,18 **** --- 15,67 ----

    more links? mail anthony at interlink.com.au

    +

    Overall Approach

    + Please note that I (Anthony) am writing this based on memory and + limited understanding of some of the subtler points of the maths. Gentle + corrections are welcome, or even encouraged. +

    Tokenizing

    +

    The architecture of the spambayes system has a couple of distinct + parts. The first, and most obvious, is the tokenizer. This takes + a mail message and breaks it up into a series of tokens. At the moment + it splits words out of the text parts of a message, there's a variety + of header tokenization that goes on as well. The code in tokenizer.py + and the comments in the Tokenizer section of Options.py contain more + information about various approaches to tokenizing.

    + +

    Combining and Scoring

    +

    The next part of the system is the scoring and combining part. This + is where the hairy mathematics and statistics come in.

    +

    Initially we started with Paul Graham's original combining scheme - + this has a number of "magic numbers" and "fuzz factors" built into it. + The Graham combining scheme has a number of problems, aside from the + magic in the internal fudge factors - it tends to produce scores of + either 1 or 0, and there's a very small middle ground in between - it + doesn't often claim to be "unsure", and gets it wrong because of this. + There's a number of discussions back and forth between Tim Peters and + Gary Robinson on this subject in the mailing list archives - I'll try + and put links to the relevant threads at some point.

    +

    Gary produced a number of alternative approaches to combining and + scoring word probabilities. The initial one, after much back and forth + in the mailing list, is in the code today as 'gary_combining'. A couple + of other approaches, using the Central Limit Theorem, were also tried. + They produced interesting output - but histograms of the ham and spam + distributions had a disturbingly large overlap in the middle. There was + also an issue with incremental training and untraining of messages that + made it harder to use in the "real world". These two central limit + approaches were dropped after Tim, Gary and Rob Hooft produced a combining + scheme using chi-squared probabilities. This is now the default combining + scheme.

    +

    The chi-squared approach produces two numbers - a "ham probability" ("*H*") + and a "spam probability" ("*S*"). A typical spam will have a high *S* + and low *H*, while a ham will have high *H* and low *S*. In the case where + the message looks entirely unlike anything the system's been trained on, + you can end up with a low *H* and low *S* - this is the code saying "I don't + know what this message is". So at the end of the processing, you end up + with three possible results - "Spam", "Ham", or "Unsure". It's possible to + tweak the high and low cutoffs for the Unsure window - this trades off + unsure messages vs possible false positives or negatives.

    + +

    Training

    +

    TBD

    +

    Mailing list archives

    There's a lot of background on what's been tried available from From anthonybaxter@users.sourceforge.net Mon Nov 4 09:58:02 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 04 Nov 2002 01:58:02 -0800 Subject: [Spambayes-checkins] website background.ht,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv6694 Modified Files: background.ht Log Message: addition from RobH about high *H* and high *S* meaning. Index: background.ht =================================================================== RCS file: /cvsroot/spambayes/website/background.ht,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** background.ht 4 Nov 2002 06:39:42 -0000 1.2 --- background.ht 4 Nov 2002 09:57:59 -0000 1.3 *************** *** 56,60 **** the message looks entirely unlike anything the system's been trained on, you can end up with a low *H* and low *S* - this is the code saying "I don't ! know what this message is". So at the end of the processing, you end up with three possible results - "Spam", "Ham", or "Unsure". It's possible to tweak the high and low cutoffs for the Unsure window - this trades off --- 56,66 ---- the message looks entirely unlike anything the system's been trained on, you can end up with a low *H* and low *S* - this is the code saying "I don't ! know what this message is". ! Some messages can even have both a high *H* and a high *S*, telling you ! basically that the message looks very much like ham, but also very much ! like spam. In this case spambayes is also unsure where the message ! should be classified, and the final score will be near 0.5.

    ! !

    So at the end of the processing, you end up with three possible results - "Spam", "Ham", or "Unsure". It's possible to tweak the high and low cutoffs for the Unsure window - this trades off From tim_one@users.sourceforge.net Mon Nov 4 21:06:30 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 04 Nov 2002 13:06:30 -0800 Subject: [Spambayes-checkins] spambayes classifier.py,1.46,1.47 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8400 Modified Files: classifier.py Log Message: _add_msg(): Removed redundant store into wordinfo[word]. _remove_msg(): Added a store into wordinfo[word], which may be needed if wordinfo is a persistent database, to let the persistence machinery know that an internal field in the value associated *with* word changed. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.46 retrieving revision 1.47 diff -C2 -d -r1.46 -r1.47 *** classifier.py 1 Nov 2002 16:01:14 -0000 1.46 --- classifier.py 4 Nov 2002 21:06:26 -0000 1.47 *************** *** 401,405 **** record = wordinfoget(word) if record is None: ! record = wordinfo[word] = WordInfo(now) if is_spam: --- 401,405 ---- record = wordinfoget(word) if record is None: ! record = WordInfo(now) if is_spam: *************** *** 407,410 **** --- 407,411 ---- else: record.hamcount += 1 + # Needed to tell a persistent DB that the content changed. wordinfo[word] = record *************** *** 419,423 **** self.nham -= 1 ! wordinfoget = self.wordinfo.get for word in Set(wordstream): record = wordinfoget(word) --- 420,425 ---- self.nham -= 1 ! wordinfo = self.wordinfo ! wordinfoget = wordinfo.get for word in Set(wordstream): record = wordinfoget(word) *************** *** 430,434 **** record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: ! del self.wordinfo[word] def _getclues(self, wordstream): --- 432,439 ---- record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: ! del wordinfo[word] ! else: ! # Needed to tell a persistent DB that the content changed. ! wordinfo[word] = record def _getclues(self, wordstream): From jhylton@users.sourceforge.net Mon Nov 4 21:25:56 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Mon, 04 Nov 2002 13:25:56 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv18044 Modified Files: profile.py Log Message: Use the same default spamprob as regular classifier. Index: profile.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** profile.py 4 Nov 2002 04:44:20 -0000 1.1 --- profile.py 4 Nov 2002 21:25:54 -0000 1.2 *************** *** 10,13 **** --- 10,14 ---- from pspam.folder import Folder + from pspam.options import options import os *************** *** 36,40 **** class WordInfo(Persistent): ! def __init__(self, atime, spamprob=None): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 37,41 ---- class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 From jhylton@users.sourceforge.net Mon Nov 4 21:24:54 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Mon, 04 Nov 2002 13:24:54 -0800 Subject: [Spambayes-checkins] spambayes classifier.py,1.47,1.48 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv17508 Modified Files: classifier.py Log Message: Two changes to support pspam. Make Bayes a classic class so that it can be mixed with ExtensionClass. Define Bayes.WordInfoClass so that a subclass can define a different class to represent word info. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** classifier.py 4 Nov 2002 21:06:26 -0000 1.47 --- classifier.py 4 Nov 2002 21:24:52 -0000 1.48 *************** *** 80,84 **** self.spamprob) = t ! class Bayes(object): # Defining __slots__ here made Jeremy's life needlessly difficult when # trying to hook this all up to ZODB as a persistent object. There's --- 80,84 ---- self.spamprob) = t ! class Bayes: # Defining __slots__ here made Jeremy's life needlessly difficult when # trying to hook this all up to ZODB as a persistent object. There's *************** *** 92,95 **** --- 92,98 ---- # ) + # allow a subclass to use a different class for WordInfo + WordInfoClass = WordInfo + def __init__(self): self.wordinfo = {} *************** *** 401,405 **** record = wordinfoget(word) if record is None: ! record = WordInfo(now) if is_spam: --- 404,408 ---- record = wordinfoget(word) if record is None: ! record = self.WordInfoClass(now) if is_spam: From mhammond@users.sourceforge.net Mon Nov 4 22:19:36 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Mon, 04 Nov 2002 14:19:36 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv17976 Modified Files: train.py Log Message: Roll-back my previous "update probs" change - Tim's fix would have fixed it had I done a complete retain. Done that now, and if I still need this Tim will sort it out once-and-for-all Index: train.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** train.py 4 Nov 2002 01:12:53 -0000 1.13 --- train.py 4 Nov 2002 22:19:34 -0000 1.14 *************** *** 19,23 **** return spam == True ! def train_message(msg, is_spam, mgr, update_probs = True): # Train an individual message. # Returns True if newly added (message will be correctly --- 19,23 ---- return spam == True ! def train_message(msg, is_spam, mgr): # Train an individual message. # Returns True if newly added (message will be correctly *************** *** 41,47 **** mgr.bayes.learn(tokens, is_spam, False) mgr.message_db[msg.searchkey] = is_spam - if update_probs: - mgr.bayes.update_probabilities() - mgr.bayes_dirty = True return True --- 41,44 ---- *************** *** 54,58 **** progress.tick() try: ! if train_message(message, isspam, mgr, False): num_added += 1 except: --- 51,55 ---- progress.tick() try: ! if train_message(message, isspam, mgr): num_added += 1 except: From mhammond@skippinet.com.au Mon Nov 4 22:48:08 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 5 Nov 2002 09:48:08 +1100 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13 In-Reply-To: Message-ID: [Tim] > In other news, here's a shallow bug, upon starting Outlook now: ... > It can't have worked for you, either. It can - my code took the "win32all has such a function" path. Pity mine is the only machine in the world taking that path > I fiddled my local copy to do > > from win32com.client import constants, getevents > > near the top, and that appears to have fixed it. I'll check that in, but > please ensure that was the correct fix. Just dandy - thanks! pychecker can tell us when it is no longer necessary! Mark. From mhammond@users.sourceforge.net Mon Nov 4 22:50:44 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Mon, 04 Nov 2002 14:50:44 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.26,1.27 train.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv30899 Modified Files: addin.py train.py Log Message: After incremental training on individual messages, they are also recored so that they appear in the ham/spam folder with the *new* post-training score rather than their pre-training, presumably wrong score. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** addin.py 4 Nov 2002 05:03:47 -0000 1.26 --- addin.py 4 Nov 2002 22:50:41 -0000 1.27 *************** *** 159,163 **** import train print "Training on message '%s' - " % subject, ! if train.train_message(msgstore_message, False, self.manager): print "trained as good" else: --- 159,163 ---- import train print "Training on message '%s' - " % subject, ! if train.train_message(msgstore_message, False, self.manager, rescore = True): print "trained as good" else: *************** *** 191,195 **** subject = item.Subject.encode("mbcs", "replace") print "Training on message '%s' - " % subject, ! if train.train_message(msgstore_message, True, self.manager): print "trained as spam" else: --- 191,195 ---- subject = item.Subject.encode("mbcs", "replace") print "Training on message '%s' - " % subject, ! if train.train_message(msgstore_message, True, self.manager, rescore = True): print "trained as spam" else: *************** *** 329,333 **** # Must train before moving, else we lose the message! print "Training on message - ", ! if train.train_message(msgstore_message, True, self.manager): print "trained as spam" else: --- 329,333 ---- # Must train before moving, else we lose the message! print "Training on message - ", ! if train.train_message(msgstore_message, True, self.manager, rescore = True): print "trained as spam" else: Index: train.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** train.py 4 Nov 2002 22:19:34 -0000 1.14 --- train.py 4 Nov 2002 22:50:41 -0000 1.15 *************** *** 19,27 **** return spam == True ! def train_message(msg, is_spam, mgr): # Train an individual message. # Returns True if newly added (message will be correctly # untrained if it was in the wrong category), False if already # in the correct category. Catch your own damn exceptions. from tokenizer import tokenize stream = msg.GetEmailPackageObject() --- 19,29 ---- return spam == True ! def train_message(msg, is_spam, mgr, rescore = False): # Train an individual message. # Returns True if newly added (message will be correctly # untrained if it was in the wrong category), False if already # in the correct category. Catch your own damn exceptions. + # If re-classified AND rescore = True, then a new score will + # be written to the message (so the user can see some effects) from tokenizer import tokenize stream = msg.GetEmailPackageObject() *************** *** 42,45 **** --- 44,52 ---- mgr.message_db[msg.searchkey] = is_spam mgr.bayes_dirty = True + # Simplest way to rescore is to re-filter with all_actions = False + if rescore: + import filter + filter.filter_message(msg, mgr, all_actions = False) + return True From tim_one@users.sourceforge.net Mon Nov 4 23:21:45 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 04 Nov 2002 15:21:45 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.64,1.65 tokenizer.py,1.60,1.61 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12377 Modified Files: Options.py tokenizer.py Log Message: New option record_header_absence. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.64 retrieving revision 1.65 diff -C2 -d -r1.64 -r1.65 *** Options.py 3 Nov 2002 13:48:47 -0000 1.64 --- Options.py 4 Nov 2002 23:21:43 -0000 1.65 *************** *** 54,63 **** # very strong ham clue, but a bogus one. In that case, set # count_all_header_lines to False, and adjust safe_headers instead. - count_all_header_lines: False ! # Like count_all_header_lines, but restricted to headers in this list. ! # safe_headers is ignored when count_all_header_lines is true. safe_headers: abuse-reports-to date --- 54,68 ---- # very strong ham clue, but a bogus one. In that case, set # count_all_header_lines to False, and adjust safe_headers instead. count_all_header_lines: False ! # When True, generate a "noheader:HEADERNAME" token for each header in ! # safe_headers (below) that *doesn't* appear in the headers. This helped ! # in various of Tim's python.org tests, but appeared to hurt a little in ! # Anthony Baxter's tests. ! record_header_absence: False + # Like count_all_header_lines, but restricted to headers in this list. + # safe_headers is ignored when count_all_header_lines is true, unless + # record_header_absence is also true. safe_headers: abuse-reports-to date *************** *** 336,339 **** --- 341,345 ---- 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, + 'record_header_absence': boolean_cracker, 'generate_long_skips': boolean_cracker, 'skip_max_word_size': int_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.60 retrieving revision 1.61 diff -C2 -d -r1.60 -r1.61 *** tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 --- tokenizer.py 4 Nov 2002 23:21:43 -0000 1.61 *************** *** 1179,1182 **** --- 1179,1185 ---- for x in x2n.items(): yield "header:%s:%d" % x + if options.record_header_absence: + for x in options.safe_headers - Set([k.lower() for k in x2n]): + yield "noheader:" + x def tokenize_body(self, msg, maxword=options.skip_max_word_size): From tim_one@users.sourceforge.net Mon Nov 4 23:21:45 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 04 Nov 2002 15:21:45 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 default_bayes_customize.ini,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv12377/Outlook2000 Modified Files: default_bayes_customize.ini Log Message: New option record_header_absence. Index: default_bayes_customize.ini =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** default_bayes_customize.ini 27 Oct 2002 03:42:58 -0000 1.4 --- default_bayes_customize.ini 4 Nov 2002 23:21:43 -0000 1.5 *************** *** 14,17 **** --- 14,20 ---- replace_nonascii_chars: True + # It's helpful for Tim . + record_header_absence: True + [Classifier] # Uncomment the next lines if you want to use the former default for From tim.one@comcast.net Mon Nov 4 23:39:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 18:39:27 -0500 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.13,1.14 In-Reply-To: Message-ID: [Mark Hammond] > Roll-back my previous "update probs" change - Tim's fix would > have fixed it had I done a complete retain. Done that now, and > if I still need this Tim will sort it out once-and-for-all Do keep an eye on it! I've never seen software that had a bug, but I keep hearing it's possible ... From mhammond@users.sourceforge.net Tue Nov 5 11:44:30 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Tue, 05 Nov 2002 03:44:30 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.21,1.22 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv14881 Modified Files: msgstore.py Log Message: Fix a few typos in comments, and code! Also adding a check if the message has attachments - currently not used, but will be soon (to handle multipart/signed messages) - was in the code then found the typos, so decided I should get 'em in. [DoCopyMode -> DoCopyMove does get me wondering about the utility of auto-complete in editors tho' <0.1 wink>] Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** msgstore.py 4 Nov 2002 00:41:08 -0000 1.21 --- msgstore.py 5 Nov 2002 11:44:27 -0000 1.22 *************** *** 296,301 **** # only problem is that it can potentially be changed - however, the # Outlook client provides no such (easy/obvious) way ! # (ie, someone would need to really want to change it ! # This, searchkey is the only reliable long-lived message key. self.searchkey = searchkey self.unread = unread --- 296,301 ---- # only problem is that it can potentially be changed - however, the # Outlook client provides no such (easy/obvious) way ! # (ie, someone would need to really want to change it ) ! # Thus, searchkey is the only reliable long-lived message key. self.searchkey = searchkey self.unread = unread *************** *** 369,377 **** # Oh - and for multipart/signed messages self._EnsureObject() ! prop_ids = PR_TRANSPORT_MESSAGE_HEADERS_A, PR_BODY_A, MYPR_BODY_HTML_A hr, data = self.mapi_object.GetProps(prop_ids,0) headers = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0]) body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1]) html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2]) # Mail delivered internally via Exchange Server etc may not have # headers - fake some up. --- 369,381 ---- # Oh - and for multipart/signed messages self._EnsureObject() ! prop_ids = (PR_TRANSPORT_MESSAGE_HEADERS_A, ! PR_BODY_A, ! MYPR_BODY_HTML_A, ! PR_HASATTACH) hr, data = self.mapi_object.GetProps(prop_ids,0) headers = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0]) body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1]) html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2]) + has_attach = data[3][1] # Mail delivered internally via Exchange Server etc may not have # headers - fake some up. *************** *** 382,385 **** --- 386,395 ---- elif headers.startswith("Microsoft Mail"): headers = "X-MS-Mail-Gibberish: " + headers + if not html and not body: + # Only ever seen this for "multipart/signed" messages, so + # without any better clues, just handle this. + # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed + pass + return "%s\n%s\n%s" % (headers, html, body) *************** *** 476,480 **** props = ( (mapi.PS_PUBLIC_STRINGS, prop), ) prop = self.mapi_object.GetIDsFromNames(props, 0)[0] - # Docs say PT_ERROR, reality shows PT_UNSPECIFIED if PROP_TYPE(prop) == PT_ERROR: # No such property return None --- 486,489 ---- *************** *** 494,498 **** self.dirty = False ! def _DoCopyMode(self, folder, isMove): ## self.mapi_object = None # release the COM pointer assert not self.dirty, \ --- 503,507 ---- self.dirty = False ! def _DoCopyMove(self, folder, isMove): ## self.mapi_object = None # release the COM pointer assert not self.dirty, \ *************** *** 517,524 **** def MoveTo(self, folder): ! self._DoCopyMode(folder, True) def CopyTo(self, folder): ! self._DoCopyMode(folder, True) def test(): --- 526,533 ---- def MoveTo(self, folder): ! self._DoCopyMove(folder, True) def CopyTo(self, folder): ! self._DoCopyMove(folder, False) def test(): From mhammond@users.sourceforge.net Tue Nov 5 21:51:55 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Tue, 05 Nov 2002 13:51:55 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs ManagerDialog.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv10075 Modified Files: ManagerDialog.py Log Message: Ensure filter_status is always set to a value indicating why the filter can not be enabled. Index: ManagerDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** ManagerDialog.py 1 Nov 2002 02:03:48 -0000 1.5 --- ManagerDialog.py 5 Nov 2002 21:51:53 -0000 1.6 *************** *** 120,123 **** --- 120,128 ---- if ok_to_enable: unsure_name = self.mgr.FormatFolderNames([config.unsure_folder_id], False) + else: + filter_status = "You must define the folder to receive your possible spam" + else: + filter_status = "You must define the folder to receive your certain spam" + # whew if ok_to_enable: From richiehindle@users.sourceforge.net Tue Nov 5 22:18:59 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Tue, 05 Nov 2002 14:18:59 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv23270 Modified Files: pop3proxy.py Log Message: First cut of the HTML user interface - see the docstring for -b and -u. Now reads the classification header and its values from the options. Added TOP support to the test server (to make 40tude Dialog happy). Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** pop3proxy.py 2 Nov 2002 21:00:21 -0000 1.9 --- pop3proxy.py 5 Nov 2002 22:18:56 -0000 1.10 *************** *** 15,19 **** -p FILE : use the named data file -d : the file is a DBM file rather than a pickle ! -l port : listen on this port number (default 110) pop3proxy -t --- 15,22 ---- -p FILE : use the named data file -d : the file is a DBM file rather than a pickle ! -l port : proxy listens on this port number (default 110) ! -u port : User interface listens on this port number ! (default 8880; Browse http://localhost:8880/) ! -b : Launch a web browser showing the user interface. pop3proxy -t *************** *** 35,40 **** ! import sys, re, operator, errno, getopt, cPickle, time ! import socket, asyncore, asynchat import classifier, tokenizer, hammie from Options import options --- 38,43 ---- ! import sys, re, operator, errno, getopt, cPickle, cStringIO, time ! import socket, asyncore, asynchat, cgi, urlparse, webbrowser import classifier, tokenizer, hammie from Options import options *************** *** 42,47 **** # HEADER_EXAMPLE is the longest possible header - the length of this one # is added to the size of each message. ! HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER ! HEADER_EXAMPLE = '%s: Unsure\r\n' % hammie.DISPHEADER --- 45,57 ---- # HEADER_EXAMPLE is the longest possible header - the length of this one # is added to the size of each message. ! HEADER_FORMAT = '%s: %%s\r\n' % options.hammie_header_name ! HEADER_EXAMPLE = '%s: xxxxxxxxxxxxxxxxxxxx\r\n' % options.hammie_header_name ! ! # This keeps the global status of the module - the command-line options, ! # how many mails have been classified, how many active connections there ! # are, and so on. ! class Status: ! pass ! status = Status() *************** *** 61,65 **** self.set_socket(s, socketMap) self.set_reuse_addr() ! print "Listening on port %d." % port self.bind(('', port)) self.listen(5) --- 71,75 ---- self.set_socket(s, socketMap) self.set_reuse_addr() ! print "%s listening on port %d." % (self.__class__.__name__, port) self.bind(('', port)) self.listen(5) *************** *** 73,80 **** self.factory(*args) ! class POP3ProxyBase(asynchat.async_chat): """An async dispatcher that understands POP3 and proxies to a POP3 ! server, calling `self.onTransaction( request, response )` for each transaction. Responses are not un-byte-stuffed before reaching self.onTransaction() (they probably should be for a totally generic --- 83,107 ---- self.factory(*args) + class BrighterAsyncChat(asynchat.async_chat): + """An asynchat.async_chat that doesn't give spurious warnings on + receiving an incoming connection, and lets SystemExit cause an + exit.""" ! def handle_connect(self): ! """Suppress the asyncore "unhandled connect event" warning.""" ! pass ! ! def handle_error(self): ! """Let SystemExit cause an exit.""" ! type, v, t = sys.exc_info() ! if type == SystemExit: ! raise ! else: ! asynchat.async_chat.handle_error(self) ! ! ! class POP3ProxyBase(BrighterAsyncChat): """An async dispatcher that understands POP3 and proxies to a POP3 ! server, calling `self.onTransaction(request, response)` for each transaction. Responses are not un-byte-stuffed before reaching self.onTransaction() (they probably should be for a totally generic *************** *** 88,92 **** def __init__(self, clientSocket, serverName, serverPort): ! asynchat.async_chat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') --- 115,119 ---- def __init__(self, clientSocket, serverName, serverPort): ! BrighterAsyncChat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') *************** *** 96,103 **** self.push(self.serverIn.readline()) - def handle_connect(self): - """Suppress the asyncore "unhandled connect event" warning.""" - pass - def onTransaction(self, command, args, response): """Overide this. Takes the raw request and the response, and --- 123,126 ---- *************** *** 221,232 **** self.close_when_done() - def handle_error(self): - """Let SystemExit cause an exit.""" - type, v, t = sys.exc_info() - if type == SystemExit: - raise - else: - asynchat.async_chat.handle_error(self) - class BayesProxyListener(Listener): --- 244,247 ---- *************** *** 276,279 **** --- 291,296 ---- self.handlers = {'STAT': self.onStat, 'LIST': self.onList, 'RETR': self.onRetr, 'TOP': self.onTop} + status.totalSessions += 1 + status.activeSessions += 1 def send(self, data): *************** *** 290,293 **** --- 307,314 ---- return data + def close(self): + status.activeSessions -= 1 + POP3ProxyBase.close(self) + def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the *************** *** 343,352 **** # Now find the spam disposition and add the header. prob = self.bayes.spamprob(tokenizer.tokenize(message)) if prob < options.ham_cutoff: ! disposition = "No" elif prob > options.spam_cutoff: ! disposition = "Yes" else: ! disposition = "Unsure" headers, body = re.split(r'\n\r?\n', response, 1) --- 364,381 ---- # Now find the spam disposition and add the header. prob = self.bayes.spamprob(tokenizer.tokenize(message)) + if command == 'RETR': + status.numEmails += 1 if prob < options.ham_cutoff: ! disposition = options.header_ham_string ! if command == 'RETR': ! status.numHams += 1 elif prob > options.spam_cutoff: ! disposition = options.header_spam_string ! if command == 'RETR': ! status.numSpams += 1 else: ! disposition = options.header_unsure_string ! if command == 'RETR': ! status.numUnsure += 1 headers, body = re.split(r'\n\r?\n', response, 1) *************** *** 368,372 **** ! def main(serverName, serverPort, proxyPort, pickleName, useDB): """Runs the proxy forever or until a 'KILL' command is received or someone hits Ctrl+Break.""" --- 397,646 ---- ! class UserInterfaceListener(Listener): ! """Listens for incoming web browser connections and spins off ! UserInterface objects to serve them.""" ! ! def __init__(self, uiPort, bayes): ! uiArgs = (bayes,) ! Listener.__init__(self, uiPort, UserInterface, uiArgs) ! ! ! # Until the user interface has had a wider audience, I won't pollute the ! # project with .gif files and the like. Here's the viking helmet. ! import base64 ! helmet = base64.decodestring( ! """R0lGODlhIgAYAPcAAEJCRlVTVGNaUl5eXmtaVm9lXGtrZ3NrY3dvZ4d0Znt3dImHh5R+a6GDcJyU ! jrSdjaWlra2tra2tta+3ur2trcC9t7W9ysDDyMbGzsbS3r3W78bW78be78be973e/8bn/86pjNav ! kc69re/Lrc7Ly9ba4vfWveTh5M7e79be79bn797n7+fr6+/v5+/v7/f3787e987n987n/9bn99bn ! /9bv/97n997v++fv9+f3/+/v9+/3//f39/f/////9////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAEAAB4ALAAAAAAiABgA ! AAj+AD0IHEiwoMGDA2XI8PBhxg2EECN+YJHjwwccOz5E3FhQBgseMmK44KGRo0kaLHzQENljoUmO ! NE74uGHDxQ8aL2GmzFHzZs6NNFr8yKHC5sOfEEUOVcHiR8aNFksi/LCCx1KZPXAilLHBAoYMMSB6 ! 9DEUhsyhUgl+wOBAwQIHFsIapGpzaIcTVnvcSOsBhgUFBgYUMKAgAgqNH2J0aPjxR9YPJerqlYEi ! w4YYExQM2FygwIHCKVBgiBChBIsXP5wu3HD2Bw8MC2JD0CygAIHOnhU4cLDA7QWrqfd6iBE5dQsH ! BgJvHiDgNoID0A88V6AAAQSyjl16QIHXBwnNAwDIBAhAwDmDBAjQHyiAIPkC7DnUljhxwkGAAQHE ! B+icIAGD8+clUMByCNjUUkEdlHCBAvflF0BtB/zHQAMSCjhYYBXsoFVBMWAQWH4AAFBbAg2UWOID ! FK432AEO2ABRBwtsFuKDBTSAYgMghBDCAwwgwB4CClQAQ0R/4RciAQjYyMADIIwwAggN+PeWBTPw ! VdAHHEjA4IMR8ojjCCaEEGUCFcygnUQxaEndbhBAwKQIFVAAgQMQHPZTBxrkqUEHfHLAAZ+AdgBR ! QAAAOw==""") ! ! ! class UserInterface(BrighterAsyncChat): ! """Serves the HTML user interface of the proxy.""" ! ! header = """Spambayes proxy: %s ! ! \n""" ! ! bodyStart = """ !

    !
    \n""" ! ! footer = """
    !
    ! ! !
    \n""" ! ! pageSection = """ ! !
    %s
    %s
    !  
    \n""" ! ! wordQuery = """
    ! ! !
    """ ! ! def __init__(self, clientSocket, bayes): ! BrighterAsyncChat.__init__(self, clientSocket) ! self.bayes = bayes ! self.request = '' ! self.set_terminator('\r\n\r\n') ! self.helmet = helmet ! ! def collect_incoming_data(self, data): ! """Asynchat override.""" ! self.request = self.request + data ! ! def found_terminator(self): ! """Asynchat override. ! Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = self.request.split('\r\n', 1) ! try: ! method, url, version = requestLine.strip().split() ! except ValueError: ! self.pushError(400, "Malformed request: '%s'" % requestLine) # XXX: 400?? ! self.close_when_done() ! else: ! method = method.upper() ! _, _, path, _, query, _ = urlparse.urlparse(url) ! params = cgi.parse_qs(query, keep_blank_values=True) ! if self.get_terminator() == '\r\n\r\n' and method == 'POST': ! # We need to read a body; set a numeric async_chat terminator. ! match = re.search(r'(?i)content-length:\s*(\d+)', headers) ! self.set_terminator(int(match.group(1))) ! self.request = self.request + '\r\n\r\n' ! return ! ! if type(self.get_terminator()) is type(1): ! # We've just read the body of a POSTed request. ! self.set_terminator('\r\n\r\n') ! body = self.request.split('\r\n\r\n', 1)[1] ! match = re.search(r'(?i)content-type:\s*([^\r\n]+)', headers) ! contentTypeHeader = match.group(1) ! contentType, pdict = cgi.parse_header(contentTypeHeader) ! if contentType == 'multipart/form-data': ! # multipart/form-data - probably a file upload. ! bodyFile = cStringIO.StringIO(body) ! params.update(cgi.parse_multipart(bodyFile, pdict)) ! else: ! # A normal x-www-form-urlencoded. ! params.update(cgi.parse_qs(body, keep_blank_values=True)) ! ! # Convert the cgi params into a simple dictionary. ! plainParams = {} ! for name, value in params.iteritems(): ! plainParams[name] = value[0] ! self.onRequest(path, plainParams) ! self.close_when_done() ! ! def onRequest(self, path, params): ! """Handles a decoded HTTP request.""" ! if path == '/': ! path = '/Home' ! ! if path == '/helmet.gif': ! self.pushOKHeaders('image/gif') ! self.push(self.helmet) ! else: ! try: ! name = path[1:].capitalize() ! handler = getattr(self, 'on' + name) ! except AttributeError: ! self.pushError(404, "Not found: '%s'" % url) ! else: ! # This is a request for a valid page; run the handler. ! self.pushOKHeaders('text/html') ! self.pushPreamble(name) ! handler(params) ! timeString = time.asctime(time.localtime()) ! self.push(self.footer % timeString) ! ! def pushOKHeaders(self, contentType): ! self.push("HTTP/1.0 200 OK\r\n") ! self.push("Content-Type: %s\r\n" % contentType) ! self.push("\r\n") ! ! def pushError(self, code, message): ! self.push("HTTP/1.0 %d Error\r\n" % code) ! self.push("Content-Type: text/html\r\n") ! self.push("\r\n") ! self.push("

    %d %s

    " % (code, message)) ! ! def pushPreamble(self, name): ! self.push(self.header % name) ! if name == 'Home': ! homeLink = name ! else: ! homeLink = "Home > %s" % name ! self.push(self.bodyStart % homeLink) ! ! def onHome(self, params): ! summary = """POP3 proxy running on port %(proxyPort)d, ! proxying to %(serverName)s:%(serverPort)d.
    ! Active POP3 conversations: %(activeSessions)d.
    ! POP3 conversations this session: ! %(totalSessions)d.
    ! Emails classified this session: %(numSpams)d spam, ! %(numHams)d ham, %(numUnsure)d unsure. ! """ % status.__dict__ ! ! train = """
    ! Either upload a message file: !
    ! Or paste the whole message (incuding headers) here:
    !
    ! Is this message ! Ham or ! Spam?
    ! !
    """ ! ! body = (self.pageSection % ('Status', summary) + ! self.pageSection % ('Word query', self.wordQuery) + ! self.pageSection % ('Train', train)) ! self.push(body) ! ! def onShutdown(self, params): ! self.push("

    Shutdown. Goodbye.

    ") ! self.push(' ') # Acts as a flush for small buffers. ! self.shutdown(2) ! self.close() ! raise SystemExit ! ! def onUpload(self, params): ! message = params.get('file') or params.get('text') ! isSpam = (params['which'] == 'spam') ! self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("""

    Trained on your message. Saving database...

    """) ! self.push(" ") # Flush... must find out how to do this properly... ! if not status.useDB and status.pickleName: ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("

    Done.

    Home

    ") ! ! def onWordquery(self, params): ! word = params['word'] ! try: ! # Must be a better way to get __dict__ for a new-style class... ! wi = self.bayes.wordinfo[word] ! members = dict(map(lambda n: (n, getattr(wi, n)), wi.__slots__)) ! members['atime'] = time.asctime(time.localtime(members['atime'])) ! info = """Number of spam messages: %(spamcount)d.
    ! Number of ham messages: %(hamcount)d.
    ! Number of times used to classify: %(killcount)s.
    ! Probability that a message containing this word is spam: ! %(spamprob)f.
    ! Last used: %(atime)s.
    """ % members ! except KeyError: ! info = "'%s' does not appear in the database." % word ! ! body = (self.pageSection % ("Statistics for '%s':" % word, info) + ! self.pageSection % ('Word query', self.wordQuery)) ! self.push(body) ! ! ! def main(serverName, serverPort, proxyPort, ! uiPort, launchUI, pickleName, useDB): """Runs the proxy forever or until a 'KILL' command is received or someone hits Ctrl+Break.""" *************** *** 375,378 **** --- 649,655 ---- print "Done." BayesProxyListener(serverName, serverPort, proxyPort, bayes) + UserInterfaceListener(uiPort, bayes) + if launchUI: + webbrowser.open_new("http://localhost:%d/" % uiPort) asyncore.loop() *************** *** 424,430 **** ! class TestPOP3Server(asynchat.async_chat): ! """Minimal POP3 server, for testing purposes. Doesn't support TOP ! or UIDL. USER, PASS, APOP, DELE and RSET simply return "+OK" without doing anything. Also understands the 'KILL' command, to kill it. The mail content is the example messages above. --- 701,707 ---- ! class TestPOP3Server(BrighterAsyncChat): ! """Minimal POP3 server, for testing purposes. Doesn't support ! UIDL. USER, PASS, APOP, DELE and RSET simply return "+OK" without doing anything. Also understands the 'KILL' command, to kill it. The mail content is the example messages above. *************** *** 434,439 **** # Grumble: asynchat.__init__ doesn't take a 'map' argument, # hence the two-stage construction. ! asynchat.async_chat.__init__(self) ! asynchat.async_chat.set_socket(self, clientSocket, socketMap) self.maildrop = [spam1, good1] self.set_terminator('\r\n') --- 711,716 ---- # Grumble: asynchat.__init__ doesn't take a 'map' argument, # hence the two-stage construction. ! BrighterAsyncChat.__init__(self) ! BrighterAsyncChat.set_socket(self, clientSocket, socketMap) self.maildrop = [spam1, good1] self.set_terminator('\r\n') *************** *** 442,453 **** self.handlers = {'STAT': self.onStat, 'LIST': self.onList, ! 'RETR': self.onRetr} self.push("+OK ready\r\n") self.request = '' - def handle_connect(self): - """Suppress the asyncore "unhandled connect event" warning.""" - pass - def collect_incoming_data(self, data): """Asynchat override.""" --- 719,727 ---- self.handlers = {'STAT': self.onStat, 'LIST': self.onList, ! 'RETR': self.onRetr, ! 'TOP': self.onTop} self.push("+OK ready\r\n") self.request = '' def collect_incoming_data(self, data): """Asynchat override.""" *************** *** 466,469 **** --- 740,745 ---- self.close_when_done() if command == 'KILL': + self.shutdown(2) + self.close() raise SystemExit else: *************** *** 472,483 **** self.request = '' - def handle_error(self): - """Let SystemExit cause an exit.""" - type, v, t = sys.exc_info() - if type == SystemExit: - raise - else: - asynchat.async_chat.handle_error(self) - def onStat(self, command, args): """POP3 STAT command.""" --- 748,751 ---- *************** *** 502,514 **** return '\r\n'.join(returnLines) + '\r\n' ! def onRetr(self, command, args): ! """POP3 RETR command.""" ! number = int(args) if 0 < number <= len(self.maildrop): message = self.maildrop[number-1] return "+OK\r\n%s\r\n.\r\n" % message else: return "-ERR no such message\r\n" def onUnknown(self, command, args): """Unknown POP3 command.""" --- 770,793 ---- return '\r\n'.join(returnLines) + '\r\n' ! def _getMessage(self, number, maxLines): ! """Implements the POP3 RETR and TOP commands.""" if 0 < number <= len(self.maildrop): message = self.maildrop[number-1] + headers, body = message.split('\n\n', 1) + bodyLines = body.split('\n')[:maxLines] + message = headers + '\r\n\r\n' + '\n'.join(bodyLines) return "+OK\r\n%s\r\n.\r\n" % message else: return "-ERR no such message\r\n" + def onRetr(self, command, args): + """POP3 RETR command.""" + return self._getMessage(int(args), 12345) + + def onTop(self, command, args): + """POP3 RETR command.""" + number, lines = map(int, args.split()) + return self._getMessage(number, lines) + def onUnknown(self, command, args): """Unknown POP3 command.""" *************** *** 564,568 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(hammie.DISPHEADER) != -1 # Kill the proxy and the test server. --- 843,847 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) != -1 # Kill the proxy and the test server. *************** *** 580,592 **** # Read the arguments. try: ! opts, args = getopt.getopt(sys.argv[1:], 'htdp:l:') except getopt.error, msg: print >>sys.stderr, str(msg) + '\n\n' + __doc__ sys.exit() ! pickleName = hammie.DEFAULTDB ! proxyPort = 110 ! useDB = False ! runTestServer = False for opt, arg in opts: if opt == '-h': --- 859,880 ---- # Read the arguments. try: ! opts, args = getopt.getopt(sys.argv[1:], 'htdbp:l:u:') except getopt.error, msg: print >>sys.stderr, str(msg) + '\n\n' + __doc__ sys.exit() ! status.pickleName = hammie.DEFAULTDB ! status.proxyPort = 110 ! status.uiPort = 8880 ! status.serverPort = 110 ! status.useDB = False ! status.runTestServer = False ! status.launchUI = False ! status.totalSessions = 0 ! status.activeSessions = 0 ! status.numEmails = 0 ! status.numSpams = 0 ! status.numHams = 0 ! status.numUnsure = 0 for opt, arg in opts: if opt == '-h': *************** *** 594,604 **** sys.exit() elif opt == '-t': ! runTestServer = True elif opt == '-d': ! useDB = True elif opt == '-p': ! pickleName = arg elif opt == '-l': ! proxyPort = int(arg) # Do whatever we've been asked to do... --- 882,896 ---- sys.exit() elif opt == '-t': ! status.runTestServer = True ! elif opt == '-b': ! status.launchUI = True elif opt == '-d': ! status.useDB = True elif opt == '-p': ! status.pickleName = arg elif opt == '-l': ! status.proxyPort = int(arg) ! elif opt == '-u': ! status.uiPort = int(arg) # Do whatever we've been asked to do... *************** *** 608,623 **** print "Self-test passed." # ...else it would have asserted. ! elif runTestServer: print "Running a test POP3 server on port 8110..." TestListener() asyncore.loop() ! elif len(args) == 1: ! # Named POP3 server, default port. ! main(args[0], 110, proxyPort, pickleName, useDB) ! ! elif len(args) == 2: ! # Named POP3 server, named port. ! main(args[0], int(args[1]), proxyPort, pickleName, useDB) else: --- 900,915 ---- print "Self-test passed." # ...else it would have asserted. ! elif status.runTestServer: print "Running a test POP3 server on port 8110..." TestListener() asyncore.loop() ! elif 1 <= len(args) <= 2: ! # Normal usage, with optional server port number. ! status.serverName = args[0] ! if len(args) == 2: ! status.serverPort = int(args[1]) ! main(status.serverName, status.serverPort, status.proxyPort, ! status.uiPort, status.launchUI, status.pickleName, status.useDB) else: From jhylton@users.sourceforge.net Tue Nov 5 22:57:29 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Tue, 05 Nov 2002 14:57:29 -0800 Subject: [Spambayes-checkins] spambayes/pspam pop.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam In directory usw-pr-cvs1:/tmp/cvs-serv9113 Modified Files: pop.py Log Message: Allow the proxy server to get the real server name from USER command. Index: pop.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pop.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** pop.py 4 Nov 2002 04:44:19 -0000 1.1 --- pop.py 5 Nov 2002 22:57:27 -0000 1.2 *************** *** 11,14 **** --- 11,21 ---- insert the spam header. + The proxy can connect to any real POP3 server. It parses the USER + command to figure out the address of the real server. It expects the + USER argument to follow this format user@server[:port]. For example, + if you configure your POP client to send USER jeremy@example.com:111. + It will connect to a server on port 111 at example.com and send it the + command USER jeremy. + XXX A POP3 server sometimes adds the number of bytes in the +OK response to some commands when the POP3 spec doesn't require it to. *************** *** 41,52 **** HEADER_SIZE = len(HEADER % 0.0) class POP3ProxyServer(SocketServer.ThreadingTCPServer): allow_reuse_address = True ! def __init__(self, addr, handler, classifier, real_server, log, zodb): SocketServer.ThreadingTCPServer.__init__(self, addr, handler) self.classifier = classifier - self.pop_server = real_server self.log = log self.zodb = zodb --- 48,60 ---- HEADER_SIZE = len(HEADER % 0.0) + VERSION = 0.1 + class POP3ProxyServer(SocketServer.ThreadingTCPServer): allow_reuse_address = True ! def __init__(self, addr, handler, classifier, log, zodb): SocketServer.ThreadingTCPServer.__init__(self, addr, handler) self.classifier = classifier self.log = log self.zodb = zodb *************** *** 73,80 **** """Act as proxy between POP client and server.""" ! def connect_pop(self): # connect to the pop server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! s.connect(self.server.pop_server) self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb")) # the write side should be unbuffered --- 81,117 ---- """Act as proxy between POP client and server.""" ! def read_user(self): ! # XXX This could be cleaned up a bit. ! line = self.rfile.readline() ! if line == "": ! return False ! parts = line.split() ! if parts[0] != "USER": ! self.wfile.write("-ERR Invalid command; must specify USER first") ! return False ! user = parts[1] ! i = user.rfind("@") ! username = user[:i] ! server = user[i+1:] ! i = server.find(":") ! if i == -1: ! server = server, 110 ! else: ! port = int(server[i+1:]) ! server = server[:i], port ! zLOG.LOG("POP3", zLOG.INFO, "Got connect for %s" % repr(server)) ! self.connect_pop(server) ! self.pop_wfile.write("USER %s\r\n" % username) ! resp = self.pop_rfile.readline() ! # As long the server responds OK, just swallow this reponse. ! if resp.startswith("+OK"): ! return True ! else: ! return False ! ! def connect_pop(self, pop_server): # connect to the pop server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! s.connect(pop_server) self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb")) # the write side should be unbuffered *************** *** 90,94 **** self.server.zodb.sync() self.sess_retr_count = 0 ! self.connect_pop() try: self.handle_pop() --- 127,135 ---- self.server.zodb.sync() self.sess_retr_count = 0 ! self.wfile.write("+OK pspam/pop %s\r\n" % VERSION) ! # First read the USER command to get the real server's name ! if not self.read_user(): ! zLOG.LOG("POP3", zLOG.INFO, "Did not get valid USER") ! return try: self.handle_pop() *************** *** 265,269 **** POP3RequestHandler, profile.classifier, - (options.server, options.server_port), log, conn, --- 306,309 ---- From montanaro@users.sourceforge.net Wed Nov 6 01:57:42 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 17:57:42 -0800 Subject: [Spambayes-checkins] spambayes mboxutils.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12413 Modified Files: mboxutils.py Log Message: Add get_message() factory function ripped from tokenizer.Tokenizer.get_message(). Replace usage of _factory() with it. Index: mboxutils.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** mboxutils.py 27 Oct 2002 21:35:00 -0000 1.3 --- mboxutils.py 6 Nov 2002 01:57:39 -0000 1.4 *************** *** 24,27 **** --- 24,28 ---- import email import mailbox + import email.Message class DirOfTxtFileMailbox: *************** *** 44,54 **** f.close() - def _factory(fp): - # Helper for getmbox - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return '' - def _cat(seqs): for seq in seqs: --- 45,48 ---- *************** *** 74,78 **** for name in names: filename = os.path.join(mhpath, name) ! mbox = mailbox.MHMailbox(filename, _factory) mboxes.append(mbox) if len(mboxes) == 1: --- 68,72 ---- for name in names: filename = os.path.join(mhpath, name) ! mbox = mailbox.MHMailbox(filename, get_message) mboxes.append(mbox) if len(mboxes) == 1: *************** *** 85,95 **** # if the pathname contains /Mail/, else a DirOfTxtFileMailbox. if os.path.exists(os.path.join(name, 'cur')): ! mbox = mailbox.Maildir(name, _factory) elif name.find("/Mail/") >= 0: ! mbox = mailbox.MHMailbox(name, _factory) else: ! mbox = DirOfTxtFileMailbox(name, _factory) else: fp = open(name, "rb") ! mbox = mailbox.PortableUnixMailbox(fp, _factory) return iter(mbox) --- 79,120 ---- # if the pathname contains /Mail/, else a DirOfTxtFileMailbox. if os.path.exists(os.path.join(name, 'cur')): ! mbox = mailbox.Maildir(name, get_message) elif name.find("/Mail/") >= 0: ! mbox = mailbox.MHMailbox(name, get_message) else: ! mbox = DirOfTxtFileMailbox(name, get_message) else: fp = open(name, "rb") ! mbox = mailbox.PortableUnixMailbox(fp, get_message) return iter(mbox) + + def get_message(obj): + """Return an email Message object. + + The argument may be a Message object already, in which case it's + returned as-is. + + If the argument is a string or file-like object (supports read()), + the email package is used to create a Message object from it. This + can fail if the message is malformed. In that case, the headers + (everything through the first blank line) are thrown out, and the + rest of the text is wrapped in a bare email.Message.Message. + """ + + if isinstance(obj, email.Message.Message): + return obj + # Create an email Message object. + if hasattr(obj, "read"): + obj = obj.read() + try: + msg = email.message_from_string(obj) + except email.Errors.MessageParseError: + # Wrap the raw text in a bare Message object. Since the + # headers are most likely damaged, we can't use the email + # package to parse them, so just get rid of them first. + i = obj.find('\n\n') + if i >= 0: + obj = obj[i+2:] # strip headers + msg = email.Message.Message() + msg.set_payload(obj) + return msg From montanaro@users.sourceforge.net Wed Nov 6 01:58:37 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 17:58:37 -0800 Subject: [Spambayes-checkins] spambayes mboxcount.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12636 Modified Files: mboxcount.py Log Message: replace _factory() with mboxutils.get_message() Index: mboxcount.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxcount.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** mboxcount.py 5 Sep 2002 16:16:43 -0000 1.1 --- mboxcount.py 6 Nov 2002 01:58:35 -0000 1.2 *************** *** 34,40 **** import glob ! program = sys.argv[0] ! _marker = object() def usage(code, msg=''): --- 34,40 ---- import glob ! from mboxutils import get_message ! program = sys.argv[0] def usage(code, msg=''): *************** *** 44,60 **** sys.exit(code) - def _factory(fp): - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return _marker - def count(fname): fp = open(fname, 'rb') ! mbox = mailbox.PortableUnixMailbox(fp, _factory) goodcount = 0 badcount = 0 for msg in mbox: ! if msg is _marker: badcount += 1 else: --- 44,54 ---- sys.exit(code) def count(fname): fp = open(fname, 'rb') ! mbox = mailbox.PortableUnixMailbox(fp, get_message) goodcount = 0 badcount = 0 for msg in mbox: ! if msg["to"] is None and msg["cc"] is None: badcount += 1 else: From montanaro@users.sourceforge.net Wed Nov 6 02:01:27 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 18:01:27 -0800 Subject: [Spambayes-checkins] spambayes split.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13359 Modified Files: split.py Log Message: replace _factory() with mboxutils.get_message() Index: split.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/split.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** split.py 5 Sep 2002 16:16:43 -0000 1.1 --- split.py 6 Nov 2002 02:01:25 -0000 1.2 *************** *** 32,35 **** --- 32,37 ---- import getopt + import mboxutils + program = sys.argv[0] *************** *** 44,55 **** - def _factory(fp): - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return '' - - - def main(): try: --- 46,49 ---- *************** *** 81,85 **** infp = open(mboxfile, 'rb') ! mbox = mailbox.PortableUnixMailbox(infp, _factory) for msg in mbox: if random.random() < percent: --- 75,79 ---- infp = open(mboxfile, 'rb') ! mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message) for msg in mbox: if random.random() < percent: From montanaro@users.sourceforge.net Wed Nov 6 02:02:10 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 18:02:10 -0800 Subject: [Spambayes-checkins] spambayes splitn.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13571 Modified Files: splitn.py Log Message: replace _factory() with mboxutils.get_message() Index: splitn.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitn.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** splitn.py 8 Sep 2002 17:41:56 -0000 1.2 --- splitn.py 6 Nov 2002 02:02:08 -0000 1.3 *************** *** 46,49 **** --- 46,51 ---- import getopt + import mboxutils + program = sys.argv[0] *************** *** 54,63 **** sys.exit(code) - def _factory(fp): - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return '' - def main(): try: --- 56,59 ---- *************** *** 89,93 **** for i in range(1, n+1)] ! mbox = mailbox.PortableUnixMailbox(infile, _factory) counter = 0 for msg in mbox: --- 85,89 ---- for i in range(1, n+1)] ! mbox = mailbox.PortableUnixMailbox(infile, mboxutils.get_message) counter = 0 for msg in mbox: From montanaro@users.sourceforge.net Wed Nov 6 02:02:46 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 18:02:46 -0800 Subject: [Spambayes-checkins] spambayes splitndirs.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13738 Modified Files: splitndirs.py Log Message: delete unused _factory() function Index: splitndirs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** splitndirs.py 24 Sep 2002 18:26:11 -0000 1.5 --- splitndirs.py 6 Nov 2002 02:02:43 -0000 1.6 *************** *** 63,72 **** sys.exit(code) - def _factory(fp): - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return '' - def main(): try: --- 63,66 ---- From montanaro@users.sourceforge.net Wed Nov 6 02:07:44 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 18:07:44 -0800 Subject: [Spambayes-checkins] spambayes hammie.py,1.35,1.36 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15267 Modified Files: hammie.py Log Message: use mboxutils.get_message() Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** hammie.py 3 Nov 2002 14:24:36 -0000 1.35 --- hammie.py 6 Nov 2002 02:07:42 -0000 1.36 *************** *** 263,270 **** """ ! if hasattr(msg, "readlines"): ! msg = email.message_from_file(msg) ! elif not hasattr(msg, "add_header"): ! msg = email.message_from_string(msg) prob, clues = self._scoremsg(msg, True) if prob < ham_cutoff: --- 263,267 ---- """ ! msg = mboxutils.get_message(msg) prob, clues = self._scoremsg(msg, True) if prob < ham_cutoff: From montanaro@users.sourceforge.net Wed Nov 6 02:12:49 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 05 Nov 2002 18:12:49 -0800 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.61,1.62 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16612 Modified Files: tokenizer.py Log Message: move Tokenizer.get_message() to mboxutils.py where it becomes the one true place to try and generate email.Message.Message objects. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.61 retrieving revision 1.62 diff -C2 -d -r1.61 -r1.62 *** tokenizer.py 4 Nov 2002 23:21:43 -0000 1.61 --- tokenizer.py 6 Nov 2002 02:12:47 -0000 1.62 *************** *** 14,17 **** --- 14,19 ---- from Options import options + from mboxutils import get_message + # Patch encodings.aliases to recognize 'ansi_x3_4_1968' from encodings.aliases import aliases # The aliases dictionary *************** *** 985,1017 **** def get_message(self, obj): ! """Return an email Message object. ! ! The argument may be a Message object already, in which case it's ! returned as-is. ! ! If the argument is a string or file-like object (supports read()), ! the email package is used to create a Message object from it. This ! can fail if the message is malformed. In that case, the headers ! (everything through the first blank line) are thrown out, and the ! rest of the text is wrapped in a bare email.Message.Message. ! """ ! ! if isinstance(obj, email.Message.Message): ! return obj ! # Create an email Message object. ! if hasattr(obj, "read"): ! obj = obj.read() ! try: ! msg = email.message_from_string(obj) ! except email.Errors.MessageParseError: ! # Wrap the raw text in a bare Message object. Since the ! # headers are most likely damaged, we can't use the email ! # package to parse them, so just get rid of them first. ! i = obj.find('\n\n') ! if i >= 0: ! obj = obj[i+2:] # strip headers ! msg = email.Message.Message() ! msg.set_payload(obj) ! return msg def tokenize(self, obj): --- 987,991 ---- def get_message(self, obj): ! return get_message(obj) def tokenize(self, obj): From anthonybaxter@users.sourceforge.net Wed Nov 6 20:07:37 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Wed, 06 Nov 2002 12:07:37 -0800 Subject: [Spambayes-checkins] website related.ht,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv3271 Modified Files: related.ht Log Message: couple more projects, from Alexandre Fayolle Index: related.ht =================================================================== RCS file: /cvsroot/spambayes/website/related.ht,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** related.ht 1 Nov 2002 04:06:49 -0000 1.3 --- related.ht 6 Nov 2002 20:07:34 -0000 1.4 *************** *** 12,16 ****
  • ifile, a Naive Bayes classification system.
  • PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification. !
  • ... --- 12,17 ----
  • ifile, a Naive Bayes classification system.
  • PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification. !
  • spamoracle, a Paul Graham based spam filter written in OCaml, designed for use with procmail. !
  • popfile, a pop3 proxy written in Perl with a Naive Bayes classifier. From anthonybaxter@users.sourceforge.net Wed Nov 6 22:12:52 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Wed, 06 Nov 2002 14:12:52 -0800 Subject: [Spambayes-checkins] spambayes table.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15111 Modified Files: table.py Log Message: added '-m' option to print means for each row. little bit of a cleanup. Index: table.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/table.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** table.py 26 Oct 2002 15:30:23 -0000 1.4 --- table.py 6 Nov 2002 22:12:48 -0000 1.5 *************** *** 2,6 **** """ ! table.py base1 base2 ... baseN Combines output from base1.txt, base2.txt, etc., which are created by --- 2,6 ---- """ ! table.py [-m] base1 base2 ... baseN Combines output from base1.txt, base2.txt, etc., which are created by *************** *** 8,15 **** comparison statistics to stdout. Each input file is represented by one column in the table. - """ ! import sys ! import re # Return --- 8,15 ---- comparison statistics to stdout. Each input file is represented by one column in the table. ! Optional argument -m shows a final column with the mean value of each ! statistic. ! """ # Return *************** *** 46,56 **** line = get() if line.startswith('-> tested'): ! # -> tested 1910 hams & 948 spams against 2741 hams & 948 spams ! # 0 1 2 3 4 5 6 print line, elif line.find(' items; mean ') > 0 and line.find('for all runs') > 0: ! # -> Ham scores for all runs: 2741 items; mean 0.86; sdev 6.28 ! # 0 1 2 vals = line.split(';') mean = float(vals[1].split()[-1]) --- 46,56 ---- line = get() if line.startswith('-> tested'): ! # tested 1910 hams & 948 spams against 2741 hams & 948 spams ! # 1 2 3 4 5 6 print line, elif line.find(' items; mean ') > 0 and line.find('for all runs') > 0: ! # Ham scores for all runs: 2741 items; mean 0.86; sdev 6.28 ! # 0 1 2 vals = line.split(';') mean = float(vals[1].split()[-1]) *************** *** 103,184 **** return fn ! fname = "filename: " ! fnam2 = " " ! ratio = "ham:spam: " ! rat2 = " " ! fptot = "fp total: " ! fpper = "fp %: " ! fntot = "fn total: " ! fnper = "fn %: " ! untot = "unsure t: " ! unper = "unsure %: " ! rcost = "real cost:" ! bcost = "best cost:" ! hmean = "h mean: " ! hsdev = "h sdev: " ! smean = "s mean: " ! ssdev = "s sdev: " ! meand = "mean diff:" ! kval = "k: " ! for filename in sys.argv[1:]: ! filename = windowsfy(filename) ! (htest, stest, fp, fn, un, fpp, fnp, unp, cost, bestcost, ! hamdevall, spamdevall) = suck(file(filename)) ! if filename.endswith('.txt'): ! filename = filename[:-4] ! filename = filename[filename.rfind('/')+1:] ! filename = filename[filename.rfind("\\")+1:] ! if len(fname) > len(fnam2): ! fname += " " ! fname = fname[0:(len(fnam2) + 8)] ! fnam2 += " %7s" % filename ! else: ! fnam2 += " " ! fnam2 = fnam2[0:(len(fname) + 8)] ! fname += " %7s" % filename ! if len(ratio) > len(rat2): ! ratio += " " ! ratio = ratio[0:(len(rat2) + 8)] ! rat2 += " %7s" % ("%d:%d" % (htest, stest)) ! else: ! rat2 += " " ! rat2 = rat2[0:(len(ratio) + 8)] ! ratio += " %7s" % ("%d:%d" % (htest, stest)) ! fptot += "%8d" % fp ! fpper += "%8.2f" % fpp ! fntot += "%8d" % fn ! fnper += "%8.2f" % fnp ! untot += "%8d" % un ! unper += "%8.2f" % unp ! rcost += "%8s" % ("$%.2f" % cost) ! bcost += "%8s" % ("$%.2f" % bestcost) ! hmean += "%8.2f" % hamdevall[0] ! hsdev += "%8.2f" % hamdevall[1] ! smean += "%8.2f" % spamdevall[0] ! ssdev += "%8.2f" % spamdevall[1] ! meand += "%8.2f" % (spamdevall[0] - hamdevall[0]) ! k = (spamdevall[0] - hamdevall[0]) / (spamdevall[1] + hamdevall[1]) ! kval += "%8.2f" % k ! print fname ! if len(fnam2.strip()) > 0: ! print fnam2 ! print ratio ! if len(rat2.strip()) > 0: ! print rat2 ! print fptot ! print fpper ! print fntot ! print fnper ! print untot ! print unper ! print rcost ! print bcost ! print hmean ! print hsdev ! print smean ! print ssdev ! print meand ! print kval --- 103,231 ---- return fn ! def table(): ! import getopt, sys ! showMean = 0 ! fname = "filename: " ! fnam2 = " " ! ratio = "ham:spam: " ! rat2 = " " ! fptot = "fp total: " ! fpper = "fp %: " ! fntot = "fn total: " ! fnper = "fn %: " ! untot = "unsure t: " ! unper = "unsure %: " ! rcost = "real cost:" ! bcost = "best cost:" ! hmean = "h mean: " ! hsdev = "h sdev: " ! smean = "s mean: " ! ssdev = "s sdev: " ! meand = "mean diff:" ! kval = "k: " ! ! tfptot = tfpper = tfntot = tfnper = tuntot = tunper = trcost = tbcost = \ ! thmean = thsdev = tsmean = tssdev = tmeand = tkval = 0 ! ! args, fileargs = getopt.getopt(sys.argv[1:], 'm') ! for arg, val in args: ! if arg == "-m": ! showMean = 1 ! ! for filename in fileargs: ! filename = windowsfy(filename) ! (htest, stest, fp, fn, un, fpp, fnp, unp, cost, bestcost, ! hamdevall, spamdevall) = suck(file(filename)) ! if filename.endswith('.txt'): ! filename = filename[:-4] ! filename = filename[filename.rfind('/')+1:] ! filename = filename[filename.rfind("\\")+1:] ! if len(fname) > len(fnam2): ! fname += " " ! fname = fname[0:(len(fnam2) + 8)] ! fnam2 += " %7s" % filename ! else: ! fnam2 += " " ! fnam2 = fnam2[0:(len(fname) + 8)] ! fname += " %7s" % filename ! if len(ratio) > len(rat2): ! ratio += " " ! ratio = ratio[0:(len(rat2) + 8)] ! rat2 += " %7s" % ("%d:%d" % (htest, stest)) ! else: ! rat2 += " " ! rat2 = rat2[0:(len(ratio) + 8)] ! ratio += " %7s" % ("%d:%d" % (htest, stest)) ! fptot += "%8d" % fp ! tfptot += fp ! fpper += "%8.2f" % fpp ! tfpper += fpp ! fntot += "%8d" % fn ! tfntot += fn ! fnper += "%8.2f" % fnp ! tfnper += fnp ! untot += "%8d" % un ! tuntot += un ! unper += "%8.2f" % unp ! tunper += unp ! rcost += "%8s" % ("$%.2f" % cost) ! trcost += cost ! bcost += "%8s" % ("$%.2f" % bestcost) ! tbcost += bestcost ! hmean += "%8.2f" % hamdevall[0] ! thmean += hamdevall[0] ! hsdev += "%8.2f" % hamdevall[1] ! thsdev += hamdevall[1] ! smean += "%8.2f" % spamdevall[0] ! tsmean += spamdevall[0] ! ssdev += "%8.2f" % spamdevall[1] ! tssdev += spamdevall[1] ! meand += "%8.2f" % (spamdevall[0] - hamdevall[0]) ! tmeand += (spamdevall[0] - hamdevall[0]) ! k = (spamdevall[0] - hamdevall[0]) / (spamdevall[1] + hamdevall[1]) ! kval += "%8.2f" % k ! tkval += k ! ! nfiles = len(fileargs) ! if nfiles and showMean: ! fptot += "%12d" % (tfptot/nfiles) ! fpper += "%12.2f" % (tfpper/nfiles) ! fntot += "%12d" % (tfntot/nfiles) ! fnper += "%12.2f" % (tfnper/nfiles) ! untot += "%12d" % (tuntot/nfiles) ! unper += "%12.2f" % (tunper/nfiles) ! rcost += "%12s" % ("$%.2f" % (trcost/nfiles)) ! bcost += "%12s" % ("$%.2f" % (tbcost/nfiles)) ! hmean += "%12.2f" % (thmean/nfiles) ! hsdev += "%12.2f" % (thsdev/nfiles) ! smean += "%12.2f" % (tsmean/nfiles) ! ssdev += "%12.2f" % (tssdev/nfiles) ! meand += "%12.2f" % (tmeand/nfiles) ! kval += "%12.2f" % (tkval/nfiles) ! ! print fname ! if len(fnam2.strip()) > 0: ! print fnam2 ! print ratio ! if len(rat2.strip()) > 0: ! print rat2 ! print fptot ! print fpper ! print fntot ! print fnper ! print untot ! print unper ! print rcost ! print bcost ! print hmean ! print hsdev ! print smean ! print ssdev ! print meand ! print kval ! ! if __name__ == "__main__": ! table() From mhammond@users.sourceforge.net Thu Nov 7 02:54:18 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Wed, 06 Nov 2002 18:54:18 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs ManagerDialog.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv18380/dialogs Modified Files: ManagerDialog.py Log Message: As per report on mailing list, don't insist on an "Unsure" folder before filtering can be enabled. Also wrapped a few long lines. Index: ManagerDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** ManagerDialog.py 5 Nov 2002 21:51:53 -0000 1.6 --- ManagerDialog.py 7 Nov 2002 02:54:16 -0000 1.7 *************** *** 69,74 **** self.checkbox_items = [ (IDC_BUT_FILTER_ENABLE, "self.mgr.config.filter.enabled"), ! (IDC_BUT_TRAIN_FROM_SPAM_FOLDER, "self.mgr.config.training.train_recovered_spam"), ! (IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"), ] --- 69,76 ---- self.checkbox_items = [ (IDC_BUT_FILTER_ENABLE, "self.mgr.config.filter.enabled"), ! (IDC_BUT_TRAIN_FROM_SPAM_FOLDER, ! "self.mgr.config.training.train_recovered_spam"), ! (IDC_BUT_TRAIN_TO_SPAM_FOLDER, ! "self.mgr.config.training.train_manual_spam"), ] *************** *** 105,114 **** ok_to_enable = operator.truth(config.watch_folder_ids) if not ok_to_enable: ! filter_status = "You must define folders to watch for new messages" if ok_to_enable: ok_to_enable = nspam >= min_spam and nham >= min_ham if not ok_to_enable: ! filter_status = "There must be %d good and %d spam messages\n" \ ! "trained before filtering can be enabled" \ % (min_ham, min_spam) if ok_to_enable: --- 107,118 ---- ok_to_enable = operator.truth(config.watch_folder_ids) if not ok_to_enable: ! filter_status = "You must define folders to watch "\ ! "for new messages" if ok_to_enable: ok_to_enable = nspam >= min_spam and nham >= min_ham if not ok_to_enable: ! filter_status = "There must be %d good and %d spam " \ ! "messages\ntrained before filtering " \ ! "can be enabled" \ % (min_ham, min_spam) if ok_to_enable: *************** *** 116,137 **** ok_to_enable = operator.truth(config.spam_folder_id) if ok_to_enable: ! certain_spam_name = self.mgr.FormatFolderNames([config.spam_folder_id], False) ! ok_to_enable = operator.truth(config.unsure_folder_id) ! if ok_to_enable: ! unsure_name = self.mgr.FormatFolderNames([config.unsure_folder_id], False) else: ! filter_status = "You must define the folder to receive your possible spam" else: ! filter_status = "You must define the folder to receive your certain spam" ! # whew if ok_to_enable: ! watch_names = self.mgr.FormatFolderNames(config.watch_folder_ids, config.watch_include_sub) ! filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \ ! % (watch_names, certain_spam_name, unsure_name) self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable) enabled = config.enabled ! self.GetDlgItem(IDC_BUT_FILTER_ENABLE).SetCheck(ok_to_enable and enabled) self.SetDlgItemText(IDC_FILTER_STATUS, filter_status) --- 120,148 ---- ok_to_enable = operator.truth(config.spam_folder_id) if ok_to_enable: ! certain_spam_name = self.mgr.FormatFolderNames( ! [config.spam_folder_id], False) ! if config.unsure_folder_id: ! unsure_name = self.mgr.FormatFolderNames( ! [config.unsure_folder_id], False) ! unsure_text = "unsure managed in '%s'" % (unsure_name,) else: ! unsure_text = "unsure messages untouched" else: ! filter_status = "You must define the folder to " \ ! "receive your certain spam" ! # whew if ok_to_enable: ! watch_names = self.mgr.FormatFolderNames( ! config.watch_folder_ids, config.watch_include_sub) ! filter_status = "Watching '%s'. Spam managed in '%s', %s" \ ! % (watch_names, ! certain_spam_name, ! unsure_text) self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable) enabled = config.enabled ! self.GetDlgItem(IDC_BUT_FILTER_ENABLE).SetCheck( ! ok_to_enable and enabled) self.SetDlgItemText(IDC_FILTER_STATUS, filter_status) *************** *** 139,143 **** if code == win32con.BN_CLICKED: ! fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html") fname = os.path.abspath(fname) print fname --- 150,156 ---- if code == win32con.BN_CLICKED: ! fname = os.path.join(os.path.dirname(__file__), ! os.pardir, ! "about.html") fname = os.path.abspath(fname) print fname From mhammond@users.sourceforge.net Thu Nov 7 05:05:06 2002 From: mhammond@users.sourceforge.net (Mark Hammond) Date: Wed, 06 Nov 2002 21:05:06 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.27,1.28 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv3431 Modified Files: addin.py Log Message: Revamp the "delete as spam" and "recover from spam" buttons - now 2 buttons, and the visibility state changes depending on the folder. The "unsure" folder now has both buttons available. Probably lighter on Outlook too, as all we do now is toggle a Visible property on a folder change event. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** addin.py 4 Nov 2002 22:50:41 -0000 1.27 --- addin.py 7 Nov 2002 05:05:03 -0000 1.28 *************** *** 239,278 **** new_msg.Display() ! # The "Delete As Spam" and "Recover Spam" button ! # The event from Outlook's explorer that our folder has changed. ! class ButtonDeleteAsExplorerEvent: ! def Init(self, but): ! self.but = but ! def Close(self): ! self.but = None ! def OnFolderSwitch(self): ! self.but._UpdateForFolderChange() ! ! class ButtonDeleteAsEvent: ! def Init(self, manager, application, explorer): ! # NOTE - keeping a reference to 'explorer' in this event ! # appears to cause an Outlook circular reference, and outlook ! # never terminates (it does close, but the process remains alive) ! # This is why we needed to use WithEvents, so the event class ! # itself doesnt keep such a reference (and we need to keep a ref ! # to the event class so it doesn't auto-disconnect!) self.manager = manager self.application = application ! self.explorer_events = WithEvents(explorer, ! ButtonDeleteAsExplorerEvent) ! self.set_for_as_spam = None ! self.explorer_events.Init(self) ! self._UpdateForFolderChange() ! def Close(self): ! self.manager = self.application = self.explorer = None ! ! def _UpdateForFolderChange(self): explorer = self.application.ActiveExplorer() if explorer is None: print "** Folder Change, but don't have an explorer" return outlook_folder = explorer.CurrentFolder ! is_spam = False if outlook_folder is not None: mapi_folder = self.manager.message_store.GetFolder(outlook_folder) --- 239,262 ---- new_msg.Display() ! # Events from our Explorer instance - currently used to enable/disable ! # controls ! class ExplorerEvent: ! def Init(self, manager, application, but_delete_as, but_recover_as): self.manager = manager self.application = application ! self.but_delete_as = but_delete_as ! self.but_recover_as = but_recover_as def Close(self): ! self.but_delete_as = self.but_recover_as = None ! def OnFolderSwitch(self): ! # Work out what folder we are in. explorer = self.application.ActiveExplorer() if explorer is None: print "** Folder Change, but don't have an explorer" return + outlook_folder = explorer.CurrentFolder ! show_delete_as = True ! show_recover_as = False if outlook_folder is not None: mapi_folder = self.manager.message_store.GetFolder(outlook_folder) *************** *** 281,314 **** look_folder = self.manager.message_store.GetFolder(look_id) if mapi_folder == look_folder: ! is_spam = True ! if not is_spam: ! look_id = self.manager.config.filter.unsure_folder_id ! if look_id: ! look_folder = self.manager.message_store.GetFolder(look_id) ! if mapi_folder == look_folder: ! is_spam = True ! if is_spam: ! set_for_as_spam = False ! else: ! set_for_as_spam = True ! if set_for_as_spam != self.set_for_as_spam: ! if set_for_as_spam: ! image = "delete_as_spam.bmp" ! self.Caption = "Delete As Spam" ! self.TooltipText = \ "Move the selected message to the Spam folder,\n" \ "and train the system that this is Spam." else: ! image = "recover_ham.bmp" ! self.Caption = "Recover from Spam" ! self.TooltipText = \ ! "Recovers the selected item back to the folder\n" \ ! "it was filtered from (or to the Inbox if this\n" \ ! "folder is not known), and trains the system that\n" \ ! "this is a good message\n" ! # Set the image. ! print "Setting image to", image ! SetButtonImage(self, image) ! self.set_for_as_spam = set_for_as_spam def OnClick(self, button, cancel): --- 265,341 ---- look_folder = self.manager.message_store.GetFolder(look_id) if mapi_folder == look_folder: ! # This is the Spam folder - only show "recover" ! show_recover_as = True ! show_delete_as = False ! # Check if uncertain ! look_id = self.manager.config.filter.unsure_folder_id ! if look_id: ! look_folder = self.manager.message_store.GetFolder(look_id) ! if mapi_folder == look_folder: ! show_recover_as = True ! show_delete_as = True ! self.but_recover_as.Visible = show_recover_as ! self.but_delete_as.Visible = show_delete_as ! ! # The "Delete As Spam" and "Recover Spam" button ! # The event from Outlook's explorer that our folder has changed. ! class ButtonDeleteAsEventBase: ! def Init(self, manager, application): ! # NOTE - keeping a reference to 'explorer' in this event ! # appears to cause an Outlook circular reference, and outlook ! # never terminates (it does close, but the process remains alive) ! # This is why we needed to use WithEvents, so the event class ! # itself doesnt keep such a reference (and we need to keep a ref ! # to the event class so it doesn't auto-disconnect!) ! self.manager = manager ! self.application = application ! ! def Close(self): ! self.manager = self.application = None ! ! class ButtonDeleteAsSpamEvent(ButtonDeleteAsEventBase): ! def Init(self, manager, application): ! ButtonDeleteAsEventBase.Init(self, manager, application) ! image = "delete_as_spam.bmp" ! self.Caption = "Delete As Spam" ! self.TooltipText = \ "Move the selected message to the Spam folder,\n" \ "and train the system that this is Spam." + SetButtonImage(self, image) + + def OnClick(self, button, cancel): + msgstore = self.manager.message_store + msgstore_messages = self.manager.addin.GetSelectedMessages(True) + if not msgstore_messages: + return + # Delete this item as spam. + spam_folder_id = self.manager.config.filter.spam_folder_id + spam_folder = msgstore.GetFolder(spam_folder_id) + if not spam_folder: + win32ui.MessageBox("You must configure the Spam folder", + "Invalid Configuration") + return + import train + for msgstore_message in msgstore_messages: + # Must train before moving, else we lose the message! + print "Training on message - ", + if train.train_message(msgstore_message, True, self.manager, rescore = True): + print "trained as spam" else: ! print "already was trained as spam" ! # Now move it. ! msgstore_message.MoveTo(spam_folder) ! ! class ButtonRecoverFromSpamEvent(ButtonDeleteAsEventBase): ! def Init(self, manager, application): ! ButtonDeleteAsEventBase.Init(self, manager, application) ! image = "recover_ham.bmp" ! self.Caption = "Recover from Spam" ! self.TooltipText = \ ! "Recovers the selected item back to the folder\n" \ ! "it was filtered from (or to the Inbox if this\n" \ ! "folder is not known), and trains the system that\n" \ ! "this is a good message\n" ! SetButtonImage(self, image) def OnClick(self, button, cancel): *************** *** 317,340 **** if not msgstore_messages: return ! if self.set_for_as_spam: ! # Delete this item as spam. ! spam_folder_id = self.manager.config.filter.spam_folder_id ! spam_folder = msgstore.GetFolder(spam_folder_id) ! if not spam_folder: ! win32ui.MessageBox("You must configure the Spam folder", ! "Invalid Configuration") ! return ! import train ! for msgstore_message in msgstore_messages: ! # Must train before moving, else we lose the message! ! print "Training on message - ", ! if train.train_message(msgstore_message, True, self.manager, rescore = True): ! print "trained as spam" ! else: ! print "already was trained as spam" ! # Now move it. ! msgstore_message.MoveTo(spam_folder) ! else: ! win32ui.MessageBox("Please be patient ") # Helpers to work with images on buttons/toolbars. --- 344,364 ---- if not msgstore_messages: return ! # Recover to where they were moved from ! # Get the inbox as the default place to restore to ! # (incase we dont know (early code) or folder removed etc ! inbox_folder = msgstore.GetFolder( ! self.application.Session.GetDefaultFolder( ! constants.olFolderInbox)) ! import train ! for msgstore_message in msgstore_messages: ! # Must train before moving, else we lose the message! ! print "Training on message - ", ! if train.train_message(msgstore_message, False, self.manager, rescore = True): ! print "trained as ham" ! else: ! print "already was trained as ham" ! # Now move it. ! # XXX - still don't write the source, so no point looking :( ! msgstore_message.MoveTo(inbox_folder) # Helpers to work with images on buttons/toolbars. *************** *** 379,382 **** --- 403,407 ---- assert self.manager.addin is None, "Should not already have an addin" self.manager.addin = self + self.explorer_events = None # ActiveExplorer may be none when started without a UI (eg, WinCE synchronisation) *************** *** 385,414 **** bars = activeExplorer.CommandBars toolbar = bars.Item("Standard") ! # Add our "Delete as ..." button ! button = toolbar.Controls.Add(Type=constants.msoControlButton, Temporary=True) # Hook events for the item button.BeginGroup = True ! button = DispatchWithEvents(button, ButtonDeleteAsEvent) ! button.Init(self.manager, application, activeExplorer) self.buttons.append(button) # Add a pop-up menu to the toolbar ! popup = toolbar.Controls.Add(Type=constants.msoControlPopup, Temporary=True) popup.Caption="Anti-Spam" popup.TooltipText = "Anti-Spam filters and functions" popup.Enabled = True ! # Convert from "CommandBarItem" to derived "CommandBarPopup" ! # Not sure if we should be able to work this out ourselves, but no ! # introspection I tried seemed to indicate we can. VB does it via ! # strongly-typed declarations. popup = CastTo(popup, "CommandBarPopup") # And add our children. - self._AddPopup(popup, ShowClues, (self.manager, application), - Caption="Show spam clues for current message", - Enabled=True) self._AddPopup(popup, manager.ShowManager, (self.manager,), Caption="Anti-Spam Manager...", TooltipText = "Show the Anti-Spam manager dialog.", Enabled = True) self.FiltersChanged() --- 410,460 ---- bars = activeExplorer.CommandBars toolbar = bars.Item("Standard") ! # Add our "Delete as ..." and "Recover as" buttons ! but_delete_as = button = toolbar.Controls.Add( ! Type=constants.msoControlButton, ! Temporary=True) # Hook events for the item button.BeginGroup = True ! button = DispatchWithEvents(button, ButtonDeleteAsSpamEvent) ! button.Init(self.manager, application) self.buttons.append(button) + # And again for "Recover as" + but_recover_as = button = toolbar.Controls.Add( + Type=constants.msoControlButton, + Temporary=True) + button = DispatchWithEvents(button, ButtonRecoverFromSpamEvent) + self.buttons.append(button) + # Hook our explorer events, and pass the buttons. + button.Init(self.manager, application) + + self.explorer_events = WithEvents(activeExplorer, + ExplorerEvent) + self.explorer_events.Init(self.manager, application, but_delete_as, but_recover_as) + # And prime the event handler. + self.explorer_events.OnFolderSwitch() + + # The main tool-bar dropdown with all out entries. # Add a pop-up menu to the toolbar ! popup = toolbar.Controls.Add( ! Type=constants.msoControlPopup, ! Temporary=True) popup.Caption="Anti-Spam" popup.TooltipText = "Anti-Spam filters and functions" popup.Enabled = True ! # Convert from "CommandBarItem" to derived ! # "CommandBarPopup" Not sure if we should be able to work ! # this out ourselves, but no introspection I tried seemed ! # to indicate we can. VB does it via strongly-typed ! # declarations. popup = CastTo(popup, "CommandBarPopup") # And add our children. self._AddPopup(popup, manager.ShowManager, (self.manager,), Caption="Anti-Spam Manager...", TooltipText = "Show the Anti-Spam manager dialog.", Enabled = True) + self._AddPopup(popup, ShowClues, (self.manager, application), + Caption="Show spam clues for current message", + Enabled=True) self.FiltersChanged() *************** *** 499,506 **** --- 545,556 ---- self.manager.Close() self.manager = None + + if self.explorer_events is not None: + self.explorer_events = None if self.buttons: for button in self.buttons: button.Close() self.buttons = None + print "Addin terminating: %d COM client and %d COM servers exist." \ % (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount()) *************** *** 514,522 **** def OnAddInsUpdate(self, custom): ! print "SpamAddin - OnAddInsUpdate", custom def OnStartupComplete(self, custom): ! print "SpamAddin - OnStartupComplete", custom def OnBeginShutdown(self, custom): ! print "SpamAddin - OnBeginShutdown", custom def RegisterAddin(klass): --- 564,572 ---- def OnAddInsUpdate(self, custom): ! pass def OnStartupComplete(self, custom): ! pass def OnBeginShutdown(self, custom): ! pass def RegisterAddin(klass): From tim.one@comcast.net Thu Nov 7 05:58:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 00:58:58 -0500 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.27,1.28 In-Reply-To: Message-ID: [Mark Hammond] > Modified Files: > addin.py > Log Message: > Revamp the "delete as spam" and "recover from spam" buttons - now 2 > buttons, and the visibility state changes depending on the folder. Wow -- a 21KB patch to change a button. I *knew* there was a reason I always left this stuff to you . From jvr@users.sourceforge.net Thu Nov 7 22:27:05 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:27:05 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv23622 Modified Files: pop3proxy.py Log Message: - added True/False for compatibilty with Python 2.2 - write out trained messages to files, to make it easier to rebuild the database Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** pop3proxy.py 5 Nov 2002 22:18:56 -0000 1.10 --- pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11 *************** *** 28,31 **** --- 28,34 ---- For safety, and to help debugging, the whole POP3 conversation is written out to _pop3proxy.log for each run. + + To make rebuilding the database easier, trained messages are appended + to _pop3proxyham.mbox and _pop3proxyspam.mbox. """ *************** *** 37,40 **** --- 40,49 ---- __credits__ = "Tim Peters, Neale Pickett, all the spambayes contributors." + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + import sys, re, operator, errno, getopt, cPickle, cStringIO, time *************** *** 609,614 **** def onUpload(self, params): ! message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') self.bayes.learn(tokenizer.tokenize(message), isSpam, True) self.push("""

    Trained on your message. Saving database...

    """) --- 618,634 ---- def onUpload(self, params): ! message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild + # the database later. + message = message.replace('\r\n', '\n').replace('\r', '\n') + if isSpam: + f = open("_pop3proxyspam.mbox", "a") + else: + f = open("_pop3proxyham.mbox", "a") + f.write("From ???@???\n") # fake From line (XXX good enough?) + f.write(message) + f.write("\n") + f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) self.push("""

    Trained on your message. Saving database...

    """) From jvr@users.sourceforge.net Thu Nov 7 22:30:12 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:30:12 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.28,1.29 config.py,1.3,1.4 filter.py,1.12,1.13 manager.py,1.32,1.33 msgstore.py,1.22,1.23 train.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv25250/Outlook2000 Modified Files: addin.py config.py filter.py manager.py msgstore.py train.py Log Message: Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** addin.py 7 Nov 2002 05:05:03 -0000 1.28 --- addin.py 7 Nov 2002 22:30:08 -0000 1.29 *************** *** 4,7 **** --- 4,14 ---- import warnings + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + if sys.version_info >= (2, 3): # sick off the new hex() warnings! Index: config.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/config.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** config.py 31 Oct 2002 21:56:59 -0000 1.3 --- config.py 7 Nov 2002 22:30:09 -0000 1.4 *************** *** 3,6 **** --- 3,13 ---- # or as a module. + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + class _ConfigurationContainer: def __init__(self, **kw): Index: filter.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** filter.py 1 Nov 2002 02:03:42 -0000 1.12 --- filter.py 7 Nov 2002 22:30:09 -0000 1.13 *************** *** 4,8 **** # Copyright PSF, license under the PSF license ! def filter_message(msg, mgr, all_actions = True): config = mgr.config.filter prob = mgr.score(msg) --- 4,15 ---- # Copyright PSF, license under the PSF license ! try: ! True, False ! except NameError: ! # Maintain compatibility with Python 2.2 ! True, False = 1, 0 ! ! ! def filter_message(msg, mgr, all_actions=True): config = mgr.config.filter prob = mgr.score(msg) Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** manager.py 4 Nov 2002 00:50:09 -0000 1.32 --- manager.py 7 Nov 2002 22:30:09 -0000 1.33 *************** *** 13,16 **** --- 13,22 ---- try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + try: this_filename = os.path.abspath(__file__) except NameError: *************** *** 83,87 **** return ret ! def EnsureOutlookFieldsForFolder(self, folder_id, include_sub = False): # Ensure that our fields exist on the Outlook *folder* # Setting properties via our msgstore (via Ext Mapi) gets the props --- 89,93 ---- return ret ! def EnsureOutlookFieldsForFolder(self, folder_id, include_sub=False): # Ensure that our fields exist on the Outlook *folder* # Setting properties via our msgstore (via Ext Mapi) gets the props Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** msgstore.py 5 Nov 2002 11:44:27 -0000 1.22 --- msgstore.py 7 Nov 2002 22:30:09 -0000 1.23 *************** *** 3,6 **** --- 3,12 ---- import sys, os + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + # Abstract definition - can be moved out when we have more than one sub-class Index: train.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** train.py 4 Nov 2002 22:50:41 -0000 1.15 --- train.py 7 Nov 2002 22:30:09 -0000 1.16 *************** *** 7,10 **** --- 7,17 ---- from win32com.mapi import mapi + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + # Note our Message Database uses PR_SEARCH_KEY, *not* PR_ENTRYID, as the # latter changes after a Move operation - see msgstore.py From jvr@users.sourceforge.net Thu Nov 7 22:30:13 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:30:13 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam folder.py,1.1,1.2 profile.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv25250/pspam/pspam Modified Files: folder.py profile.py Log Message: Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py. Index: folder.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/folder.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** folder.py 4 Nov 2002 04:44:20 -0000 1.1 --- folder.py 7 Nov 2002 22:30:11 -0000 1.2 *************** *** 10,13 **** --- 10,20 ---- from pspam.message import PMessage + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + def factory(fp): try: Index: profile.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** profile.py 4 Nov 2002 21:25:54 -0000 1.2 --- profile.py 7 Nov 2002 22:30:11 -0000 1.3 *************** *** 14,17 **** --- 14,24 ---- import os + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + def open_folders(dir, names, klass): L = [] From jvr@users.sourceforge.net Thu Nov 7 22:30:13 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:30:13 -0800 Subject: [Spambayes-checkins] spambayes/pspam pop.py,1.2,1.3 scoremsg.py,1.1,1.2 update.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam In directory usw-pr-cvs1:/tmp/cvs-serv25250/pspam Modified Files: pop.py scoremsg.py update.py Log Message: Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py. Index: pop.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pop.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** pop.py 5 Nov 2002 22:57:27 -0000 1.2 --- pop.py 7 Nov 2002 22:30:10 -0000 1.3 *************** *** 45,48 **** --- 45,55 ---- from pspam.options import options + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + HEADER = "X-Spambayes: %5.3f\r\n" HEADER_SIZE = len(HEADER % 0.0) Index: scoremsg.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/scoremsg.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** scoremsg.py 4 Nov 2002 04:44:19 -0000 1.1 --- scoremsg.py 7 Nov 2002 22:30:10 -0000 1.2 *************** *** 12,15 **** --- 12,22 ---- import pspam.options + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + def main(fp): cs = ClientStorage("/var/tmp/zeospam") Index: update.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/update.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** update.py 4 Nov 2002 04:44:19 -0000 1.1 --- update.py 7 Nov 2002 22:30:10 -0000 1.2 *************** *** 10,13 **** --- 10,20 ---- from pspam.options import options + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + def folder_exists(L, p): """Return true folder with path p exists in list L.""" From jvr@users.sourceforge.net Thu Nov 7 22:30:12 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:30:12 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs AsyncDialog.py,1.2,1.3 FilterDialog.py,1.10,1.11 FolderSelector.py,1.8,1.9 ManagerDialog.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs In directory usw-pr-cvs1:/tmp/cvs-serv25250/Outlook2000/dialogs Modified Files: AsyncDialog.py FilterDialog.py FolderSelector.py ManagerDialog.py Log Message: Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py. Index: AsyncDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/AsyncDialog.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** AsyncDialog.py 19 Oct 2002 18:14:01 -0000 1.2 --- AsyncDialog.py 7 Nov 2002 22:30:10 -0000 1.3 *************** *** 6,9 **** --- 6,15 ---- import win32api + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + IDC_START = 1100 Index: FilterDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** FilterDialog.py 2 Nov 2002 17:27:44 -0000 1.10 --- FilterDialog.py 7 Nov 2002 22:30:10 -0000 1.11 *************** *** 11,14 **** --- 11,21 ---- from DialogGlobals import * + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + IDC_FOLDER_WATCH = 1024 IDC_BROWSE_WATCH = 1025 Index: FolderSelector.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** FolderSelector.py 2 Nov 2002 17:11:47 -0000 1.8 --- FolderSelector.py 7 Nov 2002 22:30:10 -0000 1.9 *************** *** 9,12 **** --- 9,19 ---- from DialogGlobals import * + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + # Helpers for building the folder list class FolderSpec: Index: ManagerDialog.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** ManagerDialog.py 7 Nov 2002 02:54:16 -0000 1.7 --- ManagerDialog.py 7 Nov 2002 22:30:10 -0000 1.8 *************** *** 11,14 **** --- 11,21 ---- from DialogGlobals import * + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + IDC_BUT_ABOUT = 1024 IDC_BUT_TRAIN_FROM_SPAM_FOLDER = 1025 From jvr@users.sourceforge.net Thu Nov 7 22:30:40 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:30:40 -0800 Subject: [Spambayes-checkins] spambayes README.txt,1.40,1.41 TestDriver.py,1.27,1.28 Tester.py,1.7,1.8 chi2.py,1.7,1.8 classifier.py,1.48,1.49 hammie.py,1.36,1.37 hammiesrv.py,1.9,1.10 mboxcount.py,1.2,1.3 mboxtest.py,1.9,1.10 neiltrain.py,1.3,1.4 rebal.py,1.8,1.9 sets.py,1.1,1.2 splitn.py,1.3,1.4 splitndirs.py,1.6,1.7 tokenizer.py,1.62,1.63 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25250 Modified Files: README.txt TestDriver.py Tester.py chi2.py classifier.py hammie.py hammiesrv.py mboxcount.py mboxtest.py neiltrain.py rebal.py sets.py splitn.py splitndirs.py tokenizer.py Log Message: Mass checkin: Remain compatible with Python 2.2. Only tested with pop3proxy.py. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** README.txt 27 Oct 2002 22:04:32 -0000 1.40 --- README.txt 7 Nov 2002 22:30:02 -0000 1.41 *************** *** 24,28 **** too small to measure reliably across that much training data. ! The code in this project requires Python 2.2.1 (or later). --- 24,28 ---- too small to measure reliably across that much training data. ! The code in this project requires Python 2.2 (or later). Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** TestDriver.py 20 Oct 2002 05:19:48 -0000 1.27 --- TestDriver.py 7 Nov 2002 22:30:04 -0000 1.28 *************** *** 31,34 **** --- 31,41 ---- from Histogram import Hist + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + def printhist(tag, ham, spam, nbuckets=options.nbuckets): print Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** Tester.py 20 Oct 2002 04:01:08 -0000 1.7 --- Tester.py 7 Nov 2002 22:30:04 -0000 1.8 *************** *** 1,4 **** --- 1,11 ---- from Options import options + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + class Test: # Pass a classifier instance (an instance of Bayes). Index: chi2.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/chi2.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** chi2.py 16 Oct 2002 21:31:19 -0000 1.7 --- chi2.py 7 Nov 2002 22:30:05 -0000 1.8 *************** *** 1,4 **** --- 1,11 ---- import math as _math + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + def chi2Q(x2, v, exp=_math.exp, min=min): """Return prob(chisq >= x2, with v degrees of freedom). Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** classifier.py 4 Nov 2002 21:24:52 -0000 1.48 --- classifier.py 7 Nov 2002 22:30:05 -0000 1.49 *************** *** 37,40 **** --- 37,48 ---- from Options import options from chi2 import chi2Q + + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + LN2 = math.log(2) # used frequently by chi-combining Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** hammie.py 6 Nov 2002 02:07:42 -0000 1.36 --- hammie.py 7 Nov 2002 22:30:05 -0000 1.37 *************** *** 53,56 **** --- 53,63 ---- from Options import options + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + program = sys.argv[0] # For usage(); referenced by docstring above Index: hammiesrv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** hammiesrv.py 1 Nov 2002 02:55:32 -0000 1.9 --- hammiesrv.py 7 Nov 2002 22:30:06 -0000 1.10 *************** *** 30,33 **** --- 30,40 ---- import hammie + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + program = sys.argv[0] # For usage(); referenced by docstring above Index: mboxcount.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxcount.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** mboxcount.py 6 Nov 2002 01:58:35 -0000 1.2 --- mboxcount.py 7 Nov 2002 22:30:07 -0000 1.3 *************** *** 36,39 **** --- 36,46 ---- from mboxutils import get_message + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + program = sys.argv[0] Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** mboxtest.py 23 Sep 2002 21:20:10 -0000 1.9 --- mboxtest.py 7 Nov 2002 22:30:07 -0000 1.10 *************** *** 33,36 **** --- 33,43 ---- from Options import options + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + mbox_fmts = {"unix": mailbox.PortableUnixMailbox, "mmdf": mailbox.MmdfMailbox, Index: neiltrain.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** neiltrain.py 27 Sep 2002 21:18:18 -0000 1.3 --- neiltrain.py 7 Nov 2002 22:30:07 -0000 1.4 *************** *** 13,16 **** --- 13,23 ---- import mboxutils + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + program = sys.argv[0] # For usage(); referenced by docstring above Index: rebal.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rebal.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** rebal.py 29 Sep 2002 16:55:10 -0000 1.8 --- rebal.py 7 Nov 2002 22:30:07 -0000 1.9 *************** *** 46,49 **** --- 46,56 ---- import getopt + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + # defaults NPERDIR = 4000 Index: sets.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/sets.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** sets.py 22 Sep 2002 06:58:36 -0000 1.1 --- sets.py 7 Nov 2002 22:30:07 -0000 1.2 *************** *** 60,63 **** --- 60,70 ---- + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + class BaseSet(object): """Common base class for mutable and immutable sets.""" Index: splitn.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitn.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** splitn.py 6 Nov 2002 02:02:08 -0000 1.3 --- splitn.py 7 Nov 2002 22:30:08 -0000 1.4 *************** *** 48,51 **** --- 48,58 ---- import mboxutils + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + program = sys.argv[0] Index: splitndirs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** splitndirs.py 6 Nov 2002 02:02:43 -0000 1.6 --- splitndirs.py 7 Nov 2002 22:30:08 -0000 1.7 *************** *** 55,58 **** --- 55,65 ---- import mboxutils + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + program = sys.argv[0] Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.62 retrieving revision 1.63 diff -C2 -d -r1.62 -r1.63 *** tokenizer.py 6 Nov 2002 02:12:47 -0000 1.62 --- tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 *************** *** 16,19 **** --- 16,26 ---- from mboxutils import get_message + try: + True, False + except NameError: + # Maintain compatibility with Python 2.2 + True, False = 1, 0 + + # Patch encodings.aliases to recognize 'ansi_x3_4_1968' from encodings.aliases import aliases # The aliases dictionary From jvr@users.sourceforge.net Thu Nov 7 22:32:17 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Thu, 07 Nov 2002 14:32:17 -0800 Subject: [Spambayes-checkins] website developer.ht,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv26318 Modified Files: developer.ht Log Message: Python version requirement dropped to 2.2. Someone else should regenerate and upload the site, I haven't got a clue.. Index: developer.ht =================================================================== RCS file: /cvsroot/spambayes/website/developer.ht,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** developer.ht 4 Nov 2002 06:38:52 -0000 1.4 --- developer.ht 7 Nov 2002 22:32:15 -0000 1.5 *************** *** 12,16 **** come crying <wink>.

    !

    This project works with either the absolute bleeding edge of python code, available from CVS on sourceforge, or with Python 2.2.1 (not 2.2, or 2.1.3).

    The spambayes code itself is also available via CVS --- 12,16 ---- come crying <wink>.

    !

    This project works with either the absolute bleeding edge of python code, available from CVS on sourceforge, or with Python 2.2 (not 2.1.x or earlier).

    The spambayes code itself is also available via CVS From just@letterror.com Thu Nov 7 22:51:11 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 23:51:11 +0100 Subject: [Spambayes-checkins] spambayes README.txt,1.40,1.41 TestDriver.py,1.27,1.28 Tester.py,1.7,1.8 chi2.py,1.7,1.8 classifier.py,1.48,1.49 hammie.py,1.36,1.37 hammiesrv.py,1.9,1.10 mboxcount.py,1.2,1.3 mboxtest.py,1.9,1.10 neiltrain.py,1.3,1.4 rebal.py,1. In-Reply-To: Message-ID: Just van Rossum wrote: > Mass checkin: Remain compatible with Python 2.2. Only tested with > pop3proxy.py. Btw. I screwed up the checkin for Options.py, Histogram.py and INTEGRATION.txt; these have a bogus log message for the 2.2 compat patch :-(. Just From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 07 Nov 2002 20:06:29 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67 tokenizer.py,1.63,1.64 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31798 Modified Files: Options.py tokenizer.py Log Message: Removed option retain_pure_html_tags; nobody enables that anymore, and it's hard to believe it would ever help anymore (except as an HTML detector). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** Options.py 7 Nov 2002 22:25:46 -0000 1.66 --- Options.py 8 Nov 2002 04:06:23 -0000 1.67 *************** *** 42,53 **** x-.* - # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags - # from pure text/html messages. Set true to retain HTML tags in this - # case. On the c.l.py corpus, it helps to set this true because any - # sign of HTML is so despised on tech lists; however, the advantage - # of setting it true eventually vanishes even there given enough - # training data. - retain_pure_html_tags: False - # If true, the first few characters of application/octet-stream sections # are used, undecoded. What 'few' means is decided by octet_prefix_size. --- 42,45 ---- *************** *** 347,352 **** all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, ! 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, --- 339,343 ---- all_options = { ! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 --- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 *************** *** 495,504 **** # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaults to False. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was removed. ############################################################################## --- 495,505 ---- # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaulted to False. Later, as the ! # algorithm improved, retain_pure_html_tags was removed. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was also removed. ############################################################################## *************** *** 1167,1175 **** """Generate a stream of tokens from an email Message. - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are - also stripped from text/html sections. Except in special cases, - it's recommended to leave that at its default of false. - If options.check_octets is True, the first few undecoded characters of application/octet-stream parts of the message body become tokens. --- 1168,1171 ---- *************** *** 1228,1235 **** # Remove HTML/XML tags. Also  . ! if (part.get_content_type() == "text/plain" or ! not options.retain_pure_html_tags): ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. --- 1224,1229 ---- # Remove HTML/XML tags. Also  . ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Fri, 08 Nov 2002 00:00:25 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25390 Modified Files: pop3proxy.py Log Message: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11 --- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 *************** *** 47,50 **** --- 47,74 ---- + todo = """ + o (Re)training interface - one message per line, quick-rendering table. + o Slightly-wordy index page; intro paragraph for each page. + o Once the training stuff is on a separate page, make the paste box + bigger. + o "Links" section (on homepage?) to project homepage, mailing list, + etc. + o "Home" link (with helmet!) at the end of each page. + o "Classify this" - just like Train. + o "Send me an email every [...] to remind me to train on new + messages." + o "Send me a status email every [...] telling how many mails have been + classified, etc." + o Deployment: Windows executable? atlaxwin and ctypes? Or just + webbrowser? + o Possibly integrate Tim Stone's SMTP code - make it use async, make + the training code update (rather than replace!) the database. + o Can it cleanly dynamically update its status display while having a + POP3 converation? Hammering reload sucks. + o Add a command to save the database without shutting down, and one to + reload the database. + o Leave the word in the input field after a Word query. + """ + import sys, re, operator, errno, getopt, cPickle, cStringIO, time import socket, asyncore, asynchat, cgi, urlparse, webbrowser *************** *** 92,95 **** --- 116,120 ---- self.factory(*args) + class BrighterAsyncChat(asynchat.async_chat): """An asynchat.async_chat that doesn't give spurious warnings on *************** *** 110,113 **** --- 135,164 ---- + class ServerLineReader(BrighterAsyncChat): + """An async socket that reads lines from a remote server and + simply calls a callback with the data. The BayesProxy object + can't connect to the real POP3 server and talk to it + synchronously, because that would block the process.""" + + def __init__(self, serverName, serverPort, lineCallback): + BrighterAsyncChat.__init__(self) + self.lineCallback = lineCallback + self.request = '' + self.set_terminator('\r\n') + self.create_socket(socket.AF_INET, socket.SOCK_STREAM) + self.connect((serverName, serverPort)) + + def collect_incoming_data(self, data): + self.request = self.request + data + + def found_terminator(self): + self.lineCallback(self.request + '\r\n') + self.request = '' + + def handle_close(self): + self.lineCallback('') + self.close() + + class POP3ProxyBase(BrighterAsyncChat): """An async dispatcher that understands POP3 and proxies to a POP3 *************** *** 126,134 **** BrighterAsyncChat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') ! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! self.serverSocket.connect((serverName, serverPort)) ! self.serverIn = self.serverSocket.makefile('r') # For reading only ! self.push(self.serverIn.readline()) def onTransaction(self, command, args, response): --- 177,189 ---- BrighterAsyncChat.__init__(self, clientSocket) self.request = '' + self.response = '' self.set_terminator('\r\n') ! self.command = '' # The POP3 command being processed... ! self.args = '' # ...and its arguments ! self.isClosing = False # Has the server closed the socket? ! self.seenAllHeaders = False # For the current RETR or TOP ! self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, ! self.onServerLine) def onTransaction(self, command, args, response): *************** *** 139,152 **** raise NotImplementedError ! def isMultiline(self, command, args): ! """Returns True if the given request should get a multiline response (assuming the response is positive). """ ! if command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif command in ['RETR', 'TOP']: return True ! elif command in ['LIST', 'UIDL']: return len(args) == 0 else: --- 194,237 ---- raise NotImplementedError ! def onServerLine(self, line): ! """A line of response has been received from the POP3 server.""" ! isFirstLine = not self.response ! self.response = self.response + line ! ! # Is this line that terminates a set of headers? ! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! ! # Has the server closed its end of the socket? ! if not line: ! self.isClosing = True ! ! # If we're not processing a command, just echo the response. ! if not self.command: ! self.push(self.response) ! self.response = '' ! ! # Time out after 30 seconds for message-retrieval commands if ! # all the headers are down. The rest of the message will proxy ! # straight through. ! if self.command in ['TOP', 'RETR'] and \ ! self.seenAllHeaders and time.time() > self.startTime + 30: ! self.onResponse() ! self.response = '' ! # If that's a complete response, handle it. ! elif not self.isMultiline() or line == '.\r\n' or \ ! (isFirstLine and line.startswith('-ERR')): ! self.onResponse() ! self.response = '' ! ! def isMultiline(self): ! """Returns True if the request should get a multiline response (assuming the response is positive). """ ! if self.command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif self.command in ['RETR', 'TOP']: return True ! elif self.command in ['LIST', 'UIDL']: return len(args) == 0 else: *************** *** 155,204 **** return False - def readResponse(self, command, args): - """Reads the POP3 server's response and returns a tuple of - (response, isClosing, timedOut). isClosing is True if the - server closes the socket, which tells found_terminator() to - close when the response has been sent. timedOut is set if a - TOP or RETR request was still arriving after 30 seconds, and - tells found_terminator() to proxy the remainder of the response. - """ - responseLines = [] - startTime = time.time() - isMulti = self.isMultiline(command, args) - isClosing = False - timedOut = False - isFirstLine = True - seenAllHeaders = False - while True: - line = self.serverIn.readline() - if not line: - # The socket's been closed by the server, probably by QUIT. - isClosing = True - break - elif not isMulti or (isFirstLine and line.startswith('-ERR')): - # A single-line response. - responseLines.append(line) - break - elif line == '.\r\n': - # The termination line. - responseLines.append(line) - break - else: - # A normal line - append it to the response and carry on. - responseLines.append(line) - seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n'] - - # Time out after 30 seconds for message-retrieval commands - # if all the headers are down - found_terminator() knows how - # to deal with this. - if command in ['TOP', 'RETR'] and \ - seenAllHeaders and time.time() > startTime + 30: - timedOut = True - break - - isFirstLine = False - - return ''.join(responseLines), isClosing, timedOut - def collect_incoming_data(self, data): """Asynchat override.""" --- 240,243 ---- *************** *** 207,256 **** def found_terminator(self): """Asynchat override.""" - # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverSocket.sendall(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! command, args = ('', '') else: splitCommand = self.request.strip().split(None, 1) ! command = splitCommand[0].upper() ! args = splitCommand[1:] ! rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. ! cookedResponse = self.onTransaction(command, args, rawResponse) ! self.push(cookedResponse) ! self.request = '' ! ! # If readResponse() timed out, we still need to read and proxy ! # the rest of the message. ! if timedOut: ! while True: ! line = self.serverIn.readline() ! if not line: ! # The socket's been closed by the server. ! isClosing = True ! break ! elif line == '.\r\n': ! # The termination line. ! self.push(line) ! break ! else: ! # A normal line. ! self.push(line) ! ! # If readResponse() or the loop above decided that the server ! # has closed its socket, close this one when the response has ! # been sent. ! if isClosing: self.close_when_done() class BayesProxyListener(Listener): --- 246,288 ---- def found_terminator(self): """Asynchat override.""" if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") + self.serverSocket.shutdown(2) + self.serverSocket.close() self.shutdown(2) self.close() raise SystemExit ! ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! self.command = self.args = '' else: + # A proper command. splitCommand = self.request.strip().split(None, 1) ! self.command = splitCommand[0].upper() ! self.args = splitCommand[1:] ! self.startTime = time.time() ! ! self.request = '' ! ! def onResponse(self): # Pass the request and the raw response to the subclass and # send back the cooked response. ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) ! ! # If onServerLine() decided that the server has closed its ! # socket, close this one when the response has been sent. ! if self.isClosing: self.close_when_done() + # Reset. + self.command = '' + self.args = '' + self.isClosing = False + self.seenAllHeaders = False + class BayesProxyListener(Listener): *************** *** 452,456 **** table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15 } .header { font-size: 133%% } .content { margin: 15 } --- 484,490 ---- table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15; ! border-top: 1px solid black; ! border-bottom: 1px solid black } .header { font-size: 133%% } .content { margin: 15 } *************** *** 466,470 ****

    \n""" --- 500,504 ----
    \n""" *************** *** 475,481 **** Spambayes.org ! \n""" pageSection = """ --- 509,520 ---- Spambayes.org
    %s
    \n""" + shutdownDB = """""" + + shutdownPickle = shutdownDB + """   + """ + pageSection = """ *************** *** 483,486 **** --- 522,533 ----  
    \n""" + summary = """POP3 proxy running on port %(proxyPort)d, + proxying to %(serverName)s:%(serverPort)d.
    + Active POP3 conversations: %(activeSessions)d.
    + POP3 conversations this session: %(totalSessions)d.
    + Emails classified this session: %(numSpams)d spam, + %(numHams)d ham, %(numUnsure)d unsure. + """ + wordQuery = """ *************** *** 488,491 **** --- 535,550 ---- """ + train = """ + Either upload a message file:
    + Or paste the whole message (incuding headers) here:
    +
    + Is this message + Ham or + Spam?
    + + """ + def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 502,506 **** """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = self.request.split('\r\n', 1) try: method, url, version = requestLine.strip().split() --- 561,565 ---- """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = (self.request+'\r\n').split('\r\n', 1) try: method, url, version = requestLine.strip().split() *************** *** 547,551 **** if path == '/helmet.gif': ! self.pushOKHeaders('image/gif') self.push(self.helmet) else: --- 606,614 ---- if path == '/helmet.gif': ! # XXX Why doesn't Expires work? Must read RFC 2616 one day. ! inOneHour = time.gmtime(time.time() + 3600) ! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour) ! extraHeaders = {'Expires': expiryDate} ! self.pushOKHeaders('image/gif', extraHeaders) self.push(self.helmet) else: *************** *** 554,558 **** handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % url) else: # This is a request for a valid page; run the handler. --- 617,621 ---- handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % path) else: # This is a request for a valid page; run the handler. *************** *** 561,569 **** handler(params) timeString = time.asctime(time.localtime()) ! self.push(self.footer % timeString) ! def pushOKHeaders(self, contentType): ! self.push("HTTP/1.0 200 OK\r\n") self.push("Content-Type: %s\r\n" % contentType) self.push("\r\n") --- 624,641 ---- handler(params) timeString = time.asctime(time.localtime()) ! if status.useDB: ! self.push(self.footer % (timeString, self.shutdownDB)) ! else: ! self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): ! timeNow = time.gmtime(time.time()) ! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow) ! self.push("HTTP/1.1 200 OK\r\n") ! self.push("Connection: close\r\n") self.push("Content-Type: %s\r\n" % contentType) + self.push("Date: %s\r\n" % httpNow) + for name, value in extraHeaders.items(): + self.push("%s: %s\r\n" % (name, value)) self.push("\r\n") *************** *** 583,616 **** def onHome(self, params): ! summary = """POP3 proxy running on port %(proxyPort)d, ! proxying to %(serverName)s:%(serverPort)d.
    ! Active POP3 conversations: %(activeSessions)d.
    ! POP3 conversations this session: ! %(totalSessions)d.
    ! Emails classified this session: %(numSpams)d spam, ! %(numHams)d ham, %(numUnsure)d unsure. ! """ % status.__dict__ ! ! train = """
    ! Either upload a message file: !
    ! Or paste the whole message (incuding headers) here:
    !
    ! Is this message ! Ham or ! Spam?
    ! ! """ ! ! body = (self.pageSection % ('Status', summary) + ! self.pageSection % ('Word query', self.wordQuery) + ! self.pageSection % ('Train', train)) self.push(body) def onShutdown(self, params): ! self.push("

    Shutdown. Goodbye.

    ") ! self.push(' ') # Acts as a flush for small buffers. self.shutdown(2) self.close() --- 655,675 ---- def onHome(self, params): ! """Serve up the homepage.""" ! body = (self.pageSection % ('Status', self.summary % status.__dict__)+ ! self.pageSection % ('Word query', self.wordQuery)+ ! self.pageSection % ('Train', self.train)) self.push(body) def onShutdown(self, params): ! """Shutdown the server, saving the pickle if requested to do so.""" ! if params['how'].lower().find('save') >= 0: ! if not status.useDB and status.pickleName: ! self.push("Saving...") ! self.push(' ') # Acts as a flush for small buffers. ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("Shutdown. Goodbye.") ! self.push(' ') self.shutdown(2) self.close() *************** *** 618,625 **** def onUpload(self, params): message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') # Append the message to a file, to make it easier to rebuild ! # the database later. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: --- 677,690 ---- def onUpload(self, params): + """Train on an uploaded or pasted message.""" + # Upload or paste? Spam or ham? message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild ! # the database later. This is a temporary implementation - ! # it should keep a Corpus (from Tim Stone's forthcoming message ! # management module) to manage a cache of messages. It needs ! # to keep them for the HTML retraining interface anyway. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: *************** *** 627,642 **** else: f = open("_pop3proxyham.mbox", "a") ! f.write("From ???@???\n") # fake From line (XXX good enough?) f.write(message) ! f.write("\n") f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("""

    Trained on your message. Saving database...

    """) ! self.push(" ") # Flush... must find out how to do this properly... ! if not status.useDB and status.pickleName: ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("

    Done.

    Home

    ") def onWordquery(self, params): --- 692,704 ---- else: f = open("_pop3proxyham.mbox", "a") ! f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n") f.write(message) ! f.write("\n\n") f.close() + + # Train on the message. self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("

    OK. Return Home or train another:

    ") ! self.push(self.pageSection % ('Train another', self.train)) def onWordquery(self, params): *************** *** 656,660 **** info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s':" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) --- 718,722 ---- info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) *************** *** 765,771 **** else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) self.request = '' def onStat(self, command, args): """POP3 STAT command.""" --- 827,839 ---- else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) # Or push_slowly for testing self.request = '' + def push_slowly(self, response): + """Useful for testing.""" + for c in response: + self.push(c) + time.sleep(0.02) + def onStat(self, command, args): """POP3 STAT command.""" *************** *** 777,781 **** """POP3 LIST command, with optional message number argument.""" if args: ! number = int(args) if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) --- 845,852 ---- """POP3 LIST command, with optional message number argument.""" if args: ! try: ! number = int(args) ! except ValueError: ! number = -1 if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) *************** *** 803,811 **** def onRetr(self, command, args): """POP3 RETR command.""" ! return self._getMessage(int(args), 12345) def onTop(self, command, args): """POP3 RETR command.""" ! number, lines = map(int, args.split()) return self._getMessage(number, lines) --- 874,889 ---- def onRetr(self, command, args): """POP3 RETR command.""" ! try: ! number = int(args) ! except ValueError: ! number = -1 ! return self._getMessage(number, 12345) def onTop(self, command, args): """POP3 RETR command.""" ! try: ! number, lines = map(int, args.split()) ! except ValueError: ! number, lines = -1, -1 return self._getMessage(number, lines) *************** *** 863,867 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) != -1 # Kill the proxy and the test server. --- 941,945 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) >= 0 # Kill the proxy and the test server. From jvr@users.sourceforge.net Sat Nov 9 18:05:44 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Sat, 09 Nov 2002 10:05:44 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20814 Modified Files: pop3proxy.py Log Message: force word query to be lowercase, making the UI case insensitive Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 --- pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 *************** *** 704,707 **** --- 704,708 ---- def onWordquery(self, params): word = params['word'] + word = word.lower() try: # Must be a better way to get __dict__ for a new-style class... From hooft@users.sourceforge.net Sat Nov 9 21:48:55 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sat, 09 Nov 2002 13:48:55 -0800 Subject: [Spambayes-checkins] spambayes weaktest.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31102 Added Files: weaktest.py Log Message: New test driver to simulate "unsure only" training --- NEW FILE: weaktest.py --- #! /usr/bin/env python # A test driver using "the standard" test directory structure. # This simulates a user that gets E-mail, and only trains on fp, # fn and unsure messages. It starts by training on the first 30 # messages, and from that point on well classified messages will # not be used for training. This can be used to see what the performance # of the scoring algorithm is under such conditions. Questions are: # * How does the size of the database behave over time? # * Does the classification get better over time? # * Are there other combinations of parameters for the classifier # that make this better behaved than the default values? """Usage: %(program)s [options] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ from __future__ import generators import sys,os from Options import options import hammie import msgs program = sys.argv[0] debug = 0 def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def drive(nsets): print options.display() spamdirs = [options.spam_directories % i for i in range(1, nsets+1)] hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] spamfns = [(x,y,1) for x in spamdirs for y in os.listdir(x)] hamfns = [(x,y,0) for x in hamdirs for y in os.listdir(x)] nham = len(hamfns) nspam = len(spamfns) allfns={} for fn in spamfns+hamfns: allfns[fn] = None d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) n=0 unsure=0 hamtrain=0 spamtrain=0 fp=0 fn=0 for dir,name, is_spam in allfns.iterkeys(): n += 1 m=msgs.Msg(dir, name).guts if debug: print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name), if hamtrain + spamtrain > 30: scr=d.score(m) else: scr=0.50 if debug: print "score:%.3f"%scr, if scr < hammie.SPAM_THRESHOLD and is_spam: if scr < hammie.HAM_THRESHOLD: fn += 1 if debug: print "fn" else: unsure += 1 if debug: print "Unsure" spamtrain += 1 d.train_spam(m) d.update_probabilities() elif scr > hammie.HAM_THRESHOLD and not is_spam: if scr > hammie.SPAM_THRESHOLD: fp += 1 if debug: print "fp" else: print "fp: %s score:%.4f"%(os.path.join(dir,name),scr) else: unsure += 1 if debug: print "Unsure" hamtrain += 1 d.train_ham(m) d.update_probabilities() else: if debug: print "OK" if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure) print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( unsure,unsure*100.0/len(allfns)) print "Trained on %d ham and %d spam"%(hamtrain,spamtrain) print "fp: %d fn: %d"%(fp,fn) FPW = options.best_cutoff_fp_weight FNW = options.best_cutoff_fn_weight UNW = options.best_cutoff_unsure_weight print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure) def main(): import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") drive(nsets) if __name__ == "__main__": main() From hooft@users.sourceforge.net Sun Nov 10 12:02:36 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:02:36 -0800 Subject: [Spambayes-checkins] spambayes weaktest.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22741 Modified Files: weaktest.py Log Message: add flexcost; sanitize spacing Index: weaktest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** weaktest.py 9 Nov 2002 21:48:52 -0000 1.1 --- weaktest.py 10 Nov 2002 12:02:33 -0000 1.2 *************** *** 59,63 **** nspam = len(spamfns) ! allfns={} for fn in spamfns+hamfns: allfns[fn] = None --- 59,63 ---- nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: allfns[fn] = None *************** *** 65,74 **** d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) ! n=0 ! unsure=0 ! hamtrain=0 ! spamtrain=0 ! fp=0 ! fn=0 for dir,name, is_spam in allfns.iterkeys(): n += 1 --- 65,80 ---- d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) ! n = 0 ! unsure = 0 ! hamtrain = 0 ! spamtrain = 0 ! fp = 0 ! fn = 0 ! flexcost = 0 ! FPW = options.best_cutoff_fp_weight ! FNW = options.best_cutoff_fn_weight ! UNW = options.best_cutoff_unsure_weight ! SPC = options.spam_cutoff ! HC = options.ham_cutoff for dir,name, is_spam in allfns.iterkeys(): n += 1 *************** *** 82,87 **** if debug: print "score:%.3f"%scr, ! if scr < hammie.SPAM_THRESHOLD and is_spam: ! if scr < hammie.HAM_THRESHOLD: fn += 1 if debug: --- 88,96 ---- if debug: print "score:%.3f"%scr, ! if scr < SPC and is_spam: ! t = FNW * (SPC - scr) / (SPC - HC) ! #print "Spam at %.3f costs %.2f"%(scr,t) ! flexcost += t ! if scr < HC: fn += 1 if debug: *************** *** 94,104 **** d.train_spam(m) d.update_probabilities() ! elif scr > hammie.HAM_THRESHOLD and not is_spam: ! if scr > hammie.SPAM_THRESHOLD: fp += 1 if debug: print "fp" else: ! print "fp: %s score:%.4f"%(os.path.join(dir,name),scr) else: unsure += 1 --- 103,116 ---- d.train_spam(m) d.update_probabilities() ! elif scr > HC and not is_spam: ! t = FPW * (scr - HC) / (SPC - HC) ! #print "Ham at %.3f costs %.2f"%(scr,t) ! flexcost += t ! if scr > SPC: fp += 1 if debug: print "fp" else: ! print "fp: %s score:%.4f"%(os.path.join(dir, name), scr) else: unsure += 1 *************** *** 113,126 **** if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( ! n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure) ! print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( ! unsure,unsure*100.0/len(allfns)) ! print "Trained on %d ham and %d spam"%(hamtrain,spamtrain) ! print "fp: %d fn: %d"%(fp,fn) ! FPW = options.best_cutoff_fp_weight ! FNW = options.best_cutoff_fn_weight ! UNW = options.best_cutoff_unsure_weight ! print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure) def main(): --- 125,136 ---- if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( ! n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure) ! print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( ! unsure, unsure * 100.0 / len(allfns)) ! print "Trained on %d ham and %d spam"%(hamtrain, spamtrain) ! print "fp: %d fn: %d"%(fp, fn) ! print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) ! print "Flex cost: $%.4f"%flexcost def main(): *************** *** 128,137 **** try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': --- 138,146 ---- try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) ! nsets = None for opt, arg in opts: if opt == '-h': From hooft@users.sourceforge.net Sun Nov 10 12:07:18 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:07:18 -0800 Subject: [Spambayes-checkins] spambayes optimize.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24245 Added Files: optimize.py Log Message: Simplex maximization --- NEW FILE: optimize.py --- # __version__ = '$Id: optimize.py,v 1.1 2002/11/10 12:07:15 hooft Exp $' # # Optimize any parametric function. # import copy import Numeric def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001): var = Numeric.array(var) simplex = [var] for i in range(len(var)): var2 = copy.copy(var) var2[i] = var[i] + err[i] simplex.append(var2) value = [] for i in range(len(simplex)): value.append(func(simplex[i])) while 1: # Determine worst and best wi = 0 bi = 0 for i in range(len(simplex)): if value[wi] > value[i]: wi = i if value[bi] < value[i]: bi = i # Test for convergence #print "worst, best are",wi,bi,"with",value[wi],value[bi] if abs(value[bi] - value[wi]) <= convcrit: return simplex[bi] # Calculate average of non-worst ave=Numeric.zeros(len(var), 'd') for i in range(len(simplex)): if i != wi: ave = ave + simplex[i] ave = ave / (len(simplex) - 1) worst = Numeric.array(simplex[wi]) # Check for too-small simplex simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) if simsize <= minerr: #print "Size of simplex too small:",simsize return simplex[bi] # Invert worst new = 2 * ave - simplex[wi] newv = func(new) if newv <= value[wi]: # Even worse. Shrink instead #print "Shrunk simplex" #print "ave=",repr(ave) #print "wi=",repr(worst) new = 0.5 * ave + 0.5 * worst newv = func(new) elif newv > value[bi]: # Better than the best. Expand new2 = 3 * ave - 2 * worst newv2 = func(new2) if newv2 > newv: # Accept #print "Expanded simplex" new = new2 newv = newv2 simplex[wi] = new value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): err = Numeric.array(err) var = SimplexMaximize(var, err, func, convcrit*5, minerr*5) return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr) From hooft@users.sourceforge.net Sun Nov 10 12:08:42 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:08:42 -0800 Subject: [Spambayes-checkins] spambayes weakloop.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24653 Added Files: weakloop.py Log Message: Loop simplex optimization over weaktest.py --- NEW FILE: weakloop.py --- # # Optimize parameters # """Usage: %(program)s [options] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ import sys def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) program = sys.argv[0] default=""" [Classifier] robinson_probability_x = 0.5 robinson_minimum_prob_strength = 0.1 robinson_probability_s = 0.45 max_discriminators = 150 [TestDriver] spam_cutoff = 0.90 ham_cutoff = 0.20 """ import Options start = (Options.options.robinson_probability_x, Options.options.robinson_minimum_prob_strength, Options.options.robinson_probability_s, Options.options.spam_cutoff, Options.options.ham_cutoff) err = (0.01, 0.01, 0.01, 0.005, 0.01) def mkini(vars): f=open('bayescustomize.ini', 'w') f.write(""" [Classifier] robinson_probability_x = %.6f robinson_minimum_prob_strength = %.6f robinson_probability_s = %.6f [TestDriver] spam_cutoff = %.4f ham_cutoff = %.4f """%tuple(vars)) f.close() def score(vars): import os mkini(vars) status = os.system('python2.3 weaktest.py -n %d > weak.out'%nsets) if status != 0: print >> sys.stderr, "Error status from weaktest" sys.exit(status) f = open('weak.out', 'r') txt = f.readlines() # Extract the flex cost field. cost = float(txt[-1].split()[2][1:]) f.close() print ''.join(txt[-4:])[:-1] print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,)) return -cost def main(): import optimize finish=optimize.SimplexMaximize(start,err,score) mkini(finish) if __name__ == "__main__": import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) nsets = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") main() From tim_one@users.sourceforge.net Sun Nov 10 19:59:24 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 11:59:24 -0800 Subject: [Spambayes-checkins] spambayes msgs.py,1.5,1.6 optimize.py,1.1,1.2 pop3proxy.py,1.13,1.14 timcv.py,1.11,1.12 weaktest.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14712 Modified Files: msgs.py optimize.py pop3proxy.py timcv.py weaktest.py Log Message: Whitespace normalization. Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** msgs.py 1 Nov 2002 04:10:50 -0000 1.5 --- msgs.py 10 Nov 2002 19:59:22 -0000 1.6 *************** *** 84,88 **** def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. If seed is not None, also set SEED. If (ham|spam)test are not set, set to the same as the (ham|spam)train --- 84,88 ---- def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. If seed is not None, also set SEED. If (ham|spam)test are not set, set to the same as the (ham|spam)train Index: optimize.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/optimize.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** optimize.py 10 Nov 2002 12:07:15 -0000 1.1 --- optimize.py 10 Nov 2002 19:59:22 -0000 1.2 *************** *** 11,66 **** simplex = [var] for i in range(len(var)): ! var2 = copy.copy(var) ! var2[i] = var[i] + err[i] ! simplex.append(var2) value = [] for i in range(len(simplex)): ! value.append(func(simplex[i])) while 1: ! # Determine worst and best ! wi = 0 ! bi = 0 ! for i in range(len(simplex)): ! if value[wi] > value[i]: ! wi = i ! if value[bi] < value[i]: ! bi = i ! # Test for convergence ! #print "worst, best are",wi,bi,"with",value[wi],value[bi] ! if abs(value[bi] - value[wi]) <= convcrit: ! return simplex[bi] ! # Calculate average of non-worst ! ave=Numeric.zeros(len(var), 'd') ! for i in range(len(simplex)): ! if i != wi: ! ave = ave + simplex[i] ! ave = ave / (len(simplex) - 1) ! worst = Numeric.array(simplex[wi]) ! # Check for too-small simplex ! simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) ! if simsize <= minerr: ! #print "Size of simplex too small:",simsize ! return simplex[bi] ! # Invert worst ! new = 2 * ave - simplex[wi] ! newv = func(new) ! if newv <= value[wi]: ! # Even worse. Shrink instead ! #print "Shrunk simplex" ! #print "ave=",repr(ave) ! #print "wi=",repr(worst) ! new = 0.5 * ave + 0.5 * worst ! newv = func(new) ! elif newv > value[bi]: ! # Better than the best. Expand ! new2 = 3 * ave - 2 * worst ! newv2 = func(new2) ! if newv2 > newv: ! # Accept ! #print "Expanded simplex" ! new = new2 ! newv = newv2 ! simplex[wi] = new ! value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): --- 11,66 ---- simplex = [var] for i in range(len(var)): ! var2 = copy.copy(var) ! var2[i] = var[i] + err[i] ! simplex.append(var2) value = [] for i in range(len(simplex)): ! value.append(func(simplex[i])) while 1: ! # Determine worst and best ! wi = 0 ! bi = 0 ! for i in range(len(simplex)): ! if value[wi] > value[i]: ! wi = i ! if value[bi] < value[i]: ! bi = i ! # Test for convergence ! #print "worst, best are",wi,bi,"with",value[wi],value[bi] ! if abs(value[bi] - value[wi]) <= convcrit: ! return simplex[bi] ! # Calculate average of non-worst ! ave=Numeric.zeros(len(var), 'd') ! for i in range(len(simplex)): ! if i != wi: ! ave = ave + simplex[i] ! ave = ave / (len(simplex) - 1) ! worst = Numeric.array(simplex[wi]) ! # Check for too-small simplex ! simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) ! if simsize <= minerr: ! #print "Size of simplex too small:",simsize ! return simplex[bi] ! # Invert worst ! new = 2 * ave - simplex[wi] ! newv = func(new) ! if newv <= value[wi]: ! # Even worse. Shrink instead ! #print "Shrunk simplex" ! #print "ave=",repr(ave) ! #print "wi=",repr(worst) ! new = 0.5 * ave + 0.5 * worst ! newv = func(new) ! elif newv > value[bi]: ! # Better than the best. Expand ! new2 = 3 * ave - 2 * worst ! newv2 = func(new2) ! if newv2 > newv: ! # Accept ! #print "Expanded simplex" ! new = new2 ! newv = newv2 ! simplex[wi] = new ! value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 --- pop3proxy.py 10 Nov 2002 19:59:22 -0000 1.14 *************** *** 140,144 **** can't connect to the real POP3 server and talk to it synchronously, because that would block the process.""" ! def __init__(self, serverName, serverPort, lineCallback): BrighterAsyncChat.__init__(self) --- 140,144 ---- can't connect to the real POP3 server and talk to it synchronously, because that would block the process.""" ! def __init__(self, serverName, serverPort, lineCallback): BrighterAsyncChat.__init__(self) *************** *** 148,152 **** self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((serverName, serverPort)) ! def collect_incoming_data(self, data): self.request = self.request + data --- 148,152 ---- self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((serverName, serverPort)) ! def collect_incoming_data(self, data): self.request = self.request + data *************** *** 184,188 **** self.seenAllHeaders = False # For the current RETR or TOP self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, self.onServerLine) --- 184,188 ---- self.seenAllHeaders = False # For the current RETR or TOP self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, self.onServerLine) *************** *** 198,214 **** isFirstLine = not self.response self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! # Has the server closed its end of the socket? if not line: self.isClosing = True ! # If we're not processing a command, just echo the response. if not self.command: self.push(self.response) self.response = '' ! # Time out after 30 seconds for message-retrieval commands if # all the headers are down. The rest of the message will proxy --- 198,214 ---- isFirstLine = not self.response self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! # Has the server closed its end of the socket? if not line: self.isClosing = True ! # If we're not processing a command, just echo the response. if not self.command: self.push(self.response) self.response = '' ! # Time out after 30 seconds for message-retrieval commands if # all the headers are down. The rest of the message will proxy *************** *** 223,227 **** self.onResponse() self.response = '' ! def isMultiline(self): """Returns True if the request should get a multiline --- 223,227 ---- self.onResponse() self.response = '' ! def isMultiline(self): """Returns True if the request should get a multiline *************** *** 254,258 **** self.close() raise SystemExit ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': --- 254,258 ---- self.close() raise SystemExit ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': *************** *** 265,271 **** self.args = splitCommand[1:] self.startTime = time.time() ! self.request = '' ! def onResponse(self): # Pass the request and the raw response to the subclass and --- 265,271 ---- self.args = splitCommand[1:] self.startTime = time.time() ! self.request = '' ! def onResponse(self): # Pass the request and the raw response to the subclass and *************** *** 273,277 **** cooked = self.onTransaction(self.command, self.args, self.response) self.push(cooked) ! # If onServerLine() decided that the server has closed its # socket, close this one when the response has been sent. --- 273,277 ---- cooked = self.onTransaction(self.command, self.args, self.response) self.push(cooked) ! # If onServerLine() decided that the server has closed its # socket, close this one when the response has been sent. *************** *** 351,355 **** status.activeSessions -= 1 POP3ProxyBase.close(self) ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the --- 351,355 ---- status.activeSessions -= 1 POP3ProxyBase.close(self) ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the *************** *** 419,423 **** if command == 'RETR': status.numUnsure += 1 ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" --- 419,423 ---- if command == 'RETR': status.numUnsure += 1 ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" *************** *** 490,494 **** .content { margin: 15 } .sectiontable { border: 1px solid #808080; width: 95%% } ! .sectionheading { background: fffae0; padding-left: 1ex; border-bottom: 1px solid #808080; font-weight: bold } --- 490,494 ---- .content { margin: 15 } .sectiontable { border: 1px solid #808080; width: 95%% } ! .sectionheading { background: fffae0; padding-left: 1ex; border-bottom: 1px solid #808080; font-weight: bold } *************** *** 513,517 **** shutdownDB = """""" ! shutdownPickle = shutdownDB + """   """ --- 513,517 ---- shutdownDB = """""" ! shutdownPickle = shutdownDB + """   """ *************** *** 521,525 ****
    %s
    %s
     
    \n""" ! summary = """POP3 proxy running on port %(proxyPort)d, proxying to %(serverName)s:%(serverPort)d.
    --- 521,525 ---- %s  
    \n""" ! summary = """POP3 proxy running on port %(proxyPort)d, proxying to %(serverName)s:%(serverPort)d.
    *************** *** 529,538 **** %(numHams)d ham, %(numUnsure)d unsure. """ ! wordQuery = """
    """ ! train = """
    --- 529,538 ---- %(numHams)d ham, %(numUnsure)d unsure. """ ! wordQuery = """
    """ ! train = """
    *************** *** 546,550 ****
    """ ! def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) --- 546,550 ---- """ ! def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 577,581 **** self.request = self.request + '\r\n\r\n' return ! if type(self.get_terminator()) is type(1): # We've just read the body of a POSTed request. --- 577,581 ---- self.request = self.request + '\r\n\r\n' return ! if type(self.get_terminator()) is type(1): # We've just read the body of a POSTed request. *************** *** 592,596 **** # A normal x-www-form-urlencoded. params.update(cgi.parse_qs(body, keep_blank_values=True)) ! # Convert the cgi params into a simple dictionary. plainParams = {} --- 592,596 ---- # A normal x-www-form-urlencoded. params.update(cgi.parse_qs(body, keep_blank_values=True)) ! # Convert the cgi params into a simple dictionary. plainParams = {} *************** *** 604,608 **** if path == '/': path = '/Home' ! if path == '/helmet.gif': # XXX Why doesn't Expires work? Must read RFC 2616 one day. --- 604,608 ---- if path == '/': path = '/Home' ! if path == '/helmet.gif': # XXX Why doesn't Expires work? Must read RFC 2616 one day. *************** *** 628,632 **** else: self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): timeNow = time.gmtime(time.time()) --- 628,632 ---- else: self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): timeNow = time.gmtime(time.time()) *************** *** 645,649 **** self.push("\r\n") self.push("

    %d %s

    " % (code, message)) ! def pushPreamble(self, name): self.push(self.header % name) --- 645,649 ---- self.push("\r\n") self.push("

    %d %s

    " % (code, message)) ! def pushPreamble(self, name): self.push(self.header % name) *************** *** 681,685 **** message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') ! # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - --- 681,685 ---- message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') ! # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - *************** *** 718,722 **** except KeyError: info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) --- 718,722 ---- except KeyError: info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) *************** *** 992,996 **** elif opt == '-u': status.uiPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: --- 992,996 ---- elif opt == '-u': status.uiPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** timcv.py 1 Nov 2002 04:10:50 -0000 1.11 --- timcv.py 10 Nov 2002 19:59:22 -0000 1.12 *************** *** 15,19 **** --HamTrain int ! The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. --- 15,19 ---- --HamTrain int ! The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. *************** *** 23,27 **** --HamTest int ! The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. --- 23,27 ---- --HamTest int ! The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. *************** *** 73,79 **** d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:], train=1)) --- 73,79 ---- d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:], train=1)) *************** *** 98,102 **** del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), msgs.SpamStream(sname, s2, train=1)) --- 98,102 ---- del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), msgs.SpamStream(sname, s2, train=1)) Index: weaktest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** weaktest.py 10 Nov 2002 12:02:33 -0000 1.2 --- weaktest.py 10 Nov 2002 19:59:22 -0000 1.3 *************** *** 58,62 **** nham = len(hamfns) nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: --- 58,62 ---- nham = len(hamfns) nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: *************** *** 133,137 **** print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) print "Flex cost: $%.4f"%flexcost ! def main(): import getopt --- 133,137 ---- print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) print "Flex cost: $%.4f"%flexcost ! def main(): import getopt From tim_one@users.sourceforge.net Sun Nov 10 20:00:03 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 12:00:03 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv14946 Modified Files: msgstore.py Log Message: Whitespace normalization. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** msgstore.py 7 Nov 2002 22:30:09 -0000 1.23 --- msgstore.py 10 Nov 2002 19:59:59 -0000 1.24 *************** *** 397,401 **** # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed pass ! return "%s\n%s\n%s" % (headers, html, body) --- 397,401 ---- # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed pass ! return "%s\n%s\n%s" % (headers, html, body) From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 17:59:08 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam Modified Files: profile.py Log Message: For the benefit of future generations, renamed some options: Old New --- --- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Index: profile.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** profile.py 7 Nov 2002 22:30:11 -0000 1.3 --- profile.py 11 Nov 2002 01:59:06 -0000 1.4 *************** *** 44,48 **** class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 44,48 ---- class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 17:59:08 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5402 Modified Files: Options.py classifier.py weakloop.py Log Message: For the benefit of future generations, renamed some options: Old New --- --- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.67 retrieving revision 1.68 diff -C2 -d -r1.67 -r1.68 *** Options.py 8 Nov 2002 04:06:23 -0000 1.67 --- Options.py 11 Nov 2002 01:59:06 -0000 1.68 *************** *** 241,268 **** # These two control the prior assumption about word probabilities. ! # "x" is essentially the probability given to a word that has never been ! # seen before. Nobody has reported an improvement via moving it away ! # from 1/2. ! # "s" adjusts how much weight to give the prior assumption relative to ! # the probabilities estimated by counting. At s=0, the counting estimates ! # are believed 100%, even to the extent of assigning certainty (0 or 1) ! # to a word that has appeared in only ham or only spam. This is a disaster. ! # As s tends toward infintity, all probabilities tend toward x. All ! # reports were that a value near 0.4 worked best, so this does not seem to ! # be corpus-dependent. ! # NOTE: Gary Robinson previously used a different formula involving 'a' ! # and 'x'. The 'x' here is the same as before. The 's' here is the old ! # 'a' divided by 'x'. ! robinson_probability_x: 0.5 ! robinson_probability_s: 0.45 # When scoring a message, ignore all words with ! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many ! # tests over Robinsons base scheme. 0.1 appeared to work well across ! # all corpora. ! robinson_minimum_prob_strength: 0.1 ! # The combining scheme currently detailed on Gary Robinons web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme --- 241,268 ---- # These two control the prior assumption about word probabilities. ! # unknown_word_prob is essentially the probability given to a word that ! # has never been seen before. Nobody has reported an improvement via moving ! # it away from 1/2, although Tim has measured a mean spamprob of a bit over ! # 0.5 (0.51-0.55) in 3 well-trained classifiers. ! # ! # unknown_word_strength adjusts how much weight to give the prior assumption ! # relative to the probabilities estimated by counting. At 0, the counting ! # estimates are believed 100%, even to the extent of assigning certainty ! # (0 or 1) to a word that has appeared in only ham or only spam. This ! # is a disaster. ! # ! # As unknown_word_strength tends toward infintity, all probabilities tend ! # toward unknown_word_prob. All reports were that a value near 0.4 worked ! # best, so this does not seem to be corpus-dependent. ! unknown_word_prob: 0.5 ! unknown_word_strength: 0.45 # When scoring a message, ignore all words with ! # abs(word.spamprob - 0.5) < minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many ! # tests. 0.1 appeared to work well across all corpora. ! minimum_prob_strength: 0.1 ! # The combining scheme currently detailed on the Robinon web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme *************** *** 272,284 **** # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) ! # follows the chi-squared distribution with 2*n degrees of freedom. That is ! # the "provably most-sensitive" test Garys original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where ! # lots of the mistakes live. This is the best method so far on Tims data. ! # One systematic benefit is that it is immune to "cancellation disease". One ! # systematic drawback is that it is sensitive to *any* deviation from a ! # uniform distribution, regardless of whether that is actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). --- 272,284 ---- # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) ! # follows the chi-squared distribution with 2*n degrees of freedom. This is ! # the "provably most-sensitive" test the original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where ! # lots of the mistakes live. This is the best method so far. ! # One systematic benefit is is immunity to "cancellation disease". One ! # systematic drawback is sensitivity to *any* deviation from a ! # uniform distribution, regardless of whether actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). *************** *** 381,387 **** }, 'Classifier': {'max_discriminators': int_cracker, ! 'robinson_probability_x': float_cracker, ! 'robinson_probability_s': float_cracker, ! 'robinson_minimum_prob_strength': float_cracker, 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, --- 381,387 ---- }, 'Classifier': {'max_discriminators': int_cracker, ! 'unknown_word_prob': float_cracker, ! 'unknown_word_strength': float_cracker, ! 'minimum_prob_strength': float_cracker, 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** classifier.py 7 Nov 2002 22:30:05 -0000 1.49 --- classifier.py 11 Nov 2002 01:59:06 -0000 1.50 *************** *** 70,74 **** # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 70,74 ---- # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 *************** *** 322,327 **** nspam = float(self.nspam or 1) ! S = options.robinson_probability_s ! StimesX = S * options.robinson_probability_x for word, record in self.wordinfo.iteritems(): --- 322,327 ---- nspam = float(self.nspam or 1) ! S = options.unknown_word_strength ! StimesX = S * options.unknown_word_prob for word, record in self.wordinfo.iteritems(): *************** *** 449,454 **** def _getclues(self, wordstream): ! mindist = options.robinson_minimum_prob_strength ! unknown = options.robinson_probability_x clues = [] # (distance, prob, word, record) tuples --- 449,454 ---- def _getclues(self, wordstream): ! mindist = options.minimum_prob_strength ! unknown = options.unknown_word_prob clues = [] # (distance, prob, word, record) tuples Index: weakloop.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1 --- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2 *************** *** 29,35 **** default=""" [Classifier] ! robinson_probability_x = 0.5 ! robinson_minimum_prob_strength = 0.1 ! robinson_probability_s = 0.45 max_discriminators = 150 --- 29,35 ---- default=""" [Classifier] ! unknown_word_prob = 0.5 ! minimum_prob_strength = 0.1 ! unknown_word_strength = 0.45 max_discriminators = 150 *************** *** 41,47 **** import Options ! start = (Options.options.robinson_probability_x, ! Options.options.robinson_minimum_prob_strength, ! Options.options.robinson_probability_s, Options.options.spam_cutoff, Options.options.ham_cutoff) --- 41,47 ---- import Options ! start = (Options.options.unknown_word_prob, ! Options.options.minimum_prob_strength, ! Options.options.unknown_word_strength, Options.options.spam_cutoff, Options.options.ham_cutoff) *************** *** 52,58 **** f.write(""" [Classifier] ! robinson_probability_x = %.6f ! robinson_minimum_prob_strength = %.6f ! robinson_probability_s = %.6f [TestDriver] --- 52,58 ---- f.write(""" [Classifier] ! unknown_word_prob = %.6f ! minimum_prob_strength = %.6f ! unknown_word_strength = %.6f [TestDriver] From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 07 Nov 2002 20:06:29 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67 tokenizer.py,1.63,1.64 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31798 Modified Files: Options.py tokenizer.py Log Message: Removed option retain_pure_html_tags; nobody enables that anymore, and it's hard to believe it would ever help anymore (except as an HTML detector). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** Options.py 7 Nov 2002 22:25:46 -0000 1.66 --- Options.py 8 Nov 2002 04:06:23 -0000 1.67 *************** *** 42,53 **** x-.* - # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags - # from pure text/html messages. Set true to retain HTML tags in this - # case. On the c.l.py corpus, it helps to set this true because any - # sign of HTML is so despised on tech lists; however, the advantage - # of setting it true eventually vanishes even there given enough - # training data. - retain_pure_html_tags: False - # If true, the first few characters of application/octet-stream sections # are used, undecoded. What 'few' means is decided by octet_prefix_size. --- 42,45 ---- *************** *** 347,352 **** all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, ! 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, --- 339,343 ---- all_options = { ! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 --- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 *************** *** 495,504 **** # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaults to False. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was removed. ############################################################################## --- 495,505 ---- # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaulted to False. Later, as the ! # algorithm improved, retain_pure_html_tags was removed. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was also removed. ############################################################################## *************** *** 1167,1175 **** """Generate a stream of tokens from an email Message. - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are - also stripped from text/html sections. Except in special cases, - it's recommended to leave that at its default of false. - If options.check_octets is True, the first few undecoded characters of application/octet-stream parts of the message body become tokens. --- 1168,1171 ---- *************** *** 1228,1235 **** # Remove HTML/XML tags. Also  . ! if (part.get_content_type() == "text/plain" or ! not options.retain_pure_html_tags): ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. --- 1224,1229 ---- # Remove HTML/XML tags. Also  . ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Fri, 08 Nov 2002 00:00:25 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25390 Modified Files: pop3proxy.py Log Message: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11 --- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 *************** *** 47,50 **** --- 47,74 ---- + todo = """ + o (Re)training interface - one message per line, quick-rendering table. + o Slightly-wordy index page; intro paragraph for each page. + o Once the training stuff is on a separate page, make the paste box + bigger. + o "Links" section (on homepage?) to project homepage, mailing list, + etc. + o "Home" link (with helmet!) at the end of each page. + o "Classify this" - just like Train. + o "Send me an email every [...] to remind me to train on new + messages." + o "Send me a status email every [...] telling how many mails have been + classified, etc." + o Deployment: Windows executable? atlaxwin and ctypes? Or just + webbrowser? + o Possibly integrate Tim Stone's SMTP code - make it use async, make + the training code update (rather than replace!) the database. + o Can it cleanly dynamically update its status display while having a + POP3 converation? Hammering reload sucks. + o Add a command to save the database without shutting down, and one to + reload the database. + o Leave the word in the input field after a Word query. + """ + import sys, re, operator, errno, getopt, cPickle, cStringIO, time import socket, asyncore, asynchat, cgi, urlparse, webbrowser *************** *** 92,95 **** --- 116,120 ---- self.factory(*args) + class BrighterAsyncChat(asynchat.async_chat): """An asynchat.async_chat that doesn't give spurious warnings on *************** *** 110,113 **** --- 135,164 ---- + class ServerLineReader(BrighterAsyncChat): + """An async socket that reads lines from a remote server and + simply calls a callback with the data. The BayesProxy object + can't connect to the real POP3 server and talk to it + synchronously, because that would block the process.""" + + def __init__(self, serverName, serverPort, lineCallback): + BrighterAsyncChat.__init__(self) + self.lineCallback = lineCallback + self.request = '' + self.set_terminator('\r\n') + self.create_socket(socket.AF_INET, socket.SOCK_STREAM) + self.connect((serverName, serverPort)) + + def collect_incoming_data(self, data): + self.request = self.request + data + + def found_terminator(self): + self.lineCallback(self.request + '\r\n') + self.request = '' + + def handle_close(self): + self.lineCallback('') + self.close() + + class POP3ProxyBase(BrighterAsyncChat): """An async dispatcher that understands POP3 and proxies to a POP3 *************** *** 126,134 **** BrighterAsyncChat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') ! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! self.serverSocket.connect((serverName, serverPort)) ! self.serverIn = self.serverSocket.makefile('r') # For reading only ! self.push(self.serverIn.readline()) def onTransaction(self, command, args, response): --- 177,189 ---- BrighterAsyncChat.__init__(self, clientSocket) self.request = '' + self.response = '' self.set_terminator('\r\n') ! self.command = '' # The POP3 command being processed... ! self.args = '' # ...and its arguments ! self.isClosing = False # Has the server closed the socket? ! self.seenAllHeaders = False # For the current RETR or TOP ! self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, ! self.onServerLine) def onTransaction(self, command, args, response): *************** *** 139,152 **** raise NotImplementedError ! def isMultiline(self, command, args): ! """Returns True if the given request should get a multiline response (assuming the response is positive). """ ! if command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif command in ['RETR', 'TOP']: return True ! elif command in ['LIST', 'UIDL']: return len(args) == 0 else: --- 194,237 ---- raise NotImplementedError ! def onServerLine(self, line): ! """A line of response has been received from the POP3 server.""" ! isFirstLine = not self.response ! self.response = self.response + line ! ! # Is this line that terminates a set of headers? ! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! ! # Has the server closed its end of the socket? ! if not line: ! self.isClosing = True ! ! # If we're not processing a command, just echo the response. ! if not self.command: ! self.push(self.response) ! self.response = '' ! ! # Time out after 30 seconds for message-retrieval commands if ! # all the headers are down. The rest of the message will proxy ! # straight through. ! if self.command in ['TOP', 'RETR'] and \ ! self.seenAllHeaders and time.time() > self.startTime + 30: ! self.onResponse() ! self.response = '' ! # If that's a complete response, handle it. ! elif not self.isMultiline() or line == '.\r\n' or \ ! (isFirstLine and line.startswith('-ERR')): ! self.onResponse() ! self.response = '' ! ! def isMultiline(self): ! """Returns True if the request should get a multiline response (assuming the response is positive). """ ! if self.command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif self.command in ['RETR', 'TOP']: return True ! elif self.command in ['LIST', 'UIDL']: return len(args) == 0 else: *************** *** 155,204 **** return False - def readResponse(self, command, args): - """Reads the POP3 server's response and returns a tuple of - (response, isClosing, timedOut). isClosing is True if the - server closes the socket, which tells found_terminator() to - close when the response has been sent. timedOut is set if a - TOP or RETR request was still arriving after 30 seconds, and - tells found_terminator() to proxy the remainder of the response. - """ - responseLines = [] - startTime = time.time() - isMulti = self.isMultiline(command, args) - isClosing = False - timedOut = False - isFirstLine = True - seenAllHeaders = False - while True: - line = self.serverIn.readline() - if not line: - # The socket's been closed by the server, probably by QUIT. - isClosing = True - break - elif not isMulti or (isFirstLine and line.startswith('-ERR')): - # A single-line response. - responseLines.append(line) - break - elif line == '.\r\n': - # The termination line. - responseLines.append(line) - break - else: - # A normal line - append it to the response and carry on. - responseLines.append(line) - seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n'] - - # Time out after 30 seconds for message-retrieval commands - # if all the headers are down - found_terminator() knows how - # to deal with this. - if command in ['TOP', 'RETR'] and \ - seenAllHeaders and time.time() > startTime + 30: - timedOut = True - break - - isFirstLine = False - - return ''.join(responseLines), isClosing, timedOut - def collect_incoming_data(self, data): """Asynchat override.""" --- 240,243 ---- *************** *** 207,256 **** def found_terminator(self): """Asynchat override.""" - # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverSocket.sendall(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! command, args = ('', '') else: splitCommand = self.request.strip().split(None, 1) ! command = splitCommand[0].upper() ! args = splitCommand[1:] ! rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. ! cookedResponse = self.onTransaction(command, args, rawResponse) ! self.push(cookedResponse) ! self.request = '' ! ! # If readResponse() timed out, we still need to read and proxy ! # the rest of the message. ! if timedOut: ! while True: ! line = self.serverIn.readline() ! if not line: ! # The socket's been closed by the server. ! isClosing = True ! break ! elif line == '.\r\n': ! # The termination line. ! self.push(line) ! break ! else: ! # A normal line. ! self.push(line) ! ! # If readResponse() or the loop above decided that the server ! # has closed its socket, close this one when the response has ! # been sent. ! if isClosing: self.close_when_done() class BayesProxyListener(Listener): --- 246,288 ---- def found_terminator(self): """Asynchat override.""" if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") + self.serverSocket.shutdown(2) + self.serverSocket.close() self.shutdown(2) self.close() raise SystemExit ! ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! self.command = self.args = '' else: + # A proper command. splitCommand = self.request.strip().split(None, 1) ! self.command = splitCommand[0].upper() ! self.args = splitCommand[1:] ! self.startTime = time.time() ! ! self.request = '' ! ! def onResponse(self): # Pass the request and the raw response to the subclass and # send back the cooked response. ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) ! ! # If onServerLine() decided that the server has closed its ! # socket, close this one when the response has been sent. ! if self.isClosing: self.close_when_done() + # Reset. + self.command = '' + self.args = '' + self.isClosing = False + self.seenAllHeaders = False + class BayesProxyListener(Listener): *************** *** 452,456 **** table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15 } .header { font-size: 133%% } .content { margin: 15 } --- 484,490 ---- table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15; ! border-top: 1px solid black; ! border-bottom: 1px solid black } .header { font-size: 133%% } .content { margin: 15 } *************** *** 466,470 ****
    \n""" --- 500,504 ----
    \n""" *************** *** 475,481 **** Spambayes.org ! \n""" pageSection = """ --- 509,520 ---- Spambayes.org
    %s
    \n""" + shutdownDB = """""" + + shutdownPickle = shutdownDB + """   + """ + pageSection = """ *************** *** 483,486 **** --- 522,533 ----  
    \n""" + summary = """POP3 proxy running on port %(proxyPort)d, + proxying to %(serverName)s:%(serverPort)d.
    + Active POP3 conversations: %(activeSessions)d.
    + POP3 conversations this session: %(totalSessions)d.
    + Emails classified this session: %(numSpams)d spam, + %(numHams)d ham, %(numUnsure)d unsure. + """ + wordQuery = """ *************** *** 488,491 **** --- 535,550 ---- """ + train = """ + Either upload a message file:
    + Or paste the whole message (incuding headers) here:
    +
    + Is this message + Ham or + Spam?
    + + """ + def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 502,506 **** """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = self.request.split('\r\n', 1) try: method, url, version = requestLine.strip().split() --- 561,565 ---- """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = (self.request+'\r\n').split('\r\n', 1) try: method, url, version = requestLine.strip().split() *************** *** 547,551 **** if path == '/helmet.gif': ! self.pushOKHeaders('image/gif') self.push(self.helmet) else: --- 606,614 ---- if path == '/helmet.gif': ! # XXX Why doesn't Expires work? Must read RFC 2616 one day. ! inOneHour = time.gmtime(time.time() + 3600) ! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour) ! extraHeaders = {'Expires': expiryDate} ! self.pushOKHeaders('image/gif', extraHeaders) self.push(self.helmet) else: *************** *** 554,558 **** handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % url) else: # This is a request for a valid page; run the handler. --- 617,621 ---- handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % path) else: # This is a request for a valid page; run the handler. *************** *** 561,569 **** handler(params) timeString = time.asctime(time.localtime()) ! self.push(self.footer % timeString) ! def pushOKHeaders(self, contentType): ! self.push("HTTP/1.0 200 OK\r\n") self.push("Content-Type: %s\r\n" % contentType) self.push("\r\n") --- 624,641 ---- handler(params) timeString = time.asctime(time.localtime()) ! if status.useDB: ! self.push(self.footer % (timeString, self.shutdownDB)) ! else: ! self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): ! timeNow = time.gmtime(time.time()) ! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow) ! self.push("HTTP/1.1 200 OK\r\n") ! self.push("Connection: close\r\n") self.push("Content-Type: %s\r\n" % contentType) + self.push("Date: %s\r\n" % httpNow) + for name, value in extraHeaders.items(): + self.push("%s: %s\r\n" % (name, value)) self.push("\r\n") *************** *** 583,616 **** def onHome(self, params): ! summary = """POP3 proxy running on port %(proxyPort)d, ! proxying to %(serverName)s:%(serverPort)d.
    ! Active POP3 conversations: %(activeSessions)d.
    ! POP3 conversations this session: ! %(totalSessions)d.
    ! Emails classified this session: %(numSpams)d spam, ! %(numHams)d ham, %(numUnsure)d unsure. ! """ % status.__dict__ ! ! train = """
    ! Either upload a message file: !
    ! Or paste the whole message (incuding headers) here:
    !
    ! Is this message ! Ham or ! Spam?
    ! ! """ ! ! body = (self.pageSection % ('Status', summary) + ! self.pageSection % ('Word query', self.wordQuery) + ! self.pageSection % ('Train', train)) self.push(body) def onShutdown(self, params): ! self.push("

    Shutdown. Goodbye.

    ") ! self.push(' ') # Acts as a flush for small buffers. self.shutdown(2) self.close() --- 655,675 ---- def onHome(self, params): ! """Serve up the homepage.""" ! body = (self.pageSection % ('Status', self.summary % status.__dict__)+ ! self.pageSection % ('Word query', self.wordQuery)+ ! self.pageSection % ('Train', self.train)) self.push(body) def onShutdown(self, params): ! """Shutdown the server, saving the pickle if requested to do so.""" ! if params['how'].lower().find('save') >= 0: ! if not status.useDB and status.pickleName: ! self.push("Saving...") ! self.push(' ') # Acts as a flush for small buffers. ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("Shutdown. Goodbye.") ! self.push(' ') self.shutdown(2) self.close() *************** *** 618,625 **** def onUpload(self, params): message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') # Append the message to a file, to make it easier to rebuild ! # the database later. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: --- 677,690 ---- def onUpload(self, params): + """Train on an uploaded or pasted message.""" + # Upload or paste? Spam or ham? message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild ! # the database later. This is a temporary implementation - ! # it should keep a Corpus (from Tim Stone's forthcoming message ! # management module) to manage a cache of messages. It needs ! # to keep them for the HTML retraining interface anyway. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: *************** *** 627,642 **** else: f = open("_pop3proxyham.mbox", "a") ! f.write("From ???@???\n") # fake From line (XXX good enough?) f.write(message) ! f.write("\n") f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("""

    Trained on your message. Saving database...

    """) ! self.push(" ") # Flush... must find out how to do this properly... ! if not status.useDB and status.pickleName: ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("

    Done.

    Home

    ") def onWordquery(self, params): --- 692,704 ---- else: f = open("_pop3proxyham.mbox", "a") ! f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n") f.write(message) ! f.write("\n\n") f.close() + + # Train on the message. self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("

    OK. Return Home or train another:

    ") ! self.push(self.pageSection % ('Train another', self.train)) def onWordquery(self, params): *************** *** 656,660 **** info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s':" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) --- 718,722 ---- info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) *************** *** 765,771 **** else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) self.request = '' def onStat(self, command, args): """POP3 STAT command.""" --- 827,839 ---- else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) # Or push_slowly for testing self.request = '' + def push_slowly(self, response): + """Useful for testing.""" + for c in response: + self.push(c) + time.sleep(0.02) + def onStat(self, command, args): """POP3 STAT command.""" *************** *** 777,781 **** """POP3 LIST command, with optional message number argument.""" if args: ! number = int(args) if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) --- 845,852 ---- """POP3 LIST command, with optional message number argument.""" if args: ! try: ! number = int(args) ! except ValueError: ! number = -1 if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) *************** *** 803,811 **** def onRetr(self, command, args): """POP3 RETR command.""" ! return self._getMessage(int(args), 12345) def onTop(self, command, args): """POP3 RETR command.""" ! number, lines = map(int, args.split()) return self._getMessage(number, lines) --- 874,889 ---- def onRetr(self, command, args): """POP3 RETR command.""" ! try: ! number = int(args) ! except ValueError: ! number = -1 ! return self._getMessage(number, 12345) def onTop(self, command, args): """POP3 RETR command.""" ! try: ! number, lines = map(int, args.split()) ! except ValueError: ! number, lines = -1, -1 return self._getMessage(number, lines) *************** *** 863,867 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) != -1 # Kill the proxy and the test server. --- 941,945 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) >= 0 # Kill the proxy and the test server. From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 07 Nov 2002 20:06:29 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67 tokenizer.py,1.63,1.64 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31798 Modified Files: Options.py tokenizer.py Log Message: Removed option retain_pure_html_tags; nobody enables that anymore, and it's hard to believe it would ever help anymore (except as an HTML detector). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** Options.py 7 Nov 2002 22:25:46 -0000 1.66 --- Options.py 8 Nov 2002 04:06:23 -0000 1.67 *************** *** 42,53 **** x-.* - # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags - # from pure text/html messages. Set true to retain HTML tags in this - # case. On the c.l.py corpus, it helps to set this true because any - # sign of HTML is so despised on tech lists; however, the advantage - # of setting it true eventually vanishes even there given enough - # training data. - retain_pure_html_tags: False - # If true, the first few characters of application/octet-stream sections # are used, undecoded. What 'few' means is decided by octet_prefix_size. --- 42,45 ---- *************** *** 347,352 **** all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, ! 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, --- 339,343 ---- all_options = { ! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 --- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 *************** *** 495,504 **** # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaults to False. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was removed. ############################################################################## --- 495,505 ---- # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaulted to False. Later, as the ! # algorithm improved, retain_pure_html_tags was removed. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was also removed. ############################################################################## *************** *** 1167,1175 **** """Generate a stream of tokens from an email Message. - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are - also stripped from text/html sections. Except in special cases, - it's recommended to leave that at its default of false. - If options.check_octets is True, the first few undecoded characters of application/octet-stream parts of the message body become tokens. --- 1168,1171 ---- *************** *** 1228,1235 **** # Remove HTML/XML tags. Also  . ! if (part.get_content_type() == "text/plain" or ! not options.retain_pure_html_tags): ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. --- 1224,1229 ---- # Remove HTML/XML tags. Also  . ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Fri, 08 Nov 2002 00:00:25 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25390 Modified Files: pop3proxy.py Log Message: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11 --- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 *************** *** 47,50 **** --- 47,74 ---- + todo = """ + o (Re)training interface - one message per line, quick-rendering table. + o Slightly-wordy index page; intro paragraph for each page. + o Once the training stuff is on a separate page, make the paste box + bigger. + o "Links" section (on homepage?) to project homepage, mailing list, + etc. + o "Home" link (with helmet!) at the end of each page. + o "Classify this" - just like Train. + o "Send me an email every [...] to remind me to train on new + messages." + o "Send me a status email every [...] telling how many mails have been + classified, etc." + o Deployment: Windows executable? atlaxwin and ctypes? Or just + webbrowser? + o Possibly integrate Tim Stone's SMTP code - make it use async, make + the training code update (rather than replace!) the database. + o Can it cleanly dynamically update its status display while having a + POP3 converation? Hammering reload sucks. + o Add a command to save the database without shutting down, and one to + reload the database. + o Leave the word in the input field after a Word query. + """ + import sys, re, operator, errno, getopt, cPickle, cStringIO, time import socket, asyncore, asynchat, cgi, urlparse, webbrowser *************** *** 92,95 **** --- 116,120 ---- self.factory(*args) + class BrighterAsyncChat(asynchat.async_chat): """An asynchat.async_chat that doesn't give spurious warnings on *************** *** 110,113 **** --- 135,164 ---- + class ServerLineReader(BrighterAsyncChat): + """An async socket that reads lines from a remote server and + simply calls a callback with the data. The BayesProxy object + can't connect to the real POP3 server and talk to it + synchronously, because that would block the process.""" + + def __init__(self, serverName, serverPort, lineCallback): + BrighterAsyncChat.__init__(self) + self.lineCallback = lineCallback + self.request = '' + self.set_terminator('\r\n') + self.create_socket(socket.AF_INET, socket.SOCK_STREAM) + self.connect((serverName, serverPort)) + + def collect_incoming_data(self, data): + self.request = self.request + data + + def found_terminator(self): + self.lineCallback(self.request + '\r\n') + self.request = '' + + def handle_close(self): + self.lineCallback('') + self.close() + + class POP3ProxyBase(BrighterAsyncChat): """An async dispatcher that understands POP3 and proxies to a POP3 *************** *** 126,134 **** BrighterAsyncChat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') ! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! self.serverSocket.connect((serverName, serverPort)) ! self.serverIn = self.serverSocket.makefile('r') # For reading only ! self.push(self.serverIn.readline()) def onTransaction(self, command, args, response): --- 177,189 ---- BrighterAsyncChat.__init__(self, clientSocket) self.request = '' + self.response = '' self.set_terminator('\r\n') ! self.command = '' # The POP3 command being processed... ! self.args = '' # ...and its arguments ! self.isClosing = False # Has the server closed the socket? ! self.seenAllHeaders = False # For the current RETR or TOP ! self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, ! self.onServerLine) def onTransaction(self, command, args, response): *************** *** 139,152 **** raise NotImplementedError ! def isMultiline(self, command, args): ! """Returns True if the given request should get a multiline response (assuming the response is positive). """ ! if command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif command in ['RETR', 'TOP']: return True ! elif command in ['LIST', 'UIDL']: return len(args) == 0 else: --- 194,237 ---- raise NotImplementedError ! def onServerLine(self, line): ! """A line of response has been received from the POP3 server.""" ! isFirstLine = not self.response ! self.response = self.response + line ! ! # Is this line that terminates a set of headers? ! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! ! # Has the server closed its end of the socket? ! if not line: ! self.isClosing = True ! ! # If we're not processing a command, just echo the response. ! if not self.command: ! self.push(self.response) ! self.response = '' ! ! # Time out after 30 seconds for message-retrieval commands if ! # all the headers are down. The rest of the message will proxy ! # straight through. ! if self.command in ['TOP', 'RETR'] and \ ! self.seenAllHeaders and time.time() > self.startTime + 30: ! self.onResponse() ! self.response = '' ! # If that's a complete response, handle it. ! elif not self.isMultiline() or line == '.\r\n' or \ ! (isFirstLine and line.startswith('-ERR')): ! self.onResponse() ! self.response = '' ! ! def isMultiline(self): ! """Returns True if the request should get a multiline response (assuming the response is positive). """ ! if self.command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif self.command in ['RETR', 'TOP']: return True ! elif self.command in ['LIST', 'UIDL']: return len(args) == 0 else: *************** *** 155,204 **** return False - def readResponse(self, command, args): - """Reads the POP3 server's response and returns a tuple of - (response, isClosing, timedOut). isClosing is True if the - server closes the socket, which tells found_terminator() to - close when the response has been sent. timedOut is set if a - TOP or RETR request was still arriving after 30 seconds, and - tells found_terminator() to proxy the remainder of the response. - """ - responseLines = [] - startTime = time.time() - isMulti = self.isMultiline(command, args) - isClosing = False - timedOut = False - isFirstLine = True - seenAllHeaders = False - while True: - line = self.serverIn.readline() - if not line: - # The socket's been closed by the server, probably by QUIT. - isClosing = True - break - elif not isMulti or (isFirstLine and line.startswith('-ERR')): - # A single-line response. - responseLines.append(line) - break - elif line == '.\r\n': - # The termination line. - responseLines.append(line) - break - else: - # A normal line - append it to the response and carry on. - responseLines.append(line) - seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n'] - - # Time out after 30 seconds for message-retrieval commands - # if all the headers are down - found_terminator() knows how - # to deal with this. - if command in ['TOP', 'RETR'] and \ - seenAllHeaders and time.time() > startTime + 30: - timedOut = True - break - - isFirstLine = False - - return ''.join(responseLines), isClosing, timedOut - def collect_incoming_data(self, data): """Asynchat override.""" --- 240,243 ---- *************** *** 207,256 **** def found_terminator(self): """Asynchat override.""" - # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverSocket.sendall(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! command, args = ('', '') else: splitCommand = self.request.strip().split(None, 1) ! command = splitCommand[0].upper() ! args = splitCommand[1:] ! rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. ! cookedResponse = self.onTransaction(command, args, rawResponse) ! self.push(cookedResponse) ! self.request = '' ! ! # If readResponse() timed out, we still need to read and proxy ! # the rest of the message. ! if timedOut: ! while True: ! line = self.serverIn.readline() ! if not line: ! # The socket's been closed by the server. ! isClosing = True ! break ! elif line == '.\r\n': ! # The termination line. ! self.push(line) ! break ! else: ! # A normal line. ! self.push(line) ! ! # If readResponse() or the loop above decided that the server ! # has closed its socket, close this one when the response has ! # been sent. ! if isClosing: self.close_when_done() class BayesProxyListener(Listener): --- 246,288 ---- def found_terminator(self): """Asynchat override.""" if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") + self.serverSocket.shutdown(2) + self.serverSocket.close() self.shutdown(2) self.close() raise SystemExit ! ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! self.command = self.args = '' else: + # A proper command. splitCommand = self.request.strip().split(None, 1) ! self.command = splitCommand[0].upper() ! self.args = splitCommand[1:] ! self.startTime = time.time() ! ! self.request = '' ! ! def onResponse(self): # Pass the request and the raw response to the subclass and # send back the cooked response. ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) ! ! # If onServerLine() decided that the server has closed its ! # socket, close this one when the response has been sent. ! if self.isClosing: self.close_when_done() + # Reset. + self.command = '' + self.args = '' + self.isClosing = False + self.seenAllHeaders = False + class BayesProxyListener(Listener): *************** *** 452,456 **** table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15 } .header { font-size: 133%% } .content { margin: 15 } --- 484,490 ---- table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15; ! border-top: 1px solid black; ! border-bottom: 1px solid black } .header { font-size: 133%% } .content { margin: 15 } *************** *** 466,470 ****
    \n""" --- 500,504 ----
    \n""" *************** *** 475,481 **** Spambayes.org
    %s
    \n""" pageSection = """ --- 509,520 ---- Spambayes.org
    %s
    \n""" + shutdownDB = """""" + + shutdownPickle = shutdownDB + """   + """ + pageSection = """ *************** *** 483,486 **** --- 522,533 ----  
    \n""" + summary = """POP3 proxy running on port %(proxyPort)d, + proxying to %(serverName)s:%(serverPort)d.
    + Active POP3 conversations: %(activeSessions)d.
    + POP3 conversations this session: %(totalSessions)d.
    + Emails classified this session: %(numSpams)d spam, + %(numHams)d ham, %(numUnsure)d unsure. + """ + wordQuery = """ *************** *** 488,491 **** --- 535,550 ---- """ + train = """ + Either upload a message file:
    + Or paste the whole message (incuding headers) here:
    +
    + Is this message + Ham or + Spam?
    + + """ + def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 502,506 **** """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = self.request.split('\r\n', 1) try: method, url, version = requestLine.strip().split() --- 561,565 ---- """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = (self.request+'\r\n').split('\r\n', 1) try: method, url, version = requestLine.strip().split() *************** *** 547,551 **** if path == '/helmet.gif': ! self.pushOKHeaders('image/gif') self.push(self.helmet) else: --- 606,614 ---- if path == '/helmet.gif': ! # XXX Why doesn't Expires work? Must read RFC 2616 one day. ! inOneHour = time.gmtime(time.time() + 3600) ! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour) ! extraHeaders = {'Expires': expiryDate} ! self.pushOKHeaders('image/gif', extraHeaders) self.push(self.helmet) else: *************** *** 554,558 **** handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % url) else: # This is a request for a valid page; run the handler. --- 617,621 ---- handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % path) else: # This is a request for a valid page; run the handler. *************** *** 561,569 **** handler(params) timeString = time.asctime(time.localtime()) ! self.push(self.footer % timeString) ! def pushOKHeaders(self, contentType): ! self.push("HTTP/1.0 200 OK\r\n") self.push("Content-Type: %s\r\n" % contentType) self.push("\r\n") --- 624,641 ---- handler(params) timeString = time.asctime(time.localtime()) ! if status.useDB: ! self.push(self.footer % (timeString, self.shutdownDB)) ! else: ! self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): ! timeNow = time.gmtime(time.time()) ! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow) ! self.push("HTTP/1.1 200 OK\r\n") ! self.push("Connection: close\r\n") self.push("Content-Type: %s\r\n" % contentType) + self.push("Date: %s\r\n" % httpNow) + for name, value in extraHeaders.items(): + self.push("%s: %s\r\n" % (name, value)) self.push("\r\n") *************** *** 583,616 **** def onHome(self, params): ! summary = """POP3 proxy running on port %(proxyPort)d, ! proxying to %(serverName)s:%(serverPort)d.
    ! Active POP3 conversations: %(activeSessions)d.
    ! POP3 conversations this session: ! %(totalSessions)d.
    ! Emails classified this session: %(numSpams)d spam, ! %(numHams)d ham, %(numUnsure)d unsure. ! """ % status.__dict__ ! ! train = """
    ! Either upload a message file: !
    ! Or paste the whole message (incuding headers) here:
    !
    ! Is this message ! Ham or ! Spam?
    ! ! """ ! ! body = (self.pageSection % ('Status', summary) + ! self.pageSection % ('Word query', self.wordQuery) + ! self.pageSection % ('Train', train)) self.push(body) def onShutdown(self, params): ! self.push("

    Shutdown. Goodbye.

    ") ! self.push(' ') # Acts as a flush for small buffers. self.shutdown(2) self.close() --- 655,675 ---- def onHome(self, params): ! """Serve up the homepage.""" ! body = (self.pageSection % ('Status', self.summary % status.__dict__)+ ! self.pageSection % ('Word query', self.wordQuery)+ ! self.pageSection % ('Train', self.train)) self.push(body) def onShutdown(self, params): ! """Shutdown the server, saving the pickle if requested to do so.""" ! if params['how'].lower().find('save') >= 0: ! if not status.useDB and status.pickleName: ! self.push("Saving...") ! self.push(' ') # Acts as a flush for small buffers. ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("Shutdown. Goodbye.") ! self.push(' ') self.shutdown(2) self.close() *************** *** 618,625 **** def onUpload(self, params): message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') # Append the message to a file, to make it easier to rebuild ! # the database later. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: --- 677,690 ---- def onUpload(self, params): + """Train on an uploaded or pasted message.""" + # Upload or paste? Spam or ham? message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild ! # the database later. This is a temporary implementation - ! # it should keep a Corpus (from Tim Stone's forthcoming message ! # management module) to manage a cache of messages. It needs ! # to keep them for the HTML retraining interface anyway. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: *************** *** 627,642 **** else: f = open("_pop3proxyham.mbox", "a") ! f.write("From ???@???\n") # fake From line (XXX good enough?) f.write(message) ! f.write("\n") f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("""

    Trained on your message. Saving database...

    """) ! self.push(" ") # Flush... must find out how to do this properly... ! if not status.useDB and status.pickleName: ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("

    Done.

    Home

    ") def onWordquery(self, params): --- 692,704 ---- else: f = open("_pop3proxyham.mbox", "a") ! f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n") f.write(message) ! f.write("\n\n") f.close() + + # Train on the message. self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("

    OK. Return Home or train another:

    ") ! self.push(self.pageSection % ('Train another', self.train)) def onWordquery(self, params): *************** *** 656,660 **** info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s':" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) --- 718,722 ---- info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) *************** *** 765,771 **** else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) self.request = '' def onStat(self, command, args): """POP3 STAT command.""" --- 827,839 ---- else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) # Or push_slowly for testing self.request = '' + def push_slowly(self, response): + """Useful for testing.""" + for c in response: + self.push(c) + time.sleep(0.02) + def onStat(self, command, args): """POP3 STAT command.""" *************** *** 777,781 **** """POP3 LIST command, with optional message number argument.""" if args: ! number = int(args) if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) --- 845,852 ---- """POP3 LIST command, with optional message number argument.""" if args: ! try: ! number = int(args) ! except ValueError: ! number = -1 if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) *************** *** 803,811 **** def onRetr(self, command, args): """POP3 RETR command.""" ! return self._getMessage(int(args), 12345) def onTop(self, command, args): """POP3 RETR command.""" ! number, lines = map(int, args.split()) return self._getMessage(number, lines) --- 874,889 ---- def onRetr(self, command, args): """POP3 RETR command.""" ! try: ! number = int(args) ! except ValueError: ! number = -1 ! return self._getMessage(number, 12345) def onTop(self, command, args): """POP3 RETR command.""" ! try: ! number, lines = map(int, args.split()) ! except ValueError: ! number, lines = -1, -1 return self._getMessage(number, lines) *************** *** 863,867 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) != -1 # Kill the proxy and the test server. --- 941,945 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) >= 0 # Kill the proxy and the test server. From jvr@users.sourceforge.net Sat Nov 9 18:05:44 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Sat, 09 Nov 2002 10:05:44 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20814 Modified Files: pop3proxy.py Log Message: force word query to be lowercase, making the UI case insensitive Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 --- pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 *************** *** 704,707 **** --- 704,708 ---- def onWordquery(self, params): word = params['word'] + word = word.lower() try: # Must be a better way to get __dict__ for a new-style class... From hooft@users.sourceforge.net Sat Nov 9 21:48:55 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sat, 09 Nov 2002 13:48:55 -0800 Subject: [Spambayes-checkins] spambayes weaktest.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31102 Added Files: weaktest.py Log Message: New test driver to simulate "unsure only" training --- NEW FILE: weaktest.py --- #! /usr/bin/env python # A test driver using "the standard" test directory structure. # This simulates a user that gets E-mail, and only trains on fp, # fn and unsure messages. It starts by training on the first 30 # messages, and from that point on well classified messages will # not be used for training. This can be used to see what the performance # of the scoring algorithm is under such conditions. Questions are: # * How does the size of the database behave over time? # * Does the classification get better over time? # * Are there other combinations of parameters for the classifier # that make this better behaved than the default values? """Usage: %(program)s [options] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ from __future__ import generators import sys,os from Options import options import hammie import msgs program = sys.argv[0] debug = 0 def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def drive(nsets): print options.display() spamdirs = [options.spam_directories % i for i in range(1, nsets+1)] hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] spamfns = [(x,y,1) for x in spamdirs for y in os.listdir(x)] hamfns = [(x,y,0) for x in hamdirs for y in os.listdir(x)] nham = len(hamfns) nspam = len(spamfns) allfns={} for fn in spamfns+hamfns: allfns[fn] = None d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) n=0 unsure=0 hamtrain=0 spamtrain=0 fp=0 fn=0 for dir,name, is_spam in allfns.iterkeys(): n += 1 m=msgs.Msg(dir, name).guts if debug: print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name), if hamtrain + spamtrain > 30: scr=d.score(m) else: scr=0.50 if debug: print "score:%.3f"%scr, if scr < hammie.SPAM_THRESHOLD and is_spam: if scr < hammie.HAM_THRESHOLD: fn += 1 if debug: print "fn" else: unsure += 1 if debug: print "Unsure" spamtrain += 1 d.train_spam(m) d.update_probabilities() elif scr > hammie.HAM_THRESHOLD and not is_spam: if scr > hammie.SPAM_THRESHOLD: fp += 1 if debug: print "fp" else: print "fp: %s score:%.4f"%(os.path.join(dir,name),scr) else: unsure += 1 if debug: print "Unsure" hamtrain += 1 d.train_ham(m) d.update_probabilities() else: if debug: print "OK" if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure) print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( unsure,unsure*100.0/len(allfns)) print "Trained on %d ham and %d spam"%(hamtrain,spamtrain) print "fp: %d fn: %d"%(fp,fn) FPW = options.best_cutoff_fp_weight FNW = options.best_cutoff_fn_weight UNW = options.best_cutoff_unsure_weight print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure) def main(): import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") drive(nsets) if __name__ == "__main__": main() From hooft@users.sourceforge.net Sun Nov 10 12:02:36 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:02:36 -0800 Subject: [Spambayes-checkins] spambayes weaktest.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22741 Modified Files: weaktest.py Log Message: add flexcost; sanitize spacing Index: weaktest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** weaktest.py 9 Nov 2002 21:48:52 -0000 1.1 --- weaktest.py 10 Nov 2002 12:02:33 -0000 1.2 *************** *** 59,63 **** nspam = len(spamfns) ! allfns={} for fn in spamfns+hamfns: allfns[fn] = None --- 59,63 ---- nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: allfns[fn] = None *************** *** 65,74 **** d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) ! n=0 ! unsure=0 ! hamtrain=0 ! spamtrain=0 ! fp=0 ! fn=0 for dir,name, is_spam in allfns.iterkeys(): n += 1 --- 65,80 ---- d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) ! n = 0 ! unsure = 0 ! hamtrain = 0 ! spamtrain = 0 ! fp = 0 ! fn = 0 ! flexcost = 0 ! FPW = options.best_cutoff_fp_weight ! FNW = options.best_cutoff_fn_weight ! UNW = options.best_cutoff_unsure_weight ! SPC = options.spam_cutoff ! HC = options.ham_cutoff for dir,name, is_spam in allfns.iterkeys(): n += 1 *************** *** 82,87 **** if debug: print "score:%.3f"%scr, ! if scr < hammie.SPAM_THRESHOLD and is_spam: ! if scr < hammie.HAM_THRESHOLD: fn += 1 if debug: --- 88,96 ---- if debug: print "score:%.3f"%scr, ! if scr < SPC and is_spam: ! t = FNW * (SPC - scr) / (SPC - HC) ! #print "Spam at %.3f costs %.2f"%(scr,t) ! flexcost += t ! if scr < HC: fn += 1 if debug: *************** *** 94,104 **** d.train_spam(m) d.update_probabilities() ! elif scr > hammie.HAM_THRESHOLD and not is_spam: ! if scr > hammie.SPAM_THRESHOLD: fp += 1 if debug: print "fp" else: ! print "fp: %s score:%.4f"%(os.path.join(dir,name),scr) else: unsure += 1 --- 103,116 ---- d.train_spam(m) d.update_probabilities() ! elif scr > HC and not is_spam: ! t = FPW * (scr - HC) / (SPC - HC) ! #print "Ham at %.3f costs %.2f"%(scr,t) ! flexcost += t ! if scr > SPC: fp += 1 if debug: print "fp" else: ! print "fp: %s score:%.4f"%(os.path.join(dir, name), scr) else: unsure += 1 *************** *** 113,126 **** if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( ! n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure) ! print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( ! unsure,unsure*100.0/len(allfns)) ! print "Trained on %d ham and %d spam"%(hamtrain,spamtrain) ! print "fp: %d fn: %d"%(fp,fn) ! FPW = options.best_cutoff_fp_weight ! FNW = options.best_cutoff_fn_weight ! UNW = options.best_cutoff_unsure_weight ! print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure) def main(): --- 125,136 ---- if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( ! n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure) ! print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( ! unsure, unsure * 100.0 / len(allfns)) ! print "Trained on %d ham and %d spam"%(hamtrain, spamtrain) ! print "fp: %d fn: %d"%(fp, fn) ! print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) ! print "Flex cost: $%.4f"%flexcost def main(): *************** *** 128,137 **** try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': --- 138,146 ---- try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) ! nsets = None for opt, arg in opts: if opt == '-h': From hooft@users.sourceforge.net Sun Nov 10 12:07:18 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:07:18 -0800 Subject: [Spambayes-checkins] spambayes optimize.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24245 Added Files: optimize.py Log Message: Simplex maximization --- NEW FILE: optimize.py --- # __version__ = '$Id: optimize.py,v 1.1 2002/11/10 12:07:15 hooft Exp $' # # Optimize any parametric function. # import copy import Numeric def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001): var = Numeric.array(var) simplex = [var] for i in range(len(var)): var2 = copy.copy(var) var2[i] = var[i] + err[i] simplex.append(var2) value = [] for i in range(len(simplex)): value.append(func(simplex[i])) while 1: # Determine worst and best wi = 0 bi = 0 for i in range(len(simplex)): if value[wi] > value[i]: wi = i if value[bi] < value[i]: bi = i # Test for convergence #print "worst, best are",wi,bi,"with",value[wi],value[bi] if abs(value[bi] - value[wi]) <= convcrit: return simplex[bi] # Calculate average of non-worst ave=Numeric.zeros(len(var), 'd') for i in range(len(simplex)): if i != wi: ave = ave + simplex[i] ave = ave / (len(simplex) - 1) worst = Numeric.array(simplex[wi]) # Check for too-small simplex simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) if simsize <= minerr: #print "Size of simplex too small:",simsize return simplex[bi] # Invert worst new = 2 * ave - simplex[wi] newv = func(new) if newv <= value[wi]: # Even worse. Shrink instead #print "Shrunk simplex" #print "ave=",repr(ave) #print "wi=",repr(worst) new = 0.5 * ave + 0.5 * worst newv = func(new) elif newv > value[bi]: # Better than the best. Expand new2 = 3 * ave - 2 * worst newv2 = func(new2) if newv2 > newv: # Accept #print "Expanded simplex" new = new2 newv = newv2 simplex[wi] = new value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): err = Numeric.array(err) var = SimplexMaximize(var, err, func, convcrit*5, minerr*5) return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr) From hooft@users.sourceforge.net Sun Nov 10 12:08:42 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:08:42 -0800 Subject: [Spambayes-checkins] spambayes weakloop.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24653 Added Files: weakloop.py Log Message: Loop simplex optimization over weaktest.py --- NEW FILE: weakloop.py --- # # Optimize parameters # """Usage: %(program)s [options] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ import sys def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) program = sys.argv[0] default=""" [Classifier] robinson_probability_x = 0.5 robinson_minimum_prob_strength = 0.1 robinson_probability_s = 0.45 max_discriminators = 150 [TestDriver] spam_cutoff = 0.90 ham_cutoff = 0.20 """ import Options start = (Options.options.robinson_probability_x, Options.options.robinson_minimum_prob_strength, Options.options.robinson_probability_s, Options.options.spam_cutoff, Options.options.ham_cutoff) err = (0.01, 0.01, 0.01, 0.005, 0.01) def mkini(vars): f=open('bayescustomize.ini', 'w') f.write(""" [Classifier] robinson_probability_x = %.6f robinson_minimum_prob_strength = %.6f robinson_probability_s = %.6f [TestDriver] spam_cutoff = %.4f ham_cutoff = %.4f """%tuple(vars)) f.close() def score(vars): import os mkini(vars) status = os.system('python2.3 weaktest.py -n %d > weak.out'%nsets) if status != 0: print >> sys.stderr, "Error status from weaktest" sys.exit(status) f = open('weak.out', 'r') txt = f.readlines() # Extract the flex cost field. cost = float(txt[-1].split()[2][1:]) f.close() print ''.join(txt[-4:])[:-1] print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,)) return -cost def main(): import optimize finish=optimize.SimplexMaximize(start,err,score) mkini(finish) if __name__ == "__main__": import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) nsets = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") main() From tim_one@users.sourceforge.net Sun Nov 10 19:59:24 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 11:59:24 -0800 Subject: [Spambayes-checkins] spambayes msgs.py,1.5,1.6 optimize.py,1.1,1.2 pop3proxy.py,1.13,1.14 timcv.py,1.11,1.12 weaktest.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14712 Modified Files: msgs.py optimize.py pop3proxy.py timcv.py weaktest.py Log Message: Whitespace normalization. Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** msgs.py 1 Nov 2002 04:10:50 -0000 1.5 --- msgs.py 10 Nov 2002 19:59:22 -0000 1.6 *************** *** 84,88 **** def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. If seed is not None, also set SEED. If (ham|spam)test are not set, set to the same as the (ham|spam)train --- 84,88 ---- def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. If seed is not None, also set SEED. If (ham|spam)test are not set, set to the same as the (ham|spam)train Index: optimize.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/optimize.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** optimize.py 10 Nov 2002 12:07:15 -0000 1.1 --- optimize.py 10 Nov 2002 19:59:22 -0000 1.2 *************** *** 11,66 **** simplex = [var] for i in range(len(var)): ! var2 = copy.copy(var) ! var2[i] = var[i] + err[i] ! simplex.append(var2) value = [] for i in range(len(simplex)): ! value.append(func(simplex[i])) while 1: ! # Determine worst and best ! wi = 0 ! bi = 0 ! for i in range(len(simplex)): ! if value[wi] > value[i]: ! wi = i ! if value[bi] < value[i]: ! bi = i ! # Test for convergence ! #print "worst, best are",wi,bi,"with",value[wi],value[bi] ! if abs(value[bi] - value[wi]) <= convcrit: ! return simplex[bi] ! # Calculate average of non-worst ! ave=Numeric.zeros(len(var), 'd') ! for i in range(len(simplex)): ! if i != wi: ! ave = ave + simplex[i] ! ave = ave / (len(simplex) - 1) ! worst = Numeric.array(simplex[wi]) ! # Check for too-small simplex ! simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) ! if simsize <= minerr: ! #print "Size of simplex too small:",simsize ! return simplex[bi] ! # Invert worst ! new = 2 * ave - simplex[wi] ! newv = func(new) ! if newv <= value[wi]: ! # Even worse. Shrink instead ! #print "Shrunk simplex" ! #print "ave=",repr(ave) ! #print "wi=",repr(worst) ! new = 0.5 * ave + 0.5 * worst ! newv = func(new) ! elif newv > value[bi]: ! # Better than the best. Expand ! new2 = 3 * ave - 2 * worst ! newv2 = func(new2) ! if newv2 > newv: ! # Accept ! #print "Expanded simplex" ! new = new2 ! newv = newv2 ! simplex[wi] = new ! value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): --- 11,66 ---- simplex = [var] for i in range(len(var)): ! var2 = copy.copy(var) ! var2[i] = var[i] + err[i] ! simplex.append(var2) value = [] for i in range(len(simplex)): ! value.append(func(simplex[i])) while 1: ! # Determine worst and best ! wi = 0 ! bi = 0 ! for i in range(len(simplex)): ! if value[wi] > value[i]: ! wi = i ! if value[bi] < value[i]: ! bi = i ! # Test for convergence ! #print "worst, best are",wi,bi,"with",value[wi],value[bi] ! if abs(value[bi] - value[wi]) <= convcrit: ! return simplex[bi] ! # Calculate average of non-worst ! ave=Numeric.zeros(len(var), 'd') ! for i in range(len(simplex)): ! if i != wi: ! ave = ave + simplex[i] ! ave = ave / (len(simplex) - 1) ! worst = Numeric.array(simplex[wi]) ! # Check for too-small simplex ! simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) ! if simsize <= minerr: ! #print "Size of simplex too small:",simsize ! return simplex[bi] ! # Invert worst ! new = 2 * ave - simplex[wi] ! newv = func(new) ! if newv <= value[wi]: ! # Even worse. Shrink instead ! #print "Shrunk simplex" ! #print "ave=",repr(ave) ! #print "wi=",repr(worst) ! new = 0.5 * ave + 0.5 * worst ! newv = func(new) ! elif newv > value[bi]: ! # Better than the best. Expand ! new2 = 3 * ave - 2 * worst ! newv2 = func(new2) ! if newv2 > newv: ! # Accept ! #print "Expanded simplex" ! new = new2 ! newv = newv2 ! simplex[wi] = new ! value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 --- pop3proxy.py 10 Nov 2002 19:59:22 -0000 1.14 *************** *** 140,144 **** can't connect to the real POP3 server and talk to it synchronously, because that would block the process.""" ! def __init__(self, serverName, serverPort, lineCallback): BrighterAsyncChat.__init__(self) --- 140,144 ---- can't connect to the real POP3 server and talk to it synchronously, because that would block the process.""" ! def __init__(self, serverName, serverPort, lineCallback): BrighterAsyncChat.__init__(self) *************** *** 148,152 **** self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((serverName, serverPort)) ! def collect_incoming_data(self, data): self.request = self.request + data --- 148,152 ---- self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((serverName, serverPort)) ! def collect_incoming_data(self, data): self.request = self.request + data *************** *** 184,188 **** self.seenAllHeaders = False # For the current RETR or TOP self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, self.onServerLine) --- 184,188 ---- self.seenAllHeaders = False # For the current RETR or TOP self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, self.onServerLine) *************** *** 198,214 **** isFirstLine = not self.response self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! # Has the server closed its end of the socket? if not line: self.isClosing = True ! # If we're not processing a command, just echo the response. if not self.command: self.push(self.response) self.response = '' ! # Time out after 30 seconds for message-retrieval commands if # all the headers are down. The rest of the message will proxy --- 198,214 ---- isFirstLine = not self.response self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! # Has the server closed its end of the socket? if not line: self.isClosing = True ! # If we're not processing a command, just echo the response. if not self.command: self.push(self.response) self.response = '' ! # Time out after 30 seconds for message-retrieval commands if # all the headers are down. The rest of the message will proxy *************** *** 223,227 **** self.onResponse() self.response = '' ! def isMultiline(self): """Returns True if the request should get a multiline --- 223,227 ---- self.onResponse() self.response = '' ! def isMultiline(self): """Returns True if the request should get a multiline *************** *** 254,258 **** self.close() raise SystemExit ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': --- 254,258 ---- self.close() raise SystemExit ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': *************** *** 265,271 **** self.args = splitCommand[1:] self.startTime = time.time() ! self.request = '' ! def onResponse(self): # Pass the request and the raw response to the subclass and --- 265,271 ---- self.args = splitCommand[1:] self.startTime = time.time() ! self.request = '' ! def onResponse(self): # Pass the request and the raw response to the subclass and *************** *** 273,277 **** cooked = self.onTransaction(self.command, self.args, self.response) self.push(cooked) ! # If onServerLine() decided that the server has closed its # socket, close this one when the response has been sent. --- 273,277 ---- cooked = self.onTransaction(self.command, self.args, self.response) self.push(cooked) ! # If onServerLine() decided that the server has closed its # socket, close this one when the response has been sent. *************** *** 351,355 **** status.activeSessions -= 1 POP3ProxyBase.close(self) ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the --- 351,355 ---- status.activeSessions -= 1 POP3ProxyBase.close(self) ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the *************** *** 419,423 **** if command == 'RETR': status.numUnsure += 1 ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" --- 419,423 ---- if command == 'RETR': status.numUnsure += 1 ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" *************** *** 490,494 **** .content { margin: 15 } .sectiontable { border: 1px solid #808080; width: 95%% } ! .sectionheading { background: fffae0; padding-left: 1ex; border-bottom: 1px solid #808080; font-weight: bold } --- 490,494 ---- .content { margin: 15 } .sectiontable { border: 1px solid #808080; width: 95%% } ! .sectionheading { background: fffae0; padding-left: 1ex; border-bottom: 1px solid #808080; font-weight: bold } *************** *** 513,517 **** shutdownDB = """""" ! shutdownPickle = shutdownDB + """   """ --- 513,517 ---- shutdownDB = """""" ! shutdownPickle = shutdownDB + """   """ *************** *** 521,525 ****
    %s
    %s
     
    \n""" ! summary = """POP3 proxy running on port %(proxyPort)d, proxying to %(serverName)s:%(serverPort)d.
    --- 521,525 ---- %s  
    \n""" ! summary = """POP3 proxy running on port %(proxyPort)d, proxying to %(serverName)s:%(serverPort)d.
    *************** *** 529,538 **** %(numHams)d ham, %(numUnsure)d unsure. """ ! wordQuery = """
    """ ! train = """
    --- 529,538 ---- %(numHams)d ham, %(numUnsure)d unsure. """ ! wordQuery = """
    """ ! train = """
    *************** *** 546,550 ****
    """ ! def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) --- 546,550 ---- """ ! def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 577,581 **** self.request = self.request + '\r\n\r\n' return ! if type(self.get_terminator()) is type(1): # We've just read the body of a POSTed request. --- 577,581 ---- self.request = self.request + '\r\n\r\n' return ! if type(self.get_terminator()) is type(1): # We've just read the body of a POSTed request. *************** *** 592,596 **** # A normal x-www-form-urlencoded. params.update(cgi.parse_qs(body, keep_blank_values=True)) ! # Convert the cgi params into a simple dictionary. plainParams = {} --- 592,596 ---- # A normal x-www-form-urlencoded. params.update(cgi.parse_qs(body, keep_blank_values=True)) ! # Convert the cgi params into a simple dictionary. plainParams = {} *************** *** 604,608 **** if path == '/': path = '/Home' ! if path == '/helmet.gif': # XXX Why doesn't Expires work? Must read RFC 2616 one day. --- 604,608 ---- if path == '/': path = '/Home' ! if path == '/helmet.gif': # XXX Why doesn't Expires work? Must read RFC 2616 one day. *************** *** 628,632 **** else: self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): timeNow = time.gmtime(time.time()) --- 628,632 ---- else: self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): timeNow = time.gmtime(time.time()) *************** *** 645,649 **** self.push("\r\n") self.push("

    %d %s

    " % (code, message)) ! def pushPreamble(self, name): self.push(self.header % name) --- 645,649 ---- self.push("\r\n") self.push("

    %d %s

    " % (code, message)) ! def pushPreamble(self, name): self.push(self.header % name) *************** *** 681,685 **** message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') ! # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - --- 681,685 ---- message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') ! # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - *************** *** 718,722 **** except KeyError: info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) --- 718,722 ---- except KeyError: info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) *************** *** 992,996 **** elif opt == '-u': status.uiPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: --- 992,996 ---- elif opt == '-u': status.uiPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** timcv.py 1 Nov 2002 04:10:50 -0000 1.11 --- timcv.py 10 Nov 2002 19:59:22 -0000 1.12 *************** *** 15,19 **** --HamTrain int ! The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. --- 15,19 ---- --HamTrain int ! The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. *************** *** 23,27 **** --HamTest int ! The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. --- 23,27 ---- --HamTest int ! The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. *************** *** 73,79 **** d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:], train=1)) --- 73,79 ---- d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:], train=1)) *************** *** 98,102 **** del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), msgs.SpamStream(sname, s2, train=1)) --- 98,102 ---- del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), msgs.SpamStream(sname, s2, train=1)) Index: weaktest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** weaktest.py 10 Nov 2002 12:02:33 -0000 1.2 --- weaktest.py 10 Nov 2002 19:59:22 -0000 1.3 *************** *** 58,62 **** nham = len(hamfns) nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: --- 58,62 ---- nham = len(hamfns) nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: *************** *** 133,137 **** print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) print "Flex cost: $%.4f"%flexcost ! def main(): import getopt --- 133,137 ---- print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) print "Flex cost: $%.4f"%flexcost ! def main(): import getopt From tim_one@users.sourceforge.net Sun Nov 10 20:00:03 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 12:00:03 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv14946 Modified Files: msgstore.py Log Message: Whitespace normalization. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** msgstore.py 7 Nov 2002 22:30:09 -0000 1.23 --- msgstore.py 10 Nov 2002 19:59:59 -0000 1.24 *************** *** 397,401 **** # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed pass ! return "%s\n%s\n%s" % (headers, html, body) --- 397,401 ---- # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed pass ! return "%s\n%s\n%s" % (headers, html, body) From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 17:59:08 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam Modified Files: profile.py Log Message: For the benefit of future generations, renamed some options: Old New --- --- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Index: profile.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** profile.py 7 Nov 2002 22:30:11 -0000 1.3 --- profile.py 11 Nov 2002 01:59:06 -0000 1.4 *************** *** 44,48 **** class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 44,48 ---- class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 17:59:08 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5402 Modified Files: Options.py classifier.py weakloop.py Log Message: For the benefit of future generations, renamed some options: Old New --- --- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.67 retrieving revision 1.68 diff -C2 -d -r1.67 -r1.68 *** Options.py 8 Nov 2002 04:06:23 -0000 1.67 --- Options.py 11 Nov 2002 01:59:06 -0000 1.68 *************** *** 241,268 **** # These two control the prior assumption about word probabilities. ! # "x" is essentially the probability given to a word that has never been ! # seen before. Nobody has reported an improvement via moving it away ! # from 1/2. ! # "s" adjusts how much weight to give the prior assumption relative to ! # the probabilities estimated by counting. At s=0, the counting estimates ! # are believed 100%, even to the extent of assigning certainty (0 or 1) ! # to a word that has appeared in only ham or only spam. This is a disaster. ! # As s tends toward infintity, all probabilities tend toward x. All ! # reports were that a value near 0.4 worked best, so this does not seem to ! # be corpus-dependent. ! # NOTE: Gary Robinson previously used a different formula involving 'a' ! # and 'x'. The 'x' here is the same as before. The 's' here is the old ! # 'a' divided by 'x'. ! robinson_probability_x: 0.5 ! robinson_probability_s: 0.45 # When scoring a message, ignore all words with ! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many ! # tests over Robinsons base scheme. 0.1 appeared to work well across ! # all corpora. ! robinson_minimum_prob_strength: 0.1 ! # The combining scheme currently detailed on Gary Robinons web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme --- 241,268 ---- # These two control the prior assumption about word probabilities. ! # unknown_word_prob is essentially the probability given to a word that ! # has never been seen before. Nobody has reported an improvement via moving ! # it away from 1/2, although Tim has measured a mean spamprob of a bit over ! # 0.5 (0.51-0.55) in 3 well-trained classifiers. ! # ! # unknown_word_strength adjusts how much weight to give the prior assumption ! # relative to the probabilities estimated by counting. At 0, the counting ! # estimates are believed 100%, even to the extent of assigning certainty ! # (0 or 1) to a word that has appeared in only ham or only spam. This ! # is a disaster. ! # ! # As unknown_word_strength tends toward infintity, all probabilities tend ! # toward unknown_word_prob. All reports were that a value near 0.4 worked ! # best, so this does not seem to be corpus-dependent. ! unknown_word_prob: 0.5 ! unknown_word_strength: 0.45 # When scoring a message, ignore all words with ! # abs(word.spamprob - 0.5) < minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many ! # tests. 0.1 appeared to work well across all corpora. ! minimum_prob_strength: 0.1 ! # The combining scheme currently detailed on the Robinon web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme *************** *** 272,284 **** # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) ! # follows the chi-squared distribution with 2*n degrees of freedom. That is ! # the "provably most-sensitive" test Garys original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where ! # lots of the mistakes live. This is the best method so far on Tims data. ! # One systematic benefit is that it is immune to "cancellation disease". One ! # systematic drawback is that it is sensitive to *any* deviation from a ! # uniform distribution, regardless of whether that is actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). --- 272,284 ---- # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) ! # follows the chi-squared distribution with 2*n degrees of freedom. This is ! # the "provably most-sensitive" test the original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where ! # lots of the mistakes live. This is the best method so far. ! # One systematic benefit is is immunity to "cancellation disease". One ! # systematic drawback is sensitivity to *any* deviation from a ! # uniform distribution, regardless of whether actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). *************** *** 381,387 **** }, 'Classifier': {'max_discriminators': int_cracker, ! 'robinson_probability_x': float_cracker, ! 'robinson_probability_s': float_cracker, ! 'robinson_minimum_prob_strength': float_cracker, 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, --- 381,387 ---- }, 'Classifier': {'max_discriminators': int_cracker, ! 'unknown_word_prob': float_cracker, ! 'unknown_word_strength': float_cracker, ! 'minimum_prob_strength': float_cracker, 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** classifier.py 7 Nov 2002 22:30:05 -0000 1.49 --- classifier.py 11 Nov 2002 01:59:06 -0000 1.50 *************** *** 70,74 **** # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 70,74 ---- # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 *************** *** 322,327 **** nspam = float(self.nspam or 1) ! S = options.robinson_probability_s ! StimesX = S * options.robinson_probability_x for word, record in self.wordinfo.iteritems(): --- 322,327 ---- nspam = float(self.nspam or 1) ! S = options.unknown_word_strength ! StimesX = S * options.unknown_word_prob for word, record in self.wordinfo.iteritems(): *************** *** 449,454 **** def _getclues(self, wordstream): ! mindist = options.robinson_minimum_prob_strength ! unknown = options.robinson_probability_x clues = [] # (distance, prob, word, record) tuples --- 449,454 ---- def _getclues(self, wordstream): ! mindist = options.minimum_prob_strength ! unknown = options.unknown_word_prob clues = [] # (distance, prob, word, record) tuples Index: weakloop.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1 --- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2 *************** *** 29,35 **** default=""" [Classifier] ! robinson_probability_x = 0.5 ! robinson_minimum_prob_strength = 0.1 ! robinson_probability_s = 0.45 max_discriminators = 150 --- 29,35 ---- default=""" [Classifier] ! unknown_word_prob = 0.5 ! minimum_prob_strength = 0.1 ! unknown_word_strength = 0.45 max_discriminators = 150 *************** *** 41,47 **** import Options ! start = (Options.options.robinson_probability_x, ! Options.options.robinson_minimum_prob_strength, ! Options.options.robinson_probability_s, Options.options.spam_cutoff, Options.options.ham_cutoff) --- 41,47 ---- import Options ! start = (Options.options.unknown_word_prob, ! Options.options.minimum_prob_strength, ! Options.options.unknown_word_strength, Options.options.spam_cutoff, Options.options.ham_cutoff) *************** *** 52,58 **** f.write(""" [Classifier] ! robinson_probability_x = %.6f ! robinson_minimum_prob_strength = %.6f ! robinson_probability_s = %.6f [TestDriver] --- 52,58 ---- f.write(""" [Classifier] ! unknown_word_prob = %.6f ! minimum_prob_strength = %.6f ! unknown_word_strength = %.6f [TestDriver] From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 07 Nov 2002 20:06:29 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67 tokenizer.py,1.63,1.64 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31798 Modified Files: Options.py tokenizer.py Log Message: Removed option retain_pure_html_tags; nobody enables that anymore, and it's hard to believe it would ever help anymore (except as an HTML detector). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** Options.py 7 Nov 2002 22:25:46 -0000 1.66 --- Options.py 8 Nov 2002 04:06:23 -0000 1.67 *************** *** 42,53 **** x-.* - # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags - # from pure text/html messages. Set true to retain HTML tags in this - # case. On the c.l.py corpus, it helps to set this true because any - # sign of HTML is so despised on tech lists; however, the advantage - # of setting it true eventually vanishes even there given enough - # training data. - retain_pure_html_tags: False - # If true, the first few characters of application/octet-stream sections # are used, undecoded. What 'few' means is decided by octet_prefix_size. --- 42,45 ---- *************** *** 347,352 **** all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, ! 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, --- 339,343 ---- all_options = { ! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 --- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 *************** *** 495,504 **** # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaults to False. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was removed. ############################################################################## --- 495,505 ---- # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaulted to False. Later, as the ! # algorithm improved, retain_pure_html_tags was removed. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was also removed. ############################################################################## *************** *** 1167,1175 **** """Generate a stream of tokens from an email Message. - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are - also stripped from text/html sections. Except in special cases, - it's recommended to leave that at its default of false. - If options.check_octets is True, the first few undecoded characters of application/octet-stream parts of the message body become tokens. --- 1168,1171 ---- *************** *** 1228,1235 **** # Remove HTML/XML tags. Also  . ! if (part.get_content_type() == "text/plain" or ! not options.retain_pure_html_tags): ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. --- 1224,1229 ---- # Remove HTML/XML tags. Also  . ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 07 Nov 2002 20:06:29 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67 tokenizer.py,1.63,1.64 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31798 Modified Files: Options.py tokenizer.py Log Message: Removed option retain_pure_html_tags; nobody enables that anymore, and it's hard to believe it would ever help anymore (except as an HTML detector). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** Options.py 7 Nov 2002 22:25:46 -0000 1.66 --- Options.py 8 Nov 2002 04:06:23 -0000 1.67 *************** *** 42,53 **** x-.* - # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags - # from pure text/html messages. Set true to retain HTML tags in this - # case. On the c.l.py corpus, it helps to set this true because any - # sign of HTML is so despised on tech lists; however, the advantage - # of setting it true eventually vanishes even there given enough - # training data. - retain_pure_html_tags: False - # If true, the first few characters of application/octet-stream sections # are used, undecoded. What 'few' means is decided by octet_prefix_size. --- 42,45 ---- *************** *** 347,352 **** all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, ! 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, --- 339,343 ---- all_options = { ! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 --- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 *************** *** 495,504 **** # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaults to False. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was removed. ############################################################################## --- 495,505 ---- # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaulted to False. Later, as the ! # algorithm improved, retain_pure_html_tags was removed. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was also removed. ############################################################################## *************** *** 1167,1175 **** """Generate a stream of tokens from an email Message. - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are - also stripped from text/html sections. Except in special cases, - it's recommended to leave that at its default of false. - If options.check_octets is True, the first few undecoded characters of application/octet-stream parts of the message body become tokens. --- 1168,1171 ---- *************** *** 1228,1235 **** # Remove HTML/XML tags. Also  . ! if (part.get_content_type() == "text/plain" or ! not options.retain_pure_html_tags): ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. --- 1224,1229 ---- # Remove HTML/XML tags. Also  . ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Fri, 08 Nov 2002 00:00:25 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25390 Modified Files: pop3proxy.py Log Message: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11 --- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 *************** *** 47,50 **** --- 47,74 ---- + todo = """ + o (Re)training interface - one message per line, quick-rendering table. + o Slightly-wordy index page; intro paragraph for each page. + o Once the training stuff is on a separate page, make the paste box + bigger. + o "Links" section (on homepage?) to project homepage, mailing list, + etc. + o "Home" link (with helmet!) at the end of each page. + o "Classify this" - just like Train. + o "Send me an email every [...] to remind me to train on new + messages." + o "Send me a status email every [...] telling how many mails have been + classified, etc." + o Deployment: Windows executable? atlaxwin and ctypes? Or just + webbrowser? + o Possibly integrate Tim Stone's SMTP code - make it use async, make + the training code update (rather than replace!) the database. + o Can it cleanly dynamically update its status display while having a + POP3 converation? Hammering reload sucks. + o Add a command to save the database without shutting down, and one to + reload the database. + o Leave the word in the input field after a Word query. + """ + import sys, re, operator, errno, getopt, cPickle, cStringIO, time import socket, asyncore, asynchat, cgi, urlparse, webbrowser *************** *** 92,95 **** --- 116,120 ---- self.factory(*args) + class BrighterAsyncChat(asynchat.async_chat): """An asynchat.async_chat that doesn't give spurious warnings on *************** *** 110,113 **** --- 135,164 ---- + class ServerLineReader(BrighterAsyncChat): + """An async socket that reads lines from a remote server and + simply calls a callback with the data. The BayesProxy object + can't connect to the real POP3 server and talk to it + synchronously, because that would block the process.""" + + def __init__(self, serverName, serverPort, lineCallback): + BrighterAsyncChat.__init__(self) + self.lineCallback = lineCallback + self.request = '' + self.set_terminator('\r\n') + self.create_socket(socket.AF_INET, socket.SOCK_STREAM) + self.connect((serverName, serverPort)) + + def collect_incoming_data(self, data): + self.request = self.request + data + + def found_terminator(self): + self.lineCallback(self.request + '\r\n') + self.request = '' + + def handle_close(self): + self.lineCallback('') + self.close() + + class POP3ProxyBase(BrighterAsyncChat): """An async dispatcher that understands POP3 and proxies to a POP3 *************** *** 126,134 **** BrighterAsyncChat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') ! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! self.serverSocket.connect((serverName, serverPort)) ! self.serverIn = self.serverSocket.makefile('r') # For reading only ! self.push(self.serverIn.readline()) def onTransaction(self, command, args, response): --- 177,189 ---- BrighterAsyncChat.__init__(self, clientSocket) self.request = '' + self.response = '' self.set_terminator('\r\n') ! self.command = '' # The POP3 command being processed... ! self.args = '' # ...and its arguments ! self.isClosing = False # Has the server closed the socket? ! self.seenAllHeaders = False # For the current RETR or TOP ! self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, ! self.onServerLine) def onTransaction(self, command, args, response): *************** *** 139,152 **** raise NotImplementedError ! def isMultiline(self, command, args): ! """Returns True if the given request should get a multiline response (assuming the response is positive). """ ! if command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif command in ['RETR', 'TOP']: return True ! elif command in ['LIST', 'UIDL']: return len(args) == 0 else: --- 194,237 ---- raise NotImplementedError ! def onServerLine(self, line): ! """A line of response has been received from the POP3 server.""" ! isFirstLine = not self.response ! self.response = self.response + line ! ! # Is this line that terminates a set of headers? ! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! ! # Has the server closed its end of the socket? ! if not line: ! self.isClosing = True ! ! # If we're not processing a command, just echo the response. ! if not self.command: ! self.push(self.response) ! self.response = '' ! ! # Time out after 30 seconds for message-retrieval commands if ! # all the headers are down. The rest of the message will proxy ! # straight through. ! if self.command in ['TOP', 'RETR'] and \ ! self.seenAllHeaders and time.time() > self.startTime + 30: ! self.onResponse() ! self.response = '' ! # If that's a complete response, handle it. ! elif not self.isMultiline() or line == '.\r\n' or \ ! (isFirstLine and line.startswith('-ERR')): ! self.onResponse() ! self.response = '' ! ! def isMultiline(self): ! """Returns True if the request should get a multiline response (assuming the response is positive). """ ! if self.command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif self.command in ['RETR', 'TOP']: return True ! elif self.command in ['LIST', 'UIDL']: return len(args) == 0 else: *************** *** 155,204 **** return False - def readResponse(self, command, args): - """Reads the POP3 server's response and returns a tuple of - (response, isClosing, timedOut). isClosing is True if the - server closes the socket, which tells found_terminator() to - close when the response has been sent. timedOut is set if a - TOP or RETR request was still arriving after 30 seconds, and - tells found_terminator() to proxy the remainder of the response. - """ - responseLines = [] - startTime = time.time() - isMulti = self.isMultiline(command, args) - isClosing = False - timedOut = False - isFirstLine = True - seenAllHeaders = False - while True: - line = self.serverIn.readline() - if not line: - # The socket's been closed by the server, probably by QUIT. - isClosing = True - break - elif not isMulti or (isFirstLine and line.startswith('-ERR')): - # A single-line response. - responseLines.append(line) - break - elif line == '.\r\n': - # The termination line. - responseLines.append(line) - break - else: - # A normal line - append it to the response and carry on. - responseLines.append(line) - seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n'] - - # Time out after 30 seconds for message-retrieval commands - # if all the headers are down - found_terminator() knows how - # to deal with this. - if command in ['TOP', 'RETR'] and \ - seenAllHeaders and time.time() > startTime + 30: - timedOut = True - break - - isFirstLine = False - - return ''.join(responseLines), isClosing, timedOut - def collect_incoming_data(self, data): """Asynchat override.""" --- 240,243 ---- *************** *** 207,256 **** def found_terminator(self): """Asynchat override.""" - # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverSocket.sendall(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! command, args = ('', '') else: splitCommand = self.request.strip().split(None, 1) ! command = splitCommand[0].upper() ! args = splitCommand[1:] ! rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. ! cookedResponse = self.onTransaction(command, args, rawResponse) ! self.push(cookedResponse) ! self.request = '' ! ! # If readResponse() timed out, we still need to read and proxy ! # the rest of the message. ! if timedOut: ! while True: ! line = self.serverIn.readline() ! if not line: ! # The socket's been closed by the server. ! isClosing = True ! break ! elif line == '.\r\n': ! # The termination line. ! self.push(line) ! break ! else: ! # A normal line. ! self.push(line) ! ! # If readResponse() or the loop above decided that the server ! # has closed its socket, close this one when the response has ! # been sent. ! if isClosing: self.close_when_done() class BayesProxyListener(Listener): --- 246,288 ---- def found_terminator(self): """Asynchat override.""" if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") + self.serverSocket.shutdown(2) + self.serverSocket.close() self.shutdown(2) self.close() raise SystemExit ! ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! self.command = self.args = '' else: + # A proper command. splitCommand = self.request.strip().split(None, 1) ! self.command = splitCommand[0].upper() ! self.args = splitCommand[1:] ! self.startTime = time.time() ! ! self.request = '' ! ! def onResponse(self): # Pass the request and the raw response to the subclass and # send back the cooked response. ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) ! ! # If onServerLine() decided that the server has closed its ! # socket, close this one when the response has been sent. ! if self.isClosing: self.close_when_done() + # Reset. + self.command = '' + self.args = '' + self.isClosing = False + self.seenAllHeaders = False + class BayesProxyListener(Listener): *************** *** 452,456 **** table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15 } .header { font-size: 133%% } .content { margin: 15 } --- 484,490 ---- table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15; ! border-top: 1px solid black; ! border-bottom: 1px solid black } .header { font-size: 133%% } .content { margin: 15 } *************** *** 466,470 ****
    \n""" --- 500,504 ----
    \n""" *************** *** 475,481 **** Spambayes.org ! \n""" pageSection = """ --- 509,520 ---- Spambayes.org
    %s
    \n""" + shutdownDB = """""" + + shutdownPickle = shutdownDB + """   + """ + pageSection = """ *************** *** 483,486 **** --- 522,533 ----  
    \n""" + summary = """POP3 proxy running on port %(proxyPort)d, + proxying to %(serverName)s:%(serverPort)d.
    + Active POP3 conversations: %(activeSessions)d.
    + POP3 conversations this session: %(totalSessions)d.
    + Emails classified this session: %(numSpams)d spam, + %(numHams)d ham, %(numUnsure)d unsure. + """ + wordQuery = """ *************** *** 488,491 **** --- 535,550 ---- """ + train = """ + Either upload a message file:
    + Or paste the whole message (incuding headers) here:
    +
    + Is this message + Ham or + Spam?
    + + """ + def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 502,506 **** """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = self.request.split('\r\n', 1) try: method, url, version = requestLine.strip().split() --- 561,565 ---- """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = (self.request+'\r\n').split('\r\n', 1) try: method, url, version = requestLine.strip().split() *************** *** 547,551 **** if path == '/helmet.gif': ! self.pushOKHeaders('image/gif') self.push(self.helmet) else: --- 606,614 ---- if path == '/helmet.gif': ! # XXX Why doesn't Expires work? Must read RFC 2616 one day. ! inOneHour = time.gmtime(time.time() + 3600) ! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour) ! extraHeaders = {'Expires': expiryDate} ! self.pushOKHeaders('image/gif', extraHeaders) self.push(self.helmet) else: *************** *** 554,558 **** handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % url) else: # This is a request for a valid page; run the handler. --- 617,621 ---- handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % path) else: # This is a request for a valid page; run the handler. *************** *** 561,569 **** handler(params) timeString = time.asctime(time.localtime()) ! self.push(self.footer % timeString) ! def pushOKHeaders(self, contentType): ! self.push("HTTP/1.0 200 OK\r\n") self.push("Content-Type: %s\r\n" % contentType) self.push("\r\n") --- 624,641 ---- handler(params) timeString = time.asctime(time.localtime()) ! if status.useDB: ! self.push(self.footer % (timeString, self.shutdownDB)) ! else: ! self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): ! timeNow = time.gmtime(time.time()) ! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow) ! self.push("HTTP/1.1 200 OK\r\n") ! self.push("Connection: close\r\n") self.push("Content-Type: %s\r\n" % contentType) + self.push("Date: %s\r\n" % httpNow) + for name, value in extraHeaders.items(): + self.push("%s: %s\r\n" % (name, value)) self.push("\r\n") *************** *** 583,616 **** def onHome(self, params): ! summary = """POP3 proxy running on port %(proxyPort)d, ! proxying to %(serverName)s:%(serverPort)d.
    ! Active POP3 conversations: %(activeSessions)d.
    ! POP3 conversations this session: ! %(totalSessions)d.
    ! Emails classified this session: %(numSpams)d spam, ! %(numHams)d ham, %(numUnsure)d unsure. ! """ % status.__dict__ ! ! train = """
    ! Either upload a message file: !
    ! Or paste the whole message (incuding headers) here:
    !
    ! Is this message ! Ham or ! Spam?
    ! ! """ ! ! body = (self.pageSection % ('Status', summary) + ! self.pageSection % ('Word query', self.wordQuery) + ! self.pageSection % ('Train', train)) self.push(body) def onShutdown(self, params): ! self.push("

    Shutdown. Goodbye.

    ") ! self.push(' ') # Acts as a flush for small buffers. self.shutdown(2) self.close() --- 655,675 ---- def onHome(self, params): ! """Serve up the homepage.""" ! body = (self.pageSection % ('Status', self.summary % status.__dict__)+ ! self.pageSection % ('Word query', self.wordQuery)+ ! self.pageSection % ('Train', self.train)) self.push(body) def onShutdown(self, params): ! """Shutdown the server, saving the pickle if requested to do so.""" ! if params['how'].lower().find('save') >= 0: ! if not status.useDB and status.pickleName: ! self.push("Saving...") ! self.push(' ') # Acts as a flush for small buffers. ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("Shutdown. Goodbye.") ! self.push(' ') self.shutdown(2) self.close() *************** *** 618,625 **** def onUpload(self, params): message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') # Append the message to a file, to make it easier to rebuild ! # the database later. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: --- 677,690 ---- def onUpload(self, params): + """Train on an uploaded or pasted message.""" + # Upload or paste? Spam or ham? message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild ! # the database later. This is a temporary implementation - ! # it should keep a Corpus (from Tim Stone's forthcoming message ! # management module) to manage a cache of messages. It needs ! # to keep them for the HTML retraining interface anyway. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: *************** *** 627,642 **** else: f = open("_pop3proxyham.mbox", "a") ! f.write("From ???@???\n") # fake From line (XXX good enough?) f.write(message) ! f.write("\n") f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("""

    Trained on your message. Saving database...

    """) ! self.push(" ") # Flush... must find out how to do this properly... ! if not status.useDB and status.pickleName: ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("

    Done.

    Home

    ") def onWordquery(self, params): --- 692,704 ---- else: f = open("_pop3proxyham.mbox", "a") ! f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n") f.write(message) ! f.write("\n\n") f.close() + + # Train on the message. self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("

    OK. Return Home or train another:

    ") ! self.push(self.pageSection % ('Train another', self.train)) def onWordquery(self, params): *************** *** 656,660 **** info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s':" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) --- 718,722 ---- info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) *************** *** 765,771 **** else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) self.request = '' def onStat(self, command, args): """POP3 STAT command.""" --- 827,839 ---- else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) # Or push_slowly for testing self.request = '' + def push_slowly(self, response): + """Useful for testing.""" + for c in response: + self.push(c) + time.sleep(0.02) + def onStat(self, command, args): """POP3 STAT command.""" *************** *** 777,781 **** """POP3 LIST command, with optional message number argument.""" if args: ! number = int(args) if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) --- 845,852 ---- """POP3 LIST command, with optional message number argument.""" if args: ! try: ! number = int(args) ! except ValueError: ! number = -1 if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) *************** *** 803,811 **** def onRetr(self, command, args): """POP3 RETR command.""" ! return self._getMessage(int(args), 12345) def onTop(self, command, args): """POP3 RETR command.""" ! number, lines = map(int, args.split()) return self._getMessage(number, lines) --- 874,889 ---- def onRetr(self, command, args): """POP3 RETR command.""" ! try: ! number = int(args) ! except ValueError: ! number = -1 ! return self._getMessage(number, 12345) def onTop(self, command, args): """POP3 RETR command.""" ! try: ! number, lines = map(int, args.split()) ! except ValueError: ! number, lines = -1, -1 return self._getMessage(number, lines) *************** *** 863,867 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) != -1 # Kill the proxy and the test server. --- 941,945 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) >= 0 # Kill the proxy and the test server. From jvr@users.sourceforge.net Sat Nov 9 18:05:44 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Sat, 09 Nov 2002 10:05:44 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20814 Modified Files: pop3proxy.py Log Message: force word query to be lowercase, making the UI case insensitive Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 --- pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 *************** *** 704,707 **** --- 704,708 ---- def onWordquery(self, params): word = params['word'] + word = word.lower() try: # Must be a better way to get __dict__ for a new-style class... From hooft@users.sourceforge.net Sat Nov 9 21:48:55 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sat, 09 Nov 2002 13:48:55 -0800 Subject: [Spambayes-checkins] spambayes weaktest.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31102 Added Files: weaktest.py Log Message: New test driver to simulate "unsure only" training --- NEW FILE: weaktest.py --- #! /usr/bin/env python # A test driver using "the standard" test directory structure. # This simulates a user that gets E-mail, and only trains on fp, # fn and unsure messages. It starts by training on the first 30 # messages, and from that point on well classified messages will # not be used for training. This can be used to see what the performance # of the scoring algorithm is under such conditions. Questions are: # * How does the size of the database behave over time? # * Does the classification get better over time? # * Are there other combinations of parameters for the classifier # that make this better behaved than the default values? """Usage: %(program)s [options] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ from __future__ import generators import sys,os from Options import options import hammie import msgs program = sys.argv[0] debug = 0 def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def drive(nsets): print options.display() spamdirs = [options.spam_directories % i for i in range(1, nsets+1)] hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] spamfns = [(x,y,1) for x in spamdirs for y in os.listdir(x)] hamfns = [(x,y,0) for x in hamdirs for y in os.listdir(x)] nham = len(hamfns) nspam = len(spamfns) allfns={} for fn in spamfns+hamfns: allfns[fn] = None d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) n=0 unsure=0 hamtrain=0 spamtrain=0 fp=0 fn=0 for dir,name, is_spam in allfns.iterkeys(): n += 1 m=msgs.Msg(dir, name).guts if debug: print "trained:%dH+%dS fp:%d fn:%d unsure:%d before %s/%s"%(hamtrain,spamtrain,fp,fn,unsure,dir,name), if hamtrain + spamtrain > 30: scr=d.score(m) else: scr=0.50 if debug: print "score:%.3f"%scr, if scr < hammie.SPAM_THRESHOLD and is_spam: if scr < hammie.HAM_THRESHOLD: fn += 1 if debug: print "fn" else: unsure += 1 if debug: print "Unsure" spamtrain += 1 d.train_spam(m) d.update_probabilities() elif scr > hammie.HAM_THRESHOLD and not is_spam: if scr > hammie.SPAM_THRESHOLD: fp += 1 if debug: print "fp" else: print "fp: %s score:%.4f"%(os.path.join(dir,name),scr) else: unsure += 1 if debug: print "Unsure" hamtrain += 1 d.train_ham(m) d.update_probabilities() else: if debug: print "OK" if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure) print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( unsure,unsure*100.0/len(allfns)) print "Trained on %d ham and %d spam"%(hamtrain,spamtrain) print "fp: %d fn: %d"%(fp,fn) FPW = options.best_cutoff_fp_weight FNW = options.best_cutoff_fn_weight UNW = options.best_cutoff_unsure_weight print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure) def main(): import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") drive(nsets) if __name__ == "__main__": main() From hooft@users.sourceforge.net Sun Nov 10 12:02:36 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:02:36 -0800 Subject: [Spambayes-checkins] spambayes weaktest.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22741 Modified Files: weaktest.py Log Message: add flexcost; sanitize spacing Index: weaktest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** weaktest.py 9 Nov 2002 21:48:52 -0000 1.1 --- weaktest.py 10 Nov 2002 12:02:33 -0000 1.2 *************** *** 59,63 **** nspam = len(spamfns) ! allfns={} for fn in spamfns+hamfns: allfns[fn] = None --- 59,63 ---- nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: allfns[fn] = None *************** *** 65,74 **** d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) ! n=0 ! unsure=0 ! hamtrain=0 ! spamtrain=0 ! fp=0 ! fn=0 for dir,name, is_spam in allfns.iterkeys(): n += 1 --- 65,80 ---- d = hammie.Hammie(hammie.createbayes('weaktest.db', False)) ! n = 0 ! unsure = 0 ! hamtrain = 0 ! spamtrain = 0 ! fp = 0 ! fn = 0 ! flexcost = 0 ! FPW = options.best_cutoff_fp_weight ! FNW = options.best_cutoff_fn_weight ! UNW = options.best_cutoff_unsure_weight ! SPC = options.spam_cutoff ! HC = options.ham_cutoff for dir,name, is_spam in allfns.iterkeys(): n += 1 *************** *** 82,87 **** if debug: print "score:%.3f"%scr, ! if scr < hammie.SPAM_THRESHOLD and is_spam: ! if scr < hammie.HAM_THRESHOLD: fn += 1 if debug: --- 88,96 ---- if debug: print "score:%.3f"%scr, ! if scr < SPC and is_spam: ! t = FNW * (SPC - scr) / (SPC - HC) ! #print "Spam at %.3f costs %.2f"%(scr,t) ! flexcost += t ! if scr < HC: fn += 1 if debug: *************** *** 94,104 **** d.train_spam(m) d.update_probabilities() ! elif scr > hammie.HAM_THRESHOLD and not is_spam: ! if scr > hammie.SPAM_THRESHOLD: fp += 1 if debug: print "fp" else: ! print "fp: %s score:%.4f"%(os.path.join(dir,name),scr) else: unsure += 1 --- 103,116 ---- d.train_spam(m) d.update_probabilities() ! elif scr > HC and not is_spam: ! t = FPW * (scr - HC) / (SPC - HC) ! #print "Ham at %.3f costs %.2f"%(scr,t) ! flexcost += t ! if scr > SPC: fp += 1 if debug: print "fp" else: ! print "fp: %s score:%.4f"%(os.path.join(dir, name), scr) else: unsure += 1 *************** *** 113,126 **** if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( ! n,hamtrain,spamtrain,len(d.bayes.wordinfo),fp,fn,unsure) ! print "Total messages %d (%d ham and %d spam)"%(len(allfns),nham,nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( ! unsure,unsure*100.0/len(allfns)) ! print "Trained on %d ham and %d spam"%(hamtrain,spamtrain) ! print "fp: %d fn: %d"%(fp,fn) ! FPW = options.best_cutoff_fp_weight ! FNW = options.best_cutoff_fn_weight ! UNW = options.best_cutoff_unsure_weight ! print "Total cost: $%.2f"%(FPW*fp+FNW*fn+UNW*unsure) def main(): --- 125,136 ---- if n % 100 == 0: print "%5d trained:%dH+%dS wrds:%d fp:%d fn:%d unsure:%d"%( ! n, hamtrain, spamtrain, len(d.bayes.wordinfo), fp, fn, unsure) ! print "Total messages %d (%d ham and %d spam)"%(len(allfns), nham, nspam) print "Total unsure (including 30 startup messages): %d (%.1f%%)"%( ! unsure, unsure * 100.0 / len(allfns)) ! print "Trained on %d ham and %d spam"%(hamtrain, spamtrain) ! print "fp: %d fn: %d"%(fp, fn) ! print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) ! print "Flex cost: $%.4f"%flexcost def main(): *************** *** 128,137 **** try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': --- 138,146 ---- try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) ! nsets = None for opt, arg in opts: if opt == '-h': From hooft@users.sourceforge.net Sun Nov 10 12:07:18 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:07:18 -0800 Subject: [Spambayes-checkins] spambayes optimize.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24245 Added Files: optimize.py Log Message: Simplex maximization --- NEW FILE: optimize.py --- # __version__ = '$Id: optimize.py,v 1.1 2002/11/10 12:07:15 hooft Exp $' # # Optimize any parametric function. # import copy import Numeric def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001): var = Numeric.array(var) simplex = [var] for i in range(len(var)): var2 = copy.copy(var) var2[i] = var[i] + err[i] simplex.append(var2) value = [] for i in range(len(simplex)): value.append(func(simplex[i])) while 1: # Determine worst and best wi = 0 bi = 0 for i in range(len(simplex)): if value[wi] > value[i]: wi = i if value[bi] < value[i]: bi = i # Test for convergence #print "worst, best are",wi,bi,"with",value[wi],value[bi] if abs(value[bi] - value[wi]) <= convcrit: return simplex[bi] # Calculate average of non-worst ave=Numeric.zeros(len(var), 'd') for i in range(len(simplex)): if i != wi: ave = ave + simplex[i] ave = ave / (len(simplex) - 1) worst = Numeric.array(simplex[wi]) # Check for too-small simplex simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) if simsize <= minerr: #print "Size of simplex too small:",simsize return simplex[bi] # Invert worst new = 2 * ave - simplex[wi] newv = func(new) if newv <= value[wi]: # Even worse. Shrink instead #print "Shrunk simplex" #print "ave=",repr(ave) #print "wi=",repr(worst) new = 0.5 * ave + 0.5 * worst newv = func(new) elif newv > value[bi]: # Better than the best. Expand new2 = 3 * ave - 2 * worst newv2 = func(new2) if newv2 > newv: # Accept #print "Expanded simplex" new = new2 newv = newv2 simplex[wi] = new value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): err = Numeric.array(err) var = SimplexMaximize(var, err, func, convcrit*5, minerr*5) return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr) From hooft@users.sourceforge.net Sun Nov 10 12:08:42 2002 From: hooft@users.sourceforge.net (Rob W.W. Hooft) Date: Sun, 10 Nov 2002 04:08:42 -0800 Subject: [Spambayes-checkins] spambayes weakloop.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24653 Added Files: weakloop.py Log Message: Loop simplex optimization over weaktest.py --- NEW FILE: weakloop.py --- # # Optimize parameters # """Usage: %(program)s [options] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ import sys def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) program = sys.argv[0] default=""" [Classifier] robinson_probability_x = 0.5 robinson_minimum_prob_strength = 0.1 robinson_probability_s = 0.45 max_discriminators = 150 [TestDriver] spam_cutoff = 0.90 ham_cutoff = 0.20 """ import Options start = (Options.options.robinson_probability_x, Options.options.robinson_minimum_prob_strength, Options.options.robinson_probability_s, Options.options.spam_cutoff, Options.options.ham_cutoff) err = (0.01, 0.01, 0.01, 0.005, 0.01) def mkini(vars): f=open('bayescustomize.ini', 'w') f.write(""" [Classifier] robinson_probability_x = %.6f robinson_minimum_prob_strength = %.6f robinson_probability_s = %.6f [TestDriver] spam_cutoff = %.4f ham_cutoff = %.4f """%tuple(vars)) f.close() def score(vars): import os mkini(vars) status = os.system('python2.3 weaktest.py -n %d > weak.out'%nsets) if status != 0: print >> sys.stderr, "Error status from weaktest" sys.exit(status) f = open('weak.out', 'r') txt = f.readlines() # Extract the flex cost field. cost = float(txt[-1].split()[2][1:]) f.close() print ''.join(txt[-4:])[:-1] print "x=%.4f p=%.4f s=%.4f sc=%.3f hc=%.3f %.2f"%(tuple(vars)+(cost,)) return -cost def main(): import optimize finish=optimize.SimplexMaximize(start,err,score) mkini(finish) if __name__ == "__main__": import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) nsets = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") main() From tim_one@users.sourceforge.net Sun Nov 10 19:59:24 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 11:59:24 -0800 Subject: [Spambayes-checkins] spambayes msgs.py,1.5,1.6 optimize.py,1.1,1.2 pop3proxy.py,1.13,1.14 timcv.py,1.11,1.12 weaktest.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14712 Modified Files: msgs.py optimize.py pop3proxy.py timcv.py weaktest.py Log Message: Whitespace normalization. Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** msgs.py 1 Nov 2002 04:10:50 -0000 1.5 --- msgs.py 10 Nov 2002 19:59:22 -0000 1.6 *************** *** 84,88 **** def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. If seed is not None, also set SEED. If (ham|spam)test are not set, set to the same as the (ham|spam)train --- 84,88 ---- def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None): ! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN. If seed is not None, also set SEED. If (ham|spam)test are not set, set to the same as the (ham|spam)train Index: optimize.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/optimize.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** optimize.py 10 Nov 2002 12:07:15 -0000 1.1 --- optimize.py 10 Nov 2002 19:59:22 -0000 1.2 *************** *** 11,66 **** simplex = [var] for i in range(len(var)): ! var2 = copy.copy(var) ! var2[i] = var[i] + err[i] ! simplex.append(var2) value = [] for i in range(len(simplex)): ! value.append(func(simplex[i])) while 1: ! # Determine worst and best ! wi = 0 ! bi = 0 ! for i in range(len(simplex)): ! if value[wi] > value[i]: ! wi = i ! if value[bi] < value[i]: ! bi = i ! # Test for convergence ! #print "worst, best are",wi,bi,"with",value[wi],value[bi] ! if abs(value[bi] - value[wi]) <= convcrit: ! return simplex[bi] ! # Calculate average of non-worst ! ave=Numeric.zeros(len(var), 'd') ! for i in range(len(simplex)): ! if i != wi: ! ave = ave + simplex[i] ! ave = ave / (len(simplex) - 1) ! worst = Numeric.array(simplex[wi]) ! # Check for too-small simplex ! simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) ! if simsize <= minerr: ! #print "Size of simplex too small:",simsize ! return simplex[bi] ! # Invert worst ! new = 2 * ave - simplex[wi] ! newv = func(new) ! if newv <= value[wi]: ! # Even worse. Shrink instead ! #print "Shrunk simplex" ! #print "ave=",repr(ave) ! #print "wi=",repr(worst) ! new = 0.5 * ave + 0.5 * worst ! newv = func(new) ! elif newv > value[bi]: ! # Better than the best. Expand ! new2 = 3 * ave - 2 * worst ! newv2 = func(new2) ! if newv2 > newv: ! # Accept ! #print "Expanded simplex" ! new = new2 ! newv = newv2 ! simplex[wi] = new ! value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): --- 11,66 ---- simplex = [var] for i in range(len(var)): ! var2 = copy.copy(var) ! var2[i] = var[i] + err[i] ! simplex.append(var2) value = [] for i in range(len(simplex)): ! value.append(func(simplex[i])) while 1: ! # Determine worst and best ! wi = 0 ! bi = 0 ! for i in range(len(simplex)): ! if value[wi] > value[i]: ! wi = i ! if value[bi] < value[i]: ! bi = i ! # Test for convergence ! #print "worst, best are",wi,bi,"with",value[wi],value[bi] ! if abs(value[bi] - value[wi]) <= convcrit: ! return simplex[bi] ! # Calculate average of non-worst ! ave=Numeric.zeros(len(var), 'd') ! for i in range(len(simplex)): ! if i != wi: ! ave = ave + simplex[i] ! ave = ave / (len(simplex) - 1) ! worst = Numeric.array(simplex[wi]) ! # Check for too-small simplex ! simsize = Numeric.add.reduce(Numeric.absolute(ave - worst)) ! if simsize <= minerr: ! #print "Size of simplex too small:",simsize ! return simplex[bi] ! # Invert worst ! new = 2 * ave - simplex[wi] ! newv = func(new) ! if newv <= value[wi]: ! # Even worse. Shrink instead ! #print "Shrunk simplex" ! #print "ave=",repr(ave) ! #print "wi=",repr(worst) ! new = 0.5 * ave + 0.5 * worst ! newv = func(new) ! elif newv > value[bi]: ! # Better than the best. Expand ! new2 = 3 * ave - 2 * worst ! newv2 = func(new2) ! if newv2 > newv: ! # Accept ! #print "Expanded simplex" ! new = new2 ! newv = newv2 ! simplex[wi] = new ! value[wi] = newv def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001): Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 --- pop3proxy.py 10 Nov 2002 19:59:22 -0000 1.14 *************** *** 140,144 **** can't connect to the real POP3 server and talk to it synchronously, because that would block the process.""" ! def __init__(self, serverName, serverPort, lineCallback): BrighterAsyncChat.__init__(self) --- 140,144 ---- can't connect to the real POP3 server and talk to it synchronously, because that would block the process.""" ! def __init__(self, serverName, serverPort, lineCallback): BrighterAsyncChat.__init__(self) *************** *** 148,152 **** self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((serverName, serverPort)) ! def collect_incoming_data(self, data): self.request = self.request + data --- 148,152 ---- self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((serverName, serverPort)) ! def collect_incoming_data(self, data): self.request = self.request + data *************** *** 184,188 **** self.seenAllHeaders = False # For the current RETR or TOP self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, self.onServerLine) --- 184,188 ---- self.seenAllHeaders = False # For the current RETR or TOP self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, self.onServerLine) *************** *** 198,214 **** isFirstLine = not self.response self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! # Has the server closed its end of the socket? if not line: self.isClosing = True ! # If we're not processing a command, just echo the response. if not self.command: self.push(self.response) self.response = '' ! # Time out after 30 seconds for message-retrieval commands if # all the headers are down. The rest of the message will proxy --- 198,214 ---- isFirstLine = not self.response self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! # Has the server closed its end of the socket? if not line: self.isClosing = True ! # If we're not processing a command, just echo the response. if not self.command: self.push(self.response) self.response = '' ! # Time out after 30 seconds for message-retrieval commands if # all the headers are down. The rest of the message will proxy *************** *** 223,227 **** self.onResponse() self.response = '' ! def isMultiline(self): """Returns True if the request should get a multiline --- 223,227 ---- self.onResponse() self.response = '' ! def isMultiline(self): """Returns True if the request should get a multiline *************** *** 254,258 **** self.close() raise SystemExit ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': --- 254,258 ---- self.close() raise SystemExit ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': *************** *** 265,271 **** self.args = splitCommand[1:] self.startTime = time.time() ! self.request = '' ! def onResponse(self): # Pass the request and the raw response to the subclass and --- 265,271 ---- self.args = splitCommand[1:] self.startTime = time.time() ! self.request = '' ! def onResponse(self): # Pass the request and the raw response to the subclass and *************** *** 273,277 **** cooked = self.onTransaction(self.command, self.args, self.response) self.push(cooked) ! # If onServerLine() decided that the server has closed its # socket, close this one when the response has been sent. --- 273,277 ---- cooked = self.onTransaction(self.command, self.args, self.response) self.push(cooked) ! # If onServerLine() decided that the server has closed its # socket, close this one when the response has been sent. *************** *** 351,355 **** status.activeSessions -= 1 POP3ProxyBase.close(self) ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the --- 351,355 ---- status.activeSessions -= 1 POP3ProxyBase.close(self) ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the *************** *** 419,423 **** if command == 'RETR': status.numUnsure += 1 ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" --- 419,423 ---- if command == 'RETR': status.numUnsure += 1 ! headers, body = re.split(r'\n\r?\n', response, 1) headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n" *************** *** 490,494 **** .content { margin: 15 } .sectiontable { border: 1px solid #808080; width: 95%% } ! .sectionheading { background: fffae0; padding-left: 1ex; border-bottom: 1px solid #808080; font-weight: bold } --- 490,494 ---- .content { margin: 15 } .sectiontable { border: 1px solid #808080; width: 95%% } ! .sectionheading { background: fffae0; padding-left: 1ex; border-bottom: 1px solid #808080; font-weight: bold } *************** *** 513,517 **** shutdownDB = """""" ! shutdownPickle = shutdownDB + """   """ --- 513,517 ---- shutdownDB = """""" ! shutdownPickle = shutdownDB + """   """ *************** *** 521,525 ****
    %s
    %s
     
    \n""" ! summary = """POP3 proxy running on port %(proxyPort)d, proxying to %(serverName)s:%(serverPort)d.
    --- 521,525 ---- %s  
    \n""" ! summary = """POP3 proxy running on port %(proxyPort)d, proxying to %(serverName)s:%(serverPort)d.
    *************** *** 529,538 **** %(numHams)d ham, %(numUnsure)d unsure. """ ! wordQuery = """
    """ ! train = """
    --- 529,538 ---- %(numHams)d ham, %(numUnsure)d unsure. """ ! wordQuery = """
    """ ! train = """
    *************** *** 546,550 ****
    """ ! def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) --- 546,550 ---- """ ! def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 577,581 **** self.request = self.request + '\r\n\r\n' return ! if type(self.get_terminator()) is type(1): # We've just read the body of a POSTed request. --- 577,581 ---- self.request = self.request + '\r\n\r\n' return ! if type(self.get_terminator()) is type(1): # We've just read the body of a POSTed request. *************** *** 592,596 **** # A normal x-www-form-urlencoded. params.update(cgi.parse_qs(body, keep_blank_values=True)) ! # Convert the cgi params into a simple dictionary. plainParams = {} --- 592,596 ---- # A normal x-www-form-urlencoded. params.update(cgi.parse_qs(body, keep_blank_values=True)) ! # Convert the cgi params into a simple dictionary. plainParams = {} *************** *** 604,608 **** if path == '/': path = '/Home' ! if path == '/helmet.gif': # XXX Why doesn't Expires work? Must read RFC 2616 one day. --- 604,608 ---- if path == '/': path = '/Home' ! if path == '/helmet.gif': # XXX Why doesn't Expires work? Must read RFC 2616 one day. *************** *** 628,632 **** else: self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): timeNow = time.gmtime(time.time()) --- 628,632 ---- else: self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): timeNow = time.gmtime(time.time()) *************** *** 645,649 **** self.push("\r\n") self.push("

    %d %s

    " % (code, message)) ! def pushPreamble(self, name): self.push(self.header % name) --- 645,649 ---- self.push("\r\n") self.push("

    %d %s

    " % (code, message)) ! def pushPreamble(self, name): self.push(self.header % name) *************** *** 681,685 **** message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') ! # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - --- 681,685 ---- message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') ! # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - *************** *** 718,722 **** except KeyError: info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) --- 718,722 ---- except KeyError: info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) *************** *** 992,996 **** elif opt == '-u': status.uiPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: --- 992,996 ---- elif opt == '-u': status.uiPort = int(arg) ! # Do whatever we've been asked to do... if not opts and not args: Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** timcv.py 1 Nov 2002 04:10:50 -0000 1.11 --- timcv.py 10 Nov 2002 19:59:22 -0000 1.12 *************** *** 15,19 **** --HamTrain int ! The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. --- 15,19 ---- --HamTrain int ! The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. *************** *** 23,27 **** --HamTest int ! The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. --- 23,27 ---- --HamTest int ! The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. *************** *** 73,79 **** d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:], train=1)) --- 73,79 ---- d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:], train=1), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:], train=1)) *************** *** 98,102 **** del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), msgs.SpamStream(sname, s2, train=1)) --- 98,102 ---- del s2[i] ! d.train(msgs.HamStream(hname, h2, train=1), msgs.SpamStream(sname, s2, train=1)) Index: weaktest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** weaktest.py 10 Nov 2002 12:02:33 -0000 1.2 --- weaktest.py 10 Nov 2002 19:59:22 -0000 1.3 *************** *** 58,62 **** nham = len(hamfns) nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: --- 58,62 ---- nham = len(hamfns) nspam = len(spamfns) ! allfns = {} for fn in spamfns+hamfns: *************** *** 133,137 **** print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) print "Flex cost: $%.4f"%flexcost ! def main(): import getopt --- 133,137 ---- print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure) print "Flex cost: $%.4f"%flexcost ! def main(): import getopt From tim_one@users.sourceforge.net Sun Nov 10 20:00:03 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 12:00:03 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv14946 Modified Files: msgstore.py Log Message: Whitespace normalization. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** msgstore.py 7 Nov 2002 22:30:09 -0000 1.23 --- msgstore.py 10 Nov 2002 19:59:59 -0000 1.24 *************** *** 397,401 **** # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed pass ! return "%s\n%s\n%s" % (headers, html, body) --- 397,401 ---- # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed pass ! return "%s\n%s\n%s" % (headers, html, body) From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 17:59:08 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam Modified Files: profile.py Log Message: For the benefit of future generations, renamed some options: Old New --- --- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Index: profile.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** profile.py 7 Nov 2002 22:30:11 -0000 1.3 --- profile.py 11 Nov 2002 01:59:06 -0000 1.4 *************** *** 44,48 **** class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 44,48 ---- class WordInfo(Persistent): ! def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 10 Nov 2002 17:59:08 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5402 Modified Files: Options.py classifier.py weakloop.py Log Message: For the benefit of future generations, renamed some options: Old New --- --- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.67 retrieving revision 1.68 diff -C2 -d -r1.67 -r1.68 *** Options.py 8 Nov 2002 04:06:23 -0000 1.67 --- Options.py 11 Nov 2002 01:59:06 -0000 1.68 *************** *** 241,268 **** # These two control the prior assumption about word probabilities. ! # "x" is essentially the probability given to a word that has never been ! # seen before. Nobody has reported an improvement via moving it away ! # from 1/2. ! # "s" adjusts how much weight to give the prior assumption relative to ! # the probabilities estimated by counting. At s=0, the counting estimates ! # are believed 100%, even to the extent of assigning certainty (0 or 1) ! # to a word that has appeared in only ham or only spam. This is a disaster. ! # As s tends toward infintity, all probabilities tend toward x. All ! # reports were that a value near 0.4 worked best, so this does not seem to ! # be corpus-dependent. ! # NOTE: Gary Robinson previously used a different formula involving 'a' ! # and 'x'. The 'x' here is the same as before. The 's' here is the old ! # 'a' divided by 'x'. ! robinson_probability_x: 0.5 ! robinson_probability_s: 0.45 # When scoring a message, ignore all words with ! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many ! # tests over Robinsons base scheme. 0.1 appeared to work well across ! # all corpora. ! robinson_minimum_prob_strength: 0.1 ! # The combining scheme currently detailed on Gary Robinons web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme --- 241,268 ---- # These two control the prior assumption about word probabilities. ! # unknown_word_prob is essentially the probability given to a word that ! # has never been seen before. Nobody has reported an improvement via moving ! # it away from 1/2, although Tim has measured a mean spamprob of a bit over ! # 0.5 (0.51-0.55) in 3 well-trained classifiers. ! # ! # unknown_word_strength adjusts how much weight to give the prior assumption ! # relative to the probabilities estimated by counting. At 0, the counting ! # estimates are believed 100%, even to the extent of assigning certainty ! # (0 or 1) to a word that has appeared in only ham or only spam. This ! # is a disaster. ! # ! # As unknown_word_strength tends toward infintity, all probabilities tend ! # toward unknown_word_prob. All reports were that a value near 0.4 worked ! # best, so this does not seem to be corpus-dependent. ! unknown_word_prob: 0.5 ! unknown_word_strength: 0.45 # When scoring a message, ignore all words with ! # abs(word.spamprob - 0.5) < minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many ! # tests. 0.1 appeared to work well across all corpora. ! minimum_prob_strength: 0.1 ! # The combining scheme currently detailed on the Robinon web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme *************** *** 272,284 **** # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) ! # follows the chi-squared distribution with 2*n degrees of freedom. That is ! # the "provably most-sensitive" test Garys original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where ! # lots of the mistakes live. This is the best method so far on Tims data. ! # One systematic benefit is that it is immune to "cancellation disease". One ! # systematic drawback is that it is sensitive to *any* deviation from a ! # uniform distribution, regardless of whether that is actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). --- 272,284 ---- # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) ! # follows the chi-squared distribution with 2*n degrees of freedom. This is ! # the "provably most-sensitive" test the original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where ! # lots of the mistakes live. This is the best method so far. ! # One systematic benefit is is immunity to "cancellation disease". One ! # systematic drawback is sensitivity to *any* deviation from a ! # uniform distribution, regardless of whether actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). *************** *** 381,387 **** }, 'Classifier': {'max_discriminators': int_cracker, ! 'robinson_probability_x': float_cracker, ! 'robinson_probability_s': float_cracker, ! 'robinson_minimum_prob_strength': float_cracker, 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, --- 381,387 ---- }, 'Classifier': {'max_discriminators': int_cracker, ! 'unknown_word_prob': float_cracker, ! 'unknown_word_strength': float_cracker, ! 'minimum_prob_strength': float_cracker, 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** classifier.py 7 Nov 2002 22:30:05 -0000 1.49 --- classifier.py 11 Nov 2002 01:59:06 -0000 1.50 *************** *** 70,74 **** # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.robinson_probability_x): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 --- 70,74 ---- # a word is no longer being used, it's just wasting space. ! def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 *************** *** 322,327 **** nspam = float(self.nspam or 1) ! S = options.robinson_probability_s ! StimesX = S * options.robinson_probability_x for word, record in self.wordinfo.iteritems(): --- 322,327 ---- nspam = float(self.nspam or 1) ! S = options.unknown_word_strength ! StimesX = S * options.unknown_word_prob for word, record in self.wordinfo.iteritems(): *************** *** 449,454 **** def _getclues(self, wordstream): ! mindist = options.robinson_minimum_prob_strength ! unknown = options.robinson_probability_x clues = [] # (distance, prob, word, record) tuples --- 449,454 ---- def _getclues(self, wordstream): ! mindist = options.minimum_prob_strength ! unknown = options.unknown_word_prob clues = [] # (distance, prob, word, record) tuples Index: weakloop.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1 --- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2 *************** *** 29,35 **** default=""" [Classifier] ! robinson_probability_x = 0.5 ! robinson_minimum_prob_strength = 0.1 ! robinson_probability_s = 0.45 max_discriminators = 150 --- 29,35 ---- default=""" [Classifier] ! unknown_word_prob = 0.5 ! minimum_prob_strength = 0.1 ! unknown_word_strength = 0.45 max_discriminators = 150 *************** *** 41,47 **** import Options ! start = (Options.options.robinson_probability_x, ! Options.options.robinson_minimum_prob_strength, ! Options.options.robinson_probability_s, Options.options.spam_cutoff, Options.options.ham_cutoff) --- 41,47 ---- import Options ! start = (Options.options.unknown_word_prob, ! Options.options.minimum_prob_strength, ! Options.options.unknown_word_strength, Options.options.spam_cutoff, Options.options.ham_cutoff) *************** *** 52,58 **** f.write(""" [Classifier] ! robinson_probability_x = %.6f ! robinson_minimum_prob_strength = %.6f ! robinson_probability_s = %.6f [TestDriver] --- 52,58 ---- f.write(""" [Classifier] ! unknown_word_prob = %.6f ! minimum_prob_strength = %.6f ! unknown_word_strength = %.6f [TestDriver] From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 07 Nov 2002 20:06:29 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67 tokenizer.py,1.63,1.64 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31798 Modified Files: Options.py tokenizer.py Log Message: Removed option retain_pure_html_tags; nobody enables that anymore, and it's hard to believe it would ever help anymore (except as an HTML detector). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** Options.py 7 Nov 2002 22:25:46 -0000 1.66 --- Options.py 8 Nov 2002 04:06:23 -0000 1.67 *************** *** 42,53 **** x-.* - # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags - # from pure text/html messages. Set true to retain HTML tags in this - # case. On the c.l.py corpus, it helps to set this true because any - # sign of HTML is so despised on tech lists; however, the advantage - # of setting it true eventually vanishes even there given enough - # training data. - retain_pure_html_tags: False - # If true, the first few characters of application/octet-stream sections # are used, undecoded. What 'few' means is decided by octet_prefix_size. --- 42,45 ---- *************** *** 347,352 **** all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, ! 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, --- 339,343 ---- all_options = { ! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63 --- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 *************** *** 495,504 **** # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaults to False. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was removed. ############################################################################## --- 495,505 ---- # Later: As the amount of training data increased, the effect of retaining # HTML tags decreased to insignificance. options.retain_pure_html_tags ! # was introduced to control this, and it defaulted to False. Later, as the ! # algorithm improved, retain_pure_html_tags was removed. # # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, ! # and it defaults to False. Later: ignore_redundant_html was also removed. ############################################################################## *************** *** 1167,1175 **** """Generate a stream of tokens from an email Message. - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are - also stripped from text/html sections. Except in special cases, - it's recommended to leave that at its default of false. - If options.check_octets is True, the first few undecoded characters of application/octet-stream parts of the message body become tokens. --- 1168,1171 ---- *************** *** 1228,1235 **** # Remove HTML/XML tags. Also  . ! if (part.get_content_type() == "text/plain" or ! not options.retain_pure_html_tags): ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. --- 1224,1229 ---- # Remove HTML/XML tags. Also  . ! text = text.replace(' ', ' ') ! text = html_re.sub(' ', text) # Tokenize everything in the body. From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Fri, 08 Nov 2002 00:00:25 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25390 Modified Files: pop3proxy.py Log Message: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11 --- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 *************** *** 47,50 **** --- 47,74 ---- + todo = """ + o (Re)training interface - one message per line, quick-rendering table. + o Slightly-wordy index page; intro paragraph for each page. + o Once the training stuff is on a separate page, make the paste box + bigger. + o "Links" section (on homepage?) to project homepage, mailing list, + etc. + o "Home" link (with helmet!) at the end of each page. + o "Classify this" - just like Train. + o "Send me an email every [...] to remind me to train on new + messages." + o "Send me a status email every [...] telling how many mails have been + classified, etc." + o Deployment: Windows executable? atlaxwin and ctypes? Or just + webbrowser? + o Possibly integrate Tim Stone's SMTP code - make it use async, make + the training code update (rather than replace!) the database. + o Can it cleanly dynamically update its status display while having a + POP3 converation? Hammering reload sucks. + o Add a command to save the database without shutting down, and one to + reload the database. + o Leave the word in the input field after a Word query. + """ + import sys, re, operator, errno, getopt, cPickle, cStringIO, time import socket, asyncore, asynchat, cgi, urlparse, webbrowser *************** *** 92,95 **** --- 116,120 ---- self.factory(*args) + class BrighterAsyncChat(asynchat.async_chat): """An asynchat.async_chat that doesn't give spurious warnings on *************** *** 110,113 **** --- 135,164 ---- + class ServerLineReader(BrighterAsyncChat): + """An async socket that reads lines from a remote server and + simply calls a callback with the data. The BayesProxy object + can't connect to the real POP3 server and talk to it + synchronously, because that would block the process.""" + + def __init__(self, serverName, serverPort, lineCallback): + BrighterAsyncChat.__init__(self) + self.lineCallback = lineCallback + self.request = '' + self.set_terminator('\r\n') + self.create_socket(socket.AF_INET, socket.SOCK_STREAM) + self.connect((serverName, serverPort)) + + def collect_incoming_data(self, data): + self.request = self.request + data + + def found_terminator(self): + self.lineCallback(self.request + '\r\n') + self.request = '' + + def handle_close(self): + self.lineCallback('') + self.close() + + class POP3ProxyBase(BrighterAsyncChat): """An async dispatcher that understands POP3 and proxies to a POP3 *************** *** 126,134 **** BrighterAsyncChat.__init__(self, clientSocket) self.request = '' self.set_terminator('\r\n') ! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ! self.serverSocket.connect((serverName, serverPort)) ! self.serverIn = self.serverSocket.makefile('r') # For reading only ! self.push(self.serverIn.readline()) def onTransaction(self, command, args, response): --- 177,189 ---- BrighterAsyncChat.__init__(self, clientSocket) self.request = '' + self.response = '' self.set_terminator('\r\n') ! self.command = '' # The POP3 command being processed... ! self.args = '' # ...and its arguments ! self.isClosing = False # Has the server closed the socket? ! self.seenAllHeaders = False # For the current RETR or TOP ! self.startTime = 0 # (ditto) ! self.serverSocket = ServerLineReader(serverName, serverPort, ! self.onServerLine) def onTransaction(self, command, args, response): *************** *** 139,152 **** raise NotImplementedError ! def isMultiline(self, command, args): ! """Returns True if the given request should get a multiline response (assuming the response is positive). """ ! if command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif command in ['RETR', 'TOP']: return True ! elif command in ['LIST', 'UIDL']: return len(args) == 0 else: --- 194,237 ---- raise NotImplementedError ! def onServerLine(self, line): ! """A line of response has been received from the POP3 server.""" ! isFirstLine = not self.response ! self.response = self.response + line ! ! # Is this line that terminates a set of headers? ! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] ! ! # Has the server closed its end of the socket? ! if not line: ! self.isClosing = True ! ! # If we're not processing a command, just echo the response. ! if not self.command: ! self.push(self.response) ! self.response = '' ! ! # Time out after 30 seconds for message-retrieval commands if ! # all the headers are down. The rest of the message will proxy ! # straight through. ! if self.command in ['TOP', 'RETR'] and \ ! self.seenAllHeaders and time.time() > self.startTime + 30: ! self.onResponse() ! self.response = '' ! # If that's a complete response, handle it. ! elif not self.isMultiline() or line == '.\r\n' or \ ! (isFirstLine and line.startswith('-ERR')): ! self.onResponse() ! self.response = '' ! ! def isMultiline(self): ! """Returns True if the request should get a multiline response (assuming the response is positive). """ ! if self.command in ['USER', 'PASS', 'APOP', 'QUIT', ! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']: return False ! elif self.command in ['RETR', 'TOP']: return True ! elif self.command in ['LIST', 'UIDL']: return len(args) == 0 else: *************** *** 155,204 **** return False - def readResponse(self, command, args): - """Reads the POP3 server's response and returns a tuple of - (response, isClosing, timedOut). isClosing is True if the - server closes the socket, which tells found_terminator() to - close when the response has been sent. timedOut is set if a - TOP or RETR request was still arriving after 30 seconds, and - tells found_terminator() to proxy the remainder of the response. - """ - responseLines = [] - startTime = time.time() - isMulti = self.isMultiline(command, args) - isClosing = False - timedOut = False - isFirstLine = True - seenAllHeaders = False - while True: - line = self.serverIn.readline() - if not line: - # The socket's been closed by the server, probably by QUIT. - isClosing = True - break - elif not isMulti or (isFirstLine and line.startswith('-ERR')): - # A single-line response. - responseLines.append(line) - break - elif line == '.\r\n': - # The termination line. - responseLines.append(line) - break - else: - # A normal line - append it to the response and carry on. - responseLines.append(line) - seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n'] - - # Time out after 30 seconds for message-retrieval commands - # if all the headers are down - found_terminator() knows how - # to deal with this. - if command in ['TOP', 'RETR'] and \ - seenAllHeaders and time.time() > startTime + 30: - timedOut = True - break - - isFirstLine = False - - return ''.join(responseLines), isClosing, timedOut - def collect_incoming_data(self, data): """Asynchat override.""" --- 240,243 ---- *************** *** 207,256 **** def found_terminator(self): """Asynchat override.""" - # Send the request to the server and read the reply. if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") self.shutdown(2) self.close() raise SystemExit ! self.serverSocket.sendall(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! command, args = ('', '') else: splitCommand = self.request.strip().split(None, 1) ! command = splitCommand[0].upper() ! args = splitCommand[1:] ! rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. ! cookedResponse = self.onTransaction(command, args, rawResponse) ! self.push(cookedResponse) ! self.request = '' ! ! # If readResponse() timed out, we still need to read and proxy ! # the rest of the message. ! if timedOut: ! while True: ! line = self.serverIn.readline() ! if not line: ! # The socket's been closed by the server. ! isClosing = True ! break ! elif line == '.\r\n': ! # The termination line. ! self.push(line) ! break ! else: ! # A normal line. ! self.push(line) ! ! # If readResponse() or the loop above decided that the server ! # has closed its socket, close this one when the response has ! # been sent. ! if isClosing: self.close_when_done() class BayesProxyListener(Listener): --- 246,288 ---- def found_terminator(self): """Asynchat override.""" if self.request.strip().upper() == 'KILL': self.serverSocket.sendall('QUIT\r\n') self.send("+OK, dying.\r\n") + self.serverSocket.shutdown(2) + self.serverSocket.close() self.shutdown(2) self.close() raise SystemExit ! ! self.serverSocket.push(self.request + '\r\n') if self.request.strip() == '': # Someone just hit the Enter key. ! self.command = self.args = '' else: + # A proper command. splitCommand = self.request.strip().split(None, 1) ! self.command = splitCommand[0].upper() ! self.args = splitCommand[1:] ! self.startTime = time.time() ! ! self.request = '' ! ! def onResponse(self): # Pass the request and the raw response to the subclass and # send back the cooked response. ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) ! ! # If onServerLine() decided that the server has closed its ! # socket, close this one when the response has been sent. ! if self.isClosing: self.close_when_done() + # Reset. + self.command = '' + self.args = '' + self.isClosing = False + self.seenAllHeaders = False + class BayesProxyListener(Listener): *************** *** 452,456 **** table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15 } .header { font-size: 133%% } .content { margin: 15 } --- 484,490 ---- table { font: 90%% arial, swiss, helvetica } form { margin: 0 } ! .banner { background: #c0e0ff; padding=5; padding-left: 15; ! border-top: 1px solid black; ! border-bottom: 1px solid black } .header { font-size: 133%% } .content { margin: 15 } *************** *** 466,470 ****
    \n""" --- 500,504 ----
    \n""" *************** *** 475,481 **** Spambayes.org ! \n""" pageSection = """ --- 509,520 ---- Spambayes.org
    %s
    \n""" + shutdownDB = """""" + + shutdownPickle = shutdownDB + """   + """ + pageSection = """ *************** *** 483,486 **** --- 522,533 ----  
    \n""" + summary = """POP3 proxy running on port %(proxyPort)d, + proxying to %(serverName)s:%(serverPort)d.
    + Active POP3 conversations: %(activeSessions)d.
    + POP3 conversations this session: %(totalSessions)d.
    + Emails classified this session: %(numSpams)d spam, + %(numHams)d ham, %(numUnsure)d unsure. + """ + wordQuery = """ *************** *** 488,491 **** --- 535,550 ---- """ + train = """ + Either upload a message file:
    + Or paste the whole message (incuding headers) here:
    +
    + Is this message + Ham or + Spam?
    + + """ + def __init__(self, clientSocket, bayes): BrighterAsyncChat.__init__(self, clientSocket) *************** *** 502,506 **** """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = self.request.split('\r\n', 1) try: method, url, version = requestLine.strip().split() --- 561,565 ---- """Asynchat override. Read and parse the HTTP request and call an on handler.""" ! requestLine, headers = (self.request+'\r\n').split('\r\n', 1) try: method, url, version = requestLine.strip().split() *************** *** 547,551 **** if path == '/helmet.gif': ! self.pushOKHeaders('image/gif') self.push(self.helmet) else: --- 606,614 ---- if path == '/helmet.gif': ! # XXX Why doesn't Expires work? Must read RFC 2616 one day. ! inOneHour = time.gmtime(time.time() + 3600) ! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour) ! extraHeaders = {'Expires': expiryDate} ! self.pushOKHeaders('image/gif', extraHeaders) self.push(self.helmet) else: *************** *** 554,558 **** handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % url) else: # This is a request for a valid page; run the handler. --- 617,621 ---- handler = getattr(self, 'on' + name) except AttributeError: ! self.pushError(404, "Not found: '%s'" % path) else: # This is a request for a valid page; run the handler. *************** *** 561,569 **** handler(params) timeString = time.asctime(time.localtime()) ! self.push(self.footer % timeString) ! def pushOKHeaders(self, contentType): ! self.push("HTTP/1.0 200 OK\r\n") self.push("Content-Type: %s\r\n" % contentType) self.push("\r\n") --- 624,641 ---- handler(params) timeString = time.asctime(time.localtime()) ! if status.useDB: ! self.push(self.footer % (timeString, self.shutdownDB)) ! else: ! self.push(self.footer % (timeString, self.shutdownPickle)) ! def pushOKHeaders(self, contentType, extraHeaders={}): ! timeNow = time.gmtime(time.time()) ! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow) ! self.push("HTTP/1.1 200 OK\r\n") ! self.push("Connection: close\r\n") self.push("Content-Type: %s\r\n" % contentType) + self.push("Date: %s\r\n" % httpNow) + for name, value in extraHeaders.items(): + self.push("%s: %s\r\n" % (name, value)) self.push("\r\n") *************** *** 583,616 **** def onHome(self, params): ! summary = """POP3 proxy running on port %(proxyPort)d, ! proxying to %(serverName)s:%(serverPort)d.
    ! Active POP3 conversations: %(activeSessions)d.
    ! POP3 conversations this session: ! %(totalSessions)d.
    ! Emails classified this session: %(numSpams)d spam, ! %(numHams)d ham, %(numUnsure)d unsure. ! """ % status.__dict__ ! ! train = """
    ! Either upload a message file: !
    ! Or paste the whole message (incuding headers) here:
    !
    ! Is this message ! Ham or ! Spam?
    ! ! """ ! ! body = (self.pageSection % ('Status', summary) + ! self.pageSection % ('Word query', self.wordQuery) + ! self.pageSection % ('Train', train)) self.push(body) def onShutdown(self, params): ! self.push("

    Shutdown. Goodbye.

    ") ! self.push(' ') # Acts as a flush for small buffers. self.shutdown(2) self.close() --- 655,675 ---- def onHome(self, params): ! """Serve up the homepage.""" ! body = (self.pageSection % ('Status', self.summary % status.__dict__)+ ! self.pageSection % ('Word query', self.wordQuery)+ ! self.pageSection % ('Train', self.train)) self.push(body) def onShutdown(self, params): ! """Shutdown the server, saving the pickle if requested to do so.""" ! if params['how'].lower().find('save') >= 0: ! if not status.useDB and status.pickleName: ! self.push("Saving...") ! self.push(' ') # Acts as a flush for small buffers. ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("Shutdown. Goodbye.") ! self.push(' ') self.shutdown(2) self.close() *************** *** 618,625 **** def onUpload(self, params): message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') # Append the message to a file, to make it easier to rebuild ! # the database later. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: --- 677,690 ---- def onUpload(self, params): + """Train on an uploaded or pasted message.""" + # Upload or paste? Spam or ham? message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild ! # the database later. This is a temporary implementation - ! # it should keep a Corpus (from Tim Stone's forthcoming message ! # management module) to manage a cache of messages. It needs ! # to keep them for the HTML retraining interface anyway. message = message.replace('\r\n', '\n').replace('\r', '\n') if isSpam: *************** *** 627,642 **** else: f = open("_pop3proxyham.mbox", "a") ! f.write("From ???@???\n") # fake From line (XXX good enough?) f.write(message) ! f.write("\n") f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("""

    Trained on your message. Saving database...

    """) ! self.push(" ") # Flush... must find out how to do this properly... ! if not status.useDB and status.pickleName: ! fp = open(status.pickleName, 'wb') ! cPickle.dump(self.bayes, fp, 1) ! fp.close() ! self.push("

    Done.

    Home

    ") def onWordquery(self, params): --- 692,704 ---- else: f = open("_pop3proxyham.mbox", "a") ! f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n") f.write(message) ! f.write("\n\n") f.close() + + # Train on the message. self.bayes.learn(tokenizer.tokenize(message), isSpam, True) ! self.push("

    OK. Return Home or train another:

    ") ! self.push(self.pageSection % ('Train another', self.train)) def onWordquery(self, params): *************** *** 656,660 **** info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s':" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) --- 718,722 ---- info = "'%s' does not appear in the database." % word ! body = (self.pageSection % ("Statistics for '%s'" % word, info) + self.pageSection % ('Word query', self.wordQuery)) self.push(body) *************** *** 765,771 **** else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) self.request = '' def onStat(self, command, args): """POP3 STAT command.""" --- 827,839 ---- else: handler = self.handlers.get(command, self.onUnknown) ! self.push(handler(command, args)) # Or push_slowly for testing self.request = '' + def push_slowly(self, response): + """Useful for testing.""" + for c in response: + self.push(c) + time.sleep(0.02) + def onStat(self, command, args): """POP3 STAT command.""" *************** *** 777,781 **** """POP3 LIST command, with optional message number argument.""" if args: ! number = int(args) if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) --- 845,852 ---- """POP3 LIST command, with optional message number argument.""" if args: ! try: ! number = int(args) ! except ValueError: ! number = -1 if 0 < number <= len(self.maildrop): return "+OK %d\r\n" % len(self.maildrop[number-1]) *************** *** 803,811 **** def onRetr(self, command, args): """POP3 RETR command.""" ! return self._getMessage(int(args), 12345) def onTop(self, command, args): """POP3 RETR command.""" ! number, lines = map(int, args.split()) return self._getMessage(number, lines) --- 874,889 ---- def onRetr(self, command, args): """POP3 RETR command.""" ! try: ! number = int(args) ! except ValueError: ! number = -1 ! return self._getMessage(number, 12345) def onTop(self, command, args): """POP3 RETR command.""" ! try: ! number, lines = map(int, args.split()) ! except ValueError: ! number, lines = -1, -1 return self._getMessage(number, lines) *************** *** 863,867 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) != -1 # Kill the proxy and the test server. --- 941,945 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(options.hammie_header_name) >= 0 # Kill the proxy and the test server. From jvr@users.sourceforge.net Sat Nov 9 18:05:44 2002 From: jvr@users.sourceforge.net (Just van Rossum) Date: Sat, 09 Nov 2002 10:05:44 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20814 Modified Files: pop3proxy.py Log Message: force word query to be lowercase, making the UI case insensitive Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12 --- pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13 *************** *** 704,707 **** --- 704,708 ---- def onWordquery(self, params): word = params['word'] + word = word.lower() try: # Must be a better way to get __dict__ for a new-style class... From tim_one@users.sourceforge.net Mon Nov 11 23:26:21 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 11 Nov 2002 15:26:21 -0800 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.64,1.65 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10237 Modified Files: tokenizer.py Log Message: An idea from Anthony Baxter: decode Subject lines, so that they're tokenized in decoded form, and so that they generate charset tokens too. This had minor good effects in both our tests. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.64 retrieving revision 1.65 diff -C2 -d -r1.64 -r1.65 *** tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64 --- tokenizer.py 11 Nov 2002 23:26:18 -0000 1.65 *************** *** 5,8 **** --- 5,9 ---- import email + import email.Header import email.Message import email.Errors *************** *** 1054,1062 **** # but real benefit to keeping case intact in this specific context. x = msg.get('subject', '') ! for w in subject_word_re.findall(x): ! for t in tokenize_word(w): ! yield 'subject:' + t ! for w in punctuation_run_re.findall(x): ! yield 'subject:' + w # Dang -- I can't use Sender:. If I do, --- 1055,1066 ---- # but real benefit to keeping case intact in this specific context. x = msg.get('subject', '') ! for x, subjcharset in email.Header.decode_header(x): ! if subjcharset is not None: ! yield 'subjectcharset:' + subjcharset ! for w in subject_word_re.findall(x): ! for t in tokenize_word(w): ! yield 'subject:' + t ! for w in punctuation_run_re.findall(x): ! yield 'subject:' + w # Dang -- I can't use Sender:. If I do, From anthonybaxter@users.sourceforge.net Tue Nov 12 00:37:21 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 11 Nov 2002 16:37:21 -0800 Subject: [Spambayes-checkins] website docs.ht,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv5772 Modified Files: docs.ht Log Message: few more definitions Index: docs.ht =================================================================== RCS file: /cvsroot/spambayes/website/docs.ht,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** docs.ht 19 Sep 2002 23:39:24 -0000 1.3 --- docs.ht 12 Nov 2002 00:37:19 -0000 1.4 *************** *** 27,32 ****
    f-n, FN
    (abbrev.) false negative
    f-p, FP
    (abbrev.) false positive ! - --- 27,34 ----
    f-n, FN
    (abbrev.) false negative
    f-p, FP
    (abbrev.) false positive !
    corpus
    in this context, a body of messages. Usually referring to a ! training database. !
    hapax, hapax legomenon
    a word or form occuring only once in a ! document or corpus. (plural is hapax legomena) From tim.one@comcast.net Tue Nov 12 00:40:44 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 19:40:44 -0500 Subject: [Spambayes-checkins] website docs.ht,1.3,1.4 In-Reply-To: Message-ID: > !
    hapax, hapax legomenon
    a word or form occuring only once in a > ! document or corpus. (plural is hapax legomena) > Ya, but even I'm not that anal -- I usually say hapaxes. hapaxora would be a hoot too . From anthony@interlink.com.au Tue Nov 12 00:43:58 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 11:43:58 +1100 Subject: [Spambayes-checkins] website docs.ht,1.3,1.4 In-Reply-To: Message-ID: <200211120043.gAC0hwp09308@localhost.localdomain> >>> Tim Peters wrote > > !
    hapax, hapax legomenon
    a word or form occuring only once in a > > ! document or corpus. (plural is hapax legomena) > > > > Ya, but even I'm not that anal -- I usually say hapaxes. hapaxora would be > a hoot too Hapax legomena sounds like something that the CDC sends the black helicopters in to lock down an outbreak of... From tim_one@users.sourceforge.net Tue Nov 12 04:52:14 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 11 Nov 2002 20:52:14 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.29,1.30 manager.py,1.33,1.34 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv27097/Outlook2000 Modified Files: addin.py manager.py Log Message: In the "show clues" msg, for each word give the raw ham and spam counts too. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** addin.py 7 Nov 2002 22:30:08 -0000 1.29 --- addin.py 12 Nov 2002 04:52:12 -0000 1.30 *************** *** 225,233 **** # Format the clues. push("
    \n")
          for word, prob in clues:
              word = repr(word)
    !         push(escape(word) + ' ' * (30 - len(word)))
    !         push(' %g\n' % prob)
          push("
    \n") # Now the raw text of the message, as best we can push("

    Message Stream:


    ") --- 225,244 ---- # Format the clues. push("
    \n")
    +     push("word                                spamprob         #ham  #spam\n")
    +     format = " %-12g %8s %6s\n"
    +     c = mgr.GetClassifier()
    +     fetchword = c.wordinfo.get
          for word, prob in clues:
    +         record = fetchword(word)
    +         if record:
    +             nham = record.hamcount
    +             nspam = record.spamcount
    +         else:
    +             nham = nspam = "-"
              word = repr(word)
    !         push(escape(word) + " " * (35-len(word)))
    !         push(format % (prob, nham, nspam))
          push("
    \n") + # Now the raw text of the message, as best we can push("

    Message Stream:


    ") Index: manager.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** manager.py 7 Nov 2002 22:30:09 -0000 1.33 --- manager.py 12 Nov 2002 04:52:12 -0000 1.34 *************** *** 223,226 **** --- 223,230 ---- self.bayes_dirty = False + def GetClassifier(self): + """Return the classifier we're using.""" + return self.bayes + def SaveConfig(self): if self.verbose > 1: From anthonybaxter@users.sourceforge.net Tue Nov 12 06:21:41 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 11 Nov 2002 22:21:41 -0800 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.65,1.66 Options.py,1.68,1.69 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16090 Modified Files: tokenizer.py Options.py Log Message: New tokenizer option 'address_headers'. Allows the mining of headers other than 'from' for email addresses and names (e.g. to or cc). By default, it's just set to 'from' for now. In addition, address headers (including from) now get decoded and parsed correctly, rather than by a whitespace split. This shows a quite nice improvement for me. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.65 retrieving revision 1.66 diff -C2 -d -r1.65 -r1.66 *** tokenizer.py 11 Nov 2002 23:26:18 -0000 1.65 --- tokenizer.py 12 Nov 2002 06:21:38 -0000 1.66 *************** *** 7,10 **** --- 7,12 ---- import email.Header import email.Message + import email.Header + import email.Utils import email.Errors import re *************** *** 1072,1082 **** # # one (smalls wins & losses across runs, overall # # not significant), so leaving it out ! for field in ('from',): ! prefix = field + ':' ! x = msg.get(field, 'none').lower() ! for w in x.split(): ! for t in tokenize_word(w): ! yield prefix + t ! # To: # Cc: --- 1074,1096 ---- # # one (smalls wins & losses across runs, overall # # not significant), so leaving it out ! # To:, Cc: # These can help, if your ham and spam are sourced ! # # from the same location. If not, they'll be horrible. ! for field in options.address_headers: ! addrlist = msg.get_all(field, []) ! if not addrlist: ! yield field + ":none" ! for addrs in addrlist: ! for rname,ename in email.Utils.getaddresses([addrs]): ! if rname: ! for rname,rcharset in email.Header.decode_header(rname): ! for w in rname.lower().split(): ! for t in tokenize_word(w): ! yield field+'realname:'+t ! if rcharset is not None: ! yield field+'charset:'+rcharset ! if ename: ! for w in ename.lower().split('@'): ! for t in tokenize_word(w): ! yield field+'email:'+t # To: # Cc: Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.68 retrieving revision 1.69 diff -C2 -d -r1.68 -r1.69 *** Options.py 11 Nov 2002 01:59:06 -0000 1.68 --- Options.py 12 Nov 2002 06:21:38 -0000 1.69 *************** *** 90,93 **** --- 90,101 ---- mine_received_headers: False + # Mine the following address headers. If you have mixed source corpuses + # (as opposed to a mixed sauce walrus, which is delicious!) then you + # probably don't want to use 'to' or 'cc') + # Address headers will be decoded, and will generate charset tokens as + # well as the real address. + # others to consider: to, cc, reply-to, errors-to, sender, ... + address_headers: from + # If legitimate mail contains things that look like text to the tokenizer # and turning turning off this option helps (perhaps binary attachments get *************** *** 340,343 **** --- 348,352 ---- all_options = { 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())), + 'address_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, 'record_header_absence': boolean_cracker, From anthonybaxter@users.sourceforge.net Tue Nov 12 07:03:22 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 11 Nov 2002 23:03:22 -0800 Subject: [Spambayes-checkins] spambayes/pspam scoremsg.py,1.2,1.3 update.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam In directory usw-pr-cvs1:/tmp/cvs-serv26080 Modified Files: scoremsg.py update.py Log Message: whitespace normalisation. Index: scoremsg.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/scoremsg.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** scoremsg.py 7 Nov 2002 22:30:10 -0000 1.2 --- scoremsg.py 12 Nov 2002 07:03:20 -0000 1.3 *************** *** 39,43 **** ## print ## print msg ! if __name__ == "__main__": main(sys.stdin) --- 39,43 ---- ## print ## print msg ! if __name__ == "__main__": main(sys.stdin) Index: update.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/update.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** update.py 7 Nov 2002 22:30:10 -0000 1.2 --- update.py 12 Nov 2002 07:03:20 -0000 1.3 *************** *** 39,43 **** if not folder_exists(profile.hams, p): profile.add_ham(p) ! for spam in options.spam_folders: p = os.path.join(options.folder_dir, spam) --- 39,43 ---- if not folder_exists(profile.hams, p): profile.add_ham(p) ! for spam in options.spam_folders: p = os.path.join(options.folder_dir, spam) *************** *** 49,53 **** profile.update() get_transaction().commit() ! db.close() --- 49,53 ---- profile.update() get_transaction().commit() ! db.close() *************** *** 58,61 **** if k == '-F': FORCE_REBUILD = True ! main(FORCE_REBUILD) --- 58,61 ---- if k == '-F': FORCE_REBUILD = True ! main(FORCE_REBUILD) From anthonybaxter@users.sourceforge.net Tue Nov 12 07:03:22 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 11 Nov 2002 23:03:22 -0800 Subject: [Spambayes-checkins] spambayes/pspam/pspam folder.py,1.2,1.3 options.py,1.1,1.2 profile.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes/pspam/pspam In directory usw-pr-cvs1:/tmp/cvs-serv26080/pspam Modified Files: folder.py options.py profile.py Log Message: whitespace normalisation. Index: folder.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/folder.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** folder.py 7 Nov 2002 22:30:11 -0000 1.2 --- folder.py 12 Nov 2002 07:03:20 -0000 1.3 *************** *** 68,72 **** self.messages[msgid] = msg new.insert(msg) ! removed = difference(self.messages, cur) for msgid in removed.keys(): --- 68,72 ---- self.messages[msgid] = msg new.insert(msg) ! removed = difference(self.messages, cur) for msgid in removed.keys(): Index: options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/options.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** options.py 4 Nov 2002 04:44:20 -0000 1.1 --- options.py 12 Nov 2002 07:03:20 -0000 1.2 *************** *** 1,5 **** from Options import options, all_options, \ boolean_cracker, float_cracker, int_cracker, string_cracker ! from sets import Set all_options["Score"] = {'max_ham': float_cracker, --- 1,5 ---- from Options import options, all_options, \ boolean_cracker, float_cracker, int_cracker, string_cracker ! from sets import Set all_options["Score"] = {'max_ham': float_cracker, Index: profile.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** profile.py 11 Nov 2002 01:59:06 -0000 1.4 --- profile.py 12 Nov 2002 07:03:20 -0000 1.5 *************** *** 92,96 **** get_transaction().commit() log("updated probabilities") ! def _update(self, folders, is_spam): changed = False --- 92,96 ---- get_transaction().commit() log("updated probabilities") ! def _update(self, folders, is_spam): changed = False *************** *** 100,104 **** if added: log("added %d" % len(added)) ! if removed: log("removed %d" % len(removed)) get_transaction().commit() --- 100,104 ---- if added: log("added %d" % len(added)) ! if removed: log("removed %d" % len(removed)) get_transaction().commit() *************** *** 117,121 **** for msg in removed.keys(): self.classifier.unlearn(tokenize(msg), is_spam, False) ! if removed: log("unlearned") del removed --- 117,121 ---- for msg in removed.keys(): self.classifier.unlearn(tokenize(msg), is_spam, False) ! if removed: log("unlearned") del removed From tim_one@users.sourceforge.net Tue Nov 12 22:56:26 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 14:56:26 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.30,1.31 msgstore.py,1.24,1.25 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv21157/Outlook2000 Modified Files: addin.py msgstore.py Log Message: Removed the strip_mime_headers business. I'm not sure whether it ever helped, but at this point it was definitely happening too late to do any good. Index: addin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** addin.py 12 Nov 2002 04:52:12 -0000 1.30 --- addin.py 12 Nov 2002 22:56:24 -0000 1.31 *************** *** 244,248 **** push("

    Message Stream:


    ") push("
    \n")
    !     msg = msgstore_message.GetEmailPackageObject(strip_mime_headers=False)
          push(escape(msg.as_string(), True))
          push("
    \n") --- 244,248 ---- push("

    Message Stream:


    ") push("
    \n")
    !     msg = msgstore_message.GetEmailPackageObject()
          push(escape(msg.as_string(), True))
          push("
    \n") Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** msgstore.py 10 Nov 2002 19:59:59 -0000 1.24 --- msgstore.py 12 Nov 2002 22:56:24 -0000 1.25 *************** *** 49,53 **** def __init__(self): self.unread = False ! def GetEmailPackageObject(self, strip_mime_headers=True): # Return a "read-only" Python email package object # "read-only" in that changes will never be reflected to the real store. --- 49,53 ---- def __init__(self): self.unread = False ! def GetEmailPackageObject(self): # Return a "read-only" Python email package object # "read-only" in that changes will never be reflected to the real store. *************** *** 420,424 **** self.mapi_object = self.msgstore._OpenEntry(self.id) ! def GetEmailPackageObject(self, strip_mime_headers=True): import email # XXX If this was originally a MIME msg, we're hosed at this point -- --- 420,424 ---- self.mapi_object = self.msgstore._OpenEntry(self.id) ! def GetEmailPackageObject(self): import email # XXX If this was originally a MIME msg, we're hosed at this point -- *************** *** 433,451 **** print "FAILED to create email.message from: ", `text` raise - - if strip_mime_headers: - # If we're going to pass this to a scoring function, the MIME - # headers must be stripped, else the email pkg will run off - # looking for MIME boundaries that don't exist. The charset - # info from the original MIME armor is also lost, and we don't - # want the email pkg to try decoding the msg a second time - # (assuming Outlook is in fact already decoding text originally - # in base64 and quoted-printable). - # We want to retain the MIME headers if we're just displaying - # the msg stream. - if msg.has_key('content-type'): - del msg['content-type'] - if msg.has_key('content-transfer-encoding'): - del msg['content-transfer-encoding'] return msg --- 433,436 ---- From tim_one@users.sourceforge.net Tue Nov 12 23:12:14 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 15:12:14 -0800 Subject: [Spambayes-checkins] spambayes mboxutils.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31150 Modified Files: mboxutils.py Log Message: New utility function extract_headers(), for very simple-minded header extraction. Index: mboxutils.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** mboxutils.py 6 Nov 2002 01:57:39 -0000 1.4 --- mboxutils.py 12 Nov 2002 23:12:11 -0000 1.5 *************** *** 25,28 **** --- 25,29 ---- import mailbox import email.Message + import re class DirOfTxtFileMailbox: *************** *** 119,120 **** --- 120,164 ---- msg.set_payload(obj) return msg + + header_break_re = re.compile(r"\r?\n(\r?\n)") + + def extract_headers(text): + """Very simple-minded header extraction: prefix of text up to blank line. + + A blank line is recognized via two adjacent line-ending sequences, where + a line-ending sequence is a newline optionally preceded by a carriage + return. + + If no blank line is found, all of text is considered to be a potential + header section. If a blank line is found, the text up to (but not + including) the blank line is considered to be a potential header section. + + The potential header section is returned, unless it doesn't contain a + colon, in which case an empty string is returned. + + >>> extract_headers("abc") + '' + >>> extract_headers("abc\\n\\n\\n") # no colon + '' + >>> extract_headers("abc: xyz\\n\\n\\n") + 'abc: xyz\\n' + >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n") + 'abc: xyz\\r\\n' + >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish") + 'a: b\\ngibberish\\n' + """ + + m = header_break_re.search(text) + if m: + eol = m.start(1) + text = text[:eol] + if ':' not in text: + text = "" + return text + + def _test(): + import doctest, mboxutils + return doctest.testmod(mboxutils) + + if __name__ == "__main__": + _test() From tim_one@users.sourceforge.net Tue Nov 12 23:16:06 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 15:16:06 -0800 Subject: [Spambayes-checkins] spambayes mboxutils.py,1.5,1.6 tokenizer.py,1.66,1.67 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1192 Modified Files: mboxutils.py tokenizer.py Log Message: get_message(): changed to use the new extract_headers() hack. Index: mboxutils.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** mboxutils.py 12 Nov 2002 23:12:11 -0000 1.5 --- mboxutils.py 12 Nov 2002 23:16:04 -0000 1.6 *************** *** 114,120 **** # headers are most likely damaged, we can't use the email # package to parse them, so just get rid of them first. ! i = obj.find('\n\n') ! if i >= 0: ! obj = obj[i+2:] # strip headers msg = email.Message.Message() msg.set_payload(obj) --- 114,119 ---- # headers are most likely damaged, we can't use the email # package to parse them, so just get rid of them first. ! headers = extract_headers(obj) ! obj = obj[len(headers):] msg = email.Message.Message() msg.set_payload(obj) Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** tokenizer.py 12 Nov 2002 06:21:38 -0000 1.66 --- tokenizer.py 12 Nov 2002 23:16:04 -0000 1.67 *************** *** 17,20 **** --- 17,21 ---- from Options import options + import mboxutils from mboxutils import get_message From tim_one@users.sourceforge.net Tue Nov 12 23:19:35 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 15:19:35 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.25,1.26 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv3198/Outlook2000 Modified Files: msgstore.py Log Message: GetEmailPackageObject(): Removed comments that no longer made sense, at least not here. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** msgstore.py 12 Nov 2002 22:56:24 -0000 1.25 --- msgstore.py 12 Nov 2002 23:19:33 -0000 1.26 *************** *** 422,430 **** def GetEmailPackageObject(self): import email - # XXX If this was originally a MIME msg, we're hosed at this point -- - # the boundary tag in the headers doesn't exist in the body, and - # the msg is simply ill-formed. The miserable hack here simply - # squashes the text part (if any) and the HTML part (if any) together, - # and strips MIME info from the original headers. text = self._GetMessageText() try: --- 422,425 ---- From tim_one@users.sourceforge.net Tue Nov 12 23:33:48 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 15:33:48 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.26,1.27 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv11116/Outlook2000 Modified Files: msgstore.py Log Message: _GetMessageText(): Whatever the value of the headers property, stop paying attention to it after the first blank line, and don't believe it at all if it doesn't contain a colon. Cheap trick to worm around the problems some people have reported with Outlook returning multiple header sections here (including internal MIME armor with empty bodies). Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** msgstore.py 12 Nov 2002 23:19:33 -0000 1.26 --- msgstore.py 12 Nov 2002 23:33:45 -0000 1.27 *************** *** 1,5 **** from __future__ import generators ! import sys, os try: --- 1,5 ---- from __future__ import generators ! import sys, os, re try: *************** *** 10,13 **** --- 10,53 ---- + # XXX + # import mboxutils doesn't work at this point. The extract_headers function + # here is a copy-and-paste. + header_break_re = re.compile(r"\r?\n(\r?\n)") + + def extract_headers(text): + """Very simple-minded header extraction: prefix of text up to blank line. + + A blank line is recognized via two adjacent line-ending sequences, where + a line-ending sequence is a newline optionally preceded by a carriage + return. + + If no blank line is found, all of text is considered to be a potential + header section. If a blank line is found, the text up to (but not + including) the blank line is considered to be a potential header section. + + The potential header section is returned, unless it doesn't contain a + colon, in which case an empty string is returned. + + >>> extract_headers("abc") + '' + >>> extract_headers("abc\\n\\n\\n") # no colon + '' + >>> extract_headers("abc: xyz\\n\\n\\n") + 'abc: xyz\\n' + >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n") + 'abc: xyz\\r\\n' + >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish") + 'a: b\\ngibberish\\n' + """ + + m = header_break_re.search(text) + if m: + eol = m.start(1) + text = text[:eol] + if ':' not in text: + text = "" + return text + + # Abstract definition - can be moved out when we have more than one sub-class # External interface to this module is almost exclusively via a "folder ID" *************** *** 384,387 **** --- 424,434 ---- html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2]) has_attach = data[3][1] + + # Some Outlooks deliver a strange notion of headers, including + # interior MIME armor. To prevent later errors, try to get rid + # of stuff now that can't possibly be parsed as "real" (SMTP) + # headers. + headers = extract_headers(headers) + # Mail delivered internally via Exchange Server etc may not have # headers - fake some up. *************** *** 392,395 **** --- 439,443 ---- elif headers.startswith("Microsoft Mail"): headers = "X-MS-Mail-Gibberish: " + headers + if not html and not body: # Only ever seen this for "multipart/signed" messages, so From tim_one@users.sourceforge.net Wed Nov 13 05:29:15 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 21:29:15 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv4228 Modified Files: train.py Log Message: train_message(): When rescoring was asked for, it had no visible effect, since the probabilities didn't get updated after training. So update the probs before rescoring. Index: train.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** train.py 7 Nov 2002 22:30:09 -0000 1.16 --- train.py 13 Nov 2002 05:29:10 -0000 1.17 *************** *** 26,30 **** return spam == True ! def train_message(msg, is_spam, mgr, rescore = False): # Train an individual message. # Returns True if newly added (message will be correctly --- 26,30 ---- return spam == True ! def train_message(msg, is_spam, mgr, rescore=False): # Train an individual message. # Returns True if newly added (message will be correctly *************** *** 54,57 **** --- 54,58 ---- if rescore: import filter + mgr.bayes.update_probabilities() # else rescoring gives the same score filter.filter_message(msg, mgr, all_actions = False) From tim_one@users.sourceforge.net Wed Nov 13 06:25:10 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 22:25:10 -0800 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.67,1.68 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2039a Modified Files: tokenizer.py Log Message: More refinements of address-header tokenization. In particular, it now generators "no real name" log-count tokens, which are strong spam clues in my data. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.67 retrieving revision 1.68 diff -C2 -d -r1.67 -r1.68 *** tokenizer.py 12 Nov 2002 23:16:04 -0000 1.67 --- tokenizer.py 13 Nov 2002 06:25:08 -0000 1.68 *************** *** 1081,1097 **** if not addrlist: yield field + ":none" ! for addrs in addrlist: ! for rname,ename in email.Utils.getaddresses([addrs]): ! if rname: ! for rname,rcharset in email.Header.decode_header(rname): ! for w in rname.lower().split(): ! for t in tokenize_word(w): ! yield field+'realname:'+t ! if rcharset is not None: ! yield field+'charset:'+rcharset ! if ename: ! for w in ename.lower().split('@'): ! for t in tokenize_word(w): ! yield field+'email:'+t # To: # Cc: --- 1081,1105 ---- if not addrlist: yield field + ":none" ! continue ! ! noname_count = 0 ! for name, addr in email.Utils.getaddresses(addrlist): ! if name: ! for name, charset in email.Header.decode_header(name): ! yield "%s:name:%s" % (field, name.lower()) ! if charset is not None: ! yield "%s:charset:%s" % (field, charset) ! else: ! noname_count += 1 ! if addr: ! for w in addr.lower().split('@'): ! yield "%s:addr:%s" % (field, w) ! else: ! yield field + ":addr:none" ! ! if noname_count: ! yield "%s:no real name:2**%d" % (field, ! round(log2(noname_count))) ! # To: # Cc: From mhammond@skippinet.com.au Wed Nov 13 07:01:59 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 13 Nov 2002 18:01:59 +1100 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17 In-Reply-To: Message-ID: > Log Message: > train_message(): When rescoring was asked for, it had no visible > effect, since the probabilities didn't get updated after training. > So update the probs before rescoring. I'm a little confused about these probabilities. Isn't it true that whenever we do a "train operation", we should also update the probabilities? For a batch train, we only want to do it at the end, but for an individual, incremental train, I would have thought we still want the probabilities updated, even if we don't rescore the message. Otherwise future messages will not use the new probabilities. I ask because revision 1.14 did exactly this, and we regressed it. That revision was: diff -r1.13 -r1.14 21c21 < def train_message(msg, is_spam, mgr, update_probs = True): --- > def train_message(msg, is_spam, mgr): 43,45d42 < if update_probs: < mgr.bayes.update_probabilities() < 56c53 < if train_message(message, isspam, mgr, False): --- > if train_message(message, isspam, mgr): And it seems to me that a new param, specifically for update_probs, is less of a hack than tieing it to the "rescore" param - we want the new probs used for the *next* incoming message even if we don't need it for *this* message. Mark. From tim_one@users.sourceforge.net Wed Nov 13 06:59:27 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 12 Nov 2002 22:59:27 -0800 Subject: [Spambayes-checkins] spambayes/Outlook2000 default_bayes_customize.ini,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes/Outlook2000 In directory usw-pr-cvs1:/tmp/cvs-serv19210/Outlook2000 Modified Files: default_bayes_customize.ini Log Message: Enable more address-header tokenization than the default. This should help any personal email classifier. I recommend a full retrain to get the most benefit. Index: default_bayes_customize.ini =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** default_bayes_customize.ini 4 Nov 2002 23:21:43 -0000 1.5 --- default_bayes_customize.ini 13 Nov 2002 06:59:24 -0000 1.6 *************** *** 17,20 **** --- 17,26 ---- record_header_absence: True + # These should help. All but "from" are disabled by default, because + # they're killer-good clues for bad reasons when using mixed-source + # data. + address_headers: from to cc sender reply-to + + [Classifier] # Uncomment the next lines if you want to use the former default for From tim.one@comcast.net Wed Nov 13 07:18:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 13 Nov 2002 02:18:45 -0500 Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17 In-Reply-To: Message-ID: [Mark Hammond] > I'm a little confused about these probabilities. > > Isn't it true that whenever we do a "train operation", we should > also update the probabilities? It's a tradeoff. The bigger the database, the longer update_probabilities() takes. If the user is staring at a specific msg, and expects to see its score change, then the probs *have* to be updated or the score won't change. So that was a very clear reason to force updating here. I didn't know why the probs weren't being updated anyway, so fixed the one thing that was unarguably buggy. > For a batch train, we only want to do it at the end, but for an > individual, incremental train, I would have thought we still want the > probabilities updated, even if we don't rescore the message. Otherwise > future messages will not use the new probabilities. That's so. I haven't worried about it, perhaps because I run on Win9x most of the time so live with frequent reboots (i.e., I retrain from scratch several times every day anyway, as incremental updates are lost when a forced reboot occurs; that's not *this* code's fault, although I eventual hope to get around to writing out the updated database whenever the probs get updated). > I ask because revision 1.14 did exactly this, and we regressed it. That's odd -- the CVS log says mhammond did that . > ... > And it seems to me that a new param, specifically for update_probs, is > less of a hack than tieing it to the "rescore" param - we want the > new probs used for the *next* incoming message even if we don't need > it for *this* message. It's still a tradeoff, though. Once a classifier has gotten any amount of decent training, whether or not a new training msg gets reflected instantly in the probs should make little difference to results. If it's possible that update_probabilities() *never* gets called after training and before shutdown now, then that's clearly a bug. It's OK by me whatever you'd rather do here, and updating probs after training, without fail, is certainly the least error-prone strategy. From richiehindle@users.sourceforge.net Wed Nov 13 18:13:46 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Wed, 13 Nov 2002 10:13:46 -0800 Subject: [Spambayes-checkins] spambayes README.txt,1.41,1.42 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14506 Modified Files: README.txt Log Message: Added a note about the web interface implemented by pop3proxy.py. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** README.txt 7 Nov 2002 22:30:02 -0000 1.41 --- README.txt 13 Nov 2002 18:13:43 -0000 1.42 *************** *** 74,77 **** --- 74,82 ---- delivery system. + Also acts as a web server providing a user interface that allows you + to train the classifier, classify messages interactively, and query + the token database. This piece will at some point be split out into + a separate module. + neiltrain.py Builds a CDB (constant database) file of word probabilities using From richiehindle@users.sourceforge.net Wed Nov 13 18:14:34 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Wed, 13 Nov 2002 10:14:34 -0800 Subject: [Spambayes-checkins] spambayes Options.py,1.69,1.70 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15336 Modified Files: Options.py Log Message: Added options for pop3proxy.py, so you don't need a huge command line. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.69 retrieving revision 1.70 diff -C2 -d -r1.69 -r1.70 *** Options.py 12 Nov 2002 06:21:38 -0000 1.69 --- Options.py 13 Nov 2002 18:14:32 -0000 1.70 *************** *** 339,342 **** --- 339,357 ---- # database by default. persistent_use_database: False + + [pop3proxy] + # pop3proxy settings - pop3proxy also respects the options in the Hammie + # section, with the exception of the extra header details at the moment. + # The only mandatory option is pop3proxy_server_name, eg. pop3.my-isp.com, + # but that can come from the command line - see "pop3proxy -h". + pop3proxy_server_name: "" + pop3proxy_server_port: 110 + pop3proxy_port: 110 + pop3proxy_cache_use_gzip: True + pop3proxy_cache_expiry_days: 7 + + [html_ui] + html_ui_port: 8880 + html_ui_launch_browser: False """ *************** *** 408,412 **** 'hammie_debug_header_name': string_cracker, }, ! } --- 423,435 ---- 'hammie_debug_header_name': string_cracker, }, ! 'pop3proxy': {'pop3proxy_server_name': string_cracker, ! 'pop3proxy_server_port': int_cracker, ! 'pop3proxy_port': int_cracker, ! 'pop3proxy_cache_use_gzip': boolean_cracker, ! 'pop3proxy_cache_expiry_days': int_cracker, ! }, ! 'html_ui': {'html_ui_port': int_cracker, ! 'html_ui_launch_browser': boolean_cracker, ! }, } From richiehindle@users.sourceforge.net Wed Nov 13 18:19:48 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Wed, 13 Nov 2002 10:19:48 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20474 Modified Files: pop3proxy.py Log Message: o All command line switches and options now default to values from bayescustomize.ini. Thanks to Francois Granger for the idea. o Instead of there being two radio buttons (ham, spam) on the training form, there are now two buttons: "Train as Ham" and "Train as Spam". Thanks to Just van Rossum for the suggestion. o "Classify message" form - paste or upload a message for classification. Gives you the spam probability and the clues. o It now gives a decent error if the POP3 server is unreachable. o The "Bad file descriptor" / last-response-is-logged-three-times bug is (hopefully) fixed. o The bug whereby socket errors could cause the "Active POP3 conversations" count to go negative is fixed. o After doing a word query, it now prepopulates the query field with your word - handy if you mistyped it or you want to try a variant. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** pop3proxy.py 10 Nov 2002 19:59:22 -0000 1.14 --- pop3proxy.py 13 Nov 2002 18:19:45 -0000 1.15 *************** *** 7,11 **** header. Usage: ! pop3proxy.py [options] [] is the name of your real POP3 server is the port number of your real POP3 server, which --- 7,11 ---- header. Usage: ! pop3proxy.py [options] [ []] is the name of your real POP3 server is the port number of your real POP3 server, which *************** *** 13,16 **** --- 13,20 ---- options: + -z : Runs a self-test and exits. + -t : Runs a test POP3 server on port 8110 (for testing). + -h : Displays this help message. + -p FILE : use the named data file -d : the file is a DBM file rather than a pickle *************** *** 20,28 **** -b : Launch a web browser showing the user interface. ! pop3proxy -t ! Runs a test POP3 server on port 8110; useful for testing. ! ! pop3proxy -h ! Displays this help message. For safety, and to help debugging, the whole POP3 conversation is --- 24,30 ---- -b : Launch a web browser showing the user interface. ! All command line arguments and switches take their default ! values from the [Hammie], [pop3proxy] and [html_ui] sections ! of bayescustomize.ini. For safety, and to help debugging, the whole POP3 conversation is *************** *** 48,72 **** todo = """ ! o (Re)training interface - one message per line, quick-rendering table. ! o Slightly-wordy index page; intro paragraph for each page. o Once the training stuff is on a separate page, make the paste box bigger. - o "Links" section (on homepage?) to project homepage, mailing list, - etc. - o "Home" link (with helmet!) at the end of each page. - o "Classify this" - just like Train. - o "Send me an email every [...] to remind me to train on new - messages." - o "Send me a status email every [...] telling how many mails have been - classified, etc." o Deployment: Windows executable? atlaxwin and ctypes? Or just webbrowser? - o Possibly integrate Tim Stone's SMTP code - make it use async, make - the training code update (rather than replace!) the database. o Can it cleanly dynamically update its status display while having a POP3 converation? Hammering reload sucks. o Add a command to save the database without shutting down, and one to reload the database. ! o Leave the word in the input field after a Word query. """ --- 50,103 ---- todo = """ ! ! User interface improvements: ! o Once the training stuff is on a separate page, make the paste box bigger. o Deployment: Windows executable? atlaxwin and ctypes? Or just webbrowser? o Can it cleanly dynamically update its status display while having a POP3 converation? Hammering reload sucks. o Add a command to save the database without shutting down, and one to reload the database. ! o Save the Status (num classified, etc.) between sessions. ! ! ! New features: ! ! o (Re)training interface - one message per line, quick-rendering table. ! o "Send me an email every [...] to remind me to train on new ! messages." ! o "Send me a status email every [...] telling how many mails have been ! classified, etc." ! o Possibly integrate Tim Stone's SMTP code - make it use async, make ! the training code update (rather than replace!) the database. ! o Option to keep trained messages and view potential FPs and FNs to ! correct them. ! o Allow use of the UI without the POP3 proxy. ! ! ! Code quality: ! ! o Move the UI into its own module. ! o Eventually, pull the common HTTP code from pop3proxy.py and Entrian ! Debugger into a library. ! ! ! Info: ! ! o Slightly-wordy index page; intro paragraph for each page. ! o In both stats and training results, report nham and nspam - warn if ! they're very different (for some value of 'very'). ! o "Links" section (on homepage?) to project homepage, mailing list, ! etc. ! ! ! Gimmicks: ! ! o Classify a web page given a URL. ! o Graphs. Of something. Who cares what? ! o Zoe...! ! """ *************** *** 147,151 **** self.set_terminator('\r\n') self.create_socket(socket.AF_INET, socket.SOCK_STREAM) ! self.connect((serverName, serverPort)) def collect_incoming_data(self, data): --- 178,188 ---- self.set_terminator('\r\n') self.create_socket(socket.AF_INET, socket.SOCK_STREAM) ! try: ! self.connect((serverName, serverPort)) ! except socket.error, e: ! print >>sys.stderr, "Can't connect to %s:%d: %s" % \ ! (serverName, serverPort, e) ! self.close() ! self.lineCallback('') # "The socket's been closed." def collect_incoming_data(self, data): *************** *** 199,203 **** self.response = self.response + line ! # Is this line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] --- 236,240 ---- self.response = self.response + line ! # Is this the line that terminates a set of headers? self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n'] *************** *** 237,241 **** else: # Assume that an unknown command will get a single-line ! # response. This should work for errors and for POP-AUTH. return False --- 274,281 ---- else: # Assume that an unknown command will get a single-line ! # response. This should work for errors and for POP-AUTH, ! # and is harmless even for multiline responses - the first ! # line will be passed to onTransaction and ignored, then the ! # rest will be proxied straight through. return False *************** *** 246,257 **** def found_terminator(self): """Asynchat override.""" ! if self.request.strip().upper() == 'KILL': ! self.serverSocket.sendall('QUIT\r\n') ! self.send("+OK, dying.\r\n") ! self.serverSocket.shutdown(2) ! self.serverSocket.close() self.shutdown(2) self.close() raise SystemExit self.serverSocket.push(self.request + '\r\n') --- 286,298 ---- def found_terminator(self): """Asynchat override.""" ! verb = self.request.strip().upper() ! if verb == 'KILL': self.shutdown(2) self.close() raise SystemExit + elif verb == 'CRASH': + # For testing + x = 0 + y = 1/x self.serverSocket.push(self.request + '\r\n') *************** *** 271,276 **** # Pass the request and the raw response to the subclass and # send back the cooked response. ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) # If onServerLine() decided that the server has closed its --- 312,318 ---- # Pass the request and the raw response to the subclass and # send back the cooked response. ! if self.response: ! cooked = self.onTransaction(self.command, self.args, self.response) ! self.push(cooked) # If onServerLine() decided that the server has closed its *************** *** 334,337 **** --- 376,380 ---- status.totalSessions += 1 status.activeSessions += 1 + self.isClosed = False def send(self, data): *************** *** 339,343 **** self.logFile.write(data) self.logFile.flush() ! return POP3ProxyBase.send(self, data) def recv(self, size): --- 382,392 ---- self.logFile.write(data) self.logFile.flush() ! try: ! return POP3ProxyBase.send(self, data) ! except socket.error: ! # The email client has closed the connection - 40tude Dialog ! # does this immediately after issuing a QUIT command, ! # without waiting for the response. ! self.close() def recv(self, size): *************** *** 349,354 **** def close(self): ! status.activeSessions -= 1 ! POP3ProxyBase.close(self) def onTransaction(self, command, args, response): --- 398,406 ---- def close(self): ! # This can be called multiple times by async. ! if not self.isClosed: ! self.isClosed = True ! status.activeSessions -= 1 ! POP3ProxyBase.close(self) def onTransaction(self, command, args, response): *************** *** 442,448 **** UserInterface objects to serve them.""" ! def __init__(self, uiPort, bayes): uiArgs = (bayes,) ! Listener.__init__(self, uiPort, UserInterface, uiArgs) --- 494,500 ---- UserInterface objects to serve them.""" ! def __init__(self, uiPort, bayes, socketMap=asyncore.socket_map): uiArgs = (bayes,) ! Listener.__init__(self, uiPort, UserInterface, uiArgs, socketMap=socketMap) *************** *** 479,485 **** """Serves the HTML user interface of the proxy.""" header = """Spambayes proxy: %s ]{0,256} # search for the end '>', but don't run wild ! ) > """, re.VERBOSE | re.DOTALL) --- 611,625 ---- msg.walk())) has_highbit_char = re.compile(r"[\x80-\xff]").search # Cheap-ass gimmick to probabilistically find HTML/XML tags. + # Note that ").search) ! ! crack_html_style = StyleStripper().analyze ! ! # Nuke HTML comments. ! ! class CommentStripper(Stripper): ! def __init__(self): ! Stripper.__init__(self, re.compile(r"").search) ! ! crack_html_comment = CommentStripper().analyze # Scan HTML for constructs often seen in viruses and worms. *************** *** 1232,1251 **** text = text.lower() - # Get rid of uuencoded sections. - text, tokens = crack_uuencode(text) - for t in tokens: - yield t - if options.replace_nonascii_chars: # Replace high-bit chars and control chars with '?'. text = text.translate(non_ascii_translate_tab) - # Special tagging of embedded URLs. - text, tokens = crack_urls(text) - for t in tokens: - yield t - for t in find_html_virus_clues(text): yield "virus:%s" % t # Remove HTML/XML tags. Also  . --- 1268,1287 ---- text = text.lower() if options.replace_nonascii_chars: # Replace high-bit chars and control chars with '?'. text = text.translate(non_ascii_translate_tab) for t in find_html_virus_clues(text): yield "virus:%s" % t + + # Get rid of uuencoded sections, embedded URLs, *************** *** 664,671 **** reviewHeader = """

    These are untrained emails, which you can use to ! train the classifier. Check the Discard / Defer / Ham / ! Spam buttton for each email, then click 'Train' below. ! (Defer leaves the message here, to be trained on ! later.)

    --- 665,673 ---- reviewHeader = """

    These are untrained emails, which you can use to ! train the classifier. Check the appropriate buttton for ! each email, then click 'Train' below. 'Defer' leaves the ! message here, to be trained on later. Click one of the ! Discard / Defer / Ham / Spam headers to check all of the ! buttons in that section in one go.

    *************** *** 684,690 **** """ ! reviewSubheader = """
    ! ! """ upload = """ ! function onHeader(type, switchTo) ! { ! if (document.forms && document.forms.length >= 2) ! { ! form = document.forms[1]; ! for (i = 0; i < form.length; i++) ! { ! splitName = form[i].name.split(':'); ! if (splitName.length == 3 && splitName[1] == type && ! form[i].value == switchTo.toLowerCase()) ! { ! form[i].checked = true; ! } ! } ! } ! } ! ! """ ! ! reviewSubheader = \ ! """ ! ! """ upload = """ limit: field = field[:limit-3] + "..." --- 862,871 ---- def trimAndQuote(self, field, limit, quote=False): """Trims a string, adding an ellipsis if necessary, and ! HTML-quotes it. Also pumps it through email.Header.decode_header, ! which understands charset sections in email headers - I suspect ! this will only work for Latin character sets, but hey, it works for ! Francois Granger's name. 8-)""" ! sections = email.Header.decode_header(field) ! field = ' '.join([text for text, _ in sections]) if len(field) > limit: field = field[:limit-3] + "..." *************** *** 970,980 **** return keys, date, prior, start, end ! def appendMessages(self, lines, keyedMessages, judgement): """Appends the lines of a table of messages to 'lines'.""" buttons = \ ! """  !   !   ! """ stripe = 0 for key, message in keyedMessages: --- 1004,1014 ---- return keys, date, prior, start, end ! def appendMessages(self, lines, keyedMessages, label): """Appends the lines of a table of messages to 'lines'.""" buttons = \ ! """  !   !   ! """ stripe = 0 for key, message in keyedMessages: *************** *** 1002,1013 **** # Output the table row for this message. defer = ham = spam = "" ! if judgement == options.header_spam_string: spam='checked' ! elif judgement == options.header_ham_string: ham='checked' ! elif judgement == options.header_unsure_string: defer='checked' subject = "%s" % (text, subject) ! radioGroup = buttons % (key, key, defer, key, ham, key, spam) stripeClass = ['stripe_on', 'stripe_off'][stripe] lines.append(""" --- 1036,1050 ---- # Output the table row for this message. defer = ham = spam = "" ! if label == 'Spam': spam='checked' ! elif label == 'Ham': ham='checked' ! elif label == 'Unsure': defer='checked' subject = "%s" % (text, subject) ! radioGroup = buttons % (label, key, ! label, key, defer, ! label, key, ham, ! label, key, spam) stripeClass = ['stripe_on', 'stripe_off'][stripe] lines.append(""" *************** *** 1024,1028 **** for key, value in params.items(): if key.startswith('classify:'): ! id = key.split(':', 1)[1] if value == 'spam': targetCorpus = state.spamCorpus --- 1061,1065 ---- for key, value in params.items(): if key.startswith('classify:'): ! id = key.split(':')[2] if value == 'spam': targetCorpus = state.spamCorpus *************** *** 1103,1114 **** if not next: nextState = 'disabled' ! lines = [self.reviewHeader % (prior, next, priorState, nextState)] ! for header, type in ((options.header_spam_string, 'Spam'), ! (options.header_ham_string, 'Ham'), ! (options.header_unsure_string, 'Unsure')): if keyedMessages[header]: lines.append("") ! lines.append(self.reviewSubheader % type) ! self.appendMessages(lines, keyedMessages[header], header) lines.append("""") ! lines.append(self.reviewSubheader % ! (label, label, label, label, label)) ! self.appendMessages(lines, keyedMessages[header], label) lines.append(""" ! !
    %s
    Messages classified as %s:From:Discard / Defer / Ham / Spam
    Messages classified as %s:From: ! Discard / ! Defer / ! Ham / ! Spam !
    %s%s
    %s%s
     
     
    --- 1140,1153 ---- if not next: nextState = 'disabled' ! lines = [self.onReviewHeader, ! self.reviewHeader % (prior, next, priorState, nextState)] ! for header, label in ((options.header_spam_string, 'Spam'), ! (options.header_ham_string, 'Ham'), ! (options.header_unsure_string, 'Unsure')): if keyedMessages[header]: lines.append("
     
     
    From npickett@users.sourceforge.net Wed Nov 27 22:38:00 2002 From: npickett@users.sourceforge.net (Neale Pickett) Date: Wed, 27 Nov 2002 14:38:00 -0800 Subject: [Spambayes-checkins] spambayes classifier.py,1.60,1.61 hammie.py,1.43,1.44 storage.py,1.2,1.3 dbdict.py,1.4,NONE Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv31393 Modified Files: classifier.py hammie.py storage.py Removed Files: dbdict.py Log Message: * Caching dbdict implementation. You'll have to retrain your databases again (sorry) Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.60 retrieving revision 1.61 diff -C2 -d -r1.60 -r1.61 *** classifier.py 26 Nov 2002 20:22:05 -0000 1.60 --- classifier.py 27 Nov 2002 22:37:55 -0000 1.61 *************** *** 47,75 **** LN2 = math.log(2) # used frequently by chi-combining ! PICKLE_VERSION = 4 ! ! class MetaInfo(object): ! """Information about the corpora. ! ! Contains nham and nspam, used for calculating probabilities. ! ! """ ! def __init__(self): ! self.__setstate__((PICKLE_VERSION, 0, 0)) ! ! def __repr__(self): ! return "MetaInfo%r" % repr((self._nspam, ! self._nham, ! self.revision)) ! ! def __getstate__(self): ! return (PICKLE_VERSION, self.nspam, self.nham) ! ! def __setstate__(self, t): ! if t[0] != PICKLE_VERSION: ! raise ValueError("Can't unpickle -- version %s unknown" % t[0]) ! self.nspam, self.nham = t[1:] ! self.revision = 0 ! class WordInfo(object): --- 47,51 ---- LN2 = math.log(2) # used frequently by chi-combining ! PICKLE_VERSION = 5 class WordInfo(object): *************** *** 109,138 **** def __init__(self): self.wordinfo = {} - self.meta = MetaInfo() self.probcache = {} def __getstate__(self): ! return PICKLE_VERSION, self.wordinfo, self.meta def __setstate__(self, t): if t[0] != PICKLE_VERSION: raise ValueError("Can't unpickle -- version %s unknown" % t[0]) ! self.wordinfo, self.meta = t[1:] self.probcache = {} - # Slacker's way out--pass calls to nham/nspam up to the meta class - - def get_nham(self): - return self.meta.nham - def set_nham(self, val): - self.meta.nham = val - nham = property(get_nham, set_nham) - - def get_nspam(self): - return self.meta.nspam - def set_nspam(self, val): - self.meta.nspam = val - nspam = property(get_nspam, set_nspam) - # spamprob() implementations. One of the following is aliased to # spamprob, depending on option settings. --- 85,100 ---- def __init__(self): self.wordinfo = {} self.probcache = {} + self.nspam = self.nham = 0 def __getstate__(self): ! return (PICKLE_VERSION, self.wordinfo, self.nspam, self.nham) def __setstate__(self, t): if t[0] != PICKLE_VERSION: raise ValueError("Can't unpickle -- version %s unknown" % t[0]) ! (self.wordinfo, self.nspam, self.nham) = t[1:] self.probcache = {} # spamprob() implementations. One of the following is aliased to # spamprob, depending on option settings. *************** *** 331,336 **** pass ! nham = float(self.meta.nham or 1) ! nspam = float(self.meta.nspam or 1) assert hamcount <= nham --- 293,298 ---- pass ! nham = float(self.nham or 1) ! nspam = float(self.nspam or 1) assert hamcount <= nham *************** *** 420,431 **** self.probcache = {} # nuke the prob cache if is_spam: ! self.meta.nspam += 1 else: ! self.meta.nham += 1 - wordinfo = self.wordinfo - wordinfoget = wordinfo.get for word in Set(wordstream): ! record = wordinfoget(word) if record is None: record = self.WordInfoClass() --- 382,391 ---- self.probcache = {} # nuke the prob cache if is_spam: ! self.nspam += 1 else: ! self.nham += 1 for word in Set(wordstream): ! record = self._wordinfoget(word) if record is None: record = self.WordInfoClass() *************** *** 436,441 **** record.hamcount += 1 ! # Needed to tell a persistent DB that the content changed. ! wordinfo[word] = record --- 396,400 ---- record.hamcount += 1 ! self._wordinfoset(word, record) *************** *** 443,458 **** self.probcache = {} # nuke the prob cache if is_spam: ! if self.meta.nspam <= 0: raise ValueError("spam count would go negative!") ! self.meta.nspam -= 1 else: ! if self.meta.nham <= 0: raise ValueError("non-spam count would go negative!") ! self.meta.nham -= -1 - wordinfo = self.wordinfo - wordinfoget = wordinfo.get for word in Set(wordstream): ! record = wordinfoget(word) if record is not None: if is_spam: --- 402,415 ---- self.probcache = {} # nuke the prob cache if is_spam: ! if self.nspam <= 0: raise ValueError("spam count would go negative!") ! self.nspam -= 1 else: ! if self.nham <= 0: raise ValueError("non-spam count would go negative!") ! self.nham -= -1 for word in Set(wordstream): ! record = self._wordinfoget(word) if record is not None: if is_spam: *************** *** 463,471 **** record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: ! del wordinfo[word] else: ! # Needed to tell a persistent DB that the content ! # changed. ! wordinfo[word] = record def _getclues(self, wordstream): --- 420,426 ---- record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: ! self._wordinfodel(word) else: ! self._wordinfoset(word, record) def _getclues(self, wordstream): *************** *** 476,482 **** pushclue = clues.append - wordinfoget = self.wordinfo.get for word in Set(wordstream): ! record = wordinfoget(word) if record is None: prob = unknown --- 431,436 ---- pushclue = clues.append for word in Set(wordstream): ! record = self._wordinfoget(word) if record is None: prob = unknown *************** *** 492,495 **** --- 446,459 ---- # Return (prob, word, record). return [t[1:] for t in clues] + + def _wordinfoget(self, word): + return self.wordinfo.get(word) + + def _wordinfoset(self, word, record): + self.wordinfo[word] = record + + def _wordinfodel(self, word): + del self.wordinfo[word] + Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** hammie.py 25 Nov 2002 20:49:17 -0000 1.43 --- hammie.py 27 Nov 2002 22:37:56 -0000 1.44 *************** *** 2,6 **** - import dbdict import mboxutils import storage --- 2,5 ---- *************** *** 45,49 **** for word, prob in clues if (word[0] == '*' or ! prob <= SHOWCLUE or prob >= 1.0 - SHOWCLUE)]) def score(self, msg, evidence=False): --- 44,49 ---- for word, prob in clues if (word[0] == '*' or ! prob <= options.clue_mailheader_cutoff or ! prob >= 1.0 - options.clue_mailheader_cutoff)]) def score(self, msg, evidence=False): Index: storage.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/storage.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** storage.py 26 Nov 2002 00:43:51 -0000 1.2 --- storage.py 27 Nov 2002 22:37:56 -0000 1.3 *************** *** 5,9 **** Classes: PickledClassifier - Classifier that uses a pickle db ! DBDictClassifier - Classifier that uses a DBDict db Trainer - Classifier training observer SpamTrainer - Trainer for spam --- 5,9 ---- Classes: PickledClassifier - Classifier that uses a pickle db ! DBDictClassifier - Classifier that uses a DBM db Trainer - Classifier training observer SpamTrainer - Trainer for spam *************** *** 18,23 **** databases. ! DBDictClassifier is a Classifier class that uses a DBDict ! datastore. Trainer is concrete class that observes a Corpus and trains a --- 18,23 ---- databases. ! DBDictClassifier is a Classifier class that uses a database ! store. Trainer is concrete class that observes a Corpus and trains a *************** *** 50,55 **** from Options import options import cPickle as pickle - import dbdict import errno PICKLE_TYPE = 1 --- 50,55 ---- from Options import options import cPickle as pickle import errno + import shelve PICKLE_TYPE = 1 *************** *** 84,91 **** fp.close() if tempbayes: self.wordinfo = tempbayes.wordinfo ! self.meta.nham = tempbayes.get_nham() ! self.meta.nspam = tempbayes.get_nspam() if options.verbose: --- 84,92 ---- fp.close() + # XXX: why not self.__setstate__(tempbayes.__getstate__())? if tempbayes: self.wordinfo = tempbayes.wordinfo ! self.nham = tempbayes.nham ! self.nspam = tempbayes.nspam if options.verbose: *************** *** 97,102 **** print self.db_name,'is a new pickle' self.wordinfo = {} ! self.meta.nham = 0 ! self.meta.nspam = 0 def store(self): --- 98,103 ---- print self.db_name,'is a new pickle' self.wordinfo = {} ! self.nham = 0 ! self.nspam = 0 def store(self): *************** *** 110,124 **** fp.close() - def __getstate__(self): - return PICKLE_TYPE, self.wordinfo, self.meta - - def __setstate__(self, t): - if t[0] != PICKLE_TYPE: - raise ValueError("Can't unpickle -- version %s unknown" % t[0]) - self.wordinfo, self.meta = t[1:] - class DBDictClassifier(classifier.Classifier): ! '''Classifier object persisted in a WIDict''' def __init__(self, db_name, mode='c'): --- 111,117 ---- fp.close() class DBDictClassifier(classifier.Classifier): ! '''Classifier object persisted in a caching database''' def __init__(self, db_name, mode='c'): *************** *** 126,129 **** --- 119,123 ---- classifier.Classifier.__init__(self) + self.wordcache = {} self.statekey = "saved state" self.mode = mode *************** *** 132,157 **** def load(self): ! '''Load state from WIDict''' if options.verbose: ! print 'Loading state from',self.db_name,'WIDict' ! self.wordinfo = dbdict.DBDict(self.db_name, self.mode, ! classifier.WordInfo,iterskip=[self.statekey]) ! if self.wordinfo.has_key(self.statekey): ! (nham, nspam) = self.wordinfo[self.statekey] ! self.set_nham(nham) ! self.set_nspam(nspam) if options.verbose: ! print '%s is an existing DBDict, with %d ham and %d spam' \ ! % (self.db_name, self.nham, self.nspam) else: ! # new dbdict if options.verbose: ! print self.db_name,'is a new DBDict' ! self.set_nham(0) ! self.set_nspam(0) def store(self): --- 126,152 ---- def load(self): ! '''Load state from database''' if options.verbose: ! print 'Loading state from',self.db_name,'database' ! self.db = shelve.DbfilenameShelf(self.db_name, self.mode) ! if self.db.has_key(self.statekey): ! t = self.db[self.statekey] ! if t[0] != classifier.PICKLE_VERSION: ! raise ValueError("Can't unpickle -- version %s unknown" % t[0]) ! (self.nspam, self.nham) = t[1:] if options.verbose: ! print '%s is an existing database, with %d spam and %d ham' \ ! % (self.db_name, self.nspam, self.nham) else: ! # new database if options.verbose: ! print self.db_name,'is a new database' ! self.nspam = 0 ! self.nham = 0 ! self.wordinfo = {} def store(self): *************** *** 159,166 **** if options.verbose: ! print 'Persisting',self.db_name,'state in WIDict' ! self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam()) ! self.wordinfo.sync() --- 154,186 ---- if options.verbose: ! print 'Persisting',self.db_name,'state in database' ! for key, val in self.wordinfo.iteritems(): ! if val == None: ! del self.wordinfo[key] ! try: ! del self.db[key] ! except KeyError: ! pass ! else: ! self.db[key] = val.__getstate__() ! self.db[self.statekey] = (classifier.PICKLE_VERSION, ! self.nspam, self.nham) ! self.db.sync() ! ! def _wordinfoget(self, word): ! ret = self.wordinfo.get(word) ! if not ret: ! r = self.db.get(word) ! if r: ! ret = self.WordInfoClass() ! ret.__setstate__(r) ! self.wordinfo[word] = ret ! return ret ! ! # _wordinfoset is the same ! ! def _wordinfodel(self, word): ! self.wordinfo[word] = None --- dbdict.py DELETED --- From timstone4@users.sourceforge.net Wed Nov 27 23:04:17 2002 From: timstone4@users.sourceforge.net (Tim Stone) Date: Wed, 27 Nov 2002 15:04:17 -0800 Subject: [Spambayes-checkins] spambayes storage.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv10926 Modified Files: storage.py Log Message: Fixed a couple of comments Index: storage.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/storage.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** storage.py 27 Nov 2002 22:37:56 -0000 1.3 --- storage.py 27 Nov 2002 23:04:14 -0000 1.4 *************** *** 5,9 **** Classes: PickledClassifier - Classifier that uses a pickle db ! DBDictClassifier - Classifier that uses a DBM db Trainer - Classifier training observer SpamTrainer - Trainer for spam --- 5,9 ---- Classes: PickledClassifier - Classifier that uses a pickle db ! DBDictClassifier - Classifier that uses a shelve db Trainer - Classifier training observer SpamTrainer - Trainer for spam *************** *** 43,49 **** # Foundation license. ! __author__ = "Tim Stone " ! __credits__ = "Richie Hindle, Tim Peters, Neale Pickett, \ ! all the spambayes contributors." import classifier --- 43,49 ---- # Foundation license. ! __author__ = "Neale Pickett , \ ! Tim Stone " ! __credits__ = "All the spambayes contributors." import classifier From Paul.Moore@atosorigin.com Thu Nov 28 09:26:34 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Thu, 28 Nov 2002 09:26:34 -0000 Subject: [Spambayes-checkins] spambayes classifier.py,1.60,1.61hammie.py,1.43,1.44 storage.py,1.2,1.3 dbdict.py,1.4,NONE Message-ID: <16E1010E4581B049ABC51D4975CEDB8861995C@UKDCX001.uk.int.atosorigin.com> From: Neale Pickett [mailto:npickett@users.sourceforge.net] > + import shelve > ! self.wordinfo =3D dbdict.DBDict(self.db_name, self.mode, > ! = classifier.WordInfo,iterskip=3D[self.statekey]) > ! self.db =3D shelve.DbfilenameShelf(self.db_name, self.mode) You do realise that shelve uses anydbm under the hood, making it = susceptible to the same problems with Windows (only broken DBM or dumbdbm available) = that the old version had - but with no obvious way of patching it up to allow = customisation by the user? As I said, I use pickles now, so I no longer have a use case where = Windows users would be using DBM format anyway, but there probably should be at least = a warning in a comment somewhere... Paul From sjoerd@users.sourceforge.net Thu Nov 28 15:48:31 2002 From: sjoerd@users.sourceforge.net (Sjoerd Mullender) Date: Thu, 28 Nov 2002 07:48:31 -0800 Subject: [Spambayes-checkins] spambayes FileCorpus.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv1021 Modified Files: FileCorpus.py Log Message: Use double quotes for some triple-quoted strings that contain lonely single quotes. This makes XEmacs' fontification a whole lot happier. Index: FileCorpus.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/FileCorpus.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** FileCorpus.py 26 Nov 2002 00:43:51 -0000 1.6 --- FileCorpus.py 28 Nov 2002 15:48:29 -0000 1.7 *************** *** 1,5 **** #! /usr/bin/env python ! '''FileCorpus.py - Corpus composed of file system artifacts Classes: --- 1,5 ---- #! /usr/bin/env python ! """FileCorpus.py - Corpus composed of file system artifacts Classes: *************** *** 74,78 **** o Suggestions? ! ''' # This module is part of the spambayes project, which is Copyright 2002 --- 74,78 ---- o Suggestions? ! """ # This module is part of the spambayes project, which is Copyright 2002 *************** *** 572,576 **** def testmsg1(): ! return ''' X-Hd:skip@pobox.com Mon Nov 04 10:50:49 2002 Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31 --- 572,576 ---- def testmsg1(): ! return """ X-Hd:skip@pobox.com Mon Nov 04 10:50:49 2002 Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31 *************** *** 626,633 **** > - Tim ! www.fourstonesExpressions.com ''' def testmsg2(): ! return ''' X-Hd:richie@entrian.com Wed Nov 06 12:05:41 2002 Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31 --- 626,633 ---- > - Tim ! www.fourstonesExpressions.com """ def testmsg2(): ! return """ X-Hd:richie@entrian.com Wed Nov 06 12:05:41 2002 Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31 *************** *** 677,681 **** -- Richie Hindle ! richie@entrian.com''' if __name__ == '__main__': --- 677,681 ---- -- Richie Hindle ! richie@entrian.com""" if __name__ == '__main__': From richiehindle@users.sourceforge.net Thu Nov 28 16:10:49 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Thu, 28 Nov 2002 08:10:49 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.26,1.27 Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv15947 Modified Files: pop3proxy.py Log Message: o Fixed Tim Stone's hanging problem - "LIST 1" would hang because it thought that the response should be multiline (I don't like nested scopes 8-) o Don't allow the radio buttons headers in the training interface to word wrap. o When the POP3 server is unreachable, return an error to the email client as well as printing it to the console. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** pop3proxy.py 27 Nov 2002 18:44:41 -0000 1.26 --- pop3proxy.py 28 Nov 2002 16:10:46 -0000 1.27 *************** *** 14,22 **** options: -z : Runs a self-test and exits. ! -t : Runs a test POP3 server on port 8110 (for testing). -h : Displays this help message. ! -p FILE : use the named data file ! -d : the file is a DBM file rather than a pickle -l port : proxy listens on this port number (default 110) -u port : User interface listens on this port number --- 14,22 ---- options: -z : Runs a self-test and exits. ! -t : Runs a fake POP3 server on port 8110 (for testing). -h : Displays this help message. ! -p FILE : use the named database file ! -d : the database is a DBM file rather than a pickle -l port : proxy listens on this port number (default 110) -u port : User interface listens on this port number *************** *** 25,30 **** All command line arguments and switches take their default ! values from the [Hammie], [pop3proxy] and [html_ui] sections ! of bayescustomize.ini. For safety, and to help debugging, the whole POP3 conversation is --- 25,30 ---- All command line arguments and switches take their default ! values from the [pop3proxy] and [html_ui] sections of ! bayescustomize.ini. For safety, and to help debugging, the whole POP3 conversation is *************** *** 40,44 **** __author__ = "Richie Hindle " ! __credits__ = "Tim Peters, Neale Pickett, all the spambayes contributors." try: --- 40,44 ---- __author__ = "Richie Hindle " ! __credits__ = "Tim Peters, Neale Pickett, Tim Stone, all the Spambayes folk." try: *************** *** 56,59 **** --- 56,61 ---- o Review already-trained messages, and purge them. o Put in a link to view a message (plain text, html, multipart...?) + Include a Reply link that launches the registered email client, eg. + mailto:tim@fourstonesExpressions.com?subject=Re:%20pop3proxy&body=Hi%21%0D o Keyboard navigation (David Ascher). But aren't Tab and left/right arrow enough? *************** *** 130,133 **** --- 132,139 ---- take weeks over a modem - I've already had problems with clients timing out while the proxy was downloading stuff from the server). + + Adam's idea: add checkboxes to a Google results list for "Relevant" / + "Irrelevant", then submit that to build a search including the + highest-scoring tokens and excluding the lowest-scoring ones. """ *************** *** 214,221 **** self.connect((serverName, serverPort)) except socket.error, e: ! print >>sys.stderr, "Can't connect to %s:%d: %s" % \ ! (serverName, serverPort, e) ! self.close() self.lineCallback('') # "The socket's been closed." def collect_incoming_data(self, data): --- 220,228 ---- self.connect((serverName, serverPort)) except socket.error, e: ! error = "Can't connect to %s:%d: %s" % (serverName, serverPort, e) ! print >>sys.stderr, error ! self.lineCallback('-ERR %s\r\n' % error) self.lineCallback('') # "The socket's been closed." + self.close() def collect_incoming_data(self, data): *************** *** 304,308 **** return True elif self.command in ['LIST', 'UIDL']: ! return len(args) == 0 else: # Assume that an unknown command will get a single-line --- 311,315 ---- return True elif self.command in ['LIST', 'UIDL']: ! return len(self.args) == 0 else: # Assume that an unknown command will get a single-line *************** *** 710,714 **** """
    Messages classified as %s: From: Discard / Defer / --- 717,721 ---- """
    Messages classified as %s: From: Discard / Defer / From timstone4@users.sourceforge.net Thu Nov 28 16:35:59 2002 From: timstone4@users.sourceforge.net (Tim Stone) Date: Thu, 28 Nov 2002 08:35:59 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.27,1.28 Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv30105 Modified Files: pop3proxy.py Log Message: Changed startup messages to be a bit more informative. Made writing of log file dependent on options.verbose Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** pop3proxy.py 28 Nov 2002 16:10:46 -0000 1.27 --- pop3proxy.py 28 Nov 2002 16:35:57 -0000 1.28 *************** *** 29,33 **** For safety, and to help debugging, the whole POP3 conversation is ! written out to _pop3proxy.log for each run. To make rebuilding the database easier, uploaded messages are appended --- 29,33 ---- For safety, and to help debugging, the whole POP3 conversation is ! written out to _pop3proxy.log for each run, if options.verbose is True. To make rebuilding the database easier, uploaded messages are appended *************** *** 166,170 **** self.set_socket(s, socketMap) self.set_reuse_addr() ! print "%s listening on port %d." % (self.__class__.__name__, port) self.bind(('', port)) self.listen(5) --- 166,171 ---- self.set_socket(s, socketMap) self.set_reuse_addr() ! if options.verbose: ! print "%s listening on port %d." % (self.__class__.__name__, port) self.bind(('', port)) self.listen(5) *************** *** 389,392 **** --- 390,394 ---- proxyArgs = (serverName, serverPort) Listener.__init__(self, proxyPort, BayesProxy, proxyArgs) + print 'Listener on port %d is proxying %s:%d' % (proxyPort, serverName, serverPort) *************** *** 429,434 **** def send(self, data): """Logs the data to the log file.""" ! state.logFile.write(data) ! state.logFile.flush() try: return POP3ProxyBase.send(self, data) --- 431,437 ---- def send(self, data): """Logs the data to the log file.""" ! if options.verbose: ! state.logFile.write(data) ! state.logFile.flush() try: return POP3ProxyBase.send(self, data) *************** *** 442,447 **** """Logs the data to the log file.""" data = POP3ProxyBase.recv(self, size) ! state.logFile.write(data) ! state.logFile.flush() return data --- 445,451 ---- """Logs the data to the log file.""" data = POP3ProxyBase.recv(self, size) ! if options.verbose: ! state.logFile.write(data) ! state.logFile.flush() return data *************** *** 565,568 **** --- 569,573 ---- def __init__(self, uiPort, socketMap=asyncore.socket_map): Listener.__init__(self, uiPort, UserInterface, (), socketMap=socketMap) + print 'User interface url is http://localhost:%d' % (uiPort) *************** *** 1215,1219 **** __main__ code below.""" # Open the log file. ! self.logFile = open('_pop3proxy.log', 'wb', 0) # Load up the old proxy settings from Options.py / bayescustomize.ini --- 1220,1225 ---- __main__ code below.""" # Open the log file. ! if options.verbose: ! self.logFile = open('_pop3proxy.log', 'wb', 0) # Load up the old proxy settings from Options.py / bayescustomize.ini From richiehindle@users.sourceforge.net Thu Nov 28 17:05:00 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Thu, 28 Nov 2002 09:05:00 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.28,1.29 Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv15372 Modified Files: pop3proxy.py Log Message: Don't introduce module-level variables in the __main__ code, because they mask potential NameErrors later on. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** pop3proxy.py 28 Nov 2002 16:35:57 -0000 1.28 --- pop3proxy.py 28 Nov 2002 17:04:58 -0000 1.29 *************** *** 1572,1576 **** # =================================================================== ! if __name__ == '__main__': # Read the arguments. try: --- 1572,1576 ---- # =================================================================== ! def run(): # Read the arguments. try: *************** *** 1633,1634 **** --- 1633,1637 ---- else: print >>sys.stderr, __doc__ + + if __name__ == '__main__': + run() From richiehindle@users.sourceforge.net Thu Nov 28 21:27:11 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Thu, 28 Nov 2002 13:27:11 -0800 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.29,1.30 Message-ID: Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs1:/tmp/cvs-serv3157 Modified Files: pop3proxy.py Log Message: HTML tidyings. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** pop3proxy.py 28 Nov 2002 17:04:58 -0000 1.29 --- pop3proxy.py 28 Nov 2002 21:27:09 -0000 1.30 *************** *** 611,615 **** # value. This is so that setFieldValue can set the value. ! header = """Spambayes proxy: %s