From montanaro@users.sourceforge.net Fri Nov 1 01:23:30 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Thu, 31 Oct 2002 17:23:30 -0800
Subject: [Spambayes-checkins] spambayes INTEGRATION.txt,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26766
Added Files:
INTEGRATION.txt
Log Message:
first scribbled notes about integrating Spambayes with different email
packages.
--- NEW FILE: INTEGRATION.txt ---
=======================================
Integrating Spambayes with mail systems
=======================================
General
-------
Spambayes in a tool used to segregate unwanted (spam) mail from the mail you
want (ham). Before Spambayes can be your spam filter of choice you need to
train it on representative samples of email you receive. After it's been
trained, you use Spambayes to classify new mail according to its spamminess
and hamminess qualities.
To train Spambayes, you need to save your incoming email for awhile,
segregating it into two piles, known spam and known ham (ham is our nickname
for good mail). It's best to train on recent email, because your interests
and the nature of what spam looks like change over time. Once you've
collected a fair portion of each (anything is better than nothing, but it
helps to have a couple hundred of each), you can tell Spambayes, "Here's my
ham and my spam". It will then process that mail and save information about
different patterns which appear in ham and spam. That information is then
used during the filtering stage.
When Spambayes filters your email, it compares each unclassified message
against the information it saved from training and makes a decision about
whether it thinks the message qualifies as ham or spam, or if it's unsure
about how to classify the message.
In the sections below, are gathered notes about how Spambayes can be
integrated into your mail processing system. As a general requirement, you
must have a recent version of Python installed on your computer, version
2.2.1 or later. (Don't ask about backporting it to earlier versions of
Python. It's almost a certainty this won't happen.) If you need to install
Python on your system, check the Python download page for the version
appropriate to your computer:
http://www.python.org/download/
Training
--------
Given a pair of Unix mailbox format files (each message starts with a line
which begins with 'From '), one containing nothing but spam and the other
containing nothing but ham, you can train Spambayes using a command like
hammie.py -g ~/tmp/newham -s ~/tmp/newspam
The above command is Unix-centric. In other environments it's likely that a
less command-line-oriented tool will be available in the near future.
Windows
-------
TBD.
Unix/Linux
----------
Unlike Windows, there are too many combinations of mail reading tools (mutt,
pine, Eudora, ...) and mail transport and delivery tools (sendmail, exim,
procmail, qmail, ...) to attempt to be exhaustive about how to integrate
Spambayes into your environment at this time. This section just documents
some of what's possible.
Procmail
--------
Many people on Unix-like systems have procmail available as an optional or
as the default local delivery agent. Integrating Spambayes checking with
Procmail is straightforward. Once you've trained Spambayes on your
collection of know ham and spam, you can use the hammie.py script to
classify incoming mail like so:
:0 fw:hamlock
| /usr/local/bin/hammie.py -f -d -p $HOME/hammie.db
The above Procmail recipe tells it to run /usr/local/bin/hammie.py in filter
mode (-f), and to use the training results stored in the dbm-style file
~/hammie.db. While hammie.py is runnning, Procmail uses the lock file
hamlock to prevent multiple invocations from stepping on each others' toes.
(It's not strictly necessary in this case since no files on-disk are
modified, but Procmail will still complain if you don't specify a lock
file.)
The result of running hammie.py in filter mode is that Procmail will use the
output from the run as the mail message for further processing downstream.
Hammie.py inserts an X-Hammie-Disposition header in the output message which
looks like
X-Hammie-Disposition: No; 0.00; '*H*': 1.00; '*S*': 0.00; 'python': 0.00;
'linux,': 0.01; 'desirable': 0.01; 'cvs,': 0.01; 'perl.': 0.02;
...
You can then use this to segregate your messages into various inboxes, like
so:
:0
* ^X-Hammie-Disposition: Yes
spam
:0
* ^X-Hammie-Disposition: Unsure
unsure
The first recipe catches all messages which hammie.py classified as spam.
The second catches all messages about which it was unsure. The combination
allows you to isolate spam from your good mail and tuck away messages it was
unsure about so you can scan them more closely.
X/Emacs+VM
----------
Emacs and XEmacs both come with VM, one of a choice of several Emacs-based
mail packages. Emacs is extensible using Emacs Lisp or Pymacs. This
extensibility allows you to easily segregate your incoming mail for training
purposes. Here's one such example. If you place the following code in your
~/.vm file:
(defun copy-to-spam ()
(interactive)
(vm-save-message (expand-file-name "~/tmp/newspam"))
(vm-undelete-message 1))
(defun copy-to-nonspam ()
(interactive)
(vm-save-message (expand-file-name "~/tmp/newham"))
(vm-undelete-message 1))
(define-key vm-mode-map "ls" 'copy-to-spam)
(define-key vm-summary-mode-map "ls" 'copy-to-spam)
(define-key vm-mode-map "lh" 'copy-to-nonspam)
(define-key vm-summary-mode-map "lh" 'copy-to-nonspam)
'ls' will save a copy of the current message to ~/tmp/newspam and 'lh' will
save a copy of the current message to ~/tmp/newham. You can then use those
files later as arguments to hammie.py for training.
Things to watch out for
-----------------------
While Spambayes does an excellent job of classifying incoming mail, it is
only as good as the data on which it was trained. Here are some tips to
help you create a good training set:
* Don't use old mail. The characteristics of your email change over time,
sometimes subtly, sometimes dramatically, so it's best to use very recent
mail to train Spambayes. If you've abandoned an email address in the
past because it was getting spammed heavily, there are probably some
clues in mail sent to your old address which would bias Spambayes.
* Check and recheck your training collections. While you are manually
classifying mail as spam or ham, it's easy to make a mistake and toss a
message or ten in the wrong file. Such miscategorized mail will throw
off the classifier.
From mhammond@users.sourceforge.net Fri Nov 1 01:23:39 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 17:23:39 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
FilterDialog.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv26773
Modified Files:
FilterDialog.py
Log Message:
Missing an import of the win32com constants.
Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** FilterDialog.py 31 Oct 2002 21:57:00 -0000 1.6
--- FilterDialog.py 1 Nov 2002 01:23:27 -0000 1.7
***************
*** 7,10 ****
--- 7,11 ----
import win32api
import pythoncom
+ from win32com.client import constants
from DialogGlobals import *
***************
*** 365,369 ****
if __name__=='__main__':
! from win32com.client import Dispatch, constants
outlook = Dispatch("Outlook.Application")
--- 366,370 ----
if __name__=='__main__':
! from win32com.client import Dispatch
outlook = Dispatch("Outlook.Application")
From mhammond@users.sourceforge.net Fri Nov 1 01:24:52 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 17:24:52 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv26936
Modified Files:
about.html
Log Message:
Add a bit more cruft
Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** about.html 31 Oct 2002 21:56:59 -0000 1.1
--- about.html 1 Nov 2002 01:24:09 -0000 1.2
***************
*** 1,7 ****
!
! About SpamBayes
!
!
! Contributions welcome!
!
!
\ No newline at end of file
--- 1,57 ----
!
!
!
! About SpamBayes
!
!
! NOTE: This is very very early code. If
! you are looking this, you have probably been told about it against our better
! judgement <wink>. Stuff doesnt work correctly. Fields are
! funny. If you want something known to work well today for alot of people,
! this is not for you.
!
! The source code is maintained at SourceForge.
!
! This spam filter uses Bayesian analysis to filter spam. Unlike other
! spam detection systems, Bayesian systems actually "learn" about what you
! consider spam, and continually adapt as both your regular email and spam
! patterns change.
!
Training
! Due to the nature of the system, it must be trained before it can be effective.
! Although the system does learn over time, when first installed it has
! no knowledge of either spam or good email.
!
Initial Training
! When first installed, it is recommended you perform the following steps:
!
!
Create two folders - one for "Spam", and one for "Possible Spam"
!
Go through your Inbox and Deleted Items, and move as much spam as you
! can find to the "Spam" folder. Try and get as much Spam out of your
! inbox as possible.
!
Select the Training dialog.
! Nominate your Spam folder for spam, and your Inbox for good messages,
! and start training.
!
! To see how effective your Inbox cleanup was, you may like to try:
!
!
Go to the Filter Now dialog.
!
Select your Inbox as the folder to filter.
!
Select Score messages, but dont perform
! filter action.
!
Clear both checkboxes so all messages will be scored.
!
Start the score operation.
!
! You can then look at and sort by the Spam field in your Inbox - this is likely
! to find hidden spam that you missed from your inbox cleanup.
!
Incremental Training
! When you drag a message to your Spam folder, it will be automatically trained
! as spam. Thus, as the classifier misses spam (or is unsure about them),
! it learns as you correct it.
! If messages are dropped back into the Inbox, they are trained as good - thus,
! the system learns what good messages look like should it incorrectly classify
! it as spam or possible spam.
!
! Contributions to this documentation are welcome!
!
!
!
From tim_one@users.sourceforge.net Fri Nov 1 02:04:36 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 31 Oct 2002 18:04:36 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.20,1.21 filter.py,1.11,1.12
manager.py,1.27,1.28 msgstore.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000
Modified Files:
addin.py filter.py manager.py msgstore.py
Log Message:
Whitespace normalization.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** addin.py 31 Oct 2002 21:56:59 -0000 1.20
--- addin.py 1 Nov 2002 02:03:39 -0000 1.21
***************
*** 300,304 ****
self.folder_hooks[k]._obj_.close()
self.folder_hooks = new_hooks
!
def _HookFolderEvents(self, folder_ids, include_sub, HandlerClass):
new_hooks = {}
--- 300,304 ----
self.folder_hooks[k]._obj_.close()
self.folder_hooks = new_hooks
!
def _HookFolderEvents(self, folder_ids, include_sub, HandlerClass):
new_hooks = {}
Index: filter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** filter.py 31 Oct 2002 21:56:59 -0000 1.11
--- filter.py 1 Nov 2002 02:03:42 -0000 1.12
***************
*** 79,83 ****
if progress.stop_requested():
return
! # All done - report what we did.
err_text = ""
if dispositions.has_key("Error"):
--- 79,83 ----
if progress.stop_requested():
return
! # All done - report what we did.
err_text = ""
if dispositions.has_key("Error"):
Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** manager.py 31 Oct 2002 21:56:59 -0000 1.27
--- manager.py 1 Nov 2002 02:03:43 -0000 1.28
***************
*** 113,117 ****
# "Integer" from the UI doesn't exist!
# 'olNumber' doesn't seem to work with PT_INT*
! win32com.client.constants.olCombination,
True) # Add to folder
item.Save()
--- 113,117 ----
# "Integer" from the UI doesn't exist!
# 'olNumber' doesn't seem to work with PT_INT*
! win32com.client.constants.olCombination,
True) # Add to folder
item.Save()
***************
*** 130,134 ****
self.EnsureOutlookFieldsForFolder(folder.EntryID, True)
folder = folders.GetNext()
!
def LoadBayes(self):
if not os.path.exists(self.ini_filename):
--- 130,134 ----
self.EnsureOutlookFieldsForFolder(folder.EntryID, True)
folder = folders.GetNext()
!
def LoadBayes(self):
if not os.path.exists(self.ini_filename):
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** msgstore.py 31 Oct 2002 21:56:59 -0000 1.13
--- msgstore.py 1 Nov 2002 02:03:45 -0000 1.14
***************
*** 363,367 ****
# objects use the same name-to-identifier mapping.
# [MarkH: Note MAPIUUID object are supported and hashable]
!
# XXX If the SpamProb (Hammie, whatever) property is passed in as an
# XXX int, Outlook displays the field as all blanks, and sorting on
--- 363,367 ----
# objects use the same name-to-identifier mapping.
# [MarkH: Note MAPIUUID object are supported and hashable]
!
# XXX If the SpamProb (Hammie, whatever) property is passed in as an
# XXX int, Outlook displays the field as all blanks, and sorting on
From tim_one@users.sourceforge.net Fri Nov 1 02:04:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 31 Oct 2002 18:04:39 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
FilterDialog.py,1.7,1.8
ManagerDialog.py,1.4,1.5 TrainingDialog.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000/dialogs
Modified Files:
FilterDialog.py ManagerDialog.py TrainingDialog.py
Log Message:
Whitespace normalization.
Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** FilterDialog.py 1 Nov 2002 01:23:27 -0000 1.7
--- FilterDialog.py 1 Nov 2002 02:03:46 -0000 1.8
***************
*** 213,217 ****
slider_pos = slider.GetPos()
self.SetDlgItemText(idc_edit, "%d" % slider_pos)
!
def _InitSlider(self, idc_slider, idc_edit):
slider = self.GetDlgItem(idc_slider)
--- 213,217 ----
slider_pos = slider.GetPos()
self.SetDlgItemText(idc_edit, "%d" % slider_pos)
!
def _InitSlider(self, idc_slider, idc_edit):
slider = self.GetDlgItem(idc_slider)
***************
*** 285,289 ****
[BUTTON, action_score, IDC_BUT_ACT_SCORE, (15,62,203,10), csts | win32con.BS_AUTORADIOBUTTON],
!
[BUTTON, only_group, -1, (7,84,230,35), cs | win32con.BS_GROUPBOX | win32con.WS_GROUP],
[BUTTON, only_unread, IDC_BUT_UNREAD, (15,94,149,9), csts | win32con.BS_AUTOCHECKBOX],
--- 285,289 ----
[BUTTON, action_score, IDC_BUT_ACT_SCORE, (15,62,203,10), csts | win32con.BS_AUTORADIOBUTTON],
!
[BUTTON, only_group, -1, (7,84,230,35), cs | win32con.BS_GROUPBOX | win32con.WS_GROUP],
[BUTTON, only_unread, IDC_BUT_UNREAD, (15,94,149,9), csts | win32con.BS_AUTOCHECKBOX],
Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** ManagerDialog.py 31 Oct 2002 21:57:00 -0000 1.4
--- ManagerDialog.py 1 Nov 2002 02:03:48 -0000 1.5
***************
*** 28,32 ****
training_intro = "Training is the process of giving examples of both good and bad email to the system so it can classify future email"
filtering_intro = "Filtering defines how spam is handled as it arrives"
!
dt = [
# Dialog itself.
--- 28,32 ----
training_intro = "Training is the process of giving examples of both good and bad email to the system so it can classify future email"
filtering_intro = "Filtering defines how spam is handled as it arrives"
!
dt = [
# Dialog itself.
***************
*** 39,48 ****
[BUTTON, "It is moved from a spam folder back to the Inbox",
IDC_BUT_TRAIN_FROM_SPAM_FOLDER,(20,50,204,9), csts | win32con.BS_AUTOCHECKBOX],
!
[STATIC, "Automatically train that a message is spam when",
-1, (15,64,208,10), cs],
[BUTTON, "It is moved to the certain-spam folder",
IDC_BUT_TRAIN_TO_SPAM_FOLDER,(20,75,204,9), csts | win32con.BS_AUTOCHECKBOX],
!
[STATIC, "", IDC_TRAINING_STATUS, (15,88,146,14), cs | win32con.SS_LEFTNOWORDWRAP | win32con.SS_CENTERIMAGE | win32con.SS_SUNKEN],
[BUTTON, 'Train Now...', IDC_BUT_TRAIN_NOW, (167,88,63,14), csts | win32con.BS_PUSHBUTTON],
--- 39,48 ----
[BUTTON, "It is moved from a spam folder back to the Inbox",
IDC_BUT_TRAIN_FROM_SPAM_FOLDER,(20,50,204,9), csts | win32con.BS_AUTOCHECKBOX],
!
[STATIC, "Automatically train that a message is spam when",
-1, (15,64,208,10), cs],
[BUTTON, "It is moved to the certain-spam folder",
IDC_BUT_TRAIN_TO_SPAM_FOLDER,(20,75,204,9), csts | win32con.BS_AUTOCHECKBOX],
!
[STATIC, "", IDC_TRAINING_STATUS, (15,88,146,14), cs | win32con.SS_LEFTNOWORDWRAP | win32con.SS_CENTERIMAGE | win32con.SS_SUNKEN],
[BUTTON, 'Train Now...', IDC_BUT_TRAIN_NOW, (167,88,63,14), csts | win32con.BS_PUSHBUTTON],
***************
*** 72,76 ****
(IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"),
]
!
dialog.Dialog.__init__(self, self.dt)
--- 72,76 ----
(IDC_BUT_TRAIN_TO_SPAM_FOLDER, "self.mgr.config.training.train_manual_spam"),
]
!
dialog.Dialog.__init__(self, self.dt)
***************
*** 125,129 ****
filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \
% (watch_names, certain_spam_name, unsure_name)
!
self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable)
enabled = config.enabled
--- 125,129 ----
filter_status = "Watching '%s'. Spam managed in '%s', unsure managed in '%s'" \
% (watch_names, certain_spam_name, unsure_name)
!
self.GetDlgItem(IDC_BUT_FILTER_ENABLE).EnableWindow(ok_to_enable)
enabled = config.enabled
***************
*** 133,137 ****
def OnButAbout(self, id, code):
if code == win32con.BN_CLICKED:
!
fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html")
fname = os.path.abspath(fname)
--- 133,137 ----
def OnButAbout(self, id, code):
if code == win32con.BN_CLICKED:
!
fname = os.path.join(os.path.dirname(__file__), os.pardir, "about.html")
fname = os.path.abspath(fname)
Index: TrainingDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/TrainingDialog.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** TrainingDialog.py 31 Oct 2002 21:57:00 -0000 1.6
--- TrainingDialog.py 1 Nov 2002 02:03:52 -0000 1.7
***************
*** 76,80 ****
if len(self.config.spam_folder_ids)==0 and self.mgr.config.filter.spam_folder_id:
self.config.spam_folder_ids = [self.mgr.config.filter.spam_folder_id]
!
names = []
for eid in self.config.ham_folder_ids:
--- 76,80 ----
if len(self.config.spam_folder_ids)==0 and self.mgr.config.filter.spam_folder_id:
self.config.spam_folder_ids = [self.mgr.config.filter.spam_folder_id]
!
names = []
for eid in self.config.ham_folder_ids:
From tim_one@users.sourceforge.net Fri Nov 1 02:04:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 31 Oct 2002 18:04:39 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/sandbox delete_outlook_field.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv5945/Outlook2000/sandbox
Modified Files:
delete_outlook_field.py
Log Message:
Whitespace normalization.
Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** delete_outlook_field.py 31 Oct 2002 21:57:00 -0000 1.1
--- delete_outlook_field.py 1 Nov 2002 02:04:03 -0000 1.2
***************
*** 69,73 ****
None,
mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
!
table = mapi_folder.GetContentsTable(0)
prop_ids = PR_ENTRYID,
--- 69,73 ----
None,
mapi.MAPI_MODIFY | mapi.MAPI_DEFERRED_ERRORS)
!
table = mapi_folder.GetContentsTable(0)
prop_ids = PR_ENTRYID,
***************
*** 152,156 ****
print msg
!
def main():
import getopt
--- 152,156 ----
print msg
!
def main():
import getopt
From npickett@users.sourceforge.net Fri Nov 1 02:55:35 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Thu, 31 Oct 2002 18:55:35 -0800
Subject: [Spambayes-checkins] spambayes hammiesrv.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18408
Modified Files:
hammiesrv.py
Log Message:
* XML-encode the output (thanks Toby Dickenson)
Index: hammiesrv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** hammiesrv.py 27 Oct 2002 05:13:55 -0000 1.8
--- hammiesrv.py 1 Nov 2002 02:55:32 -0000 1.9
***************
*** 41,45 ****
except AttributeError:
pass
! return hammie.Hammie.score(self, msg, *extra)
def filter(self, msg, *extra):
--- 41,45 ----
except AttributeError:
pass
! return xmlrpclib.Binary(hammie.Hammie.score(self, msg, *extra))
def filter(self, msg, *extra):
***************
*** 48,52 ****
except AttributeError:
pass
! return hammie.Hammie.filter(self, msg, *extra)
--- 48,52 ----
except AttributeError:
pass
! return xmlrpclib.Binary(hammie.Hammie.filter(self, msg, *extra))
From anthonybaxter@users.sourceforge.net Fri Nov 1 04:06:52 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:06:52 -0800
Subject: [Spambayes-checkins] website related.ht,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv6404
Modified Files:
related.ht
Log Message:
bogofilter now on SF.
Index: related.ht
===================================================================
RCS file: /cvsroot/spambayes/website/related.ht,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** related.ht 30 Sep 2002 04:02:31 -0000 1.2
--- related.ht 1 Nov 2002 04:06:49 -0000 1.3
***************
*** 9,13 ****
PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.
From anthonybaxter@users.sourceforge.net Fri Nov 1 04:10:52 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:10:52 -0800
Subject: [Spambayes-checkins] spambayes timcv.py,1.10,1.11 msgs.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7003
Modified Files:
timcv.py msgs.py
Log Message:
Added support for specifying different numbers for training and testing
ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam)
still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain
have been added to timcv.py.
Note that msgs.setparms _tries_ to do the right thing if it's called as
an old 3-arg form, but I might not have captured all the possible
twistedness. As far as I can tell, only timcv.py and timtest.py
actually call these
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** timcv.py 10 Oct 2002 04:55:15 -0000 1.10
--- timcv.py 1 Nov 2002 04:10:50 -0000 1.11
***************
*** 14,24 ****
If you only want to use some of the messages in each set,
--ham-keep int
! The maximum number of msgs to use from each Ham set. The msgs are
! chosen randomly. See also the -s option.
--spam-keep int
! The maximum number of msgs to use from each Spam set. The msgs are
! chosen randomly. See also the -s option.
-s int
--- 14,40 ----
If you only want to use some of the messages in each set,
+ --HamTrain int
+ The maximum number of msgs to use from each Ham set for training.
+ The msgs are chosen randomly. See also the -s option.
+
+ --SpamTrain int
+ The maximum number of msgs to use from each Spam set for training.
+ The msgs are chosen randomly. See also the -s option.
+
+ --HamTest int
+ The maximum number of msgs to use from each Ham set for testing.
+ The msgs are chosen randomly. See also the -s option.
+
+ --SpamTest int
+ The maximum number of msgs to use from each Spam set for testing.
+ The msgs are chosen randomly. See also the -s option.
+
--ham-keep int
! The maximum number of msgs to use from each Ham set for testing
! and training. The msgs are chosen randomly. See also the -s option.
--spam-keep int
! The maximum number of msgs to use from each Spam set for testing
! and training. The msgs are chosen randomly. See also the -s option.
-s int
***************
*** 57,62 ****
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
# Now run nsets times, predicting pair i against all except pair i.
--- 73,80 ----
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
! hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
! spamdirs[1:], train=1))
# Now run nsets times, predicting pair i against all except pair i.
***************
*** 64,69 ****
h = hamdirs[i]
s = spamdirs[i]
! hamstream = msgs.HamStream(h, [h])
! spamstream = msgs.SpamStream(s, [s])
if i > 0:
--- 82,87 ----
h = hamdirs[i]
s = spamdirs[i]
! hamstream = msgs.HamStream(h, [h], train=0)
! spamstream = msgs.SpamStream(s, [s], train=0)
if i > 0:
***************
*** 80,84 ****
del s2[i]
! d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2))
else:
--- 98,103 ----
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
! msgs.SpamStream(sname, s2, train=1))
else:
***************
*** 101,109 ****
try:
opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
! ['ham-keep=', 'spam-keep='])
except getopt.error, msg:
usage(1, msg)
! nsets = seed = hamkeep = spamkeep = None
for opt, arg in opts:
if opt == '-h':
--- 120,131 ----
try:
opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
! ['HamTrain=', 'SpamTrain=',
! 'HamTest=', 'SpamTest=',
! 'ham-keep=', 'spam-keep='])
except getopt.error, msg:
usage(1, msg)
! nsets = seed = hamtrain = spamtrain = None
! hamtest = spamtest = hamkeep = spamkeep = None
for opt, arg in opts:
if opt == '-h':
***************
*** 113,116 ****
--- 135,146 ----
elif opt == '-s':
seed = int(arg)
+ elif opt == '--HamTest':
+ hamtest = int(arg)
+ elif opt == '--SpamTest':
+ spamtest = int(arg)
+ elif opt == '--HamTrain':
+ hamtrain = int(arg)
+ elif opt == '--SpamTrain':
+ spamtrain = int(arg)
elif opt == '--ham-keep':
hamkeep = int(arg)
***************
*** 123,127 ****
usage(1, "-n is required")
! msgs.setparms(hamkeep, spamkeep, seed)
drive(nsets)
--- 153,160 ----
usage(1, "-n is required")
! if hamkeep is not None:
! msgs.setparms(hamkeep, spamkeep, seed=seed)
! else:
! msgs.setparms(hamtrain, spamtrain, hamtest, spamtest, seed)
drive(nsets)
Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** msgs.py 25 Sep 2002 20:07:06 -0000 1.4
--- msgs.py 1 Nov 2002 04:10:50 -0000 1.5
***************
*** 6,11 ****
from tokenizer import tokenize
! HAMKEEP = None
! SPAMKEEP = None
SEED = random.randrange(2000000000)
--- 6,13 ----
from tokenizer import tokenize
! HAMTEST = None
! SPAMTEST = None
! HAMTRAIN = None
! SPAMTRAIN = None
SEED = random.randrange(2000000000)
***************
*** 68,83 ****
class HamStream(MsgStream):
! def __init__(self, tag, directories):
! MsgStream.__init__(self, tag, directories, HAMKEEP)
class SpamStream(MsgStream):
! def __init__(self, tag, directories):
! MsgStream.__init__(self, tag, directories, SPAMKEEP)
! def setparms(hamkeep, spamkeep, seed=None):
! """Set HAMKEEP and SPAMKEEP. If seed is not None, also set SEED."""
! global HAMKEEP, SPAMKEEP, SEED
! HAMKEEP, SPAMKEEP = hamkeep, spamkeep
if seed is not None:
SEED = seed
--- 70,103 ----
class HamStream(MsgStream):
! def __init__(self, tag, directories, train=0):
! if train:
! MsgStream.__init__(self, tag, directories, HAMTRAIN)
! else:
! MsgStream.__init__(self, tag, directories, HAMTEST)
class SpamStream(MsgStream):
! def __init__(self, tag, directories, train=0):
! if train:
! MsgStream.__init__(self, tag, directories, SPAMTRAIN)
! else:
! MsgStream.__init__(self, tag, directories, SPAMTEST)
! def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
! """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.
! If seed is not None, also set SEED.
! If (ham|spam)test are not set, set to the same as the (ham|spam)train
! numbers (backwards compat option).
! """
! global HAMTEST, SPAMTEST, HAMTRAIN, SPAMTRAIN, SEED
! HAMTRAIN, SPAMTRAIN = hamtrain, spamtrain
! if hamtest is None:
! HAMTEST = HAMTRAIN
! else:
! HAMTEST = hamtest
! if spamtest is None:
! SPAMTEST = SPAMTRAIN
! else:
! SPAMTEST = spamtest
if seed is not None:
SEED = seed
From anthonybaxter@users.sourceforge.net Fri Nov 1 04:13:13 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:13:13 -0800
Subject: [Spambayes-checkins] spambayes timtest.py,1.29,1.30
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8231
Modified Files:
timtest.py
Log Message:
Added support for specifying different numbers for training and testing
ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam)
still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain
have been added to timcv.py.
Note that msgs.setparms _tries_ to do the right thing if it's called as
an old 3-arg form, but I might not have captured all the possible
twistedness. As far as I can tell, only timcv.py and timtest.py
actually call these. Also, msgs.HamStream and msgs.SpamStream now
have an option 'train' argument (which defaults to 0/False), which
tells them whether to use the test or train numbers.
If you have your own test harnesses, you _might_ need to update them
a little.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** timtest.py 24 Sep 2002 05:37:11 -0000 1.29
--- timtest.py 1 Nov 2002 04:13:11 -0000 1.30
***************
*** 98,102 ****
usage(1, "-n is required")
! msgs.setparms(hamkeep, spamkeep, seed)
drive(nsets)
--- 98,102 ----
usage(1, "-n is required")
! msgs.setparms(hamkeep, spamkeep, seed=seed)
drive(nsets)
From anthony@interlink.com.au Fri Nov 1 04:13:29 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 15:13:29 +1100
Subject: [Spambayes-checkins] spambayes timcv.py,1.10,1.11 msgs.py,1.4,1.5
In-Reply-To:
Message-ID: <200211010413.gA14DUn09404@localhost.localdomain>
>>> "Anthony Baxter" wrote
> Update of /cvsroot/spambayes/spambayes
> In directory usw-pr-cvs1:/tmp/cvs-serv7003
>
> Modified Files:
> timcv.py msgs.py
> Log Message:
> Added support for specifying different numbers for training and testing
> ham and spam. Old options --ham-keep and --spam-keep (or --ham/--spam)
> still work as before. New options --HamTest --SpamTest --HamTrain --SpamTrain
> have been added to timcv.py.
>
> Note that msgs.setparms _tries_ to do the right thing if it's called as
> an old 3-arg form, but I might not have captured all the possible
> twistedness. As far as I can tell, only timcv.py and timtest.py
> actually call these
Wierd. My cvs commit aborted and only did two of the files, and truncated
my commit message??? I'll use cvs admin to fix the commit message next.
Anthony
--
Anthony Baxter
It's never too late to have a happy childhood.
From anthonybaxter@users.sourceforge.net Fri Nov 1 04:50:21 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 31 Oct 2002 20:50:21 -0800
Subject: [Spambayes-checkins]
website applications.ht,NONE,1.1 index.ht,1.1.1.1,1.2 links.h,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv20352
Modified Files:
index.ht links.h
Added Files:
applications.ht
Log Message:
initial 'applications' notes.
--- NEW FILE: applications.ht ---
Title: SpamBayes: Applications
Author-Email: spambayes@python.org
Author: spambayes
Applications
A number of applications are available in the SpamBayes project. None
of these are particularly polished, finished pieces of work, but they're
getting there (and help is always appreciated).
Outlook2000
Sean True and Mark Hammond have developed an addin for Outlook2000 that
adds support for the spambayes classifier.
Requirements
Python2.2 or later (2.2.2 recommended)
Outlook 2000 (not Outlook Express)
Python's win32com
extensions (win32all-149 or later)
CDO installed.
For more on this, see the README.txt or
about.html file in the spambayes CVS repository's Outlook2000 directory.
Availability
At the moment, you'll need to use CVS to get the code - go to the CVS page on the project's sourceforge site for more.
hammie.py
hammie is a command line tool for marking mail as ham or spam. Skip Montanaro has started a guide to integrating hammie with your mailer (Unix-only instructions at the moment - additions welcome!).
Currently it focusses on running hammie via procmail.
Requirements
Python2.2 or later (2.2.2 recommended)
Currently documentation focusses on Unix.
Availability
At the moment, you'll need to use CVS to get the code - go to the CVS page on the project's sourceforge site for more.
pop3proxy.py
pop3proxy sits between your mail client and your real POP3 server and marks
mail as ham or spam as it passes through. See the docstring at the top of pop3proxy.py for more.
Requirements
Python2.2 or later (2.2.2 recommended)
Should work on windows/unix/whatever... ?
Availability
At the moment, you'll need to use CVS to get the code - go to the CVS page on the project's sourceforge site for more.
Index: index.ht
===================================================================
RCS file: /cvsroot/spambayes/website/index.ht,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** index.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1
--- index.ht 1 Nov 2002 04:50:19 -0000 1.2
***************
*** 12,16 ****
via CVS -
note that it's not yet
! suitable for end-users, but for people interested in experimenting.
--- 12,22 ----
via CVS -
note that it's not yet
! suitable for non-technical end-users, but for people interested
! in experimenting.
!
!
! There are now a couple of end-user applications available for those
! excited by the bleeding edge - these are detailed on the
! Applications page.
Related
From mhammond@users.sourceforge.net Fri Nov 1 05:48:02 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 21:48:02 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/dialogs FolderSelector.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv548/dialogs
Modified Files:
FolderSelector.py
Log Message:
All items are now identified by a (store_id, entry_id) tuple. This was
done in such a way that old config files should be fully supported - no
need to reconfigure.
Not much should look different, except mutiple stores should be *fully*
supported - you should be able to train and filter across stores to your
hearts content.
Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** FolderSelector.py 31 Oct 2002 21:57:00 -0000 1.5
--- FolderSelector.py 1 Nov 2002 05:47:59 -0000 1.6
***************
*** 53,63 ****
from win32com.mapi.mapitags import *
def _BuildFoldersMAPI(msgstore, folder):
# Get the hierarchy table for it.
table = folder.GetHierarchyTable(0)
children = []
! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID,PR_DISPLAY_NAME_A), None, None, 0)
! for (eid_tag, eid),(name_tag, name) in rows:
! spec = FolderSpec(mapi.HexFromBin(eid), name)
child_folder = msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
spec.children = _BuildFoldersMAPI(msgstore, child_folder)
--- 53,66 ----
from win32com.mapi.mapitags import *
+ default_store_id = None
+
def _BuildFoldersMAPI(msgstore, folder):
# Get the hierarchy table for it.
table = folder.GetHierarchyTable(0)
children = []
! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
! for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows:
! folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid)
! spec = FolderSpec(folder_id, name)
child_folder = msgstore.OpenEntry(eid, None, mapi.MAPI_DEFERRED_ERRORS)
spec.children = _BuildFoldersMAPI(msgstore, child_folder)
***************
*** 66,79 ****
def BuildFolderTreeMAPI(session):
root = FolderSpec(None, "root")
tab = session.GetMsgStoresTable(0)
! rows = mapi.HrQueryAllRows(tab, (PR_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
for row in rows:
! (eid_tag, eid), (name_tag, name) = row
msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS)
hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0)
subtree_eid = data[0][1]
folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
! spec = FolderSpec(mapi.HexFromBin(subtree_eid), name)
spec.children = _BuildFoldersMAPI(msgstore, folder)
root.children.append(spec)
--- 69,89 ----
def BuildFolderTreeMAPI(session):
+ global default_store_id
root = FolderSpec(None, "root")
tab = session.GetMsgStoresTable(0)
! prop_tags = PR_ENTRYID, PR_DEFAULT_STORE, PR_DISPLAY_NAME_A
! rows = mapi.HrQueryAllRows(tab, prop_tags, None, None, 0)
for row in rows:
! (eid_tag, eid), (is_def_tag, is_def), (name_tag, name) = row
! hex_eid = mapi.HexFromBin(eid)
! if is_def:
! default_store_id = hex_eid
!
msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS)
hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0)
subtree_eid = data[0][1]
folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
! folder_id = hex_eid, mapi.HexFromBin(subtree_eid)
! spec = FolderSpec(folder_id, name)
spec.children = _BuildFoldersMAPI(msgstore, folder)
root.children.append(spec)
***************
*** 126,129 ****
--- 136,153 ----
self.checkbox_text = checkbox_text or "Include &subfolders"
+ def CompareIDs(self, id1, id2):
+ if type(id1) != type(()):
+ id1 = default_store_id, id1
+ if type(id2) != type(()):
+ id2 = default_store_id, id2
+ return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \
+ self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1]))
+
+ def InIDs(self, id, ids):
+ for id_check in ids:
+ if self.CompareIDs(id_check, id):
+ return True
+ return False
+
def _MakeItemParam(self, item):
item_id = self.next_item_id
***************
*** 144,148 ****
mask = state = 0
else:
! if self.selected_ids and child.folder_id in self.selected_ids:
state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
num_children_selected += 1
--- 168,172 ----
mask = state = 0
else:
! if self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
num_children_selected += 1
***************
*** 152,156 ****
item_id = self._MakeItemParam(child)
hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id))
! if self.single_select and self.selected_ids and child.folder_id in self.selected_ids:
self.list.SelectItem(hitem)
--- 176,180 ----
item_id = self._MakeItemParam(child)
hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id))
! if self.single_select and self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
self.list.SelectItem(hitem)
From mhammond@users.sourceforge.net Fri Nov 1 05:48:01 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 21:48:01 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.21,1.22 manager.py,1.28,1.29
msgstore.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv548
Modified Files:
addin.py manager.py msgstore.py
Log Message:
All items are now identified by a (store_id, entry_id) tuple. This was
done in such a way that old config files should be fully supported - no
need to reconfigure.
Not much should look different, except mutiple stores should be *fully*
supported - you should be able to train and filter across stores to your
hearts content.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** addin.py 1 Nov 2002 02:03:39 -0000 1.21
--- addin.py 1 Nov 2002 05:47:59 -0000 1.22
***************
*** 308,312 ****
existing = self.folder_hooks.get(eid)
if existing is None or existing.__class__ != HandlerClass:
! folder = self.application.Session.GetFolderFromID(eid)
name = folder.Name.encode("mbcs", "replace")
try:
--- 308,312 ----
existing = self.folder_hooks.get(eid)
if existing is None or existing.__class__ != HandlerClass:
! folder = self.application.Session.GetFolderFromID(*eid)
name = folder.Name.encode("mbcs", "replace")
try:
Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** manager.py 1 Nov 2002 02:03:43 -0000 1.28
--- manager.py 1 Nov 2002 05:47:59 -0000 1.29
***************
*** 92,96 ****
assert self.outlook is not None, "I need outlook :("
ol = self.outlook
! folder = ol.Session.GetFolderFromID(folder_id)
if self.verbose > 1:
print "Checking folder '%s' for our field '%s'" \
--- 92,96 ----
assert self.outlook is not None, "I need outlook :("
ol = self.outlook
! folder = ol.Session.GetFolderFromID(*folder_id)
if self.verbose > 1:
print "Checking folder '%s' for our field '%s'" \
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** msgstore.py 1 Nov 2002 02:03:45 -0000 1.14
--- msgstore.py 1 Nov 2002 05:47:59 -0000 1.15
***************
*** 91,123 ****
mapi.MAPI_USE_DEFAULT)
self.session = mapi.MAPILogonEx(0, None, None, logonFlags)
! self._FindDefaultMessageStore()
os.chdir(cwd)
def Close(self):
! self.mapi_msgstore = None
self.session.Logoff(0, 0, 0)
self.session = None
mapi.MAPIUninitialize()
! def _FindDefaultMessageStore(self):
! tab = self.session.GetMsgStoresTable(0)
! # Restriction for the table: get rows where PR_DEFAULT_STORE is true.
! # There should be only one.
! restriction = (mapi.RES_PROPERTY, # a property restriction
! (mapi.RELOP_EQ, # check for equality
! PR_DEFAULT_STORE, # of the PR_DEFAULT_STORE prop
! (PR_DEFAULT_STORE, True))) # with True
! rows = mapi.HrQueryAllRows(tab,
! (PR_ENTRYID,), # columns to retrieve
! restriction, # only these rows
! None, # any sort order is fine
! 0) # any # of results is fine
! # get first entry, a (property_tag, value) pair, for PR_ENTRYID
! row = rows[0]
! eid_tag, eid = row[0]
! # Open the store.
! self.mapi_msgstore = self.session.OpenMsgStore(
0, # no parent window
! eid, # msg store to open
None, # IID; accept default IMsgStore
# need write access to add score fields
--- 91,135 ----
mapi.MAPI_USE_DEFAULT)
self.session = mapi.MAPILogonEx(0, None, None, logonFlags)
! self.mapi_msg_stores = {}
! self.default_store_bin_eid = None
! self._GetMessageStore(None)
os.chdir(cwd)
def Close(self):
! self.mapi_msg_stores = None
self.session.Logoff(0, 0, 0)
self.session = None
mapi.MAPIUninitialize()
! def _GetMessageStore(self, store_eid): # bin eid.
! try:
! # Will usually be pre-fetched, so fast-path out
! return self.mapi_msg_stores[store_eid]
! except KeyError:
! pass
! given_store_eid = store_eid
! if store_eid is None:
! # Find the EID for the default store.
! tab = self.session.GetMsgStoresTable(0)
! # Restriction for the table: get rows where PR_DEFAULT_STORE is true.
! # There should be only one.
! restriction = (mapi.RES_PROPERTY, # a property restriction
! (mapi.RELOP_EQ, # check for equality
! PR_DEFAULT_STORE, # of the PR_DEFAULT_STORE prop
! (PR_DEFAULT_STORE, True))) # with True
! rows = mapi.HrQueryAllRows(tab,
! (PR_ENTRYID,), # columns to retrieve
! restriction, # only these rows
! None, # any sort order is fine
! 0) # any # of results is fine
! # get first entry, a (property_tag, value) pair, for PR_ENTRYID
! row = rows[0]
! eid_tag, store_eid = row[0]
! self.default_store_bin_eid = store_eid
!
! # Open it.
! store = self.session.OpenMsgStore(
0, # no parent window
! store_eid, # msg store to open
None, # IID; accept default IMsgStore
# need write access to add score fields
***************
*** 126,158 ****
mapi.MDB_NO_MAIL |
USE_DEFERRED_ERRORS)
def _GetSubFolderIter(self, folder):
table = folder.GetHierarchyTable(0)
rows = mapi.HrQueryAllRows(table,
! (PR_ENTRYID, PR_DISPLAY_NAME_A),
None,
None,
0)
! for (eid_tag, eid),(name_tag, name) in rows:
! sub = self.mapi_msgstore.OpenEntry(eid,
! None,
! mapi.MAPI_MODIFY |
! USE_DEFERRED_ERRORS)
table = sub.GetContentsTable(0)
! yield MAPIMsgStoreFolder(self, eid, name, table.GetRowCount(0))
! folder = self.mapi_msgstore.OpenEntry(eid,
! None,
! mapi.MAPI_MODIFY |
! USE_DEFERRED_ERRORS)
! for store_folder in self._GetSubFolderIter(folder):
yield store_folder
def GetFolderGenerator(self, folder_ids, include_sub):
for folder_id in folder_ids:
! folder_id = mapi.BinFromHex(folder_id)
! folder = self.mapi_msgstore.OpenEntry(folder_id,
! None,
! mapi.MAPI_MODIFY |
! USE_DEFERRED_ERRORS)
table = folder.GetContentsTable(0)
rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
--- 138,191 ----
mapi.MDB_NO_MAIL |
USE_DEFERRED_ERRORS)
+ # cache it
+ self.mapi_msg_stores[store_eid] = store
+ if given_store_eid is None: # The default store
+ self.mapi_msg_stores[None] = store
+ return store
+
+ def _OpenEntry(self, id, iid = None, flags = None):
+ # id is already normalized.
+ store_id, item_id = id
+ store = self._GetMessageStore(store_id)
+ if flags is None:
+ flags = mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS
+ return store.OpenEntry(item_id, iid, flags)
+
+ # Given an ID, normalize it into a (store_id, item_id) binary tuple.
+ # item_id may be:
+ # - Simple hex EID, in wich case default store ID is assumed.
+ # - Tuple of (None, hex_eid), in which case default store assumed.
+ # - Tuple of (hex_store_id, hex_id)
+ def NormalizeID(self, item_id):
+ if type(item_id)==type(()):
+ store_id, item_id = item_id
+ item_id = mapi.BinFromHex(item_id)
+ if store_id is None:
+ store_id = self.default_store_bin_eid
+ else:
+ store_id = mapi.BinFromHex(store_id)
+ return store_id, item_id
+ assert type(item_id) in [type(''), type(u'')], "What kind of ID is '%r'?" % (item_id,)
+ return self.default_store_bin_eid, mapi.BinFromHex(item_id)
def _GetSubFolderIter(self, folder):
table = folder.GetHierarchyTable(0)
rows = mapi.HrQueryAllRows(table,
! (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A),
None,
None,
0)
! for (eid_tag, eid), (store_eid_tag, store_eid), (name_tag, name) in rows:
! item_id = store_eid, eid
! sub = self._OpenEntry(item_id)
table = sub.GetContentsTable(0)
! yield MAPIMsgStoreFolder(self, item_id, name, table.GetRowCount(0))
! for store_folder in self._GetSubFolderIter(sub):
yield store_folder
def GetFolderGenerator(self, folder_ids, include_sub):
for folder_id in folder_ids:
! folder_id = self.NormalizeID(folder_id)
! folder = self._OpenEntry(folder_id)
table = folder.GetContentsTable(0)
rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
***************
*** 165,173 ****
def GetFolder(self, folder_id):
# Return a single folder given the ID.
! folder_id = mapi.BinFromHex(folder_id)
! folder = self.mapi_msgstore.OpenEntry(folder_id,
! None,
! mapi.MAPI_MODIFY |
! USE_DEFERRED_ERRORS)
table = folder.GetContentsTable(0)
rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
--- 198,203 ----
def GetFolder(self, folder_id):
# Return a single folder given the ID.
! folder_id = self.NormalizeID(folder_id)
! folder = self._OpenEntry(folder_id)
table = folder.GetContentsTable(0)
rc, props = folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
***************
*** 177,191 ****
def GetMessage(self, message_id):
# Return a single message given the ID.
! message_id = mapi.BinFromHex(message_id)
prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
! mapi_object = self.mapi_msgstore.OpenEntry(message_id,
! None,
! mapi.MAPI_MODIFY |
! USE_DEFERRED_ERRORS)
hr, data = mapi_object.GetProps(prop_ids,0)
folder_eid = data[0][1]
searchkey = data[1][1]
unread = data[2][1]
! folder = MAPIMsgStoreFolder(self, folder_eid,
"Unknown - temp message", -1)
return MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
--- 207,219 ----
def GetMessage(self, message_id):
# Return a single message given the ID.
! message_id = self.NormalizeID(message_id)
prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
! mapi_object = self._OpenEntry(message_id)
hr, data = mapi_object.GetProps(prop_ids,0)
folder_eid = data[0][1]
searchkey = data[1][1]
unread = data[2][1]
! folder_id = message_id[0], folder_eid
! folder = MAPIMsgStoreFolder(self, folder_id,
"Unknown - temp message", -1)
return MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
***************
*** 216,232 ****
def __repr__(self):
! return "<%s '%s' (%d items), id=%s>" % (self.__class__.__name__,
self.name,
self.count,
! mapi.HexFromBin(self.id))
def GetOutlookEntryID(self):
! return mapi.HexFromBin(self.id)
def GetMessageGenerator(self):
! folder = self.msgstore.mapi_msgstore.OpenEntry(self.id,
! None,
! mapi.MAPI_MODIFY |
! USE_DEFERRED_ERRORS)
table = folder.GetContentsTable(0)
prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
--- 244,263 ----
def __repr__(self):
! return "<%s '%s' (%d items), id=%s/%s>" % (self.__class__.__name__,
self.name,
self.count,
! mapi.HexFromBin(self.id[0]),
! mapi.HexFromBin(self.id[1]))
def GetOutlookEntryID(self):
! # Return EntryID, StoreID - we use this order as it is the same as
! # Session.GetItemFromID() uses - thus:
! # ids = me.GetOutlookEntryID()
! # session.GetItemFromID(*ids)
! # should work.
! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
def GetMessageGenerator(self):
! folder = self.msgstore._OpenEntry(self.id)
table = folder.GetContentsTable(0)
prop_ids = PR_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
***************
*** 239,244 ****
break
for row in rows:
yield MAPIMsgStoreMsg(self.msgstore, self,
! row[0][1], row[1][1], row[2][1])
--- 270,276 ----
break
for row in rows:
+ item_id = self.id[0], row[0][1] # assume in same store as folder!
yield MAPIMsgStoreMsg(self.msgstore, self,
! item_id, row[1][1], row[2][1])
***************
*** 263,272 ****
else:
urs = "unread"
! return "<%s, (%s) id=%s>" % (self.__class__.__name__,
urs,
! mapi.HexFromBin(self.id))
def GetOutlookEntryID(self):
! return mapi.HexFromBin(self.id)
def _GetPropFromStream(self, prop_id):
--- 295,310 ----
else:
urs = "unread"
! return "<%s, (%s) id=%s/%s>" % (self.__class__.__name__,
urs,
! mapi.HexFromBin(self.id[0]),
! mapi.HexFromBin(self.id[1]))
def GetOutlookEntryID(self):
! # Return EntryID, StoreID - we use this order as it is the same as
! # Session.GetItemFromID() uses - thus:
! # ids = me.GetOutlookEntryID()
! # session.GetItemFromID(*ids)
! # should work.
! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
def _GetPropFromStream(self, prop_id):
***************
*** 319,326 ****
def _EnsureObject(self):
if self.mapi_object is None:
! self.mapi_object = self.msgstore.mapi_msgstore.OpenEntry(
! self.id,
! None,
! mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS)
def GetEmailPackageObject(self, strip_mime_headers=True):
--- 357,361 ----
def _EnsureObject(self):
if self.mapi_object is None:
! self.mapi_object = self.msgstore._OpenEntry(self.id)
def GetEmailPackageObject(self, strip_mime_headers=True):
***************
*** 418,432 ****
assert not self.dirty, \
"asking me to move a dirty message - later saves will fail!"
! dest_folder = self.msgstore.mapi_msgstore.OpenEntry(
! folder.id,
! None,
! mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS)
! source_folder = self.msgstore.mapi_msgstore.OpenEntry(
! self.folder.id,
! None,
! mapi.MAPI_MODIFY | USE_DEFERRED_ERRORS)
flags = 0
if isMove: flags |= MESSAGE_MOVE
! source_folder.CopyMessages((self.id,),
None,
dest_folder,
--- 453,462 ----
assert not self.dirty, \
"asking me to move a dirty message - later saves will fail!"
! dest_folder = self.msgstore._OpenEntry(folder.id)
! source_folder = self.msgstore._OpenEntry(self.folder.id)
flags = 0
if isMove: flags |= MESSAGE_MOVE
! eid = self.id[1]
! source_folder.CopyMessages((eid,),
None,
dest_folder,
***************
*** 434,438 ****
None,
flags)
! self.folder = self.msgstore.GetFolder(mapi.HexFromBin(folder.id))
def MoveTo(self, folder):
--- 464,473 ----
None,
flags)
! # At this stage, I think we have lost meaningful ID etc values
! # Set everything to None to make it clearer what is wrong should
! # this become an issue. We would need to re-fetch the eid of
! # the item, and set the store_id to the dest folder.
! self.id = None
! self.folder = None
def MoveTo(self, folder):
***************
*** 453,457 ****
print msg
store.Close()
-
if __name__=='__main__':
--- 488,491 ----
From mhammond@users.sourceforge.net Fri Nov 1 06:09:08 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Thu, 31 Oct 2002 22:09:08 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.29,1.30
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv5475
Modified Files:
manager.py
Log Message:
Stop everyone fretting over a known problem.
Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** manager.py 1 Nov 2002 05:47:59 -0000 1.29
--- manager.py 1 Nov 2002 06:09:06 -0000 1.30
***************
*** 119,125 ****
print "Created the UserProperty!"
except pythoncom.com_error:
! import traceback
! print "Failed to create the field"
! traceback.print_exc()
# else no items in this folder - not much worth doing!
if include_sub:
--- 119,126 ----
print "Created the UserProperty!"
except pythoncom.com_error:
! pass # We know, we know...
! ## import traceback
! ## print "Failed to create the field"
! ## traceback.print_exc()
# else no items in this folder - not much worth doing!
if include_sub:
From tim.one@comcast.net Fri Nov 1 06:22:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 01:22:38 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
FolderSelector.py,1.5,1.6
In-Reply-To:
Message-ID:
[Mark Hammond]
> Modified Files:
> FolderSelector.py
> Log Message:
> All items are now identified by a (store_id, entry_id) tuple. This was
> done in such a way that old config files should be fully supported - no
> need to reconfigure.
>
> Not much should look different, except mutiple stores should be *fully*
> supported - you should be able to train and filter across stores to your
> hearts content.
That's impressive! I'll do my bit next by ensuring there's no trailing
whitespace .
From richiehindle@users.sourceforge.net Fri Nov 1 09:14:50 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 01 Nov 2002 01:14:50 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16187
Modified Files:
pop3proxy.py
Log Message:
Made this work on Linux, where socket.makefile behaves differently from
Windows.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** pop3proxy.py 29 Oct 2002 21:02:40 -0000 1.7
--- pop3proxy.py 1 Nov 2002 09:14:47 -0000 1.8
***************
*** 87,94 ****
self.request = ''
self.set_terminator('\r\n')
! serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! serverSocket.connect((serverName, serverPort))
! self.serverFile = serverSocket.makefile()
! self.push(self.serverFile.readline())
def handle_connect(self):
--- 87,94 ----
self.request = ''
self.set_terminator('\r\n')
! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! self.serverSocket.connect((serverName, serverPort))
! self.serverIn = self.serverSocket.makefile('r') # For reading only
! self.push(self.serverIn.readline())
def handle_connect(self):
***************
*** 135,139 ****
seenAllHeaders = False
while True:
! line = self.serverFile.readline()
if not line:
# The socket's been closed by the server, probably by QUIT.
--- 135,139 ----
seenAllHeaders = False
while True:
! line = self.serverIn.readline()
if not line:
# The socket's been closed by the server, probably by QUIT.
***************
*** 173,184 ****
# Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
! self.serverFile.write('QUIT\r\n')
! self.serverFile.flush()
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverFile.write(self.request + '\r\n')
! self.serverFile.flush()
if self.request.strip() == '':
# Someone just hit the Enter key.
--- 173,182 ----
# Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
! self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverSocket.sendall(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
***************
*** 200,204 ****
if timedOut:
while True:
! line = self.serverFile.readline()
if not line:
# The socket's been closed by the server.
--- 198,202 ----
if timedOut:
while True:
! line = self.serverIn.readline()
if not line:
# The socket's been closed by the server.
***************
*** 529,532 ****
--- 527,531 ----
asyncore.loop(map=testSocketMap)
+ proxyReady = threading.Event()
def runProxy():
# Name the database in case it ever gets auto-flushed to disk.
***************
*** 535,538 ****
--- 534,538 ----
bayes.learn(tokenizer.tokenize(spam1), True)
bayes.learn(tokenizer.tokenize(good1), False)
+ proxyReady.set()
asyncore.loop()
***************
*** 540,548 ****
testServerReady.wait()
threading.Thread(target=runProxy).start()
# Connect to the proxy.
proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
proxy.connect(('localhost', 8111))
! assert proxy.recv(100) == "+OK ready\r\n"
# Stat the mailbox to get the number of messages.
--- 540,550 ----
testServerReady.wait()
threading.Thread(target=runProxy).start()
+ proxyReady.wait()
# Connect to the proxy.
proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
proxy.connect(('localhost', 8111))
! response = proxy.recv(100)
! assert response == "+OK ready\r\n"
# Stat the mailbox to get the number of messages.
From mhammond@users.sourceforge.net Fri Nov 1 14:35:10 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 06:35:10 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.22,1.23 manager.py,1.30,1.31
msgstore.py,1.15,1.16
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14364
Modified Files:
addin.py manager.py msgstore.py
Log Message:
Fix a problem with the (store_id, item_id) change, and remove the
confusing GetOutlookItemID concept - just get the item!
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** addin.py 1 Nov 2002 05:47:59 -0000 1.22
--- addin.py 1 Nov 2002 14:35:05 -0000 1.23
***************
*** 305,312 ****
for msgstore_folder in self.manager.message_store.GetFolderGenerator(
folder_ids, include_sub):
! eid = msgstore_folder.GetOutlookEntryID()
! existing = self.folder_hooks.get(eid)
if existing is None or existing.__class__ != HandlerClass:
! folder = self.application.Session.GetFolderFromID(*eid)
name = folder.Name.encode("mbcs", "replace")
try:
--- 305,311 ----
for msgstore_folder in self.manager.message_store.GetFolderGenerator(
folder_ids, include_sub):
! existing = self.folder_hooks.get(msgstore_folder.id)
if existing is None or existing.__class__ != HandlerClass:
! folder = msgstore_folder.GetOutlookItem()
name = folder.Name.encode("mbcs", "replace")
try:
***************
*** 317,325 ****
if new_hook is not None:
new_hook.Init(folder, self.application, self.manager)
! new_hooks[eid] = new_hook
! self.manager.EnsureOutlookFieldsForFolder(eid)
print "AntiSpam: Watching for new messages in folder", name
else:
! new_hooks[eid] = existing
return new_hooks
--- 316,324 ----
if new_hook is not None:
new_hook.Init(folder, self.application, self.manager)
! new_hooks[msgstore_folder.id] = new_hook
! self.manager.EnsureOutlookFieldsForFolder(msgstore_folder.GetID())
print "AntiSpam: Watching for new messages in folder", name
else:
! new_hooks[msgstore_folder.id] = existing
return new_hooks
Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** manager.py 1 Nov 2002 06:09:06 -0000 1.30
--- manager.py 1 Nov 2002 14:35:05 -0000 1.31
***************
*** 92,96 ****
assert self.outlook is not None, "I need outlook :("
ol = self.outlook
! folder = ol.Session.GetFolderFromID(*folder_id)
if self.verbose > 1:
print "Checking folder '%s' for our field '%s'" \
--- 92,97 ----
assert self.outlook is not None, "I need outlook :("
ol = self.outlook
! msgstore_folder = self.message_store.GetFolder(folder_id)
! folder = msgstore_folder.GetOutlookItem()
if self.verbose > 1:
print "Checking folder '%s' for our field '%s'" \
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** msgstore.py 1 Nov 2002 05:47:59 -0000 1.15
--- msgstore.py 1 Nov 2002 14:35:06 -0000 1.16
***************
*** 219,230 ****
return MAPIMsgStoreMsg(self, folder, message_id, searchkey, unread)
- ## # Currently no need for this
- ## def GetOutlookObjectFromID(self, eid):
- ## if self.outlook is None:
- ## from win32com.client import Dispatch
- ## self.outlook = Dispatch("Outlook.Application")
- ## return self.outlook.Session.GetItemFromID(mapi.HexFromBin(eid))
-
-
_MapiTypeMap = {
type(0.0): PT_DOUBLE,
--- 219,222 ----
***************
*** 250,260 ****
mapi.HexFromBin(self.id[1]))
! def GetOutlookEntryID(self):
! # Return EntryID, StoreID - we use this order as it is the same as
! # Session.GetItemFromID() uses - thus:
! # ids = me.GetOutlookEntryID()
! # session.GetItemFromID(*ids)
! # should work.
! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
def GetMessageGenerator(self):
--- 242,252 ----
mapi.HexFromBin(self.id[1]))
! def GetID(self):
! return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
!
! def GetOutlookItem(self):
! hex_item_id = mapi.HexFromBin(self.id[1])
! hex_store_id = mapi.HexFromBin(self.id[0])
! return self.msgstore.outlook.Session.GetFolderFromID(hex_item_id, hex_store_id)
def GetMessageGenerator(self):
***************
*** 300,310 ****
mapi.HexFromBin(self.id[1]))
! def GetOutlookEntryID(self):
! # Return EntryID, StoreID - we use this order as it is the same as
! # Session.GetItemFromID() uses - thus:
! # ids = me.GetOutlookEntryID()
! # session.GetItemFromID(*ids)
! # should work.
! return mapi.HexFromBin(self.id[1]), mapi.HexFromBin(self.id[0])
def _GetPropFromStream(self, prop_id):
--- 292,302 ----
mapi.HexFromBin(self.id[1]))
! def GetID(self):
! return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
!
! def GetOutlookItem(self):
! hex_item_id = mapi.HexFromBin(self.id[1])
! store_hex_id = mapi.HexFromBin(self.id[0])
! return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id)
def _GetPropFromStream(self, prop_id):
From tim_one@users.sourceforge.net Fri Nov 1 16:01:20 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 01 Nov 2002 08:01:20 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.45,1.46
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7943
Modified Files:
classifier.py
Log Message:
WordInfo.__init__: if an initial spamprob isn't specified, set it to
options.robinson_probability_x (the "unknown word" probability) instead
of to None. If threads exist such that scoring can happen in parallel
with training, None could cause scoring to raise an exception. "A real"
spamprob can't be computed until update_probabilities is called to
recalculate the entire database; before then, giving a new word the
unknown-word spamprob is thoroughly appropriate.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.45
retrieving revision 1.46
diff -C2 -d -r1.45 -r1.46
*** classifier.py 27 Oct 2002 17:11:00 -0000 1.45
--- classifier.py 1 Nov 2002 16:01:14 -0000 1.46
***************
*** 62,66 ****
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=None):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 62,66 ----
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
From sjoerd@users.sourceforge.net Fri Nov 1 16:10:18 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Fri, 01 Nov 2002 08:10:18 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.59,1.60
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13555
Modified Files:
tokenizer.py
Log Message:
Switch " and ' in url_re character class and add # ' token the re to
resync python-mode.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.59
retrieving revision 1.60
diff -C2 -d -r1.59 -r1.60
*** tokenizer.py 31 Oct 2002 15:43:55 -0000 1.59
--- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60
***************
*** 604,609 ****
# be in HTML, may or may not be in quotes, etc. If it's full of %
# escapes, cool -- that's a clue too.
! ([^\s<>'"\x7f-\xff]+) # capture the guts
! """, re.VERBOSE)
urlsep_re = re.compile(r"[;?:@&=+,$.]")
--- 604,609 ----
# be in HTML, may or may not be in quotes, etc. If it's full of %
# escapes, cool -- that's a clue too.
! ([^\s<>"'\x7f-\xff]+) # capture the guts
! """, re.VERBOSE) # '
urlsep_re = re.compile(r"[;?:@&=+,$.]")
From mhammond@users.sourceforge.net Fri Nov 1 23:54:05 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 15:54:05 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.23,1.24 msgstore.py,1.16,1.17
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14570
Modified Files:
addin.py msgstore.py
Log Message:
Fix a couple of places the "multiple stores" concept fell over.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** addin.py 1 Nov 2002 14:35:05 -0000 1.23
--- addin.py 1 Nov 2002 23:54:03 -0000 1.24
***************
*** 121,125 ****
# PR_RECEIVED_BY_ENTRYID
# PR_TRANSPORT_MESSAGE_HEADERS
! msgstore_message = self.manager.message_store.GetMessage(item.EntryID)
if msgstore_message.GetField(self.manager.config.field_score_name) is not None:
# Already seem this message - user probably moving it back
--- 121,125 ----
# PR_RECEIVED_BY_ENTRYID
# PR_TRANSPORT_MESSAGE_HEADERS
! msgstore_message = self.manager.message_store.GetMessage(item)
if msgstore_message.GetField(self.manager.config.field_score_name) is not None:
# Already seem this message - user probably moving it back
***************
*** 154,158 ****
if not self.manager.config.training.train_manual_spam:
return
! msgstore_message = self.manager.message_store.GetMessage(item.EntryID)
prop = msgstore_message.GetField(self.manager.config.field_score_name)
if prop is not None:
--- 154,158 ----
if not self.manager.config.training.train_manual_spam:
return
! msgstore_message = self.manager.message_store.GetMessage(item)
prop = msgstore_message.GetField(self.manager.config.field_score_name)
if prop is not None:
***************
*** 189,193 ****
return
! msgstore_message = mgr.message_store.GetMessage(item.EntryID)
score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
new_msg = app.CreateItem(0)
--- 189,193 ----
return
! msgstore_message = mgr.message_store.GetMessage(item)
score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
new_msg = app.CreateItem(0)
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** msgstore.py 1 Nov 2002 14:35:06 -0000 1.16
--- msgstore.py 1 Nov 2002 23:54:03 -0000 1.17
***************
*** 206,211 ****
def GetMessage(self, message_id):
! # Return a single message given the ID.
! message_id = self.NormalizeID(message_id)
prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
mapi_object = self._OpenEntry(message_id)
--- 206,217 ----
def GetMessage(self, message_id):
! # Return a single message given either the ID, or an Outlook
! # message representing the object.
! if hasattr(message_id, "EntryID"):
! # A CDO object
! message_id = mapi.BinFromHex(message_id.Parent.StoreID), \
! mapi.BinFromHex(message_id.EntryID)
! else:
! message_id = self.NormalizeID(message_id)
prop_ids = PR_PARENT_ENTRYID, PR_SEARCH_KEY, PR_CONTENT_UNREAD
mapi_object = self._OpenEntry(message_id)
From mhammond@users.sourceforge.net Sat Nov 2 03:12:15 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 19:12:15 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/sandbox delete_outlook_field.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv30593
Modified Files:
delete_outlook_field.py
Log Message:
Fix missing quote in usage string.
Index: delete_outlook_field.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/delete_outlook_field.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** delete_outlook_field.py 1 Nov 2002 02:04:03 -0000 1.2
--- delete_outlook_field.py 2 Nov 2002 03:12:12 -0000 1.3
***************
*** 147,151 ****
of the default message store
! Eg, python\\python-dev' will locate a python-dev subfolder in a python
subfolder in your default store.
""" % os.path.basename(sys.argv[0])
--- 147,151 ----
of the default message store
! Eg, 'python\\python-dev' will locate a python-dev subfolder in a python
subfolder in your default store.
""" % os.path.basename(sys.argv[0])
From mhammond@users.sourceforge.net Sat Nov 2 03:13:24 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 19:13:24 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
dump_props.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv30848
Added Files:
dump_props.py
Log Message:
Tool to dump everything we know about a message.
--- NEW FILE: dump_props.py ---
# Dump every property we can find for a MAPI item
from win32com.client import Dispatch, constants
import pythoncom
import os, sys
from win32com.mapi import mapi, mapiutil
from win32com.mapi.mapitags import *
mapi.MAPIInitialize(None)
logonFlags = (mapi.MAPI_NO_MAIL |
mapi.MAPI_EXTENDED |
mapi.MAPI_USE_DEFAULT)
session = mapi.MAPILogonEx(0, None, None, logonFlags)
def _FindDefaultMessageStore():
tab = session.GetMsgStoresTable(0)
# Restriction for the table: get rows where PR_DEFAULT_STORE is true.
# There should be only one.
restriction = (mapi.RES_PROPERTY, # a property restriction
(mapi.RELOP_EQ, # check for equality
PR_DEFAULT_STORE, # of the PR_DEFAULT_STORE prop
(PR_DEFAULT_STORE, True))) # with True
rows = mapi.HrQueryAllRows(tab,
(PR_ENTRYID,), # columns to retrieve
restriction, # only these rows
None, # any sort order is fine
0) # any # of results is fine
# get first entry, a (property_tag, value) pair, for PR_ENTRYID
row = rows[0]
eid_tag, eid = row[0]
# Open the store.
return session.OpenMsgStore(
0, # no parent window
eid, # msg store to open
None, # IID; accept default IMsgStore
# need write access to add score fields
mapi.MDB_WRITE |
# we won't send or receive email
mapi.MDB_NO_MAIL |
mapi.MAPI_DEFERRED_ERRORS)
def _FindItemsWithValue(folder, prop_tag, prop_val):
tab = folder.GetContentsTable(0)
# Restriction for the table: get rows where our prop values match
restriction = (mapi.RES_CONTENT, # a property restriction
(mapi.FL_SUBSTRING | mapi.FL_IGNORECASE | mapi.FL_LOOSE, # fuzz level
prop_tag, # of the given prop
(prop_tag, prop_val))) # with given val
## tab.SetColumns((PR_ENTRYID,), 0)
## restriction = None
rows = mapi.HrQueryAllRows(tab,
(PR_ENTRYID,), # columns to retrieve
restriction, # only these rows
None, # any sort order is fine
0) # any # of results is fine
# get entry IDs
print rows
return [row[0][1] for row in rows]
def _FindFolderEID(name):
assert name
from win32com.mapi import exchange
if not name.startswith("\\"):
name = "\\Top Of Personal Folders\\" + name
store = _FindDefaultMessageStore()
folder_eid = exchange.HrMAPIFindFolderEx(store, "\\", name)
return folder_eid
# Also in new versions of mapituil
def GetAllProperties(obj, make_tag_names = True):
tags = obj.GetPropList(0)
hr, data = obj.GetProps(tags)
ret = []
for tag, val in data:
if make_tag_names:
hr, tags, array = obj.GetNamesFromIDs( (tag,) )
if type(array[0][1])==type(u''):
name = array[0][1]
else:
name = mapiutil.GetPropTagName(tag)
else:
name = tag
ret.append((name, val))
return ret
def DumpProps(folder_eid, subject, shorten):
mapi_msgstore = _FindDefaultMessageStore()
mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
None,
mapi.MAPI_DEFERRED_ERRORS)
hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
name = data[0][1]
print name
eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject)
print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject)
for eid in eids:
print "Dumping item with ID", mapi.HexFromBin(eid)
item = mapi_msgstore.OpenEntry(eid,
None,
mapi.MAPI_DEFERRED_ERRORS)
for prop_name, prop_val in GetAllProperties(item):
prop_repr = repr(prop_val)
if shorten:
prop_repr = prop_repr[:50]
print "%-20s: %s" % (prop_name, prop_repr)
def usage():
msg = """\
Usage: %s [-f foldername] subject of the message
-f - Search for the message in the specified folder (default = Inbox)
-s - Shorten long property values.
Dumps all properties for all messages that match the subject. Subject
matching is substring and ignore-case.
Folder name must be a hierarchical 'path' name, using '\\'
as the path seperator. If the folder name begins with a
\\, it must be a fully-qualified name, including the message
store name (eg, "Top of Public Folders"). If the path does not
begin with a \\, it is assumed to be fully-qualifed from the root
of the default message store
Eg, python\\python-dev' will locate a python-dev subfolder in a python
subfolder in your default store.
""" % os.path.basename(sys.argv[0])
print msg
def main():
import getopt
try:
opts, args = getopt.getopt(sys.argv[1:], "f:s")
except getopt.error, e:
print e
print
usage()
sys.exit(1)
folder_name = ""
subject = " ".join(args)
if not subject:
usage()
sys.exit(1)
shorten = False
for opt, opt_val in opts:
if opt == "-f":
folder_name = opt_val
elif opt == "-s":
shorten = True
else:
print "Invalid arg"
return
if not folder_name:
folder_name = "Inbox" # Assume this exists!
eid = _FindFolderEID(folder_name)
if eid is None:
print "*** Cant find folder", folder_name
return
DumpProps(eid, subject, shorten)
if __name__=='__main__':
main()
From mhammond@users.sourceforge.net Sat Nov 2 03:18:10 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 19:18:10 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
dump_props.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv31673
Modified Files:
dump_props.py
Log Message:
Remove old debug code I missed.
Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** dump_props.py 2 Nov 2002 03:13:22 -0000 1.1
--- dump_props.py 2 Nov 2002 03:18:08 -0000 1.2
***************
*** 48,53 ****
prop_tag, # of the given prop
(prop_tag, prop_val))) # with given val
- ## tab.SetColumns((PR_ENTRYID,), 0)
- ## restriction = None
rows = mapi.HrQueryAllRows(tab,
(PR_ENTRYID,), # columns to retrieve
--- 48,51 ----
***************
*** 56,60 ****
0) # any # of results is fine
# get entry IDs
- print rows
return [row[0][1] for row in rows]
--- 54,57 ----
From mhammond@users.sourceforge.net Sat Nov 2 04:00:45 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 20:00:45 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv13755
Modified Files:
README.txt
Log Message:
Update to reflect the current world state.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** README.txt 21 Oct 2002 01:38:10 -0000 1.4
--- README.txt 2 Nov 2002 04:00:43 -0000 1.5
***************
*** 4,12 ****
to run the Outlook Addin you *must* have win32all-149 or later.
! ** NOTE ** - You also need CDO installed. This comes with Outlook 2k, but is
! not installed by default. Attempting to install the add-in will detect this
! situation, and print instructions how to install CDO. Note however that
! running the stand-alone scripts (see below) will generally just print the raw
! Python exception - generally a semi-incomprehensible COM exception.
Outlook Addin
--- 4,8 ----
to run the Outlook Addin you *must* have win32all-149 or later.
! CDO is no longer needed :)
Outlook Addin
***************
*** 43,54 ****
Inbox filter). You can watch as many folders for Spam as you like.
- You can define any number of filters to apply, each performing a different
- action or testing a different spam probability. You can enable and disable
- any rule, and you can "Filter Now" an entire folder in one step.
-
- Note that the rule ordering can be important, as if early rules move
- a message, later rules will not fire for that message (cos MAPI
- appears to make access to the message once moved impossible)
-
Command Line Tools
-------------------
--- 39,42 ----
***************
*** 66,76 ****
plugin must be running for filtering of new mail to occur)
- classify.py
- Creates a field in each message with the classifier score. Once run,
- the Outlook Field Chooser can be used to display, sort etc the field,
- or used to change formatting of these messages. The field will appear
- in "user defined fields"
-
-
Misc Comments
===========
--- 54,57 ----
***************
*** 78,86 ****
Somewhere over 4MB, they seem to stop working. Mark's hasn't got
that big yet - just over 2MB and going strong.
-
- Outlook will occasionally complain that folders are corrupted after running
- filter. Closing and reopening Outlook always seems to restore things,
- with no fuss. Your mileage may vary. Buyer beware. Worth what you paid.
- (Mark hasn't seen this)
Copyright transferred to PSF from Sean D. True and WebReply.com.
--- 59,62 ----
From mhammond@users.sourceforge.net Sat Nov 2 04:08:04 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 20:08:04 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv15352
Modified Files:
README.txt
Log Message:
Add known problems.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/README.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** README.txt 2 Nov 2002 04:00:43 -0000 1.5
--- README.txt 2 Nov 2002 04:08:02 -0000 1.6
***************
*** 2,9 ****
Outlook 2000, courtesy of Sean True and Mark Hammond. Note that you need
Python's win32com extensions (http://starship.python.net/crew/mhammond) and
! to run the Outlook Addin you *must* have win32all-149 or later.
CDO is no longer needed :)
Outlook Addin
==========
--- 2,12 ----
Outlook 2000, courtesy of Sean True and Mark Hammond. Note that you need
Python's win32com extensions (http://starship.python.net/crew/mhammond) and
! you *must* have win32all-149 or later.
CDO is no longer needed :)
+ See below for a list of known problems (particularly that you must manually
+ create an Outlook property before you can see the Spam scores)
+
Outlook Addin
==========
***************
*** 54,63 ****
plugin must be running for filtering of new mail to occur)
Misc Comments
===========
- Sean reports bad output saving very large classifiers in training.py.
- Somewhere over 4MB, they seem to stop working. Mark's hasn't got
- that big yet - just over 2MB and going strong.
-
Copyright transferred to PSF from Sean D. True and WebReply.com.
Licensed under PSF, see Tim Peters for IANAL interpretation.
--- 57,76 ----
plugin must be running for filtering of new mail to occur)
+ Known Problems
+ ---------------
+ * No field is created in Outlook for the Spam Score field. To create
+ the field, go to the field chooser for the folder you are interested
+ in, and create a new User Property called "Spam". Ensure the type
+ of the field is "Integer" (the last option), NOT "Number". This is only
+ necessary for you to *see* the score, not for the scoring to work.
+
+ * Filtering an Exchange Server public store appears to not work.
+
+ * Sean reports bad output saving very large classifiers in training.py.
+ Somewhere over 4MB, they seem to stop working. Mark's hasn't got
+ that big yet - just over 2MB and going strong.
+
Misc Comments
===========
Copyright transferred to PSF from Sean D. True and WebReply.com.
Licensed under PSF, see Tim Peters for IANAL interpretation.
From tim.one@comcast.net Sat Nov 2 04:12:29 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 23:12:29 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5
In-Reply-To:
Message-ID:
[Mark Hammond]
> ...
> Modified Files:
> README.txt
> Log Message:
> Update to reflect the current world state.
> ...
> - Outlook will occasionally complain that folders are corrupted
> - after running filter. Closing and reopening Outlook always seems to
> - restore things, with no fuss. Your mileage may vary. Buyer beware.
> - Worth what you paid.
> - (Mark hasn't seen this)
I meant to mention before that I've never seen this either. Sean, do you
still see it? scanpst.exe sometimes claims there are minor inconsistencies
when I run it, but it's always done that, and AFAICT it doesn't claim it
more often now than before I started using the addin.
From mhammond@skippinet.com.au Sat Nov 2 04:18:30 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 2 Nov 2002 15:18:30 +1100
Subject: [Spambayes-checkins] spambayes/Outlook2000 README.txt,1.4,1.5
In-Reply-To:
Message-ID:
> > ...
> > - Outlook will occasionally complain that folders are corrupted
> > - after running filter. Closing and reopening Outlook always seems to
> > - restore things, with no fuss. Your mileage may vary. Buyer beware.
> > - Worth what you paid.
> > - (Mark hasn't seen this)
>
> I meant to mention before that I've never seen this either. Sean, do you
> still see it? scanpst.exe sometimes claims there are minor
> inconsistencies
> when I run it, but it's always done that, and AFAICT it doesn't claim it
> more often now than before I started using the addin.
Actually, I saw similar things when using the Outlook model to scan huge
folders. Since moving to MAPI I think it will have gone away.
Mark.
From mhammond@users.sourceforge.net Sat Nov 2 05:26:55 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 21:26:55 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
dump_props.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv4243
Modified Files:
dump_props.py
Log Message:
Add support for dumping attachments too
Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** dump_props.py 2 Nov 2002 03:18:08 -0000 1.2
--- dump_props.py 2 Nov 2002 05:26:52 -0000 1.3
***************
*** 82,86 ****
return ret
! def DumpProps(folder_eid, subject, shorten):
mapi_msgstore = _FindDefaultMessageStore()
mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
--- 82,93 ----
return ret
! def DumpItemProps(item, shorten):
! for prop_name, prop_val in GetAllProperties(item):
! prop_repr = repr(prop_val)
! if shorten:
! prop_repr = prop_repr[:50]
! print "%-20s: %s" % (prop_name, prop_repr)
!
! def DumpProps(folder_eid, subject, include_attach, shorten):
mapi_msgstore = _FindDefaultMessageStore()
mapi_folder = mapi_msgstore.OpenEntry(folder_eid,
***************
*** 89,93 ****
hr, data = mapi_folder.GetProps( (PR_DISPLAY_NAME_A,), 0)
name = data[0][1]
- print name
eids = _FindItemsWithValue(mapi_folder, PR_SUBJECT_A, subject)
print "Folder '%s' has %d items matching '%s'" % (name, len(eids), subject)
--- 96,99 ----
***************
*** 97,105 ****
None,
mapi.MAPI_DEFERRED_ERRORS)
! for prop_name, prop_val in GetAllProperties(item):
! prop_repr = repr(prop_val)
! if shorten:
! prop_repr = prop_repr[:50]
! print "%-20s: %s" % (prop_name, prop_repr)
def usage():
--- 103,116 ----
None,
mapi.MAPI_DEFERRED_ERRORS)
! DumpItemProps(item, shorten)
! if include_attach:
! print
! table = item.GetAttachmentTable(0)
! rows = mapi.HrQueryAllRows(table, (PR_ATTACH_NUM,), None, None, 0)
! for row in rows:
! attach_num = row[0][1]
! print "Dumping attachment (PR_ATTACH_NUM=%d)" % (attach_num,)
! attach = item.OpenAttach(attach_num, None, mapi.MAPI_DEFERRED_ERRORS)
! DumpItemProps(attach, shorten)
def usage():
***************
*** 108,111 ****
--- 119,123 ----
-f - Search for the message in the specified folder (default = Inbox)
-s - Shorten long property values.
+ -a - Include attachments
Dumps all properties for all messages that match the subject. Subject
***************
*** 128,132 ****
import getopt
try:
! opts, args = getopt.getopt(sys.argv[1:], "f:s")
except getopt.error, e:
print e
--- 140,144 ----
import getopt
try:
! opts, args = getopt.getopt(sys.argv[1:], "af:s")
except getopt.error, e:
print e
***************
*** 141,144 ****
--- 153,157 ----
shorten = False
+ include_attach = False
for opt, opt_val in opts:
if opt == "-f":
***************
*** 146,149 ****
--- 159,164 ----
elif opt == "-s":
shorten = True
+ elif opt == "-a":
+ include_attach = True
else:
print "Invalid arg"
***************
*** 157,161 ****
print "*** Cant find folder", folder_name
return
! DumpProps(eid, subject, shorten)
if __name__=='__main__':
--- 172,176 ----
print "*** Cant find folder", folder_name
return
! DumpProps(eid, subject, include_attach, shorten)
if __name__=='__main__':
From mhammond@users.sourceforge.net Sat Nov 2 06:12:36 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Fri, 01 Nov 2002 22:12:36 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.17,1.18
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv13542
Modified Files:
msgstore.py
Log Message:
Correct misleading comment.
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** msgstore.py 1 Nov 2002 23:54:03 -0000 1.17
--- msgstore.py 2 Nov 2002 06:12:34 -0000 1.18
***************
*** 209,213 ****
# message representing the object.
if hasattr(message_id, "EntryID"):
! # A CDO object
message_id = mapi.BinFromHex(message_id.Parent.StoreID), \
mapi.BinFromHex(message_id.EntryID)
--- 209,213 ----
# message representing the object.
if hasattr(message_id, "EntryID"):
! # An Outlook object
message_id = mapi.BinFromHex(message_id.Parent.StoreID), \
mapi.BinFromHex(message_id.EntryID)
From tim_one@users.sourceforge.net Sat Nov 2 06:53:26 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 01 Nov 2002 22:53:26 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv21025/Outlook2000
Modified Files:
about.html
Log Message:
Added exhaustive sister-friendly instructions for creating a Spam column
in a view in a folder.
Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** about.html 1 Nov 2002 01:24:09 -0000 1.2
--- about.html 2 Nov 2002 06:53:24 -0000 1.3
***************
*** 18,25 ****
--- 18,27 ----
consider spam, and continually adapt as both your regular email and spam
patterns change.
+
Training
Due to the nature of the system, it must be trained before it can be effective.
Although the system does learn over time, when first installed it has
no knowledge of either spam or good email.
+
Initial Training
When first installed, it is recommended you perform the following steps:
***************
*** 44,47 ****
--- 46,50 ----
You can then look at and sort by the Spam field in your Inbox - this is likely
to find hidden spam that you missed from your inbox cleanup.
+
Incremental Training
When you drag a message to your Spam folder, it will be automatically trained
***************
*** 51,55 ****
the system learns what good messages look like should it incorrectly classify
it as spam or possible spam.
!
Contributions to this documentation are welcome!
--- 54,97 ----
the system learns what good messages look like should it incorrectly classify
it as spam or possible spam.
!
!
Creating a Spam Score Field
! A custom property named "Spam" is added to all Outlook messages scored.
! This is an integer in 0 (ham) through 100 (spam) inclusive.
! You can teach Outlook to display this field as a column in any table view,
! like the standard Messages view.
!
! This takes some work, and has to be done again for every folder in which
! you want to display a Spam column:
!
!
While looking at an Outlook table view (like Messages), right-click
! on the line with column headers (From, Subject, To, Received, ...).
! In the context menu that pops up, click on Field Chooser. A box
! with title Field Chooser pops up.
!
In the lower left corner of the Field Chooser box, click
! New.... A box with title New Field pops up.
!
In the Name: box, type Spam.
!
In the Type: dropdown list, select Integer. This is the
! last choice in the dropdown list.
! Do not select Number -- it won't work.
!
The Format: dropdown list should display "1,234" now. Leave it alone.
!
Click OK in the New Field box. Now you're back in the
! Field Chooser box.
!
The dropdown list at the top of the Field Chooser box should say
! User-defined fields in FOLDER now, where FOLDER is the name of the
! folder you're currently looking at (like Inbox). Below that, you
! should see a new rectangular button with a Spam label.
!
Use your mouse to drag the Spam button to the column header position
! where you want to see the Spam column. You don't have to be precise
! here -- you can rearrange or resize the column later just by dragging
! it around.
!
You're done! Close the Field Chooser box.
!
! Outlook's standard Automatic Formatting features can also be taught how
! access the value of this field; for example, you could tell Outlook to display
! rows with suspected spam messages in green italic. However, for whatever reason,
! the Outlook Rules Wizard does not allow creating rules based on user-defined
! fields. That's why this addin supplies its own filtering rules.
!
!
Contributions to this documentation are welcome!
From tim_one@users.sourceforge.net Sat Nov 2 07:01:24 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 01 Nov 2002 23:01:24 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 about.html,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv22485/Outlook2000
Modified Files:
about.html
Log Message:
Grammar repair in new stuff.
Index: about.html
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/about.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** about.html 2 Nov 2002 06:53:24 -0000 1.3
--- about.html 2 Nov 2002 07:01:21 -0000 1.4
***************
*** 87,91 ****
You're done! Close the Field Chooser box.
! Outlook's standard Automatic Formatting features can also be taught how
access the value of this field; for example, you could tell Outlook to display
rows with suspected spam messages in green italic. However, for whatever reason,
--- 87,91 ----
You're done! Close the Field Chooser box.
! Outlook's standard Automatic Formatting features can also be taught how to
access the value of this field; for example, you could tell Outlook to display
rows with suspected spam messages in green italic. However, for whatever reason,
From mhammond@users.sourceforge.net Sat Nov 2 11:27:55 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 03:27:55 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
dump_props.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv9291/sandbox
Modified Files:
dump_props.py
Log Message:
Beat Tim to the whitespace normalization
Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** dump_props.py 2 Nov 2002 05:26:52 -0000 1.3
--- dump_props.py 2 Nov 2002 11:27:53 -0000 1.4
***************
*** 55,59 ****
# get entry IDs
return [row[0][1] for row in rows]
!
def _FindFolderEID(name):
assert name
--- 55,59 ----
# get entry IDs
return [row[0][1] for row in rows]
!
def _FindFolderEID(name):
assert name
***************
*** 67,84 ****
# Also in new versions of mapituil
def GetAllProperties(obj, make_tag_names = True):
! tags = obj.GetPropList(0)
! hr, data = obj.GetProps(tags)
! ret = []
! for tag, val in data:
! if make_tag_names:
! hr, tags, array = obj.GetNamesFromIDs( (tag,) )
! if type(array[0][1])==type(u''):
! name = array[0][1]
! else:
! name = mapiutil.GetPropTagName(tag)
! else:
! name = tag
! ret.append((name, val))
! return ret
def DumpItemProps(item, shorten):
--- 67,84 ----
# Also in new versions of mapituil
def GetAllProperties(obj, make_tag_names = True):
! tags = obj.GetPropList(0)
! hr, data = obj.GetProps(tags)
! ret = []
! for tag, val in data:
! if make_tag_names:
! hr, tags, array = obj.GetNamesFromIDs( (tag,) )
! if type(array[0][1])==type(u''):
! name = array[0][1]
! else:
! name = mapiutil.GetPropTagName(tag)
! else:
! name = tag
! ret.append((name, val))
! return ret
def DumpItemProps(item, shorten):
***************
*** 88,92 ****
prop_repr = prop_repr[:50]
print "%-20s: %s" % (prop_name, prop_repr)
!
def DumpProps(folder_eid, subject, include_attach, shorten):
mapi_msgstore = _FindDefaultMessageStore()
--- 88,92 ----
prop_repr = prop_repr[:50]
print "%-20s: %s" % (prop_name, prop_repr)
!
def DumpProps(folder_eid, subject, include_attach, shorten):
mapi_msgstore = _FindDefaultMessageStore()
***************
*** 167,171 ****
if not folder_name:
folder_name = "Inbox" # Assume this exists!
!
eid = _FindFolderEID(folder_name)
if eid is None:
--- 167,171 ----
if not folder_name:
folder_name = "Inbox" # Assume this exists!
!
eid = _FindFolderEID(folder_name)
if eid is None:
From mhammond@users.sourceforge.net Sat Nov 2 12:09:38 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 04:09:38 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.18,1.19
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv812
Modified Files:
msgstore.py
Log Message:
Nice patch from Piers Haken that does the best we can with Exchange Server
delivered messages.
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** msgstore.py 2 Nov 2002 06:12:34 -0000 1.18
--- msgstore.py 2 Nov 2002 12:09:36 -0000 1.19
***************
*** 351,355 ****
--- 351,379 ----
body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
+ # Mail delivered internally via Exchange Server etc may not have
+ # headers - fake some up.
+ if not headers:
+ headers = self._GetFakeHeaders ()
+ # Mail delivered via the Exchange Internet Mail MTA may have
+ # gibberish at the start of the headers - fix this.
+ elif headers.startswith("Microsoft Mail"):
+ headers = "X-MS-Mail-Gibberish: " + headers
return "%s\n%s\n%s" % (headers, html, body)
+
+ def _GetFakeHeaders(self):
+ # This is designed to fake up some SMTP headers for messages
+ # on an exchange server that do not have such headers of their own
+ prop_ids = PR_SUBJECT_A, PR_DISPLAY_NAME_A, PR_DISPLAY_TO_A, PR_DISPLAY_CC_A
+ hr, data = self.mapi_object.GetProps(prop_ids,0)
+ subject = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0])
+ sender = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
+ to = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
+ cc = self._GetPotentiallyLargeStringProp(prop_ids[3], data[3])
+ headers = ["X-Exchange-Message: true"]
+ if subject: headers.append("Subject: "+subject)
+ if sender: headers.append("From: "+sender)
+ if to: headers.append("To: "+to)
+ if cc: headers.append("CC: "+cc)
+ return "\n".join(headers)
def _EnsureObject(self):
From mhammond@users.sourceforge.net Sat Nov 2 12:28:41 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 04:28:41 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/dialogs
FilterDialog.py,1.8,1.9
FolderSelector.py,1.6,1.7 TrainingDialog.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv9661
Modified Files:
FilterDialog.py FolderSelector.py TrainingDialog.py
Log Message:
Another nice patch from Piers Haken - use the Outlook object model for the
folder dialog. I have no idea why this is necessary for Exchange server,
but it seems OK, and is trivial to revert.
I'm certain that Exchange Server can be navigated via Ext MAPI, but I'm
happy this at least gets more people going.
Note after applying this, the Folder dialog may not automatically
pre-select the folders you had selected (but they are still working)
- however, once you have re-selected, it does re-remember.
(It seems Outlook has done something funky with the entry IDs, and made
them binary comparable, whereas MAPI and CDO ones are not. Whatever)
Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** FilterDialog.py 1 Nov 2002 02:03:46 -0000 1.8
--- FilterDialog.py 2 Nov 2002 12:28:38 -0000 1.9
***************
*** 194,198 ****
ids = [ids]
single_select = not ids_are_list
! d = FolderSelector.FolderSelector(self.mgr.message_store.session, ids, checkbox_state=None, single_select=single_select)
if d.DoModal()==win32con.IDOK:
new_ids, include_sub = d.GetSelectedIDs()
--- 194,199 ----
ids = [ids]
single_select = not ids_are_list
! # d = FolderSelector.FolderSelector(self.mgr.message_store.session, ids, checkbox_state=None, single_select=single_select)
! d = FolderSelector.FolderSelector(self.mgr.outlook.Session, ids, checkbox_state=None, single_select=single_select)
if d.DoModal()==win32con.IDOK:
new_ids, include_sub = d.GetSelectedIDs()
Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** FolderSelector.py 1 Nov 2002 05:47:59 -0000 1.6
--- FolderSelector.py 2 Nov 2002 12:28:38 -0000 1.7
***************
*** 22,25 ****
--- 22,35 ----
c.dump(level+1)
+ # Oh, lord help us.
+ # We started with a CDO version - but CDO sucks for lots of reasons I
+ # wont even start to mention.
+ # So we moved to an Extended MAPI version with is nice and fast - screams
+ # along! Except it doesn't work in all cases with Exchange (which
+ # strikes Mark as extremely strange given that the Extended MAPI Python
+ # bindings were developed against an Exchange Server - but Mark doesn't
+ # have an Exchange server handy these days, and really doesn't give a
+ # rat's arse
+ # So finally we have an Outlook object model version!
#########################################################################
## CDO version of a folder walker.
***************
*** 90,93 ****
--- 100,118 ----
return root
+ ## - An Outlook object model version
+ def _BuildFolderTreeOutlook(session, parent):
+ children = []
+ for i in range (parent.Folders.Count):
+ folder = parent.Folders [i+1]
+ spec = FolderSpec ((folder.StoreID, folder.EntryID), folder.Name.encode("mbcs", "replace"))
+ if folder.Folders != None:
+ spec.children = _BuildFolderTreeOutlook (session, folder)
+ children.append(spec)
+ return children
+
+ def BuildFolderTreeOutlook(session):
+ root = FolderSpec(None, "root")
+ root.children = _BuildFolderTreeOutlook(session, session)
+ return root
#########################################################################
***************
*** 141,146 ****
if type(id2) != type(()):
id2 = default_store_id, id2
! return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \
! self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1]))
def InIDs(self, id, ids):
--- 166,172 ----
if type(id2) != type(()):
id2 = default_store_id, id2
! return id1 == id2
! # return self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[0]), mapi.BinFromHex(id2[0])) and \
! # self.mapi.CompareEntryIDs(mapi.BinFromHex(id1[1]), mapi.BinFromHex(id2[1]))
def InIDs(self, id, ids):
***************
*** 251,260 ****
self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
! if hasattr(self.mapi, "_oleobj_"): # Dispatch COM object
! # CDO
! tree = BuildFolderTreeCDO(self.mapi)
! else:
! # Extended MAPI.
! tree = BuildFolderTreeMAPI(self.mapi)
self._InsertSubFolders(0, tree)
self.selected_ids = [] # wipe this out while we are alive.
--- 277,287 ----
self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
! tree = BuildFolderTreeOutlook(self.mapi)
! # if hasattr(self.mapi, "_oleobj_"): # Dispatch COM object
! # # CDO
! # tree = BuildFolderTreeCDO(self.mapi)
! # else:
! # # Extended MAPI.
! # tree = BuildFolderTreeMAPI(self.mapi)
self._InsertSubFolders(0, tree)
self.selected_ids = [] # wipe this out while we are alive.
***************
*** 353,356 ****
print d.GetSelectedIDs()
if __name__=='__main__':
! TestWithMAPI()
--- 380,391 ----
print d.GetSelectedIDs()
+ def TestWithOutlook():
+ from win32com.client import Dispatch
+ outlook = Dispatch("Outlook.Application")
+ d=FolderSelector(outlook.Session, None, single_select = False)
+ d.DoModal()
+ print d.GetSelectedIDs()
+
+
if __name__=='__main__':
! TestWithOutlook()
Index: TrainingDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/TrainingDialog.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** TrainingDialog.py 1 Nov 2002 02:03:52 -0000 1.7
--- TrainingDialog.py 2 Nov 2002 12:28:38 -0000 1.8
***************
*** 105,109 ****
sub_attr = "ham_include_sub"
include_sub = getattr(self.config, sub_attr)
! d = FolderSelector.FolderSelector(self.mgr.message_store.session, l, checkbox_state=include_sub)
if d.DoModal()==win32con.IDOK:
l[:], include_sub = d.GetSelectedIDs()[:]
--- 105,110 ----
sub_attr = "ham_include_sub"
include_sub = getattr(self.config, sub_attr)
! # d = FolderSelector.FolderSelector(self.mgr.message_store.session, l, checkbox_state=include_sub)
! d = FolderSelector.FolderSelector(self.mgr.outlook.Session, l, checkbox_state=include_sub)
if d.DoModal()==win32con.IDOK:
l[:], include_sub = d.GetSelectedIDs()[:]
From tim_one@users.sourceforge.net Sat Nov 2 17:11:50 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 02 Nov 2002 09:11:50 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/dialogs FolderSelector.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv19232/Outlook2000/dialogs
Modified Files:
FolderSelector.py
Log Message:
Folded long lines so I could read it better. We've got a regression
here: the folder selectors in the Training and Define Filters dialogs
still work, but in the Filter Now dialog clicking Browse dies with
Traceback (most recent call last):
File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py",
line 313, in OnInitDialog
tree = BuildFolderTreeOutlook(self.mapi)
File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py",
line 119, in BuildFolderTreeOutlook
root.children = _BuildFolderTreeOutlook(session, session)
File "C:\Code\spambayes\Outlook2000\dialogs\FolderSelector.py",
line 108, in _BuildFolderTreeOutlook
for i in range(parent.Folders.Count):
AttributeError: Folders
win32ui: OnInitDialog() virtual handler
(>)
raised an exception
Index: FolderSelector.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FolderSelector.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** FolderSelector.py 2 Nov 2002 12:28:38 -0000 1.7
--- FolderSelector.py 2 Nov 2002 17:11:47 -0000 1.8
***************
*** 26,34 ****
# wont even start to mention.
# So we moved to an Extended MAPI version with is nice and fast - screams
! # along! Except it doesn't work in all cases with Exchange (which
# strikes Mark as extremely strange given that the Extended MAPI Python
# bindings were developed against an Exchange Server - but Mark doesn't
# have an Exchange server handy these days, and really doesn't give a
! # rat's arse
# So finally we have an Outlook object model version!
#########################################################################
--- 26,34 ----
# wont even start to mention.
# So we moved to an Extended MAPI version with is nice and fast - screams
! # along! Except it doesn't work in all cases with Exchange (which
# strikes Mark as extremely strange given that the Extended MAPI Python
# bindings were developed against an Exchange Server - but Mark doesn't
# have an Exchange server handy these days, and really doesn't give a
! # rat's arse ).
# So finally we have an Outlook object model version!
#########################################################################
***************
*** 69,73 ****
table = folder.GetHierarchyTable(0)
children = []
! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID, PR_STORE_ENTRYID, PR_DISPLAY_NAME_A), None, None, 0)
for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows:
folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid)
--- 69,75 ----
table = folder.GetHierarchyTable(0)
children = []
! rows = mapi.HrQueryAllRows(table, (PR_ENTRYID,
! PR_STORE_ENTRYID,
! PR_DISPLAY_NAME_A), None, None, 0)
for (eid_tag, eid),(storeeid_tag, store_eid), (name_tag, name) in rows:
folder_id = mapi.HexFromBin(store_eid), mapi.HexFromBin(eid)
***************
*** 90,95 ****
default_store_id = hex_eid
! msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | mapi.MAPI_DEFERRED_ERRORS)
! hr, data = msgstore.GetProps( ( PR_IPM_SUBTREE_ENTRYID,), 0)
subtree_eid = data[0][1]
folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
--- 92,98 ----
default_store_id = hex_eid
! msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL |
! mapi.MAPI_DEFERRED_ERRORS)
! hr, data = msgstore.GetProps((PR_IPM_SUBTREE_ENTRYID,), 0)
subtree_eid = data[0][1]
folder = msgstore.OpenEntry(subtree_eid, None, mapi.MAPI_DEFERRED_ERRORS)
***************
*** 103,111 ****
def _BuildFolderTreeOutlook(session, parent):
children = []
! for i in range (parent.Folders.Count):
! folder = parent.Folders [i+1]
! spec = FolderSpec ((folder.StoreID, folder.EntryID), folder.Name.encode("mbcs", "replace"))
! if folder.Folders != None:
! spec.children = _BuildFolderTreeOutlook (session, folder)
children.append(spec)
return children
--- 106,115 ----
def _BuildFolderTreeOutlook(session, parent):
children = []
! for i in range(parent.Folders.Count):
! folder = parent.Folders[i+1]
! spec = FolderSpec((folder.StoreID, folder.EntryID),
! folder.Name.encode("mbcs", "replace"))
! if folder.Folders:
! spec.children = _BuildFolderTreeOutlook(session, folder)
children.append(spec)
return children
***************
*** 128,136 ****
class FolderSelector(dialog.Dialog):
! style = win32con.DS_MODALFRAME | win32con.WS_POPUP | win32con.WS_VISIBLE | win32con.WS_CAPTION | win32con.WS_SYSMENU | win32con.DS_SETFONT
cs = win32con.WS_CHILD | win32con.WS_VISIBLE
! treestyle = cs | win32con.WS_BORDER | commctrl.TVS_HASLINES | commctrl.TVS_LINESATROOT | \
! commctrl.TVS_CHECKBOXES | commctrl.TVS_HASBUTTONS | \
! commctrl.TVS_DISABLEDRAGDROP | commctrl.TVS_SHOWSELALWAYS
dt = [
# Dialog itself.
--- 132,150 ----
class FolderSelector(dialog.Dialog):
! style = (win32con.DS_MODALFRAME |
! win32con.WS_POPUP |
! win32con.WS_VISIBLE |
! win32con.WS_CAPTION |
! win32con.WS_SYSMENU |
! win32con.DS_SETFONT)
cs = win32con.WS_CHILD | win32con.WS_VISIBLE
! treestyle = (cs |
! win32con.WS_BORDER |
! commctrl.TVS_HASLINES |
! commctrl.TVS_LINESATROOT |
! commctrl.TVS_CHECKBOXES |
! commctrl.TVS_HASBUTTONS |
! commctrl.TVS_DISABLEDRAGDROP |
! commctrl.TVS_SHOWSELALWAYS)
dt = [
# Dialog itself.
***************
*** 147,151 ****
]
! def __init__ (self, mapi, selected_ids = None, single_select = False, checkbox_state = False, checkbox_text = None, desc_noun = "Select", desc_noun_suffix = "ed"):
assert not single_select or selected_ids is None or len(selected_ids)<=1
dialog.Dialog.__init__ (self, self.dt)
--- 161,170 ----
]
! def __init__ (self, mapi, selected_ids=None,
! single_select=False,
! checkbox_state=False,
! checkbox_text=None,
! desc_noun="Select",
! desc_noun_suffix="ed"):
assert not single_select or selected_ids is None or len(selected_ids)<=1
dialog.Dialog.__init__ (self, self.dt)
***************
*** 194,198 ****
mask = state = 0
else:
! if self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
num_children_selected += 1
--- 213,218 ----
mask = state = 0
else:
! if (self.selected_ids and
! self.InIDs(child.folder_id, self.selected_ids)):
state = INDEXTOSTATEIMAGEMASK(IIL_CHECKED)
num_children_selected += 1
***************
*** 201,206 ****
mask = commctrl.TVIS_STATEIMAGEMASK
item_id = self._MakeItemParam(child)
! hitem = self.list.InsertItem(hParent, 0, (None, state, mask, text, bitmapCol, bitmapSel, cItems, item_id))
! if self.single_select and self.selected_ids and self.InIDs(child.folder_id, self.selected_ids):
self.list.SelectItem(hitem)
--- 221,236 ----
mask = commctrl.TVIS_STATEIMAGEMASK
item_id = self._MakeItemParam(child)
! hitem = self.list.InsertItem(hParent, 0,
! (None,
! state,
! mask,
! text,
! bitmapCol,
! bitmapSel,
! cItems,
! item_id))
! if (self.single_select and
! self.selected_ids and
! self.InIDs(child.folder_id, self.selected_ids)):
self.list.SelectItem(hitem)
***************
*** 232,236 ****
def _YieldCheckedChildren(self):
if self.single_select:
! # If single-select, the checked state is not used, just the selected state.
try:
h = self.list.GetSelectedItem()
--- 262,267 ----
def _YieldCheckedChildren(self):
if self.single_select:
! # If single-select, the checked state is not used, just the
! # selected state.
try:
h = self.list.GetSelectedItem()
***************
*** 271,277 ****
if self.single_select:
# Remove the checkbox style from the list for single-selection
! style = win32api.GetWindowLong(self.list.GetSafeHwnd(), win32con.GWL_STYLE)
style = style & ~commctrl.TVS_CHECKBOXES
! win32api.SetWindowLong(self.list.GetSafeHwnd(), win32con.GWL_STYLE, style)
# Hide "clear all"
self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
--- 302,311 ----
if self.single_select:
# Remove the checkbox style from the list for single-selection
! style = win32api.GetWindowLong(self.list.GetSafeHwnd(),
! win32con.GWL_STYLE)
style = style & ~commctrl.TVS_CHECKBOXES
! win32api.SetWindowLong(self.list.GetSafeHwnd(),
! win32con.GWL_STYLE,
! style)
# Hide "clear all"
self.GetDlgItem(IDC_BUTTON_CLEARALL).ShowWindow(win32con.SW_HIDE)
***************
*** 283,287 ****
# else:
# # Extended MAPI.
! # tree = BuildFolderTreeMAPI(self.mapi)
self._InsertSubFolders(0, tree)
self.selected_ids = [] # wipe this out while we are alive.
--- 317,321 ----
# else:
# # Extended MAPI.
! # tree = BuildFolderTreeMAPI(self.mapi)
self._InsertSubFolders(0, tree)
self.selected_ids = [] # wipe this out while we are alive.
***************
*** 311,315 ****
names.append(info[3])
! status_string = "%s%s %d folder" % (self.select_desc_noun, self.select_desc_noun_suffix, num_checked)
if num_checked != 1:
status_string += "s"
--- 345,351 ----
names.append(info[3])
! status_string = "%s%s %d folder" % (self.select_desc_noun,
! self.select_desc_noun_suffix,
! num_checked)
if num_checked != 1:
status_string += "s"
From tim_one@users.sourceforge.net Sat Nov 2 17:27:49 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 02 Nov 2002 09:27:49 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/dialogs FilterDialog.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv5390/Outlook2000/dialogs
Modified Files:
FilterDialog.py
Log Message:
FilterNowDialog.OnButBrowse(): Repaired the way FolderSelector is
called so that this works again.
Index: FilterDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/FilterDialog.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** FilterDialog.py 2 Nov 2002 12:28:38 -0000 1.9
--- FilterDialog.py 2 Nov 2002 17:27:44 -0000 1.10
***************
*** 333,338 ****
import FolderSelector
filter = self.mgr.config.filter_now
! d = FolderSelector.FolderSelector(self.mgr.message_store.session, filter.folder_ids,checkbox_state=filter.include_sub)
! if d.DoModal()==win32con.IDOK:
filter.folder_ids, filter.include_sub = d.GetSelectedIDs()
self.UpdateFolderNames()
--- 333,341 ----
import FolderSelector
filter = self.mgr.config.filter_now
! # d = FolderSelector.FolderSelector(self.mgr.message_store.session, filter.folder_ids,checkbox_state=filter.include_sub)
! d = FolderSelector.FolderSelector(self.mgr.outlook.Session,
! filter.folder_ids,
! checkbox_state=filter.include_sub)
! if d.DoModal() == win32con.IDOK:
filter.folder_ids, filter.include_sub = d.GetSelectedIDs()
self.UpdateFolderNames()
From richiehindle@users.sourceforge.net Sat Nov 2 21:00:23 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Sat, 02 Nov 2002 13:00:23 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13701
Modified Files:
pop3proxy.py
Log Message:
Can now listen on the port of your choice (thanks to Tim Stone).
Now supports the 'Unsure' value for X-Hammie-Disposition.
Now less anal about correcting for the size of the added header.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** pop3proxy.py 1 Nov 2002 09:14:47 -0000 1.8
--- pop3proxy.py 2 Nov 2002 21:00:21 -0000 1.9
***************
*** 12,18 ****
defaults to 110.
! options (the same as hammie):
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
pop3proxy -t
--- 12,19 ----
defaults to 110.
! options:
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
+ -l port : listen on this port number (default 110)
pop3proxy -t
***************
*** 39,44 ****
from Options import options
HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
! HEADER_EXAMPLE = '%s: Yes\r\n' % hammie.DISPHEADER
--- 40,47 ----
from Options import options
+ # HEADER_EXAMPLE is the longest possible header - the length of this one
+ # is added to the size of each message.
HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
! HEADER_EXAMPLE = '%s: Unsure\r\n' % hammie.DISPHEADER
***************
*** 58,61 ****
--- 61,65 ----
self.set_socket(s, socketMap)
self.set_reuse_addr()
+ print "Listening on port %d." % port
self.bind(('', port))
self.listen(5)
***************
*** 337,350 ****
ok, message = response.split('\n', 1)
! # Now find the spam disposition and add the header. The
! # trailing space in "No " ensures consistent lengths - this
! # is required because POP3 commands like 'STAT' and 'LIST'
! # need to be able to report the size of a message before
! # it's been classified.
prob = self.bayes.spamprob(tokenizer.tokenize(message))
! if prob > options.spam_cutoff:
disposition = "Yes"
else:
! disposition = "No "
headers, body = re.split(r'\n\r?\n', response, 1)
headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
--- 341,353 ----
ok, message = response.split('\n', 1)
! # Now find the spam disposition and add the header.
prob = self.bayes.spamprob(tokenizer.tokenize(message))
! if prob < options.ham_cutoff:
! disposition = "No"
! elif prob > options.spam_cutoff:
disposition = "Yes"
else:
! disposition = "Unsure"
!
headers, body = re.split(r'\n\r?\n', response, 1)
headers = headers + "\n" + HEADER_FORMAT % disposition + "\r\n"
***************
*** 577,581 ****
# Read the arguments.
try:
! opts, args = getopt.getopt(sys.argv[1:], 'htdp:')
except getopt.error, msg:
print >>sys.stderr, str(msg) + '\n\n' + __doc__
--- 580,584 ----
# Read the arguments.
try:
! opts, args = getopt.getopt(sys.argv[1:], 'htdp:l:')
except getopt.error, msg:
print >>sys.stderr, str(msg) + '\n\n' + __doc__
***************
*** 583,586 ****
--- 586,590 ----
pickleName = hammie.DEFAULTDB
+ proxyPort = 110
useDB = False
runTestServer = False
***************
*** 595,599 ****
elif opt == '-p':
pickleName = arg
!
# Do whatever we've been asked to do...
if not opts and not args:
--- 599,605 ----
elif opt == '-p':
pickleName = arg
! elif opt == '-l':
! proxyPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
***************
*** 609,617 ****
elif len(args) == 1:
# Named POP3 server, default port.
! main(args[0], 110, 110, pickleName, useDB)
elif len(args) == 2:
# Named POP3 server, named port.
! main(args[0], int(args[1]), 110, pickleName, useDB)
else:
--- 615,623 ----
elif len(args) == 1:
# Named POP3 server, default port.
! main(args[0], 110, proxyPort, pickleName, useDB)
elif len(args) == 2:
# Named POP3 server, named port.
! main(args[0], int(args[1]), proxyPort, pickleName, useDB)
else:
From mhammond@users.sourceforge.net Sun Nov 3 02:00:33 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sat, 02 Nov 2002 18:00:33 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.19,1.20
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv31898
Modified Files:
msgstore.py
Log Message:
_GetFakeHeaders must end with \n
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** msgstore.py 2 Nov 2002 12:09:36 -0000 1.19
--- msgstore.py 3 Nov 2002 02:00:31 -0000 1.20
***************
*** 375,379 ****
if to: headers.append("To: "+to)
if cc: headers.append("CC: "+cc)
! return "\n".join(headers)
def _EnsureObject(self):
--- 375,379 ----
if to: headers.append("To: "+to)
if cc: headers.append("CC: "+cc)
! return "\n".join(headers) + "\n"
def _EnsureObject(self):
From hooft@users.sourceforge.net Sun Nov 3 13:48:49 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 03 Nov 2002 05:48:49 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.63,1.64
hammie.py,1.33,1.34
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16667
Modified Files:
Options.py hammie.py
Log Message:
* Added options "header_spam_string", "header_unsure_string",
"header_ham_string". Defaults are set to "Yes", "Unsure", "No".
* Added options header_score_digits and header_score_logarithm. The
first is an integer telling hammie in how many digits it should show
the score. If the second option is set to "True", scores of 1.00 or
0.00 are augmented by a logarithmic "one-ness" or "zero-ness" score
(basically it shows the "number of zeros" or "number of nines" next
to the score value).
* Added support for a debugging header using the boolean hammie_debug_header
option and the string hammie_debug_header_name
* Changed hammie.py to use all of the new options
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** Options.py 28 Oct 2002 20:19:46 -0000 1.63
--- Options.py 3 Nov 2002 13:48:47 -0000 1.64
***************
*** 286,302 ****
[Hammie]
# The name of the header that hammie adds to an E-mail in filter mode
hammie_header_name: X-Hammie-Disposition
! # The default database path used by hammie
! persistent_storage_file: hammie.db
! # The range of clues that are added to the "hammie" header in the E-mail
# All clues that have their probability smaller than this number, or larger
# than one minus this number are added to the header such that you can see
# why spambayes thinks this is ham/spam or why it is unsure. The default is
# to show all clues, but you can reduce that by setting showclue to a lower
! # value, such as 0.1 (which Rob is using)
clue_mailheader_cutoff: 0.5
# hammie can use either a database (quick to score one message) or a pickle
# (quick to train on huge amounts of messages). Set this to True to use a
--- 286,324 ----
[Hammie]
# The name of the header that hammie adds to an E-mail in filter mode
+ # It contains the "classification" of the mail, plus the score.
hammie_header_name: X-Hammie-Disposition
! # The three disposition names are added to the header as the following
! # Three words:
! header_spam_string: Yes
! header_unsure_string: Unsure
! header_ham_string: No
! # Accuracy of the score in the header in decimal digits
! header_score_digits: 2
!
! # Set this to "True", to augment scores of 1.00 or 0.00 by a logarithmic
! # "one-ness" or "zero-ness" score (basically it shows the "number of zeros"
! # or "number of nines" next to the score value).
! header_score_logarithm: False
!
! # Enable debugging information in the header.
! hammie_debug_header: False
!
! # Name of a debugging header for spambayes hackers, showing the strongest
! # clues that have resulted in the classification in the standard header.
! hammie_debug_header_name: X-Hammie-Debug
!
! # The range of clues that are added to the "debug" header in the E-mail
# All clues that have their probability smaller than this number, or larger
# than one minus this number are added to the header such that you can see
# why spambayes thinks this is ham/spam or why it is unsure. The default is
# to show all clues, but you can reduce that by setting showclue to a lower
! # value, such as 0.1
clue_mailheader_cutoff: 0.5
+ # The default database path used by hammie
+ persistent_storage_file: hammie.db
+
# hammie can use either a database (quick to score one message) or a pickle
# (quick to train on huge amounts of messages). Set this to True to use a
***************
*** 363,366 ****
--- 385,395 ----
'clue_mailheader_cutoff': float_cracker,
'persistent_use_database': boolean_cracker,
+ 'header_spam_string': string_cracker,
+ 'header_unsure_string': string_cracker,
+ 'header_ham_string': string_cracker,
+ 'header_score_digits': int_cracker,
+ 'header_score_logarithm': boolean_cracker,
+ 'hammie_debug_header': boolean_cracker,
+ 'hammie_debug_header_name': string_cracker,
},
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** hammie.py 27 Oct 2002 22:56:15 -0000 1.33
--- hammie.py 3 Nov 2002 13:48:47 -0000 1.34
***************
*** 57,60 ****
--- 57,62 ----
# Name of the header to add in filter mode
DISPHEADER = options.hammie_header_name
+ DEBUGHEADER = options.hammie_debug_header_name
+ DODEBUG = options.hammie_debug_header
# Default database name
***************
*** 242,246 ****
def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
! ham_cutoff=HAM_THRESHOLD):
"""Score (judge) a message and add a disposition header.
--- 244,249 ----
def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
! ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER,
! debug=DODEBUG):
"""Score (judge) a message and add a disposition header.
***************
*** 248,253 ****
Optionally, set header to the name of the header to add, and/or
! cutoff to the probability value which must be met or exceeded
! for a message to get a 'Yes' disposition.
Returns the same message with a new disposition header.
--- 251,261 ----
Optionally, set header to the name of the header to add, and/or
! spam_cutoff/ham_cutoff to the probability values which must be met
! or exceeded for a message to get a 'Spam' or 'Ham' classification.
!
! An extra debugging header can be added if 'debug' is set to True.
! The name of the debugging header is given as 'debugheader'.
!
! All defaults for optional parameters come from the Options file.
Returns the same message with a new disposition header.
***************
*** 261,272 ****
prob, clues = self._scoremsg(msg, True)
if prob < ham_cutoff:
! disp = "No"
elif prob > spam_cutoff:
! disp = "Yes"
else:
! disp = "Unsure"
! disp += "; %.2f" % prob
! disp += "; " + self.formatclues(clues)
msg.add_header(header, disp)
return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
--- 269,291 ----
prob, clues = self._scoremsg(msg, True)
if prob < ham_cutoff:
! disp = options.header_ham_string
elif prob > spam_cutoff:
! disp = options.header_spam_string
else:
! disp = options.header_unknown_string
! disp += ("; %."+str(options.header_score_digits)+"f") % prob
! if options.header_score_logarithm:
! if prob<=0.005 and prob>0.0:
! import math
! x=-math.log10(prob)
! disp += " (%d)"%x
! if prob>=0.995 and prob<1.0:
! import math
! x=-math.log10(1.0-prob)
! disp += " (%d)"%x
msg.add_header(header, disp)
+ if debug:
+ disp = self.formatclues(clues)
+ msg.add_header(debugheader, disp)
return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
From hooft@users.sourceforge.net Sun Nov 3 14:24:38 2002
From: hooft@users.sourceforge.net (Rob W.W. Hooft)
Date: Sun, 03 Nov 2002 06:24:38 -0800
Subject: [Spambayes-checkins] spambayes hammie.py,1.34,1.35
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25907
Modified Files:
hammie.py
Log Message:
fix typo(?)
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** hammie.py 3 Nov 2002 13:48:47 -0000 1.34
--- hammie.py 3 Nov 2002 14:24:36 -0000 1.35
***************
*** 273,277 ****
disp = options.header_spam_string
else:
! disp = options.header_unknown_string
disp += ("; %."+str(options.header_score_digits)+"f") % prob
if options.header_score_logarithm:
--- 273,277 ----
disp = options.header_spam_string
else:
! disp = options.header_unsure_string
disp += ("; %."+str(options.header_score_digits)+"f") % prob
if options.header_score_logarithm:
From mhammond@users.sourceforge.net Mon Nov 4 00:41:10 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:41:10 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.20,1.21
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv26387
Modified Files:
msgstore.py
Log Message:
Allow an Outlook folder to be passed as a "folder id" (in the same way
we did that for messages).
Give __eq__ and __ne__ methods to compare folders. I'm pretty sure the
MAPI semantics are correct, but not as confident on the new rich
comparisons .
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** msgstore.py 3 Nov 2002 02:00:31 -0000 1.20
--- msgstore.py 4 Nov 2002 00:41:08 -0000 1.21
***************
*** 198,202 ****
def GetFolder(self, folder_id):
# Return a single folder given the ID.
! folder_id = self.NormalizeID(folder_id)
folder = self._OpenEntry(folder_id)
table = folder.GetContentsTable(0)
--- 198,207 ----
def GetFolder(self, folder_id):
# Return a single folder given the ID.
! if hasattr(folder_id, "EntryID"):
! # An Outlook object
! folder_id = mapi.BinFromHex(folder_id.StoreID), \
! mapi.BinFromHex(folder_id.EntryID)
! else:
! folder_id = self.NormalizeID(folder_id)
folder = self._OpenEntry(folder_id)
table = folder.GetContentsTable(0)
***************
*** 248,251 ****
--- 253,265 ----
mapi.HexFromBin(self.id[1]))
+ def __eq__(self, other):
+ if other is None: return False
+ ceid = self.msgstore.session.CompareEntryIDs
+ return ceid(self.id[0], other.id[0]) and \
+ ceid(self.id[1], other.id[1])
+
+ def __ne__(self, other):
+ return not self.__eq__(other)
+
def GetID(self):
return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
***************
*** 298,301 ****
--- 312,324 ----
mapi.HexFromBin(self.id[1]))
+ def __eq__(self, other):
+ if other is None: return False
+ ceid = self.msgstore.session.CompareEntryIDs
+ return ceid(self.id[0], other.id[0]) and \
+ ceid(self.id[1], other.id[1])
+
+ def __ne__(self, other):
+ return not self.__eq__(other)
+
def GetID(self):
return mapi.HexFromBin(self.id[0]), mapi.HexFromBin(self.id[1])
***************
*** 303,307 ****
def GetOutlookItem(self):
hex_item_id = mapi.HexFromBin(self.id[1])
! store_hex_id = mapi.HexFromBin(self.id[0])
return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id)
--- 326,330 ----
def GetOutlookItem(self):
hex_item_id = mapi.HexFromBin(self.id[1])
! hex_store_id = mapi.HexFromBin(self.id[0])
return self.msgstore.outlook.Session.GetItemFromID(hex_item_id, hex_store_id)
From mhammond@users.sourceforge.net Mon Nov 4 00:49:13 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:49:13 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/sandbox
dump_props.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/sandbox
In directory usw-pr-cvs1:/tmp/cvs-serv29119
Modified Files:
dump_props.py
Log Message:
If the property type is PT_ERROR, show the best error code repr we can.
Index: dump_props.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/sandbox/dump_props.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** dump_props.py 2 Nov 2002 11:27:53 -0000 1.4
--- dump_props.py 4 Nov 2002 00:49:11 -0000 1.5
***************
*** 66,75 ****
# Also in new versions of mapituil
! def GetAllProperties(obj, make_tag_names = True):
tags = obj.GetPropList(0)
hr, data = obj.GetProps(tags)
ret = []
for tag, val in data:
! if make_tag_names:
hr, tags, array = obj.GetNamesFromIDs( (tag,) )
if type(array[0][1])==type(u''):
--- 66,75 ----
# Also in new versions of mapituil
! def GetAllProperties(obj, make_pretty = True):
tags = obj.GetPropList(0)
hr, data = obj.GetProps(tags)
ret = []
for tag, val in data:
! if make_pretty:
hr, tags, array = obj.GetNamesFromIDs( (tag,) )
if type(array[0][1])==type(u''):
***************
*** 77,80 ****
--- 77,83 ----
else:
name = mapiutil.GetPropTagName(tag)
+ # pretty value transformations
+ if PROP_TYPE(tag)==PT_ERROR:
+ val = mapiutil.GetScodeString(val)
else:
name = tag
From mhammond@users.sourceforge.net Mon Nov 4 00:50:11 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:50:11 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 manager.py,1.31,1.32
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv29458
Modified Files:
manager.py
Log Message:
Wipe outlook reference as we die.
Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.31
retrieving revision 1.32
diff -C2 -d -r1.31 -r1.32
*** manager.py 1 Nov 2002 14:35:05 -0000 1.31
--- manager.py 4 Nov 2002 00:50:09 -0000 1.32
***************
*** 239,242 ****
--- 239,243 ----
self.message_store.Close()
self.message_store = None
+ self.outlook = None
def score(self, msg, evidence=False, scale=True):
From mhammond@users.sourceforge.net Mon Nov 4 00:50:26 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:50:26 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000/images - New directory
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/images
In directory usw-pr-cvs1:/tmp/cvs-serv29597/images
Log Message:
Directory /cvsroot/spambayes/spambayes/Outlook2000/images added to the repository
From mhammond@users.sourceforge.net Mon Nov 4 00:51:18 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:51:18 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/images delete_as_spam.bmp,NONE,1.1
recover_ham.bmp,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/images
In directory usw-pr-cvs1:/tmp/cvs-serv29827
Added Files:
delete_as_spam.bmp recover_ham.bmp
Log Message:
Some button images :)
--- NEW FILE: delete_as_spam.bmp ---
(This appears to be a binary file; contents omitted.)
--- NEW FILE: recover_ham.bmp ---
(This appears to be a binary file; contents omitted.)
From mhammond@users.sourceforge.net Mon Nov 4 00:52:12 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 16:52:12 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.24,1.25
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv29880
Modified Files:
addin.py
Log Message:
New "Delete As Spam" button, complete with button image, and the
button changes appearance and behaviour when one of the spam
folders is selected.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** addin.py 1 Nov 2002 23:54:03 -0000 1.24
--- addin.py 4 Nov 2002 00:52:10 -0000 1.25
***************
*** 1,5 ****
# SpamBayes Outlook Addin
! import sys
import warnings
--- 1,5 ----
# SpamBayes Outlook Addin
! import sys, os
import warnings
***************
*** 16,19 ****
--- 16,21 ----
import win32ui
+ import win32gui, win32con, win32clipboard # for button images!
+
# If we are not running in a console, redirect all print statements to the
# win32traceutil collector.
***************
*** 28,38 ****
! # A lovely big block that attempts to catch the most common errors - COM objects not installed.
try:
! # Support for COM objects we use.
gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True) # Outlook 9
gencache.EnsureModule('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1, bForDemand=True) # Office 9
! # The TLB defiining the interfaces we implement
universal.RegisterInterfaces('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0, ["_IDTExtensibility2"])
except pythoncom.com_error, (hr, msg, exc, arg):
--- 30,40 ----
! # Attempt to catch the most common errors - COM objects not installed.
try:
! # Generate support so we get complete support including events
gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True) # Outlook 9
gencache.EnsureModule('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1, bForDemand=True) # Office 9
! # Register what vtable based interfaces we need to implement.
universal.RegisterInterfaces('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0, ["_IDTExtensibility2"])
except pythoncom.com_error, (hr, msg, exc, arg):
***************
*** 46,76 ****
if exc:
print "Exception: %s" % (exc)
! print "Sorry, I can't be more help, but I can't continue while I have this error."
sys.exit(1)
! # Something that should be in win32com in some form or another.
def CastToClone(ob, target):
"""'Cast' a COM object to another type"""
- # todo - should support target being an IID
if hasattr(target, "index"): # string like
# for now, we assume makepy for this to work.
if not ob.__class__.__dict__.has_key("CLSID"):
- # Eeek - no makepy support - try and build it.
ob = gencache.EnsureDispatch(ob)
if not ob.__class__.__dict__.has_key("CLSID"):
raise ValueError, "Must be a makepy-able object for this to work"
clsid = ob.CLSID
- # Lots of hoops to support "demand-build" - ie, generating
- # code for an interface first time it is used. We assume the
- # interface name exists in the same library as the object.
- # This is generally the case - only referenced typelibs may be
- # a problem, and we can handle that later. Maybe
- # So get the generated module for the library itself, then
- # find the interface CLSID there.
mod = gencache.GetModuleForCLSID(clsid)
- # Get the 'root' module.
mod = gencache.GetModuleForTypelib(mod.CLSID, mod.LCID,
mod.MajorVersion, mod.MinorVersion)
- # Find the CLSID of the target
# XXX - should not be looking in VTables..., but no general map currently exists
# (Fixed in win32all!)
--- 48,69 ----
if exc:
print "Exception: %s" % (exc)
! print "Sorry I can't be more help, but I can't continue while I have this error."
sys.exit(1)
! # A couple of functions that are in new win32all, but we dont want to
! # force people to ugrade if we can avoid it.
! # NOTE: Most docstrings and comments removed - see the win32all version
def CastToClone(ob, target):
"""'Cast' a COM object to another type"""
if hasattr(target, "index"): # string like
# for now, we assume makepy for this to work.
if not ob.__class__.__dict__.has_key("CLSID"):
ob = gencache.EnsureDispatch(ob)
if not ob.__class__.__dict__.has_key("CLSID"):
raise ValueError, "Must be a makepy-able object for this to work"
clsid = ob.CLSID
mod = gencache.GetModuleForCLSID(clsid)
mod = gencache.GetModuleForTypelib(mod.CLSID, mod.LCID,
mod.MajorVersion, mod.MinorVersion)
# XXX - should not be looking in VTables..., but no general map currently exists
# (Fixed in win32all!)
***************
*** 81,85 ****
mod = gencache.GetModuleForCLSID(target_clsid)
target_class = getattr(mod, target)
- # resolve coclass to interface
target_class = getattr(target_class, "default_interface", target_class)
return target_class(ob) # auto QI magic happens
--- 74,77 ----
***************
*** 90,93 ****
--- 82,118 ----
CastTo = CastToClone
+ # Something else in later win32alls - like "DispatchWithEvents", but the
+ # returned object is not both the Dispatch *and* the event handler
+ def WithEventsClone(clsid, user_event_class):
+ clsid = getattr(clsid, "_oleobj_", clsid)
+ disp = Dispatch(clsid)
+ if not disp.__dict__.get("CLSID"): # Eeek - no makepy support - try and build it.
+ try:
+ ti = disp._oleobj_.GetTypeInfo()
+ disp_clsid = ti.GetTypeAttr()[0]
+ tlb, index = ti.GetContainingTypeLib()
+ tla = tlb.GetLibAttr()
+ mod = gencache.EnsureModule(tla[0], tla[1], tla[3], tla[4])
+ disp_class = gencache.GetClassForProgID(str(disp_clsid))
+ except pythoncom.com_error:
+ raise TypeError, "This COM object can not automate the makepy process - please run makepy manually for this object"
+ else:
+ disp_class = disp.__class__
+ clsid = disp_class.CLSID
+ import new
+ events_class = getevents(clsid)
+ if events_class is None:
+ raise ValueError, "This COM object does not support events."
+ result_class = new.classobj("COMEventClass", (events_class, user_event_class), {})
+ instance = result_class(disp) # This only calls the first base class __init__.
+ if hasattr(user_event_class, "__init__"):
+ user_event_class.__init__(instance)
+ return instance
+
+ try:
+ from win32com.client import WithEvents
+ except ImportError: # appears in 151 and later.
+ WithEvents = WithEventsClone
+
# Whew - we seem to have all the COM support we need - let's rock!
***************
*** 97,101 ****
self.handler = handler
self.args = args
!
def OnClick(self, button, cancel):
self.handler(*self.args)
--- 122,127 ----
self.handler = handler
self.args = args
! def Close(self):
! self.handler = self.args = None
def OnClick(self, button, cancel):
self.handler(*self.args)
***************
*** 107,110 ****
--- 133,138 ----
self.manager = manager
self.target = target
+ def Close(self):
+ self.application = self.manager = self.target = None
class FolderItemsEvent(_BaseItemsEvent):
***************
*** 172,195 ****
assert train.been_trained_as_spam(msgstore_message, self.manager)
def ShowClues(mgr, app):
from cgi import escape
! sel = app.ActiveExplorer().Selection
! if sel.Count == 0:
! win32ui.MessageBox("No items are selected", "No selection")
! return
! if sel.Count > 1:
! win32ui.MessageBox("Please select a single item", "Large selection")
! return
!
! item = sel.Item(1)
! if item.Class != constants.olMail:
! win32ui.MessageBox("This function can only be performed on mail items",
! "Not a mail message")
return
!
! msgstore_message = mgr.message_store.GetMessage(item)
score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
new_msg = app.CreateItem(0)
body = ["
Spam Score: %g
" % score]
push = body.append
--- 200,217 ----
assert train.been_trained_as_spam(msgstore_message, self.manager)
+ # Event function fired from the "Show Clues" UI items.
def ShowClues(mgr, app):
from cgi import escape
! msgstore_message = mgr.addin.GetSelectedMessages(False)
! if msgstore_message is None:
return
! item = msgstore_message.GetOutlookItem()
score, clues = mgr.score(msgstore_message, evidence=True, scale=False)
new_msg = app.CreateItem(0)
+ # NOTE: Silly Outlook always switches the message editor back to RTF
+ # once the Body property has been set. Thus, there is no reasonable
+ # way to get this as text only. Next best then is to use HTML, 'cos at
+ # least we know how to exploit it!
body = ["
Spam Score: %g
" % score]
push = body.append
***************
*** 210,215 ****
new_msg.Subject = "Spam Clues: " + item.Subject
! # Stupid outlook always switches to RTF :( Work-around
! ## new_msg.Body = body
new_msg.HTMLBody = "" + body + ""
# Attach the source message to it
--- 232,236 ----
new_msg.Subject = "Spam Clues: " + item.Subject
! # As above, use HTMLBody else Outlook refuses to behave.
new_msg.HTMLBody = "" + body + ""
# Attach the source message to it
***************
*** 218,221 ****
--- 239,359 ----
new_msg.Display()
+ # The "Delete As Spam" and "Recover Spam" button
+ # The event from Outlook's explorer that our folder has changed.
+ class ButtonDeleteAsExplorerEvent:
+ def Init(self, but):
+ self.but = but
+ def Close(self):
+ self.but = None
+ def OnFolderSwitch(self):
+ self.but._UpdateForFolderChange()
+
+ class ButtonDeleteAsEvent:
+ def Init(self, manager, application, explorer):
+ # NOTE - keeping a reference to 'explorer' in this event
+ # appears to cause an Outlook circular reference, and outlook
+ # never terminates (it does close, but the process remains alive)
+ # This is why we needed to use WithEvents, so the event class
+ # itself doesnt keep such a reference (and we need to keep a ref
+ # to the event class so it doesn't auto-disconnect!)
+ self.manager = manager
+ self.application = application
+ self.explorer_events = WithEvents(explorer,
+ ButtonDeleteAsExplorerEvent)
+ self.set_for_as_spam = None
+ self.explorer_events.Init(self)
+ self._UpdateForFolderChange()
+
+ def Close(self):
+ self.manager = self.application = self.explorer = None
+
+ def _UpdateForFolderChange(self):
+ explorer = self.application.ActiveExplorer()
+ if explorer is None:
+ print "** Folder Change, but don't have an explorer"
+ return
+ outlook_folder = explorer.CurrentFolder
+ is_spam = False
+ if outlook_folder is not None:
+ mapi_folder = self.manager.message_store.GetFolder(outlook_folder)
+ look_id = self.manager.config.filter.spam_folder_id
+ if look_id:
+ look_folder = self.manager.message_store.GetFolder(look_id)
+ if mapi_folder == look_folder:
+ is_spam = True
+ if not is_spam:
+ look_id = self.manager.config.filter.unsure_folder_id
+ if look_id:
+ look_folder = self.manager.message_store.GetFolder(look_id)
+ if mapi_folder == look_folder:
+ is_spam = True
+ if is_spam:
+ set_for_as_spam = False
+ else:
+ set_for_as_spam = True
+ if set_for_as_spam != self.set_for_as_spam:
+ if set_for_as_spam:
+ image = "delete_as_spam.bmp"
+ self.Caption = "Delete As Spam"
+ self.TooltipText = \
+ "Move the selected message to the Spam folder,\n" \
+ "and train the system that this is Spam."
+ else:
+ image = "recover_ham.bmp"
+ self.Caption = "Recover from Spam"
+ self.TooltipText = \
+ "Recovers the selected item back to the folder\n" \
+ "it was filtered from (or to the Inbox if this\n" \
+ "folder is not known), and trains the system that\n" \
+ "this is a good message\n"
+ # Set the image.
+ print "Setting image to", image
+ SetButtonImage(self, image)
+ self.set_for_as_spam = set_for_as_spam
+
+ def OnClick(self, button, cancel):
+ msgstore = self.manager.message_store
+ msgstore_messages = self.manager.addin.GetSelectedMessages(True)
+ if not msgstore_messages:
+ return
+ if self.set_for_as_spam:
+ # Delete this item as spam.
+ spam_folder_id = self.manager.config.filter.spam_folder_id
+ spam_folder = msgstore.GetFolder(spam_folder_id)
+ if not spam_folder:
+ win32ui.MessageBox("You must configure the Spam folder",
+ "Invalid Configuration")
+ return
+ import train
+ for msgstore_message in msgstore_messages:
+ # Must train before moving, else we lose the message!
+ print "Training on message - ",
+ if train.train_message(msgstore_message, True, self.manager):
+ print "trained as spam"
+ else:
+ print "already was trained as spam"
+ # Now move it.
+ msgstore_message.MoveTo(spam_folder)
+ else:
+ win32ui.MessageBox("Please be patient ")
+
+ # Helpers to work with images on buttons/toolbars.
+ def SetButtonImage(button, fname):
+ # whew - http://support.microsoft.com/default.aspx?scid=KB;EN-US;q288771
+ # shows how to make a transparent bmp.
+ # Also note that the clipboard takes ownership of the handle -
+ # this, we can not simply perform this load once and reuse the image.
+ if not os.path.isabs(fname):
+ fname = os.path.join( os.path.dirname(__file__), "images", fname)
+ if not os.path.isfile(fname):
+ print "WARNING - Trying to use image '%s', but it doesn't exist" % (fname,)
+ return None
+ handle = win32gui.LoadImage(0, fname, win32con.IMAGE_BITMAP, 0, 0, win32con.LR_DEFAULTSIZE | win32con.LR_LOADFROMFILE)
+ win32clipboard.OpenClipboard()
+ win32clipboard.SetClipboardData(win32con.CF_BITMAP, handle)
+ win32clipboard.CloseClipboard()
+ button.Style = constants.msoButtonIconAndCaption
+ button.PasteFace()
+
# The outlook Plugin COM object itself.
class OutlookAddin:
***************
*** 247,250 ****
--- 385,396 ----
bars = activeExplorer.CommandBars
toolbar = bars.Item("Standard")
+ # Add our "Delete as ..." button
+ button = toolbar.Controls.Add(Type=constants.msoControlButton, Temporary=True)
+ # Hook events for the item
+ button.BeginGroup = True
+ button = DispatchWithEvents(button, ButtonDeleteAsEvent)
+ button.Init(self.manager, application, activeExplorer)
+ self.buttons.append(button)
+
# Add a pop-up menu to the toolbar
popup = toolbar.Controls.Add(Type=constants.msoControlPopup, Temporary=True)
***************
*** 323,326 ****
--- 469,494 ----
return new_hooks
+ def GetSelectedMessages(self, allow_multi = True, explorer = None):
+ if explorer is None:
+ explorer = self.application.ActiveExplorer()
+ sel = explorer.Selection
+ if sel.Count > 1 and not allow_multi:
+ win32ui.MessageBox("Please select a single item", "Large selection")
+ return None
+
+ ret = []
+ for i in range(sel.Count):
+ item = sel.Item(i+1)
+ if item.Class == constants.olMail:
+ msgstore_message = self.manager.message_store.GetMessage(item)
+ ret.append(msgstore_message)
+
+ if len(ret) == 0:
+ win32ui.MessageBox("No mail items are selected", "No selection")
+ return None
+ if allow_multi:
+ return ret
+ return ret[0]
+
def OnDisconnection(self, mode, custom):
print "SpamAddin - Disconnecting from Outlook"
***************
*** 331,336 ****
self.manager.Close()
self.manager = None
! self.buttons = None
!
print "Addin terminating: %d COM client and %d COM servers exist." \
% (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount())
--- 499,506 ----
self.manager.Close()
self.manager = None
! if self.buttons:
! for button in self.buttons:
! button.Close()
! self.buttons = None
print "Addin terminating: %d COM client and %d COM servers exist." \
% (pythoncom._GetInterfaceCount(), pythoncom._GetGatewayCount())
From mhammond@users.sourceforge.net Mon Nov 4 01:12:56 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Sun, 03 Nov 2002 17:12:56 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv2046
Modified Files:
train.py
Log Message:
Fix the root of my:
File "F:\src\spambayes\classifier.py", line 450, in _getclues
distance = abs(prob - 0.5)
Exception - problem is that we trained, but didn't update probabilities -
thus, we failed for every new word seen only since the last complete
retrain.
There may be a case for _getclues() to detect a probability of None
and call update_probabilities() automatically - either that or just
keep throwing vague exceptions
Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** train.py 31 Oct 2002 22:03:35 -0000 1.12
--- train.py 4 Nov 2002 01:12:53 -0000 1.13
***************
*** 19,23 ****
return spam == True
! def train_message(msg, is_spam, mgr):
# Train an individual message.
# Returns True if newly added (message will be correctly
--- 19,23 ----
return spam == True
! def train_message(msg, is_spam, mgr, update_probs = True):
# Train an individual message.
# Returns True if newly added (message will be correctly
***************
*** 41,44 ****
--- 41,47 ----
mgr.bayes.learn(tokens, is_spam, False)
mgr.message_db[msg.searchkey] = is_spam
+ if update_probs:
+ mgr.bayes.update_probabilities()
+
mgr.bayes_dirty = True
return True
***************
*** 51,55 ****
progress.tick()
try:
! if train_message(message, isspam, mgr):
num_added += 1
except:
--- 54,58 ----
progress.tick()
try:
! if train_message(message, isspam, mgr, False):
num_added += 1
except:
From jhylton@users.sourceforge.net Mon Nov 4 04:36:01 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:36:01 -0800
Subject: [Spambayes-checkins] spambayes/pspam - New directory
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv19246/pspam
Log Message:
Directory /cvsroot/spambayes/spambayes/pspam added to the repository
From jhylton@users.sourceforge.net Mon Nov 4 04:42:44 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:42:44 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam - New directory
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv21182/pspam/pspam
Log Message:
Directory /cvsroot/spambayes/spambayes/pspam/pspam added to the repository
From jhylton@users.sourceforge.net Mon Nov 4 04:44:22 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:44:22 -0800
Subject: [Spambayes-checkins]
spambayes/pspam/pspam __init__.py,NONE,1.1 database.py,NONE,1.1
folder.py,NONE,1.1 message.py,NONE,1.1 options.py,NONE,1.1
profile.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv21558/pspam/pspam
Added Files:
__init__.py database.py folder.py message.py options.py
profile.py
Log Message:
Initial checkin of pspam code.
--- NEW FILE: __init__.py ---
"""Package for interacting with VM folders.
Design notes go here.
Use ZODB to store training data and classifier.
The spam and ham data are culled from sets of folders. The actual
tokenized messages are stored in a training database. When the folder
changes, the training data is updated.
- Updates are incremental.
- Changes to a folder are detected based on mtime and folder size.
- The contents of the folder are keyed on message-id.
- If a message is removed from a folder, it is removed from training data.
"""
--- NEW FILE: database.py ---
from pspam.options import options
import ZODB
from ZEO.ClientStorage import ClientStorage
import zLOG
import os
def logging():
os.environ["STUPID_LOG_FILE"] = options.event_log_file
os.environ["STUPID_LOG_SEVERITY"] = str(options.event_log_severity)
zLOG.initialize()
def open():
cs = ClientStorage(options.zeo_addr)
db = ZODB.DB(cs, cache_size=options.cache_size)
return db
--- NEW FILE: folder.py ---
import ZODB
from Persistence import Persistent
from BTrees.OOBTree import OOBTree, OOSet, difference
import email
import mailbox
import os
import stat
from pspam.message import PMessage
def factory(fp):
try:
return email.message_from_file(fp, PMessage)
except email.Errors.MessageError, msg:
print msg
return PMessage()
class Folder(Persistent):
def __init__(self, path):
self.path = path
self.mtime = 0
self.size = 0
self.messages = OOBTree()
def _stat(self):
t = os.stat(self.path)
self.mtime = t[stat.ST_MTIME]
self.size = t[stat.ST_SIZE]
def changed(self):
t = os.stat(self.path)
if (t[stat.ST_MTIME] != self.mtime
or t[stat.ST_SIZE] != self.size):
return True
else:
return False
def read(self):
"""Return messages added and removed from folder.
Two sets of message objects are returned. The first set is
messages that were added to the folder since the last read.
The second set is the messages that were removed from the
folder since the last read.
The code assumes messages are added and removed but not edited.
"""
mbox = mailbox.UnixMailbox(open(self.path, "rb"), factory)
self._stat()
cur = OOSet()
new = OOSet()
while 1:
msg = mbox.next()
if msg is None:
break
msgid = msg["message-id"]
cur.insert(msgid)
if not self.messages.has_key(msgid):
self.messages[msgid] = msg
new.insert(msg)
removed = difference(self.messages, cur)
for msgid in removed.keys():
del self.messages[msgid]
# XXX perhaps just return the OOBTree for removed?
return new, OOSet(removed.values())
if __name__ == "__main__":
f = Folder("/home/jeremy/Mail/INBOX")
--- NEW FILE: message.py ---
import ZODB
from Persistence import Persistent
from email.Message import Message
class PMessage(Message, Persistent):
def __hash__(self):
return id(self)
--- NEW FILE: options.py ---
from Options import options, all_options, \
boolean_cracker, float_cracker, int_cracker, string_cracker
from sets import Set
all_options["Score"] = {'max_ham': float_cracker,
'min_spam': float_cracker,
}
all_options["Train"] = {'folder_dir': string_cracker,
'spam_folders': ('get', lambda s: Set(s.split())),
'ham_folders': ('get', lambda s: Set(s.split())),
}
all_options["Proxy"] = {'server': string_cracker,
'server_port': int_cracker,
'proxy_port': int_cracker,
'log_pop_session': boolean_cracker,
'log_pop_session_file': string_cracker,
}
all_options["ZODB"] = {'zeo_addr': string_cracker,
'event_log_file': string_cracker,
'event_log_severity': int_cracker,
'cache_size': int_cracker,
}
import os
options.mergefiles("vmspam.ini")
def mergefile(p):
options.mergefiles(p)
--- NEW FILE: profile.py ---
"""Spam/ham profile for a single VM user."""
import ZODB
from ZODB.PersistentList import PersistentList
from Persistence import Persistent
from BTrees.OOBTree import OOBTree
import classifier
from tokenizer import tokenize
from pspam.folder import Folder
import os
def open_folders(dir, names, klass):
L = []
for name in names:
path = os.path.join(dir, name)
L.append(klass(path))
return L
import time
_start = None
def log(s):
global _start
if _start is None:
_start = time.time()
print round(time.time() - _start, 2), s
class IterOOBTree(OOBTree):
def iteritems(self):
return self.items()
class WordInfo(Persistent):
def __init__(self, atime, spamprob=None):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
self.spamprob = spamprob
def __repr__(self):
return "WordInfo%r" % repr((self.atime, self.spamcount,
self.hamcount, self.killcount,
self.spamprob))
class PBayes(classifier.Bayes, Persistent):
WordInfoClass = WordInfo
def __init__(self):
classifier.Bayes.__init__(self)
self.wordinfo = IterOOBTree()
# XXX what about the getstate and setstate defined in base class
class Profile(Persistent):
FolderClass = Folder
def __init__(self, folder_dir):
self._dir = folder_dir
self.classifier = PBayes()
self.hams = PersistentList()
self.spams = PersistentList()
def add_ham(self, folder):
p = os.path.join(self._dir, folder)
f = self.FolderClass(p)
self.hams.append(f)
def add_spam(self, folder):
p = os.path.join(self._dir, folder)
f = self.FolderClass(p)
self.spams.append(f)
def update(self):
"""Update classifier from current folder contents."""
changed1 = self._update(self.hams, False)
changed2 = self._update(self.spams, True)
if changed1 or changed2:
self.classifier.update_probabilities()
get_transaction().commit()
log("updated probabilities")
def _update(self, folders, is_spam):
changed = False
for f in folders:
log("update from %s" % f.path)
added, removed = f.read()
if added:
log("added %d" % len(added))
if removed:
log("removed %d" % len(removed))
get_transaction().commit()
if not (added or removed):
continue
changed = True
# It's important not to commit a transaction until
# after update_probabilities is called in update().
# Otherwise some new entries will cause scoring to fail.
for msg in added.keys():
self.classifier.learn(tokenize(msg), is_spam, False)
del added
get_transaction().commit(1)
log("learned")
for msg in removed.keys():
self.classifier.unlearn(tokenize(msg), is_spam, False)
if removed:
log("unlearned")
del removed
get_transaction().commit(1)
return changed
From jhylton@users.sourceforge.net Mon Nov 4 04:44:22 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sun, 03 Nov 2002 20:44:22 -0800
Subject: [Spambayes-checkins] spambayes/pspam README.txt,NONE,1.1
pop.py,NONE,1.1vmspam.ini,NONE,1.1zeo.sh,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv21558/pspam
Added Files:
README.txt pop.py scoremsg.py update.py vmspam.ini zeo.sh
Log Message:
Initial checkin of pspam code.
--- NEW FILE: README.txt ---
pspam: persistent spambayes filtering system
--------------------------------------------
pspam uses a POP proxy to score incoming messages, a set of VM folders
to manage training data, and a ZODB database to manage data used by
the various applications.
The current code only works with a patched version of classifier.py.
Remove the object base class & change the class used to create new
WordInfo objects.
This directory contains:
pspam -- a Python package
pop.py -- a POP proxy based on SocketServer
scoremsg.py -- prints the evidence for a single message read from stdin
update.py -- a script to update training data from folders
vmspam.ini -- a sample configuration file
zeo.sh -- a script to start a ZEO server
The code depends on ZODB3, which you can download from
http://www.zope.org/Products/StandaloneZODB.
--- NEW FILE: pop.py ---
"""Spam-filtering proxy for a POP3 server.
The implementation uses the SocketServer module to run a
multi-threaded POP3 proxy. It adds an X-Spambayes header with a spam
probability. It scores a message using a persistent spambayes
classifier loaded from a ZEO server.
The strategy for adding spam headers is from Richie Hindler's
pop3proxy.py. The STAT, LIST, RETR, and TOP commands are intercepted
to change the number of bytes the client is told to expect and/or to
insert the spam header.
XXX A POP3 server sometimes adds the number of bytes in the +OK
response to some commands when the POP3 spec doesn't require it to.
In those case, the proxy does not re-write the number of bytes. I
assume the clients won't be confused by this behavior, because they
shouldn't be expecting to see the number of bytes.
POP3 is documented in RFC 1939.
"""
import SocketServer
import asyncore
import cStringIO
import email
import re
import socket
import sys
import threading
import time
import ZODB
from ZEO.ClientStorage import ClientStorage
import zLOG
from tokenizer import tokenize
import pspam.database
from pspam.options import options
HEADER = "X-Spambayes: %5.3f\r\n"
HEADER_SIZE = len(HEADER % 0.0)
class POP3ProxyServer(SocketServer.ThreadingTCPServer):
allow_reuse_address = True
def __init__(self, addr, handler, classifier, real_server, log, zodb):
SocketServer.ThreadingTCPServer.__init__(self, addr, handler)
self.classifier = classifier
self.pop_server = real_server
self.log = log
self.zodb = zodb
class LogWrapper:
def __init__(self, log, file):
self.log = log
self.file = file
def readline(self):
line = self.file.readline()
self.log.write(line)
return line
def write(self, buf):
self.log.write(buf)
return self.file.write(buf)
def close(self):
self.file.close()
class POP3RequestHandler(SocketServer.StreamRequestHandler):
"""Act as proxy between POP client and server."""
def connect_pop(self):
# connect to the pop server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(self.server.pop_server)
self.pop_rfile = LogWrapper(self.server.log, s.makefile("rb"))
# the write side should be unbuffered
self.pop_wfile = LogWrapper(self.server.log, s.makefile("wb", 0))
def close_pop(self):
self.pop_rfile.close()
self.pop_wfile.close()
def handle(self):
zLOG.LOG("POP3", zLOG.INFO,
"Connection from %s" % repr(self.client_address))
self.server.zodb.sync()
self.sess_retr_count = 0
self.connect_pop()
try:
self.handle_pop()
finally:
self.close_pop()
if self.sess_retr_count == 1:
ending = ""
else:
ending = "s"
zLOG.LOG("POP3", zLOG.INFO,
"Ending session (%d message%s retrieved)"
% (self.sess_retr_count, ending))
_multiline = {"RETR": True, "TOP": True,}
_multiline_noargs = {"LIST": True, "UIDL": True,}
def is_multiline(self, command, args):
if command in self._multiline:
return True
if command in self._multiline_noargs and not args:
return True
return False
def parse_request(self, req):
parts = req.split()
req = parts[0]
args = tuple(parts[1:])
return req, args
def handle_pop(self):
# send the initial server hello
hello = self.pop_rfile.readline()
self.wfile.write(hello)
# now get client requests and return server responses
while 1:
line = self.rfile.readline()
if line == '':
break
self.pop_wfile.write(line)
if not self.handle_pop_response(line):
break
def handle_pop_response(self, req):
# Return True if connection is still open
cmd, args = self.parse_request(req)
multiline = self.is_multiline(cmd, args)
firstline = self.pop_rfile.readline()
zLOG.LOG("POP3", zLOG.DEBUG, "command %s multiline %s resp %s"
% (cmd, multiline, firstline.strip()))
if multiline:
# Collect the entire response as one string
resp = cStringIO.StringIO()
while 1:
line = self.pop_rfile.readline()
resp.write(line)
# The response is finished if we get . or an error.
# XXX should handle byte-stuffed response
if line == ".\r\n":
break
if line.startswith("-ERR"):
break
buf = resp.getvalue()
else:
buf = None
handler = getattr(self, "handle_%s" % cmd, None)
if handler:
firstline, buf = handler(cmd, args, firstline, buf)
self.wfile.write(firstline)
if buf is not None:
self.wfile.write(buf)
if cmd == "QUIT":
return False
else:
return True
def handle_RETR(self, cmd, args, firstline, resp):
if not resp:
return firstline, resp
try:
msg = email.message_from_string(resp)
except email.Errors.MessageParseError, err:
zLOG.LOG("POP3", zLOG.WARNING,
"Failed to parse msg: %s" % err, error=sys.exc_info())
resp = self.message_parse_error(resp)
else:
self.score_msg(msg)
resp = msg.as_string()
self.sess_retr_count += 1
return firstline, resp
def handle_TOP(self, cmd, args, firstline, resp):
# XXX Just handle TOP like RETR?
return self.handle_RETR(cmd, args, firstline, resp)
rx_STAT = re.compile("\+OK (\d+) (\d+)(.*)", re.DOTALL)
def handle_STAT(self, cmd, args, firstline, resp):
# STAT returns the number of messages and the total size. The
# proxy must add the size of new headers to the total size.
# Example: +OK 3 340
mo = self.rx_STAT.match(firstline)
if mo is None:
return firstline, resp
count, size, extra = mo.group(1, 2, 3)
count = int(count)
size = int(size)
size += count * HEADER_SIZE
firstline = "+OK %d %d%s" % (count, size, extra)
return firstline, resp
rx_LIST = re.compile("\+OK (\d+) (\d+)(.*)", re.DOTALL)
rx_LIST_2 = re.compile("(\d+) (\d+)(.*)", re.DOTALL)
def handle_LIST(self, cmd, args, firstline, resp):
# If there are no args, LIST returns size info for each message.
# If there is an arg, LIST return number and size for one message.
mo = self.rx_LIST.match(firstline)
if mo:
# a single-line response
n, size, extra = mo.group(1, 2, 3)
size = int(size) + HEADER_SIZE
firstline = "+OK %s %d%s" % (n, size, extra)
return firstline, resp
else:
# possibility a multiline response
if not firstline.startswith("+OK"):
return firstline, resp
# update each line of the response
L = []
for line in resp.split("\r\n"):
if not line:
continue
mo = self.rx_LIST_2.match(line)
if not mo:
L.append(line)
else:
n, size, extra = mo.group(1, 2, 3)
size = int(size) + HEADER_SIZE
L.append("%s %d%s" % (n, size, extra))
return firstline, "\r\n".join(L)
def message_parse_error(self, buf):
# We get an error parsing the message. We've already told the
# client to expect more bytes that this buffer contains, but
# there's not clean way to add the header.
self.server.log.write("# error: %s\n" % repr(buf))
# XXX what to do? list's just add it after the first line
score = self.server.classifier.spamprob(tokenize(buf))
L = buf.split("\n")
L.insert(1, HEADER % score)
return "\n".join(L)
def score_msg(self, msg):
score = self.server.classifier.spamprob(tokenize(msg))
msg.add_header("X-Spambayes", "%5.3f" % score)
def main():
db = pspam.database.open()
conn = db.open()
r = conn.root()
profile = r["profile"]
log = open("/var/tmp/pop.log", "ab")
print >> log, "+PROXY start", time.ctime()
server = POP3ProxyServer(('', options.proxy_port),
POP3RequestHandler,
profile.classifier,
(options.server, options.server_port),
log,
conn,
)
server.serve_forever()
if __name__ == "__main__":
main()
--- NEW FILE: scoremsg.py ---
#! /usr/bin/env python
"""Score a message provided on stdin and show the evidence."""
import ZODB
from ZEO.ClientStorage import ClientStorage
from tokenizer import tokenize
import email
import sys
import pspam.options
def main(fp):
cs = ClientStorage("/var/tmp/zeospam")
db = ZODB.DB(cs)
r = db.open().root()
# make sure scoring uses the right set of options
pspam.options.mergefile("/home/jeremy/src/vmspam/vmspam.ini")
p = r["profile"]
msg = email.message_from_file(fp)
prob, evidence = p.classifier.spamprob(tokenize(msg), True)
print "Score:", prob
print
print "Clues"
print "-----"
for clue, prob in evidence:
print clue, prob
## print
## print msg
if __name__ == "__main__":
main(sys.stdin)
--- NEW FILE: update.py ---
import getopt
import os
import sys
import ZODB
from ZEO.ClientStorage import ClientStorage
import pspam.database
from pspam.profile import Profile
from pspam.options import options
def folder_exists(L, p):
"""Return true folder with path p exists in list L."""
for f in L:
if f.path == p:
return True
return False
def main(rebuild=False):
db = pspam.database.open()
r = db.open().root()
profile = r.get("profile")
if profile is None or rebuild:
# if there is no profile, create it
profile = r["profile"] = Profile(options.folder_dir)
get_transaction().commit()
# check for new folders of training data
for ham in options.ham_folders:
p = os.path.join(options.folder_dir, ham)
if not folder_exists(profile.hams, p):
profile.add_ham(p)
for spam in options.spam_folders:
p = os.path.join(options.folder_dir, spam)
if not folder_exists(profile.spams, p):
profile.add_spam(p)
get_transaction().commit()
# read new messages from folders
profile.update()
get_transaction().commit()
db.close()
if __name__ == "__main__":
FORCE_REBUILD = False
opts, args = getopt.getopt(sys.argv[1:], 'F')
for k, v in opts:
if k == '-F':
FORCE_REBUILD = True
main(FORCE_REBUILD)
--- NEW FILE: vmspam.ini ---
[Train]
folder_dir: /home/jeremy/Mail
spam_folders: train/spam
ham_folders: train/ham
[Score]
max_ham: 0.05
min_spam: 0.99
[Proxy]
server: mail.zope.com
server_port: 110
proxy_port: 1111
log_pop_session: true
log_pop_session_file: /var/tmp/pop.log
[ZODB]
zeo_addr: /var/tmp/zeospam
event_log_file: /var/tmp/zeospam.log
event_log_severity: 0
cache_size: 2000
--- NEW FILE: zeo.sh ---
#! /bin/bash
export STUPID_LOG_FILE=/var/tmp/zeospam.log
export LIBDIR=/usr/local/lib/python2.3/site-packages
python2.3 $LIBDIR/ZEO/start.py -U /var/tmp/zeospam /var/tmp/zeospam.fs
From tim.one@comcast.net Mon Nov 4 05:03:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 00:03:05 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13
In-Reply-To:
Message-ID:
[Mark Hammond]
> Modified Files:
> train.py
> Log Message:
> Fix the root of my:
> File "F:\src\spambayes\classifier.py", line 450, in _getclues
> distance = abs(prob - 0.5)
>
> Exception - problem is that we trained, but didn't update probabilities -
> thus, we failed for every new word seen only since the last complete
> retrain.
Mark, I've never seen this, and believed I fixed the only way it could have
happened last week -- WordInfo records start life with a genuine probability
(spamprob) now, instead with a spamprob of None. It's possible, though,
that you had some leftover WordInfo record with None in your dict, and
didn't retrain from scratch after that fix. Or it's possible there's an
entirely different bug I still don't know about.
> There may be a case for _getclues() to detect a probability of None
> and call update_probabilities() automatically - either that or just
> keep throwing vague exceptions
Except it should never be possible for _getclues() to see None -- if that
was still happening for you, there's a deeper bug that still needs to be
fixed.
In other news, here's a shallow bug, upon starting Outlook now:
Traceback (most recent call last):
File "C:\PYTHON22\lib\site-packages\win32com\universal.py", line 150, in
dispatch
retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD, args,
None, None)
File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 322,
in _InvokeEx_
return self._invokeex_(dispid, lcid, wFlags, args, kwargs,
serviceProvider)
File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 562,
in _invokeex_
return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
args, kwArgs, serviceProvider)
File "C:\PYTHON22\lib\site-packages\win32com\server\policy.py", line 510,
in _invokeex_
return apply(func, args)
File "C:\Code\spambayes\Outlook2000\addin.py", line 392, in OnConnection
button.Init(self.manager, application, activeExplorer)
File "C:\Code\spambayes\Outlook2000\addin.py", line 262, in Init
ButtonDeleteAsExplorerEvent)
File "C:\Code\spambayes\Outlook2000\addin.py", line 103, in
WithEventsClone
events_class = getevents(clsid)
exceptions.NameError: global name 'getevents' is not defined
It can't have worked for you, either. I fiddled my local copy to do
from win32com.client import constants, getevents
near the top, and that appears to have fixed it. I'll check that in, but
please ensure that was the correct fix.
From tim_one@users.sourceforge.net Mon Nov 4 05:03:49 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 03 Nov 2002 21:03:49 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 addin.py,1.25,1.26
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv25951/Outlook2000
Modified Files:
addin.py
Log Message:
Fix whar appeared to be a missing import of win32.client.getevents.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** addin.py 4 Nov 2002 00:52:10 -0000 1.25
--- addin.py 4 Nov 2002 05:03:47 -0000 1.26
***************
*** 13,17 ****
import win32api
import pythoncom
! from win32com.client import constants
import win32ui
--- 13,17 ----
import win32api
import pythoncom
! from win32com.client import constants, getevents
import win32ui
From anthonybaxter@users.sourceforge.net Mon Nov 4 06:38:54 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Sun, 03 Nov 2002 22:38:54 -0800
Subject: [Spambayes-checkins] website developer.ht,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv16008
Modified Files:
developer.ht
Log Message:
added a "what needs to be done" section.
Index: developer.ht
===================================================================
RCS file: /cvsroot/spambayes/website/developer.ht,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** developer.ht 22 Sep 2002 07:48:03 -0000 1.3
--- developer.ht 4 Nov 2002 06:38:52 -0000 1.4
***************
*** 27,30 ****
--- 27,38 ----
available as links from the documentation page.
+
So what needs to be done
+
Currently (early November) work is now being focussed on finding
+ additional things that are beneficial to the tokenizer. The combining
+ scheme is now pretty solid and pretty amazing. The other big body of
+ work at the moment is producing something that's useful to end-users -
+ actually building the applications and the code so that Tim's sister
+ <wink> can use the system.
+
Collecting training data
One of the tricky problems is collecting a set of data that's
From anthonybaxter@users.sourceforge.net Mon Nov 4 06:39:44 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Sun, 03 Nov 2002 22:39:44 -0800
Subject: [Spambayes-checkins] website background.ht,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv16178
Modified Files:
background.ht
Log Message:
A bit of a potted history here. I probably have a bunch of things here
that need to be cleaned up and made more obvious, but hey, it's a start.
Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** background.ht 19 Sep 2002 23:39:24 -0000 1.1
--- background.ht 4 Nov 2002 06:39:42 -0000 1.2
***************
*** 15,18 ****
--- 15,67 ----
more links? mail anthony at interlink.com.au
+
Overall Approach
+ Please note that I (Anthony) am writing this based on memory and
+ limited understanding of some of the subtler points of the maths. Gentle
+ corrections are welcome, or even encouraged.
+
Tokenizing
+
The architecture of the spambayes system has a couple of distinct
+ parts. The first, and most obvious, is the tokenizer. This takes
+ a mail message and breaks it up into a series of tokens. At the moment
+ it splits words out of the text parts of a message, there's a variety
+ of header tokenization that goes on as well. The code in tokenizer.py
+ and the comments in the Tokenizer section of Options.py contain more
+ information about various approaches to tokenizing.
+
+
Combining and Scoring
+
The next part of the system is the scoring and combining part. This
+ is where the hairy mathematics and statistics come in.
+
Initially we started with Paul Graham's original combining scheme -
+ this has a number of "magic numbers" and "fuzz factors" built into it.
+ The Graham combining scheme has a number of problems, aside from the
+ magic in the internal fudge factors - it tends to produce scores of
+ either 1 or 0, and there's a very small middle ground in between - it
+ doesn't often claim to be "unsure", and gets it wrong because of this.
+ There's a number of discussions back and forth between Tim Peters and
+ Gary Robinson on this subject in the mailing list archives - I'll try
+ and put links to the relevant threads at some point.
+
Gary produced a number of alternative approaches to combining and
+ scoring word probabilities. The initial one, after much back and forth
+ in the mailing list, is in the code today as 'gary_combining'. A couple
+ of other approaches, using the Central Limit Theorem, were also tried.
+ They produced interesting output - but histograms of the ham and spam
+ distributions had a disturbingly large overlap in the middle. There was
+ also an issue with incremental training and untraining of messages that
+ made it harder to use in the "real world". These two central limit
+ approaches were dropped after Tim, Gary and Rob Hooft produced a combining
+ scheme using chi-squared probabilities. This is now the default combining
+ scheme.
+
The chi-squared approach produces two numbers - a "ham probability" ("*H*")
+ and a "spam probability" ("*S*"). A typical spam will have a high *S*
+ and low *H*, while a ham will have high *H* and low *S*. In the case where
+ the message looks entirely unlike anything the system's been trained on,
+ you can end up with a low *H* and low *S* - this is the code saying "I don't
+ know what this message is". So at the end of the processing, you end up
+ with three possible results - "Spam", "Ham", or "Unsure". It's possible to
+ tweak the high and low cutoffs for the Unsure window - this trades off
+ unsure messages vs possible false positives or negatives.
+
+
Training
+
TBD
+
Mailing list archives
There's a lot of background on what's been tried available from
From anthonybaxter@users.sourceforge.net Mon Nov 4 09:58:02 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 04 Nov 2002 01:58:02 -0800
Subject: [Spambayes-checkins] website background.ht,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv6694
Modified Files:
background.ht
Log Message:
addition from RobH about high *H* and high *S* meaning.
Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** background.ht 4 Nov 2002 06:39:42 -0000 1.2
--- background.ht 4 Nov 2002 09:57:59 -0000 1.3
***************
*** 56,60 ****
the message looks entirely unlike anything the system's been trained on,
you can end up with a low *H* and low *S* - this is the code saying "I don't
! know what this message is". So at the end of the processing, you end up
with three possible results - "Spam", "Ham", or "Unsure". It's possible to
tweak the high and low cutoffs for the Unsure window - this trades off
--- 56,66 ----
the message looks entirely unlike anything the system's been trained on,
you can end up with a low *H* and low *S* - this is the code saying "I don't
! know what this message is".
! Some messages can even have both a high *H* and a high *S*, telling you
! basically that the message looks very much like ham, but also very much
! like spam. In this case spambayes is also unsure where the message
! should be classified, and the final score will be near 0.5.
!
!
So at the end of the processing, you end up
with three possible results - "Spam", "Ham", or "Unsure". It's possible to
tweak the high and low cutoffs for the Unsure window - this trades off
From tim_one@users.sourceforge.net Mon Nov 4 21:06:30 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 04 Nov 2002 13:06:30 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.46,1.47
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8400
Modified Files:
classifier.py
Log Message:
_add_msg(): Removed redundant store into wordinfo[word].
_remove_msg(): Added a store into wordinfo[word], which may be needed
if wordinfo is a persistent database, to let the persistence machinery
know that an internal field in the value associated *with* word changed.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.46
retrieving revision 1.47
diff -C2 -d -r1.46 -r1.47
*** classifier.py 1 Nov 2002 16:01:14 -0000 1.46
--- classifier.py 4 Nov 2002 21:06:26 -0000 1.47
***************
*** 401,405 ****
record = wordinfoget(word)
if record is None:
! record = wordinfo[word] = WordInfo(now)
if is_spam:
--- 401,405 ----
record = wordinfoget(word)
if record is None:
! record = WordInfo(now)
if is_spam:
***************
*** 407,410 ****
--- 407,411 ----
else:
record.hamcount += 1
+ # Needed to tell a persistent DB that the content changed.
wordinfo[word] = record
***************
*** 419,423 ****
self.nham -= 1
! wordinfoget = self.wordinfo.get
for word in Set(wordstream):
record = wordinfoget(word)
--- 420,425 ----
self.nham -= 1
! wordinfo = self.wordinfo
! wordinfoget = wordinfo.get
for word in Set(wordstream):
record = wordinfoget(word)
***************
*** 430,434 ****
record.hamcount -= 1
if record.hamcount == 0 == record.spamcount:
! del self.wordinfo[word]
def _getclues(self, wordstream):
--- 432,439 ----
record.hamcount -= 1
if record.hamcount == 0 == record.spamcount:
! del wordinfo[word]
! else:
! # Needed to tell a persistent DB that the content changed.
! wordinfo[word] = record
def _getclues(self, wordstream):
From jhylton@users.sourceforge.net Mon Nov 4 21:25:56 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Mon, 04 Nov 2002 13:25:56 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv18044
Modified Files:
profile.py
Log Message:
Use the same default spamprob as regular classifier.
Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** profile.py 4 Nov 2002 04:44:20 -0000 1.1
--- profile.py 4 Nov 2002 21:25:54 -0000 1.2
***************
*** 10,13 ****
--- 10,14 ----
from pspam.folder import Folder
+ from pspam.options import options
import os
***************
*** 36,40 ****
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=None):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 37,41 ----
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
From jhylton@users.sourceforge.net Mon Nov 4 21:24:54 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Mon, 04 Nov 2002 13:24:54 -0800
Subject: [Spambayes-checkins] spambayes classifier.py,1.47,1.48
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv17508
Modified Files:
classifier.py
Log Message:
Two changes to support pspam.
Make Bayes a classic class so that it can be mixed with
ExtensionClass.
Define Bayes.WordInfoClass so that a subclass can define a different
class to represent word info.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.47
retrieving revision 1.48
diff -C2 -d -r1.47 -r1.48
*** classifier.py 4 Nov 2002 21:06:26 -0000 1.47
--- classifier.py 4 Nov 2002 21:24:52 -0000 1.48
***************
*** 80,84 ****
self.spamprob) = t
! class Bayes(object):
# Defining __slots__ here made Jeremy's life needlessly difficult when
# trying to hook this all up to ZODB as a persistent object. There's
--- 80,84 ----
self.spamprob) = t
! class Bayes:
# Defining __slots__ here made Jeremy's life needlessly difficult when
# trying to hook this all up to ZODB as a persistent object. There's
***************
*** 92,95 ****
--- 92,98 ----
# )
+ # allow a subclass to use a different class for WordInfo
+ WordInfoClass = WordInfo
+
def __init__(self):
self.wordinfo = {}
***************
*** 401,405 ****
record = wordinfoget(word)
if record is None:
! record = WordInfo(now)
if is_spam:
--- 404,408 ----
record = wordinfoget(word)
if record is None:
! record = self.WordInfoClass(now)
if is_spam:
From mhammond@users.sourceforge.net Mon Nov 4 22:19:36 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Mon, 04 Nov 2002 14:19:36 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv17976
Modified Files:
train.py
Log Message:
Roll-back my previous "update probs" change - Tim's fix would have fixed it had I done a complete retain. Done that now, and if I still need this Tim will sort it out once-and-for-all
Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** train.py 4 Nov 2002 01:12:53 -0000 1.13
--- train.py 4 Nov 2002 22:19:34 -0000 1.14
***************
*** 19,23 ****
return spam == True
! def train_message(msg, is_spam, mgr, update_probs = True):
# Train an individual message.
# Returns True if newly added (message will be correctly
--- 19,23 ----
return spam == True
! def train_message(msg, is_spam, mgr):
# Train an individual message.
# Returns True if newly added (message will be correctly
***************
*** 41,47 ****
mgr.bayes.learn(tokens, is_spam, False)
mgr.message_db[msg.searchkey] = is_spam
- if update_probs:
- mgr.bayes.update_probabilities()
-
mgr.bayes_dirty = True
return True
--- 41,44 ----
***************
*** 54,58 ****
progress.tick()
try:
! if train_message(message, isspam, mgr, False):
num_added += 1
except:
--- 51,55 ----
progress.tick()
try:
! if train_message(message, isspam, mgr):
num_added += 1
except:
From mhammond@skippinet.com.au Mon Nov 4 22:48:08 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 5 Nov 2002 09:48:08 +1100
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.12,1.13
In-Reply-To:
Message-ID:
[Tim]
> In other news, here's a shallow bug, upon starting Outlook now:
...
> It can't have worked for you, either.
It can - my code took the "win32all has such a function" path. Pity mine is
the only machine in the world taking that path
> I fiddled my local copy to do
>
> from win32com.client import constants, getevents
>
> near the top, and that appears to have fixed it. I'll check that in, but
> please ensure that was the correct fix.
Just dandy - thanks! pychecker can tell us when it is no longer necessary!
Mark.
From mhammond@users.sourceforge.net Mon Nov 4 22:50:44 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Mon, 04 Nov 2002 14:50:44 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.26,1.27 train.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv30899
Modified Files:
addin.py train.py
Log Message:
After incremental training on individual messages, they are also recored
so that they appear in the ham/spam folder with the *new* post-training
score rather than their pre-training, presumably wrong score.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** addin.py 4 Nov 2002 05:03:47 -0000 1.26
--- addin.py 4 Nov 2002 22:50:41 -0000 1.27
***************
*** 159,163 ****
import train
print "Training on message '%s' - " % subject,
! if train.train_message(msgstore_message, False, self.manager):
print "trained as good"
else:
--- 159,163 ----
import train
print "Training on message '%s' - " % subject,
! if train.train_message(msgstore_message, False, self.manager, rescore = True):
print "trained as good"
else:
***************
*** 191,195 ****
subject = item.Subject.encode("mbcs", "replace")
print "Training on message '%s' - " % subject,
! if train.train_message(msgstore_message, True, self.manager):
print "trained as spam"
else:
--- 191,195 ----
subject = item.Subject.encode("mbcs", "replace")
print "Training on message '%s' - " % subject,
! if train.train_message(msgstore_message, True, self.manager, rescore = True):
print "trained as spam"
else:
***************
*** 329,333 ****
# Must train before moving, else we lose the message!
print "Training on message - ",
! if train.train_message(msgstore_message, True, self.manager):
print "trained as spam"
else:
--- 329,333 ----
# Must train before moving, else we lose the message!
print "Training on message - ",
! if train.train_message(msgstore_message, True, self.manager, rescore = True):
print "trained as spam"
else:
Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** train.py 4 Nov 2002 22:19:34 -0000 1.14
--- train.py 4 Nov 2002 22:50:41 -0000 1.15
***************
*** 19,27 ****
return spam == True
! def train_message(msg, is_spam, mgr):
# Train an individual message.
# Returns True if newly added (message will be correctly
# untrained if it was in the wrong category), False if already
# in the correct category. Catch your own damn exceptions.
from tokenizer import tokenize
stream = msg.GetEmailPackageObject()
--- 19,29 ----
return spam == True
! def train_message(msg, is_spam, mgr, rescore = False):
# Train an individual message.
# Returns True if newly added (message will be correctly
# untrained if it was in the wrong category), False if already
# in the correct category. Catch your own damn exceptions.
+ # If re-classified AND rescore = True, then a new score will
+ # be written to the message (so the user can see some effects)
from tokenizer import tokenize
stream = msg.GetEmailPackageObject()
***************
*** 42,45 ****
--- 44,52 ----
mgr.message_db[msg.searchkey] = is_spam
mgr.bayes_dirty = True
+ # Simplest way to rescore is to re-filter with all_actions = False
+ if rescore:
+ import filter
+ filter.filter_message(msg, mgr, all_actions = False)
+
return True
From tim_one@users.sourceforge.net Mon Nov 4 23:21:45 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:21:45 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.64,1.65
tokenizer.py,1.60,1.61
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12377
Modified Files:
Options.py tokenizer.py
Log Message:
New option record_header_absence.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.64
retrieving revision 1.65
diff -C2 -d -r1.64 -r1.65
*** Options.py 3 Nov 2002 13:48:47 -0000 1.64
--- Options.py 4 Nov 2002 23:21:43 -0000 1.65
***************
*** 54,63 ****
# very strong ham clue, but a bogus one. In that case, set
# count_all_header_lines to False, and adjust safe_headers instead.
-
count_all_header_lines: False
! # Like count_all_header_lines, but restricted to headers in this list.
! # safe_headers is ignored when count_all_header_lines is true.
safe_headers: abuse-reports-to
date
--- 54,68 ----
# very strong ham clue, but a bogus one. In that case, set
# count_all_header_lines to False, and adjust safe_headers instead.
count_all_header_lines: False
! # When True, generate a "noheader:HEADERNAME" token for each header in
! # safe_headers (below) that *doesn't* appear in the headers. This helped
! # in various of Tim's python.org tests, but appeared to hurt a little in
! # Anthony Baxter's tests.
! record_header_absence: False
+ # Like count_all_header_lines, but restricted to headers in this list.
+ # safe_headers is ignored when count_all_header_lines is true, unless
+ # record_header_absence is also true.
safe_headers: abuse-reports-to
date
***************
*** 336,339 ****
--- 341,345 ----
'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
+ 'record_header_absence': boolean_cracker,
'generate_long_skips': boolean_cracker,
'skip_max_word_size': int_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.60
retrieving revision 1.61
diff -C2 -d -r1.60 -r1.61
*** tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60
--- tokenizer.py 4 Nov 2002 23:21:43 -0000 1.61
***************
*** 1179,1182 ****
--- 1179,1185 ----
for x in x2n.items():
yield "header:%s:%d" % x
+ if options.record_header_absence:
+ for x in options.safe_headers - Set([k.lower() for k in x2n]):
+ yield "noheader:" + x
def tokenize_body(self, msg, maxword=options.skip_max_word_size):
From tim_one@users.sourceforge.net Mon Nov 4 23:21:45 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:21:45 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 default_bayes_customize.ini,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv12377/Outlook2000
Modified Files:
default_bayes_customize.ini
Log Message:
New option record_header_absence.
Index: default_bayes_customize.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** default_bayes_customize.ini 27 Oct 2002 03:42:58 -0000 1.4
--- default_bayes_customize.ini 4 Nov 2002 23:21:43 -0000 1.5
***************
*** 14,17 ****
--- 14,20 ----
replace_nonascii_chars: True
+ # It's helpful for Tim .
+ record_header_absence: True
+
[Classifier]
# Uncomment the next lines if you want to use the former default for
From tim.one@comcast.net Mon Nov 4 23:39:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 18:39:27 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.13,1.14
In-Reply-To:
Message-ID:
[Mark Hammond]
> Roll-back my previous "update probs" change - Tim's fix would
> have fixed it had I done a complete retain. Done that now, and
> if I still need this Tim will sort it out once-and-for-all
Do keep an eye on it! I've never seen software that had a bug, but I keep
hearing it's possible ...
From mhammond@users.sourceforge.net Tue Nov 5 11:44:30 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 05 Nov 2002 03:44:30 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.21,1.22
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14881
Modified Files:
msgstore.py
Log Message:
Fix a few typos in comments, and code!
Also adding a check if the message has attachments - currently not used,
but will be soon (to handle multipart/signed messages) - was in the code
then found the typos, so decided I should get 'em in.
[DoCopyMode -> DoCopyMove does get me wondering about the utility of
auto-complete in editors tho' <0.1 wink>]
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** msgstore.py 4 Nov 2002 00:41:08 -0000 1.21
--- msgstore.py 5 Nov 2002 11:44:27 -0000 1.22
***************
*** 296,301 ****
# only problem is that it can potentially be changed - however, the
# Outlook client provides no such (easy/obvious) way
! # (ie, someone would need to really want to change it
! # This, searchkey is the only reliable long-lived message key.
self.searchkey = searchkey
self.unread = unread
--- 296,301 ----
# only problem is that it can potentially be changed - however, the
# Outlook client provides no such (easy/obvious) way
! # (ie, someone would need to really want to change it )
! # Thus, searchkey is the only reliable long-lived message key.
self.searchkey = searchkey
self.unread = unread
***************
*** 369,377 ****
# Oh - and for multipart/signed messages
self._EnsureObject()
! prop_ids = PR_TRANSPORT_MESSAGE_HEADERS_A, PR_BODY_A, MYPR_BODY_HTML_A
hr, data = self.mapi_object.GetProps(prop_ids,0)
headers = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0])
body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
# Mail delivered internally via Exchange Server etc may not have
# headers - fake some up.
--- 369,381 ----
# Oh - and for multipart/signed messages
self._EnsureObject()
! prop_ids = (PR_TRANSPORT_MESSAGE_HEADERS_A,
! PR_BODY_A,
! MYPR_BODY_HTML_A,
! PR_HASATTACH)
hr, data = self.mapi_object.GetProps(prop_ids,0)
headers = self._GetPotentiallyLargeStringProp(prop_ids[0], data[0])
body = self._GetPotentiallyLargeStringProp(prop_ids[1], data[1])
html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
+ has_attach = data[3][1]
# Mail delivered internally via Exchange Server etc may not have
# headers - fake some up.
***************
*** 382,385 ****
--- 386,395 ----
elif headers.startswith("Microsoft Mail"):
headers = "X-MS-Mail-Gibberish: " + headers
+ if not html and not body:
+ # Only ever seen this for "multipart/signed" messages, so
+ # without any better clues, just handle this.
+ # Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
+ pass
+
return "%s\n%s\n%s" % (headers, html, body)
***************
*** 476,480 ****
props = ( (mapi.PS_PUBLIC_STRINGS, prop), )
prop = self.mapi_object.GetIDsFromNames(props, 0)[0]
- # Docs say PT_ERROR, reality shows PT_UNSPECIFIED
if PROP_TYPE(prop) == PT_ERROR: # No such property
return None
--- 486,489 ----
***************
*** 494,498 ****
self.dirty = False
! def _DoCopyMode(self, folder, isMove):
## self.mapi_object = None # release the COM pointer
assert not self.dirty, \
--- 503,507 ----
self.dirty = False
! def _DoCopyMove(self, folder, isMove):
## self.mapi_object = None # release the COM pointer
assert not self.dirty, \
***************
*** 517,524 ****
def MoveTo(self, folder):
! self._DoCopyMode(folder, True)
def CopyTo(self, folder):
! self._DoCopyMode(folder, True)
def test():
--- 526,533 ----
def MoveTo(self, folder):
! self._DoCopyMove(folder, True)
def CopyTo(self, folder):
! self._DoCopyMove(folder, False)
def test():
From mhammond@users.sourceforge.net Tue Nov 5 21:51:55 2002
From: mhammond@users.sourceforge.net (Mark Hammond)
Date: Tue, 05 Nov 2002 13:51:55 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000/dialogs ManagerDialog.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000/dialogs
In directory usw-pr-cvs1:/tmp/cvs-serv10075
Modified Files:
ManagerDialog.py
Log Message:
Ensure filter_status is always set to a value indicating why the filter
can not be enabled.
Index: ManagerDialog.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/dialogs/ManagerDialog.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** ManagerDialog.py 1 Nov 2002 02:03:48 -0000 1.5
--- ManagerDialog.py 5 Nov 2002 21:51:53 -0000 1.6
***************
*** 120,123 ****
--- 120,128 ----
if ok_to_enable:
unsure_name = self.mgr.FormatFolderNames([config.unsure_folder_id], False)
+ else:
+ filter_status = "You must define the folder to receive your possible spam"
+ else:
+ filter_status = "You must define the folder to receive your certain spam"
+
# whew
if ok_to_enable:
From richiehindle@users.sourceforge.net Tue Nov 5 22:18:59 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Tue, 05 Nov 2002 14:18:59 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23270
Modified Files:
pop3proxy.py
Log Message:
First cut of the HTML user interface - see the docstring for -b and -u.
Now reads the classification header and its values from the options.
Added TOP support to the test server (to make 40tude Dialog happy).
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** pop3proxy.py 2 Nov 2002 21:00:21 -0000 1.9
--- pop3proxy.py 5 Nov 2002 22:18:56 -0000 1.10
***************
*** 15,19 ****
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
! -l port : listen on this port number (default 110)
pop3proxy -t
--- 15,22 ----
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
! -l port : proxy listens on this port number (default 110)
! -u port : User interface listens on this port number
! (default 8880; Browse http://localhost:8880/)
! -b : Launch a web browser showing the user interface.
pop3proxy -t
***************
*** 35,40 ****
! import sys, re, operator, errno, getopt, cPickle, time
! import socket, asyncore, asynchat
import classifier, tokenizer, hammie
from Options import options
--- 38,43 ----
! import sys, re, operator, errno, getopt, cPickle, cStringIO, time
! import socket, asyncore, asynchat, cgi, urlparse, webbrowser
import classifier, tokenizer, hammie
from Options import options
***************
*** 42,47 ****
# HEADER_EXAMPLE is the longest possible header - the length of this one
# is added to the size of each message.
! HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
! HEADER_EXAMPLE = '%s: Unsure\r\n' % hammie.DISPHEADER
--- 45,57 ----
# HEADER_EXAMPLE is the longest possible header - the length of this one
# is added to the size of each message.
! HEADER_FORMAT = '%s: %%s\r\n' % options.hammie_header_name
! HEADER_EXAMPLE = '%s: xxxxxxxxxxxxxxxxxxxx\r\n' % options.hammie_header_name
!
! # This keeps the global status of the module - the command-line options,
! # how many mails have been classified, how many active connections there
! # are, and so on.
! class Status:
! pass
! status = Status()
***************
*** 61,65 ****
self.set_socket(s, socketMap)
self.set_reuse_addr()
! print "Listening on port %d." % port
self.bind(('', port))
self.listen(5)
--- 71,75 ----
self.set_socket(s, socketMap)
self.set_reuse_addr()
! print "%s listening on port %d." % (self.__class__.__name__, port)
self.bind(('', port))
self.listen(5)
***************
*** 73,80 ****
self.factory(*args)
! class POP3ProxyBase(asynchat.async_chat):
"""An async dispatcher that understands POP3 and proxies to a POP3
! server, calling `self.onTransaction( request, response )` for each
transaction. Responses are not un-byte-stuffed before reaching
self.onTransaction() (they probably should be for a totally generic
--- 83,107 ----
self.factory(*args)
+ class BrighterAsyncChat(asynchat.async_chat):
+ """An asynchat.async_chat that doesn't give spurious warnings on
+ receiving an incoming connection, and lets SystemExit cause an
+ exit."""
! def handle_connect(self):
! """Suppress the asyncore "unhandled connect event" warning."""
! pass
!
! def handle_error(self):
! """Let SystemExit cause an exit."""
! type, v, t = sys.exc_info()
! if type == SystemExit:
! raise
! else:
! asynchat.async_chat.handle_error(self)
!
!
! class POP3ProxyBase(BrighterAsyncChat):
"""An async dispatcher that understands POP3 and proxies to a POP3
! server, calling `self.onTransaction(request, response)` for each
transaction. Responses are not un-byte-stuffed before reaching
self.onTransaction() (they probably should be for a totally generic
***************
*** 88,92 ****
def __init__(self, clientSocket, serverName, serverPort):
! asynchat.async_chat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
--- 115,119 ----
def __init__(self, clientSocket, serverName, serverPort):
! BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
***************
*** 96,103 ****
self.push(self.serverIn.readline())
- def handle_connect(self):
- """Suppress the asyncore "unhandled connect event" warning."""
- pass
-
def onTransaction(self, command, args, response):
"""Overide this. Takes the raw request and the response, and
--- 123,126 ----
***************
*** 221,232 ****
self.close_when_done()
- def handle_error(self):
- """Let SystemExit cause an exit."""
- type, v, t = sys.exc_info()
- if type == SystemExit:
- raise
- else:
- asynchat.async_chat.handle_error(self)
-
class BayesProxyListener(Listener):
--- 244,247 ----
***************
*** 276,279 ****
--- 291,296 ----
self.handlers = {'STAT': self.onStat, 'LIST': self.onList,
'RETR': self.onRetr, 'TOP': self.onTop}
+ status.totalSessions += 1
+ status.activeSessions += 1
def send(self, data):
***************
*** 290,293 ****
--- 307,314 ----
return data
+ def close(self):
+ status.activeSessions -= 1
+ POP3ProxyBase.close(self)
+
def onTransaction(self, command, args, response):
"""Takes the raw request and response, and returns the
***************
*** 343,352 ****
# Now find the spam disposition and add the header.
prob = self.bayes.spamprob(tokenizer.tokenize(message))
if prob < options.ham_cutoff:
! disposition = "No"
elif prob > options.spam_cutoff:
! disposition = "Yes"
else:
! disposition = "Unsure"
headers, body = re.split(r'\n\r?\n', response, 1)
--- 364,381 ----
# Now find the spam disposition and add the header.
prob = self.bayes.spamprob(tokenizer.tokenize(message))
+ if command == 'RETR':
+ status.numEmails += 1
if prob < options.ham_cutoff:
! disposition = options.header_ham_string
! if command == 'RETR':
! status.numHams += 1
elif prob > options.spam_cutoff:
! disposition = options.header_spam_string
! if command == 'RETR':
! status.numSpams += 1
else:
! disposition = options.header_unsure_string
! if command == 'RETR':
! status.numUnsure += 1
headers, body = re.split(r'\n\r?\n', response, 1)
***************
*** 368,372 ****
! def main(serverName, serverPort, proxyPort, pickleName, useDB):
"""Runs the proxy forever or until a 'KILL' command is received or
someone hits Ctrl+Break."""
--- 397,646 ----
! class UserInterfaceListener(Listener):
! """Listens for incoming web browser connections and spins off
! UserInterface objects to serve them."""
!
! def __init__(self, uiPort, bayes):
! uiArgs = (bayes,)
! Listener.__init__(self, uiPort, UserInterface, uiArgs)
!
!
! # Until the user interface has had a wider audience, I won't pollute the
! # project with .gif files and the like. Here's the viking helmet.
! import base64
! helmet = base64.decodestring(
! """R0lGODlhIgAYAPcAAEJCRlVTVGNaUl5eXmtaVm9lXGtrZ3NrY3dvZ4d0Znt3dImHh5R+a6GDcJyU
! jrSdjaWlra2tra2tta+3ur2trcC9t7W9ysDDyMbGzsbS3r3W78bW78be78be973e/8bn/86pjNav
! kc69re/Lrc7Ly9ba4vfWveTh5M7e79be79bn797n7+fr6+/v5+/v7/f3787e987n987n/9bn99bn
! /9bv/97n997v++fv9+f3/+/v9+/3//f39/f/////9////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAEAAB4ALAAAAAAiABgA
! AAj+AD0IHEiwoMGDA2XI8PBhxg2EECN+YJHjwwccOz5E3FhQBgseMmK44KGRo0kaLHzQENljoUmO
! NE74uGHDxQ8aL2GmzFHzZs6NNFr8yKHC5sOfEEUOVcHiR8aNFksi/LCCx1KZPXAilLHBAoYMMSB6
! 9DEUhsyhUgl+wOBAwQIHFsIapGpzaIcTVnvcSOsBhgUFBgYUMKAgAgqNH2J0aPjxR9YPJerqlYEi
! w4YYExQM2FygwIHCKVBgiBChBIsXP5wu3HD2Bw8MC2JD0CygAIHOnhU4cLDA7QWrqfd6iBE5dQsH
! BgJvHiDgNoID0A88V6AAAQSyjl16QIHXBwnNAwDIBAhAwDmDBAjQHyiAIPkC7DnUljhxwkGAAQHE
! B+icIAGD8+clUMByCNjUUkEdlHCBAvflF0BtB/zHQAMSCjhYYBXsoFVBMWAQWH4AAFBbAg2UWOID
! FK432AEO2ABRBwtsFuKDBTSAYgMghBDCAwwgwB4CClQAQ0R/4RciAQjYyMADIIwwAggN+PeWBTPw
! VdAHHEjA4IMR8ojjCCaEEGUCFcygnUQxaEndbhBAwKQIFVAAgQMQHPZTBxrkqUEHfHLAAZ+AdgBR
! QAAAOw==""")
!
!
! class UserInterface(BrighterAsyncChat):
! """Serves the HTML user interface of the proxy."""
!
! header = """Spambayes proxy: %s
!
! \n"""
!
! bodyStart = """
!
!
! Spambayes proxy: %s
!
\n"""
!
! footer = """
! \n"""
!
! pageSection = """
!
%s
!
%s
! \n"""
!
! wordQuery = """"""
!
! def __init__(self, clientSocket, bayes):
! BrighterAsyncChat.__init__(self, clientSocket)
! self.bayes = bayes
! self.request = ''
! self.set_terminator('\r\n\r\n')
! self.helmet = helmet
!
! def collect_incoming_data(self, data):
! """Asynchat override."""
! self.request = self.request + data
!
! def found_terminator(self):
! """Asynchat override.
! Read and parse the HTTP request and call an on handler."""
! requestLine, headers = self.request.split('\r\n', 1)
! try:
! method, url, version = requestLine.strip().split()
! except ValueError:
! self.pushError(400, "Malformed request: '%s'" % requestLine) # XXX: 400??
! self.close_when_done()
! else:
! method = method.upper()
! _, _, path, _, query, _ = urlparse.urlparse(url)
! params = cgi.parse_qs(query, keep_blank_values=True)
! if self.get_terminator() == '\r\n\r\n' and method == 'POST':
! # We need to read a body; set a numeric async_chat terminator.
! match = re.search(r'(?i)content-length:\s*(\d+)', headers)
! self.set_terminator(int(match.group(1)))
! self.request = self.request + '\r\n\r\n'
! return
!
! if type(self.get_terminator()) is type(1):
! # We've just read the body of a POSTed request.
! self.set_terminator('\r\n\r\n')
! body = self.request.split('\r\n\r\n', 1)[1]
! match = re.search(r'(?i)content-type:\s*([^\r\n]+)', headers)
! contentTypeHeader = match.group(1)
! contentType, pdict = cgi.parse_header(contentTypeHeader)
! if contentType == 'multipart/form-data':
! # multipart/form-data - probably a file upload.
! bodyFile = cStringIO.StringIO(body)
! params.update(cgi.parse_multipart(bodyFile, pdict))
! else:
! # A normal x-www-form-urlencoded.
! params.update(cgi.parse_qs(body, keep_blank_values=True))
!
! # Convert the cgi params into a simple dictionary.
! plainParams = {}
! for name, value in params.iteritems():
! plainParams[name] = value[0]
! self.onRequest(path, plainParams)
! self.close_when_done()
!
! def onRequest(self, path, params):
! """Handles a decoded HTTP request."""
! if path == '/':
! path = '/Home'
!
! if path == '/helmet.gif':
! self.pushOKHeaders('image/gif')
! self.push(self.helmet)
! else:
! try:
! name = path[1:].capitalize()
! handler = getattr(self, 'on' + name)
! except AttributeError:
! self.pushError(404, "Not found: '%s'" % url)
! else:
! # This is a request for a valid page; run the handler.
! self.pushOKHeaders('text/html')
! self.pushPreamble(name)
! handler(params)
! timeString = time.asctime(time.localtime())
! self.push(self.footer % timeString)
!
! def pushOKHeaders(self, contentType):
! self.push("HTTP/1.0 200 OK\r\n")
! self.push("Content-Type: %s\r\n" % contentType)
! self.push("\r\n")
!
! def pushError(self, code, message):
! self.push("HTTP/1.0 %d Error\r\n" % code)
! self.push("Content-Type: text/html\r\n")
! self.push("\r\n")
! self.push("
%d %s
" % (code, message))
!
! def pushPreamble(self, name):
! self.push(self.header % name)
! if name == 'Home':
! homeLink = name
! else:
! homeLink = "Home > %s" % name
! self.push(self.bodyStart % homeLink)
!
! def onHome(self, params):
! summary = """POP3 proxy running on port %(proxyPort)d,
! proxying to %(serverName)s:%(serverPort)d.
! Active POP3 conversations: %(activeSessions)d.
! POP3 conversations this session:
! %(totalSessions)d.
! Emails classified this session: %(numSpams)d spam,
! %(numHams)d ham, %(numUnsure)d unsure.
! """ % status.__dict__
!
! train = """"""
!
! body = (self.pageSection % ('Status', summary) +
! self.pageSection % ('Word query', self.wordQuery) +
! self.pageSection % ('Train', train))
! self.push(body)
!
! def onShutdown(self, params):
! self.push("
Shutdown. Goodbye.
")
! self.push(' ') # Acts as a flush for small buffers.
! self.shutdown(2)
! self.close()
! raise SystemExit
!
! def onUpload(self, params):
! message = params.get('file') or params.get('text')
! isSpam = (params['which'] == 'spam')
! self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
! self.push("""
Trained on your message. Saving database...
""")
! self.push(" ") # Flush... must find out how to do this properly...
! if not status.useDB and status.pickleName:
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("
This project works with either the absolute bleeding edge of python code, available from CVS on sourceforge, or with Python 2.2.1 (not 2.2, or 2.1.3).
The spambayes code itself is also available via CVS
--- 12,16 ----
come crying <wink>.
!
This project works with either the absolute bleeding edge of python code, available from CVS on sourceforge, or with Python 2.2 (not 2.1.x or earlier).
The spambayes code itself is also available via CVS
From just@letterror.com Thu Nov 7 22:51:11 2002
From: just@letterror.com (Just van Rossum)
Date: Thu, 7 Nov 2002 23:51:11 +0100
Subject: [Spambayes-checkins] spambayes README.txt,1.40,1.41
TestDriver.py,1.27,1.28 Tester.py,1.7,1.8 chi2.py,1.7,1.8
classifier.py,1.48,1.49 hammie.py,1.36,1.37 hammiesrv.py,1.9,1.10
mboxcount.py,1.2,1.3 mboxtest.py,1.9,1.10 neiltrain.py,1.3,1.4 rebal.py,1.
In-Reply-To:
Message-ID:
Just van Rossum wrote:
> Mass checkin: Remain compatible with Python 2.2. Only tested with
> pop3proxy.py.
Btw. I screwed up the checkin for Options.py, Histogram.py and INTEGRATION.txt;
these have a bogus log message for the 2.2 compat patch :-(.
Just
From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
tokenizer.py,1.63,1.64
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798
Modified Files:
Options.py tokenizer.py
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py 7 Nov 2002 22:25:46 -0000 1.66
--- Options.py 8 Nov 2002 04:06:23 -0000 1.67
***************
*** 42,53 ****
x-.*
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages. Set true to retain HTML tags in this
- # case. On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
-
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
! 'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
--- 339,343 ----
all_options = {
! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63
--- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
***************
*** 495,504 ****
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was removed.
##############################################################################
--- 495,505 ----
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False. Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was also removed.
##############################################################################
***************
*** 1167,1175 ****
"""Generate a stream of tokens from an email Message.
- HTML tags are always stripped from text/plain sections.
- options.retain_pure_html_tags controls whether HTML tags are
- also stripped from text/html sections. Except in special cases,
- it's recommended to leave that at its default of false.
-
If options.check_octets is True, the first few undecoded characters
of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
# Remove HTML/XML tags. Also .
! if (part.get_content_type() == "text/plain" or
! not options.retain_pure_html_tags):
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
--- 1224,1229 ----
# Remove HTML/XML tags. Also .
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390
Modified Files:
pop3proxy.py
Log Message:
o The database is now saved (optionally) on exit, rather than after each
message you train with. There should be explicit save/reload commands,
but they can come later.
o It now keeps two mbox files of all the messages that have been used to
train via the web interface - thanks to Just for the patch.
o All the sockets now use async - the web interface used to freeze
whenever the proxy was awaiting a response from the POP3 server. That's
now fixed.
o It now copes with POP3 servers that don't issue a welcome command.
o The training form now appears in the training results, so you can train
on another message without having to go back to the Home page.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11
--- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12
***************
*** 47,50 ****
--- 47,74 ----
+ todo = """
+ o (Re)training interface - one message per line, quick-rendering table.
+ o Slightly-wordy index page; intro paragraph for each page.
+ o Once the training stuff is on a separate page, make the paste box
+ bigger.
+ o "Links" section (on homepage?) to project homepage, mailing list,
+ etc.
+ o "Home" link (with helmet!) at the end of each page.
+ o "Classify this" - just like Train.
+ o "Send me an email every [...] to remind me to train on new
+ messages."
+ o "Send me a status email every [...] telling how many mails have been
+ classified, etc."
+ o Deployment: Windows executable? atlaxwin and ctypes? Or just
+ webbrowser?
+ o Possibly integrate Tim Stone's SMTP code - make it use async, make
+ the training code update (rather than replace!) the database.
+ o Can it cleanly dynamically update its status display while having a
+ POP3 converation? Hammering reload sucks.
+ o Add a command to save the database without shutting down, and one to
+ reload the database.
+ o Leave the word in the input field after a Word query.
+ """
+
import sys, re, operator, errno, getopt, cPickle, cStringIO, time
import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
self.factory(*args)
+
class BrighterAsyncChat(asynchat.async_chat):
"""An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
+ class ServerLineReader(BrighterAsyncChat):
+ """An async socket that reads lines from a remote server and
+ simply calls a callback with the data. The BayesProxy object
+ can't connect to the real POP3 server and talk to it
+ synchronously, because that would block the process."""
+
+ def __init__(self, serverName, serverPort, lineCallback):
+ BrighterAsyncChat.__init__(self)
+ self.lineCallback = lineCallback
+ self.request = ''
+ self.set_terminator('\r\n')
+ self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+ self.connect((serverName, serverPort))
+
+ def collect_incoming_data(self, data):
+ self.request = self.request + data
+
+ def found_terminator(self):
+ self.lineCallback(self.request + '\r\n')
+ self.request = ''
+
+ def handle_close(self):
+ self.lineCallback('')
+ self.close()
+
+
class POP3ProxyBase(BrighterAsyncChat):
"""An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! self.serverSocket.connect((serverName, serverPort))
! self.serverIn = self.serverSocket.makefile('r') # For reading only
! self.push(self.serverIn.readline())
def onTransaction(self, command, args, response):
--- 177,189 ----
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
+ self.response = ''
self.set_terminator('\r\n')
! self.command = '' # The POP3 command being processed...
! self.args = '' # ...and its arguments
! self.isClosing = False # Has the server closed the socket?
! self.seenAllHeaders = False # For the current RETR or TOP
! self.startTime = 0 # (ditto)
! self.serverSocket = ServerLineReader(serverName, serverPort,
! self.onServerLine)
def onTransaction(self, command, args, response):
***************
*** 139,152 ****
raise NotImplementedError
! def isMultiline(self, command, args):
! """Returns True if the given request should get a multiline
response (assuming the response is positive).
"""
! if command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif command in ['RETR', 'TOP']:
return True
! elif command in ['LIST', 'UIDL']:
return len(args) == 0
else:
--- 194,237 ----
raise NotImplementedError
! def onServerLine(self, line):
! """A line of response has been received from the POP3 server."""
! isFirstLine = not self.response
! self.response = self.response + line
!
! # Is this line that terminates a set of headers?
! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!
! # Has the server closed its end of the socket?
! if not line:
! self.isClosing = True
!
! # If we're not processing a command, just echo the response.
! if not self.command:
! self.push(self.response)
! self.response = ''
!
! # Time out after 30 seconds for message-retrieval commands if
! # all the headers are down. The rest of the message will proxy
! # straight through.
! if self.command in ['TOP', 'RETR'] and \
! self.seenAllHeaders and time.time() > self.startTime + 30:
! self.onResponse()
! self.response = ''
! # If that's a complete response, handle it.
! elif not self.isMultiline() or line == '.\r\n' or \
! (isFirstLine and line.startswith('-ERR')):
! self.onResponse()
! self.response = ''
!
! def isMultiline(self):
! """Returns True if the request should get a multiline
response (assuming the response is positive).
"""
! if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif self.command in ['RETR', 'TOP']:
return True
! elif self.command in ['LIST', 'UIDL']:
return len(args) == 0
else:
***************
*** 155,204 ****
return False
- def readResponse(self, command, args):
- """Reads the POP3 server's response and returns a tuple of
- (response, isClosing, timedOut). isClosing is True if the
- server closes the socket, which tells found_terminator() to
- close when the response has been sent. timedOut is set if a
- TOP or RETR request was still arriving after 30 seconds, and
- tells found_terminator() to proxy the remainder of the response.
- """
- responseLines = []
- startTime = time.time()
- isMulti = self.isMultiline(command, args)
- isClosing = False
- timedOut = False
- isFirstLine = True
- seenAllHeaders = False
- while True:
- line = self.serverIn.readline()
- if not line:
- # The socket's been closed by the server, probably by QUIT.
- isClosing = True
- break
- elif not isMulti or (isFirstLine and line.startswith('-ERR')):
- # A single-line response.
- responseLines.append(line)
- break
- elif line == '.\r\n':
- # The termination line.
- responseLines.append(line)
- break
- else:
- # A normal line - append it to the response and carry on.
- responseLines.append(line)
- seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
-
- # Time out after 30 seconds for message-retrieval commands
- # if all the headers are down - found_terminator() knows how
- # to deal with this.
- if command in ['TOP', 'RETR'] and \
- seenAllHeaders and time.time() > startTime + 30:
- timedOut = True
- break
-
- isFirstLine = False
-
- return ''.join(responseLines), isClosing, timedOut
-
def collect_incoming_data(self, data):
"""Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
def found_terminator(self):
"""Asynchat override."""
- # Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverSocket.sendall(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! command, args = ('', '')
else:
splitCommand = self.request.strip().split(None, 1)
! command = splitCommand[0].upper()
! args = splitCommand[1:]
! rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cookedResponse = self.onTransaction(command, args, rawResponse)
! self.push(cookedResponse)
! self.request = ''
!
! # If readResponse() timed out, we still need to read and proxy
! # the rest of the message.
! if timedOut:
! while True:
! line = self.serverIn.readline()
! if not line:
! # The socket's been closed by the server.
! isClosing = True
! break
! elif line == '.\r\n':
! # The termination line.
! self.push(line)
! break
! else:
! # A normal line.
! self.push(line)
!
! # If readResponse() or the loop above decided that the server
! # has closed its socket, close this one when the response has
! # been sent.
! if isClosing:
self.close_when_done()
class BayesProxyListener(Listener):
--- 246,288 ----
def found_terminator(self):
"""Asynchat override."""
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
+ self.serverSocket.shutdown(2)
+ self.serverSocket.close()
self.shutdown(2)
self.close()
raise SystemExit
!
! self.serverSocket.push(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! self.command = self.args = ''
else:
+ # A proper command.
splitCommand = self.request.strip().split(None, 1)
! self.command = splitCommand[0].upper()
! self.args = splitCommand[1:]
! self.startTime = time.time()
!
! self.request = ''
!
! def onResponse(self):
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
!
! # If onServerLine() decided that the server has closed its
! # socket, close this one when the response has been sent.
! if self.isClosing:
self.close_when_done()
+ # Reset.
+ self.command = ''
+ self.args = ''
+ self.isClosing = False
+ self.seenAllHeaders = False
+
class BayesProxyListener(Listener):
***************
*** 452,456 ****
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15 }
.header { font-size: 133%% }
.content { margin: 15 }
--- 484,490 ----
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15;
! border-top: 1px solid black;
! border-bottom: 1px solid black }
.header { font-size: 133%% }
.content { margin: 15 }
***************
*** 466,470 ****
***************
*** 483,486 ****
--- 522,533 ----
\n"""
+ summary = """POP3 proxy running on port %(proxyPort)d,
+ proxying to %(serverName)s:%(serverPort)d.
+ Active POP3 conversations: %(activeSessions)d.
+ POP3 conversations this session: %(totalSessions)d.
+ Emails classified this session: %(numSpams)d spam,
+ %(numHams)d ham, %(numUnsure)d unsure.
+ """
+
wordQuery = """"""
+ train = """"""
+
def __init__(self, clientSocket, bayes):
BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = self.request.split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
--- 561,565 ----
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
if path == '/helmet.gif':
! self.pushOKHeaders('image/gif')
self.push(self.helmet)
else:
--- 606,614 ----
if path == '/helmet.gif':
! # XXX Why doesn't Expires work? Must read RFC 2616 one day.
! inOneHour = time.gmtime(time.time() + 3600)
! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
! extraHeaders = {'Expires': expiryDate}
! self.pushOKHeaders('image/gif', extraHeaders)
self.push(self.helmet)
else:
***************
*** 554,558 ****
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % url)
else:
# This is a request for a valid page; run the handler.
--- 617,621 ----
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % path)
else:
# This is a request for a valid page; run the handler.
***************
*** 561,569 ****
handler(params)
timeString = time.asctime(time.localtime())
! self.push(self.footer % timeString)
! def pushOKHeaders(self, contentType):
! self.push("HTTP/1.0 200 OK\r\n")
self.push("Content-Type: %s\r\n" % contentType)
self.push("\r\n")
--- 624,641 ----
handler(params)
timeString = time.asctime(time.localtime())
! if status.useDB:
! self.push(self.footer % (timeString, self.shutdownDB))
! else:
! self.push(self.footer % (timeString, self.shutdownPickle))
! def pushOKHeaders(self, contentType, extraHeaders={}):
! timeNow = time.gmtime(time.time())
! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
! self.push("HTTP/1.1 200 OK\r\n")
! self.push("Connection: close\r\n")
self.push("Content-Type: %s\r\n" % contentType)
+ self.push("Date: %s\r\n" % httpNow)
+ for name, value in extraHeaders.items():
+ self.push("%s: %s\r\n" % (name, value))
self.push("\r\n")
***************
*** 583,616 ****
def onHome(self, params):
! summary = """POP3 proxy running on port %(proxyPort)d,
! proxying to %(serverName)s:%(serverPort)d.
! Active POP3 conversations: %(activeSessions)d.
! POP3 conversations this session:
! %(totalSessions)d.
! Emails classified this session: %(numSpams)d spam,
! %(numHams)d ham, %(numUnsure)d unsure.
! """ % status.__dict__
!
! train = """"""
!
! body = (self.pageSection % ('Status', summary) +
! self.pageSection % ('Word query', self.wordQuery) +
! self.pageSection % ('Train', train))
self.push(body)
def onShutdown(self, params):
! self.push("
Shutdown. Goodbye.
")
! self.push(' ') # Acts as a flush for small buffers.
self.shutdown(2)
self.close()
--- 655,675 ----
def onHome(self, params):
! """Serve up the homepage."""
! body = (self.pageSection % ('Status', self.summary % status.__dict__)+
! self.pageSection % ('Word query', self.wordQuery)+
! self.pageSection % ('Train', self.train))
self.push(body)
def onShutdown(self, params):
! """Shutdown the server, saving the pickle if requested to do so."""
! if params['how'].lower().find('save') >= 0:
! if not status.useDB and status.pickleName:
! self.push("Saving...")
! self.push(' ') # Acts as a flush for small buffers.
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("Shutdown. Goodbye.")
! self.push(' ')
self.shutdown(2)
self.close()
***************
*** 618,625 ****
def onUpload(self, params):
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
# Append the message to a file, to make it easier to rebuild
! # the database later.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
--- 677,690 ----
def onUpload(self, params):
+ """Train on an uploaded or pasted message."""
+ # Upload or paste? Spam or ham?
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
+
# Append the message to a file, to make it easier to rebuild
! # the database later. This is a temporary implementation -
! # it should keep a Corpus (from Tim Stone's forthcoming message
! # management module) to manage a cache of messages. It needs
! # to keep them for the HTML retraining interface anyway.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
***************
*** 627,642 ****
else:
f = open("_pop3proxyham.mbox", "a")
! f.write("From ???@???\n") # fake From line (XXX good enough?)
f.write(message)
! f.write("\n")
f.close()
self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
! self.push("""
Trained on your message. Saving database...
""")
! self.push(" ") # Flush... must find out how to do this properly...
! if not status.useDB and status.pickleName:
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("
" % (code, message))
!
def pushPreamble(self, name):
self.push(self.header % name)
***************
*** 681,685 ****
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
!
# Append the message to a file, to make it easier to rebuild
# the database later. This is a temporary implementation -
--- 681,685 ----
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
!
# Append the message to a file, to make it easier to rebuild
# the database later. This is a temporary implementation -
***************
*** 718,722 ****
except KeyError:
info = "'%s' does not appear in the database." % word
!
body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
--- 718,722 ----
except KeyError:
info = "'%s' does not appear in the database." % word
!
body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
***************
*** 992,996 ****
elif opt == '-u':
status.uiPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
--- 992,996 ----
elif opt == '-u':
status.uiPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timcv.py 1 Nov 2002 04:10:50 -0000 1.11
--- timcv.py 10 Nov 2002 19:59:22 -0000 1.12
***************
*** 15,19 ****
--HamTrain int
! The maximum number of msgs to use from each Ham set for training.
The msgs are chosen randomly. See also the -s option.
--- 15,19 ----
--HamTrain int
! The maximum number of msgs to use from each Ham set for training.
The msgs are chosen randomly. See also the -s option.
***************
*** 23,27 ****
--HamTest int
! The maximum number of msgs to use from each Ham set for testing.
The msgs are chosen randomly. See also the -s option.
--- 23,27 ----
--HamTest int
! The maximum number of msgs to use from each Ham set for testing.
The msgs are chosen randomly. See also the -s option.
***************
*** 73,79 ****
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
spamdirs[1:], train=1))
--- 73,79 ----
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
spamdirs[1:], train=1))
***************
*** 98,102 ****
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
msgs.SpamStream(sname, s2, train=1))
--- 98,102 ----
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
msgs.SpamStream(sname, s2, train=1))
Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** weaktest.py 10 Nov 2002 12:02:33 -0000 1.2
--- weaktest.py 10 Nov 2002 19:59:22 -0000 1.3
***************
*** 58,62 ****
nham = len(hamfns)
nspam = len(spamfns)
!
allfns = {}
for fn in spamfns+hamfns:
--- 58,62 ----
nham = len(hamfns)
nspam = len(spamfns)
!
allfns = {}
for fn in spamfns+hamfns:
***************
*** 133,137 ****
print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
print "Flex cost: $%.4f"%flexcost
!
def main():
import getopt
--- 133,137 ----
print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
print "Flex cost: $%.4f"%flexcost
!
def main():
import getopt
From tim_one@users.sourceforge.net Sun Nov 10 20:00:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 12:00:03 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14946
Modified Files:
msgstore.py
Log Message:
Whitespace normalization.
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** msgstore.py 7 Nov 2002 22:30:09 -0000 1.23
--- msgstore.py 10 Nov 2002 19:59:59 -0000 1.24
***************
*** 397,401 ****
# Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
pass
!
return "%s\n%s\n%s" % (headers, html, body)
--- 397,401 ----
# Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
pass
!
return "%s\n%s\n%s" % (headers, html, body)
From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam
Modified Files:
profile.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** profile.py 7 Nov 2002 22:30:11 -0000 1.3
--- profile.py 11 Nov 2002 01:59:06 -0000 1.4
***************
*** 44,48 ****
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 44,48 ----
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins]
spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402
Modified Files:
Options.py classifier.py weakloop.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py 8 Nov 2002 04:06:23 -0000 1.67
--- Options.py 11 Nov 2002 01:59:06 -0000 1.68
***************
*** 241,268 ****
# These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before. Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam. This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x. All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE: Gary Robinson previously used a different formula involving 'a'
! # and 'x'. The 'x' here is the same as before. The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme. 0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
! # The combining scheme currently detailed on Gary Robinons web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
--- 241,268 ----
# These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before. Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting. At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam. This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob. All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests. 0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
! # The combining scheme currently detailed on the Robinon web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
***************
*** 272,284 ****
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease". One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. This is
! # the "provably most-sensitive" test the original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease". One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
},
'Classifier': {'max_discriminators': int_cracker,
! 'robinson_probability_x': float_cracker,
! 'robinson_probability_s': float_cracker,
! 'robinson_minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
},
'Classifier': {'max_discriminators': int_cracker,
! 'unknown_word_prob': float_cracker,
! 'unknown_word_strength': float_cracker,
! 'minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py 7 Nov 2002 22:30:05 -0000 1.49
--- classifier.py 11 Nov 2002 01:59:06 -0000 1.50
***************
*** 70,74 ****
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
nspam = float(self.nspam or 1)
! S = options.robinson_probability_s
! StimesX = S * options.robinson_probability_x
for word, record in self.wordinfo.iteritems():
--- 322,327 ----
nspam = float(self.nspam or 1)
! S = options.unknown_word_strength
! StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
def _getclues(self, wordstream):
! mindist = options.robinson_minimum_prob_strength
! unknown = options.robinson_probability_x
clues = [] # (distance, prob, word, record) tuples
--- 449,454 ----
def _getclues(self, wordstream):
! mindist = options.minimum_prob_strength
! unknown = options.unknown_word_prob
clues = [] # (distance, prob, word, record) tuples
Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1
--- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2
***************
*** 29,35 ****
default="""
[Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
max_discriminators = 150
--- 29,35 ----
default="""
[Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
max_discriminators = 150
***************
*** 41,47 ****
import Options
! start = (Options.options.robinson_probability_x,
! Options.options.robinson_minimum_prob_strength,
! Options.options.robinson_probability_s,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
--- 41,47 ----
import Options
! start = (Options.options.unknown_word_prob,
! Options.options.minimum_prob_strength,
! Options.options.unknown_word_strength,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
***************
*** 52,58 ****
f.write("""
[Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
[TestDriver]
--- 52,58 ----
f.write("""
[Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
[TestDriver]
From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
tokenizer.py,1.63,1.64
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798
Modified Files:
Options.py tokenizer.py
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py 7 Nov 2002 22:25:46 -0000 1.66
--- Options.py 8 Nov 2002 04:06:23 -0000 1.67
***************
*** 42,53 ****
x-.*
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages. Set true to retain HTML tags in this
- # case. On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
-
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
! 'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
--- 339,343 ----
all_options = {
! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63
--- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
***************
*** 495,504 ****
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was removed.
##############################################################################
--- 495,505 ----
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False. Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was also removed.
##############################################################################
***************
*** 1167,1175 ****
"""Generate a stream of tokens from an email Message.
- HTML tags are always stripped from text/plain sections.
- options.retain_pure_html_tags controls whether HTML tags are
- also stripped from text/html sections. Except in special cases,
- it's recommended to leave that at its default of false.
-
If options.check_octets is True, the first few undecoded characters
of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
# Remove HTML/XML tags. Also .
! if (part.get_content_type() == "text/plain" or
! not options.retain_pure_html_tags):
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
--- 1224,1229 ----
# Remove HTML/XML tags. Also .
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390
Modified Files:
pop3proxy.py
Log Message:
o The database is now saved (optionally) on exit, rather than after each
message you train with. There should be explicit save/reload commands,
but they can come later.
o It now keeps two mbox files of all the messages that have been used to
train via the web interface - thanks to Just for the patch.
o All the sockets now use async - the web interface used to freeze
whenever the proxy was awaiting a response from the POP3 server. That's
now fixed.
o It now copes with POP3 servers that don't issue a welcome command.
o The training form now appears in the training results, so you can train
on another message without having to go back to the Home page.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11
--- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12
***************
*** 47,50 ****
--- 47,74 ----
+ todo = """
+ o (Re)training interface - one message per line, quick-rendering table.
+ o Slightly-wordy index page; intro paragraph for each page.
+ o Once the training stuff is on a separate page, make the paste box
+ bigger.
+ o "Links" section (on homepage?) to project homepage, mailing list,
+ etc.
+ o "Home" link (with helmet!) at the end of each page.
+ o "Classify this" - just like Train.
+ o "Send me an email every [...] to remind me to train on new
+ messages."
+ o "Send me a status email every [...] telling how many mails have been
+ classified, etc."
+ o Deployment: Windows executable? atlaxwin and ctypes? Or just
+ webbrowser?
+ o Possibly integrate Tim Stone's SMTP code - make it use async, make
+ the training code update (rather than replace!) the database.
+ o Can it cleanly dynamically update its status display while having a
+ POP3 converation? Hammering reload sucks.
+ o Add a command to save the database without shutting down, and one to
+ reload the database.
+ o Leave the word in the input field after a Word query.
+ """
+
import sys, re, operator, errno, getopt, cPickle, cStringIO, time
import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
self.factory(*args)
+
class BrighterAsyncChat(asynchat.async_chat):
"""An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
+ class ServerLineReader(BrighterAsyncChat):
+ """An async socket that reads lines from a remote server and
+ simply calls a callback with the data. The BayesProxy object
+ can't connect to the real POP3 server and talk to it
+ synchronously, because that would block the process."""
+
+ def __init__(self, serverName, serverPort, lineCallback):
+ BrighterAsyncChat.__init__(self)
+ self.lineCallback = lineCallback
+ self.request = ''
+ self.set_terminator('\r\n')
+ self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+ self.connect((serverName, serverPort))
+
+ def collect_incoming_data(self, data):
+ self.request = self.request + data
+
+ def found_terminator(self):
+ self.lineCallback(self.request + '\r\n')
+ self.request = ''
+
+ def handle_close(self):
+ self.lineCallback('')
+ self.close()
+
+
class POP3ProxyBase(BrighterAsyncChat):
"""An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! self.serverSocket.connect((serverName, serverPort))
! self.serverIn = self.serverSocket.makefile('r') # For reading only
! self.push(self.serverIn.readline())
def onTransaction(self, command, args, response):
--- 177,189 ----
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
+ self.response = ''
self.set_terminator('\r\n')
! self.command = '' # The POP3 command being processed...
! self.args = '' # ...and its arguments
! self.isClosing = False # Has the server closed the socket?
! self.seenAllHeaders = False # For the current RETR or TOP
! self.startTime = 0 # (ditto)
! self.serverSocket = ServerLineReader(serverName, serverPort,
! self.onServerLine)
def onTransaction(self, command, args, response):
***************
*** 139,152 ****
raise NotImplementedError
! def isMultiline(self, command, args):
! """Returns True if the given request should get a multiline
response (assuming the response is positive).
"""
! if command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif command in ['RETR', 'TOP']:
return True
! elif command in ['LIST', 'UIDL']:
return len(args) == 0
else:
--- 194,237 ----
raise NotImplementedError
! def onServerLine(self, line):
! """A line of response has been received from the POP3 server."""
! isFirstLine = not self.response
! self.response = self.response + line
!
! # Is this line that terminates a set of headers?
! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!
! # Has the server closed its end of the socket?
! if not line:
! self.isClosing = True
!
! # If we're not processing a command, just echo the response.
! if not self.command:
! self.push(self.response)
! self.response = ''
!
! # Time out after 30 seconds for message-retrieval commands if
! # all the headers are down. The rest of the message will proxy
! # straight through.
! if self.command in ['TOP', 'RETR'] and \
! self.seenAllHeaders and time.time() > self.startTime + 30:
! self.onResponse()
! self.response = ''
! # If that's a complete response, handle it.
! elif not self.isMultiline() or line == '.\r\n' or \
! (isFirstLine and line.startswith('-ERR')):
! self.onResponse()
! self.response = ''
!
! def isMultiline(self):
! """Returns True if the request should get a multiline
response (assuming the response is positive).
"""
! if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif self.command in ['RETR', 'TOP']:
return True
! elif self.command in ['LIST', 'UIDL']:
return len(args) == 0
else:
***************
*** 155,204 ****
return False
- def readResponse(self, command, args):
- """Reads the POP3 server's response and returns a tuple of
- (response, isClosing, timedOut). isClosing is True if the
- server closes the socket, which tells found_terminator() to
- close when the response has been sent. timedOut is set if a
- TOP or RETR request was still arriving after 30 seconds, and
- tells found_terminator() to proxy the remainder of the response.
- """
- responseLines = []
- startTime = time.time()
- isMulti = self.isMultiline(command, args)
- isClosing = False
- timedOut = False
- isFirstLine = True
- seenAllHeaders = False
- while True:
- line = self.serverIn.readline()
- if not line:
- # The socket's been closed by the server, probably by QUIT.
- isClosing = True
- break
- elif not isMulti or (isFirstLine and line.startswith('-ERR')):
- # A single-line response.
- responseLines.append(line)
- break
- elif line == '.\r\n':
- # The termination line.
- responseLines.append(line)
- break
- else:
- # A normal line - append it to the response and carry on.
- responseLines.append(line)
- seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
-
- # Time out after 30 seconds for message-retrieval commands
- # if all the headers are down - found_terminator() knows how
- # to deal with this.
- if command in ['TOP', 'RETR'] and \
- seenAllHeaders and time.time() > startTime + 30:
- timedOut = True
- break
-
- isFirstLine = False
-
- return ''.join(responseLines), isClosing, timedOut
-
def collect_incoming_data(self, data):
"""Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
def found_terminator(self):
"""Asynchat override."""
- # Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverSocket.sendall(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! command, args = ('', '')
else:
splitCommand = self.request.strip().split(None, 1)
! command = splitCommand[0].upper()
! args = splitCommand[1:]
! rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cookedResponse = self.onTransaction(command, args, rawResponse)
! self.push(cookedResponse)
! self.request = ''
!
! # If readResponse() timed out, we still need to read and proxy
! # the rest of the message.
! if timedOut:
! while True:
! line = self.serverIn.readline()
! if not line:
! # The socket's been closed by the server.
! isClosing = True
! break
! elif line == '.\r\n':
! # The termination line.
! self.push(line)
! break
! else:
! # A normal line.
! self.push(line)
!
! # If readResponse() or the loop above decided that the server
! # has closed its socket, close this one when the response has
! # been sent.
! if isClosing:
self.close_when_done()
class BayesProxyListener(Listener):
--- 246,288 ----
def found_terminator(self):
"""Asynchat override."""
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
+ self.serverSocket.shutdown(2)
+ self.serverSocket.close()
self.shutdown(2)
self.close()
raise SystemExit
!
! self.serverSocket.push(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! self.command = self.args = ''
else:
+ # A proper command.
splitCommand = self.request.strip().split(None, 1)
! self.command = splitCommand[0].upper()
! self.args = splitCommand[1:]
! self.startTime = time.time()
!
! self.request = ''
!
! def onResponse(self):
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
!
! # If onServerLine() decided that the server has closed its
! # socket, close this one when the response has been sent.
! if self.isClosing:
self.close_when_done()
+ # Reset.
+ self.command = ''
+ self.args = ''
+ self.isClosing = False
+ self.seenAllHeaders = False
+
class BayesProxyListener(Listener):
***************
*** 452,456 ****
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15 }
.header { font-size: 133%% }
.content { margin: 15 }
--- 484,490 ----
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15;
! border-top: 1px solid black;
! border-bottom: 1px solid black }
.header { font-size: 133%% }
.content { margin: 15 }
***************
*** 466,470 ****
***************
*** 483,486 ****
--- 522,533 ----
\n"""
+ summary = """POP3 proxy running on port %(proxyPort)d,
+ proxying to %(serverName)s:%(serverPort)d.
+ Active POP3 conversations: %(activeSessions)d.
+ POP3 conversations this session: %(totalSessions)d.
+ Emails classified this session: %(numSpams)d spam,
+ %(numHams)d ham, %(numUnsure)d unsure.
+ """
+
wordQuery = """"""
+ train = """"""
+
def __init__(self, clientSocket, bayes):
BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = self.request.split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
--- 561,565 ----
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
if path == '/helmet.gif':
! self.pushOKHeaders('image/gif')
self.push(self.helmet)
else:
--- 606,614 ----
if path == '/helmet.gif':
! # XXX Why doesn't Expires work? Must read RFC 2616 one day.
! inOneHour = time.gmtime(time.time() + 3600)
! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
! extraHeaders = {'Expires': expiryDate}
! self.pushOKHeaders('image/gif', extraHeaders)
self.push(self.helmet)
else:
***************
*** 554,558 ****
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % url)
else:
# This is a request for a valid page; run the handler.
--- 617,621 ----
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % path)
else:
# This is a request for a valid page; run the handler.
***************
*** 561,569 ****
handler(params)
timeString = time.asctime(time.localtime())
! self.push(self.footer % timeString)
! def pushOKHeaders(self, contentType):
! self.push("HTTP/1.0 200 OK\r\n")
self.push("Content-Type: %s\r\n" % contentType)
self.push("\r\n")
--- 624,641 ----
handler(params)
timeString = time.asctime(time.localtime())
! if status.useDB:
! self.push(self.footer % (timeString, self.shutdownDB))
! else:
! self.push(self.footer % (timeString, self.shutdownPickle))
! def pushOKHeaders(self, contentType, extraHeaders={}):
! timeNow = time.gmtime(time.time())
! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
! self.push("HTTP/1.1 200 OK\r\n")
! self.push("Connection: close\r\n")
self.push("Content-Type: %s\r\n" % contentType)
+ self.push("Date: %s\r\n" % httpNow)
+ for name, value in extraHeaders.items():
+ self.push("%s: %s\r\n" % (name, value))
self.push("\r\n")
***************
*** 583,616 ****
def onHome(self, params):
! summary = """POP3 proxy running on port %(proxyPort)d,
! proxying to %(serverName)s:%(serverPort)d.
! Active POP3 conversations: %(activeSessions)d.
! POP3 conversations this session:
! %(totalSessions)d.
! Emails classified this session: %(numSpams)d spam,
! %(numHams)d ham, %(numUnsure)d unsure.
! """ % status.__dict__
!
! train = """"""
!
! body = (self.pageSection % ('Status', summary) +
! self.pageSection % ('Word query', self.wordQuery) +
! self.pageSection % ('Train', train))
self.push(body)
def onShutdown(self, params):
! self.push("
Shutdown. Goodbye.
")
! self.push(' ') # Acts as a flush for small buffers.
self.shutdown(2)
self.close()
--- 655,675 ----
def onHome(self, params):
! """Serve up the homepage."""
! body = (self.pageSection % ('Status', self.summary % status.__dict__)+
! self.pageSection % ('Word query', self.wordQuery)+
! self.pageSection % ('Train', self.train))
self.push(body)
def onShutdown(self, params):
! """Shutdown the server, saving the pickle if requested to do so."""
! if params['how'].lower().find('save') >= 0:
! if not status.useDB and status.pickleName:
! self.push("Saving...")
! self.push(' ') # Acts as a flush for small buffers.
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("Shutdown. Goodbye.")
! self.push(' ')
self.shutdown(2)
self.close()
***************
*** 618,625 ****
def onUpload(self, params):
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
# Append the message to a file, to make it easier to rebuild
! # the database later.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
--- 677,690 ----
def onUpload(self, params):
+ """Train on an uploaded or pasted message."""
+ # Upload or paste? Spam or ham?
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
+
# Append the message to a file, to make it easier to rebuild
! # the database later. This is a temporary implementation -
! # it should keep a Corpus (from Tim Stone's forthcoming message
! # management module) to manage a cache of messages. It needs
! # to keep them for the HTML retraining interface anyway.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
***************
*** 627,642 ****
else:
f = open("_pop3proxyham.mbox", "a")
! f.write("From ???@???\n") # fake From line (XXX good enough?)
f.write(message)
! f.write("\n")
f.close()
self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
! self.push("""
Trained on your message. Saving database...
""")
! self.push(" ") # Flush... must find out how to do this properly...
! if not status.useDB and status.pickleName:
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("
")
! self.push(self.pageSection % ('Train another', self.train))
def onWordquery(self, params):
***************
*** 656,660 ****
info = "'%s' does not appear in the database." % word
! body = (self.pageSection % ("Statistics for '%s':" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
self.push(body)
--- 718,722 ----
info = "'%s' does not appear in the database." % word
! body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
self.push(body)
***************
*** 765,771 ****
else:
handler = self.handlers.get(command, self.onUnknown)
! self.push(handler(command, args))
self.request = ''
def onStat(self, command, args):
"""POP3 STAT command."""
--- 827,839 ----
else:
handler = self.handlers.get(command, self.onUnknown)
! self.push(handler(command, args)) # Or push_slowly for testing
self.request = ''
+ def push_slowly(self, response):
+ """Useful for testing."""
+ for c in response:
+ self.push(c)
+ time.sleep(0.02)
+
def onStat(self, command, args):
"""POP3 STAT command."""
***************
*** 777,781 ****
"""POP3 LIST command, with optional message number argument."""
if args:
! number = int(args)
if 0 < number <= len(self.maildrop):
return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
"""POP3 LIST command, with optional message number argument."""
if args:
! try:
! number = int(args)
! except ValueError:
! number = -1
if 0 < number <= len(self.maildrop):
return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
def onRetr(self, command, args):
"""POP3 RETR command."""
! return self._getMessage(int(args), 12345)
def onTop(self, command, args):
"""POP3 RETR command."""
! number, lines = map(int, args.split())
return self._getMessage(number, lines)
--- 874,889 ----
def onRetr(self, command, args):
"""POP3 RETR command."""
! try:
! number = int(args)
! except ValueError:
! number = -1
! return self._getMessage(number, 12345)
def onTop(self, command, args):
"""POP3 RETR command."""
! try:
! number, lines = map(int, args.split())
! except ValueError:
! number, lines = -1, -1
return self._getMessage(number, lines)
***************
*** 863,867 ****
while response.find('\n.\r\n') == -1:
response = response + proxy.recv(1000)
! assert response.find(options.hammie_header_name) != -1
# Kill the proxy and the test server.
--- 941,945 ----
while response.find('\n.\r\n') == -1:
response = response + proxy.recv(1000)
! assert response.find(options.hammie_header_name) >= 0
# Kill the proxy and the test server.
From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
tokenizer.py,1.63,1.64
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798
Modified Files:
Options.py tokenizer.py
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py 7 Nov 2002 22:25:46 -0000 1.66
--- Options.py 8 Nov 2002 04:06:23 -0000 1.67
***************
*** 42,53 ****
x-.*
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages. Set true to retain HTML tags in this
- # case. On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
-
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
! 'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
--- 339,343 ----
all_options = {
! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63
--- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
***************
*** 495,504 ****
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was removed.
##############################################################################
--- 495,505 ----
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False. Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was also removed.
##############################################################################
***************
*** 1167,1175 ****
"""Generate a stream of tokens from an email Message.
- HTML tags are always stripped from text/plain sections.
- options.retain_pure_html_tags controls whether HTML tags are
- also stripped from text/html sections. Except in special cases,
- it's recommended to leave that at its default of false.
-
If options.check_octets is True, the first few undecoded characters
of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
# Remove HTML/XML tags. Also .
! if (part.get_content_type() == "text/plain" or
! not options.retain_pure_html_tags):
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
--- 1224,1229 ----
# Remove HTML/XML tags. Also .
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390
Modified Files:
pop3proxy.py
Log Message:
o The database is now saved (optionally) on exit, rather than after each
message you train with. There should be explicit save/reload commands,
but they can come later.
o It now keeps two mbox files of all the messages that have been used to
train via the web interface - thanks to Just for the patch.
o All the sockets now use async - the web interface used to freeze
whenever the proxy was awaiting a response from the POP3 server. That's
now fixed.
o It now copes with POP3 servers that don't issue a welcome command.
o The training form now appears in the training results, so you can train
on another message without having to go back to the Home page.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11
--- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12
***************
*** 47,50 ****
--- 47,74 ----
+ todo = """
+ o (Re)training interface - one message per line, quick-rendering table.
+ o Slightly-wordy index page; intro paragraph for each page.
+ o Once the training stuff is on a separate page, make the paste box
+ bigger.
+ o "Links" section (on homepage?) to project homepage, mailing list,
+ etc.
+ o "Home" link (with helmet!) at the end of each page.
+ o "Classify this" - just like Train.
+ o "Send me an email every [...] to remind me to train on new
+ messages."
+ o "Send me a status email every [...] telling how many mails have been
+ classified, etc."
+ o Deployment: Windows executable? atlaxwin and ctypes? Or just
+ webbrowser?
+ o Possibly integrate Tim Stone's SMTP code - make it use async, make
+ the training code update (rather than replace!) the database.
+ o Can it cleanly dynamically update its status display while having a
+ POP3 converation? Hammering reload sucks.
+ o Add a command to save the database without shutting down, and one to
+ reload the database.
+ o Leave the word in the input field after a Word query.
+ """
+
import sys, re, operator, errno, getopt, cPickle, cStringIO, time
import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
self.factory(*args)
+
class BrighterAsyncChat(asynchat.async_chat):
"""An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
+ class ServerLineReader(BrighterAsyncChat):
+ """An async socket that reads lines from a remote server and
+ simply calls a callback with the data. The BayesProxy object
+ can't connect to the real POP3 server and talk to it
+ synchronously, because that would block the process."""
+
+ def __init__(self, serverName, serverPort, lineCallback):
+ BrighterAsyncChat.__init__(self)
+ self.lineCallback = lineCallback
+ self.request = ''
+ self.set_terminator('\r\n')
+ self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+ self.connect((serverName, serverPort))
+
+ def collect_incoming_data(self, data):
+ self.request = self.request + data
+
+ def found_terminator(self):
+ self.lineCallback(self.request + '\r\n')
+ self.request = ''
+
+ def handle_close(self):
+ self.lineCallback('')
+ self.close()
+
+
class POP3ProxyBase(BrighterAsyncChat):
"""An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! self.serverSocket.connect((serverName, serverPort))
! self.serverIn = self.serverSocket.makefile('r') # For reading only
! self.push(self.serverIn.readline())
def onTransaction(self, command, args, response):
--- 177,189 ----
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
+ self.response = ''
self.set_terminator('\r\n')
! self.command = '' # The POP3 command being processed...
! self.args = '' # ...and its arguments
! self.isClosing = False # Has the server closed the socket?
! self.seenAllHeaders = False # For the current RETR or TOP
! self.startTime = 0 # (ditto)
! self.serverSocket = ServerLineReader(serverName, serverPort,
! self.onServerLine)
def onTransaction(self, command, args, response):
***************
*** 139,152 ****
raise NotImplementedError
! def isMultiline(self, command, args):
! """Returns True if the given request should get a multiline
response (assuming the response is positive).
"""
! if command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif command in ['RETR', 'TOP']:
return True
! elif command in ['LIST', 'UIDL']:
return len(args) == 0
else:
--- 194,237 ----
raise NotImplementedError
! def onServerLine(self, line):
! """A line of response has been received from the POP3 server."""
! isFirstLine = not self.response
! self.response = self.response + line
!
! # Is this line that terminates a set of headers?
! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!
! # Has the server closed its end of the socket?
! if not line:
! self.isClosing = True
!
! # If we're not processing a command, just echo the response.
! if not self.command:
! self.push(self.response)
! self.response = ''
!
! # Time out after 30 seconds for message-retrieval commands if
! # all the headers are down. The rest of the message will proxy
! # straight through.
! if self.command in ['TOP', 'RETR'] and \
! self.seenAllHeaders and time.time() > self.startTime + 30:
! self.onResponse()
! self.response = ''
! # If that's a complete response, handle it.
! elif not self.isMultiline() or line == '.\r\n' or \
! (isFirstLine and line.startswith('-ERR')):
! self.onResponse()
! self.response = ''
!
! def isMultiline(self):
! """Returns True if the request should get a multiline
response (assuming the response is positive).
"""
! if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif self.command in ['RETR', 'TOP']:
return True
! elif self.command in ['LIST', 'UIDL']:
return len(args) == 0
else:
***************
*** 155,204 ****
return False
- def readResponse(self, command, args):
- """Reads the POP3 server's response and returns a tuple of
- (response, isClosing, timedOut). isClosing is True if the
- server closes the socket, which tells found_terminator() to
- close when the response has been sent. timedOut is set if a
- TOP or RETR request was still arriving after 30 seconds, and
- tells found_terminator() to proxy the remainder of the response.
- """
- responseLines = []
- startTime = time.time()
- isMulti = self.isMultiline(command, args)
- isClosing = False
- timedOut = False
- isFirstLine = True
- seenAllHeaders = False
- while True:
- line = self.serverIn.readline()
- if not line:
- # The socket's been closed by the server, probably by QUIT.
- isClosing = True
- break
- elif not isMulti or (isFirstLine and line.startswith('-ERR')):
- # A single-line response.
- responseLines.append(line)
- break
- elif line == '.\r\n':
- # The termination line.
- responseLines.append(line)
- break
- else:
- # A normal line - append it to the response and carry on.
- responseLines.append(line)
- seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
-
- # Time out after 30 seconds for message-retrieval commands
- # if all the headers are down - found_terminator() knows how
- # to deal with this.
- if command in ['TOP', 'RETR'] and \
- seenAllHeaders and time.time() > startTime + 30:
- timedOut = True
- break
-
- isFirstLine = False
-
- return ''.join(responseLines), isClosing, timedOut
-
def collect_incoming_data(self, data):
"""Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
def found_terminator(self):
"""Asynchat override."""
- # Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverSocket.sendall(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! command, args = ('', '')
else:
splitCommand = self.request.strip().split(None, 1)
! command = splitCommand[0].upper()
! args = splitCommand[1:]
! rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cookedResponse = self.onTransaction(command, args, rawResponse)
! self.push(cookedResponse)
! self.request = ''
!
! # If readResponse() timed out, we still need to read and proxy
! # the rest of the message.
! if timedOut:
! while True:
! line = self.serverIn.readline()
! if not line:
! # The socket's been closed by the server.
! isClosing = True
! break
! elif line == '.\r\n':
! # The termination line.
! self.push(line)
! break
! else:
! # A normal line.
! self.push(line)
!
! # If readResponse() or the loop above decided that the server
! # has closed its socket, close this one when the response has
! # been sent.
! if isClosing:
self.close_when_done()
class BayesProxyListener(Listener):
--- 246,288 ----
def found_terminator(self):
"""Asynchat override."""
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
+ self.serverSocket.shutdown(2)
+ self.serverSocket.close()
self.shutdown(2)
self.close()
raise SystemExit
!
! self.serverSocket.push(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! self.command = self.args = ''
else:
+ # A proper command.
splitCommand = self.request.strip().split(None, 1)
! self.command = splitCommand[0].upper()
! self.args = splitCommand[1:]
! self.startTime = time.time()
!
! self.request = ''
!
! def onResponse(self):
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
!
! # If onServerLine() decided that the server has closed its
! # socket, close this one when the response has been sent.
! if self.isClosing:
self.close_when_done()
+ # Reset.
+ self.command = ''
+ self.args = ''
+ self.isClosing = False
+ self.seenAllHeaders = False
+
class BayesProxyListener(Listener):
***************
*** 452,456 ****
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15 }
.header { font-size: 133%% }
.content { margin: 15 }
--- 484,490 ----
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15;
! border-top: 1px solid black;
! border-bottom: 1px solid black }
.header { font-size: 133%% }
.content { margin: 15 }
***************
*** 466,470 ****
***************
*** 483,486 ****
--- 522,533 ----
\n"""
+ summary = """POP3 proxy running on port %(proxyPort)d,
+ proxying to %(serverName)s:%(serverPort)d.
+ Active POP3 conversations: %(activeSessions)d.
+ POP3 conversations this session: %(totalSessions)d.
+ Emails classified this session: %(numSpams)d spam,
+ %(numHams)d ham, %(numUnsure)d unsure.
+ """
+
wordQuery = """"""
+ train = """"""
+
def __init__(self, clientSocket, bayes):
BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = self.request.split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
--- 561,565 ----
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
if path == '/helmet.gif':
! self.pushOKHeaders('image/gif')
self.push(self.helmet)
else:
--- 606,614 ----
if path == '/helmet.gif':
! # XXX Why doesn't Expires work? Must read RFC 2616 one day.
! inOneHour = time.gmtime(time.time() + 3600)
! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
! extraHeaders = {'Expires': expiryDate}
! self.pushOKHeaders('image/gif', extraHeaders)
self.push(self.helmet)
else:
***************
*** 554,558 ****
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % url)
else:
# This is a request for a valid page; run the handler.
--- 617,621 ----
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % path)
else:
# This is a request for a valid page; run the handler.
***************
*** 561,569 ****
handler(params)
timeString = time.asctime(time.localtime())
! self.push(self.footer % timeString)
! def pushOKHeaders(self, contentType):
! self.push("HTTP/1.0 200 OK\r\n")
self.push("Content-Type: %s\r\n" % contentType)
self.push("\r\n")
--- 624,641 ----
handler(params)
timeString = time.asctime(time.localtime())
! if status.useDB:
! self.push(self.footer % (timeString, self.shutdownDB))
! else:
! self.push(self.footer % (timeString, self.shutdownPickle))
! def pushOKHeaders(self, contentType, extraHeaders={}):
! timeNow = time.gmtime(time.time())
! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
! self.push("HTTP/1.1 200 OK\r\n")
! self.push("Connection: close\r\n")
self.push("Content-Type: %s\r\n" % contentType)
+ self.push("Date: %s\r\n" % httpNow)
+ for name, value in extraHeaders.items():
+ self.push("%s: %s\r\n" % (name, value))
self.push("\r\n")
***************
*** 583,616 ****
def onHome(self, params):
! summary = """POP3 proxy running on port %(proxyPort)d,
! proxying to %(serverName)s:%(serverPort)d.
! Active POP3 conversations: %(activeSessions)d.
! POP3 conversations this session:
! %(totalSessions)d.
! Emails classified this session: %(numSpams)d spam,
! %(numHams)d ham, %(numUnsure)d unsure.
! """ % status.__dict__
!
! train = """"""
!
! body = (self.pageSection % ('Status', summary) +
! self.pageSection % ('Word query', self.wordQuery) +
! self.pageSection % ('Train', train))
self.push(body)
def onShutdown(self, params):
! self.push("
Shutdown. Goodbye.
")
! self.push(' ') # Acts as a flush for small buffers.
self.shutdown(2)
self.close()
--- 655,675 ----
def onHome(self, params):
! """Serve up the homepage."""
! body = (self.pageSection % ('Status', self.summary % status.__dict__)+
! self.pageSection % ('Word query', self.wordQuery)+
! self.pageSection % ('Train', self.train))
self.push(body)
def onShutdown(self, params):
! """Shutdown the server, saving the pickle if requested to do so."""
! if params['how'].lower().find('save') >= 0:
! if not status.useDB and status.pickleName:
! self.push("Saving...")
! self.push(' ') # Acts as a flush for small buffers.
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("Shutdown. Goodbye.")
! self.push(' ')
self.shutdown(2)
self.close()
***************
*** 618,625 ****
def onUpload(self, params):
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
# Append the message to a file, to make it easier to rebuild
! # the database later.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
--- 677,690 ----
def onUpload(self, params):
+ """Train on an uploaded or pasted message."""
+ # Upload or paste? Spam or ham?
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
+
# Append the message to a file, to make it easier to rebuild
! # the database later. This is a temporary implementation -
! # it should keep a Corpus (from Tim Stone's forthcoming message
! # management module) to manage a cache of messages. It needs
! # to keep them for the HTML retraining interface anyway.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
***************
*** 627,642 ****
else:
f = open("_pop3proxyham.mbox", "a")
! f.write("From ???@???\n") # fake From line (XXX good enough?)
f.write(message)
! f.write("\n")
f.close()
self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
! self.push("""
Trained on your message. Saving database...
""")
! self.push(" ") # Flush... must find out how to do this properly...
! if not status.useDB and status.pickleName:
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("
" % (code, message))
!
def pushPreamble(self, name):
self.push(self.header % name)
***************
*** 681,685 ****
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
!
# Append the message to a file, to make it easier to rebuild
# the database later. This is a temporary implementation -
--- 681,685 ----
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
!
# Append the message to a file, to make it easier to rebuild
# the database later. This is a temporary implementation -
***************
*** 718,722 ****
except KeyError:
info = "'%s' does not appear in the database." % word
!
body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
--- 718,722 ----
except KeyError:
info = "'%s' does not appear in the database." % word
!
body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
***************
*** 992,996 ****
elif opt == '-u':
status.uiPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
--- 992,996 ----
elif opt == '-u':
status.uiPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timcv.py 1 Nov 2002 04:10:50 -0000 1.11
--- timcv.py 10 Nov 2002 19:59:22 -0000 1.12
***************
*** 15,19 ****
--HamTrain int
! The maximum number of msgs to use from each Ham set for training.
The msgs are chosen randomly. See also the -s option.
--- 15,19 ----
--HamTrain int
! The maximum number of msgs to use from each Ham set for training.
The msgs are chosen randomly. See also the -s option.
***************
*** 23,27 ****
--HamTest int
! The maximum number of msgs to use from each Ham set for testing.
The msgs are chosen randomly. See also the -s option.
--- 23,27 ----
--HamTest int
! The maximum number of msgs to use from each Ham set for testing.
The msgs are chosen randomly. See also the -s option.
***************
*** 73,79 ****
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
spamdirs[1:], train=1))
--- 73,79 ----
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
spamdirs[1:], train=1))
***************
*** 98,102 ****
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
msgs.SpamStream(sname, s2, train=1))
--- 98,102 ----
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
msgs.SpamStream(sname, s2, train=1))
Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** weaktest.py 10 Nov 2002 12:02:33 -0000 1.2
--- weaktest.py 10 Nov 2002 19:59:22 -0000 1.3
***************
*** 58,62 ****
nham = len(hamfns)
nspam = len(spamfns)
!
allfns = {}
for fn in spamfns+hamfns:
--- 58,62 ----
nham = len(hamfns)
nspam = len(spamfns)
!
allfns = {}
for fn in spamfns+hamfns:
***************
*** 133,137 ****
print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
print "Flex cost: $%.4f"%flexcost
!
def main():
import getopt
--- 133,137 ----
print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
print "Flex cost: $%.4f"%flexcost
!
def main():
import getopt
From tim_one@users.sourceforge.net Sun Nov 10 20:00:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 12:00:03 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14946
Modified Files:
msgstore.py
Log Message:
Whitespace normalization.
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** msgstore.py 7 Nov 2002 22:30:09 -0000 1.23
--- msgstore.py 10 Nov 2002 19:59:59 -0000 1.24
***************
*** 397,401 ****
# Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
pass
!
return "%s\n%s\n%s" % (headers, html, body)
--- 397,401 ----
# Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
pass
!
return "%s\n%s\n%s" % (headers, html, body)
From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam
Modified Files:
profile.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** profile.py 7 Nov 2002 22:30:11 -0000 1.3
--- profile.py 11 Nov 2002 01:59:06 -0000 1.4
***************
*** 44,48 ****
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 44,48 ----
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins]
spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402
Modified Files:
Options.py classifier.py weakloop.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py 8 Nov 2002 04:06:23 -0000 1.67
--- Options.py 11 Nov 2002 01:59:06 -0000 1.68
***************
*** 241,268 ****
# These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before. Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam. This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x. All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE: Gary Robinson previously used a different formula involving 'a'
! # and 'x'. The 'x' here is the same as before. The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme. 0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
! # The combining scheme currently detailed on Gary Robinons web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
--- 241,268 ----
# These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before. Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting. At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam. This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob. All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests. 0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
! # The combining scheme currently detailed on the Robinon web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
***************
*** 272,284 ****
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease". One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. This is
! # the "provably most-sensitive" test the original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease". One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
},
'Classifier': {'max_discriminators': int_cracker,
! 'robinson_probability_x': float_cracker,
! 'robinson_probability_s': float_cracker,
! 'robinson_minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
},
'Classifier': {'max_discriminators': int_cracker,
! 'unknown_word_prob': float_cracker,
! 'unknown_word_strength': float_cracker,
! 'minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py 7 Nov 2002 22:30:05 -0000 1.49
--- classifier.py 11 Nov 2002 01:59:06 -0000 1.50
***************
*** 70,74 ****
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
nspam = float(self.nspam or 1)
! S = options.robinson_probability_s
! StimesX = S * options.robinson_probability_x
for word, record in self.wordinfo.iteritems():
--- 322,327 ----
nspam = float(self.nspam or 1)
! S = options.unknown_word_strength
! StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
def _getclues(self, wordstream):
! mindist = options.robinson_minimum_prob_strength
! unknown = options.robinson_probability_x
clues = [] # (distance, prob, word, record) tuples
--- 449,454 ----
def _getclues(self, wordstream):
! mindist = options.minimum_prob_strength
! unknown = options.unknown_word_prob
clues = [] # (distance, prob, word, record) tuples
Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1
--- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2
***************
*** 29,35 ****
default="""
[Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
max_discriminators = 150
--- 29,35 ----
default="""
[Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
max_discriminators = 150
***************
*** 41,47 ****
import Options
! start = (Options.options.robinson_probability_x,
! Options.options.robinson_minimum_prob_strength,
! Options.options.robinson_probability_s,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
--- 41,47 ----
import Options
! start = (Options.options.unknown_word_prob,
! Options.options.minimum_prob_strength,
! Options.options.unknown_word_strength,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
***************
*** 52,58 ****
f.write("""
[Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
[TestDriver]
--- 52,58 ----
f.write("""
[Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
[TestDriver]
From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
tokenizer.py,1.63,1.64
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798
Modified Files:
Options.py tokenizer.py
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py 7 Nov 2002 22:25:46 -0000 1.66
--- Options.py 8 Nov 2002 04:06:23 -0000 1.67
***************
*** 42,53 ****
x-.*
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages. Set true to retain HTML tags in this
- # case. On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
-
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
! 'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
--- 339,343 ----
all_options = {
! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63
--- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
***************
*** 495,504 ****
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was removed.
##############################################################################
--- 495,505 ----
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False. Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was also removed.
##############################################################################
***************
*** 1167,1175 ****
"""Generate a stream of tokens from an email Message.
- HTML tags are always stripped from text/plain sections.
- options.retain_pure_html_tags controls whether HTML tags are
- also stripped from text/html sections. Except in special cases,
- it's recommended to leave that at its default of false.
-
If options.check_octets is True, the first few undecoded characters
of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
# Remove HTML/XML tags. Also .
! if (part.get_content_type() == "text/plain" or
! not options.retain_pure_html_tags):
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
--- 1224,1229 ----
# Remove HTML/XML tags. Also .
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
tokenizer.py,1.63,1.64
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798
Modified Files:
Options.py tokenizer.py
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py 7 Nov 2002 22:25:46 -0000 1.66
--- Options.py 8 Nov 2002 04:06:23 -0000 1.67
***************
*** 42,53 ****
x-.*
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages. Set true to retain HTML tags in this
- # case. On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
-
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
! 'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
--- 339,343 ----
all_options = {
! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63
--- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
***************
*** 495,504 ****
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was removed.
##############################################################################
--- 495,505 ----
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False. Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was also removed.
##############################################################################
***************
*** 1167,1175 ****
"""Generate a stream of tokens from an email Message.
- HTML tags are always stripped from text/plain sections.
- options.retain_pure_html_tags controls whether HTML tags are
- also stripped from text/html sections. Except in special cases,
- it's recommended to leave that at its default of false.
-
If options.check_octets is True, the first few undecoded characters
of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
# Remove HTML/XML tags. Also .
! if (part.get_content_type() == "text/plain" or
! not options.retain_pure_html_tags):
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
--- 1224,1229 ----
# Remove HTML/XML tags. Also .
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390
Modified Files:
pop3proxy.py
Log Message:
o The database is now saved (optionally) on exit, rather than after each
message you train with. There should be explicit save/reload commands,
but they can come later.
o It now keeps two mbox files of all the messages that have been used to
train via the web interface - thanks to Just for the patch.
o All the sockets now use async - the web interface used to freeze
whenever the proxy was awaiting a response from the POP3 server. That's
now fixed.
o It now copes with POP3 servers that don't issue a welcome command.
o The training form now appears in the training results, so you can train
on another message without having to go back to the Home page.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11
--- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12
***************
*** 47,50 ****
--- 47,74 ----
+ todo = """
+ o (Re)training interface - one message per line, quick-rendering table.
+ o Slightly-wordy index page; intro paragraph for each page.
+ o Once the training stuff is on a separate page, make the paste box
+ bigger.
+ o "Links" section (on homepage?) to project homepage, mailing list,
+ etc.
+ o "Home" link (with helmet!) at the end of each page.
+ o "Classify this" - just like Train.
+ o "Send me an email every [...] to remind me to train on new
+ messages."
+ o "Send me a status email every [...] telling how many mails have been
+ classified, etc."
+ o Deployment: Windows executable? atlaxwin and ctypes? Or just
+ webbrowser?
+ o Possibly integrate Tim Stone's SMTP code - make it use async, make
+ the training code update (rather than replace!) the database.
+ o Can it cleanly dynamically update its status display while having a
+ POP3 converation? Hammering reload sucks.
+ o Add a command to save the database without shutting down, and one to
+ reload the database.
+ o Leave the word in the input field after a Word query.
+ """
+
import sys, re, operator, errno, getopt, cPickle, cStringIO, time
import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
self.factory(*args)
+
class BrighterAsyncChat(asynchat.async_chat):
"""An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
+ class ServerLineReader(BrighterAsyncChat):
+ """An async socket that reads lines from a remote server and
+ simply calls a callback with the data. The BayesProxy object
+ can't connect to the real POP3 server and talk to it
+ synchronously, because that would block the process."""
+
+ def __init__(self, serverName, serverPort, lineCallback):
+ BrighterAsyncChat.__init__(self)
+ self.lineCallback = lineCallback
+ self.request = ''
+ self.set_terminator('\r\n')
+ self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+ self.connect((serverName, serverPort))
+
+ def collect_incoming_data(self, data):
+ self.request = self.request + data
+
+ def found_terminator(self):
+ self.lineCallback(self.request + '\r\n')
+ self.request = ''
+
+ def handle_close(self):
+ self.lineCallback('')
+ self.close()
+
+
class POP3ProxyBase(BrighterAsyncChat):
"""An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! self.serverSocket.connect((serverName, serverPort))
! self.serverIn = self.serverSocket.makefile('r') # For reading only
! self.push(self.serverIn.readline())
def onTransaction(self, command, args, response):
--- 177,189 ----
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
+ self.response = ''
self.set_terminator('\r\n')
! self.command = '' # The POP3 command being processed...
! self.args = '' # ...and its arguments
! self.isClosing = False # Has the server closed the socket?
! self.seenAllHeaders = False # For the current RETR or TOP
! self.startTime = 0 # (ditto)
! self.serverSocket = ServerLineReader(serverName, serverPort,
! self.onServerLine)
def onTransaction(self, command, args, response):
***************
*** 139,152 ****
raise NotImplementedError
! def isMultiline(self, command, args):
! """Returns True if the given request should get a multiline
response (assuming the response is positive).
"""
! if command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif command in ['RETR', 'TOP']:
return True
! elif command in ['LIST', 'UIDL']:
return len(args) == 0
else:
--- 194,237 ----
raise NotImplementedError
! def onServerLine(self, line):
! """A line of response has been received from the POP3 server."""
! isFirstLine = not self.response
! self.response = self.response + line
!
! # Is this line that terminates a set of headers?
! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!
! # Has the server closed its end of the socket?
! if not line:
! self.isClosing = True
!
! # If we're not processing a command, just echo the response.
! if not self.command:
! self.push(self.response)
! self.response = ''
!
! # Time out after 30 seconds for message-retrieval commands if
! # all the headers are down. The rest of the message will proxy
! # straight through.
! if self.command in ['TOP', 'RETR'] and \
! self.seenAllHeaders and time.time() > self.startTime + 30:
! self.onResponse()
! self.response = ''
! # If that's a complete response, handle it.
! elif not self.isMultiline() or line == '.\r\n' or \
! (isFirstLine and line.startswith('-ERR')):
! self.onResponse()
! self.response = ''
!
! def isMultiline(self):
! """Returns True if the request should get a multiline
response (assuming the response is positive).
"""
! if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif self.command in ['RETR', 'TOP']:
return True
! elif self.command in ['LIST', 'UIDL']:
return len(args) == 0
else:
***************
*** 155,204 ****
return False
- def readResponse(self, command, args):
- """Reads the POP3 server's response and returns a tuple of
- (response, isClosing, timedOut). isClosing is True if the
- server closes the socket, which tells found_terminator() to
- close when the response has been sent. timedOut is set if a
- TOP or RETR request was still arriving after 30 seconds, and
- tells found_terminator() to proxy the remainder of the response.
- """
- responseLines = []
- startTime = time.time()
- isMulti = self.isMultiline(command, args)
- isClosing = False
- timedOut = False
- isFirstLine = True
- seenAllHeaders = False
- while True:
- line = self.serverIn.readline()
- if not line:
- # The socket's been closed by the server, probably by QUIT.
- isClosing = True
- break
- elif not isMulti or (isFirstLine and line.startswith('-ERR')):
- # A single-line response.
- responseLines.append(line)
- break
- elif line == '.\r\n':
- # The termination line.
- responseLines.append(line)
- break
- else:
- # A normal line - append it to the response and carry on.
- responseLines.append(line)
- seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
-
- # Time out after 30 seconds for message-retrieval commands
- # if all the headers are down - found_terminator() knows how
- # to deal with this.
- if command in ['TOP', 'RETR'] and \
- seenAllHeaders and time.time() > startTime + 30:
- timedOut = True
- break
-
- isFirstLine = False
-
- return ''.join(responseLines), isClosing, timedOut
-
def collect_incoming_data(self, data):
"""Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
def found_terminator(self):
"""Asynchat override."""
- # Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverSocket.sendall(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! command, args = ('', '')
else:
splitCommand = self.request.strip().split(None, 1)
! command = splitCommand[0].upper()
! args = splitCommand[1:]
! rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cookedResponse = self.onTransaction(command, args, rawResponse)
! self.push(cookedResponse)
! self.request = ''
!
! # If readResponse() timed out, we still need to read and proxy
! # the rest of the message.
! if timedOut:
! while True:
! line = self.serverIn.readline()
! if not line:
! # The socket's been closed by the server.
! isClosing = True
! break
! elif line == '.\r\n':
! # The termination line.
! self.push(line)
! break
! else:
! # A normal line.
! self.push(line)
!
! # If readResponse() or the loop above decided that the server
! # has closed its socket, close this one when the response has
! # been sent.
! if isClosing:
self.close_when_done()
class BayesProxyListener(Listener):
--- 246,288 ----
def found_terminator(self):
"""Asynchat override."""
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
+ self.serverSocket.shutdown(2)
+ self.serverSocket.close()
self.shutdown(2)
self.close()
raise SystemExit
!
! self.serverSocket.push(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! self.command = self.args = ''
else:
+ # A proper command.
splitCommand = self.request.strip().split(None, 1)
! self.command = splitCommand[0].upper()
! self.args = splitCommand[1:]
! self.startTime = time.time()
!
! self.request = ''
!
! def onResponse(self):
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
!
! # If onServerLine() decided that the server has closed its
! # socket, close this one when the response has been sent.
! if self.isClosing:
self.close_when_done()
+ # Reset.
+ self.command = ''
+ self.args = ''
+ self.isClosing = False
+ self.seenAllHeaders = False
+
class BayesProxyListener(Listener):
***************
*** 452,456 ****
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15 }
.header { font-size: 133%% }
.content { margin: 15 }
--- 484,490 ----
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15;
! border-top: 1px solid black;
! border-bottom: 1px solid black }
.header { font-size: 133%% }
.content { margin: 15 }
***************
*** 466,470 ****
***************
*** 483,486 ****
--- 522,533 ----
\n"""
+ summary = """POP3 proxy running on port %(proxyPort)d,
+ proxying to %(serverName)s:%(serverPort)d.
+ Active POP3 conversations: %(activeSessions)d.
+ POP3 conversations this session: %(totalSessions)d.
+ Emails classified this session: %(numSpams)d spam,
+ %(numHams)d ham, %(numUnsure)d unsure.
+ """
+
wordQuery = """"""
+ train = """"""
+
def __init__(self, clientSocket, bayes):
BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = self.request.split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
--- 561,565 ----
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
if path == '/helmet.gif':
! self.pushOKHeaders('image/gif')
self.push(self.helmet)
else:
--- 606,614 ----
if path == '/helmet.gif':
! # XXX Why doesn't Expires work? Must read RFC 2616 one day.
! inOneHour = time.gmtime(time.time() + 3600)
! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
! extraHeaders = {'Expires': expiryDate}
! self.pushOKHeaders('image/gif', extraHeaders)
self.push(self.helmet)
else:
***************
*** 554,558 ****
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % url)
else:
# This is a request for a valid page; run the handler.
--- 617,621 ----
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % path)
else:
# This is a request for a valid page; run the handler.
***************
*** 561,569 ****
handler(params)
timeString = time.asctime(time.localtime())
! self.push(self.footer % timeString)
! def pushOKHeaders(self, contentType):
! self.push("HTTP/1.0 200 OK\r\n")
self.push("Content-Type: %s\r\n" % contentType)
self.push("\r\n")
--- 624,641 ----
handler(params)
timeString = time.asctime(time.localtime())
! if status.useDB:
! self.push(self.footer % (timeString, self.shutdownDB))
! else:
! self.push(self.footer % (timeString, self.shutdownPickle))
! def pushOKHeaders(self, contentType, extraHeaders={}):
! timeNow = time.gmtime(time.time())
! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
! self.push("HTTP/1.1 200 OK\r\n")
! self.push("Connection: close\r\n")
self.push("Content-Type: %s\r\n" % contentType)
+ self.push("Date: %s\r\n" % httpNow)
+ for name, value in extraHeaders.items():
+ self.push("%s: %s\r\n" % (name, value))
self.push("\r\n")
***************
*** 583,616 ****
def onHome(self, params):
! summary = """POP3 proxy running on port %(proxyPort)d,
! proxying to %(serverName)s:%(serverPort)d.
! Active POP3 conversations: %(activeSessions)d.
! POP3 conversations this session:
! %(totalSessions)d.
! Emails classified this session: %(numSpams)d spam,
! %(numHams)d ham, %(numUnsure)d unsure.
! """ % status.__dict__
!
! train = """"""
!
! body = (self.pageSection % ('Status', summary) +
! self.pageSection % ('Word query', self.wordQuery) +
! self.pageSection % ('Train', train))
self.push(body)
def onShutdown(self, params):
! self.push("
Shutdown. Goodbye.
")
! self.push(' ') # Acts as a flush for small buffers.
self.shutdown(2)
self.close()
--- 655,675 ----
def onHome(self, params):
! """Serve up the homepage."""
! body = (self.pageSection % ('Status', self.summary % status.__dict__)+
! self.pageSection % ('Word query', self.wordQuery)+
! self.pageSection % ('Train', self.train))
self.push(body)
def onShutdown(self, params):
! """Shutdown the server, saving the pickle if requested to do so."""
! if params['how'].lower().find('save') >= 0:
! if not status.useDB and status.pickleName:
! self.push("Saving...")
! self.push(' ') # Acts as a flush for small buffers.
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("Shutdown. Goodbye.")
! self.push(' ')
self.shutdown(2)
self.close()
***************
*** 618,625 ****
def onUpload(self, params):
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
# Append the message to a file, to make it easier to rebuild
! # the database later.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
--- 677,690 ----
def onUpload(self, params):
+ """Train on an uploaded or pasted message."""
+ # Upload or paste? Spam or ham?
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
+
# Append the message to a file, to make it easier to rebuild
! # the database later. This is a temporary implementation -
! # it should keep a Corpus (from Tim Stone's forthcoming message
! # management module) to manage a cache of messages. It needs
! # to keep them for the HTML retraining interface anyway.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
***************
*** 627,642 ****
else:
f = open("_pop3proxyham.mbox", "a")
! f.write("From ???@???\n") # fake From line (XXX good enough?)
f.write(message)
! f.write("\n")
f.close()
self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
! self.push("""
Trained on your message. Saving database...
""")
! self.push(" ") # Flush... must find out how to do this properly...
! if not status.useDB and status.pickleName:
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("
" % (code, message))
!
def pushPreamble(self, name):
self.push(self.header % name)
***************
*** 681,685 ****
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
!
# Append the message to a file, to make it easier to rebuild
# the database later. This is a temporary implementation -
--- 681,685 ----
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
!
# Append the message to a file, to make it easier to rebuild
# the database later. This is a temporary implementation -
***************
*** 718,722 ****
except KeyError:
info = "'%s' does not appear in the database." % word
!
body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
--- 718,722 ----
except KeyError:
info = "'%s' does not appear in the database." % word
!
body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
***************
*** 992,996 ****
elif opt == '-u':
status.uiPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
--- 992,996 ----
elif opt == '-u':
status.uiPort = int(arg)
!
# Do whatever we've been asked to do...
if not opts and not args:
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timcv.py 1 Nov 2002 04:10:50 -0000 1.11
--- timcv.py 10 Nov 2002 19:59:22 -0000 1.12
***************
*** 15,19 ****
--HamTrain int
! The maximum number of msgs to use from each Ham set for training.
The msgs are chosen randomly. See also the -s option.
--- 15,19 ----
--HamTrain int
! The maximum number of msgs to use from each Ham set for training.
The msgs are chosen randomly. See also the -s option.
***************
*** 23,27 ****
--HamTest int
! The maximum number of msgs to use from each Ham set for testing.
The msgs are chosen randomly. See also the -s option.
--- 23,27 ----
--HamTest int
! The maximum number of msgs to use from each Ham set for testing.
The msgs are chosen randomly. See also the -s option.
***************
*** 73,79 ****
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
spamdirs[1:], train=1))
--- 73,79 ----
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets),
hamdirs[1:], train=1),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets),
spamdirs[1:], train=1))
***************
*** 98,102 ****
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
msgs.SpamStream(sname, s2, train=1))
--- 98,102 ----
del s2[i]
! d.train(msgs.HamStream(hname, h2, train=1),
msgs.SpamStream(sname, s2, train=1))
Index: weaktest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weaktest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** weaktest.py 10 Nov 2002 12:02:33 -0000 1.2
--- weaktest.py 10 Nov 2002 19:59:22 -0000 1.3
***************
*** 58,62 ****
nham = len(hamfns)
nspam = len(spamfns)
!
allfns = {}
for fn in spamfns+hamfns:
--- 58,62 ----
nham = len(hamfns)
nspam = len(spamfns)
!
allfns = {}
for fn in spamfns+hamfns:
***************
*** 133,137 ****
print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
print "Flex cost: $%.4f"%flexcost
!
def main():
import getopt
--- 133,137 ----
print "Total cost: $%.2f"%(FPW * fp + FNW * fn + UNW * unsure)
print "Flex cost: $%.4f"%flexcost
!
def main():
import getopt
From tim_one@users.sourceforge.net Sun Nov 10 20:00:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 12:00:03 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv14946
Modified Files:
msgstore.py
Log Message:
Whitespace normalization.
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** msgstore.py 7 Nov 2002 22:30:09 -0000 1.23
--- msgstore.py 10 Nov 2002 19:59:59 -0000 1.24
***************
*** 397,401 ****
# Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
pass
!
return "%s\n%s\n%s" % (headers, html, body)
--- 397,401 ----
# Find all attachments with PR_ATTACH_MIME_TAG_A=multipart/signed
pass
!
return "%s\n%s\n%s" % (headers, html, body)
From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv5402/pspam/pspam
Modified Files:
profile.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** profile.py 7 Nov 2002 22:30:11 -0000 1.3
--- profile.py 11 Nov 2002 01:59:06 -0000 1.4
***************
*** 44,48 ****
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 44,48 ----
class WordInfo(Persistent):
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
From tim_one@users.sourceforge.net Mon Nov 11 01:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 10 Nov 2002 17:59:08 -0800
Subject: [Spambayes-checkins]
spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402
Modified Files:
Options.py classifier.py weakloop.py
Log Message:
For the benefit of future generations, renamed some options:
Old New
--- ---
robinson_probability_x unknown_word_prob
robinson_probability_s unknown_word_strength
robinson_minimum_prob_strength minimum_prob_strength
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py 8 Nov 2002 04:06:23 -0000 1.67
--- Options.py 11 Nov 2002 01:59:06 -0000 1.68
***************
*** 241,268 ****
# These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before. Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam. This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x. All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE: Gary Robinson previously used a different formula involving 'a'
! # and 'x'. The 'x' here is the same as before. The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme. 0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
! # The combining scheme currently detailed on Gary Robinons web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
--- 241,268 ----
# These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before. Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting. At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam. This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob. All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45
# When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
! # tests. 0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1
! # The combining scheme currently detailed on the Robinon web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
***************
*** 272,284 ****
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease". One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom. This is
! # the "provably most-sensitive" test the original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live. This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease". One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
},
'Classifier': {'max_discriminators': int_cracker,
! 'robinson_probability_x': float_cracker,
! 'robinson_probability_s': float_cracker,
! 'robinson_minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
},
'Classifier': {'max_discriminators': int_cracker,
! 'unknown_word_prob': float_cracker,
! 'unknown_word_strength': float_cracker,
! 'minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py 7 Nov 2002 22:30:05 -0000 1.49
--- classifier.py 11 Nov 2002 01:59:06 -0000 1.50
***************
*** 70,74 ****
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.robinson_probability_x):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
# a word is no longer being used, it's just wasting space.
! def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
nspam = float(self.nspam or 1)
! S = options.robinson_probability_s
! StimesX = S * options.robinson_probability_x
for word, record in self.wordinfo.iteritems():
--- 322,327 ----
nspam = float(self.nspam or 1)
! S = options.unknown_word_strength
! StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****
def _getclues(self, wordstream):
! mindist = options.robinson_minimum_prob_strength
! unknown = options.robinson_probability_x
clues = [] # (distance, prob, word, record) tuples
--- 449,454 ----
def _getclues(self, wordstream):
! mindist = options.minimum_prob_strength
! unknown = options.unknown_word_prob
clues = [] # (distance, prob, word, record) tuples
Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py 10 Nov 2002 12:08:40 -0000 1.1
--- weakloop.py 11 Nov 2002 01:59:06 -0000 1.2
***************
*** 29,35 ****
default="""
[Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
max_discriminators = 150
--- 29,35 ----
default="""
[Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
max_discriminators = 150
***************
*** 41,47 ****
import Options
! start = (Options.options.robinson_probability_x,
! Options.options.robinson_minimum_prob_strength,
! Options.options.robinson_probability_s,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
--- 41,47 ----
import Options
! start = (Options.options.unknown_word_prob,
! Options.options.minimum_prob_strength,
! Options.options.unknown_word_strength,
Options.options.spam_cutoff,
Options.options.ham_cutoff)
***************
*** 52,58 ****
f.write("""
[Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f
[TestDriver]
--- 52,58 ----
f.write("""
[Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f
[TestDriver]
From tim_one@users.sourceforge.net Fri Nov 8 04:06:29 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 07 Nov 2002 20:06:29 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.66,1.67
tokenizer.py,1.63,1.64
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31798
Modified Files:
Options.py tokenizer.py
Log Message:
Removed option retain_pure_html_tags; nobody enables that anymore, and it's
hard to believe it would ever help anymore (except as an HTML detector).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** Options.py 7 Nov 2002 22:25:46 -0000 1.66
--- Options.py 8 Nov 2002 04:06:23 -0000 1.67
***************
*** 42,53 ****
x-.*
- # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
- # from pure text/html messages. Set true to retain HTML tags in this
- # case. On the c.l.py corpus, it helps to set this true because any
- # sign of HTML is so despised on tech lists; however, the advantage
- # of setting it true eventually vanishes even there given enough
- # training data.
- retain_pure_html_tags: False
-
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
--- 42,45 ----
***************
*** 347,352 ****
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
! 'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
--- 339,343 ----
all_options = {
! 'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.63
retrieving revision 1.64
diff -C2 -d -r1.63 -r1.64
*** tokenizer.py 7 Nov 2002 22:30:08 -0000 1.63
--- tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
***************
*** 495,504 ****
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaults to False.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was removed.
##############################################################################
--- 495,505 ----
# Later: As the amount of training data increased, the effect of retaining
# HTML tags decreased to insignificance. options.retain_pure_html_tags
! # was introduced to control this, and it defaulted to False. Later, as the
! # algorithm improved, retain_pure_html_tags was removed.
#
# Later: The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content. options.ignore_redundant_html was introduced to control this,
! # and it defaults to False. Later: ignore_redundant_html was also removed.
##############################################################################
***************
*** 1167,1175 ****
"""Generate a stream of tokens from an email Message.
- HTML tags are always stripped from text/plain sections.
- options.retain_pure_html_tags controls whether HTML tags are
- also stripped from text/html sections. Except in special cases,
- it's recommended to leave that at its default of false.
-
If options.check_octets is True, the first few undecoded characters
of application/octet-stream parts of the message body become tokens.
--- 1168,1171 ----
***************
*** 1228,1235 ****
# Remove HTML/XML tags. Also .
! if (part.get_content_type() == "text/plain" or
! not options.retain_pure_html_tags):
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
--- 1224,1229 ----
# Remove HTML/XML tags. Also .
! text = text.replace(' ', ' ')
! text = html_re.sub(' ', text)
# Tokenize everything in the body.
From richiehindle@users.sourceforge.net Fri Nov 8 08:00:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Fri, 08 Nov 2002 00:00:25 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25390
Modified Files:
pop3proxy.py
Log Message:
o The database is now saved (optionally) on exit, rather than after each
message you train with. There should be explicit save/reload commands,
but they can come later.
o It now keeps two mbox files of all the messages that have been used to
train via the web interface - thanks to Just for the patch.
o All the sockets now use async - the web interface used to freeze
whenever the proxy was awaiting a response from the POP3 server. That's
now fixed.
o It now copes with POP3 servers that don't issue a welcome command.
o The training form now appears in the training results, so you can train
on another message without having to go back to the Home page.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** pop3proxy.py 7 Nov 2002 22:27:02 -0000 1.11
--- pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12
***************
*** 47,50 ****
--- 47,74 ----
+ todo = """
+ o (Re)training interface - one message per line, quick-rendering table.
+ o Slightly-wordy index page; intro paragraph for each page.
+ o Once the training stuff is on a separate page, make the paste box
+ bigger.
+ o "Links" section (on homepage?) to project homepage, mailing list,
+ etc.
+ o "Home" link (with helmet!) at the end of each page.
+ o "Classify this" - just like Train.
+ o "Send me an email every [...] to remind me to train on new
+ messages."
+ o "Send me a status email every [...] telling how many mails have been
+ classified, etc."
+ o Deployment: Windows executable? atlaxwin and ctypes? Or just
+ webbrowser?
+ o Possibly integrate Tim Stone's SMTP code - make it use async, make
+ the training code update (rather than replace!) the database.
+ o Can it cleanly dynamically update its status display while having a
+ POP3 converation? Hammering reload sucks.
+ o Add a command to save the database without shutting down, and one to
+ reload the database.
+ o Leave the word in the input field after a Word query.
+ """
+
import sys, re, operator, errno, getopt, cPickle, cStringIO, time
import socket, asyncore, asynchat, cgi, urlparse, webbrowser
***************
*** 92,95 ****
--- 116,120 ----
self.factory(*args)
+
class BrighterAsyncChat(asynchat.async_chat):
"""An asynchat.async_chat that doesn't give spurious warnings on
***************
*** 110,113 ****
--- 135,164 ----
+ class ServerLineReader(BrighterAsyncChat):
+ """An async socket that reads lines from a remote server and
+ simply calls a callback with the data. The BayesProxy object
+ can't connect to the real POP3 server and talk to it
+ synchronously, because that would block the process."""
+
+ def __init__(self, serverName, serverPort, lineCallback):
+ BrighterAsyncChat.__init__(self)
+ self.lineCallback = lineCallback
+ self.request = ''
+ self.set_terminator('\r\n')
+ self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
+ self.connect((serverName, serverPort))
+
+ def collect_incoming_data(self, data):
+ self.request = self.request + data
+
+ def found_terminator(self):
+ self.lineCallback(self.request + '\r\n')
+ self.request = ''
+
+ def handle_close(self):
+ self.lineCallback('')
+ self.close()
+
+
class POP3ProxyBase(BrighterAsyncChat):
"""An async dispatcher that understands POP3 and proxies to a POP3
***************
*** 126,134 ****
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
self.set_terminator('\r\n')
! self.serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
! self.serverSocket.connect((serverName, serverPort))
! self.serverIn = self.serverSocket.makefile('r') # For reading only
! self.push(self.serverIn.readline())
def onTransaction(self, command, args, response):
--- 177,189 ----
BrighterAsyncChat.__init__(self, clientSocket)
self.request = ''
+ self.response = ''
self.set_terminator('\r\n')
! self.command = '' # The POP3 command being processed...
! self.args = '' # ...and its arguments
! self.isClosing = False # Has the server closed the socket?
! self.seenAllHeaders = False # For the current RETR or TOP
! self.startTime = 0 # (ditto)
! self.serverSocket = ServerLineReader(serverName, serverPort,
! self.onServerLine)
def onTransaction(self, command, args, response):
***************
*** 139,152 ****
raise NotImplementedError
! def isMultiline(self, command, args):
! """Returns True if the given request should get a multiline
response (assuming the response is positive).
"""
! if command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif command in ['RETR', 'TOP']:
return True
! elif command in ['LIST', 'UIDL']:
return len(args) == 0
else:
--- 194,237 ----
raise NotImplementedError
! def onServerLine(self, line):
! """A line of response has been received from the POP3 server."""
! isFirstLine = not self.response
! self.response = self.response + line
!
! # Is this line that terminates a set of headers?
! self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
!
! # Has the server closed its end of the socket?
! if not line:
! self.isClosing = True
!
! # If we're not processing a command, just echo the response.
! if not self.command:
! self.push(self.response)
! self.response = ''
!
! # Time out after 30 seconds for message-retrieval commands if
! # all the headers are down. The rest of the message will proxy
! # straight through.
! if self.command in ['TOP', 'RETR'] and \
! self.seenAllHeaders and time.time() > self.startTime + 30:
! self.onResponse()
! self.response = ''
! # If that's a complete response, handle it.
! elif not self.isMultiline() or line == '.\r\n' or \
! (isFirstLine and line.startswith('-ERR')):
! self.onResponse()
! self.response = ''
!
! def isMultiline(self):
! """Returns True if the request should get a multiline
response (assuming the response is positive).
"""
! if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
! 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
return False
! elif self.command in ['RETR', 'TOP']:
return True
! elif self.command in ['LIST', 'UIDL']:
return len(args) == 0
else:
***************
*** 155,204 ****
return False
- def readResponse(self, command, args):
- """Reads the POP3 server's response and returns a tuple of
- (response, isClosing, timedOut). isClosing is True if the
- server closes the socket, which tells found_terminator() to
- close when the response has been sent. timedOut is set if a
- TOP or RETR request was still arriving after 30 seconds, and
- tells found_terminator() to proxy the remainder of the response.
- """
- responseLines = []
- startTime = time.time()
- isMulti = self.isMultiline(command, args)
- isClosing = False
- timedOut = False
- isFirstLine = True
- seenAllHeaders = False
- while True:
- line = self.serverIn.readline()
- if not line:
- # The socket's been closed by the server, probably by QUIT.
- isClosing = True
- break
- elif not isMulti or (isFirstLine and line.startswith('-ERR')):
- # A single-line response.
- responseLines.append(line)
- break
- elif line == '.\r\n':
- # The termination line.
- responseLines.append(line)
- break
- else:
- # A normal line - append it to the response and carry on.
- responseLines.append(line)
- seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
-
- # Time out after 30 seconds for message-retrieval commands
- # if all the headers are down - found_terminator() knows how
- # to deal with this.
- if command in ['TOP', 'RETR'] and \
- seenAllHeaders and time.time() > startTime + 30:
- timedOut = True
- break
-
- isFirstLine = False
-
- return ''.join(responseLines), isClosing, timedOut
-
def collect_incoming_data(self, data):
"""Asynchat override."""
--- 240,243 ----
***************
*** 207,256 ****
def found_terminator(self):
"""Asynchat override."""
- # Send the request to the server and read the reply.
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
self.shutdown(2)
self.close()
raise SystemExit
! self.serverSocket.sendall(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! command, args = ('', '')
else:
splitCommand = self.request.strip().split(None, 1)
! command = splitCommand[0].upper()
! args = splitCommand[1:]
! rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cookedResponse = self.onTransaction(command, args, rawResponse)
! self.push(cookedResponse)
! self.request = ''
!
! # If readResponse() timed out, we still need to read and proxy
! # the rest of the message.
! if timedOut:
! while True:
! line = self.serverIn.readline()
! if not line:
! # The socket's been closed by the server.
! isClosing = True
! break
! elif line == '.\r\n':
! # The termination line.
! self.push(line)
! break
! else:
! # A normal line.
! self.push(line)
!
! # If readResponse() or the loop above decided that the server
! # has closed its socket, close this one when the response has
! # been sent.
! if isClosing:
self.close_when_done()
class BayesProxyListener(Listener):
--- 246,288 ----
def found_terminator(self):
"""Asynchat override."""
if self.request.strip().upper() == 'KILL':
self.serverSocket.sendall('QUIT\r\n')
self.send("+OK, dying.\r\n")
+ self.serverSocket.shutdown(2)
+ self.serverSocket.close()
self.shutdown(2)
self.close()
raise SystemExit
!
! self.serverSocket.push(self.request + '\r\n')
if self.request.strip() == '':
# Someone just hit the Enter key.
! self.command = self.args = ''
else:
+ # A proper command.
splitCommand = self.request.strip().split(None, 1)
! self.command = splitCommand[0].upper()
! self.args = splitCommand[1:]
! self.startTime = time.time()
!
! self.request = ''
!
! def onResponse(self):
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
!
! # If onServerLine() decided that the server has closed its
! # socket, close this one when the response has been sent.
! if self.isClosing:
self.close_when_done()
+ # Reset.
+ self.command = ''
+ self.args = ''
+ self.isClosing = False
+ self.seenAllHeaders = False
+
class BayesProxyListener(Listener):
***************
*** 452,456 ****
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15 }
.header { font-size: 133%% }
.content { margin: 15 }
--- 484,490 ----
table { font: 90%% arial, swiss, helvetica }
form { margin: 0 }
! .banner { background: #c0e0ff; padding=5; padding-left: 15;
! border-top: 1px solid black;
! border-bottom: 1px solid black }
.header { font-size: 133%% }
.content { margin: 15 }
***************
*** 466,470 ****
***************
*** 483,486 ****
--- 522,533 ----
\n"""
+ summary = """POP3 proxy running on port %(proxyPort)d,
+ proxying to %(serverName)s:%(serverPort)d.
+ Active POP3 conversations: %(activeSessions)d.
+ POP3 conversations this session: %(totalSessions)d.
+ Emails classified this session: %(numSpams)d spam,
+ %(numHams)d ham, %(numUnsure)d unsure.
+ """
+
wordQuery = """"""
+ train = """"""
+
def __init__(self, clientSocket, bayes):
BrighterAsyncChat.__init__(self, clientSocket)
***************
*** 502,506 ****
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = self.request.split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
--- 561,565 ----
"""Asynchat override.
Read and parse the HTTP request and call an on handler."""
! requestLine, headers = (self.request+'\r\n').split('\r\n', 1)
try:
method, url, version = requestLine.strip().split()
***************
*** 547,551 ****
if path == '/helmet.gif':
! self.pushOKHeaders('image/gif')
self.push(self.helmet)
else:
--- 606,614 ----
if path == '/helmet.gif':
! # XXX Why doesn't Expires work? Must read RFC 2616 one day.
! inOneHour = time.gmtime(time.time() + 3600)
! expiryDate = time.strftime('%a, %d %b %Y %H:%M:%S GMT', inOneHour)
! extraHeaders = {'Expires': expiryDate}
! self.pushOKHeaders('image/gif', extraHeaders)
self.push(self.helmet)
else:
***************
*** 554,558 ****
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % url)
else:
# This is a request for a valid page; run the handler.
--- 617,621 ----
handler = getattr(self, 'on' + name)
except AttributeError:
! self.pushError(404, "Not found: '%s'" % path)
else:
# This is a request for a valid page; run the handler.
***************
*** 561,569 ****
handler(params)
timeString = time.asctime(time.localtime())
! self.push(self.footer % timeString)
! def pushOKHeaders(self, contentType):
! self.push("HTTP/1.0 200 OK\r\n")
self.push("Content-Type: %s\r\n" % contentType)
self.push("\r\n")
--- 624,641 ----
handler(params)
timeString = time.asctime(time.localtime())
! if status.useDB:
! self.push(self.footer % (timeString, self.shutdownDB))
! else:
! self.push(self.footer % (timeString, self.shutdownPickle))
! def pushOKHeaders(self, contentType, extraHeaders={}):
! timeNow = time.gmtime(time.time())
! httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
! self.push("HTTP/1.1 200 OK\r\n")
! self.push("Connection: close\r\n")
self.push("Content-Type: %s\r\n" % contentType)
+ self.push("Date: %s\r\n" % httpNow)
+ for name, value in extraHeaders.items():
+ self.push("%s: %s\r\n" % (name, value))
self.push("\r\n")
***************
*** 583,616 ****
def onHome(self, params):
! summary = """POP3 proxy running on port %(proxyPort)d,
! proxying to %(serverName)s:%(serverPort)d.
! Active POP3 conversations: %(activeSessions)d.
! POP3 conversations this session:
! %(totalSessions)d.
! Emails classified this session: %(numSpams)d spam,
! %(numHams)d ham, %(numUnsure)d unsure.
! """ % status.__dict__
!
! train = """"""
!
! body = (self.pageSection % ('Status', summary) +
! self.pageSection % ('Word query', self.wordQuery) +
! self.pageSection % ('Train', train))
self.push(body)
def onShutdown(self, params):
! self.push("
Shutdown. Goodbye.
")
! self.push(' ') # Acts as a flush for small buffers.
self.shutdown(2)
self.close()
--- 655,675 ----
def onHome(self, params):
! """Serve up the homepage."""
! body = (self.pageSection % ('Status', self.summary % status.__dict__)+
! self.pageSection % ('Word query', self.wordQuery)+
! self.pageSection % ('Train', self.train))
self.push(body)
def onShutdown(self, params):
! """Shutdown the server, saving the pickle if requested to do so."""
! if params['how'].lower().find('save') >= 0:
! if not status.useDB and status.pickleName:
! self.push("Saving...")
! self.push(' ') # Acts as a flush for small buffers.
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("Shutdown. Goodbye.")
! self.push(' ')
self.shutdown(2)
self.close()
***************
*** 618,625 ****
def onUpload(self, params):
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
# Append the message to a file, to make it easier to rebuild
! # the database later.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
--- 677,690 ----
def onUpload(self, params):
+ """Train on an uploaded or pasted message."""
+ # Upload or paste? Spam or ham?
message = params.get('file') or params.get('text')
isSpam = (params['which'] == 'spam')
+
# Append the message to a file, to make it easier to rebuild
! # the database later. This is a temporary implementation -
! # it should keep a Corpus (from Tim Stone's forthcoming message
! # management module) to manage a cache of messages. It needs
! # to keep them for the HTML retraining interface anyway.
message = message.replace('\r\n', '\n').replace('\r', '\n')
if isSpam:
***************
*** 627,642 ****
else:
f = open("_pop3proxyham.mbox", "a")
! f.write("From ???@???\n") # fake From line (XXX good enough?)
f.write(message)
! f.write("\n")
f.close()
self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
! self.push("""
Trained on your message. Saving database...
""")
! self.push(" ") # Flush... must find out how to do this properly...
! if not status.useDB and status.pickleName:
! fp = open(status.pickleName, 'wb')
! cPickle.dump(self.bayes, fp, 1)
! fp.close()
! self.push("
")
! self.push(self.pageSection % ('Train another', self.train))
def onWordquery(self, params):
***************
*** 656,660 ****
info = "'%s' does not appear in the database." % word
! body = (self.pageSection % ("Statistics for '%s':" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
self.push(body)
--- 718,722 ----
info = "'%s' does not appear in the database." % word
! body = (self.pageSection % ("Statistics for '%s'" % word, info) +
self.pageSection % ('Word query', self.wordQuery))
self.push(body)
***************
*** 765,771 ****
else:
handler = self.handlers.get(command, self.onUnknown)
! self.push(handler(command, args))
self.request = ''
def onStat(self, command, args):
"""POP3 STAT command."""
--- 827,839 ----
else:
handler = self.handlers.get(command, self.onUnknown)
! self.push(handler(command, args)) # Or push_slowly for testing
self.request = ''
+ def push_slowly(self, response):
+ """Useful for testing."""
+ for c in response:
+ self.push(c)
+ time.sleep(0.02)
+
def onStat(self, command, args):
"""POP3 STAT command."""
***************
*** 777,781 ****
"""POP3 LIST command, with optional message number argument."""
if args:
! number = int(args)
if 0 < number <= len(self.maildrop):
return "+OK %d\r\n" % len(self.maildrop[number-1])
--- 845,852 ----
"""POP3 LIST command, with optional message number argument."""
if args:
! try:
! number = int(args)
! except ValueError:
! number = -1
if 0 < number <= len(self.maildrop):
return "+OK %d\r\n" % len(self.maildrop[number-1])
***************
*** 803,811 ****
def onRetr(self, command, args):
"""POP3 RETR command."""
! return self._getMessage(int(args), 12345)
def onTop(self, command, args):
"""POP3 RETR command."""
! number, lines = map(int, args.split())
return self._getMessage(number, lines)
--- 874,889 ----
def onRetr(self, command, args):
"""POP3 RETR command."""
! try:
! number = int(args)
! except ValueError:
! number = -1
! return self._getMessage(number, 12345)
def onTop(self, command, args):
"""POP3 RETR command."""
! try:
! number, lines = map(int, args.split())
! except ValueError:
! number, lines = -1, -1
return self._getMessage(number, lines)
***************
*** 863,867 ****
while response.find('\n.\r\n') == -1:
response = response + proxy.recv(1000)
! assert response.find(options.hammie_header_name) != -1
# Kill the proxy and the test server.
--- 941,945 ----
while response.find('\n.\r\n') == -1:
response = response + proxy.recv(1000)
! assert response.find(options.hammie_header_name) >= 0
# Kill the proxy and the test server.
From jvr@users.sourceforge.net Sat Nov 9 18:05:44 2002
From: jvr@users.sourceforge.net (Just van Rossum)
Date: Sat, 09 Nov 2002 10:05:44 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20814
Modified Files:
pop3proxy.py
Log Message:
force word query to be lowercase, making the UI case insensitive
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** pop3proxy.py 8 Nov 2002 08:00:20 -0000 1.12
--- pop3proxy.py 9 Nov 2002 18:05:42 -0000 1.13
***************
*** 704,707 ****
--- 704,708 ----
def onWordquery(self, params):
word = params['word']
+ word = word.lower()
try:
# Must be a better way to get __dict__ for a new-style class...
From tim_one@users.sourceforge.net Mon Nov 11 23:26:21 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 11 Nov 2002 15:26:21 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.64,1.65
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10237
Modified Files:
tokenizer.py
Log Message:
An idea from Anthony Baxter: decode Subject lines, so that they're
tokenized in decoded form, and so that they generate charset tokens too.
This had minor good effects in both our tests.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.64
retrieving revision 1.65
diff -C2 -d -r1.64 -r1.65
*** tokenizer.py 8 Nov 2002 04:06:24 -0000 1.64
--- tokenizer.py 11 Nov 2002 23:26:18 -0000 1.65
***************
*** 5,8 ****
--- 5,9 ----
import email
+ import email.Header
import email.Message
import email.Errors
***************
*** 1054,1062 ****
# but real benefit to keeping case intact in this specific context.
x = msg.get('subject', '')
! for w in subject_word_re.findall(x):
! for t in tokenize_word(w):
! yield 'subject:' + t
! for w in punctuation_run_re.findall(x):
! yield 'subject:' + w
# Dang -- I can't use Sender:. If I do,
--- 1055,1066 ----
# but real benefit to keeping case intact in this specific context.
x = msg.get('subject', '')
! for x, subjcharset in email.Header.decode_header(x):
! if subjcharset is not None:
! yield 'subjectcharset:' + subjcharset
! for w in subject_word_re.findall(x):
! for t in tokenize_word(w):
! yield 'subject:' + t
! for w in punctuation_run_re.findall(x):
! yield 'subject:' + w
# Dang -- I can't use Sender:. If I do,
From anthonybaxter@users.sourceforge.net Tue Nov 12 00:37:21 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 16:37:21 -0800
Subject: [Spambayes-checkins] website docs.ht,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv5772
Modified Files:
docs.ht
Log Message:
few more definitions
Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** docs.ht 19 Sep 2002 23:39:24 -0000 1.3
--- docs.ht 12 Nov 2002 00:37:19 -0000 1.4
***************
*** 27,32 ****
f-n, FN
(abbrev.) false negative
f-p, FP
(abbrev.) false positive
!
-
--- 27,34 ----
f-n, FN
(abbrev.) false negative
f-p, FP
(abbrev.) false positive
!
corpus
in this context, a body of messages. Usually referring to a
! training database.
!
hapax, hapax legomenon
a word or form occuring only once in a
! document or corpus. (plural is hapax legomena)
From tim.one@comcast.net Tue Nov 12 00:40:44 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 19:40:44 -0500
Subject: [Spambayes-checkins] website docs.ht,1.3,1.4
In-Reply-To:
Message-ID:
> !
hapax, hapax legomenon
a word or form occuring only once in a
> ! document or corpus. (plural is hapax legomena)
>
Ya, but even I'm not that anal -- I usually say hapaxes. hapaxora would be
a hoot too .
From anthony@interlink.com.au Tue Nov 12 00:43:58 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 11:43:58 +1100
Subject: [Spambayes-checkins] website docs.ht,1.3,1.4
In-Reply-To:
Message-ID: <200211120043.gAC0hwp09308@localhost.localdomain>
>>> Tim Peters wrote
> > !
hapax, hapax legomenon
a word or form occuring only once in a
> > ! document or corpus. (plural is hapax legomena)
> >
>
> Ya, but even I'm not that anal -- I usually say hapaxes. hapaxora would be
> a hoot too
Hapax legomena sounds like something that the CDC sends the black
helicopters in to lock down an outbreak of...
From tim_one@users.sourceforge.net Tue Nov 12 04:52:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 11 Nov 2002 20:52:14 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.29,1.30 manager.py,1.33,1.34
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv27097/Outlook2000
Modified Files:
addin.py manager.py
Log Message:
In the "show clues" msg, for each word give the raw ham and spam counts
too.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** addin.py 7 Nov 2002 22:30:08 -0000 1.29
--- addin.py 12 Nov 2002 04:52:12 -0000 1.30
***************
*** 225,233 ****
# Format the clues.
push("
\n")
for word, prob in clues:
word = repr(word)
! push(escape(word) + ' ' * (30 - len(word)))
! push(' %g\n' % prob)
push("
\n")
# Now the raw text of the message, as best we can
push("
Message Stream:
")
--- 225,244 ----
# Format the clues.
push("
\n")
+ push("word spamprob #ham #spam\n")
+ format = " %-12g %8s %6s\n"
+ c = mgr.GetClassifier()
+ fetchword = c.wordinfo.get
for word, prob in clues:
+ record = fetchword(word)
+ if record:
+ nham = record.hamcount
+ nspam = record.spamcount
+ else:
+ nham = nspam = "-"
word = repr(word)
! push(escape(word) + " " * (35-len(word)))
! push(format % (prob, nham, nspam))
push("
\n")
+
# Now the raw text of the message, as best we can
push("
Message Stream:
")
Index: manager.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** manager.py 7 Nov 2002 22:30:09 -0000 1.33
--- manager.py 12 Nov 2002 04:52:12 -0000 1.34
***************
*** 223,226 ****
--- 223,230 ----
self.bayes_dirty = False
+ def GetClassifier(self):
+ """Return the classifier we're using."""
+ return self.bayes
+
def SaveConfig(self):
if self.verbose > 1:
From anthonybaxter@users.sourceforge.net Tue Nov 12 06:21:41 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 22:21:41 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.65,1.66
Options.py,1.68,1.69
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16090
Modified Files:
tokenizer.py Options.py
Log Message:
New tokenizer option 'address_headers'. Allows the mining of headers
other than 'from' for email addresses and names (e.g. to or cc).
By default, it's just set to 'from' for now.
In addition, address headers (including from) now get decoded and parsed
correctly, rather than by a whitespace split.
This shows a quite nice improvement for me.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.65
retrieving revision 1.66
diff -C2 -d -r1.65 -r1.66
*** tokenizer.py 11 Nov 2002 23:26:18 -0000 1.65
--- tokenizer.py 12 Nov 2002 06:21:38 -0000 1.66
***************
*** 7,10 ****
--- 7,12 ----
import email.Header
import email.Message
+ import email.Header
+ import email.Utils
import email.Errors
import re
***************
*** 1072,1082 ****
# # one (smalls wins & losses across runs, overall
# # not significant), so leaving it out
! for field in ('from',):
! prefix = field + ':'
! x = msg.get(field, 'none').lower()
! for w in x.split():
! for t in tokenize_word(w):
! yield prefix + t
!
# To:
# Cc:
--- 1074,1096 ----
# # one (smalls wins & losses across runs, overall
# # not significant), so leaving it out
! # To:, Cc: # These can help, if your ham and spam are sourced
! # # from the same location. If not, they'll be horrible.
! for field in options.address_headers:
! addrlist = msg.get_all(field, [])
! if not addrlist:
! yield field + ":none"
! for addrs in addrlist:
! for rname,ename in email.Utils.getaddresses([addrs]):
! if rname:
! for rname,rcharset in email.Header.decode_header(rname):
! for w in rname.lower().split():
! for t in tokenize_word(w):
! yield field+'realname:'+t
! if rcharset is not None:
! yield field+'charset:'+rcharset
! if ename:
! for w in ename.lower().split('@'):
! for t in tokenize_word(w):
! yield field+'email:'+t
# To:
# Cc:
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.68
retrieving revision 1.69
diff -C2 -d -r1.68 -r1.69
*** Options.py 11 Nov 2002 01:59:06 -0000 1.68
--- Options.py 12 Nov 2002 06:21:38 -0000 1.69
***************
*** 90,93 ****
--- 90,101 ----
mine_received_headers: False
+ # Mine the following address headers. If you have mixed source corpuses
+ # (as opposed to a mixed sauce walrus, which is delicious!) then you
+ # probably don't want to use 'to' or 'cc')
+ # Address headers will be decoded, and will generate charset tokens as
+ # well as the real address.
+ # others to consider: to, cc, reply-to, errors-to, sender, ...
+ address_headers: from
+
# If legitimate mail contains things that look like text to the tokenizer
# and turning turning off this option helps (perhaps binary attachments get
***************
*** 340,343 ****
--- 348,352 ----
all_options = {
'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
+ 'address_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
From anthonybaxter@users.sourceforge.net Tue Nov 12 07:03:22 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 23:03:22 -0800
Subject: [Spambayes-checkins] spambayes/pspam scoremsg.py,1.2,1.3
update.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv26080
Modified Files:
scoremsg.py update.py
Log Message:
whitespace normalisation.
Index: scoremsg.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/scoremsg.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** scoremsg.py 7 Nov 2002 22:30:10 -0000 1.2
--- scoremsg.py 12 Nov 2002 07:03:20 -0000 1.3
***************
*** 39,43 ****
## print
## print msg
!
if __name__ == "__main__":
main(sys.stdin)
--- 39,43 ----
## print
## print msg
!
if __name__ == "__main__":
main(sys.stdin)
Index: update.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/update.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** update.py 7 Nov 2002 22:30:10 -0000 1.2
--- update.py 12 Nov 2002 07:03:20 -0000 1.3
***************
*** 39,43 ****
if not folder_exists(profile.hams, p):
profile.add_ham(p)
!
for spam in options.spam_folders:
p = os.path.join(options.folder_dir, spam)
--- 39,43 ----
if not folder_exists(profile.hams, p):
profile.add_ham(p)
!
for spam in options.spam_folders:
p = os.path.join(options.folder_dir, spam)
***************
*** 49,53 ****
profile.update()
get_transaction().commit()
!
db.close()
--- 49,53 ----
profile.update()
get_transaction().commit()
!
db.close()
***************
*** 58,61 ****
if k == '-F':
FORCE_REBUILD = True
!
main(FORCE_REBUILD)
--- 58,61 ----
if k == '-F':
FORCE_REBUILD = True
!
main(FORCE_REBUILD)
From anthonybaxter@users.sourceforge.net Tue Nov 12 07:03:22 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 11 Nov 2002 23:03:22 -0800
Subject: [Spambayes-checkins]
spambayes/pspam/pspam folder.py,1.2,1.3 options.py,1.1,1.2
profile.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes/pspam/pspam
In directory usw-pr-cvs1:/tmp/cvs-serv26080/pspam
Modified Files:
folder.py options.py profile.py
Log Message:
whitespace normalisation.
Index: folder.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/folder.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** folder.py 7 Nov 2002 22:30:11 -0000 1.2
--- folder.py 12 Nov 2002 07:03:20 -0000 1.3
***************
*** 68,72 ****
self.messages[msgid] = msg
new.insert(msg)
!
removed = difference(self.messages, cur)
for msgid in removed.keys():
--- 68,72 ----
self.messages[msgid] = msg
new.insert(msg)
!
removed = difference(self.messages, cur)
for msgid in removed.keys():
Index: options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/options.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** options.py 4 Nov 2002 04:44:20 -0000 1.1
--- options.py 12 Nov 2002 07:03:20 -0000 1.2
***************
*** 1,5 ****
from Options import options, all_options, \
boolean_cracker, float_cracker, int_cracker, string_cracker
! from sets import Set
all_options["Score"] = {'max_ham': float_cracker,
--- 1,5 ----
from Options import options, all_options, \
boolean_cracker, float_cracker, int_cracker, string_cracker
! from sets import Set
all_options["Score"] = {'max_ham': float_cracker,
Index: profile.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pspam/pspam/profile.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** profile.py 11 Nov 2002 01:59:06 -0000 1.4
--- profile.py 12 Nov 2002 07:03:20 -0000 1.5
***************
*** 92,96 ****
get_transaction().commit()
log("updated probabilities")
!
def _update(self, folders, is_spam):
changed = False
--- 92,96 ----
get_transaction().commit()
log("updated probabilities")
!
def _update(self, folders, is_spam):
changed = False
***************
*** 100,104 ****
if added:
log("added %d" % len(added))
! if removed:
log("removed %d" % len(removed))
get_transaction().commit()
--- 100,104 ----
if added:
log("added %d" % len(added))
! if removed:
log("removed %d" % len(removed))
get_transaction().commit()
***************
*** 117,121 ****
for msg in removed.keys():
self.classifier.unlearn(tokenize(msg), is_spam, False)
! if removed:
log("unlearned")
del removed
--- 117,121 ----
for msg in removed.keys():
self.classifier.unlearn(tokenize(msg), is_spam, False)
! if removed:
log("unlearned")
del removed
From tim_one@users.sourceforge.net Tue Nov 12 22:56:26 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 14:56:26 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 addin.py,1.30,1.31 msgstore.py,1.24,1.25
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv21157/Outlook2000
Modified Files:
addin.py msgstore.py
Log Message:
Removed the strip_mime_headers business. I'm not sure whether it ever
helped, but at this point it was definitely happening too late to do
any good.
Index: addin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** addin.py 12 Nov 2002 04:52:12 -0000 1.30
--- addin.py 12 Nov 2002 22:56:24 -0000 1.31
***************
*** 244,248 ****
push("
\n")
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** msgstore.py 10 Nov 2002 19:59:59 -0000 1.24
--- msgstore.py 12 Nov 2002 22:56:24 -0000 1.25
***************
*** 49,53 ****
def __init__(self):
self.unread = False
! def GetEmailPackageObject(self, strip_mime_headers=True):
# Return a "read-only" Python email package object
# "read-only" in that changes will never be reflected to the real store.
--- 49,53 ----
def __init__(self):
self.unread = False
! def GetEmailPackageObject(self):
# Return a "read-only" Python email package object
# "read-only" in that changes will never be reflected to the real store.
***************
*** 420,424 ****
self.mapi_object = self.msgstore._OpenEntry(self.id)
! def GetEmailPackageObject(self, strip_mime_headers=True):
import email
# XXX If this was originally a MIME msg, we're hosed at this point --
--- 420,424 ----
self.mapi_object = self.msgstore._OpenEntry(self.id)
! def GetEmailPackageObject(self):
import email
# XXX If this was originally a MIME msg, we're hosed at this point --
***************
*** 433,451 ****
print "FAILED to create email.message from: ", `text`
raise
-
- if strip_mime_headers:
- # If we're going to pass this to a scoring function, the MIME
- # headers must be stripped, else the email pkg will run off
- # looking for MIME boundaries that don't exist. The charset
- # info from the original MIME armor is also lost, and we don't
- # want the email pkg to try decoding the msg a second time
- # (assuming Outlook is in fact already decoding text originally
- # in base64 and quoted-printable).
- # We want to retain the MIME headers if we're just displaying
- # the msg stream.
- if msg.has_key('content-type'):
- del msg['content-type']
- if msg.has_key('content-transfer-encoding'):
- del msg['content-transfer-encoding']
return msg
--- 433,436 ----
From tim_one@users.sourceforge.net Tue Nov 12 23:12:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:12:14 -0800
Subject: [Spambayes-checkins] spambayes mboxutils.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31150
Modified Files:
mboxutils.py
Log Message:
New utility function extract_headers(), for very simple-minded header
extraction.
Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** mboxutils.py 6 Nov 2002 01:57:39 -0000 1.4
--- mboxutils.py 12 Nov 2002 23:12:11 -0000 1.5
***************
*** 25,28 ****
--- 25,29 ----
import mailbox
import email.Message
+ import re
class DirOfTxtFileMailbox:
***************
*** 119,120 ****
--- 120,164 ----
msg.set_payload(obj)
return msg
+
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+
+ def extract_headers(text):
+ """Very simple-minded header extraction: prefix of text up to blank line.
+
+ A blank line is recognized via two adjacent line-ending sequences, where
+ a line-ending sequence is a newline optionally preceded by a carriage
+ return.
+
+ If no blank line is found, all of text is considered to be a potential
+ header section. If a blank line is found, the text up to (but not
+ including) the blank line is considered to be a potential header section.
+
+ The potential header section is returned, unless it doesn't contain a
+ colon, in which case an empty string is returned.
+
+ >>> extract_headers("abc")
+ ''
+ >>> extract_headers("abc\\n\\n\\n") # no colon
+ ''
+ >>> extract_headers("abc: xyz\\n\\n\\n")
+ 'abc: xyz\\n'
+ >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+ 'abc: xyz\\r\\n'
+ >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+ 'a: b\\ngibberish\\n'
+ """
+
+ m = header_break_re.search(text)
+ if m:
+ eol = m.start(1)
+ text = text[:eol]
+ if ':' not in text:
+ text = ""
+ return text
+
+ def _test():
+ import doctest, mboxutils
+ return doctest.testmod(mboxutils)
+
+ if __name__ == "__main__":
+ _test()
From tim_one@users.sourceforge.net Tue Nov 12 23:16:06 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:16:06 -0800
Subject: [Spambayes-checkins] spambayes mboxutils.py,1.5,1.6
tokenizer.py,1.66,1.67
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1192
Modified Files:
mboxutils.py tokenizer.py
Log Message:
get_message(): changed to use the new extract_headers() hack.
Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** mboxutils.py 12 Nov 2002 23:12:11 -0000 1.5
--- mboxutils.py 12 Nov 2002 23:16:04 -0000 1.6
***************
*** 114,120 ****
# headers are most likely damaged, we can't use the email
# package to parse them, so just get rid of them first.
! i = obj.find('\n\n')
! if i >= 0:
! obj = obj[i+2:] # strip headers
msg = email.Message.Message()
msg.set_payload(obj)
--- 114,119 ----
# headers are most likely damaged, we can't use the email
# package to parse them, so just get rid of them first.
! headers = extract_headers(obj)
! obj = obj[len(headers):]
msg = email.Message.Message()
msg.set_payload(obj)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.66
retrieving revision 1.67
diff -C2 -d -r1.66 -r1.67
*** tokenizer.py 12 Nov 2002 06:21:38 -0000 1.66
--- tokenizer.py 12 Nov 2002 23:16:04 -0000 1.67
***************
*** 17,20 ****
--- 17,21 ----
from Options import options
+ import mboxutils
from mboxutils import get_message
From tim_one@users.sourceforge.net Tue Nov 12 23:19:35 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:19:35 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.25,1.26
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv3198/Outlook2000
Modified Files:
msgstore.py
Log Message:
GetEmailPackageObject(): Removed comments that no longer made sense, at
least not here.
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** msgstore.py 12 Nov 2002 22:56:24 -0000 1.25
--- msgstore.py 12 Nov 2002 23:19:33 -0000 1.26
***************
*** 422,430 ****
def GetEmailPackageObject(self):
import email
- # XXX If this was originally a MIME msg, we're hosed at this point --
- # the boundary tag in the headers doesn't exist in the body, and
- # the msg is simply ill-formed. The miserable hack here simply
- # squashes the text part (if any) and the HTML part (if any) together,
- # and strips MIME info from the original headers.
text = self._GetMessageText()
try:
--- 422,425 ----
From tim_one@users.sourceforge.net Tue Nov 12 23:33:48 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 15:33:48 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 msgstore.py,1.26,1.27
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv11116/Outlook2000
Modified Files:
msgstore.py
Log Message:
_GetMessageText(): Whatever the value of the headers property, stop
paying attention to it after the first blank line, and don't believe it
at all if it doesn't contain a colon. Cheap trick to worm around the
problems some people have reported with Outlook returning multiple header
sections here (including internal MIME armor with empty bodies).
Index: msgstore.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** msgstore.py 12 Nov 2002 23:19:33 -0000 1.26
--- msgstore.py 12 Nov 2002 23:33:45 -0000 1.27
***************
*** 1,5 ****
from __future__ import generators
! import sys, os
try:
--- 1,5 ----
from __future__ import generators
! import sys, os, re
try:
***************
*** 10,13 ****
--- 10,53 ----
+ # XXX
+ # import mboxutils doesn't work at this point. The extract_headers function
+ # here is a copy-and-paste.
+ header_break_re = re.compile(r"\r?\n(\r?\n)")
+
+ def extract_headers(text):
+ """Very simple-minded header extraction: prefix of text up to blank line.
+
+ A blank line is recognized via two adjacent line-ending sequences, where
+ a line-ending sequence is a newline optionally preceded by a carriage
+ return.
+
+ If no blank line is found, all of text is considered to be a potential
+ header section. If a blank line is found, the text up to (but not
+ including) the blank line is considered to be a potential header section.
+
+ The potential header section is returned, unless it doesn't contain a
+ colon, in which case an empty string is returned.
+
+ >>> extract_headers("abc")
+ ''
+ >>> extract_headers("abc\\n\\n\\n") # no colon
+ ''
+ >>> extract_headers("abc: xyz\\n\\n\\n")
+ 'abc: xyz\\n'
+ >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
+ 'abc: xyz\\r\\n'
+ >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
+ 'a: b\\ngibberish\\n'
+ """
+
+ m = header_break_re.search(text)
+ if m:
+ eol = m.start(1)
+ text = text[:eol]
+ if ':' not in text:
+ text = ""
+ return text
+
+
# Abstract definition - can be moved out when we have more than one sub-class
# External interface to this module is almost exclusively via a "folder ID"
***************
*** 384,387 ****
--- 424,434 ----
html = self._GetPotentiallyLargeStringProp(prop_ids[2], data[2])
has_attach = data[3][1]
+
+ # Some Outlooks deliver a strange notion of headers, including
+ # interior MIME armor. To prevent later errors, try to get rid
+ # of stuff now that can't possibly be parsed as "real" (SMTP)
+ # headers.
+ headers = extract_headers(headers)
+
# Mail delivered internally via Exchange Server etc may not have
# headers - fake some up.
***************
*** 392,395 ****
--- 439,443 ----
elif headers.startswith("Microsoft Mail"):
headers = "X-MS-Mail-Gibberish: " + headers
+
if not html and not body:
# Only ever seen this for "multipart/signed" messages, so
From tim_one@users.sourceforge.net Wed Nov 13 05:29:15 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 21:29:15 -0800
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv4228
Modified Files:
train.py
Log Message:
train_message(): When rescoring was asked for, it had no visible
effect, since the probabilities didn't get updated after training.
So update the probs before rescoring.
Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** train.py 7 Nov 2002 22:30:09 -0000 1.16
--- train.py 13 Nov 2002 05:29:10 -0000 1.17
***************
*** 26,30 ****
return spam == True
! def train_message(msg, is_spam, mgr, rescore = False):
# Train an individual message.
# Returns True if newly added (message will be correctly
--- 26,30 ----
return spam == True
! def train_message(msg, is_spam, mgr, rescore=False):
# Train an individual message.
# Returns True if newly added (message will be correctly
***************
*** 54,57 ****
--- 54,58 ----
if rescore:
import filter
+ mgr.bayes.update_probabilities() # else rescoring gives the same score
filter.filter_message(msg, mgr, all_actions = False)
From tim_one@users.sourceforge.net Wed Nov 13 06:25:10 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 22:25:10 -0800
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.67,1.68
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2039a
Modified Files:
tokenizer.py
Log Message:
More refinements of address-header tokenization. In particular, it
now generators "no real name" log-count tokens, which are strong
spam clues in my data.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** tokenizer.py 12 Nov 2002 23:16:04 -0000 1.67
--- tokenizer.py 13 Nov 2002 06:25:08 -0000 1.68
***************
*** 1081,1097 ****
if not addrlist:
yield field + ":none"
! for addrs in addrlist:
! for rname,ename in email.Utils.getaddresses([addrs]):
! if rname:
! for rname,rcharset in email.Header.decode_header(rname):
! for w in rname.lower().split():
! for t in tokenize_word(w):
! yield field+'realname:'+t
! if rcharset is not None:
! yield field+'charset:'+rcharset
! if ename:
! for w in ename.lower().split('@'):
! for t in tokenize_word(w):
! yield field+'email:'+t
# To:
# Cc:
--- 1081,1105 ----
if not addrlist:
yield field + ":none"
! continue
!
! noname_count = 0
! for name, addr in email.Utils.getaddresses(addrlist):
! if name:
! for name, charset in email.Header.decode_header(name):
! yield "%s:name:%s" % (field, name.lower())
! if charset is not None:
! yield "%s:charset:%s" % (field, charset)
! else:
! noname_count += 1
! if addr:
! for w in addr.lower().split('@'):
! yield "%s:addr:%s" % (field, w)
! else:
! yield field + ":addr:none"
!
! if noname_count:
! yield "%s:no real name:2**%d" % (field,
! round(log2(noname_count)))
!
# To:
# Cc:
From mhammond@skippinet.com.au Wed Nov 13 07:01:59 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 13 Nov 2002 18:01:59 +1100
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
In-Reply-To:
Message-ID:
> Log Message:
> train_message(): When rescoring was asked for, it had no visible
> effect, since the probabilities didn't get updated after training.
> So update the probs before rescoring.
I'm a little confused about these probabilities.
Isn't it true that whenever we do a "train operation", we should also update
the probabilities? For a batch train, we only want to do it at the end, but
for an individual, incremental train, I would have thought we still want the
probabilities updated, even if we don't rescore the message. Otherwise
future messages will not use the new probabilities.
I ask because revision 1.14 did exactly this, and we regressed it. That
revision was:
diff -r1.13 -r1.14
21c21
< def train_message(msg, is_spam, mgr, update_probs = True):
---
> def train_message(msg, is_spam, mgr):
43,45d42
< if update_probs:
< mgr.bayes.update_probabilities()
<
56c53
< if train_message(message, isspam, mgr, False):
---
> if train_message(message, isspam, mgr):
And it seems to me that a new param, specifically for update_probs, is less
of a hack than tieing it to the "rescore" param - we want the new probs used
for the *next* incoming message even if we don't need it for *this* message.
Mark.
From tim_one@users.sourceforge.net Wed Nov 13 06:59:27 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 12 Nov 2002 22:59:27 -0800
Subject: [Spambayes-checkins]
spambayes/Outlook2000 default_bayes_customize.ini,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv19210/Outlook2000
Modified Files:
default_bayes_customize.ini
Log Message:
Enable more address-header tokenization than the default. This should
help any personal email classifier. I recommend a full retrain to
get the most benefit.
Index: default_bayes_customize.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/default_bayes_customize.ini,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** default_bayes_customize.ini 4 Nov 2002 23:21:43 -0000 1.5
--- default_bayes_customize.ini 13 Nov 2002 06:59:24 -0000 1.6
***************
*** 17,20 ****
--- 17,26 ----
record_header_absence: True
+ # These should help. All but "from" are disabled by default, because
+ # they're killer-good clues for bad reasons when using mixed-source
+ # data.
+ address_headers: from to cc sender reply-to
+
+
[Classifier]
# Uncomment the next lines if you want to use the former default for
From tim.one@comcast.net Wed Nov 13 07:18:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 13 Nov 2002 02:18:45 -0500
Subject: [Spambayes-checkins] spambayes/Outlook2000 train.py,1.16,1.17
In-Reply-To:
Message-ID:
[Mark Hammond]
> I'm a little confused about these probabilities.
>
> Isn't it true that whenever we do a "train operation", we should
> also update the probabilities?
It's a tradeoff. The bigger the database, the longer update_probabilities()
takes. If the user is staring at a specific msg, and expects to see its
score change, then the probs *have* to be updated or the score won't change.
So that was a very clear reason to force updating here. I didn't know why
the probs weren't being updated anyway, so fixed the one thing that was
unarguably buggy.
> For a batch train, we only want to do it at the end, but for an
> individual, incremental train, I would have thought we still want the
> probabilities updated, even if we don't rescore the message. Otherwise
> future messages will not use the new probabilities.
That's so. I haven't worried about it, perhaps because I run on Win9x most
of the time so live with frequent reboots (i.e., I retrain from scratch
several times every day anyway, as incremental updates are lost when a
forced reboot occurs; that's not *this* code's fault, although I eventual
hope to get around to writing out the updated database whenever the probs
get updated).
> I ask because revision 1.14 did exactly this, and we regressed it.
That's odd -- the CVS log says mhammond did that .
> ...
> And it seems to me that a new param, specifically for update_probs, is
> less of a hack than tieing it to the "rescore" param - we want the
> new probs used for the *next* incoming message even if we don't need
> it for *this* message.
It's still a tradeoff, though. Once a classifier has gotten any amount of
decent training, whether or not a new training msg gets reflected instantly
in the probs should make little difference to results.
If it's possible that update_probabilities() *never* gets called after
training and before shutdown now, then that's clearly a bug.
It's OK by me whatever you'd rather do here, and updating probs after
training, without fail, is certainly the least error-prone strategy.
From richiehindle@users.sourceforge.net Wed Nov 13 18:13:46 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 13 Nov 2002 10:13:46 -0800
Subject: [Spambayes-checkins] spambayes README.txt,1.41,1.42
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14506
Modified Files:
README.txt
Log Message:
Added a note about the web interface implemented by pop3proxy.py.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.41
retrieving revision 1.42
diff -C2 -d -r1.41 -r1.42
*** README.txt 7 Nov 2002 22:30:02 -0000 1.41
--- README.txt 13 Nov 2002 18:13:43 -0000 1.42
***************
*** 74,77 ****
--- 74,82 ----
delivery system.
+ Also acts as a web server providing a user interface that allows you
+ to train the classifier, classify messages interactively, and query
+ the token database. This piece will at some point be split out into
+ a separate module.
+
neiltrain.py
Builds a CDB (constant database) file of word probabilities using
From richiehindle@users.sourceforge.net Wed Nov 13 18:14:34 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 13 Nov 2002 10:14:34 -0800
Subject: [Spambayes-checkins] spambayes Options.py,1.69,1.70
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15336
Modified Files:
Options.py
Log Message:
Added options for pop3proxy.py, so you don't need a huge command line.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.69
retrieving revision 1.70
diff -C2 -d -r1.69 -r1.70
*** Options.py 12 Nov 2002 06:21:38 -0000 1.69
--- Options.py 13 Nov 2002 18:14:32 -0000 1.70
***************
*** 339,342 ****
--- 339,357 ----
# database by default.
persistent_use_database: False
+
+ [pop3proxy]
+ # pop3proxy settings - pop3proxy also respects the options in the Hammie
+ # section, with the exception of the extra header details at the moment.
+ # The only mandatory option is pop3proxy_server_name, eg. pop3.my-isp.com,
+ # but that can come from the command line - see "pop3proxy -h".
+ pop3proxy_server_name: ""
+ pop3proxy_server_port: 110
+ pop3proxy_port: 110
+ pop3proxy_cache_use_gzip: True
+ pop3proxy_cache_expiry_days: 7
+
+ [html_ui]
+ html_ui_port: 8880
+ html_ui_launch_browser: False
"""
***************
*** 408,412 ****
'hammie_debug_header_name': string_cracker,
},
!
}
--- 423,435 ----
'hammie_debug_header_name': string_cracker,
},
! 'pop3proxy': {'pop3proxy_server_name': string_cracker,
! 'pop3proxy_server_port': int_cracker,
! 'pop3proxy_port': int_cracker,
! 'pop3proxy_cache_use_gzip': boolean_cracker,
! 'pop3proxy_cache_expiry_days': int_cracker,
! },
! 'html_ui': {'html_ui_port': int_cracker,
! 'html_ui_launch_browser': boolean_cracker,
! },
}
From richiehindle@users.sourceforge.net Wed Nov 13 18:19:48 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 13 Nov 2002 10:19:48 -0800
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20474
Modified Files:
pop3proxy.py
Log Message:
o All command line switches and options now default to values from
bayescustomize.ini. Thanks to Francois Granger for the idea.
o Instead of there being two radio buttons (ham, spam) on the training
form, there are now two buttons: "Train as Ham" and "Train as Spam".
Thanks to Just van Rossum for the suggestion.
o "Classify message" form - paste or upload a message for classification.
Gives you the spam probability and the clues.
o It now gives a decent error if the POP3 server is unreachable.
o The "Bad file descriptor" / last-response-is-logged-three-times bug
is (hopefully) fixed.
o The bug whereby socket errors could cause the "Active POP3
conversations" count to go negative is fixed.
o After doing a word query, it now prepopulates the query field with
your word - handy if you mistyped it or you want to try a variant.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** pop3proxy.py 10 Nov 2002 19:59:22 -0000 1.14
--- pop3proxy.py 13 Nov 2002 18:19:45 -0000 1.15
***************
*** 7,11 ****
header. Usage:
! pop3proxy.py [options] []
is the name of your real POP3 server
is the port number of your real POP3 server, which
--- 7,11 ----
header. Usage:
! pop3proxy.py [options] [ []]
is the name of your real POP3 server
is the port number of your real POP3 server, which
***************
*** 13,16 ****
--- 13,20 ----
options:
+ -z : Runs a self-test and exits.
+ -t : Runs a test POP3 server on port 8110 (for testing).
+ -h : Displays this help message.
+
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
***************
*** 20,28 ****
-b : Launch a web browser showing the user interface.
! pop3proxy -t
! Runs a test POP3 server on port 8110; useful for testing.
!
! pop3proxy -h
! Displays this help message.
For safety, and to help debugging, the whole POP3 conversation is
--- 24,30 ----
-b : Launch a web browser showing the user interface.
! All command line arguments and switches take their default
! values from the [Hammie], [pop3proxy] and [html_ui] sections
! of bayescustomize.ini.
For safety, and to help debugging, the whole POP3 conversation is
***************
*** 48,72 ****
todo = """
! o (Re)training interface - one message per line, quick-rendering table.
! o Slightly-wordy index page; intro paragraph for each page.
o Once the training stuff is on a separate page, make the paste box
bigger.
- o "Links" section (on homepage?) to project homepage, mailing list,
- etc.
- o "Home" link (with helmet!) at the end of each page.
- o "Classify this" - just like Train.
- o "Send me an email every [...] to remind me to train on new
- messages."
- o "Send me a status email every [...] telling how many mails have been
- classified, etc."
o Deployment: Windows executable? atlaxwin and ctypes? Or just
webbrowser?
- o Possibly integrate Tim Stone's SMTP code - make it use async, make
- the training code update (rather than replace!) the database.
o Can it cleanly dynamically update its status display while having a
POP3 converation? Hammering reload sucks.
o Add a command to save the database without shutting down, and one to
reload the database.
! o Leave the word in the input field after a Word query.
"""
--- 50,103 ----
todo = """
!
! User interface improvements:
!
o Once the training stuff is on a separate page, make the paste box
bigger.
o Deployment: Windows executable? atlaxwin and ctypes? Or just
webbrowser?
o Can it cleanly dynamically update its status display while having a
POP3 converation? Hammering reload sucks.
o Add a command to save the database without shutting down, and one to
reload the database.
! o Save the Status (num classified, etc.) between sessions.
!
!
! New features:
!
! o (Re)training interface - one message per line, quick-rendering table.
! o "Send me an email every [...] to remind me to train on new
! messages."
! o "Send me a status email every [...] telling how many mails have been
! classified, etc."
! o Possibly integrate Tim Stone's SMTP code - make it use async, make
! the training code update (rather than replace!) the database.
! o Option to keep trained messages and view potential FPs and FNs to
! correct them.
! o Allow use of the UI without the POP3 proxy.
!
!
! Code quality:
!
! o Move the UI into its own module.
! o Eventually, pull the common HTTP code from pop3proxy.py and Entrian
! Debugger into a library.
!
!
! Info:
!
! o Slightly-wordy index page; intro paragraph for each page.
! o In both stats and training results, report nham and nspam - warn if
! they're very different (for some value of 'very').
! o "Links" section (on homepage?) to project homepage, mailing list,
! etc.
!
!
! Gimmicks:
!
! o Classify a web page given a URL.
! o Graphs. Of something. Who cares what?
! o Zoe...!
!
"""
***************
*** 147,151 ****
self.set_terminator('\r\n')
self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
! self.connect((serverName, serverPort))
def collect_incoming_data(self, data):
--- 178,188 ----
self.set_terminator('\r\n')
self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
! try:
! self.connect((serverName, serverPort))
! except socket.error, e:
! print >>sys.stderr, "Can't connect to %s:%d: %s" % \
! (serverName, serverPort, e)
! self.close()
! self.lineCallback('') # "The socket's been closed."
def collect_incoming_data(self, data):
***************
*** 199,203 ****
self.response = self.response + line
! # Is this line that terminates a set of headers?
self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
--- 236,240 ----
self.response = self.response + line
! # Is this the line that terminates a set of headers?
self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']
***************
*** 237,241 ****
else:
# Assume that an unknown command will get a single-line
! # response. This should work for errors and for POP-AUTH.
return False
--- 274,281 ----
else:
# Assume that an unknown command will get a single-line
! # response. This should work for errors and for POP-AUTH,
! # and is harmless even for multiline responses - the first
! # line will be passed to onTransaction and ignored, then the
! # rest will be proxied straight through.
return False
***************
*** 246,257 ****
def found_terminator(self):
"""Asynchat override."""
! if self.request.strip().upper() == 'KILL':
! self.serverSocket.sendall('QUIT\r\n')
! self.send("+OK, dying.\r\n")
! self.serverSocket.shutdown(2)
! self.serverSocket.close()
self.shutdown(2)
self.close()
raise SystemExit
self.serverSocket.push(self.request + '\r\n')
--- 286,298 ----
def found_terminator(self):
"""Asynchat override."""
! verb = self.request.strip().upper()
! if verb == 'KILL':
self.shutdown(2)
self.close()
raise SystemExit
+ elif verb == 'CRASH':
+ # For testing
+ x = 0
+ y = 1/x
self.serverSocket.push(self.request + '\r\n')
***************
*** 271,276 ****
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
# If onServerLine() decided that the server has closed its
--- 312,318 ----
# Pass the request and the raw response to the subclass and
# send back the cooked response.
! if self.response:
! cooked = self.onTransaction(self.command, self.args, self.response)
! self.push(cooked)
# If onServerLine() decided that the server has closed its
***************
*** 334,337 ****
--- 376,380 ----
status.totalSessions += 1
status.activeSessions += 1
+ self.isClosed = False
def send(self, data):
***************
*** 339,343 ****
self.logFile.write(data)
self.logFile.flush()
! return POP3ProxyBase.send(self, data)
def recv(self, size):
--- 382,392 ----
self.logFile.write(data)
self.logFile.flush()
! try:
! return POP3ProxyBase.send(self, data)
! except socket.error:
! # The email client has closed the connection - 40tude Dialog
! # does this immediately after issuing a QUIT command,
! # without waiting for the response.
! self.close()
def recv(self, size):
***************
*** 349,354 ****
def close(self):
! status.activeSessions -= 1
! POP3ProxyBase.close(self)
def onTransaction(self, command, args, response):
--- 398,406 ----
def close(self):
! # This can be called multiple times by async.
! if not self.isClosed:
! self.isClosed = True
! status.activeSessions -= 1
! POP3ProxyBase.close(self)
def onTransaction(self, command, args, response):
***************
*** 442,448 ****
UserInterface objects to serve them."""
! def __init__(self, uiPort, bayes):
uiArgs = (bayes,)
! Listener.__init__(self, uiPort, UserInterface, uiArgs)
--- 494,500 ----
UserInterface objects to serve them."""
! def __init__(self, uiPort, bayes, socketMap=asyncore.socket_map):
uiArgs = (bayes,)
! Listener.__init__(self, uiPort, UserInterface, uiArgs, socketMap=socketMap)
***************
*** 479,485 ****
"""Serves the HTML user interface of the proxy."""
header = """Spambayes proxy: %s
]{0,256} # search for the end '>', but don't run wild
! )
>
""", re.VERBOSE | re.DOTALL)
--- 611,625 ----
msg.walk()))
has_highbit_char = re.compile(r"[\x80-\xff]").search
# Cheap-ass gimmick to probabilistically find HTML/XML tags.
+ # Note that ").search)
!
! crack_html_style = StyleStripper().analyze
!
! # Nuke HTML comments.
!
! class CommentStripper(Stripper):
! def __init__(self):
! Stripper.__init__(self, re.compile(r"").search)
!
! crack_html_comment = CommentStripper().analyze
# Scan HTML for constructs often seen in viruses and worms.
***************
*** 1232,1251 ****
text = text.lower()
- # Get rid of uuencoded sections.
- text, tokens = crack_uuencode(text)
- for t in tokens:
- yield t
-
if options.replace_nonascii_chars:
# Replace high-bit chars and control chars with '?'.
text = text.translate(non_ascii_translate_tab)
- # Special tagging of embedded URLs.
- text, tokens = crack_urls(text)
- for t in tokens:
- yield t
-
for t in find_html_virus_clues(text):
yield "virus:%s" % t
# Remove HTML/XML tags. Also .
--- 1268,1287 ----
text = text.lower()
if options.replace_nonascii_chars:
# Replace high-bit chars and control chars with '?'.
text = text.translate(non_ascii_translate_tab)
for t in find_html_virus_clues(text):
yield "virus:%s" % t
+
+ # Get rid of uuencoded sections, embedded URLs,
***************
*** 664,671 ****
reviewHeader = """
These are untrained emails, which you can use to
! train the classifier. Check the Discard / Defer / Ham /
! Spam buttton for each email, then click 'Train' below.
! (Defer leaves the message here, to be trained on
! later.)