From montanaro at users.sourceforge.net  Tue Jul  3 16:12:54 2007
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Tue, 03 Jul 2007 07:12:54 -0700
Subject: [Spambayes-checkins] spambayes/spambayes dbmstorage.py,1.15,1.16
Message-ID: <20070703141258.49C631E400A@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv13923

Modified Files:
	dbmstorage.py 
Log Message:
SF patch #810344.  Should have applied this long ago.


Index: dbmstorage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/dbmstorage.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** dbmstorage.py	7 Apr 2006 02:23:05 -0000	1.15
--- dbmstorage.py	3 Jul 2007 14:12:49 -0000	1.16
***************
*** 34,37 ****
--- 34,42 ----
      return gdbm.open(*args)
  
+ def open_dbm(*args):
+     """Open a dbm database."""
+     import dbm
+     return dbm.open(*args)
+  
  def open_best(*args):
      if sys.platform == "win32":
***************
*** 42,46 ****
              funcs.insert(0, open_dbhash)
      else:
!         funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash]
      for f in funcs:
          try:
--- 47,52 ----
              funcs.insert(0, open_dbhash)
      else:
!         funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash,
!                  open_dbm]
      for f in funcs:
          try:
***************
*** 56,59 ****
--- 62,66 ----
      "bsddb185": open_db185hash,
      "gdbm": open_gdbm,
+     "dbm": open_dbm,
      }
  

From montanaro at users.sourceforge.net  Wed Jul  4 12:58:57 2007
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Wed, 04 Jul 2007 03:58:57 -0700
Subject: [Spambayes-checkins] spambayes/spambayes dbmstorage.py,1.16,1.17
Message-ID: <20070704105901.49AA71E4003@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv19300

Modified Files:
	dbmstorage.py 
Log Message:
Revert the last change.  It was ill-considered, and only serves to sneak
Berkeley DB 1.85 files into the system on Macs in the guise of supporting
the dbm format.


Index: dbmstorage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/dbmstorage.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** dbmstorage.py	3 Jul 2007 14:12:49 -0000	1.16
--- dbmstorage.py	4 Jul 2007 10:58:54 -0000	1.17
***************
*** 34,42 ****
      return gdbm.open(*args)
  
- def open_dbm(*args):
-     """Open a dbm database."""
-     import dbm
-     return dbm.open(*args)
-  
  def open_best(*args):
      if sys.platform == "win32":
--- 34,37 ----
***************
*** 47,52 ****
              funcs.insert(0, open_dbhash)
      else:
!         funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash,
!                  open_dbm]
      for f in funcs:
          try:
--- 42,46 ----
              funcs.insert(0, open_dbhash)
      else:
!         funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash]
      for f in funcs:
          try:
***************
*** 62,66 ****
      "bsddb185": open_db185hash,
      "gdbm": open_gdbm,
-     "dbm": open_dbm,
      }
  
--- 56,59 ----


From mhammond at users.sourceforge.net  Sat Jul  7 08:25:25 2007
From: mhammond at users.sourceforge.net (Mark Hammond)
Date: Fri, 06 Jul 2007 23:25:25 -0700
Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.42,1.43
Message-ID: <20070707062531.72E361E4005@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32054

Modified Files:
	WHAT_IS_NEW.txt 
Log Message:
Add info about 1.1a4


Index: WHAT_IS_NEW.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** WHAT_IS_NEW.txt	25 Aug 2006 02:02:12 -0000	1.42
--- WHAT_IS_NEW.txt	7 Jul 2007 06:25:23 -0000	1.43
***************
*** 16,19 ****
--- 16,51 ----
  is released.
  
+ New in 1.1 Alpha 4
+ ==================
+ 
+ --------------------------------------------
+ ** Incompatible changes and Transitioning **
+ --------------------------------------------
+ 
+ Some options that were 'experimental' in 1.1a3 have now been upgraded to
+ non-experimental, meaning the option names have had their 'x-' prefix removed.
+ See below for details.
+ 
+ Otherwise, there should be no incompatible changes since 1.1a3, though users 
+ new to the 1.1 series should pay careful attention to the database changes 
+ introduced in 1.1a2.
+ 
+ -------------------
+ ** Other changes **
+ -------------------
+ 
+ The previously experimental options 'x-crack-images', 'x-ocr-engine' 
+ and 'x-image-size' have all had their 'x-' prefix removed.  'crack-images'
+ now defaults to True (meaning you don't need to change anything for it
+ to be enabled), and ocr-engine defaults to 'gocr'.  The Windows binary ships 
+ with the gocr engine, so this should work out-of-the-box for both for Outlook 
+ and POP/IMAP/etc users.
+ 
+ Image Cracking (ie, using OCR to extract text from images) has been
+ implemented for the Outlook addin.
+ 
+ Some localization related issues have been fixed, and a German translation
+ contributed.
+ 
  New in 1.1 Alpha 3
  ==================
***************
*** 91,94 ****
--- 123,130 ----
  --------------------------------------------
  
+ * NOTE * - this section does not apply to people running SpamBayes on
+ Windows using the binary installer - only source code installations are 
+ affected.
+ 
  SpamBayes has changed to use ZODB as the default database backend, rather
  than dbm (usually bsddb).  There are three methods for handling this
***************
*** 109,115 ****
          persistent_use_database:dbm
  
!   o You can convert your existing database files to the new format.
!     Windows users will be given the opportunity to do this on installation;
!     other users should use the utilities/convert_db.py script to do this.
      Note that only the token database (containing your training) is
      converted; the 'messageinfo' database (containing statistics about
--- 145,150 ----
          persistent_use_database:dbm
  
!   o You can convert your existing database files to the new format using 
!     the utilities/convert_db.py script.
      Note that only the token database (containing your training) is
      converted; the 'messageinfo' database (containing statistics about


From montanaro at users.sourceforge.net  Sun Jul 15 01:13:14 2007
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 14 Jul 2007 16:13:14 -0700
Subject: [Spambayes-checkins] spambayes/spambayes XMLRPCPlugin.py,1.2,1.3
Message-ID: <20070714231317.8BEB11E4008@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31847/spambayes

Modified Files:
	XMLRPCPlugin.py 
Log Message:

Add train and train_mime methods to the XML-RPC plugin.  These come from
Marian Neagul.


Index: XMLRPCPlugin.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/XMLRPCPlugin.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** XMLRPCPlugin.py	10 Jun 2007 15:27:36 -0000	1.2
--- XMLRPCPlugin.py	14 Jul 2007 23:13:09 -0000	1.3
***************
*** 37,40 ****
--- 37,47 ----
  """
  
+ __author__ = "Skip Montanaro <skip at pobox.com>"
+ __credits__ = "All the Spambayes folk."
+ 
+ # This module is part of the spambayes project, which is Copyright 2002 The
+ # Python Software Foundation and is covered by the Python Software
+ # Foundation license.
+ 
  import threading
  import xmlrpclib
***************
*** 70,82 ****
  
      def _dispatch(self, method, params):
!         if method in ("score", "score_mime"):
              return getattr(self, method)(*params)
          else:
              raise xmlrpclib.Fault(404, '"%s" is not supported' % method)
  
      def score(self, form_dict, extra_tokens, attachments):
          """Score a dictionary + extra tokens."""
!         mime_message = form_to_mime(form_dict, extra_tokens, attachments)
!         mime_message = unicode(mime_message).encode("utf-8")
          return self.score_mime(mime_message, "utf-8")
  
--- 77,170 ----
  
      def _dispatch(self, method, params):
!         if method in ("score", "score_mime", "train", "train_mime"):
              return getattr(self, method)(*params)
          else:
              raise xmlrpclib.Fault(404, '"%s" is not supported' % method)
  
+     def train(self, form_dict, extra_tokens, attachments, is_spam=True):
+         newdict={}
+         for (i, k) in form_dict.items():
+             if type(k)==unicode:
+                 k = k.encode("utf-8")
+             newdict[i] = k
+         mime_message = form_to_mime(newdict, extra_tokens, attachments)
+         mime_message = unicode(mime_message.as_string(), "utf-8").encode("utf-8")
+         self.train_mime(mime_message, "utf-8", is_spam)
+         return ""
+     
+     def train_mime(self, msg_text, encoding, is_spam):
+         if self.state.bayes is None:
+             self.state.create_workers()
+         # Get msg_text into canonical string representation.
+         # Make sure we have a unicode object...
+         if isinstance(msg_text, str):
+             msg_text = unicode(msg_text, encoding)
+         # ... then encode it as utf-8.
+         if isinstance(msg_text, unicode):
+             msg_text = msg_text.encode("utf-8")
+         msg = message_from_string(msg_text,
+                                   _class=spambayes.message.SBHeaderMessage)
+         tokens = tokenize(msg)
+         if is_spam:
+             desired_corpus = "spamCorpus"
+         else:
+             desired_corpus = "hamCorpus"
+         if hasattr(self, desired_corpus):
+             corpus = getattr(self, desired_corpus)
+         else:
+             if hasattr(self, "state"):
+                 corpus = getattr(self.state, desired_corpus)
+                 setattr(self, desired_corpus, corpus)
+                 self.msg_name_func = self.state.getNewMessageName
+             else:
+                 if isSpam:
+                     fn = storage.get_pathname_option("Storage",
+                                                      "spam_cache")
+                 else:
+                     fn = storage.get_pathname_option("Storage",
+                                                      "ham_cache")
+                 storage.ensureDir(fn)
+                 if options["Storage", "cache_use_gzip"]:
+                     factory = FileCorpus.GzipFileMessageFactory()
+                 else:
+                     factory = FileCorpus.FileMessageFactory()
+                 age = options["Storage", "cache_expiry_days"]*24*60*60
+                 corpus = FileCorpus.ExpiryFileCorpus(age, factory, fn,
+                                                      '[0123456789\-]*', cacheSize=20)
+                 setattr(self, desired_corpus, corpus)
+                 class UniqueNamer(object):
+                     count = -1
+                     def generate_name(self):
+                         self.count += 1
+                         return "%10.10d-%d" % (long(time.time()), self.count)
+                 Namer = UniqueNamer()
+                 self.msg_name_func = Namer.generate_name
+         key = self.msg_name_func()
+         mime_message = unicode(msg.as_string(), "utf-8").encode("utf-8")
+         msg = corpus.makeMessage(key, mime_message)
+         msg.setId(key)
+         corpus.addMessage(msg)
+         msg.RememberTrained(is_spam)
+         #self.stats.RecordTraining(not is_spam)
+         #if is_spam:
+         #    self.state.bayes.nspam += 1
+         #else:
+         #    self.state.bayes.nham += 1
+ 
+     def train_spam(self, form_dict, extra_tokens, attachments):
+         pass
+ 
+     def train_ham(self, form_dict, extra_tokens, attachments):
+         pass
+ 
      def score(self, form_dict, extra_tokens, attachments):
          """Score a dictionary + extra tokens."""
!         newdict={}
!         for (i, k) in form_dict.items():
!             if isinstance(k,unicode):
!                 k = k.encode("utf-8")
!             newdict[i] = k
!         mime_message = form_to_mime(newdict, extra_tokens, attachments)
!         mime_message = unicode(mime_message.as_string(), "utf-8").encode("utf-8")
          return self.score_mime(mime_message, "utf-8")
  

From montanaro at users.sourceforge.net  Sun Jul 15 01:13:14 2007
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 14 Jul 2007 16:13:14 -0700
Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.43,1.44
Message-ID: <20070714231318.53E3C1E4008@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31847

Modified Files:
	WHAT_IS_NEW.txt 
Log Message:

Add train and train_mime methods to the XML-RPC plugin.  These come from
Marian Neagul.


Index: WHAT_IS_NEW.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** WHAT_IS_NEW.txt	7 Jul 2007 06:25:23 -0000	1.43
--- WHAT_IS_NEW.txt	14 Jul 2007 23:13:09 -0000	1.44
***************
*** 16,19 ****
--- 16,25 ----
  is released.
  
+ New in 1.1 Alpha 5
+ ==================
+ 
+ The XML-RPC plugin for core_server.py now has "train" and "train_mime"
+ methods.
+ 
  New in 1.1 Alpha 4
  ==================
***************
*** 48,51 ****
--- 54,61 ----
  contributed.
  
+ There is a new application, core_server.py.  It is functionally similar to
+ sb_server.py but uses a plugin architecture to adapt to different
+ protocols.  The first plugin is for XML-RPC.
+ 
  New in 1.1 Alpha 3
  ==================


From montanaro at users.sourceforge.net  Sun Jul 15 01:14:16 2007
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 14 Jul 2007 16:14:16 -0700
Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.59,1.60
Message-ID: <20070714231420.290BB1E4008@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32661

Modified Files:
	CHANGELOG.txt 
Log Message:
.

Index: CHANGELOG.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v
retrieving revision 1.59
retrieving revision 1.60
diff -C2 -d -r1.59 -r1.60
*** CHANGELOG.txt	25 Jun 2007 12:10:10 -0000	1.59
--- CHANGELOG.txt	14 Jul 2007 23:14:14 -0000	1.60
***************
*** 1,4 ****
--- 1,8 ----
  [Note that all dates are in ISO 8601 format, e.g. YYYY-MM-DD to ease sorting]
  
+ Release 1.1a5
+ 
+ Skip Montanaro	  2007-07-14  Add train and train_mime methods to XML-RPC plugin (from Marian Neagul).
+ 
  Release 1.1a4
  

From montanaro at users.sourceforge.net  Tue Jul 17 04:17:00 2007
From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net)
Date: Mon, 16 Jul 2007 19:17:00 -0700
Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3152]
	trunk/spambayes/WHAT_IS_NEW.txt
Message-ID: <E1IAccW-0007nk-4P@sc8-pr-svn2.sourceforge.net>

Revision: 3152
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3152&view=rev
Author:   montanaro
Date:     2007-07-16 19:16:59 -0700 (Mon, 16 Jul 2007)

Log Message:
-----------
a trivial change - testing authentication and email notification

Modified Paths:
--------------
    trunk/spambayes/WHAT_IS_NEW.txt

Modified: trunk/spambayes/WHAT_IS_NEW.txt
===================================================================
--- trunk/spambayes/WHAT_IS_NEW.txt	2007-07-16 11:26:57 UTC (rev 3151)
+++ trunk/spambayes/WHAT_IS_NEW.txt	2007-07-17 02:16:59 UTC (rev 3152)
@@ -18,7 +18,7 @@
 New in 1.1 Alpha 5
 ==================
 
-The XML-RPC plugin for core_server.py now has "train" and "train_mime"
+The XML-RPC plugin for core_server.py now has 'train' and 'train_mime'
 methods.
 
 The source code repository was switched from CVS to Subversion.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

From montanaro at users.sourceforge.net  Tue Jul 17 04:17:17 2007
From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net)
Date: Mon, 16 Jul 2007 19:17:17 -0700
Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3153]
	trunk/spambayes/README-DEVEL.txt
Message-ID: <E1IAccn-00080H-6y@sc8-pr-svn2.sourceforge.net>

Revision: 3153
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3153&view=rev
Author:   montanaro
Date:     2007-07-16 19:17:16 -0700 (Mon, 16 Jul 2007)

Log Message:
-----------
a trivial change - testing authentication and email notification

Modified Paths:
--------------
    trunk/spambayes/README-DEVEL.txt

Modified: trunk/spambayes/README-DEVEL.txt
===================================================================
--- trunk/spambayes/README-DEVEL.txt	2007-07-17 02:16:59 UTC (rev 3152)
+++ trunk/spambayes/README-DEVEL.txt	2007-07-17 02:17:16 UTC (rev 3153)
@@ -27,7 +27,12 @@
 You should definitely check out the FAQ:
 http://spambayes.org/faq.html
 
+Getting Source Code
+===================
 
+The SpamBayes project source code is hosted at SourceForge
+(http://spambayes.sourceforge.net/).  Access is via Subversion.
+
 Primary Core Files
 ==================
 Options.py


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

From montanaro at users.sourceforge.net  Tue Jul 24 02:04:32 2007
From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net)
Date: Mon, 23 Jul 2007 17:04:32 -0700
Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3154]
	trunk/spambayes/scripts/sb_notesfilter.py
Message-ID: <E1ID7tA-0003Sb-Ll@sc8-pr-svn2.sourceforge.net>

Revision: 3154
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3154&view=rev
Author:   montanaro
Date:     2007-07-23 17:04:32 -0700 (Mon, 23 Jul 2007)

Log Message:
-----------
one more incorrectly positioned __future__ import

Modified Paths:
--------------
    trunk/spambayes/scripts/sb_notesfilter.py

Modified: trunk/spambayes/scripts/sb_notesfilter.py
===================================================================
--- trunk/spambayes/scripts/sb_notesfilter.py	2007-07-17 02:17:16 UTC (rev 3153)
+++ trunk/spambayes/scripts/sb_notesfilter.py	2007-07-24 00:04:32 UTC (rev 3154)
@@ -130,11 +130,11 @@
 # The Python Software Foundation and is covered by the Python Software
 # Foundation license.
 
+from __future__ import generators
+
 __author__ = "Tim Stone <tim at fourstonesExpressions.com>"
 __credits__ = "Mark Hammond, for his remarkable win32 modules."
 
-from __future__ import generators
-
 try:
     True, False
 except NameError:


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

From montanaro at users.sourceforge.net  Wed Jul 25 15:51:11 2007
From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net)
Date: Wed, 25 Jul 2007 06:51:11 -0700
Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3156] trunk/website
Message-ID: <E1IDhGh-0002Lt-Qg@sc8-pr-svn2.sourceforge.net>

Revision: 3156
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3156&view=rev
Author:   montanaro
Date:     2007-07-25 06:51:11 -0700 (Wed, 25 Jul 2007)

Log Message:
-----------
read the file name incorrectly as a misspelling of "prefs changelog" instead of "pre sf changelog"!

Added Paths:
-----------
    trunk/website/presfchangelog.ht

Removed Paths:
-------------
    trunk/website/prefschangelog.ht

Deleted: trunk/website/prefschangelog.ht
===================================================================
--- trunk/website/prefschangelog.ht	2007-07-25 13:49:42 UTC (rev 3155)
+++ trunk/website/prefschangelog.ht	2007-07-25 13:51:11 UTC (rev 3156)
@@ -1,905 +0,0 @@
-<h2>Pre-Sourceforge ChangeLog</h2>
-<p>This changelog lists the commits on the spambayes projects before the
-   separate project was set up. See also the 
-<a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">old CVS repository</a>, but don't forget that it's now out of date, and you probably want to be looking at <a href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/">the current CVS</a>.
-</p>
-<pre>
-2002-09-06 02:27  tim_one
-
-	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
-	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
-	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
-
-	This code has been moved to a new SourceForge project (spambayes).
-	
-2002-09-05 15:37  tim_one
-
-	* classifier.py (1.11):
-
-	Added note about MINCOUNT oddities.
-	
-2002-09-05 14:32  tim_one
-
-	* timtest.py (1.17):
-
-	Added note about word length.
-	
-2002-09-05 13:48  tim_one
-
-	* timtest.py (1.16):
-
-	tokenize_word():  Oops!  This was awfully permissive in what it
-	took as being "an email address".  Tightened that, and also
-	avoided 5-gram'ing of email addresses w/ high-bit characters.
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.050  lost
-	    0.075  0.075  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   0 times
-	tied 19 times
-	lost  1 times
-	
-	total unique fp went from 7 to 8
-	
-	false negative percentages
-	    0.764  0.691  won
-	    0.691  0.655  won
-	    0.981  0.945  won
-	    1.309  1.309  tied
-	    1.418  1.164  won
-	    0.873  0.800  won
-	    0.800  0.763  won
-	    1.163  1.163  tied
-	    1.491  1.345  won
-	    1.200  1.127  won
-	    1.381  1.345  won
-	    1.454  1.490  lost
-	    1.164  0.909  won
-	    0.655  0.582  won
-	    0.655  0.691  lost
-	    1.163  1.163  tied
-	    1.200  1.018  won
-	    0.982  0.873  won
-	    0.982  0.909  won
-	    1.236  1.127  won
-	
-	won  15 times
-	tied  3 times
-	lost  2 times
-	
-	total unique fn went from 260 to 249
-	
-	Note:  Each of the two losses there consist of just 1 msg difference.
-	The wins are bigger as well as being more common, and 260-249 = 11
-	spams no longer sneak by any run (which is more than 4% of the 260
-	spams that used to sneak thru!).
-	
-2002-09-05 11:51  tim_one
-
-	* classifier.py (1.10):
-
-	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
-	really matter; leaving it alone.
-	
-2002-09-05 10:02  tim_one
-
-	* classifier.py (1.9):
-
-	A now-rare pure win, changing spamprob() to work harder to find more
-	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
-	column, after in the right:
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.075  0.075  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.075  0.025  won
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   1 times
-	tied 19 times
-	lost  0 times
-	
-	total unique fp went from 9 to 7
-	
-	false negative percentages
-	    0.909  0.764  won
-	    0.800  0.691  won
-	    1.091  0.981  won
-	    1.381  1.309  won
-	    1.491  1.418  won
-	    1.055  0.873  won
-	    0.945  0.800  won
-	    1.236  1.163  won
-	    1.564  1.491  won
-	    1.200  1.200  tied
-	    1.454  1.381  won
-	    1.599  1.454  won
-	    1.236  1.164  won
-	    0.800  0.655  won
-	    0.836  0.655  won
-	    1.236  1.163  won
-	    1.236  1.200  won
-	    1.055  0.982  won
-	    1.127  0.982  won
-	    1.381  1.236  won
-	
-	won  19 times
-	tied  1 times
-	lost  0 times
-	
-	total unique fn went from 284 to 260
-	
-2002-09-04 11:21  tim_one
-
-	* timtest.py (1.15):
-
-	Augmented the spam callback to display spams with low probability.
-	
-2002-09-04 09:53  tim_one
-
-	* Tester.py (1.3), timtest.py (1.14):
-
-	Added support for simple histograms of the probability distributions for
-	ham and spam.
-	
-2002-09-03 12:13  tim_one
-
-	* timtest.py (1.13):
-
-	A reluctant "on principle" change no matter what it does to the stats:
-	take a stab at removing HTML decorations from plain text msgs.  See
-	comments for why it's *only* in plain text msgs.  This puts an end to
-	false positives due to text msgs talking *about* HTML.  Surprisingly, it
-	also gets rid of some false negatives.  Not surprisingly, it introduced
-	another small class of false positives due to the dumbass regexp trick
-	used to approximate HTML tag removal removing pieces of text that had
-	nothing to do with HTML tags (e.g., this happened in the middle of a
-	uuencoded .py file in such a why that it just happened to leave behind
-	a string that "looked like" a spam phrase; but before this it looked
-	like a pile of "too long" lines that didn't generate any tokens --
-	it's a nonsense outcome either way).
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.025  lost
-	    0.075  0.075  tied
-	    0.050  0.025  won
-	    0.025  0.025  tied
-	    0.000  0.025  lost
-	    0.050  0.075  lost
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   1 times
-	tied 16 times
-	lost  3 times
-	
-	total unique fp went from 8 to 9
-	
-	false negative percentages
-	    0.945  0.909  won
-	    0.836  0.800  won
-	    1.200  1.091  won
-	    1.418  1.381  won
-	    1.455  1.491  lost
-	    1.091  1.055  won
-	    1.091  0.945  won
-	    1.236  1.236  tied
-	    1.564  1.564  tied
-	    1.236  1.200  won
-	    1.563  1.454  won
-	    1.563  1.599  lost
-	    1.236  1.236  tied
-	    0.836  0.800  won
-	    0.873  0.836  won
-	    1.236  1.236  tied
-	    1.273  1.236  won
-	    1.018  1.055  lost
-	    1.091  1.127  lost
-	    1.490  1.381  won
-	
-	won  12 times
-	tied  4 times
-	lost  4 times
-	
-	total unique fn went from 292 to 284
-	
-2002-09-03 06:57  tim_one
-
-	* classifier.py (1.8):
-
-	Added a new xspamprob() method, which computes the combined probability
-	"correctly", and a long comment block explaining what happened when I
-	tried it.  There's something worth pursuing here (it greatly improves
-	the false negative rate), but this change alone pushes too many marginal
-	hams into the spam camp
-	
-2002-09-03 05:23  tim_one
-
-	* timtest.py (1.12):
-
-	Made "skip:" tokens shorter.
-	
-	Added a surprising treatment of Organization headers, with a tiny f-n
-	benefit for a tiny cost.  No change in f-p stats.
-	
-	false negative percentages
-	    1.091  0.945  won
-	    0.945  0.836  won
-	    1.236  1.200  won
-	    1.454  1.418  won
-	    1.491  1.455  won
-	    1.091  1.091  tied
-	    1.127  1.091  won
-	    1.236  1.236  tied
-	    1.636  1.564  won
-	    1.345  1.236  won
-	    1.672  1.563  won
-	    1.599  1.563  won
-	    1.236  1.236  tied
-	    0.836  0.836  tied
-	    1.018  0.873  won
-	    1.236  1.236  tied
-	    1.273  1.273  tied
-	    1.055  1.018  won
-	    1.091  1.091  tied
-	    1.527  1.490  won
-	
-	won  13 times
-	tied  7 times
-	lost  0 times
-	
-	total unique fn went from 302 to 292
-	
-2002-09-03 02:18  tim_one
-
-	* timtest.py (1.11):
-
-	tokenize_word():  dropped the prefix from the signature; it's faster
-	to let the caller do it, and this also repaired a bug in one place it
-	was being used (well, a *conceptual* bug anyway, in that the code didn't
-	do what I intended there).  This changes the stats in an insignificant
-	way.  The f-p stats didn't change.  The f-n stats shifted by one message
-	in a few cases:
-	
-	false negative percentages
-	    1.091  1.091  tied
-	    0.945  0.945  tied
-	    1.200  1.236  lost
-	    1.454  1.454  tied
-	    1.491  1.491  tied
-	    1.091  1.091  tied
-	    1.091  1.127  lost
-	    1.236  1.236  tied
-	    1.636  1.636  tied
-	    1.382  1.345  won
-	    1.636  1.672  lost
-	    1.599  1.599  tied
-	    1.236  1.236  tied
-	    0.836  0.836  tied
-	    1.018  1.018  tied
-	    1.236  1.236  tied
-	    1.273  1.273  tied
-	    1.055  1.055  tied
-	    1.091  1.091  tied
-	    1.527  1.527  tied
-	
-	won   1 times
-	tied 16 times
-	lost  3 times
-	
-	total unique unchanged
-	
-2002-09-02 19:30  tim_one
-
-	* timtest.py (1.10):
-
-	Don't ask me why this helps -- I don't really know!  When skipping "long
-	words", generating a token with a brief hint about what and how much got
-	skipped makes a definite improvement in the f-n rate, and doesn't affect
-	the f-p rate at all.  Since experiment said it's a winner, I'm checking
-	it in.  Before (left columan) and after (right column):
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.075  0.075  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   0 times
-	tied 20 times
-	lost  0 times
-	
-	total unique fp went from 8 to 8
-	
-	false negative percentages
-	    1.236  1.091  won
-	    1.164  0.945  won
-	    1.454  1.200  won
-	    1.599  1.454  won
-	    1.527  1.491  won
-	    1.236  1.091  won
-	    1.163  1.091  won
-	    1.309  1.236  won
-	    1.891  1.636  won
-	    1.418  1.382  won
-	    1.745  1.636  won
-	    1.708  1.599  won
-	    1.491  1.236  won
-	    0.836  0.836  tied
-	    1.091  1.018  won
-	    1.309  1.236  won
-	    1.491  1.273  won
-	    1.127  1.055  won
-	    1.309  1.091  won
-	    1.636  1.527  won
-	
-	won  19 times
-	tied  1 times
-	lost  0 times
-	
-	total unique fn went from 336 to 302
-	
-2002-09-02 17:55  tim_one
-
-	* timtest.py (1.9):
-
-	Some comment changes and nesting reduction.
-	
-2002-09-02 11:18  tim_one
-
-	* timtest.py (1.8):
-
-	Fixed some out-of-date comments.
-	
-	Made URL clumping lumpier:  now distinguishes among just "first field",
-	"second field", and "everything else".
-	
-	Changed tag names for email address fields (semantically neutral).
-	
-	Added "From:" line tagging.
-	
-	These add up to an almost pure win.  Before-and-after f-n rates across 20
-	runs:
-	
-	1.418   1.236
-	1.309   1.164
-	1.636   1.454
-	1.854   1.599
-	1.745   1.527
-	1.418   1.236
-	1.381   1.163
-	1.418   1.309
-	2.109   1.891
-	1.491   1.418
-	1.854   1.745
-	1.890   1.708
-	1.818   1.491
-	1.055   0.836
-	1.164   1.091
-	1.599   1.309
-	1.600   1.491
-	1.127   1.127
-	1.164   1.309
-	1.781   1.636
-	
-	It only increased in one run.  The variance appears to have been reduced
-	too (I didn't bother to compute that, though).
-	
-	Before-and-after f-p rates across 20 runs:
-	
-	0.000   0.000
-	0.000   0.000
-	0.075   0.050
-	0.000   0.000
-	0.025   0.025
-	0.050   0.025
-	0.075   0.050
-	0.025   0.025
-	0.025   0.025
-	0.025   0.000
-	0.100   0.075
-	0.050   0.050
-	0.025   0.025
-	0.000   0.000
-	0.075   0.050
-	0.025   0.025
-	0.025   0.025
-	0.000   0.000
-	0.075   0.025
-	0.100   0.050
-	
-	Note that 0.025% is a single message; it's really impossible to *measure*
-	an improvement in the f-p rate anymore with 4000-msg ham sets.
-	
-	Across all 20 runs,
-	
-	the total # of unique f-n fell from 353 to 336
-	the total # of unique f-p fell from 13 to 8
-	
-2002-09-02 10:06  tim_one
-
-	* timtest.py (1.7):
-
-	A number of changes.  The most significant is paying attention to the
-	Subject line (I was wrong before when I said my c.l.py ham corpus was
-	unusable for this due to Mailman-injected decorations).  In all, across
-	my 20 test runs,
-	
-	the total # of unique false positives fell from 23 to 13
-	the total # of unique false negatives rose from 337 to 353
-	
-	Neither result is statistically significant, although I bet the first
-	one would be if I pissed away a few days trying to come up with a more
-	realistic model for what "stat. sig." means here <wink>.
-	
-2002-09-01 17:22  tim_one
-
-	* classifier.py (1.7):
-
-	Added a comment block about HAMBIAS experiments.  There's no clearer
-	example of trading off precision against recall, and you can favor either
-	at the expense of the other to any degree you like by fiddling this knob.
-	
-2002-09-01 14:42  tim_one
-
-	* timtest.py (1.6):
-
-	Long new comment block summarizing all my experiments with character
-	n-grams.  Bottom line is that they have nothing going for them, and a
-	lot going against them, under Graham's scheme.  I believe there may
-	still be a place for them in *part* of a word-based tokenizer, though.
-	
-2002-09-01 10:05  tim_one
-
-	* classifier.py (1.6):
-
-	spamprob():  Never count unique words more than once anymore.  Counting
-	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
-	that's now a small drag instead.
-	
-2002-09-01 07:33  tim_one
-
-	* rebal.py (1.3), timtest.py (1.5):
-
-	Folding case is here to stay.  Read the new comments for why.  This may
-	be a bad idea for other languages, though.
-	
-	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
-	http is spam-neutral, but https is a strong spam indicator.  That
-	surprised me.
-	
-2002-09-01 06:47  tim_one
-
-	* classifier.py (1.5):
-
-	spamprob():  Removed useless check that wordstream isn't empty.  For one
-	thing, it didn't work, since wordstream is often an iterator.  Even if
-	it did work, it isn't needed -- the probability of an empty wordstream
-	gets computed as 0.5 based on the total absence of evidence.
-	
-2002-09-01 05:37  tim_one
-
-	* timtest.py (1.4):
-
-	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
-	details).
-	
-2002-09-01 05:09  tim_one
-
-	* rebal.py (1.2):
-
-	Aha!  Staring at the checkin msg revealed a logic bug that explains why
-	my ham directories sometimes remained unbalanced after running this --
-	if the randomly selected reservoir msg turned out to be spam, it wasn't
-	pushing the too-small directory on the stack again.
-	
-2002-09-01 04:56  tim_one
-
-	* timtest.py (1.3):
-
-	textparts():  This was failing to weed out redundant HTML in cases like
-	this:
-	
-	    multipart/alternative
-	        text/plain
-	        multipart/related
-	            text/html
-	
-	The tokenizer here also transforms everything to lowercase, but that's
-	an accident due simply to that I'm testing that now.  Can't say for
-	sure until the test runs end, but so far it looks like a bad idea for
-	the false positive rate.
-	
-2002-09-01 04:52  tim_one
-
-	* rebal.py (1.1):
-
-	A little script I use to rebalance the ham corpora after deleting what
-	turns out to be spam.  I have another Ham/reservoir directory with a
-	few thousand randomly selected msgs from the presumably-good archive.
-	These aren't used in scoring or training.  This script marches over all
-	the ham corpora directories that are used, and if any have gotten too
-	big (this never happens anymore) deletes msgs at random from them, and
-	if any have gotten too small plugs the holes by moving in random
-	msgs from the reservoir.
-	
-2002-09-01 03:25  tim_one
-
-	* classifier.py (1.4), timtest.py (1.2):
-
-	Boost UNKNOWN_SPAMPROB.
-	# The spam probability assigned to words never seen before.  Graham used
-	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
-	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
-	# negative rate by over 1/3.  The f-p rate increased, but there were so few
-	# f-ps that the increase wasn't statistically significant.  It also caught
-	# 13 more spams erroneously classified as ham.  By eyeball (and common
-	# sense <wink>), this has most effect on very short messages, where there
-	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
-	# completely ignored by spamprob(), in favor of *any* word with *any* prob
-	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
-	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
-	# on the face of it.
-	
-2002-08-31 16:50  tim_one
-
-	* timtest.py (1.1):
-
-	This is a driver I've been using for test runs.  It's specific to my
-	corpus directories, but has useful stuff in it all the same.
-	
-2002-08-31 16:49  tim_one
-
-	* classifier.py (1.3):
-
-	The explanation for these changes was on Python-Dev.  You'll find out
-	why if the moderator approves the msg <wink>.
-	
-2002-08-29 07:04  tim_one
-
-	* Tester.py (1.2), classifier.py (1.2):
-
-	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
-	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
-	too hard to get motivated to reduce 0.01 <0.1 wink>).
-	
-	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
-	the 15 strongest (word, probability) pairs is returned as well as the
-	overall probability (this is how to find out why a message scored as it
-	did).
-	
-2002-08-28 13:45  montanaro
-
-	* GBayes.py (1.15):
-
-	ehh - it actually didn't work all that well.  the spurious report that it
-	did well was pilot error.  besides, tim's report suggests that a simple
-	str.split() may be the best tokenizer anyway.
-	
-2002-08-28 10:45  montanaro
-
-	* setup.py (1.1):
-
-	trivial little setup.py file - i don't expect most people will be interested
-	in this, but it makes it a tad simpler to work with now that there are two
-	files
-	
-2002-08-28 10:43  montanaro
-
-	* GBayes.py (1.14):
-
-	add simple trigram tokenizer - this seems to yield the best results I've
-	seen so far (but has not been extensively tested)
-	
-2002-08-28 08:10  tim_one
-
-	* Tester.py (1.1):
-
-	A start at a testing class.  There isn't a lot here, but it automates
-	much of the tedium, and as the doctest shows it can already do
-	useful things, like remembering which inputs were misclassified.
-	
-2002-08-27 06:45  tim_one
-
-	* mboxcount.py (1.5):
-
-	Updated stats to what Barry and I both get now.  Fiddled output.
-	
-2002-08-27 05:09  bwarsaw
-
-	* split.py (1.5), splitn.py (1.2):
-
-	_factory(): Return the empty string instead of None in the except
-	clauses, so that for-loops won't break prematurely.  mailbox.py's base
-	class defines an __iter__() that raises a StopIteration on None
-	return.
-	
-2002-08-27 04:55  tim_one
-
-	* GBayes.py (1.13), mboxcount.py (1.4):
-
-	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
-	
-2002-08-27 04:40  bwarsaw
-
-	* mboxcount.py (1.3):
-
-	Some stats after splitting b/w good messages and unparseable messages
-	
-2002-08-27 04:23  bwarsaw
-
-	* mboxcount.py (1.2):
-
-	_factory(): Use a marker object to designate between good messages and
-	unparseable messages.  For some reason, returning None from the except
-	clause in _factory() caused Python 2.2.1 to exit early out of the for
-	loop.
-	
-	main(): Print statistics about both the number of good messages and
-	the number of unparseable messages.
-	
-2002-08-27 03:06  tim_one
-
-	* cleanarch (1.2):
-
-	"From " is a header more than a separator, so don't bump the msg count
-	at the end.
-	
-2002-08-24 01:42  tim_one
-
-	* GBayes.py (1.12), classifier.py (1.1):
-
-	Moved all the interesting code that was in the *original* GBayes.py into
-	a new classifier.py.  It was designed to have a very clean interface,
-	and there's no reason to keep slamming everything into one file.  The
-	ever-growing tokenizer stuff should probably also be split out, leaving
-	GBayes.py a pure driver.
-	
-	Also repaired _test() (Skip's checkin left it without a binding for
-	the tokenize function).
-	
-2002-08-24 01:17  tim_one
-
-	* splitn.py (1.1):
-
-	Utility to split an mbox into N random pieces in one gulp.  This gives
-	a convenient way to break a giant corpus into multiple files that can
-	then be used independently across multiple training and testing runs.
-	It's important to do multiple runs on different random samples to avoid
-	drawing conclusions based on accidents in a single random training corpus;
-	if the algorithm is robust, it should have similar performance across
-	all runs.
-	
-2002-08-24 00:25  montanaro
-
-	* GBayes.py (1.11):
-
-	Allow command line specification of tokenize functions
-	    run w/ -t flag to override default tokenize function
-	    run w/ -H flag to see list of tokenize functions
-	
-	When adding a new tokenizer, make docstring a short description and add a
-	key/value pair to the tokenizers dict.  The key is what the user specifies.
-	The value is a tokenize function.
-	
-	Added two new tokenizers - tokenize_wordpairs_foldcase and
-	tokenize_words_and_pairs.  It's not obvious that either is better than any
-	of the preexisting functions.
-	
-	Should probably add info to the pickle which indicates the tokenizing
-	function used to build it.  This could then be the default for spam
-	detection runs.
-	
-	Next step is to drive this with spam/non-spam corpora, selecting each of the
-	various tokenizer functions, and presenting the results in tabular form.
-	
-2002-08-23 13:10  tim_one
-
-	* GBayes.py (1.10):
-
-	spamprob():  Commented some subtleties.
-	
-	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
-	is that you can't delete entries from a dict that's being crawled over
-	by .iteritems(), which is why I (I suddenly recall) materialized a
-	list of words to be deleted the first time I wrote this.  It's a lot
-	better to materialize a list of to-be-deleted words than to materialize
-	the entire database in a dict.items() list.
-	
-2002-08-23 12:36  tim_one
-
-	* mboxcount.py (1.1):
-
-	Utility to count and display the # of msgs in (one or more) Unix mboxes.
-	
-2002-08-23 12:11  tim_one
-
-	* split.py (1.4):
-
-	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
-	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
-	the latter invents an extra newline.
-	
-2002-08-22 07:01  tim_one
-
-	* GBayes.py (1.9):
-
-	Renamed "modtime" to "atime", to better reflect its meaning, and added a
-	comment block to explain that better.
-	
-2002-08-21 08:07  bwarsaw
-
-	* split.py (1.3):
-
-	Guido suggests a different order for the positional args.
-	
-2002-08-21 07:37  bwarsaw
-
-	* split.py (1.2):
-
-	Get rid of the -1 and -2 arguments and make them positional.
-	
-2002-08-21 07:18  bwarsaw
-
-	* split.py (1.1):
-
-	A simple mailbox splitter
-	
-2002-08-21 06:42  tim_one
-
-	* GBayes.py (1.8):
-
-	Added a bunch of simple tokenizers.  The originals are renamed to
-	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
-	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
-	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
-	any of these to be the last word.  When Barry has the test corpus
-	set up it should be easy to let the data tell us which "pure" strategy
-	works best.  Straight character n-grams are very appealing because
-	they're the simplest and most language-neutral; I didn't have any luck
-	with them over the weekend, but the size of my training data was
-	trivial.
-	
-2002-08-21 05:08  bwarsaw
-
-	* cleanarch (1.1):
-
-	An archive cleaner, adapted from the Mailman 2.1b3 version, but
-	de-Mailman-ified.
-	
-2002-08-21 04:44  gvanrossum
-
-	* GBayes.py (1.7):
-
-	Indent repair in clearjunk().
-	
-2002-08-21 04:22  gvanrossum
-
-	* GBayes.py (1.6):
-
-	Some minor cleanup:
-	
-	- Move the identifying comment to the top, clarify it a bit, and add
-	  author info.
-	
-	- There's no reason for _time and _heapreplace to be hidden names;
-	  change these back to time and heapreplace.
-	
-	- Rename main1() to _test() and main2() to main(); when main() sees
-	  there are no options or arguments, it runs _test().
-	
-	- Get rid of a list comprehension from clearjunk().
-	
-	- Put wordinfo.get as a local variable in _add_msg().
-	
-2002-08-20 15:16  tim_one
-
-	* GBayes.py (1.5):
-
-	Neutral typo repairs, except that clearjunk() has a better chance of
-	not blowing up immediately now <wink -- I have yet to try it!>.
-	
-2002-08-20 13:49  montanaro
-
-	* GBayes.py (1.4):
-
-	help make it more easily executable... ;-)
-	
-2002-08-20 09:32  bwarsaw
-
-	* GBayes.py (1.3):
-
-	Lots of hacks great and small to the main() program, but I didn't
-	touch the guts of the algorithm.
-	
-	Added a module docstring/usage message.
-	
-	Added a bunch of switches to train the system on an mbox of known good
-	and known spam messages (using PortableUnixMailbox only for now).
-	Uses the email package but does not decoding of message bodies.  Also,
-	allows you to specify a file for pickling the training data, and for
-	setting a threshold, above which messages get an X-Bayes-Score
-	header.  Also output messages (marked and unmarked) to an output file
-	for retraining.
-	
-	Print some statistics at the end.
-	
-2002-08-20 05:43  tim_one
-
-	* GBayes.py (1.2):
-
-	Turned off debugging vrbl mistakenly checked in at True.
-	
-	unlearn():  Gave this an update_probabilities=True default arg, for
-	symmetry with learn().
-	
-2002-08-20 03:33  tim_one
-
-	* GBayes.py (1.1):
-
-	An implementation of Paul Graham's Bayes-like spam classifier.
-
-</pre>

Copied: trunk/website/presfchangelog.ht (from rev 3155, trunk/website/prefschangelog.ht)
===================================================================
--- trunk/website/presfchangelog.ht	                        (rev 0)
+++ trunk/website/presfchangelog.ht	2007-07-25 13:51:11 UTC (rev 3156)
@@ -0,0 +1,905 @@
+<h2>Pre-Sourceforge ChangeLog</h2>
+<p>This changelog lists the commits on the spambayes projects before the
+   separate project was set up. See also the 
+<a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">old CVS repository</a>, but don't forget that it's now out of date, and you probably want to be looking at <a href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/">the current CVS</a>.
+</p>
+<pre>
+2002-09-06 02:27  tim_one
+
+	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
+	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
+	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
+
+	This code has been moved to a new SourceForge project (spambayes).
+	
+2002-09-05 15:37  tim_one
+
+	* classifier.py (1.11):
+
+	Added note about MINCOUNT oddities.
+	
+2002-09-05 14:32  tim_one
+
+	* timtest.py (1.17):
+
+	Added note about word length.
+	
+2002-09-05 13:48  tim_one
+
+	* timtest.py (1.16):
+
+	tokenize_word():  Oops!  This was awfully permissive in what it
+	took as being "an email address".  Tightened that, and also
+	avoided 5-gram'ing of email addresses w/ high-bit characters.
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.050  lost
+	    0.075  0.075  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   0 times
+	tied 19 times
+	lost  1 times
+	
+	total unique fp went from 7 to 8
+	
+	false negative percentages
+	    0.764  0.691  won
+	    0.691  0.655  won
+	    0.981  0.945  won
+	    1.309  1.309  tied
+	    1.418  1.164  won
+	    0.873  0.800  won
+	    0.800  0.763  won
+	    1.163  1.163  tied
+	    1.491  1.345  won
+	    1.200  1.127  won
+	    1.381  1.345  won
+	    1.454  1.490  lost
+	    1.164  0.909  won
+	    0.655  0.582  won
+	    0.655  0.691  lost
+	    1.163  1.163  tied
+	    1.200  1.018  won
+	    0.982  0.873  won
+	    0.982  0.909  won
+	    1.236  1.127  won
+	
+	won  15 times
+	tied  3 times
+	lost  2 times
+	
+	total unique fn went from 260 to 249
+	
+	Note:  Each of the two losses there consist of just 1 msg difference.
+	The wins are bigger as well as being more common, and 260-249 = 11
+	spams no longer sneak by any run (which is more than 4% of the 260
+	spams that used to sneak thru!).
+	
+2002-09-05 11:51  tim_one
+
+	* classifier.py (1.10):
+
+	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
+	really matter; leaving it alone.
+	
+2002-09-05 10:02  tim_one
+
+	* classifier.py (1.9):
+
+	A now-rare pure win, changing spamprob() to work harder to find more
+	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
+	column, after in the right:
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.075  0.075  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.075  0.025  won
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   1 times
+	tied 19 times
+	lost  0 times
+	
+	total unique fp went from 9 to 7
+	
+	false negative percentages
+	    0.909  0.764  won
+	    0.800  0.691  won
+	    1.091  0.981  won
+	    1.381  1.309  won
+	    1.491  1.418  won
+	    1.055  0.873  won
+	    0.945  0.800  won
+	    1.236  1.163  won
+	    1.564  1.491  won
+	    1.200  1.200  tied
+	    1.454  1.381  won
+	    1.599  1.454  won
+	    1.236  1.164  won
+	    0.800  0.655  won
+	    0.836  0.655  won
+	    1.236  1.163  won
+	    1.236  1.200  won
+	    1.055  0.982  won
+	    1.127  0.982  won
+	    1.381  1.236  won
+	
+	won  19 times
+	tied  1 times
+	lost  0 times
+	
+	total unique fn went from 284 to 260
+	
+2002-09-04 11:21  tim_one
+
+	* timtest.py (1.15):
+
+	Augmented the spam callback to display spams with low probability.
+	
+2002-09-04 09:53  tim_one
+
+	* Tester.py (1.3), timtest.py (1.14):
+
+	Added support for simple histograms of the probability distributions for
+	ham and spam.
+	
+2002-09-03 12:13  tim_one
+
+	* timtest.py (1.13):
+
+	A reluctant "on principle" change no matter what it does to the stats:
+	take a stab at removing HTML decorations from plain text msgs.  See
+	comments for why it's *only* in plain text msgs.  This puts an end to
+	false positives due to text msgs talking *about* HTML.  Surprisingly, it
+	also gets rid of some false negatives.  Not surprisingly, it introduced
+	another small class of false positives due to the dumbass regexp trick
+	used to approximate HTML tag removal removing pieces of text that had
+	nothing to do with HTML tags (e.g., this happened in the middle of a
+	uuencoded .py file in such a why that it just happened to leave behind
+	a string that "looked like" a spam phrase; but before this it looked
+	like a pile of "too long" lines that didn't generate any tokens --
+	it's a nonsense outcome either way).
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.025  lost
+	    0.075  0.075  tied
+	    0.050  0.025  won
+	    0.025  0.025  tied
+	    0.000  0.025  lost
+	    0.050  0.075  lost
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   1 times
+	tied 16 times
+	lost  3 times
+	
+	total unique fp went from 8 to 9
+	
+	false negative percentages
+	    0.945  0.909  won
+	    0.836  0.800  won
+	    1.200  1.091  won
+	    1.418  1.381  won
+	    1.455  1.491  lost
+	    1.091  1.055  won
+	    1.091  0.945  won
+	    1.236  1.236  tied
+	    1.564  1.564  tied
+	    1.236  1.200  won
+	    1.563  1.454  won
+	    1.563  1.599  lost
+	    1.236  1.236  tied
+	    0.836  0.800  won
+	    0.873  0.836  won
+	    1.236  1.236  tied
+	    1.273  1.236  won
+	    1.018  1.055  lost
+	    1.091  1.127  lost
+	    1.490  1.381  won
+	
+	won  12 times
+	tied  4 times
+	lost  4 times
+	
+	total unique fn went from 292 to 284
+	
+2002-09-03 06:57  tim_one
+
+	* classifier.py (1.8):
+
+	Added a new xspamprob() method, which computes the combined probability
+	"correctly", and a long comment block explaining what happened when I
+	tried it.  There's something worth pursuing here (it greatly improves
+	the false negative rate), but this change alone pushes too many marginal
+	hams into the spam camp
+	
+2002-09-03 05:23  tim_one
+
+	* timtest.py (1.12):
+
+	Made "skip:" tokens shorter.
+	
+	Added a surprising treatment of Organization headers, with a tiny f-n
+	benefit for a tiny cost.  No change in f-p stats.
+	
+	false negative percentages
+	    1.091  0.945  won
+	    0.945  0.836  won
+	    1.236  1.200  won
+	    1.454  1.418  won
+	    1.491  1.455  won
+	    1.091  1.091  tied
+	    1.127  1.091  won
+	    1.236  1.236  tied
+	    1.636  1.564  won
+	    1.345  1.236  won
+	    1.672  1.563  won
+	    1.599  1.563  won
+	    1.236  1.236  tied
+	    0.836  0.836  tied
+	    1.018  0.873  won
+	    1.236  1.236  tied
+	    1.273  1.273  tied
+	    1.055  1.018  won
+	    1.091  1.091  tied
+	    1.527  1.490  won
+	
+	won  13 times
+	tied  7 times
+	lost  0 times
+	
+	total unique fn went from 302 to 292
+	
+2002-09-03 02:18  tim_one
+
+	* timtest.py (1.11):
+
+	tokenize_word():  dropped the prefix from the signature; it's faster
+	to let the caller do it, and this also repaired a bug in one place it
+	was being used (well, a *conceptual* bug anyway, in that the code didn't
+	do what I intended there).  This changes the stats in an insignificant
+	way.  The f-p stats didn't change.  The f-n stats shifted by one message
+	in a few cases:
+	
+	false negative percentages
+	    1.091  1.091  tied
+	    0.945  0.945  tied
+	    1.200  1.236  lost
+	    1.454  1.454  tied
+	    1.491  1.491  tied
+	    1.091  1.091  tied
+	    1.091  1.127  lost
+	    1.236  1.236  tied
+	    1.636  1.636  tied
+	    1.382  1.345  won
+	    1.636  1.672  lost
+	    1.599  1.599  tied
+	    1.236  1.236  tied
+	    0.836  0.836  tied
+	    1.018  1.018  tied
+	    1.236  1.236  tied
+	    1.273  1.273  tied
+	    1.055  1.055  tied
+	    1.091  1.091  tied
+	    1.527  1.527  tied
+	
+	won   1 times
+	tied 16 times
+	lost  3 times
+	
+	total unique unchanged
+	
+2002-09-02 19:30  tim_one
+
+	* timtest.py (1.10):
+
+	Don't ask me why this helps -- I don't really know!  When skipping "long
+	words", generating a token with a brief hint about what and how much got
+	skipped makes a definite improvement in the f-n rate, and doesn't affect
+	the f-p rate at all.  Since experiment said it's a winner, I'm checking
+	it in.  Before (left columan) and after (right column):
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.075  0.075  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   0 times
+	tied 20 times
+	lost  0 times
+	
+	total unique fp went from 8 to 8
+	
+	false negative percentages
+	    1.236  1.091  won
+	    1.164  0.945  won
+	    1.454  1.200  won
+	    1.599  1.454  won
+	    1.527  1.491  won
+	    1.236  1.091  won
+	    1.163  1.091  won
+	    1.309  1.236  won
+	    1.891  1.636  won
+	    1.418  1.382  won
+	    1.745  1.636  won
+	    1.708  1.599  won
+	    1.491  1.236  won
+	    0.836  0.836  tied
+	    1.091  1.018  won
+	    1.309  1.236  won
+	    1.491  1.273  won
+	    1.127  1.055  won
+	    1.309  1.091  won
+	    1.636  1.527  won
+	
+	won  19 times
+	tied  1 times
+	lost  0 times
+	
+	total unique fn went from 336 to 302
+	
+2002-09-02 17:55  tim_one
+
+	* timtest.py (1.9):
+
+	Some comment changes and nesting reduction.
+	
+2002-09-02 11:18  tim_one
+
+	* timtest.py (1.8):
+
+	Fixed some out-of-date comments.
+	
+	Made URL clumping lumpier:  now distinguishes among just "first field",
+	"second field", and "everything else".
+	
+	Changed tag names for email address fields (semantically neutral).
+	
+	Added "From:" line tagging.
+	
+	These add up to an almost pure win.  Before-and-after f-n rates across 20
+	runs:
+	
+	1.418   1.236
+	1.309   1.164
+	1.636   1.454
+	1.854   1.599
+	1.745   1.527
+	1.418   1.236
+	1.381   1.163
+	1.418   1.309
+	2.109   1.891
+	1.491   1.418
+	1.854   1.745
+	1.890   1.708
+	1.818   1.491
+	1.055   0.836
+	1.164   1.091
+	1.599   1.309
+	1.600   1.491
+	1.127   1.127
+	1.164   1.309
+	1.781   1.636
+	
+	It only increased in one run.  The variance appears to have been reduced
+	too (I didn't bother to compute that, though).
+	
+	Before-and-after f-p rates across 20 runs:
+	
+	0.000   0.000
+	0.000   0.000
+	0.075   0.050
+	0.000   0.000
+	0.025   0.025
+	0.050   0.025
+	0.075   0.050
+	0.025   0.025
+	0.025   0.025
+	0.025   0.000
+	0.100   0.075
+	0.050   0.050
+	0.025   0.025
+	0.000   0.000
+	0.075   0.050
+	0.025   0.025
+	0.025   0.025
+	0.000   0.000
+	0.075   0.025
+	0.100   0.050
+	
+	Note that 0.025% is a single message; it's really impossible to *measure*
+	an improvement in the f-p rate anymore with 4000-msg ham sets.
+	
+	Across all 20 runs,
+	
+	the total # of unique f-n fell from 353 to 336
+	the total # of unique f-p fell from 13 to 8
+	
+2002-09-02 10:06  tim_one
+
+	* timtest.py (1.7):
+
+	A number of changes.  The most significant is paying attention to the
+	Subject line (I was wrong before when I said my c.l.py ham corpus was
+	unusable for this due to Mailman-injected decorations).  In all, across
+	my 20 test runs,
+	
+	the total # of unique false positives fell from 23 to 13
+	the total # of unique false negatives rose from 337 to 353
+	
+	Neither result is statistically significant, although I bet the first
+	one would be if I pissed away a few days trying to come up with a more
+	realistic model for what "stat. sig." means here <wink>.
+	
+2002-09-01 17:22  tim_one
+
+	* classifier.py (1.7):
+
+	Added a comment block about HAMBIAS experiments.  There's no clearer
+	example of trading off precision against recall, and you can favor either
+	at the expense of the other to any degree you like by fiddling this knob.
+	
+2002-09-01 14:42  tim_one
+
+	* timtest.py (1.6):
+
+	Long new comment block summarizing all my experiments with character
+	n-grams.  Bottom line is that they have nothing going for them, and a
+	lot going against them, under Graham's scheme.  I believe there may
+	still be a place for them in *part* of a word-based tokenizer, though.
+	
+2002-09-01 10:05  tim_one
+
+	* classifier.py (1.6):
+
+	spamprob():  Never count unique words more than once anymore.  Counting
+	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
+	that's now a small drag instead.
+	
+2002-09-01 07:33  tim_one
+
+	* rebal.py (1.3), timtest.py (1.5):
+
+	Folding case is here to stay.  Read the new comments for why.  This may
+	be a bad idea for other languages, though.
+	
+	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
+	http is spam-neutral, but https is a strong spam indicator.  That
+	surprised me.
+	
+2002-09-01 06:47  tim_one
+
+	* classifier.py (1.5):
+
+	spamprob():  Removed useless check that wordstream isn't empty.  For one
+	thing, it didn't work, since wordstream is often an iterator.  Even if
+	it did work, it isn't needed -- the probability of an empty wordstream
+	gets computed as 0.5 based on the total absence of evidence.
+	
+2002-09-01 05:37  tim_one
+
+	* timtest.py (1.4):
+
+	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
+	details).
+	
+2002-09-01 05:09  tim_one
+
+	* rebal.py (1.2):
+
+	Aha!  Staring at the checkin msg revealed a logic bug that explains why
+	my ham directories sometimes remained unbalanced after running this --
+	if the randomly selected reservoir msg turned out to be spam, it wasn't
+	pushing the too-small directory on the stack again.
+	
+2002-09-01 04:56  tim_one
+
+	* timtest.py (1.3):
+
+	textparts():  This was failing to weed out redundant HTML in cases like
+	this:
+	
+	    multipart/alternative
+	        text/plain
+	        multipart/related
+	            text/html
+	
+	The tokenizer here also transforms everything to lowercase, but that's
+	an accident due simply to that I'm testing that now.  Can't say for
+	sure until the test runs end, but so far it looks like a bad idea for
+	the false positive rate.
+	
+2002-09-01 04:52  tim_one
+
+	* rebal.py (1.1):
+
+	A little script I use to rebalance the ham corpora after deleting what
+	turns out to be spam.  I have another Ham/reservoir directory with a
+	few thousand randomly selected msgs from the presumably-good archive.
+	These aren't used in scoring or training.  This script marches over all
+	the ham corpora directories that are used, and if any have gotten too
+	big (this never happens anymore) deletes msgs at random from them, and
+	if any have gotten too small plugs the holes by moving in random
+	msgs from the reservoir.
+	
+2002-09-01 03:25  tim_one
+
+	* classifier.py (1.4), timtest.py (1.2):
+
+	Boost UNKNOWN_SPAMPROB.
+	# The spam probability assigned to words never seen before.  Graham used
+	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
+	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
+	# negative rate by over 1/3.  The f-p rate increased, but there were so few
+	# f-ps that the increase wasn't statistically significant.  It also caught
+	# 13 more spams erroneously classified as ham.  By eyeball (and common
+	# sense <wink>), this has most effect on very short messages, where there
+	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
+	# completely ignored by spamprob(), in favor of *any* word with *any* prob
+	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
+	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
+	# on the face of it.
+	
+2002-08-31 16:50  tim_one
+
+	* timtest.py (1.1):
+
+	This is a driver I've been using for test runs.  It's specific to my
+	corpus directories, but has useful stuff in it all the same.
+	
+2002-08-31 16:49  tim_one
+
+	* classifier.py (1.3):
+
+	The explanation for these changes was on Python-Dev.  You'll find out
+	why if the moderator approves the msg <wink>.
+	
+2002-08-29 07:04  tim_one
+
+	* Tester.py (1.2), classifier.py (1.2):
+
+	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
+	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
+	too hard to get motivated to reduce 0.01 <0.1 wink>).
+	
+	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
+	the 15 strongest (word, probability) pairs is returned as well as the
+	overall probability (this is how to find out why a message scored as it
+	did).
+	
+2002-08-28 13:45  montanaro
+
+	* GBayes.py (1.15):
+
+	ehh - it actually didn't work all that well.  the spurious report that it
+	did well was pilot error.  besides, tim's report suggests that a simple
+	str.split() may be the best tokenizer anyway.
+	
+2002-08-28 10:45  montanaro
+
+	* setup.py (1.1):
+
+	trivial little setup.py file - i don't expect most people will be interested
+	in this, but it makes it a tad simpler to work with now that there are two
+	files
+	
+2002-08-28 10:43  montanaro
+
+	* GBayes.py (1.14):
+
+	add simple trigram tokenizer - this seems to yield the best results I've
+	seen so far (but has not been extensively tested)
+	
+2002-08-28 08:10  tim_one
+
+	* Tester.py (1.1):
+
+	A start at a testing class.  There isn't a lot here, but it automates
+	much of the tedium, and as the doctest shows it can already do
+	useful things, like remembering which inputs were misclassified.
+	
+2002-08-27 06:45  tim_one
+
+	* mboxcount.py (1.5):
+
+	Updated stats to what Barry and I both get now.  Fiddled output.
+	
+2002-08-27 05:09  bwarsaw
+
+	* split.py (1.5), splitn.py (1.2):
+
+	_factory(): Return the empty string instead of None in the except
+	clauses, so that for-loops won't break prematurely.  mailbox.py's base
+	class defines an __iter__() that raises a StopIteration on None
+	return.
+	
+2002-08-27 04:55  tim_one
+
+	* GBayes.py (1.13), mboxcount.py (1.4):
+
+	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
+	
+2002-08-27 04:40  bwarsaw
+
+	* mboxcount.py (1.3):
+
+	Some stats after splitting b/w good messages and unparseable messages
+	
+2002-08-27 04:23  bwarsaw
+
+	* mboxcount.py (1.2):
+
+	_factory(): Use a marker object to designate between good messages and
+	unparseable messages.  For some reason, returning None from the except
+	clause in _factory() caused Python 2.2.1 to exit early out of the for
+	loop.
+	
+	main(): Print statistics about both the number of good messages and
+	the number of unparseable messages.
+	
+2002-08-27 03:06  tim_one
+
+	* cleanarch (1.2):
+
+	"From " is a header more than a separator, so don't bump the msg count
+	at the end.
+	
+2002-08-24 01:42  tim_one
+
+	* GBayes.py (1.12), classifier.py (1.1):
+
+	Moved all the interesting code that was in the *original* GBayes.py into
+	a new classifier.py.  It was designed to have a very clean interface,
+	and there's no reason to keep slamming everything into one file.  The
+	ever-growing tokenizer stuff should probably also be split out, leaving
+	GBayes.py a pure driver.
+	
+	Also repaired _test() (Skip's checkin left it without a binding for
+	the tokenize function).
+	
+2002-08-24 01:17  tim_one
+
+	* splitn.py (1.1):
+
+	Utility to split an mbox into N random pieces in one gulp.  This gives
+	a convenient way to break a giant corpus into multiple files that can
+	then be used independently across multiple training and testing runs.
+	It's important to do multiple runs on different random samples to avoid
+	drawing conclusions based on accidents in a single random training corpus;
+	if the algorithm is robust, it should have similar performance across
+	all runs.
+	
+2002-08-24 00:25  montanaro
+
+	* GBayes.py (1.11):
+
+	Allow command line specification of tokenize functions
+	    run w/ -t flag to override default tokenize function
+	    run w/ -H flag to see list of tokenize functions
+	
+	When adding a new tokenizer, make docstring a short description and add a
+	key/value pair to the tokenizers dict.  The key is what the user specifies.
+	The value is a tokenize function.
+	
+	Added two new tokenizers - tokenize_wordpairs_foldcase and
+	tokenize_words_and_pairs.  It's not obvious that either is better than any
+	of the preexisting functions.
+	
+	Should probably add info to the pickle which indicates the tokenizing
+	function used to build it.  This could then be the default for spam
+	detection runs.
+	
+	Next step is to drive this with spam/non-spam corpora, selecting each of the
+	various tokenizer functions, and presenting the results in tabular form.
+	
+2002-08-23 13:10  tim_one
+
+	* GBayes.py (1.10):
+
+	spamprob():  Commented some subtleties.
+	
+	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
+	is that you can't delete entries from a dict that's being crawled over
+	by .iteritems(), which is why I (I suddenly recall) materialized a
+	list of words to be deleted the first time I wrote this.  It's a lot
+	better to materialize a list of to-be-deleted words than to materialize
+	the entire database in a dict.items() list.
+	
+2002-08-23 12:36  tim_one
+
+	* mboxcount.py (1.1):
+
+	Utility to count and display the # of msgs in (one or more) Unix mboxes.
+	
+2002-08-23 12:11  tim_one
+
+	* split.py (1.4):
+
+	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
+	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
+	the latter invents an extra newline.
+	
+2002-08-22 07:01  tim_one
+
+	* GBayes.py (1.9):
+
+	Renamed "modtime" to "atime", to better reflect its meaning, and added a
+	comment block to explain that better.
+	
+2002-08-21 08:07  bwarsaw
+
+	* split.py (1.3):
+
+	Guido suggests a different order for the positional args.
+	
+2002-08-21 07:37  bwarsaw
+
+	* split.py (1.2):
+
+	Get rid of the -1 and -2 arguments and make them positional.
+	
+2002-08-21 07:18  bwarsaw
+
+	* split.py (1.1):
+
+	A simple mailbox splitter
+	
+2002-08-21 06:42  tim_one
+
+	* GBayes.py (1.8):
+
+	Added a bunch of simple tokenizers.  The originals are renamed to
+	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
+	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
+	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
+	any of these to be the last word.  When Barry has the test corpus
+	set up it should be easy to let the data tell us which "pure" strategy
+	works best.  Straight character n-grams are very appealing because
+	they're the simplest and most language-neutral; I didn't have any luck
+	with them over the weekend, but the size of my training data was
+	trivial.
+	
+2002-08-21 05:08  bwarsaw
+
+	* cleanarch (1.1):
+
+	An archive cleaner, adapted from the Mailman 2.1b3 version, but
+	de-Mailman-ified.
+	
+2002-08-21 04:44  gvanrossum
+
+	* GBayes.py (1.7):
+
+	Indent repair in clearjunk().
+	
+2002-08-21 04:22  gvanrossum
+
+	* GBayes.py (1.6):
+
+	Some minor cleanup:
+	
+	- Move the identifying comment to the top, clarify it a bit, and add
+	  author info.
+	
+	- There's no reason for _time and _heapreplace to be hidden names;
+	  change these back to time and heapreplace.
+	
+	- Rename main1() to _test() and main2() to main(); when main() sees
+	  there are no options or arguments, it runs _test().
+	
+	- Get rid of a list comprehension from clearjunk().
+	
+	- Put wordinfo.get as a local variable in _add_msg().
+	
+2002-08-20 15:16  tim_one
+
+	* GBayes.py (1.5):
+
+	Neutral typo repairs, except that clearjunk() has a better chance of
+	not blowing up immediately now <wink -- I have yet to try it!>.
+	
+2002-08-20 13:49  montanaro
+
+	* GBayes.py (1.4):
+
+	help make it more easily executable... ;-)
+	
+2002-08-20 09:32  bwarsaw
+
+	* GBayes.py (1.3):
+
+	Lots of hacks great and small to the main() program, but I didn't
+	touch the guts of the algorithm.
+	
+	Added a module docstring/usage message.
+	
+	Added a bunch of switches to train the system on an mbox of known good
+	and known spam messages (using PortableUnixMailbox only for now).
+	Uses the email package but does not decoding of message bodies.  Also,
+	allows you to specify a file for pickling the training data, and for
+	setting a threshold, above which messages get an X-Bayes-Score
+	header.  Also output messages (marked and unmarked) to an output file
+	for retraining.
+	
+	Print some statistics at the end.
+	
+2002-08-20 05:43  tim_one
+
+	* GBayes.py (1.2):
+
+	Turned off debugging vrbl mistakenly checked in at True.
+	
+	unlearn():  Gave this an update_probabilities=True default arg, for
+	symmetry with learn().
+	
+2002-08-20 03:33  tim_one
+
+	* GBayes.py (1.1):
+
+	An implementation of Paul Graham's Bayes-like spam classifier.
+
+</pre>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

From montanaro at users.sourceforge.net  Wed Jul 25 15:49:42 2007
From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net)
Date: Wed, 25 Jul 2007 06:49:42 -0700
Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3155] trunk/website
Message-ID: <E1IDhFG-0001w6-M2@sc8-pr-svn2.sourceforge.net>

Revision: 3155
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3155&view=rev
Author:   montanaro
Date:     2007-07-25 06:49:42 -0700 (Wed, 25 Jul 2007)

Log Message:
-----------
cvs -> svn

Modified Paths:
--------------
    trunk/website/applications.ht
    trunk/website/background.ht
    trunk/website/contact.ht
    trunk/website/developer.ht
    trunk/website/docs.ht
    trunk/website/download.ht
    trunk/website/unix.ht
    trunk/website/windows.ht

Added Paths:
-----------
    trunk/website/prefschangelog.ht

Removed Paths:
-------------
    trunk/website/presfchangelog.ht

Modified: trunk/website/applications.ht
===================================================================
--- trunk/website/applications.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/applications.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -25,13 +25,13 @@
 <li>Python's <a href="http://starship.python.net/crew/mhammond">win32com</a>
 extensions (win32all-149 or later - currently ActivePython is not suitable)
 </ul>
-For more on this, see the <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/Outlook2000/README.txt?rev=HEAD&content-type=text/plain">README.txt</a> or 
-<a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/Outlook2000/about.html?rev=HEAD&content-type=text/html">about.html</a> file in the spambayes CVS repository's Outlook2000 directory.
-<p>Alternatively, you can use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>
+For more on this, see the <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/Outlook2000/README.txt">README.txt</a> or 
+<a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/Outlook2000/about.html">about.html</a> file in the spambayes Subversion repository's Outlook2000 directory.
+<p>Alternatively, you can use Subversion to get the code - go <a href="tp://sourceforge.net/svn/?group_id=61702">to the Subversion page</a> on the project's SourceForge site for more.</p>
 
 <h3><a name="sb_filter">sb_filter.py</a></h3>
 <p>sb_filter is a command line tool for marking mail as ham or spam.  The readme
-<a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt?rev=HEAD&content-type=text/plain">
+<a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">
 includes a guide to integrating it with your mailer</a> (Unix-only
 instructions at the moment - additions welcome!). 
 Currently it focuses on running sb_filter via procmail. </p>
@@ -44,11 +44,11 @@
 
 <h4>Availability</h4>
 <p><a href="download.html">Download the source archive.</a></p>
-<p>Alternatively, use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>
+<p>Alternatively, use Subversion to get the code.  Go to the <a href="http://sourceforge.net/svn/?group_id=61702">Subversion page</a> on the project's SourceForge site for more.</p>
 
 <h3><a name="sb_server">sb_server.py</a></h3>
 <p>sb_server provides a POP3 proxy which sits between your mail client and your real POP3 server and marks
-mail as ham or spam as it passes through. See the <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt?rev=HEAD&content-type=text/plain">README</a> for more.
+mail as ham or spam as it passes through. See the <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">README</a> for more.
 sb_server can also be used with the sb_upload.py script as a procmail (or similar) solution.</p>
 
 <h4>Requirements</h4>
@@ -63,12 +63,12 @@
 it.</p>
 <p>Alternatively, to run from source, <a href="download.html">download the
 source archive.</a></p>
-<p>Alternatively, use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>
+<p>Alternatively, use Subversion to get the code - go to the <a href="http://sourceforge.net/svn/?group_id=61702">Subversion page</a> on the project's SourceForge site for more.</p>
 
 <h3><a name="imap">sb_imapfilter.py</a></h3>
 <p>imapfilter connects to your imap server and marks mail as ham or spam,
 moving it to appropriate folders as it arrives.
-See the <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt?rev=HEAD&content-type=text/plain">README</a> for more.</p>
+See the <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">README</a> for more.</p>
 
 <h4>Requirements</h4>
 <ul>
@@ -78,7 +78,7 @@
 
 <h4>Availability</h4>
 <p><a href="download.html">Download the source archive.</a></p>
-<p>Alternatively, use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>
+<p>Alternatively, use Subversion to get the code - go to the <a href="http://sourceforge.net/svn/?group_id=61702">Subversion page</a> on the project's SourceForge site for more.</p>
 
 <h3><a name="mboxtrain">sb_mboxtrain.py</a></h3>
 <p>This application allows you to train incrementally on ham and spam
@@ -94,7 +94,7 @@
 
 <h4>Availability</h4>
 <p><a href="download.html">Download the source archive.</a></p>
-<p>Alternatively, use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>
+<p>Alternatively, use Subversion to get the code - go to the <a href="http://sourceforge.net/svn/?group_id=61702">Subversion page</a> on the project's SourceForge site for more.</p>
 
 <h3><a name="notesfilter">sb_notesfilter.py</a></h3>
 <p>This application allows you to filter Lotus Notes folders, rather like
@@ -112,4 +112,4 @@
 
 <h4>Availability</h4>
 <p><a href="download.html">Download the source archive.</a></p>
-<p>Alternatively, use CVS to get the code - go <a href="http://sourceforge.net/cvs/?group_id=61702">to the CVS page</a> on the project's sourceforge site for more.</p>
+<p>Alternatively, use Subversion to get the code - go to the <a href="http://sourceforge.net/svn/?group_id=61702">Subversion page</a> on the project's SourceForge site for more.</p>

Modified: trunk/website/background.ht
===================================================================
--- trunk/website/background.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/background.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -250,10 +250,10 @@
 <li>The discussions then moved to the spambayes <a href="http://mail.python.org/pipermail/spambayes/">mailing list</a>.
 </ul>
 
-<h2>CVS commit messages</h2>
+<h2>CVS Commit Messages</h2>
 <p>Tim Peters has whacked a whole lot of useful information into CVS
 commit messages. As the project was moved from an obscure corner of the
-Python CVS tree, there's actually two sources of CVS commits.</p>
+Python CVS tree, there are actually two sources of CVS commits.</p>
 
 <ul>
 <li>The older CVS repository via <a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">view CVS</a>, or the <a href="presfchangelog.html">entire changelog</a>.

Modified: trunk/website/contact.ht
===================================================================
--- trunk/website/contact.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/contact.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -34,7 +34,7 @@
 This list is also unmoderated.
 </li>
 <li>
-CVS commit messages go to the list <a href="http://mail.python.org/mailman/listinfo/spambayes-checkins">spambayes-checkins</a>.
+Subversion commit messages go to the list <a href="http://mail.python.org/mailman/listinfo/spambayes-checkins">spambayes-checkins</a>.
 You shouldn't send email to this list; a program running at SourceForge
 automatically creates and sends emails to this list as a result of code
 checkins.

Modified: trunk/website/developer.ht
===================================================================
--- trunk/website/developer.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/developer.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -5,10 +5,9 @@
 <h1>Developer info</h1>
 <p>So you want to get involved?</p>
 <h2>Running the code</h2>
-<p>This project works with Python 2.2, Python 2.3, Python 2.4, 
-or on the bleeding edge of python code,
-available from <a href="http://sourceforge.net/cvs/?group_id=5470">CVS on
-sourceforge</a>. It will not work on python 2.1.x or earlier, nor is it ever
+<p>This project works with Python 2.2 or later,
+available from <a href="http://sourceforge.net/svn/?group_id=5470">Subversion on
+SourceForge</a>. It will not work on python 2.1.x or earlier, nor is it ever
 likely to do so.</p>
 <p>If you're running Python 2.2 or 2.2.1, you'll need to separately fetch
 the latest <a href="http://mimelib.sf.net">email package</a>. You can get
@@ -17,7 +16,7 @@
 (you'll need version 2.4.3 or later - version 3.0 or later is recommended).
 </p>
 
-<p>The SpamBayes code itself is also available <a href="http://sourceforge.net/cvs/?group_id=61702">via CVS</a>, or from the <a href="download.html">download</a> page.
+<p>The SpamBayes code itself is also available via <a href="http://sourceforge.net/svn/?group_id=61702">via Subversion</a>, or from the <a href="download.html">download</a> page.
 </p>
 
 <h2>I just want to make suggestions</h2>

Modified: trunk/website/docs.ht
===================================================================
--- trunk/website/docs.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/docs.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -10,13 +10,13 @@
 and generally help each other out.  It would be great to see documentation improvements,
 hints and tips, scripts and recipes, and anything else (related to SpamBayes) that takes
 your fancy added here.</li>
-<li>Instructions on <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt">installing Spambayes</a> and integrating it into your mail system.</li>
-<li>The Outlook plugin includes an <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/Outlook2000/about.html">&quot;About&quot; File</a>, and a <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/Outlook2000/docs/troubleshooting.html?rev=HEAD">
+<li>Instructions on <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">installing Spambayes</a> and integrating it into your mail system.</li>
+<li>The Outlook plugin includes an <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/Outlook2000/about.html">&quot;About&quot; File</a>, and a <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/Outlook2000/docs/troubleshooting.html">
 &quot;Troubleshooting Guide&quot</a> that can be accessed via the toolbar.
 (Note that the online documentaton is always for the <strong>latest source</strong> version, and so might not correspond exactly with the version you are using.
 Always start with the documentation that came with the version you installed.)</li>
-<li>The <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README-DEVEL.txt">README-DEVEL.txt</a> information that should be of use to people planning on developing code based on SpamBayes.</li>
-<li>The <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/TESTING.txt">TESTING.txt</a> file -- Clues about the practice of statistical testing, adapted from Tim
+<li>The <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README-DEVEL.txt">README-DEVEL.txt</a> information that should be of use to people planning on developing code based on SpamBayes.</li>
+<li>The <a href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/TESTING.txt">TESTING.txt</a> file -- Clues about the practice of statistical testing, adapted from Tim
  comments on python-dev.
 <li>There are also a vast number of clues and notes scattered as block comments through the code.
 </ul>

Modified: trunk/website/download.ht
===================================================================
--- trunk/website/download.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/download.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -38,7 +38,7 @@
 
 <p>Prerequisites:
 <ul>Either: 
-<li>Python 2.2.2 or above, or a CVS build of python, or
+<li>Python 2.2.2 or above, or a Subversion build of python, or
 <li>Python 2.2, 2.2.1, plus the latest <a href="http://www.python.org/sigs/email-sig/">email package</a>.
 </ul>
 <p>Once you've downloaded and unpacked the source archive, do the regular <tt>setup.py build; setup.py install</tt> dance, then:
@@ -106,10 +106,8 @@
 <p>These instructions are geared to GnuPG and command-line weenies.
 Suggestions are welcome for other OpenPGP applications.</p>
 
-<a name="cvs"><h2>CVS Access</h2></a>
-<p>The code is currently available from sourceforge's CVS server -
-<a href="http://sourceforge.net/cvs/?group_id=61702">see here</a> for
-more details. Note that due to capacity problems with Sourceforge, 
-the public CVS servers often run up to 48 hours behind the real CVS
-servers. This is something that SF are working on improving.
+<a name="svn"><h2>Subversion Access</h2></a>
+
+<p>The code is currently available the SourceForge <a
+gref="http://sourceforge.net/svn/?group_id=61702">Subversion server</a>.
 </p>

Copied: trunk/website/prefschangelog.ht (from rev 3151, trunk/website/presfchangelog.ht)
===================================================================
--- trunk/website/prefschangelog.ht	                        (rev 0)
+++ trunk/website/prefschangelog.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -0,0 +1,905 @@
+<h2>Pre-Sourceforge ChangeLog</h2>
+<p>This changelog lists the commits on the spambayes projects before the
+   separate project was set up. See also the 
+<a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">old CVS repository</a>, but don't forget that it's now out of date, and you probably want to be looking at <a href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/">the current CVS</a>.
+</p>
+<pre>
+2002-09-06 02:27  tim_one
+
+	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
+	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
+	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
+
+	This code has been moved to a new SourceForge project (spambayes).
+	
+2002-09-05 15:37  tim_one
+
+	* classifier.py (1.11):
+
+	Added note about MINCOUNT oddities.
+	
+2002-09-05 14:32  tim_one
+
+	* timtest.py (1.17):
+
+	Added note about word length.
+	
+2002-09-05 13:48  tim_one
+
+	* timtest.py (1.16):
+
+	tokenize_word():  Oops!  This was awfully permissive in what it
+	took as being "an email address".  Tightened that, and also
+	avoided 5-gram'ing of email addresses w/ high-bit characters.
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.050  lost
+	    0.075  0.075  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   0 times
+	tied 19 times
+	lost  1 times
+	
+	total unique fp went from 7 to 8
+	
+	false negative percentages
+	    0.764  0.691  won
+	    0.691  0.655  won
+	    0.981  0.945  won
+	    1.309  1.309  tied
+	    1.418  1.164  won
+	    0.873  0.800  won
+	    0.800  0.763  won
+	    1.163  1.163  tied
+	    1.491  1.345  won
+	    1.200  1.127  won
+	    1.381  1.345  won
+	    1.454  1.490  lost
+	    1.164  0.909  won
+	    0.655  0.582  won
+	    0.655  0.691  lost
+	    1.163  1.163  tied
+	    1.200  1.018  won
+	    0.982  0.873  won
+	    0.982  0.909  won
+	    1.236  1.127  won
+	
+	won  15 times
+	tied  3 times
+	lost  2 times
+	
+	total unique fn went from 260 to 249
+	
+	Note:  Each of the two losses there consist of just 1 msg difference.
+	The wins are bigger as well as being more common, and 260-249 = 11
+	spams no longer sneak by any run (which is more than 4% of the 260
+	spams that used to sneak thru!).
+	
+2002-09-05 11:51  tim_one
+
+	* classifier.py (1.10):
+
+	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
+	really matter; leaving it alone.
+	
+2002-09-05 10:02  tim_one
+
+	* classifier.py (1.9):
+
+	A now-rare pure win, changing spamprob() to work harder to find more
+	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
+	column, after in the right:
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.075  0.075  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.075  0.025  won
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   1 times
+	tied 19 times
+	lost  0 times
+	
+	total unique fp went from 9 to 7
+	
+	false negative percentages
+	    0.909  0.764  won
+	    0.800  0.691  won
+	    1.091  0.981  won
+	    1.381  1.309  won
+	    1.491  1.418  won
+	    1.055  0.873  won
+	    0.945  0.800  won
+	    1.236  1.163  won
+	    1.564  1.491  won
+	    1.200  1.200  tied
+	    1.454  1.381  won
+	    1.599  1.454  won
+	    1.236  1.164  won
+	    0.800  0.655  won
+	    0.836  0.655  won
+	    1.236  1.163  won
+	    1.236  1.200  won
+	    1.055  0.982  won
+	    1.127  0.982  won
+	    1.381  1.236  won
+	
+	won  19 times
+	tied  1 times
+	lost  0 times
+	
+	total unique fn went from 284 to 260
+	
+2002-09-04 11:21  tim_one
+
+	* timtest.py (1.15):
+
+	Augmented the spam callback to display spams with low probability.
+	
+2002-09-04 09:53  tim_one
+
+	* Tester.py (1.3), timtest.py (1.14):
+
+	Added support for simple histograms of the probability distributions for
+	ham and spam.
+	
+2002-09-03 12:13  tim_one
+
+	* timtest.py (1.13):
+
+	A reluctant "on principle" change no matter what it does to the stats:
+	take a stab at removing HTML decorations from plain text msgs.  See
+	comments for why it's *only* in plain text msgs.  This puts an end to
+	false positives due to text msgs talking *about* HTML.  Surprisingly, it
+	also gets rid of some false negatives.  Not surprisingly, it introduced
+	another small class of false positives due to the dumbass regexp trick
+	used to approximate HTML tag removal removing pieces of text that had
+	nothing to do with HTML tags (e.g., this happened in the middle of a
+	uuencoded .py file in such a why that it just happened to leave behind
+	a string that "looked like" a spam phrase; but before this it looked
+	like a pile of "too long" lines that didn't generate any tokens --
+	it's a nonsense outcome either way).
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.025  lost
+	    0.075  0.075  tied
+	    0.050  0.025  won
+	    0.025  0.025  tied
+	    0.000  0.025  lost
+	    0.050  0.075  lost
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   1 times
+	tied 16 times
+	lost  3 times
+	
+	total unique fp went from 8 to 9
+	
+	false negative percentages
+	    0.945  0.909  won
+	    0.836  0.800  won
+	    1.200  1.091  won
+	    1.418  1.381  won
+	    1.455  1.491  lost
+	    1.091  1.055  won
+	    1.091  0.945  won
+	    1.236  1.236  tied
+	    1.564  1.564  tied
+	    1.236  1.200  won
+	    1.563  1.454  won
+	    1.563  1.599  lost
+	    1.236  1.236  tied
+	    0.836  0.800  won
+	    0.873  0.836  won
+	    1.236  1.236  tied
+	    1.273  1.236  won
+	    1.018  1.055  lost
+	    1.091  1.127  lost
+	    1.490  1.381  won
+	
+	won  12 times
+	tied  4 times
+	lost  4 times
+	
+	total unique fn went from 292 to 284
+	
+2002-09-03 06:57  tim_one
+
+	* classifier.py (1.8):
+
+	Added a new xspamprob() method, which computes the combined probability
+	"correctly", and a long comment block explaining what happened when I
+	tried it.  There's something worth pursuing here (it greatly improves
+	the false negative rate), but this change alone pushes too many marginal
+	hams into the spam camp
+	
+2002-09-03 05:23  tim_one
+
+	* timtest.py (1.12):
+
+	Made "skip:" tokens shorter.
+	
+	Added a surprising treatment of Organization headers, with a tiny f-n
+	benefit for a tiny cost.  No change in f-p stats.
+	
+	false negative percentages
+	    1.091  0.945  won
+	    0.945  0.836  won
+	    1.236  1.200  won
+	    1.454  1.418  won
+	    1.491  1.455  won
+	    1.091  1.091  tied
+	    1.127  1.091  won
+	    1.236  1.236  tied
+	    1.636  1.564  won
+	    1.345  1.236  won
+	    1.672  1.563  won
+	    1.599  1.563  won
+	    1.236  1.236  tied
+	    0.836  0.836  tied
+	    1.018  0.873  won
+	    1.236  1.236  tied
+	    1.273  1.273  tied
+	    1.055  1.018  won
+	    1.091  1.091  tied
+	    1.527  1.490  won
+	
+	won  13 times
+	tied  7 times
+	lost  0 times
+	
+	total unique fn went from 302 to 292
+	
+2002-09-03 02:18  tim_one
+
+	* timtest.py (1.11):
+
+	tokenize_word():  dropped the prefix from the signature; it's faster
+	to let the caller do it, and this also repaired a bug in one place it
+	was being used (well, a *conceptual* bug anyway, in that the code didn't
+	do what I intended there).  This changes the stats in an insignificant
+	way.  The f-p stats didn't change.  The f-n stats shifted by one message
+	in a few cases:
+	
+	false negative percentages
+	    1.091  1.091  tied
+	    0.945  0.945  tied
+	    1.200  1.236  lost
+	    1.454  1.454  tied
+	    1.491  1.491  tied
+	    1.091  1.091  tied
+	    1.091  1.127  lost
+	    1.236  1.236  tied
+	    1.636  1.636  tied
+	    1.382  1.345  won
+	    1.636  1.672  lost
+	    1.599  1.599  tied
+	    1.236  1.236  tied
+	    0.836  0.836  tied
+	    1.018  1.018  tied
+	    1.236  1.236  tied
+	    1.273  1.273  tied
+	    1.055  1.055  tied
+	    1.091  1.091  tied
+	    1.527  1.527  tied
+	
+	won   1 times
+	tied 16 times
+	lost  3 times
+	
+	total unique unchanged
+	
+2002-09-02 19:30  tim_one
+
+	* timtest.py (1.10):
+
+	Don't ask me why this helps -- I don't really know!  When skipping "long
+	words", generating a token with a brief hint about what and how much got
+	skipped makes a definite improvement in the f-n rate, and doesn't affect
+	the f-p rate at all.  Since experiment said it's a winner, I'm checking
+	it in.  Before (left columan) and after (right column):
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.075  0.075  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   0 times
+	tied 20 times
+	lost  0 times
+	
+	total unique fp went from 8 to 8
+	
+	false negative percentages
+	    1.236  1.091  won
+	    1.164  0.945  won
+	    1.454  1.200  won
+	    1.599  1.454  won
+	    1.527  1.491  won
+	    1.236  1.091  won
+	    1.163  1.091  won
+	    1.309  1.236  won
+	    1.891  1.636  won
+	    1.418  1.382  won
+	    1.745  1.636  won
+	    1.708  1.599  won
+	    1.491  1.236  won
+	    0.836  0.836  tied
+	    1.091  1.018  won
+	    1.309  1.236  won
+	    1.491  1.273  won
+	    1.127  1.055  won
+	    1.309  1.091  won
+	    1.636  1.527  won
+	
+	won  19 times
+	tied  1 times
+	lost  0 times
+	
+	total unique fn went from 336 to 302
+	
+2002-09-02 17:55  tim_one
+
+	* timtest.py (1.9):
+
+	Some comment changes and nesting reduction.
+	
+2002-09-02 11:18  tim_one
+
+	* timtest.py (1.8):
+
+	Fixed some out-of-date comments.
+	
+	Made URL clumping lumpier:  now distinguishes among just "first field",
+	"second field", and "everything else".
+	
+	Changed tag names for email address fields (semantically neutral).
+	
+	Added "From:" line tagging.
+	
+	These add up to an almost pure win.  Before-and-after f-n rates across 20
+	runs:
+	
+	1.418   1.236
+	1.309   1.164
+	1.636   1.454
+	1.854   1.599
+	1.745   1.527
+	1.418   1.236
+	1.381   1.163
+	1.418   1.309
+	2.109   1.891
+	1.491   1.418
+	1.854   1.745
+	1.890   1.708
+	1.818   1.491
+	1.055   0.836
+	1.164   1.091
+	1.599   1.309
+	1.600   1.491
+	1.127   1.127
+	1.164   1.309
+	1.781   1.636
+	
+	It only increased in one run.  The variance appears to have been reduced
+	too (I didn't bother to compute that, though).
+	
+	Before-and-after f-p rates across 20 runs:
+	
+	0.000   0.000
+	0.000   0.000
+	0.075   0.050
+	0.000   0.000
+	0.025   0.025
+	0.050   0.025
+	0.075   0.050
+	0.025   0.025
+	0.025   0.025
+	0.025   0.000
+	0.100   0.075
+	0.050   0.050
+	0.025   0.025
+	0.000   0.000
+	0.075   0.050
+	0.025   0.025
+	0.025   0.025
+	0.000   0.000
+	0.075   0.025
+	0.100   0.050
+	
+	Note that 0.025% is a single message; it's really impossible to *measure*
+	an improvement in the f-p rate anymore with 4000-msg ham sets.
+	
+	Across all 20 runs,
+	
+	the total # of unique f-n fell from 353 to 336
+	the total # of unique f-p fell from 13 to 8
+	
+2002-09-02 10:06  tim_one
+
+	* timtest.py (1.7):
+
+	A number of changes.  The most significant is paying attention to the
+	Subject line (I was wrong before when I said my c.l.py ham corpus was
+	unusable for this due to Mailman-injected decorations).  In all, across
+	my 20 test runs,
+	
+	the total # of unique false positives fell from 23 to 13
+	the total # of unique false negatives rose from 337 to 353
+	
+	Neither result is statistically significant, although I bet the first
+	one would be if I pissed away a few days trying to come up with a more
+	realistic model for what "stat. sig." means here <wink>.
+	
+2002-09-01 17:22  tim_one
+
+	* classifier.py (1.7):
+
+	Added a comment block about HAMBIAS experiments.  There's no clearer
+	example of trading off precision against recall, and you can favor either
+	at the expense of the other to any degree you like by fiddling this knob.
+	
+2002-09-01 14:42  tim_one
+
+	* timtest.py (1.6):
+
+	Long new comment block summarizing all my experiments with character
+	n-grams.  Bottom line is that they have nothing going for them, and a
+	lot going against them, under Graham's scheme.  I believe there may
+	still be a place for them in *part* of a word-based tokenizer, though.
+	
+2002-09-01 10:05  tim_one
+
+	* classifier.py (1.6):
+
+	spamprob():  Never count unique words more than once anymore.  Counting
+	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
+	that's now a small drag instead.
+	
+2002-09-01 07:33  tim_one
+
+	* rebal.py (1.3), timtest.py (1.5):
+
+	Folding case is here to stay.  Read the new comments for why.  This may
+	be a bad idea for other languages, though.
+	
+	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
+	http is spam-neutral, but https is a strong spam indicator.  That
+	surprised me.
+	
+2002-09-01 06:47  tim_one
+
+	* classifier.py (1.5):
+
+	spamprob():  Removed useless check that wordstream isn't empty.  For one
+	thing, it didn't work, since wordstream is often an iterator.  Even if
+	it did work, it isn't needed -- the probability of an empty wordstream
+	gets computed as 0.5 based on the total absence of evidence.
+	
+2002-09-01 05:37  tim_one
+
+	* timtest.py (1.4):
+
+	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
+	details).
+	
+2002-09-01 05:09  tim_one
+
+	* rebal.py (1.2):
+
+	Aha!  Staring at the checkin msg revealed a logic bug that explains why
+	my ham directories sometimes remained unbalanced after running this --
+	if the randomly selected reservoir msg turned out to be spam, it wasn't
+	pushing the too-small directory on the stack again.
+	
+2002-09-01 04:56  tim_one
+
+	* timtest.py (1.3):
+
+	textparts():  This was failing to weed out redundant HTML in cases like
+	this:
+	
+	    multipart/alternative
+	        text/plain
+	        multipart/related
+	            text/html
+	
+	The tokenizer here also transforms everything to lowercase, but that's
+	an accident due simply to that I'm testing that now.  Can't say for
+	sure until the test runs end, but so far it looks like a bad idea for
+	the false positive rate.
+	
+2002-09-01 04:52  tim_one
+
+	* rebal.py (1.1):
+
+	A little script I use to rebalance the ham corpora after deleting what
+	turns out to be spam.  I have another Ham/reservoir directory with a
+	few thousand randomly selected msgs from the presumably-good archive.
+	These aren't used in scoring or training.  This script marches over all
+	the ham corpora directories that are used, and if any have gotten too
+	big (this never happens anymore) deletes msgs at random from them, and
+	if any have gotten too small plugs the holes by moving in random
+	msgs from the reservoir.
+	
+2002-09-01 03:25  tim_one
+
+	* classifier.py (1.4), timtest.py (1.2):
+
+	Boost UNKNOWN_SPAMPROB.
+	# The spam probability assigned to words never seen before.  Graham used
+	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
+	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
+	# negative rate by over 1/3.  The f-p rate increased, but there were so few
+	# f-ps that the increase wasn't statistically significant.  It also caught
+	# 13 more spams erroneously classified as ham.  By eyeball (and common
+	# sense <wink>), this has most effect on very short messages, where there
+	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
+	# completely ignored by spamprob(), in favor of *any* word with *any* prob
+	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
+	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
+	# on the face of it.
+	
+2002-08-31 16:50  tim_one
+
+	* timtest.py (1.1):
+
+	This is a driver I've been using for test runs.  It's specific to my
+	corpus directories, but has useful stuff in it all the same.
+	
+2002-08-31 16:49  tim_one
+
+	* classifier.py (1.3):
+
+	The explanation for these changes was on Python-Dev.  You'll find out
+	why if the moderator approves the msg <wink>.
+	
+2002-08-29 07:04  tim_one
+
+	* Tester.py (1.2), classifier.py (1.2):
+
+	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
+	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
+	too hard to get motivated to reduce 0.01 <0.1 wink>).
+	
+	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
+	the 15 strongest (word, probability) pairs is returned as well as the
+	overall probability (this is how to find out why a message scored as it
+	did).
+	
+2002-08-28 13:45  montanaro
+
+	* GBayes.py (1.15):
+
+	ehh - it actually didn't work all that well.  the spurious report that it
+	did well was pilot error.  besides, tim's report suggests that a simple
+	str.split() may be the best tokenizer anyway.
+	
+2002-08-28 10:45  montanaro
+
+	* setup.py (1.1):
+
+	trivial little setup.py file - i don't expect most people will be interested
+	in this, but it makes it a tad simpler to work with now that there are two
+	files
+	
+2002-08-28 10:43  montanaro
+
+	* GBayes.py (1.14):
+
+	add simple trigram tokenizer - this seems to yield the best results I've
+	seen so far (but has not been extensively tested)
+	
+2002-08-28 08:10  tim_one
+
+	* Tester.py (1.1):
+
+	A start at a testing class.  There isn't a lot here, but it automates
+	much of the tedium, and as the doctest shows it can already do
+	useful things, like remembering which inputs were misclassified.
+	
+2002-08-27 06:45  tim_one
+
+	* mboxcount.py (1.5):
+
+	Updated stats to what Barry and I both get now.  Fiddled output.
+	
+2002-08-27 05:09  bwarsaw
+
+	* split.py (1.5), splitn.py (1.2):
+
+	_factory(): Return the empty string instead of None in the except
+	clauses, so that for-loops won't break prematurely.  mailbox.py's base
+	class defines an __iter__() that raises a StopIteration on None
+	return.
+	
+2002-08-27 04:55  tim_one
+
+	* GBayes.py (1.13), mboxcount.py (1.4):
+
+	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
+	
+2002-08-27 04:40  bwarsaw
+
+	* mboxcount.py (1.3):
+
+	Some stats after splitting b/w good messages and unparseable messages
+	
+2002-08-27 04:23  bwarsaw
+
+	* mboxcount.py (1.2):
+
+	_factory(): Use a marker object to designate between good messages and
+	unparseable messages.  For some reason, returning None from the except
+	clause in _factory() caused Python 2.2.1 to exit early out of the for
+	loop.
+	
+	main(): Print statistics about both the number of good messages and
+	the number of unparseable messages.
+	
+2002-08-27 03:06  tim_one
+
+	* cleanarch (1.2):
+
+	"From " is a header more than a separator, so don't bump the msg count
+	at the end.
+	
+2002-08-24 01:42  tim_one
+
+	* GBayes.py (1.12), classifier.py (1.1):
+
+	Moved all the interesting code that was in the *original* GBayes.py into
+	a new classifier.py.  It was designed to have a very clean interface,
+	and there's no reason to keep slamming everything into one file.  The
+	ever-growing tokenizer stuff should probably also be split out, leaving
+	GBayes.py a pure driver.
+	
+	Also repaired _test() (Skip's checkin left it without a binding for
+	the tokenize function).
+	
+2002-08-24 01:17  tim_one
+
+	* splitn.py (1.1):
+
+	Utility to split an mbox into N random pieces in one gulp.  This gives
+	a convenient way to break a giant corpus into multiple files that can
+	then be used independently across multiple training and testing runs.
+	It's important to do multiple runs on different random samples to avoid
+	drawing conclusions based on accidents in a single random training corpus;
+	if the algorithm is robust, it should have similar performance across
+	all runs.
+	
+2002-08-24 00:25  montanaro
+
+	* GBayes.py (1.11):
+
+	Allow command line specification of tokenize functions
+	    run w/ -t flag to override default tokenize function
+	    run w/ -H flag to see list of tokenize functions
+	
+	When adding a new tokenizer, make docstring a short description and add a
+	key/value pair to the tokenizers dict.  The key is what the user specifies.
+	The value is a tokenize function.
+	
+	Added two new tokenizers - tokenize_wordpairs_foldcase and
+	tokenize_words_and_pairs.  It's not obvious that either is better than any
+	of the preexisting functions.
+	
+	Should probably add info to the pickle which indicates the tokenizing
+	function used to build it.  This could then be the default for spam
+	detection runs.
+	
+	Next step is to drive this with spam/non-spam corpora, selecting each of the
+	various tokenizer functions, and presenting the results in tabular form.
+	
+2002-08-23 13:10  tim_one
+
+	* GBayes.py (1.10):
+
+	spamprob():  Commented some subtleties.
+	
+	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
+	is that you can't delete entries from a dict that's being crawled over
+	by .iteritems(), which is why I (I suddenly recall) materialized a
+	list of words to be deleted the first time I wrote this.  It's a lot
+	better to materialize a list of to-be-deleted words than to materialize
+	the entire database in a dict.items() list.
+	
+2002-08-23 12:36  tim_one
+
+	* mboxcount.py (1.1):
+
+	Utility to count and display the # of msgs in (one or more) Unix mboxes.
+	
+2002-08-23 12:11  tim_one
+
+	* split.py (1.4):
+
+	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
+	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
+	the latter invents an extra newline.
+	
+2002-08-22 07:01  tim_one
+
+	* GBayes.py (1.9):
+
+	Renamed "modtime" to "atime", to better reflect its meaning, and added a
+	comment block to explain that better.
+	
+2002-08-21 08:07  bwarsaw
+
+	* split.py (1.3):
+
+	Guido suggests a different order for the positional args.
+	
+2002-08-21 07:37  bwarsaw
+
+	* split.py (1.2):
+
+	Get rid of the -1 and -2 arguments and make them positional.
+	
+2002-08-21 07:18  bwarsaw
+
+	* split.py (1.1):
+
+	A simple mailbox splitter
+	
+2002-08-21 06:42  tim_one
+
+	* GBayes.py (1.8):
+
+	Added a bunch of simple tokenizers.  The originals are renamed to
+	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
+	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
+	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
+	any of these to be the last word.  When Barry has the test corpus
+	set up it should be easy to let the data tell us which "pure" strategy
+	works best.  Straight character n-grams are very appealing because
+	they're the simplest and most language-neutral; I didn't have any luck
+	with them over the weekend, but the size of my training data was
+	trivial.
+	
+2002-08-21 05:08  bwarsaw
+
+	* cleanarch (1.1):
+
+	An archive cleaner, adapted from the Mailman 2.1b3 version, but
+	de-Mailman-ified.
+	
+2002-08-21 04:44  gvanrossum
+
+	* GBayes.py (1.7):
+
+	Indent repair in clearjunk().
+	
+2002-08-21 04:22  gvanrossum
+
+	* GBayes.py (1.6):
+
+	Some minor cleanup:
+	
+	- Move the identifying comment to the top, clarify it a bit, and add
+	  author info.
+	
+	- There's no reason for _time and _heapreplace to be hidden names;
+	  change these back to time and heapreplace.
+	
+	- Rename main1() to _test() and main2() to main(); when main() sees
+	  there are no options or arguments, it runs _test().
+	
+	- Get rid of a list comprehension from clearjunk().
+	
+	- Put wordinfo.get as a local variable in _add_msg().
+	
+2002-08-20 15:16  tim_one
+
+	* GBayes.py (1.5):
+
+	Neutral typo repairs, except that clearjunk() has a better chance of
+	not blowing up immediately now <wink -- I have yet to try it!>.
+	
+2002-08-20 13:49  montanaro
+
+	* GBayes.py (1.4):
+
+	help make it more easily executable... ;-)
+	
+2002-08-20 09:32  bwarsaw
+
+	* GBayes.py (1.3):
+
+	Lots of hacks great and small to the main() program, but I didn't
+	touch the guts of the algorithm.
+	
+	Added a module docstring/usage message.
+	
+	Added a bunch of switches to train the system on an mbox of known good
+	and known spam messages (using PortableUnixMailbox only for now).
+	Uses the email package but does not decoding of message bodies.  Also,
+	allows you to specify a file for pickling the training data, and for
+	setting a threshold, above which messages get an X-Bayes-Score
+	header.  Also output messages (marked and unmarked) to an output file
+	for retraining.
+	
+	Print some statistics at the end.
+	
+2002-08-20 05:43  tim_one
+
+	* GBayes.py (1.2):
+
+	Turned off debugging vrbl mistakenly checked in at True.
+	
+	unlearn():  Gave this an update_probabilities=True default arg, for
+	symmetry with learn().
+	
+2002-08-20 03:33  tim_one
+
+	* GBayes.py (1.1):
+
+	An implementation of Paul Graham's Bayes-like spam classifier.
+
+</pre>

Deleted: trunk/website/presfchangelog.ht
===================================================================
--- trunk/website/presfchangelog.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/presfchangelog.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -1,905 +0,0 @@
-<h2>Pre-Sourceforge ChangeLog</h2>
-<p>This changelog lists the commits on the spambayes projects before the
-   separate project was set up. See also the 
-<a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">old CVS repository</a>, but don't forget that it's now out of date, and you probably want to be looking at <a href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/">the current CVS</a>.
-</p>
-<pre>
-2002-09-06 02:27  tim_one
-
-	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
-	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
-	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
-
-	This code has been moved to a new SourceForge project (spambayes).
-	
-2002-09-05 15:37  tim_one
-
-	* classifier.py (1.11):
-
-	Added note about MINCOUNT oddities.
-	
-2002-09-05 14:32  tim_one
-
-	* timtest.py (1.17):
-
-	Added note about word length.
-	
-2002-09-05 13:48  tim_one
-
-	* timtest.py (1.16):
-
-	tokenize_word():  Oops!  This was awfully permissive in what it
-	took as being "an email address".  Tightened that, and also
-	avoided 5-gram'ing of email addresses w/ high-bit characters.
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.050  lost
-	    0.075  0.075  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   0 times
-	tied 19 times
-	lost  1 times
-	
-	total unique fp went from 7 to 8
-	
-	false negative percentages
-	    0.764  0.691  won
-	    0.691  0.655  won
-	    0.981  0.945  won
-	    1.309  1.309  tied
-	    1.418  1.164  won
-	    0.873  0.800  won
-	    0.800  0.763  won
-	    1.163  1.163  tied
-	    1.491  1.345  won
-	    1.200  1.127  won
-	    1.381  1.345  won
-	    1.454  1.490  lost
-	    1.164  0.909  won
-	    0.655  0.582  won
-	    0.655  0.691  lost
-	    1.163  1.163  tied
-	    1.200  1.018  won
-	    0.982  0.873  won
-	    0.982  0.909  won
-	    1.236  1.127  won
-	
-	won  15 times
-	tied  3 times
-	lost  2 times
-	
-	total unique fn went from 260 to 249
-	
-	Note:  Each of the two losses there consist of just 1 msg difference.
-	The wins are bigger as well as being more common, and 260-249 = 11
-	spams no longer sneak by any run (which is more than 4% of the 260
-	spams that used to sneak thru!).
-	
-2002-09-05 11:51  tim_one
-
-	* classifier.py (1.10):
-
-	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
-	really matter; leaving it alone.
-	
-2002-09-05 10:02  tim_one
-
-	* classifier.py (1.9):
-
-	A now-rare pure win, changing spamprob() to work harder to find more
-	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
-	column, after in the right:
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.075  0.075  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.075  0.025  won
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   1 times
-	tied 19 times
-	lost  0 times
-	
-	total unique fp went from 9 to 7
-	
-	false negative percentages
-	    0.909  0.764  won
-	    0.800  0.691  won
-	    1.091  0.981  won
-	    1.381  1.309  won
-	    1.491  1.418  won
-	    1.055  0.873  won
-	    0.945  0.800  won
-	    1.236  1.163  won
-	    1.564  1.491  won
-	    1.200  1.200  tied
-	    1.454  1.381  won
-	    1.599  1.454  won
-	    1.236  1.164  won
-	    0.800  0.655  won
-	    0.836  0.655  won
-	    1.236  1.163  won
-	    1.236  1.200  won
-	    1.055  0.982  won
-	    1.127  0.982  won
-	    1.381  1.236  won
-	
-	won  19 times
-	tied  1 times
-	lost  0 times
-	
-	total unique fn went from 284 to 260
-	
-2002-09-04 11:21  tim_one
-
-	* timtest.py (1.15):
-
-	Augmented the spam callback to display spams with low probability.
-	
-2002-09-04 09:53  tim_one
-
-	* Tester.py (1.3), timtest.py (1.14):
-
-	Added support for simple histograms of the probability distributions for
-	ham and spam.
-	
-2002-09-03 12:13  tim_one
-
-	* timtest.py (1.13):
-
-	A reluctant "on principle" change no matter what it does to the stats:
-	take a stab at removing HTML decorations from plain text msgs.  See
-	comments for why it's *only* in plain text msgs.  This puts an end to
-	false positives due to text msgs talking *about* HTML.  Surprisingly, it
-	also gets rid of some false negatives.  Not surprisingly, it introduced
-	another small class of false positives due to the dumbass regexp trick
-	used to approximate HTML tag removal removing pieces of text that had
-	nothing to do with HTML tags (e.g., this happened in the middle of a
-	uuencoded .py file in such a why that it just happened to leave behind
-	a string that "looked like" a spam phrase; but before this it looked
-	like a pile of "too long" lines that didn't generate any tokens --
-	it's a nonsense outcome either way).
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.025  lost
-	    0.075  0.075  tied
-	    0.050  0.025  won
-	    0.025  0.025  tied
-	    0.000  0.025  lost
-	    0.050  0.075  lost
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   1 times
-	tied 16 times
-	lost  3 times
-	
-	total unique fp went from 8 to 9
-	
-	false negative percentages
-	    0.945  0.909  won
-	    0.836  0.800  won
-	    1.200  1.091  won
-	    1.418  1.381  won
-	    1.455  1.491  lost
-	    1.091  1.055  won
-	    1.091  0.945  won
-	    1.236  1.236  tied
-	    1.564  1.564  tied
-	    1.236  1.200  won
-	    1.563  1.454  won
-	    1.563  1.599  lost
-	    1.236  1.236  tied
-	    0.836  0.800  won
-	    0.873  0.836  won
-	    1.236  1.236  tied
-	    1.273  1.236  won
-	    1.018  1.055  lost
-	    1.091  1.127  lost
-	    1.490  1.381  won
-	
-	won  12 times
-	tied  4 times
-	lost  4 times
-	
-	total unique fn went from 292 to 284
-	
-2002-09-03 06:57  tim_one
-
-	* classifier.py (1.8):
-
-	Added a new xspamprob() method, which computes the combined probability
-	"correctly", and a long comment block explaining what happened when I
-	tried it.  There's something worth pursuing here (it greatly improves
-	the false negative rate), but this change alone pushes too many marginal
-	hams into the spam camp
-	
-2002-09-03 05:23  tim_one
-
-	* timtest.py (1.12):
-
-	Made "skip:" tokens shorter.
-	
-	Added a surprising treatment of Organization headers, with a tiny f-n
-	benefit for a tiny cost.  No change in f-p stats.
-	
-	false negative percentages
-	    1.091  0.945  won
-	    0.945  0.836  won
-	    1.236  1.200  won
-	    1.454  1.418  won
-	    1.491  1.455  won
-	    1.091  1.091  tied
-	    1.127  1.091  won
-	    1.236  1.236  tied
-	    1.636  1.564  won
-	    1.345  1.236  won
-	    1.672  1.563  won
-	    1.599  1.563  won
-	    1.236  1.236  tied
-	    0.836  0.836  tied
-	    1.018  0.873  won
-	    1.236  1.236  tied
-	    1.273  1.273  tied
-	    1.055  1.018  won
-	    1.091  1.091  tied
-	    1.527  1.490  won
-	
-	won  13 times
-	tied  7 times
-	lost  0 times
-	
-	total unique fn went from 302 to 292
-	
-2002-09-03 02:18  tim_one
-
-	* timtest.py (1.11):
-
-	tokenize_word():  dropped the prefix from the signature; it's faster
-	to let the caller do it, and this also repaired a bug in one place it
-	was being used (well, a *conceptual* bug anyway, in that the code didn't
-	do what I intended there).  This changes the stats in an insignificant
-	way.  The f-p stats didn't change.  The f-n stats shifted by one message
-	in a few cases:
-	
-	false negative percentages
-	    1.091  1.091  tied
-	    0.945  0.945  tied
-	    1.200  1.236  lost
-	    1.454  1.454  tied
-	    1.491  1.491  tied
-	    1.091  1.091  tied
-	    1.091  1.127  lost
-	    1.236  1.236  tied
-	    1.636  1.636  tied
-	    1.382  1.345  won
-	    1.636  1.672  lost
-	    1.599  1.599  tied
-	    1.236  1.236  tied
-	    0.836  0.836  tied
-	    1.018  1.018  tied
-	    1.236  1.236  tied
-	    1.273  1.273  tied
-	    1.055  1.055  tied
-	    1.091  1.091  tied
-	    1.527  1.527  tied
-	
-	won   1 times
-	tied 16 times
-	lost  3 times
-	
-	total unique unchanged
-	
-2002-09-02 19:30  tim_one
-
-	* timtest.py (1.10):
-
-	Don't ask me why this helps -- I don't really know!  When skipping "long
-	words", generating a token with a brief hint about what and how much got
-	skipped makes a definite improvement in the f-n rate, and doesn't affect
-	the f-p rate at all.  Since experiment said it's a winner, I'm checking
-	it in.  Before (left columan) and after (right column):
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.075  0.075  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   0 times
-	tied 20 times
-	lost  0 times
-	
-	total unique fp went from 8 to 8
-	
-	false negative percentages
-	    1.236  1.091  won
-	    1.164  0.945  won
-	    1.454  1.200  won
-	    1.599  1.454  won
-	    1.527  1.491  won
-	    1.236  1.091  won
-	    1.163  1.091  won
-	    1.309  1.236  won
-	    1.891  1.636  won
-	    1.418  1.382  won
-	    1.745  1.636  won
-	    1.708  1.599  won
-	    1.491  1.236  won
-	    0.836  0.836  tied
-	    1.091  1.018  won
-	    1.309  1.236  won
-	    1.491  1.273  won
-	    1.127  1.055  won
-	    1.309  1.091  won
-	    1.636  1.527  won
-	
-	won  19 times
-	tied  1 times
-	lost  0 times
-	
-	total unique fn went from 336 to 302
-	
-2002-09-02 17:55  tim_one
-
-	* timtest.py (1.9):
-
-	Some comment changes and nesting reduction.
-	
-2002-09-02 11:18  tim_one
-
-	* timtest.py (1.8):
-
-	Fixed some out-of-date comments.
-	
-	Made URL clumping lumpier:  now distinguishes among just "first field",
-	"second field", and "everything else".
-	
-	Changed tag names for email address fields (semantically neutral).
-	
-	Added "From:" line tagging.
-	
-	These add up to an almost pure win.  Before-and-after f-n rates across 20
-	runs:
-	
-	1.418   1.236
-	1.309   1.164
-	1.636   1.454
-	1.854   1.599
-	1.745   1.527
-	1.418   1.236
-	1.381   1.163
-	1.418   1.309
-	2.109   1.891
-	1.491   1.418
-	1.854   1.745
-	1.890   1.708
-	1.818   1.491
-	1.055   0.836
-	1.164   1.091
-	1.599   1.309
-	1.600   1.491
-	1.127   1.127
-	1.164   1.309
-	1.781   1.636
-	
-	It only increased in one run.  The variance appears to have been reduced
-	too (I didn't bother to compute that, though).
-	
-	Before-and-after f-p rates across 20 runs:
-	
-	0.000   0.000
-	0.000   0.000
-	0.075   0.050
-	0.000   0.000
-	0.025   0.025
-	0.050   0.025
-	0.075   0.050
-	0.025   0.025
-	0.025   0.025
-	0.025   0.000
-	0.100   0.075
-	0.050   0.050
-	0.025   0.025
-	0.000   0.000
-	0.075   0.050
-	0.025   0.025
-	0.025   0.025
-	0.000   0.000
-	0.075   0.025
-	0.100   0.050
-	
-	Note that 0.025% is a single message; it's really impossible to *measure*
-	an improvement in the f-p rate anymore with 4000-msg ham sets.
-	
-	Across all 20 runs,
-	
-	the total # of unique f-n fell from 353 to 336
-	the total # of unique f-p fell from 13 to 8
-	
-2002-09-02 10:06  tim_one
-
-	* timtest.py (1.7):
-
-	A number of changes.  The most significant is paying attention to the
-	Subject line (I was wrong before when I said my c.l.py ham corpus was
-	unusable for this due to Mailman-injected decorations).  In all, across
-	my 20 test runs,
-	
-	the total # of unique false positives fell from 23 to 13
-	the total # of unique false negatives rose from 337 to 353
-	
-	Neither result is statistically significant, although I bet the first
-	one would be if I pissed away a few days trying to come up with a more
-	realistic model for what "stat. sig." means here <wink>.
-	
-2002-09-01 17:22  tim_one
-
-	* classifier.py (1.7):
-
-	Added a comment block about HAMBIAS experiments.  There's no clearer
-	example of trading off precision against recall, and you can favor either
-	at the expense of the other to any degree you like by fiddling this knob.
-	
-2002-09-01 14:42  tim_one
-
-	* timtest.py (1.6):
-
-	Long new comment block summarizing all my experiments with character
-	n-grams.  Bottom line is that they have nothing going for them, and a
-	lot going against them, under Graham's scheme.  I believe there may
-	still be a place for them in *part* of a word-based tokenizer, though.
-	
-2002-09-01 10:05  tim_one
-
-	* classifier.py (1.6):
-
-	spamprob():  Never count unique words more than once anymore.  Counting
-	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
-	that's now a small drag instead.
-	
-2002-09-01 07:33  tim_one
-
-	* rebal.py (1.3), timtest.py (1.5):
-
-	Folding case is here to stay.  Read the new comments for why.  This may
-	be a bad idea for other languages, though.
-	
-	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
-	http is spam-neutral, but https is a strong spam indicator.  That
-	surprised me.
-	
-2002-09-01 06:47  tim_one
-
-	* classifier.py (1.5):
-
-	spamprob():  Removed useless check that wordstream isn't empty.  For one
-	thing, it didn't work, since wordstream is often an iterator.  Even if
-	it did work, it isn't needed -- the probability of an empty wordstream
-	gets computed as 0.5 based on the total absence of evidence.
-	
-2002-09-01 05:37  tim_one
-
-	* timtest.py (1.4):
-
-	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
-	details).
-	
-2002-09-01 05:09  tim_one
-
-	* rebal.py (1.2):
-
-	Aha!  Staring at the checkin msg revealed a logic bug that explains why
-	my ham directories sometimes remained unbalanced after running this --
-	if the randomly selected reservoir msg turned out to be spam, it wasn't
-	pushing the too-small directory on the stack again.
-	
-2002-09-01 04:56  tim_one
-
-	* timtest.py (1.3):
-
-	textparts():  This was failing to weed out redundant HTML in cases like
-	this:
-	
-	    multipart/alternative
-	        text/plain
-	        multipart/related
-	            text/html
-	
-	The tokenizer here also transforms everything to lowercase, but that's
-	an accident due simply to that I'm testing that now.  Can't say for
-	sure until the test runs end, but so far it looks like a bad idea for
-	the false positive rate.
-	
-2002-09-01 04:52  tim_one
-
-	* rebal.py (1.1):
-
-	A little script I use to rebalance the ham corpora after deleting what
-	turns out to be spam.  I have another Ham/reservoir directory with a
-	few thousand randomly selected msgs from the presumably-good archive.
-	These aren't used in scoring or training.  This script marches over all
-	the ham corpora directories that are used, and if any have gotten too
-	big (this never happens anymore) deletes msgs at random from them, and
-	if any have gotten too small plugs the holes by moving in random
-	msgs from the reservoir.
-	
-2002-09-01 03:25  tim_one
-
-	* classifier.py (1.4), timtest.py (1.2):
-
-	Boost UNKNOWN_SPAMPROB.
-	# The spam probability assigned to words never seen before.  Graham used
-	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
-	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
-	# negative rate by over 1/3.  The f-p rate increased, but there were so few
-	# f-ps that the increase wasn't statistically significant.  It also caught
-	# 13 more spams erroneously classified as ham.  By eyeball (and common
-	# sense <wink>), this has most effect on very short messages, where there
-	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
-	# completely ignored by spamprob(), in favor of *any* word with *any* prob
-	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
-	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
-	# on the face of it.
-	
-2002-08-31 16:50  tim_one
-
-	* timtest.py (1.1):
-
-	This is a driver I've been using for test runs.  It's specific to my
-	corpus directories, but has useful stuff in it all the same.
-	
-2002-08-31 16:49  tim_one
-
-	* classifier.py (1.3):
-
-	The explanation for these changes was on Python-Dev.  You'll find out
-	why if the moderator approves the msg <wink>.
-	
-2002-08-29 07:04  tim_one
-
-	* Tester.py (1.2), classifier.py (1.2):
-
-	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
-	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
-	too hard to get motivated to reduce 0.01 <0.1 wink>).
-	
-	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
-	the 15 strongest (word, probability) pairs is returned as well as the
-	overall probability (this is how to find out why a message scored as it
-	did).
-	
-2002-08-28 13:45  montanaro
-
-	* GBayes.py (1.15):
-
-	ehh - it actually didn't work all that well.  the spurious report that it
-	did well was pilot error.  besides, tim's report suggests that a simple
-	str.split() may be the best tokenizer anyway.
-	
-2002-08-28 10:45  montanaro
-
-	* setup.py (1.1):
-
-	trivial little setup.py file - i don't expect most people will be interested
-	in this, but it makes it a tad simpler to work with now that there are two
-	files
-	
-2002-08-28 10:43  montanaro
-
-	* GBayes.py (1.14):
-
-	add simple trigram tokenizer - this seems to yield the best results I've
-	seen so far (but has not been extensively tested)
-	
-2002-08-28 08:10  tim_one
-
-	* Tester.py (1.1):
-
-	A start at a testing class.  There isn't a lot here, but it automates
-	much of the tedium, and as the doctest shows it can already do
-	useful things, like remembering which inputs were misclassified.
-	
-2002-08-27 06:45  tim_one
-
-	* mboxcount.py (1.5):
-
-	Updated stats to what Barry and I both get now.  Fiddled output.
-	
-2002-08-27 05:09  bwarsaw
-
-	* split.py (1.5), splitn.py (1.2):
-
-	_factory(): Return the empty string instead of None in the except
-	clauses, so that for-loops won't break prematurely.  mailbox.py's base
-	class defines an __iter__() that raises a StopIteration on None
-	return.
-	
-2002-08-27 04:55  tim_one
-
-	* GBayes.py (1.13), mboxcount.py (1.4):
-
-	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
-	
-2002-08-27 04:40  bwarsaw
-
-	* mboxcount.py (1.3):
-
-	Some stats after splitting b/w good messages and unparseable messages
-	
-2002-08-27 04:23  bwarsaw
-
-	* mboxcount.py (1.2):
-
-	_factory(): Use a marker object to designate between good messages and
-	unparseable messages.  For some reason, returning None from the except
-	clause in _factory() caused Python 2.2.1 to exit early out of the for
-	loop.
-	
-	main(): Print statistics about both the number of good messages and
-	the number of unparseable messages.
-	
-2002-08-27 03:06  tim_one
-
-	* cleanarch (1.2):
-
-	"From " is a header more than a separator, so don't bump the msg count
-	at the end.
-	
-2002-08-24 01:42  tim_one
-
-	* GBayes.py (1.12), classifier.py (1.1):
-
-	Moved all the interesting code that was in the *original* GBayes.py into
-	a new classifier.py.  It was designed to have a very clean interface,
-	and there's no reason to keep slamming everything into one file.  The
-	ever-growing tokenizer stuff should probably also be split out, leaving
-	GBayes.py a pure driver.
-	
-	Also repaired _test() (Skip's checkin left it without a binding for
-	the tokenize function).
-	
-2002-08-24 01:17  tim_one
-
-	* splitn.py (1.1):
-
-	Utility to split an mbox into N random pieces in one gulp.  This gives
-	a convenient way to break a giant corpus into multiple files that can
-	then be used independently across multiple training and testing runs.
-	It's important to do multiple runs on different random samples to avoid
-	drawing conclusions based on accidents in a single random training corpus;
-	if the algorithm is robust, it should have similar performance across
-	all runs.
-	
-2002-08-24 00:25  montanaro
-
-	* GBayes.py (1.11):
-
-	Allow command line specification of tokenize functions
-	    run w/ -t flag to override default tokenize function
-	    run w/ -H flag to see list of tokenize functions
-	
-	When adding a new tokenizer, make docstring a short description and add a
-	key/value pair to the tokenizers dict.  The key is what the user specifies.
-	The value is a tokenize function.
-	
-	Added two new tokenizers - tokenize_wordpairs_foldcase and
-	tokenize_words_and_pairs.  It's not obvious that either is better than any
-	of the preexisting functions.
-	
-	Should probably add info to the pickle which indicates the tokenizing
-	function used to build it.  This could then be the default for spam
-	detection runs.
-	
-	Next step is to drive this with spam/non-spam corpora, selecting each of the
-	various tokenizer functions, and presenting the results in tabular form.
-	
-2002-08-23 13:10  tim_one
-
-	* GBayes.py (1.10):
-
-	spamprob():  Commented some subtleties.
-	
-	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
-	is that you can't delete entries from a dict that's being crawled over
-	by .iteritems(), which is why I (I suddenly recall) materialized a
-	list of words to be deleted the first time I wrote this.  It's a lot
-	better to materialize a list of to-be-deleted words than to materialize
-	the entire database in a dict.items() list.
-	
-2002-08-23 12:36  tim_one
-
-	* mboxcount.py (1.1):
-
-	Utility to count and display the # of msgs in (one or more) Unix mboxes.
-	
-2002-08-23 12:11  tim_one
-
-	* split.py (1.4):
-
-	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
-	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
-	the latter invents an extra newline.
-	
-2002-08-22 07:01  tim_one
-
-	* GBayes.py (1.9):
-
-	Renamed "modtime" to "atime", to better reflect its meaning, and added a
-	comment block to explain that better.
-	
-2002-08-21 08:07  bwarsaw
-
-	* split.py (1.3):
-
-	Guido suggests a different order for the positional args.
-	
-2002-08-21 07:37  bwarsaw
-
-	* split.py (1.2):
-
-	Get rid of the -1 and -2 arguments and make them positional.
-	
-2002-08-21 07:18  bwarsaw
-
-	* split.py (1.1):
-
-	A simple mailbox splitter
-	
-2002-08-21 06:42  tim_one
-
-	* GBayes.py (1.8):
-
-	Added a bunch of simple tokenizers.  The originals are renamed to
-	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
-	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
-	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
-	any of these to be the last word.  When Barry has the test corpus
-	set up it should be easy to let the data tell us which "pure" strategy
-	works best.  Straight character n-grams are very appealing because
-	they're the simplest and most language-neutral; I didn't have any luck
-	with them over the weekend, but the size of my training data was
-	trivial.
-	
-2002-08-21 05:08  bwarsaw
-
-	* cleanarch (1.1):
-
-	An archive cleaner, adapted from the Mailman 2.1b3 version, but
-	de-Mailman-ified.
-	
-2002-08-21 04:44  gvanrossum
-
-	* GBayes.py (1.7):
-
-	Indent repair in clearjunk().
-	
-2002-08-21 04:22  gvanrossum
-
-	* GBayes.py (1.6):
-
-	Some minor cleanup:
-	
-	- Move the identifying comment to the top, clarify it a bit, and add
-	  author info.
-	
-	- There's no reason for _time and _heapreplace to be hidden names;
-	  change these back to time and heapreplace.
-	
-	- Rename main1() to _test() and main2() to main(); when main() sees
-	  there are no options or arguments, it runs _test().
-	
-	- Get rid of a list comprehension from clearjunk().
-	
-	- Put wordinfo.get as a local variable in _add_msg().
-	
-2002-08-20 15:16  tim_one
-
-	* GBayes.py (1.5):
-
-	Neutral typo repairs, except that clearjunk() has a better chance of
-	not blowing up immediately now <wink -- I have yet to try it!>.
-	
-2002-08-20 13:49  montanaro
-
-	* GBayes.py (1.4):
-
-	help make it more easily executable... ;-)
-	
-2002-08-20 09:32  bwarsaw
-
-	* GBayes.py (1.3):
-
-	Lots of hacks great and small to the main() program, but I didn't
-	touch the guts of the algorithm.
-	
-	Added a module docstring/usage message.
-	
-	Added a bunch of switches to train the system on an mbox of known good
-	and known spam messages (using PortableUnixMailbox only for now).
-	Uses the email package but does not decoding of message bodies.  Also,
-	allows you to specify a file for pickling the training data, and for
-	setting a threshold, above which messages get an X-Bayes-Score
-	header.  Also output messages (marked and unmarked) to an output file
-	for retraining.
-	
-	Print some statistics at the end.
-	
-2002-08-20 05:43  tim_one
-
-	* GBayes.py (1.2):
-
-	Turned off debugging vrbl mistakenly checked in at True.
-	
-	unlearn():  Gave this an update_probabilities=True default arg, for
-	symmetry with learn().
-	
-2002-08-20 03:33  tim_one
-
-	* GBayes.py (1.1):
-
-	An implementation of Paul Graham's Bayes-like spam classifier.
-
-</pre>

Modified: trunk/website/unix.ht
===================================================================
--- trunk/website/unix.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/unix.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -9,8 +9,8 @@
 href="download.html">recent enough version of Python</a> is installed, then
 install the Spambayes source either as a <a
 href="https://sourceforge.net/project/showfiles.php?group_id=61702">bundled
-package</a> or <a href="https://sourceforge.net/cvs/?group_id=61702">from
-CVS</a>, then choose the Spambayes application which best fits into your
+package</a> or <a href="https://sourceforge.net/svn/?group_id=61702">from
+Subversion</a>, then choose the Spambayes application which best fits into your
 mail setup.</p>
 
 <h2>Procmail</h2>
@@ -52,8 +52,8 @@
 </li>
 </ol>
 <p>Additional details are available in the <a
-href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt?rev=HEAD&content-type=text/plain">Hammie
-readme</a>.</p>
+href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">README</a>
+file.</p>
 
 <h2>POP3</h2>
 
@@ -208,8 +208,8 @@
 <h2>Training</h2>
 
 <p>See the <a
-href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt?content-type=text%2Fplain&revision=HEAD">Hammie
-readme</a> for a detailed discussion of the many training options
+href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">
+README file</a>for a detailed discussion of the many training options
 on Unix systems.</p>
 
 <h2>Notes</h2>

Modified: trunk/website/windows.ht
===================================================================
--- trunk/website/windows.ht	2007-07-24 00:04:32 UTC (rev 3154)
+++ trunk/website/windows.ht	2007-07-25 13:49:42 UTC (rev 3155)
@@ -35,7 +35,7 @@
 report</a> and add any useful information you may have, or to open a new bug
 report if your problem seems to be a new one.  Please be sure to go through
 the <a
-href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/Outlook2000/docs/troubleshooting.html?content-type=text%2Fhtml">
+href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/Outlook2000/docs/troubleshooting.html">
 troubleshooting.html</a> file that is installed with the plugin.</p>
 
 <h3>Installing the Outlook Client From Source</h3>
@@ -56,10 +56,10 @@
 
   <li>The SpamBayes source, either as a <a
   href="http://sourceforge.net/project/showfiles.php?group_id=61702">zip
-  file</a> or <a href="http://sourceforge.net/cvs/?group_id=61702">via
-  CVS</a>.  The zip file will probably be easier to handle, but there may
-  be improvements to the code which make the CVS version a
-  viable option (though you will have to have a CVS client for Windows
+  file</a> or <a href="http://sourceforge.net/svn/?group_id=61702">via
+  Subversion</a>.  The zip file will probably be easier to handle, but there may
+  be improvements to the code which make the Subversion version a
+  viable option (though you will have to have a Subversion client for Windows
   installed).</li>
 </ol>
 </p>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

From montanaro at users.sourceforge.net  Fri Jul 27 15:34:45 2007
From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net)
Date: Fri, 27 Jul 2007 06:34:45 -0700
Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3157] trunk/website
Message-ID: <E1IEPxt-0001SF-9R@sc8-pr-svn2.sourceforge.net>

Revision: 3157
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3157&view=rev
Author:   montanaro
Date:     2007-07-27 06:34:45 -0700 (Fri, 27 Jul 2007)

Log Message:
-----------
fix a couple more CVS->Subversion problems

Modified Paths:
--------------
    trunk/website/download.ht
    trunk/website/links.h

Modified: trunk/website/download.ht
===================================================================
--- trunk/website/download.ht	2007-07-25 13:51:11 UTC (rev 3156)
+++ trunk/website/download.ht	2007-07-27 13:34:45 UTC (rev 3157)
@@ -108,6 +108,6 @@
 
 <a name="svn"><h2>Subversion Access</h2></a>
 
-<p>The code is currently available the SourceForge <a
-gref="http://sourceforge.net/svn/?group_id=61702">Subversion server</a>.
+<p>The code is currently available from the SourceForge <a
+href="http://sourceforge.net/svn/?group_id=61702">Subversion server</a>.
 </p>

Modified: trunk/website/links.h
===================================================================
--- trunk/website/links.h	2007-07-25 13:51:11 UTC (rev 3156)
+++ trunk/website/links.h	2007-07-27 13:34:45 UTC (rev 3157)
@@ -13,4 +13,4 @@
 <li><a href="mac.html">Mac OS</a>
 <h3>Getting the code</h3>
 <li><a href="download.html">Releases</a>
-<li><a href="download.html#cvs">CVS access</a>
+<li><a href="download.html#svn">Subversion access</a>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.