From montanaro at users.sourceforge.net Tue Jul 3 16:12:54 2007 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Tue, 03 Jul 2007 07:12:54 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dbmstorage.py,1.15,1.16 Message-ID: <20070703141258.49C631E400A@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv13923 Modified Files: dbmstorage.py Log Message: SF patch #810344. Should have applied this long ago. Index: dbmstorage.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/dbmstorage.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** dbmstorage.py 7 Apr 2006 02:23:05 -0000 1.15 --- dbmstorage.py 3 Jul 2007 14:12:49 -0000 1.16 *************** *** 34,37 **** --- 34,42 ---- return gdbm.open(*args) + def open_dbm(*args): + """Open a dbm database.""" + import dbm + return dbm.open(*args) + def open_best(*args): if sys.platform == "win32": *************** *** 42,46 **** funcs.insert(0, open_dbhash) else: ! funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash] for f in funcs: try: --- 47,52 ---- funcs.insert(0, open_dbhash) else: ! funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash, ! open_dbm] for f in funcs: try: *************** *** 56,59 **** --- 62,66 ---- "bsddb185": open_db185hash, "gdbm": open_gdbm, + "dbm": open_dbm, } From montanaro at users.sourceforge.net Wed Jul 4 12:58:57 2007 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Wed, 04 Jul 2007 03:58:57 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dbmstorage.py,1.16,1.17 Message-ID: <20070704105901.49AA71E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv19300 Modified Files: dbmstorage.py Log Message: Revert the last change. It was ill-considered, and only serves to sneak Berkeley DB 1.85 files into the system on Macs in the guise of supporting the dbm format. Index: dbmstorage.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/dbmstorage.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** dbmstorage.py 3 Jul 2007 14:12:49 -0000 1.16 --- dbmstorage.py 4 Jul 2007 10:58:54 -0000 1.17 *************** *** 34,42 **** return gdbm.open(*args) - def open_dbm(*args): - """Open a dbm database.""" - import dbm - return dbm.open(*args) - def open_best(*args): if sys.platform == "win32": --- 34,37 ---- *************** *** 47,52 **** funcs.insert(0, open_dbhash) else: ! funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash, ! open_dbm] for f in funcs: try: --- 42,46 ---- funcs.insert(0, open_dbhash) else: ! funcs = [open_db3hash, open_dbhash, open_gdbm, open_db185hash] for f in funcs: try: *************** *** 62,66 **** "bsddb185": open_db185hash, "gdbm": open_gdbm, - "dbm": open_dbm, } --- 56,59 ---- From mhammond at users.sourceforge.net Sat Jul 7 08:25:25 2007 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Fri, 06 Jul 2007 23:25:25 -0700 Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.42,1.43 Message-ID: <20070707062531.72E361E4005@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32054 Modified Files: WHAT_IS_NEW.txt Log Message: Add info about 1.1a4 Index: WHAT_IS_NEW.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** WHAT_IS_NEW.txt 25 Aug 2006 02:02:12 -0000 1.42 --- WHAT_IS_NEW.txt 7 Jul 2007 06:25:23 -0000 1.43 *************** *** 16,19 **** --- 16,51 ---- is released. + New in 1.1 Alpha 4 + ================== + + -------------------------------------------- + ** Incompatible changes and Transitioning ** + -------------------------------------------- + + Some options that were 'experimental' in 1.1a3 have now been upgraded to + non-experimental, meaning the option names have had their 'x-' prefix removed. + See below for details. + + Otherwise, there should be no incompatible changes since 1.1a3, though users + new to the 1.1 series should pay careful attention to the database changes + introduced in 1.1a2. + + ------------------- + ** Other changes ** + ------------------- + + The previously experimental options 'x-crack-images', 'x-ocr-engine' + and 'x-image-size' have all had their 'x-' prefix removed. 'crack-images' + now defaults to True (meaning you don't need to change anything for it + to be enabled), and ocr-engine defaults to 'gocr'. The Windows binary ships + with the gocr engine, so this should work out-of-the-box for both for Outlook + and POP/IMAP/etc users. + + Image Cracking (ie, using OCR to extract text from images) has been + implemented for the Outlook addin. + + Some localization related issues have been fixed, and a German translation + contributed. + New in 1.1 Alpha 3 ================== *************** *** 91,94 **** --- 123,130 ---- -------------------------------------------- + * NOTE * - this section does not apply to people running SpamBayes on + Windows using the binary installer - only source code installations are + affected. + SpamBayes has changed to use ZODB as the default database backend, rather than dbm (usually bsddb). There are three methods for handling this *************** *** 109,115 **** persistent_use_database:dbm ! o You can convert your existing database files to the new format. ! Windows users will be given the opportunity to do this on installation; ! other users should use the utilities/convert_db.py script to do this. Note that only the token database (containing your training) is converted; the 'messageinfo' database (containing statistics about --- 145,150 ---- persistent_use_database:dbm ! o You can convert your existing database files to the new format using ! the utilities/convert_db.py script. Note that only the token database (containing your training) is converted; the 'messageinfo' database (containing statistics about From montanaro at users.sourceforge.net Sun Jul 15 01:13:14 2007 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sat, 14 Jul 2007 16:13:14 -0700 Subject: [Spambayes-checkins] spambayes/spambayes XMLRPCPlugin.py,1.2,1.3 Message-ID: <20070714231317.8BEB11E4008@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31847/spambayes Modified Files: XMLRPCPlugin.py Log Message: Add train and train_mime methods to the XML-RPC plugin. These come from Marian Neagul. Index: XMLRPCPlugin.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/XMLRPCPlugin.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** XMLRPCPlugin.py 10 Jun 2007 15:27:36 -0000 1.2 --- XMLRPCPlugin.py 14 Jul 2007 23:13:09 -0000 1.3 *************** *** 37,40 **** --- 37,47 ---- """ + __author__ = "Skip Montanaro " + __credits__ = "All the Spambayes folk." + + # This module is part of the spambayes project, which is Copyright 2002 The + # Python Software Foundation and is covered by the Python Software + # Foundation license. + import threading import xmlrpclib *************** *** 70,82 **** def _dispatch(self, method, params): ! if method in ("score", "score_mime"): return getattr(self, method)(*params) else: raise xmlrpclib.Fault(404, '"%s" is not supported' % method) def score(self, form_dict, extra_tokens, attachments): """Score a dictionary + extra tokens.""" ! mime_message = form_to_mime(form_dict, extra_tokens, attachments) ! mime_message = unicode(mime_message).encode("utf-8") return self.score_mime(mime_message, "utf-8") --- 77,170 ---- def _dispatch(self, method, params): ! if method in ("score", "score_mime", "train", "train_mime"): return getattr(self, method)(*params) else: raise xmlrpclib.Fault(404, '"%s" is not supported' % method) + def train(self, form_dict, extra_tokens, attachments, is_spam=True): + newdict={} + for (i, k) in form_dict.items(): + if type(k)==unicode: + k = k.encode("utf-8") + newdict[i] = k + mime_message = form_to_mime(newdict, extra_tokens, attachments) + mime_message = unicode(mime_message.as_string(), "utf-8").encode("utf-8") + self.train_mime(mime_message, "utf-8", is_spam) + return "" + + def train_mime(self, msg_text, encoding, is_spam): + if self.state.bayes is None: + self.state.create_workers() + # Get msg_text into canonical string representation. + # Make sure we have a unicode object... + if isinstance(msg_text, str): + msg_text = unicode(msg_text, encoding) + # ... then encode it as utf-8. + if isinstance(msg_text, unicode): + msg_text = msg_text.encode("utf-8") + msg = message_from_string(msg_text, + _class=spambayes.message.SBHeaderMessage) + tokens = tokenize(msg) + if is_spam: + desired_corpus = "spamCorpus" + else: + desired_corpus = "hamCorpus" + if hasattr(self, desired_corpus): + corpus = getattr(self, desired_corpus) + else: + if hasattr(self, "state"): + corpus = getattr(self.state, desired_corpus) + setattr(self, desired_corpus, corpus) + self.msg_name_func = self.state.getNewMessageName + else: + if isSpam: + fn = storage.get_pathname_option("Storage", + "spam_cache") + else: + fn = storage.get_pathname_option("Storage", + "ham_cache") + storage.ensureDir(fn) + if options["Storage", "cache_use_gzip"]: + factory = FileCorpus.GzipFileMessageFactory() + else: + factory = FileCorpus.FileMessageFactory() + age = options["Storage", "cache_expiry_days"]*24*60*60 + corpus = FileCorpus.ExpiryFileCorpus(age, factory, fn, + '[0123456789\-]*', cacheSize=20) + setattr(self, desired_corpus, corpus) + class UniqueNamer(object): + count = -1 + def generate_name(self): + self.count += 1 + return "%10.10d-%d" % (long(time.time()), self.count) + Namer = UniqueNamer() + self.msg_name_func = Namer.generate_name + key = self.msg_name_func() + mime_message = unicode(msg.as_string(), "utf-8").encode("utf-8") + msg = corpus.makeMessage(key, mime_message) + msg.setId(key) + corpus.addMessage(msg) + msg.RememberTrained(is_spam) + #self.stats.RecordTraining(not is_spam) + #if is_spam: + # self.state.bayes.nspam += 1 + #else: + # self.state.bayes.nham += 1 + + def train_spam(self, form_dict, extra_tokens, attachments): + pass + + def train_ham(self, form_dict, extra_tokens, attachments): + pass + def score(self, form_dict, extra_tokens, attachments): """Score a dictionary + extra tokens.""" ! newdict={} ! for (i, k) in form_dict.items(): ! if isinstance(k,unicode): ! k = k.encode("utf-8") ! newdict[i] = k ! mime_message = form_to_mime(newdict, extra_tokens, attachments) ! mime_message = unicode(mime_message.as_string(), "utf-8").encode("utf-8") return self.score_mime(mime_message, "utf-8") From montanaro at users.sourceforge.net Sun Jul 15 01:13:14 2007 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sat, 14 Jul 2007 16:13:14 -0700 Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.43,1.44 Message-ID: <20070714231318.53E3C1E4008@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31847 Modified Files: WHAT_IS_NEW.txt Log Message: Add train and train_mime methods to the XML-RPC plugin. These come from Marian Neagul. Index: WHAT_IS_NEW.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** WHAT_IS_NEW.txt 7 Jul 2007 06:25:23 -0000 1.43 --- WHAT_IS_NEW.txt 14 Jul 2007 23:13:09 -0000 1.44 *************** *** 16,19 **** --- 16,25 ---- is released. + New in 1.1 Alpha 5 + ================== + + The XML-RPC plugin for core_server.py now has "train" and "train_mime" + methods. + New in 1.1 Alpha 4 ================== *************** *** 48,51 **** --- 54,61 ---- contributed. + There is a new application, core_server.py. It is functionally similar to + sb_server.py but uses a plugin architecture to adapt to different + protocols. The first plugin is for XML-RPC. + New in 1.1 Alpha 3 ================== From montanaro at users.sourceforge.net Sun Jul 15 01:14:16 2007 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sat, 14 Jul 2007 16:14:16 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.59,1.60 Message-ID: <20070714231420.290BB1E4008@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32661 Modified Files: CHANGELOG.txt Log Message: . Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.59 retrieving revision 1.60 diff -C2 -d -r1.59 -r1.60 *** CHANGELOG.txt 25 Jun 2007 12:10:10 -0000 1.59 --- CHANGELOG.txt 14 Jul 2007 23:14:14 -0000 1.60 *************** *** 1,4 **** --- 1,8 ---- [Note that all dates are in ISO 8601 format, e.g. YYYY-MM-DD to ease sorting] + Release 1.1a5 + + Skip Montanaro 2007-07-14 Add train and train_mime methods to XML-RPC plugin (from Marian Neagul). + Release 1.1a4 From montanaro at users.sourceforge.net Tue Jul 17 04:17:00 2007 From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net) Date: Mon, 16 Jul 2007 19:17:00 -0700 Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3152] trunk/spambayes/WHAT_IS_NEW.txt Message-ID: Revision: 3152 http://spambayes.svn.sourceforge.net/spambayes/?rev=3152&view=rev Author: montanaro Date: 2007-07-16 19:16:59 -0700 (Mon, 16 Jul 2007) Log Message: ----------- a trivial change - testing authentication and email notification Modified Paths: -------------- trunk/spambayes/WHAT_IS_NEW.txt Modified: trunk/spambayes/WHAT_IS_NEW.txt =================================================================== --- trunk/spambayes/WHAT_IS_NEW.txt 2007-07-16 11:26:57 UTC (rev 3151) +++ trunk/spambayes/WHAT_IS_NEW.txt 2007-07-17 02:16:59 UTC (rev 3152) @@ -18,7 +18,7 @@ New in 1.1 Alpha 5 ================== -The XML-RPC plugin for core_server.py now has "train" and "train_mime" +The XML-RPC plugin for core_server.py now has 'train' and 'train_mime' methods. The source code repository was switched from CVS to Subversion. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. From montanaro at users.sourceforge.net Tue Jul 17 04:17:17 2007 From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net) Date: Mon, 16 Jul 2007 19:17:17 -0700 Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3153] trunk/spambayes/README-DEVEL.txt Message-ID: Revision: 3153 http://spambayes.svn.sourceforge.net/spambayes/?rev=3153&view=rev Author: montanaro Date: 2007-07-16 19:17:16 -0700 (Mon, 16 Jul 2007) Log Message: ----------- a trivial change - testing authentication and email notification Modified Paths: -------------- trunk/spambayes/README-DEVEL.txt Modified: trunk/spambayes/README-DEVEL.txt =================================================================== --- trunk/spambayes/README-DEVEL.txt 2007-07-17 02:16:59 UTC (rev 3152) +++ trunk/spambayes/README-DEVEL.txt 2007-07-17 02:17:16 UTC (rev 3153) @@ -27,7 +27,12 @@ You should definitely check out the FAQ: http://spambayes.org/faq.html +Getting Source Code +=================== +The SpamBayes project source code is hosted at SourceForge +(http://spambayes.sourceforge.net/). Access is via Subversion. + Primary Core Files ================== Options.py This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. From montanaro at users.sourceforge.net Tue Jul 24 02:04:32 2007 From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net) Date: Mon, 23 Jul 2007 17:04:32 -0700 Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3154] trunk/spambayes/scripts/sb_notesfilter.py Message-ID: Revision: 3154 http://spambayes.svn.sourceforge.net/spambayes/?rev=3154&view=rev Author: montanaro Date: 2007-07-23 17:04:32 -0700 (Mon, 23 Jul 2007) Log Message: ----------- one more incorrectly positioned __future__ import Modified Paths: -------------- trunk/spambayes/scripts/sb_notesfilter.py Modified: trunk/spambayes/scripts/sb_notesfilter.py =================================================================== --- trunk/spambayes/scripts/sb_notesfilter.py 2007-07-17 02:17:16 UTC (rev 3153) +++ trunk/spambayes/scripts/sb_notesfilter.py 2007-07-24 00:04:32 UTC (rev 3154) @@ -130,11 +130,11 @@ # The Python Software Foundation and is covered by the Python Software # Foundation license. +from __future__ import generators + __author__ = "Tim Stone " __credits__ = "Mark Hammond, for his remarkable win32 modules." -from __future__ import generators - try: True, False except NameError: This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. From montanaro at users.sourceforge.net Wed Jul 25 15:51:11 2007 From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net) Date: Wed, 25 Jul 2007 06:51:11 -0700 Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3156] trunk/website Message-ID: Revision: 3156 http://spambayes.svn.sourceforge.net/spambayes/?rev=3156&view=rev Author: montanaro Date: 2007-07-25 06:51:11 -0700 (Wed, 25 Jul 2007) Log Message: ----------- read the file name incorrectly as a misspelling of "prefs changelog" instead of "pre sf changelog"! Added Paths: ----------- trunk/website/presfchangelog.ht Removed Paths: ------------- trunk/website/prefschangelog.ht Deleted: trunk/website/prefschangelog.ht =================================================================== --- trunk/website/prefschangelog.ht 2007-07-25 13:49:42 UTC (rev 3155) +++ trunk/website/prefschangelog.ht 2007-07-25 13:51:11 UTC (rev 3156) @@ -1,905 +0,0 @@ -

Pre-Sourceforge ChangeLog

-

This changelog lists the commits on the spambayes projects before the - separate project was set up. See also the -old CVS repository, but don't forget that it's now out of date, and you probably want to be looking at the current CVS. -

-
-2002-09-06 02:27  tim_one
-
-	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
-	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
-	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
-
-	This code has been moved to a new SourceForge project (spambayes).
-	
-2002-09-05 15:37  tim_one
-
-	* classifier.py (1.11):
-
-	Added note about MINCOUNT oddities.
-	
-2002-09-05 14:32  tim_one
-
-	* timtest.py (1.17):
-
-	Added note about word length.
-	
-2002-09-05 13:48  tim_one
-
-	* timtest.py (1.16):
-
-	tokenize_word():  Oops!  This was awfully permissive in what it
-	took as being "an email address".  Tightened that, and also
-	avoided 5-gram'ing of email addresses w/ high-bit characters.
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.050  lost
-	    0.075  0.075  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   0 times
-	tied 19 times
-	lost  1 times
-	
-	total unique fp went from 7 to 8
-	
-	false negative percentages
-	    0.764  0.691  won
-	    0.691  0.655  won
-	    0.981  0.945  won
-	    1.309  1.309  tied
-	    1.418  1.164  won
-	    0.873  0.800  won
-	    0.800  0.763  won
-	    1.163  1.163  tied
-	    1.491  1.345  won
-	    1.200  1.127  won
-	    1.381  1.345  won
-	    1.454  1.490  lost
-	    1.164  0.909  won
-	    0.655  0.582  won
-	    0.655  0.691  lost
-	    1.163  1.163  tied
-	    1.200  1.018  won
-	    0.982  0.873  won
-	    0.982  0.909  won
-	    1.236  1.127  won
-	
-	won  15 times
-	tied  3 times
-	lost  2 times
-	
-	total unique fn went from 260 to 249
-	
-	Note:  Each of the two losses there consist of just 1 msg difference.
-	The wins are bigger as well as being more common, and 260-249 = 11
-	spams no longer sneak by any run (which is more than 4% of the 260
-	spams that used to sneak thru!).
-	
-2002-09-05 11:51  tim_one
-
-	* classifier.py (1.10):
-
-	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
-	really matter; leaving it alone.
-	
-2002-09-05 10:02  tim_one
-
-	* classifier.py (1.9):
-
-	A now-rare pure win, changing spamprob() to work harder to find more
-	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
-	column, after in the right:
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.075  0.075  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.075  0.025  won
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   1 times
-	tied 19 times
-	lost  0 times
-	
-	total unique fp went from 9 to 7
-	
-	false negative percentages
-	    0.909  0.764  won
-	    0.800  0.691  won
-	    1.091  0.981  won
-	    1.381  1.309  won
-	    1.491  1.418  won
-	    1.055  0.873  won
-	    0.945  0.800  won
-	    1.236  1.163  won
-	    1.564  1.491  won
-	    1.200  1.200  tied
-	    1.454  1.381  won
-	    1.599  1.454  won
-	    1.236  1.164  won
-	    0.800  0.655  won
-	    0.836  0.655  won
-	    1.236  1.163  won
-	    1.236  1.200  won
-	    1.055  0.982  won
-	    1.127  0.982  won
-	    1.381  1.236  won
-	
-	won  19 times
-	tied  1 times
-	lost  0 times
-	
-	total unique fn went from 284 to 260
-	
-2002-09-04 11:21  tim_one
-
-	* timtest.py (1.15):
-
-	Augmented the spam callback to display spams with low probability.
-	
-2002-09-04 09:53  tim_one
-
-	* Tester.py (1.3), timtest.py (1.14):
-
-	Added support for simple histograms of the probability distributions for
-	ham and spam.
-	
-2002-09-03 12:13  tim_one
-
-	* timtest.py (1.13):
-
-	A reluctant "on principle" change no matter what it does to the stats:
-	take a stab at removing HTML decorations from plain text msgs.  See
-	comments for why it's *only* in plain text msgs.  This puts an end to
-	false positives due to text msgs talking *about* HTML.  Surprisingly, it
-	also gets rid of some false negatives.  Not surprisingly, it introduced
-	another small class of false positives due to the dumbass regexp trick
-	used to approximate HTML tag removal removing pieces of text that had
-	nothing to do with HTML tags (e.g., this happened in the middle of a
-	uuencoded .py file in such a why that it just happened to leave behind
-	a string that "looked like" a spam phrase; but before this it looked
-	like a pile of "too long" lines that didn't generate any tokens --
-	it's a nonsense outcome either way).
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.025  lost
-	    0.075  0.075  tied
-	    0.050  0.025  won
-	    0.025  0.025  tied
-	    0.000  0.025  lost
-	    0.050  0.075  lost
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   1 times
-	tied 16 times
-	lost  3 times
-	
-	total unique fp went from 8 to 9
-	
-	false negative percentages
-	    0.945  0.909  won
-	    0.836  0.800  won
-	    1.200  1.091  won
-	    1.418  1.381  won
-	    1.455  1.491  lost
-	    1.091  1.055  won
-	    1.091  0.945  won
-	    1.236  1.236  tied
-	    1.564  1.564  tied
-	    1.236  1.200  won
-	    1.563  1.454  won
-	    1.563  1.599  lost
-	    1.236  1.236  tied
-	    0.836  0.800  won
-	    0.873  0.836  won
-	    1.236  1.236  tied
-	    1.273  1.236  won
-	    1.018  1.055  lost
-	    1.091  1.127  lost
-	    1.490  1.381  won
-	
-	won  12 times
-	tied  4 times
-	lost  4 times
-	
-	total unique fn went from 292 to 284
-	
-2002-09-03 06:57  tim_one
-
-	* classifier.py (1.8):
-
-	Added a new xspamprob() method, which computes the combined probability
-	"correctly", and a long comment block explaining what happened when I
-	tried it.  There's something worth pursuing here (it greatly improves
-	the false negative rate), but this change alone pushes too many marginal
-	hams into the spam camp
-	
-2002-09-03 05:23  tim_one
-
-	* timtest.py (1.12):
-
-	Made "skip:" tokens shorter.
-	
-	Added a surprising treatment of Organization headers, with a tiny f-n
-	benefit for a tiny cost.  No change in f-p stats.
-	
-	false negative percentages
-	    1.091  0.945  won
-	    0.945  0.836  won
-	    1.236  1.200  won
-	    1.454  1.418  won
-	    1.491  1.455  won
-	    1.091  1.091  tied
-	    1.127  1.091  won
-	    1.236  1.236  tied
-	    1.636  1.564  won
-	    1.345  1.236  won
-	    1.672  1.563  won
-	    1.599  1.563  won
-	    1.236  1.236  tied
-	    0.836  0.836  tied
-	    1.018  0.873  won
-	    1.236  1.236  tied
-	    1.273  1.273  tied
-	    1.055  1.018  won
-	    1.091  1.091  tied
-	    1.527  1.490  won
-	
-	won  13 times
-	tied  7 times
-	lost  0 times
-	
-	total unique fn went from 302 to 292
-	
-2002-09-03 02:18  tim_one
-
-	* timtest.py (1.11):
-
-	tokenize_word():  dropped the prefix from the signature; it's faster
-	to let the caller do it, and this also repaired a bug in one place it
-	was being used (well, a *conceptual* bug anyway, in that the code didn't
-	do what I intended there).  This changes the stats in an insignificant
-	way.  The f-p stats didn't change.  The f-n stats shifted by one message
-	in a few cases:
-	
-	false negative percentages
-	    1.091  1.091  tied
-	    0.945  0.945  tied
-	    1.200  1.236  lost
-	    1.454  1.454  tied
-	    1.491  1.491  tied
-	    1.091  1.091  tied
-	    1.091  1.127  lost
-	    1.236  1.236  tied
-	    1.636  1.636  tied
-	    1.382  1.345  won
-	    1.636  1.672  lost
-	    1.599  1.599  tied
-	    1.236  1.236  tied
-	    0.836  0.836  tied
-	    1.018  1.018  tied
-	    1.236  1.236  tied
-	    1.273  1.273  tied
-	    1.055  1.055  tied
-	    1.091  1.091  tied
-	    1.527  1.527  tied
-	
-	won   1 times
-	tied 16 times
-	lost  3 times
-	
-	total unique unchanged
-	
-2002-09-02 19:30  tim_one
-
-	* timtest.py (1.10):
-
-	Don't ask me why this helps -- I don't really know!  When skipping "long
-	words", generating a token with a brief hint about what and how much got
-	skipped makes a definite improvement in the f-n rate, and doesn't affect
-	the f-p rate at all.  Since experiment said it's a winner, I'm checking
-	it in.  Before (left columan) and after (right column):
-	
-	false positive percentages
-	    0.000  0.000  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.075  0.075  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.050  0.050  tied
-	    0.025  0.025  tied
-	    0.025  0.025  tied
-	    0.000  0.000  tied
-	    0.025  0.025  tied
-	    0.050  0.050  tied
-	
-	won   0 times
-	tied 20 times
-	lost  0 times
-	
-	total unique fp went from 8 to 8
-	
-	false negative percentages
-	    1.236  1.091  won
-	    1.164  0.945  won
-	    1.454  1.200  won
-	    1.599  1.454  won
-	    1.527  1.491  won
-	    1.236  1.091  won
-	    1.163  1.091  won
-	    1.309  1.236  won
-	    1.891  1.636  won
-	    1.418  1.382  won
-	    1.745  1.636  won
-	    1.708  1.599  won
-	    1.491  1.236  won
-	    0.836  0.836  tied
-	    1.091  1.018  won
-	    1.309  1.236  won
-	    1.491  1.273  won
-	    1.127  1.055  won
-	    1.309  1.091  won
-	    1.636  1.527  won
-	
-	won  19 times
-	tied  1 times
-	lost  0 times
-	
-	total unique fn went from 336 to 302
-	
-2002-09-02 17:55  tim_one
-
-	* timtest.py (1.9):
-
-	Some comment changes and nesting reduction.
-	
-2002-09-02 11:18  tim_one
-
-	* timtest.py (1.8):
-
-	Fixed some out-of-date comments.
-	
-	Made URL clumping lumpier:  now distinguishes among just "first field",
-	"second field", and "everything else".
-	
-	Changed tag names for email address fields (semantically neutral).
-	
-	Added "From:" line tagging.
-	
-	These add up to an almost pure win.  Before-and-after f-n rates across 20
-	runs:
-	
-	1.418   1.236
-	1.309   1.164
-	1.636   1.454
-	1.854   1.599
-	1.745   1.527
-	1.418   1.236
-	1.381   1.163
-	1.418   1.309
-	2.109   1.891
-	1.491   1.418
-	1.854   1.745
-	1.890   1.708
-	1.818   1.491
-	1.055   0.836
-	1.164   1.091
-	1.599   1.309
-	1.600   1.491
-	1.127   1.127
-	1.164   1.309
-	1.781   1.636
-	
-	It only increased in one run.  The variance appears to have been reduced
-	too (I didn't bother to compute that, though).
-	
-	Before-and-after f-p rates across 20 runs:
-	
-	0.000   0.000
-	0.000   0.000
-	0.075   0.050
-	0.000   0.000
-	0.025   0.025
-	0.050   0.025
-	0.075   0.050
-	0.025   0.025
-	0.025   0.025
-	0.025   0.000
-	0.100   0.075
-	0.050   0.050
-	0.025   0.025
-	0.000   0.000
-	0.075   0.050
-	0.025   0.025
-	0.025   0.025
-	0.000   0.000
-	0.075   0.025
-	0.100   0.050
-	
-	Note that 0.025% is a single message; it's really impossible to *measure*
-	an improvement in the f-p rate anymore with 4000-msg ham sets.
-	
-	Across all 20 runs,
-	
-	the total # of unique f-n fell from 353 to 336
-	the total # of unique f-p fell from 13 to 8
-	
-2002-09-02 10:06  tim_one
-
-	* timtest.py (1.7):
-
-	A number of changes.  The most significant is paying attention to the
-	Subject line (I was wrong before when I said my c.l.py ham corpus was
-	unusable for this due to Mailman-injected decorations).  In all, across
-	my 20 test runs,
-	
-	the total # of unique false positives fell from 23 to 13
-	the total # of unique false negatives rose from 337 to 353
-	
-	Neither result is statistically significant, although I bet the first
-	one would be if I pissed away a few days trying to come up with a more
-	realistic model for what "stat. sig." means here .
-	
-2002-09-01 17:22  tim_one
-
-	* classifier.py (1.7):
-
-	Added a comment block about HAMBIAS experiments.  There's no clearer
-	example of trading off precision against recall, and you can favor either
-	at the expense of the other to any degree you like by fiddling this knob.
-	
-2002-09-01 14:42  tim_one
-
-	* timtest.py (1.6):
-
-	Long new comment block summarizing all my experiments with character
-	n-grams.  Bottom line is that they have nothing going for them, and a
-	lot going against them, under Graham's scheme.  I believe there may
-	still be a place for them in *part* of a word-based tokenizer, though.
-	
-2002-09-01 10:05  tim_one
-
-	* classifier.py (1.6):
-
-	spamprob():  Never count unique words more than once anymore.  Counting
-	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
-	that's now a small drag instead.
-	
-2002-09-01 07:33  tim_one
-
-	* rebal.py (1.3), timtest.py (1.5):
-
-	Folding case is here to stay.  Read the new comments for why.  This may
-	be a bad idea for other languages, though.
-	
-	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
-	http is spam-neutral, but https is a strong spam indicator.  That
-	surprised me.
-	
-2002-09-01 06:47  tim_one
-
-	* classifier.py (1.5):
-
-	spamprob():  Removed useless check that wordstream isn't empty.  For one
-	thing, it didn't work, since wordstream is often an iterator.  Even if
-	it did work, it isn't needed -- the probability of an empty wordstream
-	gets computed as 0.5 based on the total absence of evidence.
-	
-2002-09-01 05:37  tim_one
-
-	* timtest.py (1.4):
-
-	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
-	details).
-	
-2002-09-01 05:09  tim_one
-
-	* rebal.py (1.2):
-
-	Aha!  Staring at the checkin msg revealed a logic bug that explains why
-	my ham directories sometimes remained unbalanced after running this --
-	if the randomly selected reservoir msg turned out to be spam, it wasn't
-	pushing the too-small directory on the stack again.
-	
-2002-09-01 04:56  tim_one
-
-	* timtest.py (1.3):
-
-	textparts():  This was failing to weed out redundant HTML in cases like
-	this:
-	
-	    multipart/alternative
-	        text/plain
-	        multipart/related
-	            text/html
-	
-	The tokenizer here also transforms everything to lowercase, but that's
-	an accident due simply to that I'm testing that now.  Can't say for
-	sure until the test runs end, but so far it looks like a bad idea for
-	the false positive rate.
-	
-2002-09-01 04:52  tim_one
-
-	* rebal.py (1.1):
-
-	A little script I use to rebalance the ham corpora after deleting what
-	turns out to be spam.  I have another Ham/reservoir directory with a
-	few thousand randomly selected msgs from the presumably-good archive.
-	These aren't used in scoring or training.  This script marches over all
-	the ham corpora directories that are used, and if any have gotten too
-	big (this never happens anymore) deletes msgs at random from them, and
-	if any have gotten too small plugs the holes by moving in random
-	msgs from the reservoir.
-	
-2002-09-01 03:25  tim_one
-
-	* classifier.py (1.4), timtest.py (1.2):
-
-	Boost UNKNOWN_SPAMPROB.
-	# The spam probability assigned to words never seen before.  Graham used
-	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
-	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
-	# negative rate by over 1/3.  The f-p rate increased, but there were so few
-	# f-ps that the increase wasn't statistically significant.  It also caught
-	# 13 more spams erroneously classified as ham.  By eyeball (and common
-	# sense ), this has most effect on very short messages, where there
-	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
-	# completely ignored by spamprob(), in favor of *any* word with *any* prob
-	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
-	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
-	# on the face of it.
-	
-2002-08-31 16:50  tim_one
-
-	* timtest.py (1.1):
-
-	This is a driver I've been using for test runs.  It's specific to my
-	corpus directories, but has useful stuff in it all the same.
-	
-2002-08-31 16:49  tim_one
-
-	* classifier.py (1.3):
-
-	The explanation for these changes was on Python-Dev.  You'll find out
-	why if the moderator approves the msg .
-	
-2002-08-29 07:04  tim_one
-
-	* Tester.py (1.2), classifier.py (1.2):
-
-	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
-	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
-	too hard to get motivated to reduce 0.01 <0.1 wink>).
-	
-	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
-	the 15 strongest (word, probability) pairs is returned as well as the
-	overall probability (this is how to find out why a message scored as it
-	did).
-	
-2002-08-28 13:45  montanaro
-
-	* GBayes.py (1.15):
-
-	ehh - it actually didn't work all that well.  the spurious report that it
-	did well was pilot error.  besides, tim's report suggests that a simple
-	str.split() may be the best tokenizer anyway.
-	
-2002-08-28 10:45  montanaro
-
-	* setup.py (1.1):
-
-	trivial little setup.py file - i don't expect most people will be interested
-	in this, but it makes it a tad simpler to work with now that there are two
-	files
-	
-2002-08-28 10:43  montanaro
-
-	* GBayes.py (1.14):
-
-	add simple trigram tokenizer - this seems to yield the best results I've
-	seen so far (but has not been extensively tested)
-	
-2002-08-28 08:10  tim_one
-
-	* Tester.py (1.1):
-
-	A start at a testing class.  There isn't a lot here, but it automates
-	much of the tedium, and as the doctest shows it can already do
-	useful things, like remembering which inputs were misclassified.
-	
-2002-08-27 06:45  tim_one
-
-	* mboxcount.py (1.5):
-
-	Updated stats to what Barry and I both get now.  Fiddled output.
-	
-2002-08-27 05:09  bwarsaw
-
-	* split.py (1.5), splitn.py (1.2):
-
-	_factory(): Return the empty string instead of None in the except
-	clauses, so that for-loops won't break prematurely.  mailbox.py's base
-	class defines an __iter__() that raises a StopIteration on None
-	return.
-	
-2002-08-27 04:55  tim_one
-
-	* GBayes.py (1.13), mboxcount.py (1.4):
-
-	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
-	
-2002-08-27 04:40  bwarsaw
-
-	* mboxcount.py (1.3):
-
-	Some stats after splitting b/w good messages and unparseable messages
-	
-2002-08-27 04:23  bwarsaw
-
-	* mboxcount.py (1.2):
-
-	_factory(): Use a marker object to designate between good messages and
-	unparseable messages.  For some reason, returning None from the except
-	clause in _factory() caused Python 2.2.1 to exit early out of the for
-	loop.
-	
-	main(): Print statistics about both the number of good messages and
-	the number of unparseable messages.
-	
-2002-08-27 03:06  tim_one
-
-	* cleanarch (1.2):
-
-	"From " is a header more than a separator, so don't bump the msg count
-	at the end.
-	
-2002-08-24 01:42  tim_one
-
-	* GBayes.py (1.12), classifier.py (1.1):
-
-	Moved all the interesting code that was in the *original* GBayes.py into
-	a new classifier.py.  It was designed to have a very clean interface,
-	and there's no reason to keep slamming everything into one file.  The
-	ever-growing tokenizer stuff should probably also be split out, leaving
-	GBayes.py a pure driver.
-	
-	Also repaired _test() (Skip's checkin left it without a binding for
-	the tokenize function).
-	
-2002-08-24 01:17  tim_one
-
-	* splitn.py (1.1):
-
-	Utility to split an mbox into N random pieces in one gulp.  This gives
-	a convenient way to break a giant corpus into multiple files that can
-	then be used independently across multiple training and testing runs.
-	It's important to do multiple runs on different random samples to avoid
-	drawing conclusions based on accidents in a single random training corpus;
-	if the algorithm is robust, it should have similar performance across
-	all runs.
-	
-2002-08-24 00:25  montanaro
-
-	* GBayes.py (1.11):
-
-	Allow command line specification of tokenize functions
-	    run w/ -t flag to override default tokenize function
-	    run w/ -H flag to see list of tokenize functions
-	
-	When adding a new tokenizer, make docstring a short description and add a
-	key/value pair to the tokenizers dict.  The key is what the user specifies.
-	The value is a tokenize function.
-	
-	Added two new tokenizers - tokenize_wordpairs_foldcase and
-	tokenize_words_and_pairs.  It's not obvious that either is better than any
-	of the preexisting functions.
-	
-	Should probably add info to the pickle which indicates the tokenizing
-	function used to build it.  This could then be the default for spam
-	detection runs.
-	
-	Next step is to drive this with spam/non-spam corpora, selecting each of the
-	various tokenizer functions, and presenting the results in tabular form.
-	
-2002-08-23 13:10  tim_one
-
-	* GBayes.py (1.10):
-
-	spamprob():  Commented some subtleties.
-	
-	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
-	is that you can't delete entries from a dict that's being crawled over
-	by .iteritems(), which is why I (I suddenly recall) materialized a
-	list of words to be deleted the first time I wrote this.  It's a lot
-	better to materialize a list of to-be-deleted words than to materialize
-	the entire database in a dict.items() list.
-	
-2002-08-23 12:36  tim_one
-
-	* mboxcount.py (1.1):
-
-	Utility to count and display the # of msgs in (one or more) Unix mboxes.
-	
-2002-08-23 12:11  tim_one
-
-	* split.py (1.4):
-
-	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
-	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
-	the latter invents an extra newline.
-	
-2002-08-22 07:01  tim_one
-
-	* GBayes.py (1.9):
-
-	Renamed "modtime" to "atime", to better reflect its meaning, and added a
-	comment block to explain that better.
-	
-2002-08-21 08:07  bwarsaw
-
-	* split.py (1.3):
-
-	Guido suggests a different order for the positional args.
-	
-2002-08-21 07:37  bwarsaw
-
-	* split.py (1.2):
-
-	Get rid of the -1 and -2 arguments and make them positional.
-	
-2002-08-21 07:18  bwarsaw
-
-	* split.py (1.1):
-
-	A simple mailbox splitter
-	
-2002-08-21 06:42  tim_one
-
-	* GBayes.py (1.8):
-
-	Added a bunch of simple tokenizers.  The originals are renamed to
-	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
-	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
-	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
-	any of these to be the last word.  When Barry has the test corpus
-	set up it should be easy to let the data tell us which "pure" strategy
-	works best.  Straight character n-grams are very appealing because
-	they're the simplest and most language-neutral; I didn't have any luck
-	with them over the weekend, but the size of my training data was
-	trivial.
-	
-2002-08-21 05:08  bwarsaw
-
-	* cleanarch (1.1):
-
-	An archive cleaner, adapted from the Mailman 2.1b3 version, but
-	de-Mailman-ified.
-	
-2002-08-21 04:44  gvanrossum
-
-	* GBayes.py (1.7):
-
-	Indent repair in clearjunk().
-	
-2002-08-21 04:22  gvanrossum
-
-	* GBayes.py (1.6):
-
-	Some minor cleanup:
-	
-	- Move the identifying comment to the top, clarify it a bit, and add
-	  author info.
-	
-	- There's no reason for _time and _heapreplace to be hidden names;
-	  change these back to time and heapreplace.
-	
-	- Rename main1() to _test() and main2() to main(); when main() sees
-	  there are no options or arguments, it runs _test().
-	
-	- Get rid of a list comprehension from clearjunk().
-	
-	- Put wordinfo.get as a local variable in _add_msg().
-	
-2002-08-20 15:16  tim_one
-
-	* GBayes.py (1.5):
-
-	Neutral typo repairs, except that clearjunk() has a better chance of
-	not blowing up immediately now .
-	
-2002-08-20 13:49  montanaro
-
-	* GBayes.py (1.4):
-
-	help make it more easily executable... ;-)
-	
-2002-08-20 09:32  bwarsaw
-
-	* GBayes.py (1.3):
-
-	Lots of hacks great and small to the main() program, but I didn't
-	touch the guts of the algorithm.
-	
-	Added a module docstring/usage message.
-	
-	Added a bunch of switches to train the system on an mbox of known good
-	and known spam messages (using PortableUnixMailbox only for now).
-	Uses the email package but does not decoding of message bodies.  Also,
-	allows you to specify a file for pickling the training data, and for
-	setting a threshold, above which messages get an X-Bayes-Score
-	header.  Also output messages (marked and unmarked) to an output file
-	for retraining.
-	
-	Print some statistics at the end.
-	
-2002-08-20 05:43  tim_one
-
-	* GBayes.py (1.2):
-
-	Turned off debugging vrbl mistakenly checked in at True.
-	
-	unlearn():  Gave this an update_probabilities=True default arg, for
-	symmetry with learn().
-	
-2002-08-20 03:33  tim_one
-
-	* GBayes.py (1.1):
-
-	An implementation of Paul Graham's Bayes-like spam classifier.
-
-
Copied: trunk/website/presfchangelog.ht (from rev 3155, trunk/website/prefschangelog.ht) =================================================================== --- trunk/website/presfchangelog.ht (rev 0) +++ trunk/website/presfchangelog.ht 2007-07-25 13:51:11 UTC (rev 3156) @@ -0,0 +1,905 @@ +

Pre-Sourceforge ChangeLog

+

This changelog lists the commits on the spambayes projects before the + separate project was set up. See also the +old CVS repository, but don't forget that it's now out of date, and you probably want to be looking at the current CVS. +

+
+2002-09-06 02:27  tim_one
+
+	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
+	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
+	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
+
+	This code has been moved to a new SourceForge project (spambayes).
+	
+2002-09-05 15:37  tim_one
+
+	* classifier.py (1.11):
+
+	Added note about MINCOUNT oddities.
+	
+2002-09-05 14:32  tim_one
+
+	* timtest.py (1.17):
+
+	Added note about word length.
+	
+2002-09-05 13:48  tim_one
+
+	* timtest.py (1.16):
+
+	tokenize_word():  Oops!  This was awfully permissive in what it
+	took as being "an email address".  Tightened that, and also
+	avoided 5-gram'ing of email addresses w/ high-bit characters.
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.050  lost
+	    0.075  0.075  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   0 times
+	tied 19 times
+	lost  1 times
+	
+	total unique fp went from 7 to 8
+	
+	false negative percentages
+	    0.764  0.691  won
+	    0.691  0.655  won
+	    0.981  0.945  won
+	    1.309  1.309  tied
+	    1.418  1.164  won
+	    0.873  0.800  won
+	    0.800  0.763  won
+	    1.163  1.163  tied
+	    1.491  1.345  won
+	    1.200  1.127  won
+	    1.381  1.345  won
+	    1.454  1.490  lost
+	    1.164  0.909  won
+	    0.655  0.582  won
+	    0.655  0.691  lost
+	    1.163  1.163  tied
+	    1.200  1.018  won
+	    0.982  0.873  won
+	    0.982  0.909  won
+	    1.236  1.127  won
+	
+	won  15 times
+	tied  3 times
+	lost  2 times
+	
+	total unique fn went from 260 to 249
+	
+	Note:  Each of the two losses there consist of just 1 msg difference.
+	The wins are bigger as well as being more common, and 260-249 = 11
+	spams no longer sneak by any run (which is more than 4% of the 260
+	spams that used to sneak thru!).
+	
+2002-09-05 11:51  tim_one
+
+	* classifier.py (1.10):
+
+	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
+	really matter; leaving it alone.
+	
+2002-09-05 10:02  tim_one
+
+	* classifier.py (1.9):
+
+	A now-rare pure win, changing spamprob() to work harder to find more
+	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
+	column, after in the right:
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.075  0.075  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.075  0.025  won
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   1 times
+	tied 19 times
+	lost  0 times
+	
+	total unique fp went from 9 to 7
+	
+	false negative percentages
+	    0.909  0.764  won
+	    0.800  0.691  won
+	    1.091  0.981  won
+	    1.381  1.309  won
+	    1.491  1.418  won
+	    1.055  0.873  won
+	    0.945  0.800  won
+	    1.236  1.163  won
+	    1.564  1.491  won
+	    1.200  1.200  tied
+	    1.454  1.381  won
+	    1.599  1.454  won
+	    1.236  1.164  won
+	    0.800  0.655  won
+	    0.836  0.655  won
+	    1.236  1.163  won
+	    1.236  1.200  won
+	    1.055  0.982  won
+	    1.127  0.982  won
+	    1.381  1.236  won
+	
+	won  19 times
+	tied  1 times
+	lost  0 times
+	
+	total unique fn went from 284 to 260
+	
+2002-09-04 11:21  tim_one
+
+	* timtest.py (1.15):
+
+	Augmented the spam callback to display spams with low probability.
+	
+2002-09-04 09:53  tim_one
+
+	* Tester.py (1.3), timtest.py (1.14):
+
+	Added support for simple histograms of the probability distributions for
+	ham and spam.
+	
+2002-09-03 12:13  tim_one
+
+	* timtest.py (1.13):
+
+	A reluctant "on principle" change no matter what it does to the stats:
+	take a stab at removing HTML decorations from plain text msgs.  See
+	comments for why it's *only* in plain text msgs.  This puts an end to
+	false positives due to text msgs talking *about* HTML.  Surprisingly, it
+	also gets rid of some false negatives.  Not surprisingly, it introduced
+	another small class of false positives due to the dumbass regexp trick
+	used to approximate HTML tag removal removing pieces of text that had
+	nothing to do with HTML tags (e.g., this happened in the middle of a
+	uuencoded .py file in such a why that it just happened to leave behind
+	a string that "looked like" a spam phrase; but before this it looked
+	like a pile of "too long" lines that didn't generate any tokens --
+	it's a nonsense outcome either way).
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.025  lost
+	    0.075  0.075  tied
+	    0.050  0.025  won
+	    0.025  0.025  tied
+	    0.000  0.025  lost
+	    0.050  0.075  lost
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   1 times
+	tied 16 times
+	lost  3 times
+	
+	total unique fp went from 8 to 9
+	
+	false negative percentages
+	    0.945  0.909  won
+	    0.836  0.800  won
+	    1.200  1.091  won
+	    1.418  1.381  won
+	    1.455  1.491  lost
+	    1.091  1.055  won
+	    1.091  0.945  won
+	    1.236  1.236  tied
+	    1.564  1.564  tied
+	    1.236  1.200  won
+	    1.563  1.454  won
+	    1.563  1.599  lost
+	    1.236  1.236  tied
+	    0.836  0.800  won
+	    0.873  0.836  won
+	    1.236  1.236  tied
+	    1.273  1.236  won
+	    1.018  1.055  lost
+	    1.091  1.127  lost
+	    1.490  1.381  won
+	
+	won  12 times
+	tied  4 times
+	lost  4 times
+	
+	total unique fn went from 292 to 284
+	
+2002-09-03 06:57  tim_one
+
+	* classifier.py (1.8):
+
+	Added a new xspamprob() method, which computes the combined probability
+	"correctly", and a long comment block explaining what happened when I
+	tried it.  There's something worth pursuing here (it greatly improves
+	the false negative rate), but this change alone pushes too many marginal
+	hams into the spam camp
+	
+2002-09-03 05:23  tim_one
+
+	* timtest.py (1.12):
+
+	Made "skip:" tokens shorter.
+	
+	Added a surprising treatment of Organization headers, with a tiny f-n
+	benefit for a tiny cost.  No change in f-p stats.
+	
+	false negative percentages
+	    1.091  0.945  won
+	    0.945  0.836  won
+	    1.236  1.200  won
+	    1.454  1.418  won
+	    1.491  1.455  won
+	    1.091  1.091  tied
+	    1.127  1.091  won
+	    1.236  1.236  tied
+	    1.636  1.564  won
+	    1.345  1.236  won
+	    1.672  1.563  won
+	    1.599  1.563  won
+	    1.236  1.236  tied
+	    0.836  0.836  tied
+	    1.018  0.873  won
+	    1.236  1.236  tied
+	    1.273  1.273  tied
+	    1.055  1.018  won
+	    1.091  1.091  tied
+	    1.527  1.490  won
+	
+	won  13 times
+	tied  7 times
+	lost  0 times
+	
+	total unique fn went from 302 to 292
+	
+2002-09-03 02:18  tim_one
+
+	* timtest.py (1.11):
+
+	tokenize_word():  dropped the prefix from the signature; it's faster
+	to let the caller do it, and this also repaired a bug in one place it
+	was being used (well, a *conceptual* bug anyway, in that the code didn't
+	do what I intended there).  This changes the stats in an insignificant
+	way.  The f-p stats didn't change.  The f-n stats shifted by one message
+	in a few cases:
+	
+	false negative percentages
+	    1.091  1.091  tied
+	    0.945  0.945  tied
+	    1.200  1.236  lost
+	    1.454  1.454  tied
+	    1.491  1.491  tied
+	    1.091  1.091  tied
+	    1.091  1.127  lost
+	    1.236  1.236  tied
+	    1.636  1.636  tied
+	    1.382  1.345  won
+	    1.636  1.672  lost
+	    1.599  1.599  tied
+	    1.236  1.236  tied
+	    0.836  0.836  tied
+	    1.018  1.018  tied
+	    1.236  1.236  tied
+	    1.273  1.273  tied
+	    1.055  1.055  tied
+	    1.091  1.091  tied
+	    1.527  1.527  tied
+	
+	won   1 times
+	tied 16 times
+	lost  3 times
+	
+	total unique unchanged
+	
+2002-09-02 19:30  tim_one
+
+	* timtest.py (1.10):
+
+	Don't ask me why this helps -- I don't really know!  When skipping "long
+	words", generating a token with a brief hint about what and how much got
+	skipped makes a definite improvement in the f-n rate, and doesn't affect
+	the f-p rate at all.  Since experiment said it's a winner, I'm checking
+	it in.  Before (left columan) and after (right column):
+	
+	false positive percentages
+	    0.000  0.000  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.075  0.075  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.050  0.050  tied
+	    0.025  0.025  tied
+	    0.025  0.025  tied
+	    0.000  0.000  tied
+	    0.025  0.025  tied
+	    0.050  0.050  tied
+	
+	won   0 times
+	tied 20 times
+	lost  0 times
+	
+	total unique fp went from 8 to 8
+	
+	false negative percentages
+	    1.236  1.091  won
+	    1.164  0.945  won
+	    1.454  1.200  won
+	    1.599  1.454  won
+	    1.527  1.491  won
+	    1.236  1.091  won
+	    1.163  1.091  won
+	    1.309  1.236  won
+	    1.891  1.636  won
+	    1.418  1.382  won
+	    1.745  1.636  won
+	    1.708  1.599  won
+	    1.491  1.236  won
+	    0.836  0.836  tied
+	    1.091  1.018  won
+	    1.309  1.236  won
+	    1.491  1.273  won
+	    1.127  1.055  won
+	    1.309  1.091  won
+	    1.636  1.527  won
+	
+	won  19 times
+	tied  1 times
+	lost  0 times
+	
+	total unique fn went from 336 to 302
+	
+2002-09-02 17:55  tim_one
+
+	* timtest.py (1.9):
+
+	Some comment changes and nesting reduction.
+	
+2002-09-02 11:18  tim_one
+
+	* timtest.py (1.8):
+
+	Fixed some out-of-date comments.
+	
+	Made URL clumping lumpier:  now distinguishes among just "first field",
+	"second field", and "everything else".
+	
+	Changed tag names for email address fields (semantically neutral).
+	
+	Added "From:" line tagging.
+	
+	These add up to an almost pure win.  Before-and-after f-n rates across 20
+	runs:
+	
+	1.418   1.236
+	1.309   1.164
+	1.636   1.454
+	1.854   1.599
+	1.745   1.527
+	1.418   1.236
+	1.381   1.163
+	1.418   1.309
+	2.109   1.891
+	1.491   1.418
+	1.854   1.745
+	1.890   1.708
+	1.818   1.491
+	1.055   0.836
+	1.164   1.091
+	1.599   1.309
+	1.600   1.491
+	1.127   1.127
+	1.164   1.309
+	1.781   1.636
+	
+	It only increased in one run.  The variance appears to have been reduced
+	too (I didn't bother to compute that, though).
+	
+	Before-and-after f-p rates across 20 runs:
+	
+	0.000   0.000
+	0.000   0.000
+	0.075   0.050
+	0.000   0.000
+	0.025   0.025
+	0.050   0.025
+	0.075   0.050
+	0.025   0.025
+	0.025   0.025
+	0.025   0.000
+	0.100   0.075
+	0.050   0.050
+	0.025   0.025
+	0.000   0.000
+	0.075   0.050
+	0.025   0.025
+	0.025   0.025
+	0.000   0.000
+	0.075   0.025
+	0.100   0.050
+	
+	Note that 0.025% is a single message; it's really impossible to *measure*
+	an improvement in the f-p rate anymore with 4000-msg ham sets.
+	
+	Across all 20 runs,
+	
+	the total # of unique f-n fell from 353 to 336
+	the total # of unique f-p fell from 13 to 8
+	
+2002-09-02 10:06  tim_one
+
+	* timtest.py (1.7):
+
+	A number of changes.  The most significant is paying attention to the
+	Subject line (I was wrong before when I said my c.l.py ham corpus was
+	unusable for this due to Mailman-injected decorations).  In all, across
+	my 20 test runs,
+	
+	the total # of unique false positives fell from 23 to 13
+	the total # of unique false negatives rose from 337 to 353
+	
+	Neither result is statistically significant, although I bet the first
+	one would be if I pissed away a few days trying to come up with a more
+	realistic model for what "stat. sig." means here .
+	
+2002-09-01 17:22  tim_one
+
+	* classifier.py (1.7):
+
+	Added a comment block about HAMBIAS experiments.  There's no clearer
+	example of trading off precision against recall, and you can favor either
+	at the expense of the other to any degree you like by fiddling this knob.
+	
+2002-09-01 14:42  tim_one
+
+	* timtest.py (1.6):
+
+	Long new comment block summarizing all my experiments with character
+	n-grams.  Bottom line is that they have nothing going for them, and a
+	lot going against them, under Graham's scheme.  I believe there may
+	still be a place for them in *part* of a word-based tokenizer, though.
+	
+2002-09-01 10:05  tim_one
+
+	* classifier.py (1.6):
+
+	spamprob():  Never count unique words more than once anymore.  Counting
+	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
+	that's now a small drag instead.
+	
+2002-09-01 07:33  tim_one
+
+	* rebal.py (1.3), timtest.py (1.5):
+
+	Folding case is here to stay.  Read the new comments for why.  This may
+	be a bad idea for other languages, though.
+	
+	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
+	http is spam-neutral, but https is a strong spam indicator.  That
+	surprised me.
+	
+2002-09-01 06:47  tim_one
+
+	* classifier.py (1.5):
+
+	spamprob():  Removed useless check that wordstream isn't empty.  For one
+	thing, it didn't work, since wordstream is often an iterator.  Even if
+	it did work, it isn't needed -- the probability of an empty wordstream
+	gets computed as 0.5 based on the total absence of evidence.
+	
+2002-09-01 05:37  tim_one
+
+	* timtest.py (1.4):
+
+	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
+	details).
+	
+2002-09-01 05:09  tim_one
+
+	* rebal.py (1.2):
+
+	Aha!  Staring at the checkin msg revealed a logic bug that explains why
+	my ham directories sometimes remained unbalanced after running this --
+	if the randomly selected reservoir msg turned out to be spam, it wasn't
+	pushing the too-small directory on the stack again.
+	
+2002-09-01 04:56  tim_one
+
+	* timtest.py (1.3):
+
+	textparts():  This was failing to weed out redundant HTML in cases like
+	this:
+	
+	    multipart/alternative
+	        text/plain
+	        multipart/related
+	            text/html
+	
+	The tokenizer here also transforms everything to lowercase, but that's
+	an accident due simply to that I'm testing that now.  Can't say for
+	sure until the test runs end, but so far it looks like a bad idea for
+	the false positive rate.
+	
+2002-09-01 04:52  tim_one
+
+	* rebal.py (1.1):
+
+	A little script I use to rebalance the ham corpora after deleting what
+	turns out to be spam.  I have another Ham/reservoir directory with a
+	few thousand randomly selected msgs from the presumably-good archive.
+	These aren't used in scoring or training.  This script marches over all
+	the ham corpora directories that are used, and if any have gotten too
+	big (this never happens anymore) deletes msgs at random from them, and
+	if any have gotten too small plugs the holes by moving in random
+	msgs from the reservoir.
+	
+2002-09-01 03:25  tim_one
+
+	* classifier.py (1.4), timtest.py (1.2):
+
+	Boost UNKNOWN_SPAMPROB.
+	# The spam probability assigned to words never seen before.  Graham used
+	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
+	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
+	# negative rate by over 1/3.  The f-p rate increased, but there were so few
+	# f-ps that the increase wasn't statistically significant.  It also caught
+	# 13 more spams erroneously classified as ham.  By eyeball (and common
+	# sense ), this has most effect on very short messages, where there
+	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
+	# completely ignored by spamprob(), in favor of *any* word with *any* prob
+	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
+	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
+	# on the face of it.
+	
+2002-08-31 16:50  tim_one
+
+	* timtest.py (1.1):
+
+	This is a driver I've been using for test runs.  It's specific to my
+	corpus directories, but has useful stuff in it all the same.
+	
+2002-08-31 16:49  tim_one
+
+	* classifier.py (1.3):
+
+	The explanation for these changes was on Python-Dev.  You'll find out
+	why if the moderator approves the msg .
+	
+2002-08-29 07:04  tim_one
+
+	* Tester.py (1.2), classifier.py (1.2):
+
+	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
+	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
+	too hard to get motivated to reduce 0.01 <0.1 wink>).
+	
+	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
+	the 15 strongest (word, probability) pairs is returned as well as the
+	overall probability (this is how to find out why a message scored as it
+	did).
+	
+2002-08-28 13:45  montanaro
+
+	* GBayes.py (1.15):
+
+	ehh - it actually didn't work all that well.  the spurious report that it
+	did well was pilot error.  besides, tim's report suggests that a simple
+	str.split() may be the best tokenizer anyway.
+	
+2002-08-28 10:45  montanaro
+
+	* setup.py (1.1):
+
+	trivial little setup.py file - i don't expect most people will be interested
+	in this, but it makes it a tad simpler to work with now that there are two
+	files
+	
+2002-08-28 10:43  montanaro
+
+	* GBayes.py (1.14):
+
+	add simple trigram tokenizer - this seems to yield the best results I've
+	seen so far (but has not been extensively tested)
+	
+2002-08-28 08:10  tim_one
+
+	* Tester.py (1.1):
+
+	A start at a testing class.  There isn't a lot here, but it automates
+	much of the tedium, and as the doctest shows it can already do
+	useful things, like remembering which inputs were misclassified.
+	
+2002-08-27 06:45  tim_one
+
+	* mboxcount.py (1.5):
+
+	Updated stats to what Barry and I both get now.  Fiddled output.
+	
+2002-08-27 05:09  bwarsaw
+
+	* split.py (1.5), splitn.py (1.2):
+
+	_factory(): Return the empty string instead of None in the except
+	clauses, so that for-loops won't break prematurely.  mailbox.py's base
+	class defines an __iter__() that raises a StopIteration on None
+	return.
+	
+2002-08-27 04:55  tim_one
+
+	* GBayes.py (1.13), mboxcount.py (1.4):
+
+	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
+	
+2002-08-27 04:40  bwarsaw
+
+	* mboxcount.py (1.3):
+
+	Some stats after splitting b/w good messages and unparseable messages
+	
+2002-08-27 04:23  bwarsaw
+
+	* mboxcount.py (1.2):
+
+	_factory(): Use a marker object to designate between good messages and
+	unparseable messages.  For some reason, returning None from the except
+	clause in _factory() caused Python 2.2.1 to exit early out of the for
+	loop.
+	
+	main(): Print statistics about both the number of good messages and
+	the number of unparseable messages.
+	
+2002-08-27 03:06  tim_one
+
+	* cleanarch (1.2):
+
+	"From " is a header more than a separator, so don't bump the msg count
+	at the end.
+	
+2002-08-24 01:42  tim_one
+
+	* GBayes.py (1.12), classifier.py (1.1):
+
+	Moved all the interesting code that was in the *original* GBayes.py into
+	a new classifier.py.  It was designed to have a very clean interface,
+	and there's no reason to keep slamming everything into one file.  The
+	ever-growing tokenizer stuff should probably also be split out, leaving
+	GBayes.py a pure driver.
+	
+	Also repaired _test() (Skip's checkin left it without a binding for
+	the tokenize function).
+	
+2002-08-24 01:17  tim_one
+
+	* splitn.py (1.1):
+
+	Utility to split an mbox into N random pieces in one gulp.  This gives
+	a convenient way to break a giant corpus into multiple files that can
+	then be used independently across multiple training and testing runs.
+	It's important to do multiple runs on different random samples to avoid
+	drawing conclusions based on accidents in a single random training corpus;
+	if the algorithm is robust, it should have similar performance across
+	all runs.
+	
+2002-08-24 00:25  montanaro
+
+	* GBayes.py (1.11):
+
+	Allow command line specification of tokenize functions
+	    run w/ -t flag to override default tokenize function
+	    run w/ -H flag to see list of tokenize functions
+	
+	When adding a new tokenizer, make docstring a short description and add a
+	key/value pair to the tokenizers dict.  The key is what the user specifies.
+	The value is a tokenize function.
+	
+	Added two new tokenizers - tokenize_wordpairs_foldcase and
+	tokenize_words_and_pairs.  It's not obvious that either is better than any
+	of the preexisting functions.
+	
+	Should probably add info to the pickle which indicates the tokenizing
+	function used to build it.  This could then be the default for spam
+	detection runs.
+	
+	Next step is to drive this with spam/non-spam corpora, selecting each of the
+	various tokenizer functions, and presenting the results in tabular form.
+	
+2002-08-23 13:10  tim_one
+
+	* GBayes.py (1.10):
+
+	spamprob():  Commented some subtleties.
+	
+	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
+	is that you can't delete entries from a dict that's being crawled over
+	by .iteritems(), which is why I (I suddenly recall) materialized a
+	list of words to be deleted the first time I wrote this.  It's a lot
+	better to materialize a list of to-be-deleted words than to materialize
+	the entire database in a dict.items() list.
+	
+2002-08-23 12:36  tim_one
+
+	* mboxcount.py (1.1):
+
+	Utility to count and display the # of msgs in (one or more) Unix mboxes.
+	
+2002-08-23 12:11  tim_one
+
+	* split.py (1.4):
+
+	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
+	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
+	the latter invents an extra newline.
+	
+2002-08-22 07:01  tim_one
+
+	* GBayes.py (1.9):
+
+	Renamed "modtime" to "atime", to better reflect its meaning, and added a
+	comment block to explain that better.
+	
+2002-08-21 08:07  bwarsaw
+
+	* split.py (1.3):
+
+	Guido suggests a different order for the positional args.
+	
+2002-08-21 07:37  bwarsaw
+
+	* split.py (1.2):
+
+	Get rid of the -1 and -2 arguments and make them positional.
+	
+2002-08-21 07:18  bwarsaw
+
+	* split.py (1.1):
+
+	A simple mailbox splitter
+	
+2002-08-21 06:42  tim_one
+
+	* GBayes.py (1.8):
+
+	Added a bunch of simple tokenizers.  The originals are renamed to
+	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
+	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
+	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
+	any of these to be the last word.  When Barry has the test corpus
+	set up it should be easy to let the data tell us which "pure" strategy
+	works best.  Straight character n-grams are very appealing because
+	they're the simplest and most language-neutral; I didn't have any luck
+	with them over the weekend, but the size of my training data was
+	trivial.
+	
+2002-08-21 05:08  bwarsaw
+
+	* cleanarch (1.1):
+
+	An archive cleaner, adapted from the Mailman 2.1b3 version, but
+	de-Mailman-ified.
+	
+2002-08-21 04:44  gvanrossum
+
+	* GBayes.py (1.7):
+
+	Indent repair in clearjunk().
+	
+2002-08-21 04:22  gvanrossum
+
+	* GBayes.py (1.6):
+
+	Some minor cleanup:
+	
+	- Move the identifying comment to the top, clarify it a bit, and add
+	  author info.
+	
+	- There's no reason for _time and _heapreplace to be hidden names;
+	  change these back to time and heapreplace.
+	
+	- Rename main1() to _test() and main2() to main(); when main() sees
+	  there are no options or arguments, it runs _test().
+	
+	- Get rid of a list comprehension from clearjunk().
+	
+	- Put wordinfo.get as a local variable in _add_msg().
+	
+2002-08-20 15:16  tim_one
+
+	* GBayes.py (1.5):
+
+	Neutral typo repairs, except that clearjunk() has a better chance of
+	not blowing up immediately now .
+	
+2002-08-20 13:49  montanaro
+
+	* GBayes.py (1.4):
+
+	help make it more easily executable... ;-)
+	
+2002-08-20 09:32  bwarsaw
+
+	* GBayes.py (1.3):
+
+	Lots of hacks great and small to the main() program, but I didn't
+	touch the guts of the algorithm.
+	
+	Added a module docstring/usage message.
+	
+	Added a bunch of switches to train the system on an mbox of known good
+	and known spam messages (using PortableUnixMailbox only for now).
+	Uses the email package but does not decoding of message bodies.  Also,
+	allows you to specify a file for pickling the training data, and for
+	setting a threshold, above which messages get an X-Bayes-Score
+	header.  Also output messages (marked and unmarked) to an output file
+	for retraining.
+	
+	Print some statistics at the end.
+	
+2002-08-20 05:43  tim_one
+
+	* GBayes.py (1.2):
+
+	Turned off debugging vrbl mistakenly checked in at True.
+	
+	unlearn():  Gave this an update_probabilities=True default arg, for
+	symmetry with learn().
+	
+2002-08-20 03:33  tim_one
+
+	* GBayes.py (1.1):
+
+	An implementation of Paul Graham's Bayes-like spam classifier.
+
+
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. From montanaro at users.sourceforge.net Wed Jul 25 15:49:42 2007 From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net) Date: Wed, 25 Jul 2007 06:49:42 -0700 Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3155] trunk/website Message-ID: Revision: 3155 http://spambayes.svn.sourceforge.net/spambayes/?rev=3155&view=rev Author: montanaro Date: 2007-07-25 06:49:42 -0700 (Wed, 25 Jul 2007) Log Message: ----------- cvs -> svn Modified Paths: -------------- trunk/website/applications.ht trunk/website/background.ht trunk/website/contact.ht trunk/website/developer.ht trunk/website/docs.ht trunk/website/download.ht trunk/website/unix.ht trunk/website/windows.ht Added Paths: ----------- trunk/website/prefschangelog.ht Removed Paths: ------------- trunk/website/presfchangelog.ht Modified: trunk/website/applications.ht =================================================================== --- trunk/website/applications.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/applications.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -25,13 +25,13 @@
  • Python's win32com extensions (win32all-149 or later - currently ActivePython is not suitable) -For more on this, see the README.txt or -about.html file in the spambayes CVS repository's Outlook2000 directory. -

    Alternatively, you can use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

    +For more on this, see the README.txt or +about.html file in the spambayes Subversion repository's Outlook2000 directory. +

    Alternatively, you can use Subversion to get the code - go to the Subversion page on the project's SourceForge site for more.

    sb_filter.py

    sb_filter is a command line tool for marking mail as ham or spam. The readme - + includes a guide to integrating it with your mailer (Unix-only instructions at the moment - additions welcome!). Currently it focuses on running sb_filter via procmail.

    @@ -44,11 +44,11 @@

    Availability

    Download the source archive.

    -

    Alternatively, use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

    +

    Alternatively, use Subversion to get the code. Go to the Subversion page on the project's SourceForge site for more.

    sb_server.py

    sb_server provides a POP3 proxy which sits between your mail client and your real POP3 server and marks -mail as ham or spam as it passes through. See the README for more. +mail as ham or spam as it passes through. See the README for more. sb_server can also be used with the sb_upload.py script as a procmail (or similar) solution.

    Requirements

    @@ -63,12 +63,12 @@ it.

    Alternatively, to run from source, download the source archive.

    -

    Alternatively, use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

    +

    Alternatively, use Subversion to get the code - go to the Subversion page on the project's SourceForge site for more.

    sb_imapfilter.py

    imapfilter connects to your imap server and marks mail as ham or spam, moving it to appropriate folders as it arrives. -See the README for more.

    +See the README for more.

    Requirements

      @@ -78,7 +78,7 @@

      Availability

      Download the source archive.

      -

      Alternatively, use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

      +

      Alternatively, use Subversion to get the code - go to the Subversion page on the project's SourceForge site for more.

      sb_mboxtrain.py

      This application allows you to train incrementally on ham and spam @@ -94,7 +94,7 @@

      Availability

      Download the source archive.

      -

      Alternatively, use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

      +

      Alternatively, use Subversion to get the code - go to the Subversion page on the project's SourceForge site for more.

      sb_notesfilter.py

      This application allows you to filter Lotus Notes folders, rather like @@ -112,4 +112,4 @@

      Availability

      Download the source archive.

      -

      Alternatively, use CVS to get the code - go to the CVS page on the project's sourceforge site for more.

      +

      Alternatively, use Subversion to get the code - go to the Subversion page on the project's SourceForge site for more.

      Modified: trunk/website/background.ht =================================================================== --- trunk/website/background.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/background.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -250,10 +250,10 @@
    • The discussions then moved to the spambayes mailing list.
    -

    CVS commit messages

    +

    CVS Commit Messages

    Tim Peters has whacked a whole lot of useful information into CVS commit messages. As the project was moved from an obscure corner of the -Python CVS tree, there's actually two sources of CVS commits.

    +Python CVS tree, there are actually two sources of CVS commits.

    • The older CVS repository via view CVS, or the entire changelog. Modified: trunk/website/contact.ht =================================================================== --- trunk/website/contact.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/contact.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -34,7 +34,7 @@ This list is also unmoderated.
    • -CVS commit messages go to the list spambayes-checkins. +Subversion commit messages go to the list spambayes-checkins. You shouldn't send email to this list; a program running at SourceForge automatically creates and sends emails to this list as a result of code checkins. Modified: trunk/website/developer.ht =================================================================== --- trunk/website/developer.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/developer.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -5,10 +5,9 @@

      Developer info

      So you want to get involved?

      Running the code

      -

      This project works with Python 2.2, Python 2.3, Python 2.4, -or on the bleeding edge of python code, -available from CVS on -sourceforge. It will not work on python 2.1.x or earlier, nor is it ever +

      This project works with Python 2.2 or later, +available from Subversion on +SourceForge. It will not work on python 2.1.x or earlier, nor is it ever likely to do so.

      If you're running Python 2.2 or 2.2.1, you'll need to separately fetch the latest email package. You can get @@ -17,7 +16,7 @@ (you'll need version 2.4.3 or later - version 3.0 or later is recommended).

      -

      The SpamBayes code itself is also available via CVS, or from the download page. +

      The SpamBayes code itself is also available via via Subversion, or from the download page.

      I just want to make suggestions

      Modified: trunk/website/docs.ht =================================================================== --- trunk/website/docs.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/docs.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -10,13 +10,13 @@ and generally help each other out. It would be great to see documentation improvements, hints and tips, scripts and recipes, and anything else (related to SpamBayes) that takes your fancy added here.
    • -
    • Instructions on installing Spambayes and integrating it into your mail system.
    • -
    • The Outlook plugin includes an "About" File, and a +
    • Instructions on installing Spambayes and integrating it into your mail system.
    • +
    • The Outlook plugin includes an "About" File, and a "Troubleshooting Guide" that can be accessed via the toolbar. (Note that the online documentaton is always for the latest source version, and so might not correspond exactly with the version you are using. Always start with the documentation that came with the version you installed.)
    • -
    • The README-DEVEL.txt information that should be of use to people planning on developing code based on SpamBayes.
    • -
    • The TESTING.txt file -- Clues about the practice of statistical testing, adapted from Tim +
    • The README-DEVEL.txt information that should be of use to people planning on developing code based on SpamBayes.
    • +
    • The TESTING.txt file -- Clues about the practice of statistical testing, adapted from Tim comments on python-dev.
    • There are also a vast number of clues and notes scattered as block comments through the code.
    Modified: trunk/website/download.ht =================================================================== --- trunk/website/download.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/download.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -38,7 +38,7 @@

    Prerequisites:

      Either: -
    • Python 2.2.2 or above, or a CVS build of python, or +
    • Python 2.2.2 or above, or a Subversion build of python, or
    • Python 2.2, 2.2.1, plus the latest email package.

    Once you've downloaded and unpacked the source archive, do the regular setup.py build; setup.py install dance, then: @@ -106,10 +106,8 @@

    These instructions are geared to GnuPG and command-line weenies. Suggestions are welcome for other OpenPGP applications.

    -

    CVS Access

    -

    The code is currently available from sourceforge's CVS server - -see here for -more details. Note that due to capacity problems with Sourceforge, -the public CVS servers often run up to 48 hours behind the real CVS -servers. This is something that SF are working on improving. +

    Subversion Access

    + +

    The code is currently available the SourceForge Subversion server.

    Copied: trunk/website/prefschangelog.ht (from rev 3151, trunk/website/presfchangelog.ht) =================================================================== --- trunk/website/prefschangelog.ht (rev 0) +++ trunk/website/prefschangelog.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -0,0 +1,905 @@ +

    Pre-Sourceforge ChangeLog

    +

    This changelog lists the commits on the spambayes projects before the + separate project was set up. See also the +old CVS repository, but don't forget that it's now out of date, and you probably want to be looking at the current CVS. +

    +
    +2002-09-06 02:27  tim_one
    +
    +	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
    +	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
    +	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
    +
    +	This code has been moved to a new SourceForge project (spambayes).
    +	
    +2002-09-05 15:37  tim_one
    +
    +	* classifier.py (1.11):
    +
    +	Added note about MINCOUNT oddities.
    +	
    +2002-09-05 14:32  tim_one
    +
    +	* timtest.py (1.17):
    +
    +	Added note about word length.
    +	
    +2002-09-05 13:48  tim_one
    +
    +	* timtest.py (1.16):
    +
    +	tokenize_word():  Oops!  This was awfully permissive in what it
    +	took as being "an email address".  Tightened that, and also
    +	avoided 5-gram'ing of email addresses w/ high-bit characters.
    +	
    +	false positive percentages
    +	    0.000  0.000  tied
    +	    0.000  0.000  tied
    +	    0.050  0.050  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.050  lost
    +	    0.075  0.075  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	
    +	won   0 times
    +	tied 19 times
    +	lost  1 times
    +	
    +	total unique fp went from 7 to 8
    +	
    +	false negative percentages
    +	    0.764  0.691  won
    +	    0.691  0.655  won
    +	    0.981  0.945  won
    +	    1.309  1.309  tied
    +	    1.418  1.164  won
    +	    0.873  0.800  won
    +	    0.800  0.763  won
    +	    1.163  1.163  tied
    +	    1.491  1.345  won
    +	    1.200  1.127  won
    +	    1.381  1.345  won
    +	    1.454  1.490  lost
    +	    1.164  0.909  won
    +	    0.655  0.582  won
    +	    0.655  0.691  lost
    +	    1.163  1.163  tied
    +	    1.200  1.018  won
    +	    0.982  0.873  won
    +	    0.982  0.909  won
    +	    1.236  1.127  won
    +	
    +	won  15 times
    +	tied  3 times
    +	lost  2 times
    +	
    +	total unique fn went from 260 to 249
    +	
    +	Note:  Each of the two losses there consist of just 1 msg difference.
    +	The wins are bigger as well as being more common, and 260-249 = 11
    +	spams no longer sneak by any run (which is more than 4% of the 260
    +	spams that used to sneak thru!).
    +	
    +2002-09-05 11:51  tim_one
    +
    +	* classifier.py (1.10):
    +
    +	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
    +	really matter; leaving it alone.
    +	
    +2002-09-05 10:02  tim_one
    +
    +	* classifier.py (1.9):
    +
    +	A now-rare pure win, changing spamprob() to work harder to find more
    +	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
    +	column, after in the right:
    +	
    +	false positive percentages
    +	    0.000  0.000  tied
    +	    0.000  0.000  tied
    +	    0.050  0.050  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.075  0.075  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.075  0.025  won
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	
    +	won   1 times
    +	tied 19 times
    +	lost  0 times
    +	
    +	total unique fp went from 9 to 7
    +	
    +	false negative percentages
    +	    0.909  0.764  won
    +	    0.800  0.691  won
    +	    1.091  0.981  won
    +	    1.381  1.309  won
    +	    1.491  1.418  won
    +	    1.055  0.873  won
    +	    0.945  0.800  won
    +	    1.236  1.163  won
    +	    1.564  1.491  won
    +	    1.200  1.200  tied
    +	    1.454  1.381  won
    +	    1.599  1.454  won
    +	    1.236  1.164  won
    +	    0.800  0.655  won
    +	    0.836  0.655  won
    +	    1.236  1.163  won
    +	    1.236  1.200  won
    +	    1.055  0.982  won
    +	    1.127  0.982  won
    +	    1.381  1.236  won
    +	
    +	won  19 times
    +	tied  1 times
    +	lost  0 times
    +	
    +	total unique fn went from 284 to 260
    +	
    +2002-09-04 11:21  tim_one
    +
    +	* timtest.py (1.15):
    +
    +	Augmented the spam callback to display spams with low probability.
    +	
    +2002-09-04 09:53  tim_one
    +
    +	* Tester.py (1.3), timtest.py (1.14):
    +
    +	Added support for simple histograms of the probability distributions for
    +	ham and spam.
    +	
    +2002-09-03 12:13  tim_one
    +
    +	* timtest.py (1.13):
    +
    +	A reluctant "on principle" change no matter what it does to the stats:
    +	take a stab at removing HTML decorations from plain text msgs.  See
    +	comments for why it's *only* in plain text msgs.  This puts an end to
    +	false positives due to text msgs talking *about* HTML.  Surprisingly, it
    +	also gets rid of some false negatives.  Not surprisingly, it introduced
    +	another small class of false positives due to the dumbass regexp trick
    +	used to approximate HTML tag removal removing pieces of text that had
    +	nothing to do with HTML tags (e.g., this happened in the middle of a
    +	uuencoded .py file in such a why that it just happened to leave behind
    +	a string that "looked like" a spam phrase; but before this it looked
    +	like a pile of "too long" lines that didn't generate any tokens --
    +	it's a nonsense outcome either way).
    +	
    +	false positive percentages
    +	    0.000  0.000  tied
    +	    0.000  0.000  tied
    +	    0.050  0.050  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.000  0.025  lost
    +	    0.075  0.075  tied
    +	    0.050  0.025  won
    +	    0.025  0.025  tied
    +	    0.000  0.025  lost
    +	    0.050  0.075  lost
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	
    +	won   1 times
    +	tied 16 times
    +	lost  3 times
    +	
    +	total unique fp went from 8 to 9
    +	
    +	false negative percentages
    +	    0.945  0.909  won
    +	    0.836  0.800  won
    +	    1.200  1.091  won
    +	    1.418  1.381  won
    +	    1.455  1.491  lost
    +	    1.091  1.055  won
    +	    1.091  0.945  won
    +	    1.236  1.236  tied
    +	    1.564  1.564  tied
    +	    1.236  1.200  won
    +	    1.563  1.454  won
    +	    1.563  1.599  lost
    +	    1.236  1.236  tied
    +	    0.836  0.800  won
    +	    0.873  0.836  won
    +	    1.236  1.236  tied
    +	    1.273  1.236  won
    +	    1.018  1.055  lost
    +	    1.091  1.127  lost
    +	    1.490  1.381  won
    +	
    +	won  12 times
    +	tied  4 times
    +	lost  4 times
    +	
    +	total unique fn went from 292 to 284
    +	
    +2002-09-03 06:57  tim_one
    +
    +	* classifier.py (1.8):
    +
    +	Added a new xspamprob() method, which computes the combined probability
    +	"correctly", and a long comment block explaining what happened when I
    +	tried it.  There's something worth pursuing here (it greatly improves
    +	the false negative rate), but this change alone pushes too many marginal
    +	hams into the spam camp
    +	
    +2002-09-03 05:23  tim_one
    +
    +	* timtest.py (1.12):
    +
    +	Made "skip:" tokens shorter.
    +	
    +	Added a surprising treatment of Organization headers, with a tiny f-n
    +	benefit for a tiny cost.  No change in f-p stats.
    +	
    +	false negative percentages
    +	    1.091  0.945  won
    +	    0.945  0.836  won
    +	    1.236  1.200  won
    +	    1.454  1.418  won
    +	    1.491  1.455  won
    +	    1.091  1.091  tied
    +	    1.127  1.091  won
    +	    1.236  1.236  tied
    +	    1.636  1.564  won
    +	    1.345  1.236  won
    +	    1.672  1.563  won
    +	    1.599  1.563  won
    +	    1.236  1.236  tied
    +	    0.836  0.836  tied
    +	    1.018  0.873  won
    +	    1.236  1.236  tied
    +	    1.273  1.273  tied
    +	    1.055  1.018  won
    +	    1.091  1.091  tied
    +	    1.527  1.490  won
    +	
    +	won  13 times
    +	tied  7 times
    +	lost  0 times
    +	
    +	total unique fn went from 302 to 292
    +	
    +2002-09-03 02:18  tim_one
    +
    +	* timtest.py (1.11):
    +
    +	tokenize_word():  dropped the prefix from the signature; it's faster
    +	to let the caller do it, and this also repaired a bug in one place it
    +	was being used (well, a *conceptual* bug anyway, in that the code didn't
    +	do what I intended there).  This changes the stats in an insignificant
    +	way.  The f-p stats didn't change.  The f-n stats shifted by one message
    +	in a few cases:
    +	
    +	false negative percentages
    +	    1.091  1.091  tied
    +	    0.945  0.945  tied
    +	    1.200  1.236  lost
    +	    1.454  1.454  tied
    +	    1.491  1.491  tied
    +	    1.091  1.091  tied
    +	    1.091  1.127  lost
    +	    1.236  1.236  tied
    +	    1.636  1.636  tied
    +	    1.382  1.345  won
    +	    1.636  1.672  lost
    +	    1.599  1.599  tied
    +	    1.236  1.236  tied
    +	    0.836  0.836  tied
    +	    1.018  1.018  tied
    +	    1.236  1.236  tied
    +	    1.273  1.273  tied
    +	    1.055  1.055  tied
    +	    1.091  1.091  tied
    +	    1.527  1.527  tied
    +	
    +	won   1 times
    +	tied 16 times
    +	lost  3 times
    +	
    +	total unique unchanged
    +	
    +2002-09-02 19:30  tim_one
    +
    +	* timtest.py (1.10):
    +
    +	Don't ask me why this helps -- I don't really know!  When skipping "long
    +	words", generating a token with a brief hint about what and how much got
    +	skipped makes a definite improvement in the f-n rate, and doesn't affect
    +	the f-p rate at all.  Since experiment said it's a winner, I'm checking
    +	it in.  Before (left columan) and after (right column):
    +	
    +	false positive percentages
    +	    0.000  0.000  tied
    +	    0.000  0.000  tied
    +	    0.050  0.050  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.000  0.000  tied
    +	    0.075  0.075  tied
    +	    0.050  0.050  tied
    +	    0.025  0.025  tied
    +	    0.000  0.000  tied
    +	    0.050  0.050  tied
    +	    0.025  0.025  tied
    +	    0.025  0.025  tied
    +	    0.000  0.000  tied
    +	    0.025  0.025  tied
    +	    0.050  0.050  tied
    +	
    +	won   0 times
    +	tied 20 times
    +	lost  0 times
    +	
    +	total unique fp went from 8 to 8
    +	
    +	false negative percentages
    +	    1.236  1.091  won
    +	    1.164  0.945  won
    +	    1.454  1.200  won
    +	    1.599  1.454  won
    +	    1.527  1.491  won
    +	    1.236  1.091  won
    +	    1.163  1.091  won
    +	    1.309  1.236  won
    +	    1.891  1.636  won
    +	    1.418  1.382  won
    +	    1.745  1.636  won
    +	    1.708  1.599  won
    +	    1.491  1.236  won
    +	    0.836  0.836  tied
    +	    1.091  1.018  won
    +	    1.309  1.236  won
    +	    1.491  1.273  won
    +	    1.127  1.055  won
    +	    1.309  1.091  won
    +	    1.636  1.527  won
    +	
    +	won  19 times
    +	tied  1 times
    +	lost  0 times
    +	
    +	total unique fn went from 336 to 302
    +	
    +2002-09-02 17:55  tim_one
    +
    +	* timtest.py (1.9):
    +
    +	Some comment changes and nesting reduction.
    +	
    +2002-09-02 11:18  tim_one
    +
    +	* timtest.py (1.8):
    +
    +	Fixed some out-of-date comments.
    +	
    +	Made URL clumping lumpier:  now distinguishes among just "first field",
    +	"second field", and "everything else".
    +	
    +	Changed tag names for email address fields (semantically neutral).
    +	
    +	Added "From:" line tagging.
    +	
    +	These add up to an almost pure win.  Before-and-after f-n rates across 20
    +	runs:
    +	
    +	1.418   1.236
    +	1.309   1.164
    +	1.636   1.454
    +	1.854   1.599
    +	1.745   1.527
    +	1.418   1.236
    +	1.381   1.163
    +	1.418   1.309
    +	2.109   1.891
    +	1.491   1.418
    +	1.854   1.745
    +	1.890   1.708
    +	1.818   1.491
    +	1.055   0.836
    +	1.164   1.091
    +	1.599   1.309
    +	1.600   1.491
    +	1.127   1.127
    +	1.164   1.309
    +	1.781   1.636
    +	
    +	It only increased in one run.  The variance appears to have been reduced
    +	too (I didn't bother to compute that, though).
    +	
    +	Before-and-after f-p rates across 20 runs:
    +	
    +	0.000   0.000
    +	0.000   0.000
    +	0.075   0.050
    +	0.000   0.000
    +	0.025   0.025
    +	0.050   0.025
    +	0.075   0.050
    +	0.025   0.025
    +	0.025   0.025
    +	0.025   0.000
    +	0.100   0.075
    +	0.050   0.050
    +	0.025   0.025
    +	0.000   0.000
    +	0.075   0.050
    +	0.025   0.025
    +	0.025   0.025
    +	0.000   0.000
    +	0.075   0.025
    +	0.100   0.050
    +	
    +	Note that 0.025% is a single message; it's really impossible to *measure*
    +	an improvement in the f-p rate anymore with 4000-msg ham sets.
    +	
    +	Across all 20 runs,
    +	
    +	the total # of unique f-n fell from 353 to 336
    +	the total # of unique f-p fell from 13 to 8
    +	
    +2002-09-02 10:06  tim_one
    +
    +	* timtest.py (1.7):
    +
    +	A number of changes.  The most significant is paying attention to the
    +	Subject line (I was wrong before when I said my c.l.py ham corpus was
    +	unusable for this due to Mailman-injected decorations).  In all, across
    +	my 20 test runs,
    +	
    +	the total # of unique false positives fell from 23 to 13
    +	the total # of unique false negatives rose from 337 to 353
    +	
    +	Neither result is statistically significant, although I bet the first
    +	one would be if I pissed away a few days trying to come up with a more
    +	realistic model for what "stat. sig." means here .
    +	
    +2002-09-01 17:22  tim_one
    +
    +	* classifier.py (1.7):
    +
    +	Added a comment block about HAMBIAS experiments.  There's no clearer
    +	example of trading off precision against recall, and you can favor either
    +	at the expense of the other to any degree you like by fiddling this knob.
    +	
    +2002-09-01 14:42  tim_one
    +
    +	* timtest.py (1.6):
    +
    +	Long new comment block summarizing all my experiments with character
    +	n-grams.  Bottom line is that they have nothing going for them, and a
    +	lot going against them, under Graham's scheme.  I believe there may
    +	still be a place for them in *part* of a word-based tokenizer, though.
    +	
    +2002-09-01 10:05  tim_one
    +
    +	* classifier.py (1.6):
    +
    +	spamprob():  Never count unique words more than once anymore.  Counting
    +	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
    +	that's now a small drag instead.
    +	
    +2002-09-01 07:33  tim_one
    +
    +	* rebal.py (1.3), timtest.py (1.5):
    +
    +	Folding case is here to stay.  Read the new comments for why.  This may
    +	be a bad idea for other languages, though.
    +	
    +	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
    +	http is spam-neutral, but https is a strong spam indicator.  That
    +	surprised me.
    +	
    +2002-09-01 06:47  tim_one
    +
    +	* classifier.py (1.5):
    +
    +	spamprob():  Removed useless check that wordstream isn't empty.  For one
    +	thing, it didn't work, since wordstream is often an iterator.  Even if
    +	it did work, it isn't needed -- the probability of an empty wordstream
    +	gets computed as 0.5 based on the total absence of evidence.
    +	
    +2002-09-01 05:37  tim_one
    +
    +	* timtest.py (1.4):
    +
    +	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
    +	details).
    +	
    +2002-09-01 05:09  tim_one
    +
    +	* rebal.py (1.2):
    +
    +	Aha!  Staring at the checkin msg revealed a logic bug that explains why
    +	my ham directories sometimes remained unbalanced after running this --
    +	if the randomly selected reservoir msg turned out to be spam, it wasn't
    +	pushing the too-small directory on the stack again.
    +	
    +2002-09-01 04:56  tim_one
    +
    +	* timtest.py (1.3):
    +
    +	textparts():  This was failing to weed out redundant HTML in cases like
    +	this:
    +	
    +	    multipart/alternative
    +	        text/plain
    +	        multipart/related
    +	            text/html
    +	
    +	The tokenizer here also transforms everything to lowercase, but that's
    +	an accident due simply to that I'm testing that now.  Can't say for
    +	sure until the test runs end, but so far it looks like a bad idea for
    +	the false positive rate.
    +	
    +2002-09-01 04:52  tim_one
    +
    +	* rebal.py (1.1):
    +
    +	A little script I use to rebalance the ham corpora after deleting what
    +	turns out to be spam.  I have another Ham/reservoir directory with a
    +	few thousand randomly selected msgs from the presumably-good archive.
    +	These aren't used in scoring or training.  This script marches over all
    +	the ham corpora directories that are used, and if any have gotten too
    +	big (this never happens anymore) deletes msgs at random from them, and
    +	if any have gotten too small plugs the holes by moving in random
    +	msgs from the reservoir.
    +	
    +2002-09-01 03:25  tim_one
    +
    +	* classifier.py (1.4), timtest.py (1.2):
    +
    +	Boost UNKNOWN_SPAMPROB.
    +	# The spam probability assigned to words never seen before.  Graham used
    +	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
    +	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
    +	# negative rate by over 1/3.  The f-p rate increased, but there were so few
    +	# f-ps that the increase wasn't statistically significant.  It also caught
    +	# 13 more spams erroneously classified as ham.  By eyeball (and common
    +	# sense ), this has most effect on very short messages, where there
    +	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
    +	# completely ignored by spamprob(), in favor of *any* word with *any* prob
    +	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
    +	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
    +	# on the face of it.
    +	
    +2002-08-31 16:50  tim_one
    +
    +	* timtest.py (1.1):
    +
    +	This is a driver I've been using for test runs.  It's specific to my
    +	corpus directories, but has useful stuff in it all the same.
    +	
    +2002-08-31 16:49  tim_one
    +
    +	* classifier.py (1.3):
    +
    +	The explanation for these changes was on Python-Dev.  You'll find out
    +	why if the moderator approves the msg .
    +	
    +2002-08-29 07:04  tim_one
    +
    +	* Tester.py (1.2), classifier.py (1.2):
    +
    +	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
    +	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
    +	too hard to get motivated to reduce 0.01 <0.1 wink>).
    +	
    +	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
    +	the 15 strongest (word, probability) pairs is returned as well as the
    +	overall probability (this is how to find out why a message scored as it
    +	did).
    +	
    +2002-08-28 13:45  montanaro
    +
    +	* GBayes.py (1.15):
    +
    +	ehh - it actually didn't work all that well.  the spurious report that it
    +	did well was pilot error.  besides, tim's report suggests that a simple
    +	str.split() may be the best tokenizer anyway.
    +	
    +2002-08-28 10:45  montanaro
    +
    +	* setup.py (1.1):
    +
    +	trivial little setup.py file - i don't expect most people will be interested
    +	in this, but it makes it a tad simpler to work with now that there are two
    +	files
    +	
    +2002-08-28 10:43  montanaro
    +
    +	* GBayes.py (1.14):
    +
    +	add simple trigram tokenizer - this seems to yield the best results I've
    +	seen so far (but has not been extensively tested)
    +	
    +2002-08-28 08:10  tim_one
    +
    +	* Tester.py (1.1):
    +
    +	A start at a testing class.  There isn't a lot here, but it automates
    +	much of the tedium, and as the doctest shows it can already do
    +	useful things, like remembering which inputs were misclassified.
    +	
    +2002-08-27 06:45  tim_one
    +
    +	* mboxcount.py (1.5):
    +
    +	Updated stats to what Barry and I both get now.  Fiddled output.
    +	
    +2002-08-27 05:09  bwarsaw
    +
    +	* split.py (1.5), splitn.py (1.2):
    +
    +	_factory(): Return the empty string instead of None in the except
    +	clauses, so that for-loops won't break prematurely.  mailbox.py's base
    +	class defines an __iter__() that raises a StopIteration on None
    +	return.
    +	
    +2002-08-27 04:55  tim_one
    +
    +	* GBayes.py (1.13), mboxcount.py (1.4):
    +
    +	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
    +	
    +2002-08-27 04:40  bwarsaw
    +
    +	* mboxcount.py (1.3):
    +
    +	Some stats after splitting b/w good messages and unparseable messages
    +	
    +2002-08-27 04:23  bwarsaw
    +
    +	* mboxcount.py (1.2):
    +
    +	_factory(): Use a marker object to designate between good messages and
    +	unparseable messages.  For some reason, returning None from the except
    +	clause in _factory() caused Python 2.2.1 to exit early out of the for
    +	loop.
    +	
    +	main(): Print statistics about both the number of good messages and
    +	the number of unparseable messages.
    +	
    +2002-08-27 03:06  tim_one
    +
    +	* cleanarch (1.2):
    +
    +	"From " is a header more than a separator, so don't bump the msg count
    +	at the end.
    +	
    +2002-08-24 01:42  tim_one
    +
    +	* GBayes.py (1.12), classifier.py (1.1):
    +
    +	Moved all the interesting code that was in the *original* GBayes.py into
    +	a new classifier.py.  It was designed to have a very clean interface,
    +	and there's no reason to keep slamming everything into one file.  The
    +	ever-growing tokenizer stuff should probably also be split out, leaving
    +	GBayes.py a pure driver.
    +	
    +	Also repaired _test() (Skip's checkin left it without a binding for
    +	the tokenize function).
    +	
    +2002-08-24 01:17  tim_one
    +
    +	* splitn.py (1.1):
    +
    +	Utility to split an mbox into N random pieces in one gulp.  This gives
    +	a convenient way to break a giant corpus into multiple files that can
    +	then be used independently across multiple training and testing runs.
    +	It's important to do multiple runs on different random samples to avoid
    +	drawing conclusions based on accidents in a single random training corpus;
    +	if the algorithm is robust, it should have similar performance across
    +	all runs.
    +	
    +2002-08-24 00:25  montanaro
    +
    +	* GBayes.py (1.11):
    +
    +	Allow command line specification of tokenize functions
    +	    run w/ -t flag to override default tokenize function
    +	    run w/ -H flag to see list of tokenize functions
    +	
    +	When adding a new tokenizer, make docstring a short description and add a
    +	key/value pair to the tokenizers dict.  The key is what the user specifies.
    +	The value is a tokenize function.
    +	
    +	Added two new tokenizers - tokenize_wordpairs_foldcase and
    +	tokenize_words_and_pairs.  It's not obvious that either is better than any
    +	of the preexisting functions.
    +	
    +	Should probably add info to the pickle which indicates the tokenizing
    +	function used to build it.  This could then be the default for spam
    +	detection runs.
    +	
    +	Next step is to drive this with spam/non-spam corpora, selecting each of the
    +	various tokenizer functions, and presenting the results in tabular form.
    +	
    +2002-08-23 13:10  tim_one
    +
    +	* GBayes.py (1.10):
    +
    +	spamprob():  Commented some subtleties.
    +	
    +	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
    +	is that you can't delete entries from a dict that's being crawled over
    +	by .iteritems(), which is why I (I suddenly recall) materialized a
    +	list of words to be deleted the first time I wrote this.  It's a lot
    +	better to materialize a list of to-be-deleted words than to materialize
    +	the entire database in a dict.items() list.
    +	
    +2002-08-23 12:36  tim_one
    +
    +	* mboxcount.py (1.1):
    +
    +	Utility to count and display the # of msgs in (one or more) Unix mboxes.
    +	
    +2002-08-23 12:11  tim_one
    +
    +	* split.py (1.4):
    +
    +	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
    +	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
    +	the latter invents an extra newline.
    +	
    +2002-08-22 07:01  tim_one
    +
    +	* GBayes.py (1.9):
    +
    +	Renamed "modtime" to "atime", to better reflect its meaning, and added a
    +	comment block to explain that better.
    +	
    +2002-08-21 08:07  bwarsaw
    +
    +	* split.py (1.3):
    +
    +	Guido suggests a different order for the positional args.
    +	
    +2002-08-21 07:37  bwarsaw
    +
    +	* split.py (1.2):
    +
    +	Get rid of the -1 and -2 arguments and make them positional.
    +	
    +2002-08-21 07:18  bwarsaw
    +
    +	* split.py (1.1):
    +
    +	A simple mailbox splitter
    +	
    +2002-08-21 06:42  tim_one
    +
    +	* GBayes.py (1.8):
    +
    +	Added a bunch of simple tokenizers.  The originals are renamed to
    +	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
    +	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
    +	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
    +	any of these to be the last word.  When Barry has the test corpus
    +	set up it should be easy to let the data tell us which "pure" strategy
    +	works best.  Straight character n-grams are very appealing because
    +	they're the simplest and most language-neutral; I didn't have any luck
    +	with them over the weekend, but the size of my training data was
    +	trivial.
    +	
    +2002-08-21 05:08  bwarsaw
    +
    +	* cleanarch (1.1):
    +
    +	An archive cleaner, adapted from the Mailman 2.1b3 version, but
    +	de-Mailman-ified.
    +	
    +2002-08-21 04:44  gvanrossum
    +
    +	* GBayes.py (1.7):
    +
    +	Indent repair in clearjunk().
    +	
    +2002-08-21 04:22  gvanrossum
    +
    +	* GBayes.py (1.6):
    +
    +	Some minor cleanup:
    +	
    +	- Move the identifying comment to the top, clarify it a bit, and add
    +	  author info.
    +	
    +	- There's no reason for _time and _heapreplace to be hidden names;
    +	  change these back to time and heapreplace.
    +	
    +	- Rename main1() to _test() and main2() to main(); when main() sees
    +	  there are no options or arguments, it runs _test().
    +	
    +	- Get rid of a list comprehension from clearjunk().
    +	
    +	- Put wordinfo.get as a local variable in _add_msg().
    +	
    +2002-08-20 15:16  tim_one
    +
    +	* GBayes.py (1.5):
    +
    +	Neutral typo repairs, except that clearjunk() has a better chance of
    +	not blowing up immediately now .
    +	
    +2002-08-20 13:49  montanaro
    +
    +	* GBayes.py (1.4):
    +
    +	help make it more easily executable... ;-)
    +	
    +2002-08-20 09:32  bwarsaw
    +
    +	* GBayes.py (1.3):
    +
    +	Lots of hacks great and small to the main() program, but I didn't
    +	touch the guts of the algorithm.
    +	
    +	Added a module docstring/usage message.
    +	
    +	Added a bunch of switches to train the system on an mbox of known good
    +	and known spam messages (using PortableUnixMailbox only for now).
    +	Uses the email package but does not decoding of message bodies.  Also,
    +	allows you to specify a file for pickling the training data, and for
    +	setting a threshold, above which messages get an X-Bayes-Score
    +	header.  Also output messages (marked and unmarked) to an output file
    +	for retraining.
    +	
    +	Print some statistics at the end.
    +	
    +2002-08-20 05:43  tim_one
    +
    +	* GBayes.py (1.2):
    +
    +	Turned off debugging vrbl mistakenly checked in at True.
    +	
    +	unlearn():  Gave this an update_probabilities=True default arg, for
    +	symmetry with learn().
    +	
    +2002-08-20 03:33  tim_one
    +
    +	* GBayes.py (1.1):
    +
    +	An implementation of Paul Graham's Bayes-like spam classifier.
    +
    +
    Deleted: trunk/website/presfchangelog.ht =================================================================== --- trunk/website/presfchangelog.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/presfchangelog.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -1,905 +0,0 @@ -

    Pre-Sourceforge ChangeLog

    -

    This changelog lists the commits on the spambayes projects before the - separate project was set up. See also the -old CVS repository, but don't forget that it's now out of date, and you probably want to be looking at the current CVS. -

    -
    -2002-09-06 02:27  tim_one
    -
    -	* GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
    -	cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
    -	(1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
    -
    -	This code has been moved to a new SourceForge project (spambayes).
    -	
    -2002-09-05 15:37  tim_one
    -
    -	* classifier.py (1.11):
    -
    -	Added note about MINCOUNT oddities.
    -	
    -2002-09-05 14:32  tim_one
    -
    -	* timtest.py (1.17):
    -
    -	Added note about word length.
    -	
    -2002-09-05 13:48  tim_one
    -
    -	* timtest.py (1.16):
    -
    -	tokenize_word():  Oops!  This was awfully permissive in what it
    -	took as being "an email address".  Tightened that, and also
    -	avoided 5-gram'ing of email addresses w/ high-bit characters.
    -	
    -	false positive percentages
    -	    0.000  0.000  tied
    -	    0.000  0.000  tied
    -	    0.050  0.050  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.050  lost
    -	    0.075  0.075  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	
    -	won   0 times
    -	tied 19 times
    -	lost  1 times
    -	
    -	total unique fp went from 7 to 8
    -	
    -	false negative percentages
    -	    0.764  0.691  won
    -	    0.691  0.655  won
    -	    0.981  0.945  won
    -	    1.309  1.309  tied
    -	    1.418  1.164  won
    -	    0.873  0.800  won
    -	    0.800  0.763  won
    -	    1.163  1.163  tied
    -	    1.491  1.345  won
    -	    1.200  1.127  won
    -	    1.381  1.345  won
    -	    1.454  1.490  lost
    -	    1.164  0.909  won
    -	    0.655  0.582  won
    -	    0.655  0.691  lost
    -	    1.163  1.163  tied
    -	    1.200  1.018  won
    -	    0.982  0.873  won
    -	    0.982  0.909  won
    -	    1.236  1.127  won
    -	
    -	won  15 times
    -	tied  3 times
    -	lost  2 times
    -	
    -	total unique fn went from 260 to 249
    -	
    -	Note:  Each of the two losses there consist of just 1 msg difference.
    -	The wins are bigger as well as being more common, and 260-249 = 11
    -	spams no longer sneak by any run (which is more than 4% of the 260
    -	spams that used to sneak thru!).
    -	
    -2002-09-05 11:51  tim_one
    -
    -	* classifier.py (1.10):
    -
    -	Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
    -	really matter; leaving it alone.
    -	
    -2002-09-05 10:02  tim_one
    -
    -	* classifier.py (1.9):
    -
    -	A now-rare pure win, changing spamprob() to work harder to find more
    -	evidence when competing 0.01 and 0.99 clues appear.  Before in the left
    -	column, after in the right:
    -	
    -	false positive percentages
    -	    0.000  0.000  tied
    -	    0.000  0.000  tied
    -	    0.050  0.050  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.075  0.075  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.075  0.025  won
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	
    -	won   1 times
    -	tied 19 times
    -	lost  0 times
    -	
    -	total unique fp went from 9 to 7
    -	
    -	false negative percentages
    -	    0.909  0.764  won
    -	    0.800  0.691  won
    -	    1.091  0.981  won
    -	    1.381  1.309  won
    -	    1.491  1.418  won
    -	    1.055  0.873  won
    -	    0.945  0.800  won
    -	    1.236  1.163  won
    -	    1.564  1.491  won
    -	    1.200  1.200  tied
    -	    1.454  1.381  won
    -	    1.599  1.454  won
    -	    1.236  1.164  won
    -	    0.800  0.655  won
    -	    0.836  0.655  won
    -	    1.236  1.163  won
    -	    1.236  1.200  won
    -	    1.055  0.982  won
    -	    1.127  0.982  won
    -	    1.381  1.236  won
    -	
    -	won  19 times
    -	tied  1 times
    -	lost  0 times
    -	
    -	total unique fn went from 284 to 260
    -	
    -2002-09-04 11:21  tim_one
    -
    -	* timtest.py (1.15):
    -
    -	Augmented the spam callback to display spams with low probability.
    -	
    -2002-09-04 09:53  tim_one
    -
    -	* Tester.py (1.3), timtest.py (1.14):
    -
    -	Added support for simple histograms of the probability distributions for
    -	ham and spam.
    -	
    -2002-09-03 12:13  tim_one
    -
    -	* timtest.py (1.13):
    -
    -	A reluctant "on principle" change no matter what it does to the stats:
    -	take a stab at removing HTML decorations from plain text msgs.  See
    -	comments for why it's *only* in plain text msgs.  This puts an end to
    -	false positives due to text msgs talking *about* HTML.  Surprisingly, it
    -	also gets rid of some false negatives.  Not surprisingly, it introduced
    -	another small class of false positives due to the dumbass regexp trick
    -	used to approximate HTML tag removal removing pieces of text that had
    -	nothing to do with HTML tags (e.g., this happened in the middle of a
    -	uuencoded .py file in such a why that it just happened to leave behind
    -	a string that "looked like" a spam phrase; but before this it looked
    -	like a pile of "too long" lines that didn't generate any tokens --
    -	it's a nonsense outcome either way).
    -	
    -	false positive percentages
    -	    0.000  0.000  tied
    -	    0.000  0.000  tied
    -	    0.050  0.050  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.000  0.025  lost
    -	    0.075  0.075  tied
    -	    0.050  0.025  won
    -	    0.025  0.025  tied
    -	    0.000  0.025  lost
    -	    0.050  0.075  lost
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	
    -	won   1 times
    -	tied 16 times
    -	lost  3 times
    -	
    -	total unique fp went from 8 to 9
    -	
    -	false negative percentages
    -	    0.945  0.909  won
    -	    0.836  0.800  won
    -	    1.200  1.091  won
    -	    1.418  1.381  won
    -	    1.455  1.491  lost
    -	    1.091  1.055  won
    -	    1.091  0.945  won
    -	    1.236  1.236  tied
    -	    1.564  1.564  tied
    -	    1.236  1.200  won
    -	    1.563  1.454  won
    -	    1.563  1.599  lost
    -	    1.236  1.236  tied
    -	    0.836  0.800  won
    -	    0.873  0.836  won
    -	    1.236  1.236  tied
    -	    1.273  1.236  won
    -	    1.018  1.055  lost
    -	    1.091  1.127  lost
    -	    1.490  1.381  won
    -	
    -	won  12 times
    -	tied  4 times
    -	lost  4 times
    -	
    -	total unique fn went from 292 to 284
    -	
    -2002-09-03 06:57  tim_one
    -
    -	* classifier.py (1.8):
    -
    -	Added a new xspamprob() method, which computes the combined probability
    -	"correctly", and a long comment block explaining what happened when I
    -	tried it.  There's something worth pursuing here (it greatly improves
    -	the false negative rate), but this change alone pushes too many marginal
    -	hams into the spam camp
    -	
    -2002-09-03 05:23  tim_one
    -
    -	* timtest.py (1.12):
    -
    -	Made "skip:" tokens shorter.
    -	
    -	Added a surprising treatment of Organization headers, with a tiny f-n
    -	benefit for a tiny cost.  No change in f-p stats.
    -	
    -	false negative percentages
    -	    1.091  0.945  won
    -	    0.945  0.836  won
    -	    1.236  1.200  won
    -	    1.454  1.418  won
    -	    1.491  1.455  won
    -	    1.091  1.091  tied
    -	    1.127  1.091  won
    -	    1.236  1.236  tied
    -	    1.636  1.564  won
    -	    1.345  1.236  won
    -	    1.672  1.563  won
    -	    1.599  1.563  won
    -	    1.236  1.236  tied
    -	    0.836  0.836  tied
    -	    1.018  0.873  won
    -	    1.236  1.236  tied
    -	    1.273  1.273  tied
    -	    1.055  1.018  won
    -	    1.091  1.091  tied
    -	    1.527  1.490  won
    -	
    -	won  13 times
    -	tied  7 times
    -	lost  0 times
    -	
    -	total unique fn went from 302 to 292
    -	
    -2002-09-03 02:18  tim_one
    -
    -	* timtest.py (1.11):
    -
    -	tokenize_word():  dropped the prefix from the signature; it's faster
    -	to let the caller do it, and this also repaired a bug in one place it
    -	was being used (well, a *conceptual* bug anyway, in that the code didn't
    -	do what I intended there).  This changes the stats in an insignificant
    -	way.  The f-p stats didn't change.  The f-n stats shifted by one message
    -	in a few cases:
    -	
    -	false negative percentages
    -	    1.091  1.091  tied
    -	    0.945  0.945  tied
    -	    1.200  1.236  lost
    -	    1.454  1.454  tied
    -	    1.491  1.491  tied
    -	    1.091  1.091  tied
    -	    1.091  1.127  lost
    -	    1.236  1.236  tied
    -	    1.636  1.636  tied
    -	    1.382  1.345  won
    -	    1.636  1.672  lost
    -	    1.599  1.599  tied
    -	    1.236  1.236  tied
    -	    0.836  0.836  tied
    -	    1.018  1.018  tied
    -	    1.236  1.236  tied
    -	    1.273  1.273  tied
    -	    1.055  1.055  tied
    -	    1.091  1.091  tied
    -	    1.527  1.527  tied
    -	
    -	won   1 times
    -	tied 16 times
    -	lost  3 times
    -	
    -	total unique unchanged
    -	
    -2002-09-02 19:30  tim_one
    -
    -	* timtest.py (1.10):
    -
    -	Don't ask me why this helps -- I don't really know!  When skipping "long
    -	words", generating a token with a brief hint about what and how much got
    -	skipped makes a definite improvement in the f-n rate, and doesn't affect
    -	the f-p rate at all.  Since experiment said it's a winner, I'm checking
    -	it in.  Before (left columan) and after (right column):
    -	
    -	false positive percentages
    -	    0.000  0.000  tied
    -	    0.000  0.000  tied
    -	    0.050  0.050  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.000  0.000  tied
    -	    0.075  0.075  tied
    -	    0.050  0.050  tied
    -	    0.025  0.025  tied
    -	    0.000  0.000  tied
    -	    0.050  0.050  tied
    -	    0.025  0.025  tied
    -	    0.025  0.025  tied
    -	    0.000  0.000  tied
    -	    0.025  0.025  tied
    -	    0.050  0.050  tied
    -	
    -	won   0 times
    -	tied 20 times
    -	lost  0 times
    -	
    -	total unique fp went from 8 to 8
    -	
    -	false negative percentages
    -	    1.236  1.091  won
    -	    1.164  0.945  won
    -	    1.454  1.200  won
    -	    1.599  1.454  won
    -	    1.527  1.491  won
    -	    1.236  1.091  won
    -	    1.163  1.091  won
    -	    1.309  1.236  won
    -	    1.891  1.636  won
    -	    1.418  1.382  won
    -	    1.745  1.636  won
    -	    1.708  1.599  won
    -	    1.491  1.236  won
    -	    0.836  0.836  tied
    -	    1.091  1.018  won
    -	    1.309  1.236  won
    -	    1.491  1.273  won
    -	    1.127  1.055  won
    -	    1.309  1.091  won
    -	    1.636  1.527  won
    -	
    -	won  19 times
    -	tied  1 times
    -	lost  0 times
    -	
    -	total unique fn went from 336 to 302
    -	
    -2002-09-02 17:55  tim_one
    -
    -	* timtest.py (1.9):
    -
    -	Some comment changes and nesting reduction.
    -	
    -2002-09-02 11:18  tim_one
    -
    -	* timtest.py (1.8):
    -
    -	Fixed some out-of-date comments.
    -	
    -	Made URL clumping lumpier:  now distinguishes among just "first field",
    -	"second field", and "everything else".
    -	
    -	Changed tag names for email address fields (semantically neutral).
    -	
    -	Added "From:" line tagging.
    -	
    -	These add up to an almost pure win.  Before-and-after f-n rates across 20
    -	runs:
    -	
    -	1.418   1.236
    -	1.309   1.164
    -	1.636   1.454
    -	1.854   1.599
    -	1.745   1.527
    -	1.418   1.236
    -	1.381   1.163
    -	1.418   1.309
    -	2.109   1.891
    -	1.491   1.418
    -	1.854   1.745
    -	1.890   1.708
    -	1.818   1.491
    -	1.055   0.836
    -	1.164   1.091
    -	1.599   1.309
    -	1.600   1.491
    -	1.127   1.127
    -	1.164   1.309
    -	1.781   1.636
    -	
    -	It only increased in one run.  The variance appears to have been reduced
    -	too (I didn't bother to compute that, though).
    -	
    -	Before-and-after f-p rates across 20 runs:
    -	
    -	0.000   0.000
    -	0.000   0.000
    -	0.075   0.050
    -	0.000   0.000
    -	0.025   0.025
    -	0.050   0.025
    -	0.075   0.050
    -	0.025   0.025
    -	0.025   0.025
    -	0.025   0.000
    -	0.100   0.075
    -	0.050   0.050
    -	0.025   0.025
    -	0.000   0.000
    -	0.075   0.050
    -	0.025   0.025
    -	0.025   0.025
    -	0.000   0.000
    -	0.075   0.025
    -	0.100   0.050
    -	
    -	Note that 0.025% is a single message; it's really impossible to *measure*
    -	an improvement in the f-p rate anymore with 4000-msg ham sets.
    -	
    -	Across all 20 runs,
    -	
    -	the total # of unique f-n fell from 353 to 336
    -	the total # of unique f-p fell from 13 to 8
    -	
    -2002-09-02 10:06  tim_one
    -
    -	* timtest.py (1.7):
    -
    -	A number of changes.  The most significant is paying attention to the
    -	Subject line (I was wrong before when I said my c.l.py ham corpus was
    -	unusable for this due to Mailman-injected decorations).  In all, across
    -	my 20 test runs,
    -	
    -	the total # of unique false positives fell from 23 to 13
    -	the total # of unique false negatives rose from 337 to 353
    -	
    -	Neither result is statistically significant, although I bet the first
    -	one would be if I pissed away a few days trying to come up with a more
    -	realistic model for what "stat. sig." means here .
    -	
    -2002-09-01 17:22  tim_one
    -
    -	* classifier.py (1.7):
    -
    -	Added a comment block about HAMBIAS experiments.  There's no clearer
    -	example of trading off precision against recall, and you can favor either
    -	at the expense of the other to any degree you like by fiddling this knob.
    -	
    -2002-09-01 14:42  tim_one
    -
    -	* timtest.py (1.6):
    -
    -	Long new comment block summarizing all my experiments with character
    -	n-grams.  Bottom line is that they have nothing going for them, and a
    -	lot going against them, under Graham's scheme.  I believe there may
    -	still be a place for them in *part* of a word-based tokenizer, though.
    -	
    -2002-09-01 10:05  tim_one
    -
    -	* classifier.py (1.6):
    -
    -	spamprob():  Never count unique words more than once anymore.  Counting
    -	up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
    -	that's now a small drag instead.
    -	
    -2002-09-01 07:33  tim_one
    -
    -	* rebal.py (1.3), timtest.py (1.5):
    -
    -	Folding case is here to stay.  Read the new comments for why.  This may
    -	be a bad idea for other languages, though.
    -	
    -	Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
    -	http is spam-neutral, but https is a strong spam indicator.  That
    -	surprised me.
    -	
    -2002-09-01 06:47  tim_one
    -
    -	* classifier.py (1.5):
    -
    -	spamprob():  Removed useless check that wordstream isn't empty.  For one
    -	thing, it didn't work, since wordstream is often an iterator.  Even if
    -	it did work, it isn't needed -- the probability of an empty wordstream
    -	gets computed as 0.5 based on the total absence of evidence.
    -	
    -2002-09-01 05:37  tim_one
    -
    -	* timtest.py (1.4):
    -
    -	textparts():  Worm around what feels like a bug in msg.walk() (Barry has
    -	details).
    -	
    -2002-09-01 05:09  tim_one
    -
    -	* rebal.py (1.2):
    -
    -	Aha!  Staring at the checkin msg revealed a logic bug that explains why
    -	my ham directories sometimes remained unbalanced after running this --
    -	if the randomly selected reservoir msg turned out to be spam, it wasn't
    -	pushing the too-small directory on the stack again.
    -	
    -2002-09-01 04:56  tim_one
    -
    -	* timtest.py (1.3):
    -
    -	textparts():  This was failing to weed out redundant HTML in cases like
    -	this:
    -	
    -	    multipart/alternative
    -	        text/plain
    -	        multipart/related
    -	            text/html
    -	
    -	The tokenizer here also transforms everything to lowercase, but that's
    -	an accident due simply to that I'm testing that now.  Can't say for
    -	sure until the test runs end, but so far it looks like a bad idea for
    -	the false positive rate.
    -	
    -2002-09-01 04:52  tim_one
    -
    -	* rebal.py (1.1):
    -
    -	A little script I use to rebalance the ham corpora after deleting what
    -	turns out to be spam.  I have another Ham/reservoir directory with a
    -	few thousand randomly selected msgs from the presumably-good archive.
    -	These aren't used in scoring or training.  This script marches over all
    -	the ham corpora directories that are used, and if any have gotten too
    -	big (this never happens anymore) deletes msgs at random from them, and
    -	if any have gotten too small plugs the holes by moving in random
    -	msgs from the reservoir.
    -	
    -2002-09-01 03:25  tim_one
    -
    -	* classifier.py (1.4), timtest.py (1.2):
    -
    -	Boost UNKNOWN_SPAMPROB.
    -	# The spam probability assigned to words never seen before.  Graham used
    -	# 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
    -	# Tim's content-only tests (no headers), boosting to 0.5 cut the false
    -	# negative rate by over 1/3.  The f-p rate increased, but there were so few
    -	# f-ps that the increase wasn't statistically significant.  It also caught
    -	# 13 more spams erroneously classified as ham.  By eyeball (and common
    -	# sense ), this has most effect on very short messages, where there
    -	# simply aren't many high-value words.  A word with prob 0.5 is (in effect)
    -	# completely ignored by spamprob(), in favor of *any* word with *any* prob
    -	# differing from 0.5.  At 0.2, an unknown word favors ham at the expense
    -	# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
    -	# on the face of it.
    -	
    -2002-08-31 16:50  tim_one
    -
    -	* timtest.py (1.1):
    -
    -	This is a driver I've been using for test runs.  It's specific to my
    -	corpus directories, but has useful stuff in it all the same.
    -	
    -2002-08-31 16:49  tim_one
    -
    -	* classifier.py (1.3):
    -
    -	The explanation for these changes was on Python-Dev.  You'll find out
    -	why if the moderator approves the msg .
    -	
    -2002-08-29 07:04  tim_one
    -
    -	* Tester.py (1.2), classifier.py (1.2):
    -
    -	Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
    -	functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
    -	too hard to get motivated to reduce 0.01 <0.1 wink>).
    -	
    -	GrahamBayes.spamprob:  New optional bool argument; when true, a list of
    -	the 15 strongest (word, probability) pairs is returned as well as the
    -	overall probability (this is how to find out why a message scored as it
    -	did).
    -	
    -2002-08-28 13:45  montanaro
    -
    -	* GBayes.py (1.15):
    -
    -	ehh - it actually didn't work all that well.  the spurious report that it
    -	did well was pilot error.  besides, tim's report suggests that a simple
    -	str.split() may be the best tokenizer anyway.
    -	
    -2002-08-28 10:45  montanaro
    -
    -	* setup.py (1.1):
    -
    -	trivial little setup.py file - i don't expect most people will be interested
    -	in this, but it makes it a tad simpler to work with now that there are two
    -	files
    -	
    -2002-08-28 10:43  montanaro
    -
    -	* GBayes.py (1.14):
    -
    -	add simple trigram tokenizer - this seems to yield the best results I've
    -	seen so far (but has not been extensively tested)
    -	
    -2002-08-28 08:10  tim_one
    -
    -	* Tester.py (1.1):
    -
    -	A start at a testing class.  There isn't a lot here, but it automates
    -	much of the tedium, and as the doctest shows it can already do
    -	useful things, like remembering which inputs were misclassified.
    -	
    -2002-08-27 06:45  tim_one
    -
    -	* mboxcount.py (1.5):
    -
    -	Updated stats to what Barry and I both get now.  Fiddled output.
    -	
    -2002-08-27 05:09  bwarsaw
    -
    -	* split.py (1.5), splitn.py (1.2):
    -
    -	_factory(): Return the empty string instead of None in the except
    -	clauses, so that for-loops won't break prematurely.  mailbox.py's base
    -	class defines an __iter__() that raises a StopIteration on None
    -	return.
    -	
    -2002-08-27 04:55  tim_one
    -
    -	* GBayes.py (1.13), mboxcount.py (1.4):
    -
    -	Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
    -	
    -2002-08-27 04:40  bwarsaw
    -
    -	* mboxcount.py (1.3):
    -
    -	Some stats after splitting b/w good messages and unparseable messages
    -	
    -2002-08-27 04:23  bwarsaw
    -
    -	* mboxcount.py (1.2):
    -
    -	_factory(): Use a marker object to designate between good messages and
    -	unparseable messages.  For some reason, returning None from the except
    -	clause in _factory() caused Python 2.2.1 to exit early out of the for
    -	loop.
    -	
    -	main(): Print statistics about both the number of good messages and
    -	the number of unparseable messages.
    -	
    -2002-08-27 03:06  tim_one
    -
    -	* cleanarch (1.2):
    -
    -	"From " is a header more than a separator, so don't bump the msg count
    -	at the end.
    -	
    -2002-08-24 01:42  tim_one
    -
    -	* GBayes.py (1.12), classifier.py (1.1):
    -
    -	Moved all the interesting code that was in the *original* GBayes.py into
    -	a new classifier.py.  It was designed to have a very clean interface,
    -	and there's no reason to keep slamming everything into one file.  The
    -	ever-growing tokenizer stuff should probably also be split out, leaving
    -	GBayes.py a pure driver.
    -	
    -	Also repaired _test() (Skip's checkin left it without a binding for
    -	the tokenize function).
    -	
    -2002-08-24 01:17  tim_one
    -
    -	* splitn.py (1.1):
    -
    -	Utility to split an mbox into N random pieces in one gulp.  This gives
    -	a convenient way to break a giant corpus into multiple files that can
    -	then be used independently across multiple training and testing runs.
    -	It's important to do multiple runs on different random samples to avoid
    -	drawing conclusions based on accidents in a single random training corpus;
    -	if the algorithm is robust, it should have similar performance across
    -	all runs.
    -	
    -2002-08-24 00:25  montanaro
    -
    -	* GBayes.py (1.11):
    -
    -	Allow command line specification of tokenize functions
    -	    run w/ -t flag to override default tokenize function
    -	    run w/ -H flag to see list of tokenize functions
    -	
    -	When adding a new tokenizer, make docstring a short description and add a
    -	key/value pair to the tokenizers dict.  The key is what the user specifies.
    -	The value is a tokenize function.
    -	
    -	Added two new tokenizers - tokenize_wordpairs_foldcase and
    -	tokenize_words_and_pairs.  It's not obvious that either is better than any
    -	of the preexisting functions.
    -	
    -	Should probably add info to the pickle which indicates the tokenizing
    -	function used to build it.  This could then be the default for spam
    -	detection runs.
    -	
    -	Next step is to drive this with spam/non-spam corpora, selecting each of the
    -	various tokenizer functions, and presenting the results in tabular form.
    -	
    -2002-08-23 13:10  tim_one
    -
    -	* GBayes.py (1.10):
    -
    -	spamprob():  Commented some subtleties.
    -	
    -	clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
    -	is that you can't delete entries from a dict that's being crawled over
    -	by .iteritems(), which is why I (I suddenly recall) materialized a
    -	list of words to be deleted the first time I wrote this.  It's a lot
    -	better to materialize a list of to-be-deleted words than to materialize
    -	the entire database in a dict.items() list.
    -	
    -2002-08-23 12:36  tim_one
    -
    -	* mboxcount.py (1.1):
    -
    -	Utility to count and display the # of msgs in (one or more) Unix mboxes.
    -	
    -2002-08-23 12:11  tim_one
    -
    -	* split.py (1.4):
    -
    -	Open files in binary mode.  Else, e.g., about 400MB of Barry's python-list
    -	corpus vanishes on Windows.  Also use file.write() instead of print>>, as
    -	the latter invents an extra newline.
    -	
    -2002-08-22 07:01  tim_one
    -
    -	* GBayes.py (1.9):
    -
    -	Renamed "modtime" to "atime", to better reflect its meaning, and added a
    -	comment block to explain that better.
    -	
    -2002-08-21 08:07  bwarsaw
    -
    -	* split.py (1.3):
    -
    -	Guido suggests a different order for the positional args.
    -	
    -2002-08-21 07:37  bwarsaw
    -
    -	* split.py (1.2):
    -
    -	Get rid of the -1 and -2 arguments and make them positional.
    -	
    -2002-08-21 07:18  bwarsaw
    -
    -	* split.py (1.1):
    -
    -	A simple mailbox splitter
    -	
    -2002-08-21 06:42  tim_one
    -
    -	* GBayes.py (1.8):
    -
    -	Added a bunch of simple tokenizers.  The originals are renamed to
    -	tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
    -	New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
    -	tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
    -	any of these to be the last word.  When Barry has the test corpus
    -	set up it should be easy to let the data tell us which "pure" strategy
    -	works best.  Straight character n-grams are very appealing because
    -	they're the simplest and most language-neutral; I didn't have any luck
    -	with them over the weekend, but the size of my training data was
    -	trivial.
    -	
    -2002-08-21 05:08  bwarsaw
    -
    -	* cleanarch (1.1):
    -
    -	An archive cleaner, adapted from the Mailman 2.1b3 version, but
    -	de-Mailman-ified.
    -	
    -2002-08-21 04:44  gvanrossum
    -
    -	* GBayes.py (1.7):
    -
    -	Indent repair in clearjunk().
    -	
    -2002-08-21 04:22  gvanrossum
    -
    -	* GBayes.py (1.6):
    -
    -	Some minor cleanup:
    -	
    -	- Move the identifying comment to the top, clarify it a bit, and add
    -	  author info.
    -	
    -	- There's no reason for _time and _heapreplace to be hidden names;
    -	  change these back to time and heapreplace.
    -	
    -	- Rename main1() to _test() and main2() to main(); when main() sees
    -	  there are no options or arguments, it runs _test().
    -	
    -	- Get rid of a list comprehension from clearjunk().
    -	
    -	- Put wordinfo.get as a local variable in _add_msg().
    -	
    -2002-08-20 15:16  tim_one
    -
    -	* GBayes.py (1.5):
    -
    -	Neutral typo repairs, except that clearjunk() has a better chance of
    -	not blowing up immediately now .
    -	
    -2002-08-20 13:49  montanaro
    -
    -	* GBayes.py (1.4):
    -
    -	help make it more easily executable... ;-)
    -	
    -2002-08-20 09:32  bwarsaw
    -
    -	* GBayes.py (1.3):
    -
    -	Lots of hacks great and small to the main() program, but I didn't
    -	touch the guts of the algorithm.
    -	
    -	Added a module docstring/usage message.
    -	
    -	Added a bunch of switches to train the system on an mbox of known good
    -	and known spam messages (using PortableUnixMailbox only for now).
    -	Uses the email package but does not decoding of message bodies.  Also,
    -	allows you to specify a file for pickling the training data, and for
    -	setting a threshold, above which messages get an X-Bayes-Score
    -	header.  Also output messages (marked and unmarked) to an output file
    -	for retraining.
    -	
    -	Print some statistics at the end.
    -	
    -2002-08-20 05:43  tim_one
    -
    -	* GBayes.py (1.2):
    -
    -	Turned off debugging vrbl mistakenly checked in at True.
    -	
    -	unlearn():  Gave this an update_probabilities=True default arg, for
    -	symmetry with learn().
    -	
    -2002-08-20 03:33  tim_one
    -
    -	* GBayes.py (1.1):
    -
    -	An implementation of Paul Graham's Bayes-like spam classifier.
    -
    -
    Modified: trunk/website/unix.ht =================================================================== --- trunk/website/unix.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/unix.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -9,8 +9,8 @@ href="download.html">recent enough version of Python is installed, then install the Spambayes source either as a bundled -package or from -CVS, then choose the Spambayes application which best fits into your +package or from +Subversion, then choose the Spambayes application which best fits into your mail setup.

    Procmail

    @@ -52,8 +52,8 @@
  • Additional details are available in the Hammie -readme.

    +href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt">README +file.

    POP3

    @@ -208,8 +208,8 @@

    Training

    See the Hammie -readme for a detailed discussion of the many training options +href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/README.txt"> +README filefor a detailed discussion of the many training options on Unix systems.

    Notes

    Modified: trunk/website/windows.ht =================================================================== --- trunk/website/windows.ht 2007-07-24 00:04:32 UTC (rev 3154) +++ trunk/website/windows.ht 2007-07-25 13:49:42 UTC (rev 3155) @@ -35,7 +35,7 @@ report and add any useful information you may have, or to open a new bug report if your problem seems to be a new one. Please be sure to go through the +href="http://spambayes.svn.sourceforge.net/viewvc/*checkout*/spambayes/trunk/spambayes/Outlook2000/docs/troubleshooting.html"> troubleshooting.html file that is installed with the plugin.

    Installing the Outlook Client From Source

    @@ -56,10 +56,10 @@
  • The SpamBayes source, either as a zip - file or via - CVS. The zip file will probably be easier to handle, but there may - be improvements to the code which make the CVS version a - viable option (though you will have to have a CVS client for Windows + file or via + Subversion. The zip file will probably be easier to handle, but there may + be improvements to the code which make the Subversion version a + viable option (though you will have to have a Subversion client for Windows installed).
  • This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. From montanaro at users.sourceforge.net Fri Jul 27 15:34:45 2007 From: montanaro at users.sourceforge.net (montanaro at users.sourceforge.net) Date: Fri, 27 Jul 2007 06:34:45 -0700 Subject: [Spambayes-checkins] SF.net SVN: spambayes: [3157] trunk/website Message-ID: Revision: 3157 http://spambayes.svn.sourceforge.net/spambayes/?rev=3157&view=rev Author: montanaro Date: 2007-07-27 06:34:45 -0700 (Fri, 27 Jul 2007) Log Message: ----------- fix a couple more CVS->Subversion problems Modified Paths: -------------- trunk/website/download.ht trunk/website/links.h Modified: trunk/website/download.ht =================================================================== --- trunk/website/download.ht 2007-07-25 13:51:11 UTC (rev 3156) +++ trunk/website/download.ht 2007-07-27 13:34:45 UTC (rev 3157) @@ -108,6 +108,6 @@

    Subversion Access

    -

    The code is currently available the SourceForge Subversion server. +

    The code is currently available from the SourceForge Subversion server.

    Modified: trunk/website/links.h =================================================================== --- trunk/website/links.h 2007-07-25 13:51:11 UTC (rev 3156) +++ trunk/website/links.h 2007-07-27 13:34:45 UTC (rev 3157) @@ -13,4 +13,4 @@
  • Mac OS

    Getting the code

  • Releases -
  • CVS access +
  • Subversion access This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.