[Spambayes-checkins] spambayes README.txt,1.17,1.18

Tim Peters tim_one@users.sourceforge.net
Sat, 14 Sep 2002 15:18:27 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12179

Modified Files:
	README.txt 
Log Message:
Various comment updates.


Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** README.txt	14 Sep 2002 00:03:51 -0000	1.17
--- README.txt	14 Sep 2002 22:18:24 -0000	1.18
***************
*** 105,108 ****
--- 105,112 ----
      the script for an operational definition of "loose".
  
+ rebal.py
+     Evens out the number of messages in "standard" test data folders (see
+     below).  Needs generalization (e.g., Ham and 4000 are hardcoded now).
+ 
  mboxcount.py
      Count the number of messages (both parseable and unparseable) in
***************
*** 117,127 ****
      Like splitn.py (above), but splits an mbox into one message per file in
      "the standard" directory structure (see below).  This does an
!     approximate split; rebal.by (below) can be used afterwards to even out
      the number of messages per folder.
  
- rebal.py
-     Evens out the number of messages in "standard" test data folders (see
-     below).  Needs generalization (e.g., Ham and 4000 are hardcoded now).
- 
  
  Standard Test Data Setup
--- 121,127 ----
      Like splitn.py (above), but splits an mbox into one message per file in
      "the standard" directory structure (see below).  This does an
!     approximate split; rebal.py (above) can be used afterwards to even out
      the number of messages per folder.
  
  
  Standard Test Data Setup
***************
*** 133,156 ****
  random when testing reveals spam mistakenly called ham (and vice versa),
  etc -- even pasting examples into email is much easier when it's one msg
! per file (and the test driver makes it easy to print a msg's file path).
  
  The directory structure under my spambayes directory looks like so:
- [But due to a better testing infrastructure, I'm going to spread this
-  across 20 subdirectories under Spam and under Ham, and use groups
-  of 10 for 10-fold cross validation]
  
  Data/
      Spam/
!         Set1/ (contains 2750 spam .txt files)
          Set2/            ""
          Set3/            ""
          Set4/            ""
          Set5/            ""
      Ham/
!         Set1/ (contains 4000 ham .txt files)
          Set2/            ""
          Set3/            ""
          Set4/            ""
          Set5/            ""
          reservoir/ (contains "backup ham")
  
--- 133,163 ----
  random when testing reveals spam mistakenly called ham (and vice versa),
  etc -- even pasting examples into email is much easier when it's one msg
! per file (and the test drivers make it easy to print a msg's file path).
  
  The directory structure under my spambayes directory looks like so:
  
  Data/
      Spam/
!         Set1/ (contains 1375 spam .txt files)
          Set2/            ""
          Set3/            ""
          Set4/            ""
          Set5/            ""
+         Set6/            ""
+         Set7/            ""
+         Set9/            ""
+         Set9/            ""
+         Set10/           ""
      Ham/
!         Set1/ (contains 2000 ham .txt files)
          Set2/            ""
          Set3/            ""
          Set4/            ""
          Set5/            ""
+         Set6/            ""
+         Set7/            ""
+         Set8/            ""
+         Set9/            ""
+         Set10/           ""
          reservoir/ (contains "backup ham")
  
***************
*** 159,166 ****
  want at least a few hundred messages in each one.  The "reservoir" directory
  contains a few thousand other random hams.  When a ham is found that's
! really spam, I delete it, and then the rebal.py utility moves in a message
! at random from the reservoir to replace it.  If I had it to do over
! again, I think I'd move such spam into a Spam set (chosen at random),
! instead of deleting it.
  
  The hams are 20,000 msgs selected at random from a python-list archive.
--- 166,171 ----
  want at least a few hundred messages in each one.  The "reservoir" directory
  contains a few thousand other random hams.  When a ham is found that's
! really spam, move into a spam directory, and then the rebal.py utility
! moves in a random message from the reservoir to replace it.
  
  The hams are 20,000 msgs selected at random from a python-list archive.
***************
*** 171,176 ****
  The sets are grouped into pairs in the obvious way:  Spam/Set1 with
  Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
! that pair, then runs predictions on each of the other 4 pairs.  In effect,
! it's a 5x5 test grid, skipping the diagonal.  There's no particular reason
  to avoid predicting against the same set trained on, except that it
  takes more time and seems the least interesting thing to try.
--- 176,181 ----
  The sets are grouped into pairs in the obvious way:  Spam/Set1 with
  Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
! that pair, then runs predictions on each of the other pairs.  In effect,
! it's a NxN test grid, skipping the diagonal.  There's no particular reason
  to avoid predicting against the same set trained on, except that it
  takes more time and seems the least interesting thing to try.
***************
*** 178,182 ****
  Later, support for N-fold cross validation testing was added, which allows
  more accurate measurement of error rates with smaller amounts of training
! data.  That's recommended now.
  
  CAUTION:  The parititioning of your corpora across directories should
--- 183,189 ----
  Later, support for N-fold cross validation testing was added, which allows
  more accurate measurement of error rates with smaller amounts of training
! data.  That's recommended now.  timcv.py is to cross-validation testing
! as the older timtest.py is to grid testing.  timcv.py has grown additional
! arguments to allow using only a random subset of messages in each Set.
  
  CAUTION:  The parititioning of your corpora across directories should