[Spambayes-checkins] spambayes Options.py,1.36,1.37 classifier.py,1.23,1.24

Tim Peters tim_one@users.sourceforge.net
Sat, 28 Sep 2002 00:41:16 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12459

Modified Files:
	Options.py classifier.py 
Log Message:
New option

[Classifier]
count_duplicates_only_once_in_training: False

Please try it on your data with True.  Because it decreases both ham
and spam mean scores, you'll probably need a smaller spam_cutoff value
too.  Various biases in the Graham scheme made this a loser there, but
it may be better under the Robinson scheme.  Something I haven't tried:
a smaller value of robinson_probability_s *may* also help when this is
enabled (then again, it may hurt too ...).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** Options.py	28 Sep 2002 03:41:10 -0000	1.36
--- Options.py	28 Sep 2002 07:41:13 -0000	1.37
***************
*** 199,202 ****
--- 199,213 ----
  robinson_minimum_prob_strength: 0.1
  
+ # There's a strange asymmetry in the scheme, where multiple occurrences of
+ # a word in a msg are ignored during scoring, but all add to the spamcount
+ # (or hamcount) during training.  This imbalance couldn't be altered without
+ # hurting results under the Graham scheme, but it may well be better to
+ # treat things the same way during training under the Robinson schems.  Set
+ # this to true to try that.
+ # NOTE:  In Tim's tests this decreased both the ham and spam mean scores,
+ # the former more than the latter.  Therefore you'll probably want a smaller
+ # spam_cutoff value when this is enabled.
+ count_duplicates_only_once_in_training: False
+ 
  ###########################################################################
  # Speculative options for Gary Robinson's central-limit ideas.  These may go
***************
*** 257,260 ****
--- 268,272 ----
                     'robinson_probability_s': float_cracker,
                     'robinson_minimum_prob_strength': float_cracker,
+                    'count_duplicates_only_once_in_training': boolean_cracker,
  
                     'use_central_limit': boolean_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** classifier.py	28 Sep 2002 03:41:10 -0000	1.23
--- classifier.py	28 Sep 2002 07:41:13 -0000	1.24
***************
*** 282,285 ****
--- 282,287 ----
          wordinfoget = wordinfo.get
          now = time.time()
+         if options.count_duplicates_only_once_in_training:
+             wordstream = Set(wordstream)
          for word in wordstream:
              record = wordinfoget(word)
***************
*** 304,307 ****
--- 306,311 ----
  
          wordinfoget = self.wordinfo.get
+         if options.count_duplicates_only_once_in_training:
+             wordstream = Set(wordstream)
          for word in wordstream:
              record = wordinfoget(word)