[Spambayes-checkins]
spambayes Options.py,1.36,1.37 classifier.py,1.23,1.24
Tim Peters
tim_one@users.sourceforge.net
Sat, 28 Sep 2002 00:41:16 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12459
Modified Files:
Options.py classifier.py
Log Message:
New option
[Classifier]
count_duplicates_only_once_in_training: False
Please try it on your data with True. Because it decreases both ham
and spam mean scores, you'll probably need a smaller spam_cutoff value
too. Various biases in the Graham scheme made this a loser there, but
it may be better under the Robinson scheme. Something I haven't tried:
a smaller value of robinson_probability_s *may* also help when this is
enabled (then again, it may hurt too ...).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** Options.py 28 Sep 2002 03:41:10 -0000 1.36
--- Options.py 28 Sep 2002 07:41:13 -0000 1.37
***************
*** 199,202 ****
--- 199,213 ----
robinson_minimum_prob_strength: 0.1
+ # There's a strange asymmetry in the scheme, where multiple occurrences of
+ # a word in a msg are ignored during scoring, but all add to the spamcount
+ # (or hamcount) during training. This imbalance couldn't be altered without
+ # hurting results under the Graham scheme, but it may well be better to
+ # treat things the same way during training under the Robinson schems. Set
+ # this to true to try that.
+ # NOTE: In Tim's tests this decreased both the ham and spam mean scores,
+ # the former more than the latter. Therefore you'll probably want a smaller
+ # spam_cutoff value when this is enabled.
+ count_duplicates_only_once_in_training: False
+
###########################################################################
# Speculative options for Gary Robinson's central-limit ideas. These may go
***************
*** 257,260 ****
--- 268,272 ----
'robinson_probability_s': float_cracker,
'robinson_minimum_prob_strength': float_cracker,
+ 'count_duplicates_only_once_in_training': boolean_cracker,
'use_central_limit': boolean_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** classifier.py 28 Sep 2002 03:41:10 -0000 1.23
--- classifier.py 28 Sep 2002 07:41:13 -0000 1.24
***************
*** 282,285 ****
--- 282,287 ----
wordinfoget = wordinfo.get
now = time.time()
+ if options.count_duplicates_only_once_in_training:
+ wordstream = Set(wordstream)
for word in wordstream:
record = wordinfoget(word)
***************
*** 304,307 ****
--- 306,311 ----
wordinfoget = self.wordinfo.get
+ if options.count_duplicates_only_once_in_training:
+ wordstream = Set(wordstream)
for word in wordstream:
record = wordinfoget(word)