[Spambayes-checkins] spambayes/spambayes CostCounter.py, 1.3,
1.4 ImapUI.py, 1.12, 1.13 Options.py, 1.53, 1.54 ProxyUI.py,
1.10, 1.11
Tony Meyer
anadelonbrin at users.sourceforge.net
Sun May 25 18:00:56 EDT 2003
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv7707/spambayes
Modified Files:
CostCounter.py ImapUI.py Options.py ProxyUI.py
Log Message:
Update CostCounter to new options style.
Expose experimental ham/spam imbalance option to
pop3proxy/imapfilter users, and update doc to be easier to
understand (thanks to Paul Moore).
Add a couple of notes to incremental.txt
Index: CostCounter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/CostCounter.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** CostCounter.py 29 Jan 2003 03:23:34 -0000 1.3
--- CostCounter.py 26 May 2003 00:00:52 -0000 1.4
***************
*** 73,79 ****
self._total += 1
self._spam += 1
! if scr < options.ham_cutoff:
self._fn += 1
! elif scr < options.spam_cutoff:
self._unsure += 1
self._unsurespam += 1
--- 73,79 ----
self._total += 1
self._spam += 1
! if scr < options["Categorization", "ham_cutoff"]:
self._fn += 1
! elif scr < options["Categorization", "spam_cutoff"]:
self._unsure += 1
self._unsurespam += 1
***************
*** 84,90 ****
self._total += 1
self._ham += 1
! if scr > options.spam_cutoff:
self._fp += 1
! elif scr > options.ham_cutoff:
self._unsure += 1
self._unsureham += 1
--- 84,90 ----
self._total += 1
self._ham += 1
! if scr > options["Categorization", "spam_cutoff"]:
self._fp += 1
! elif scr > options["Categorization", "ham_cutoff"]:
self._unsure += 1
self._unsureham += 1
***************
*** 118,156 ****
name = "Standard Cost"
def spam(self, scr):
! if scr < options.ham_cutoff:
! self.total += options.best_cutoff_fn_weight
! elif scr < options.spam_cutoff:
! self.total += options.best_cutoff_unsure_weight
def ham(self, scr):
! if scr > options.spam_cutoff:
! self.total += options.best_cutoff_fp_weight
! elif scr > options.ham_cutoff:
! self.total += options.best_cutoff_unsure_weight
class FlexCostCounter(CostCounter):
name = "Flex Cost"
def _lambda(self, scr):
! if scr < options.ham_cutoff:
return 0
! elif scr > options.spam_cutoff:
return 1
else:
! return (scr - options.ham_cutoff) / (
! options.spam_cutoff - options.ham_cutoff)
def spam(self, scr):
! self.total += (1 - self._lambda(scr)) * options.best_cutoff_fn_weight
def ham(self, scr):
! self.total += self._lambda(scr) * options.best_cutoff_fp_weight
class Flex2CostCounter(FlexCostCounter):
name = "Flex**2 Cost"
def spam(self, scr):
! self.total += (1 - self._lambda(scr))**2 * options.best_cutoff_fn_weight
def ham(self, scr):
! self.total += self._lambda(scr)**2 * options.best_cutoff_fp_weight
def default():
--- 118,161 ----
name = "Standard Cost"
def spam(self, scr):
! if scr < options["Categorization", "ham_cutoff"]:
! self.total += options["TestDriver", "best_cutoff_fn_weight"]
! elif scr < options["Categorization", "spam_cutoff"]:
! self.total += options["TestDriver", "best_cutoff_unsure_weight"]
def ham(self, scr):
! if scr > options["Categorization", "spam_cutoff"]:
! self.total += options["TestDriver", "best_cutoff_fp_weight"]
! elif scr > options["Categorization", "ham_cutoff"]:
! self.total += options["TestDriver", "best_cutoff_unsure_weight"]
class FlexCostCounter(CostCounter):
name = "Flex Cost"
def _lambda(self, scr):
! if scr < options["Categorization", "ham_cutoff"]:
return 0
! elif scr > options["Categorization", "spam_cutoff"]:
return 1
else:
! return (scr - options["Categorization", "ham_cutoff"]) / (
! options["Categorization", "spam_cutoff"] \
! - options["Categorization", "ham_cutoff"])
def spam(self, scr):
! self.total += (1 - self._lambda(scr)) * options["TestDriver",
! "best_cutoff_fn_weight"]
def ham(self, scr):
! self.total += self._lambda(scr) * options["TestDriver",
! "best_cutoff_fp_weight"]
class Flex2CostCounter(FlexCostCounter):
name = "Flex**2 Cost"
def spam(self, scr):
! self.total += (1 - self._lambda(scr))**2 * options["TestDriver",
! "best_cutoff_fn_weight"]
def ham(self, scr):
! self.total += self._lambda(scr)**2 * options["TestDriver",
! "best_cutoff_fp_weight"]
def default():
***************
*** 182,186 ****
cc.ham(0.5)
cc.spam(0.5)
! options.spam_cutoff=0.7
! options.ham_cutoff=0.4
print cc
--- 187,191 ----
cc.ham(0.5)
cc.spam(0.5)
! options["Categorization", "spam_cutoff"]=0.7
! options["Categorization", "ham_cutoff"]=0.4
print cc
Index: ImapUI.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImapUI.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** ImapUI.py 22 May 2003 05:21:16 -0000 1.12
--- ImapUI.py 26 May 2003 00:00:52 -0000 1.13
***************
*** 79,82 ****
--- 79,83 ----
('Categorization', 'ham_cutoff'),
('Categorization', 'spam_cutoff'),
+ ('Classifier', 'experimental_ham_spam_imbalance_adjustment'),
)
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.53
retrieving revision 1.54
diff -C2 -d -r1.53 -r1.54
*** Options.py 14 May 2003 00:18:19 -0000 1.53
--- Options.py 26 May 2003 00:00:53 -0000 1.54
***************
*** 556,572 ****
BOOLEAN, RESTORE),
! ("experimental_ham_spam_imbalance_adjustment", "Correct for imbalanced ham/spam ratio", False,
! """If the # of ham and spam in training data are out of balance, the
! spamprob guesses can get stronger in the direction of the category
! with more training msgs. In one sense this must be so, since the more
! data we have of one flavor, the more we know about that flavor. But
! that allows the accidental appearance of a strong word of that flavor
! in a msg of the other flavor much more power than an accident in the
! other direction. Enable experimental_ham_spam_imbalance_adjustment if
! you have more ham than spam training data (or more spam than ham), and
! the Bayesian probability adjustment won't 'believe' raw counts more
! than min(# ham trained on, # spam trained on) justifies. I *expect*
! this option will go away (and become the default), but people *with*
! strong imbalance need to test it first.""",
BOOLEAN, RESTORE),
),
--- 556,582 ----
BOOLEAN, RESTORE),
! # If the # of ham and spam in training data are out of balance, the
! # spamprob guesses can get stronger in the direction of the category
! # with more training msgs. In one sense this must be so, since the more
! # data we have of one flavor, the more we know about that flavor. But
! # that allows the accidental appearance of a strong word of that flavor
! # in a msg of the other flavor much more power than an accident in the
! # other direction. Enable experimental_ham_spam_imbalance_adjustment if
! # you have more ham than spam training data (or more spam than ham), and
! # the Bayesian probability adjustment won't 'believe' raw counts more
! # than min(# ham trained on, # spam trained on) justifies. I *expect*
! # this option will go away (and become the default), but people *with*
! # strong imbalance need to test it first.
!
! ("experimental_ham_spam_imbalance_adjustment", "Compensate for unequal numbers of spam and ham", False,
! """If your training database has significantly (3 times) more ham than
! spam, or vice versa, you may start seeing an increase in incorrect
! classifications (messages put in the wrong category, not just marked
! as unsure). If so, this option allows you to compensate for this, at
! the cost of increasing the number of messages classified as "unsure".
!
! Note that the effect is subtle, and you should experiment with both
! settings to choose the option that suits you best. You do not have
! to retrain your database if you change this option.""",
BOOLEAN, RESTORE),
),
Index: ProxyUI.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ProxyUI.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** ProxyUI.py 22 May 2003 05:21:16 -0000 1.10
--- ProxyUI.py 26 May 2003 00:00:53 -0000 1.11
***************
*** 104,107 ****
--- 104,108 ----
('Categorization', 'ham_cutoff'),
('Categorization', 'spam_cutoff'),
+ ('Classifier', 'experimental_ham_spam_imbalance_adjustment'),
)
More information about the Spambayes-checkins
mailing list