[Spambayes-checkins] spambayes/spambayes Stats.py,1.6,1.7

Tue Nov 2 07:33:26 CET 2004

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28614/spambayes

Modified Files:
	Stats.py 
Log Message:
Improve the web interface statistics.

This is the format that was devised by Mark Moraes and Kenny Pitt on spambayes-dev
 quite some time ago (but was never checked in - maybe we were feature frozen then?).
 This is my own code, though, not the patch that Mark submitted, which added unnecessary
 counters.

At some point I'll copy across the code that Outlook has that lets the number of decimal
 places for the percentages be specified.  The Outlook stats could be changed to look
 more like this (or the damn code could be centralised), too, maybe, except that there
 isn't much room in the dialog for a lot of text.  Maybe Kenny has a patch for that?
  (A spambayes-dev message indicated that he might).

The new stats should look something like this:

        SpamBayes has classified a total of 1223 messages:
            827 ham (67.6% of total)
            333 spam (27.2% of total)
            63 unsure (5.2% of total)

        1125 messages were classified correctly (92.0% of total)
        35 messages were classified incorrectly (2.9% of total)
            0 false positives (0.0% of total)
            35 false negatives (2.9% of total)

        6 unsures trained as ham (9.5% of unsures)
        56 unsures trained as spam (88.9% of unsures)
        1 unsure was not trained (1.6% of unsures)

        A total of 760 messages have been trained:
            346 ham (98.3% ham, 1.7% unsure, 0.0% false positives)
            414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives)

Index: Stats.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Stats.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** Stats.py	15 Feb 2004 02:15:51 -0000	1.6
--- Stats.py	2 Nov 2004 06:33:23 -0000	1.7
***************
*** 25,34 ****
  """

! # This module is part of the spambayes project, which is Copyright 2002-3
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.

  __author__ = "Tony Meyer <ta-meyer at ihug.co.nz>"
! __credits__ = "Mark Hammond, all the spambayes folk."

  from spambayes.message import msginfoDB
--- 25,42 ----
  """

! # This module is part of the spambayes project, which is Copyright 2002-4
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.

  __author__ = "Tony Meyer <ta-meyer at ihug.co.nz>"
! __credits__ = "Kenny Pitt, Mark Hammond, all the spambayes folk."
! 
! try:
!     True, False
! except NameError:
!     # Maintain compatibility with Python 2.2
!     True, False = 1, 0
! 
! import types

  from spambayes.message import msginfoDB
***************
*** 62,81 ****
              msginfoDB._getState(m)
              if m.c == 's':
                  self.cls_spam += 1
!                 if m.t == 0:
                      self.fp += 1
              elif m.c == 'h':
                  self.cls_ham += 1
!                 if m.t == 1:
                      self.fn += 1
              elif m.c == 'u':
                  self.cls_unsure += 1
!                 if m.t == 0:
                      self.trn_unsure_ham += 1
!                 elif m.t == 1:
                      self.trn_unsure_spam += 1
!             if m.t == 1:
                  self.trn_spam += 1
!             elif m.t == 0:
                  self.trn_ham += 1

--- 70,94 ----
              msginfoDB._getState(m)
              if m.c == 's':
+                 # Classified as spam.
                  self.cls_spam += 1
!                 if m.t == False:
!                     # False positive (classified as spam, trained as ham)
                      self.fp += 1
              elif m.c == 'h':
+                 # Classified as ham.
                  self.cls_ham += 1
!                 if m.t == True:
!                     # False negative (classified as ham, trained as spam)
                      self.fn += 1
              elif m.c == 'u':
+                 # Classified as unsure.
                  self.cls_unsure += 1
!                 if m.t == False:
                      self.trn_unsure_ham += 1
!                 elif m.t == True:
                      self.trn_unsure_spam += 1
!             if m.t == True:
                  self.trn_spam += 1
!             elif m.t == False:
                  self.trn_ham += 1

***************
*** 85,128 ****
          chunks = []
          push = chunks.append
!         perc_ham = 100.0 * self.cls_ham / self.total
!         perc_spam = 100.0 * self.cls_spam / self.total
!         perc_unsure = 100.0 * self.cls_unsure / self.total
          format_dict = {
!             'perc_spam': perc_spam,
!             'perc_ham': perc_ham,
!             'perc_unsure': perc_unsure,
!             'num_seen': self.total
              }
          format_dict.update(self.__dict__)
          # Figure out plurals
!         for num, key in [(self.total, "sp1"), (self.trn_ham, "sp2"),
!                          (self.trn_spam, "sp3"),
!                          (self.trn_unsure_ham, "sp4"),
!                          (self.fp, "sp5"), (self.fn, "sp6")]:
!             if num == 1:
                  format_dict[key] = ''
              else:
                  format_dict[key] = 's'
!         for num, key in [(self.fp, "wp1"), (self.fn, "wp2")]:
!             if num == 1:
!                 format_dict[key] = 'was a'
              else:
                  format_dict[key] = 'were'

!         push("SpamBayes has processed %(num_seen)d message%(sp1)s - " \
!              "%(cls_ham)d (%(perc_ham).0f%%) good, " \
!              "%(cls_spam)d (%(perc_spam).0f%%) spam " \
!              "and %(cls_unsure)d (%(perc_unsure)d%%) unsure." % format_dict)
!         push("%(trn_ham)d message%(sp2)s were manually " \
!              "classified as good (%(fp)d %(wp1)s false positive%(sp5)s)." \
!              % format_dict)
!         push("%(trn_spam)d message%(sp3)s were manually " \
!              "classified as spam (%(fn)d %(wp2)s false negative%(sp6)s)." \
!              % format_dict)
!         push("%(trn_unsure_ham)d unsure message%(sp4)s were manually " \
!              "identified as good, and %(trn_unsure_spam)d as spam." \
!              % format_dict)
          return chunks

  if __name__=='__main__':
      s = Stats()
--- 98,238 ----
          chunks = []
          push = chunks.append
!         not_trn_unsure = self.cls_unsure - self.trn_unsure_ham - \
!                          self.trn_unsure_spam
!         if self.cls_unsure:
!             unsure_ham_perc = 100.0 * self.trn_unsure_ham / self.cls_unsure
!             unsure_spam_perc = 100.0 * self.trn_unsure_spam / self.cls_unsure
!             unsure_not_perc = 100.0 * not_trn_unsure / self.cls_unsure
!         else:
!             unsure_ham_perc = 0.0 # Not correct, really!
!             unsure_spam_perc = 0.0 # Not correct, really!
!             unsure_not_perc = 0.0 # Not correct, really!
!         if self.trn_ham:
!             trn_perc_unsure_ham = 100.0 * self.trn_unsure_ham / \
!                                   self.trn_ham
!             trn_perc_fp = 100.0 * self.fp / self.trn_ham
!             trn_perc_ham = 100.0 - (trn_perc_unsure_ham + trn_perc_fp)
!         else:
!             trn_perc_ham = 0.0 # Not correct, really!
!             trn_perc_unsure_ham = 0.0 # Not correct, really!
!             trn_perc_fp = 0.0 # Not correct, really!
!         if self.trn_spam:
!             trn_perc_unsure_spam = 100.0 * self.trn_unsure_spam / \
!                                    self.trn_spam
!             trn_perc_fn = 100.0 * self.fn / self.trn_spam
!             trn_perc_spam = 100.0 - (trn_perc_unsure_spam + trn_perc_fn)
!         else:
!             trn_perc_spam = 0.0 # Not correct, really!
!             trn_perc_unsure_spam = 0.0 # Not correct, really!
!             trn_perc_fn = 0.0 # Not correct, really!
          format_dict = {
!             'num_seen' : self.total,
!             'correct' : self.total - (self.cls_unsure + self.fp + self.fn),
!             'incorrect' : self.cls_unsure + self.fp + self.fn,
!             'unsure_ham_perc' : unsure_ham_perc,
!             'unsure_spam_perc' : unsure_spam_perc,
!             'unsure_not_perc' : unsure_not_perc,
!             'not_trn_unsure' : not_trn_unsure,
!             'trn_total' : (self.trn_ham + self.trn_spam + \
!                            self.trn_unsure_ham + self.trn_unsure_spam),
!             'trn_perc_ham' : trn_perc_ham,
!             'trn_perc_unsure_ham' : trn_perc_unsure_ham,
!             'trn_perc_fp' : trn_perc_fp,
!             'trn_perc_spam' : trn_perc_spam,
!             'trn_perc_unsure_spam' : trn_perc_unsure_spam,
!             'trn_perc_fn' : trn_perc_fn,
              }
          format_dict.update(self.__dict__)
+ 
+         # Add percentages of everything.
+         for key, val in format_dict.items():
+             perc_key = "perc_" + key
+             if self.total and isinstance(val, types.IntType):
+                 format_dict[perc_key] = 100.0 * val / self.total
+             else:
+                 format_dict[perc_key] = 0.0 # Not correct, really!
+ 
          # Figure out plurals
!         for num, key in [("num_seen", "sp1"),
!                          ("correct", "sp2"),
!                          ("incorrect", "sp3"),
!                          ("fp", "sp4"),
!                          ("fn", "sp5"),
!                          ("trn_unsure_ham", "sp6"),
!                          ("trn_unsure_spam", "sp7"),
!                          ("not_trn_unsure", "sp8"),
!                          ("trn_total", "sp9"),
!                          ]:
!             if format_dict[num] == 1:
                  format_dict[key] = ''
              else:
                  format_dict[key] = 's'
!         for num, key in [("correct", "wp1"),
!                          ("incorrect", "wp2"),
!                          ("not_trn_unsure", "wp3"),
!                          ]:
!             if format_dict[num] == 1:
!                 format_dict[key] = 'was'
              else:
                  format_dict[key] = 'were'

! ##        Our result should look something like this:
! ##        (devised by Mark Moraes and Kenny Pitt)
! ##
! ##        SpamBayes has classified a total of 1223 messages:
! ##            827 ham (67.6% of total)
! ##            333 spam (27.2% of total)
! ##            63 unsure (5.2% of total)
! ##
! ##        1125 messages were classified correctly (92.0% of total)
! ##        35 messages were classified incorrectly (2.9% of total)
! ##            0 false positives (0.0% of total)
! ##            35 false negatives (2.9% of total)
! ##
! ##        6 unsures trained as ham (9.5% of unsures)
! ##        56 unsures trained as spam (88.9% of unsures)
! ##        1 unsure was not trained (1.6% of unsures)
! ##
! ##        A total of 760 messages have been trained:
! ##            346 ham (98.3% ham, 1.7% unsure, 0.0% false positives)
! ##            414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives)
! 
!         push("SpamBayes has classified a total of " \
!              "%(num_seen)d message%(sp1)s:" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(cls_ham)d " \
!              "(%(perc_cls_ham).0f%% of total) good" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(cls_spam)d " \
!              "(%(perc_cls_spam).0f%% of total) spam" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(cls_unsure)d " \
!              "(%(perc_cls_unsure).0f%% of total) unsure." % \
!              format_dict)
!         push("%(correct)d message%(sp2)s %(wp1)s classified correctly " \
!              "(%(perc_correct).0f%% of total)" \
!              "<br/>%(incorrect)d message%(sp3)s %(wp2)s classified " \
!              "incorrectly " \
!              "(%(perc_incorrect).0f%% of total)" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(fp)d false positive%(sp4)s " \
!              "(%(perc_fp).0f%% of total)" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(fn)d false negative%(sp5)s " \
!              "(%(perc_fn).0f%% of total)" % \
!              format_dict)
!         push("%(trn_unsure_ham)d unsure%(sp6)s trained as good " \
!              "(%(unsure_ham_perc).0f%% of unsures)" \
!              "<br/>%(trn_unsure_spam)d unsure%(sp7)s trained as spam " \
!              "(%(unsure_spam_perc).0f%% of unsures)" \
!              "<br/>%(not_trn_unsure)d unsure%(sp8)s %(wp3)s not trained " \
!              "(%(unsure_not_perc).0f%% of unsures)" % \
!              format_dict)
!         push("A total of %(trn_total)d message%(sp9)s have been trained:" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(trn_ham)d good " \
!              "(%(trn_perc_ham)0.f%% good, %(trn_perc_unsure_ham)0.f%% " \
!              "unsure, %(trn_perc_fp).0f%% false positives)" \
!              "<br/>&nbsp;&nbsp;&nbsp;&nbsp;%(trn_spam)d spam " \
!              "(%(trn_perc_spam)0.f%% spam, %(trn_perc_unsure_spam)0.f%% " \
!              "unsure, %(trn_perc_fn).0f%% false negatives)" % \
!              format_dict)
          return chunks

+ 
  if __name__=='__main__':
      s = Stats()