[Spambayes-checkins] spambayes proxytrainer.py,1.1,1.2

Skip Montanaro montanaro at users.sourceforge.net
Fri Jan 17 12:40:32 EST 2003


Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv5960

Modified Files:
	proxytrainer.py 
Log Message:
Whole buncha changes...

* slight tweak to css and table layout to make sure discard/defer/ham/spam
  labels and radio buttons line up
* darken the stripe a bit so the alternating lines are a bit more distinct
  (this will probably quickly deteriorate into a matter of personal taste
  and display properties, but I could barely tell the difference between the
  "light" and "dark" lines on my Powerbook)
* First cut at restricted review (no more than 20 per section) - see below
  for why.  Can't page "next", "prev" yet.
* pre-classify messages being displayed if they are currently "unsure" -
  this is fairly costly, hence the above view restriction - classifying them
  when they arrive is painful as well, because those messages may be coming
  from proxytee running from a local delivery agent like procmail, which you
  generally want to run quickly, especially when lots of mail arrives at the
  same time (think fetchmail, POP, etc).
* allow user to view raw message contents (/view url, onView() method)
* now that the __getattr__ bug has been fixed, dump the
  sys.setrecursionlimit() call.
* delete a number leftover bits from pop3proxy's testing mode.


Index: proxytrainer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/proxytrainer.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** proxytrainer.py	16 Jan 2003 17:40:13 -0000	1.1
--- proxytrainer.py	17 Jan 2003 20:40:27 -0000	1.2
***************
*** 230,235 ****
                                 font-weight: bold }
               .sectionbody { padding: 1em }
!              .reviewheaders a { color: #000000 }
!              .stripe_on td { background: #f4f4f4 }
               </style>
               </head>\n"""
--- 230,235 ----
                                 font-weight: bold }
               .sectionbody { padding: 1em }
!              .reviewheaders a { color: #000000; font-weight: bold }
!              .stripe_on td { background: #dddddd }
               </style>
               </head>\n"""
***************
*** 284,287 ****
--- 284,289 ----
                         <input type='hidden' name='prior' value='%d'>
                         <input type='hidden' name='next' value='%d'>
+                        <input type='hidden' name='startAt' value='%d'>
+                        <input type='hidden' name='howMany' value='%d'>
                         <table border='0' cellpadding='0' cellspacing='0'>
                         <tr><td><input type='submit' name='go'
***************
*** 321,330 ****
          """<tr><td><b>Messages classified as %s:</b></td>
            <td><b>From:</b></td>
!           <td class='reviewheaders' nowrap><b>
!               <a href='javascript: onHeader("%s", "Discard");'>Discard</a> /
!               <a href='javascript: onHeader("%s", "Defer");'>Defer</a> /
!               <a href='javascript: onHeader("%s", "Ham");'>Ham</a> /
!               <a href='javascript: onHeader("%s", "Spam");'>Spam</a>
!           </b></td></tr>"""
  
      upload = """<form action='%s' method='POST'
--- 323,331 ----
          """<tr><td><b>Messages classified as %s:</b></td>
            <td><b>From:</b></td>
!           <td class='reviewheaders'><a href='javascript: onHeader("%s", "Discard");'>Discard</a></td>
!           <td class='reviewheaders'><a href='javascript: onHeader("%s", "Defer");'>Defer</a></td>
!           <td class='reviewheaders'><a href='javascript: onHeader("%s", "Ham");'>Ham</a></td>
!           <td class='reviewheaders'><a href='javascript: onHeader("%s", "Spam");'>Spam</a></td>
!           </tr>"""
  
      upload = """<form action='%s' method='POST'
***************
*** 657,669 ****
          return keys, date, prior, start, end
  
!     def appendMessages(self, lines, keyedMessages, label):
          """Appends the lines of a table of messages to 'lines'."""
          buttons = \
!           """<input type='radio' name='classify:%s:%s' value='discard'>&nbsp;
!              <input type='radio' name='classify:%s:%s' value='defer' %s>&nbsp;
!              <input type='radio' name='classify:%s:%s' value='ham' %s>&nbsp;
!              <input type='radio' name='classify:%s:%s' value='spam' %s>"""
          stripe = 0
          for key, message in keyedMessages:
              # Parse the message and get the relevant headers and the first
              # part of the body if we can.
--- 658,677 ----
          return keys, date, prior, start, end
  
!     def appendMessages(self, lines, keyedMessages, label, startAt, howMany):
          """Appends the lines of a table of messages to 'lines'."""
          buttons = \
!           """<td align='center'><input type='radio' name='classify:%s:%s' value='discard'></td>
!              <td align='center'><input type='radio' name='classify:%s:%s' value='defer' %s></td>
!              <td align='center'><input type='radio' name='classify:%s:%s' value='ham' %s></td>
!              <td align='center'><input type='radio' name='classify:%s:%s' value='spam' %s></td>"""
          stripe = 0
+         i = -1
          for key, message in keyedMessages:
+             i += 1
+             if i < startAt:
+                 continue
+             if i >= startAt+howMany:
+                 break
+ 
              # Parse the message and get the relevant headers and the first
              # part of the body if we can.
***************
*** 687,706 ****
              text = self.trimAndQuote(text.strip(), 200, True)
  
              # Output the table row for this message.
              defer = ham = spam = ""
!             if label == 'Spam':
                  spam='checked'
!             elif label == 'Ham':
                  ham='checked'
!             elif label == 'Unsure':
                  defer='checked'
!             subject = "<span title=\"%s\">%s</span>" % (text, subject)
!             radioGroup = buttons % (label, key,
!                                     label, key, defer,
!                                     label, key, ham,
!                                     label, key, spam)
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             <td align='center'>%s</td></tr>""" % \
                              (stripeClass, subject, from_, radioGroup))
              stripe = stripe ^ 1
--- 695,728 ----
              text = self.trimAndQuote(text.strip(), 200, True)
  
+             buttonLabel = label
+             # classify unsure messages
+             if buttonLabel == 'Unsure':
+                 tokens = tokenizer.tokenize(message)
+                 prob, clues = state.bayes.spamprob(tokens, evidence=True)
+                 if prob < options.ham_cutoff:
+                     buttonLabel = 'Ham'
+                 elif prob >= options.spam_cutoff:
+                     buttonLabel = 'Spam'
+ 
              # Output the table row for this message.
              defer = ham = spam = ""
!             if buttonLabel == 'Spam':
                  spam='checked'
!             elif buttonLabel == 'Ham':
                  ham='checked'
!             elif buttonLabel == 'Unsure':
                  defer='checked'
!             subject = ('<span title="%s">'
!                        '<a target=_top href="/view?key=%s&corpus=%s">'
!                        '%s'
!                        '</a>'
!                        '</span>') % (text, key, label, subject)
!             radioGroup = buttons % (buttonLabel, key,
!                                     buttonLabel, key, defer,
!                                     buttonLabel, key, ham,
!                                     buttonLabel, key, spam)
              stripeClass = ['stripe_on', 'stripe_off'][stripe]
              lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
!                             %s</tr>""" % \
                              (stripeClass, subject, from_, radioGroup))
              stripe = stripe ^ 1
***************
*** 712,717 ****
          numTrained = 0
          numDeferred = 0
          for key, value in params.items():
!             if key.startswith('classify:'):
                  id = key.split(':')[2]
                  if value == 'spam':
--- 734,745 ----
          numTrained = 0
          numDeferred = 0
+         startAt = 0
+         howMany = 20
          for key, value in params.items():
!             if key == 'startAt':
!                 startAt = int(value)
!             elif key == 'howMany':
!                 howMany = int(value)
!             elif key.startswith('classify:'):
                  id = key.split(':')[2]
                  if value == 'spam':
***************
*** 797,811 ****
                  nextState = 'disabled'
              lines = [self.onReviewHeader,
!                      self.reviewHeader % (prior, next, priorState, nextState)]
              for header, label in ((options.header_spam_string, 'Spam'),
                                    (options.header_ham_string, 'Ham'),
                                    (options.header_unsure_string, 'Unsure')):
                  if keyedMessages[header]:
!                     lines.append("<tr><td>&nbsp;</td><td></td><td></td></tr>")
                      lines.append(self.reviewSubheader %
                                   (label, label, label, label, label))
!                     self.appendMessages(lines, keyedMessages[header], label)
  
!             lines.append("""<tr><td></td><td></td><td align='center'>&nbsp;<br>
                              <input type='submit' value='Train'></td></tr>""")
              lines.append("</table></form>")
--- 825,842 ----
                  nextState = 'disabled'
              lines = [self.onReviewHeader,
!                      self.reviewHeader % (prior, next,
!                                           startAt+howMany, howMany,
!                                           priorState, nextState)]
              for header, label in ((options.header_spam_string, 'Spam'),
                                    (options.header_ham_string, 'Ham'),
                                    (options.header_unsure_string, 'Unsure')):
                  if keyedMessages[header]:
!                     lines.append("<tr><td>&nbsp;</td><td></td></tr>")
                      lines.append(self.reviewSubheader %
                                   (label, label, label, label, label))
!                     self.appendMessages(lines, keyedMessages[header], label,
!                                         startAt, howMany)
  
!             lines.append("""<tr><td></td><td></td><td align='center' colspan='4'>&nbsp;<br>
                              <input type='submit' value='Train'></td></tr>""")
              lines.append("</table></form>")
***************
*** 853,856 ****
--- 884,906 ----
          self.push(body)
  
+     def onView(self, params):
+         msgkey = corpus = None
+         for key, value in params.items():
+             if key == 'key':
+                 msgkey = value
+             elif key == 'corpus':
+                 corpus = value
+             if msgkey is not None and corpus is not None:
+                 message = state.unknownCorpus.get(msgkey)
+                 if message is None:
+                     self.push("<p>Can't find message %s.\n" % msgkey)
+                     self.push("Maybe it expired.</p>\n")
+                 else:
+                     self.push("<pre>")
+                     self.push(message.hdrtxt.replace("<", "&lt;"))
+                     self.push("\n")
+                     self.push(message.payload.replace("<", "&lt;"))
+                     self.push("</pre>")
+                 msgkey = corpus = None
  
  # This keeps the global state of the module - the command-line options,
***************
*** 878,882 ****
          self.unknownCache = options.pop3proxy_unknown_cache
          self.runTestServer = False
-         self.isTest = False
          if self.gzipCache:
              factory = GzipFileMessageFactory()
--- 928,931 ----
***************
*** 904,937 ****
          print "Done."
  
!         # Don't set up the caches and training objects when running the
!         # self-test, so as not to clutter the filesystem.
!         if not self.isTest:
!             def ensureDir(dirname):
!                 try:
!                     os.mkdir(dirname)
!                 except OSError, e:
!                     if e.errno != errno.EEXIST:
!                         raise
! 
!             # Create/open the Corpuses.
!             map(ensureDir, [self.spamCache, self.hamCache, self.unknownCache])
!             if self.gzipCache:
!                 factory = GzipFileMessageFactory()
!             else:
!                 factory = FileMessageFactory()
!             age = options.pop3proxy_cache_expiry_days*24*60*60
!             self.spamCorpus = ExpiryFileCorpus(age, factory, self.spamCache)
!             self.hamCorpus = ExpiryFileCorpus(age, factory, self.hamCache)
!             self.unknownCorpus = FileCorpus(factory, self.unknownCache)
  
!             # Expire old messages from the trained corpuses.
!             self.spamCorpus.removeExpiredMessages()
!             self.hamCorpus.removeExpiredMessages()
  
!             # Create the Trainers.
!             self.spamTrainer = storage.SpamTrainer(self.bayes)
!             self.hamTrainer = storage.HamTrainer(self.bayes)
!             self.spamCorpus.addObserver(self.spamTrainer)
!             self.hamCorpus.addObserver(self.hamTrainer)
  
  state = State()
--- 953,977 ----
          print "Done."
  
!         def ensureDir(dirname):
!             try:
!                 os.mkdir(dirname)
!             except OSError, e:
!                 if e.errno != errno.EEXIST:
!                     raise
  
!         # Create/open the Corpuses.
!         map(ensureDir, [self.spamCache, self.hamCache, self.unknownCache])
!         if self.gzipCache:
!             factory = GzipFileMessageFactory()
!         else:
!             factory = FileMessageFactory()
!         age = options.pop3proxy_cache_expiry_days*24*60*60
!         self.spamCorpus = ExpiryFileCorpus(age, factory, self.spamCache)
!         self.hamCorpus = ExpiryFileCorpus(age, factory, self.hamCache)
!         self.unknownCorpus = FileCorpus(factory, self.unknownCache)
  
!         # Expire old messages from the trained corpuses.
!         self.spamCorpus.removeExpiredMessages()
!         self.hamCorpus.removeExpiredMessages()
  
  state = State()
***************
*** 949,989 ****
  
  # ===================================================================
- # Test code.
- # ===================================================================
- 
- # One example of spam and one of ham - both are used to train, and are
- # then classified.  Not a good test of the classifier, but a perfectly
- # good test of the POP3 proxy.  The bodies of these came from the
- # spambayes project, and I added the headers myself because the
- # originals had no headers.
- 
- spam1 = """From: friend at public.com
- Subject: Make money fast
- 
- Hello tim_chandler , Want to save money ?
- Now is a good time to consider refinancing. Rates are low so you can cut
- your current payments and save money.
- 
- http://64.251.22.101/interest/index%38%30%300%2E%68t%6D
- 
- Take off list on site [s5]
- """
- 
- good1 = """From: chris at example.com
- Subject: ZPT and DTML
- 
- Jean Jordaan wrote:
- > 'Fraid so ;>  It contains a vintage dtml-calendar tag.
- >   http://www.zope.org/Members/teyc/CalendarTag
- >
- > Hmm I think I see what you mean: one needn't manually pass on the
- > namespace to a ZPT?
- 
- Yeah, Page Templates are a bit more clever, sadly, DTML methods aren't :-(
- 
- Chris
- """
- 
- # ===================================================================
  # __main__ driver.
  # ===================================================================
--- 989,992 ----
***************
*** 1017,1020 ****
  
  if __name__ == '__main__':
-     sys.setrecursionlimit(100)
      run()
--- 1020,1022 ----





More information about the Spambayes-checkins mailing list