[Spambayes-checkins] spambayes proxytrainer.py,1.1,1.2
Skip Montanaro
montanaro at users.sourceforge.net
Fri Jan 17 12:40:32 EST 2003
Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv5960
Modified Files:
proxytrainer.py
Log Message:
Whole buncha changes...
* slight tweak to css and table layout to make sure discard/defer/ham/spam
labels and radio buttons line up
* darken the stripe a bit so the alternating lines are a bit more distinct
(this will probably quickly deteriorate into a matter of personal taste
and display properties, but I could barely tell the difference between the
"light" and "dark" lines on my Powerbook)
* First cut at restricted review (no more than 20 per section) - see below
for why. Can't page "next", "prev" yet.
* pre-classify messages being displayed if they are currently "unsure" -
this is fairly costly, hence the above view restriction - classifying them
when they arrive is painful as well, because those messages may be coming
from proxytee running from a local delivery agent like procmail, which you
generally want to run quickly, especially when lots of mail arrives at the
same time (think fetchmail, POP, etc).
* allow user to view raw message contents (/view url, onView() method)
* now that the __getattr__ bug has been fixed, dump the
sys.setrecursionlimit() call.
* delete a number leftover bits from pop3proxy's testing mode.
Index: proxytrainer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/proxytrainer.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** proxytrainer.py 16 Jan 2003 17:40:13 -0000 1.1
--- proxytrainer.py 17 Jan 2003 20:40:27 -0000 1.2
***************
*** 230,235 ****
font-weight: bold }
.sectionbody { padding: 1em }
! .reviewheaders a { color: #000000 }
! .stripe_on td { background: #f4f4f4 }
</style>
</head>\n"""
--- 230,235 ----
font-weight: bold }
.sectionbody { padding: 1em }
! .reviewheaders a { color: #000000; font-weight: bold }
! .stripe_on td { background: #dddddd }
</style>
</head>\n"""
***************
*** 284,287 ****
--- 284,289 ----
<input type='hidden' name='prior' value='%d'>
<input type='hidden' name='next' value='%d'>
+ <input type='hidden' name='startAt' value='%d'>
+ <input type='hidden' name='howMany' value='%d'>
<table border='0' cellpadding='0' cellspacing='0'>
<tr><td><input type='submit' name='go'
***************
*** 321,330 ****
"""<tr><td><b>Messages classified as %s:</b></td>
<td><b>From:</b></td>
! <td class='reviewheaders' nowrap><b>
! <a href='javascript: onHeader("%s", "Discard");'>Discard</a> /
! <a href='javascript: onHeader("%s", "Defer");'>Defer</a> /
! <a href='javascript: onHeader("%s", "Ham");'>Ham</a> /
! <a href='javascript: onHeader("%s", "Spam");'>Spam</a>
! </b></td></tr>"""
upload = """<form action='%s' method='POST'
--- 323,331 ----
"""<tr><td><b>Messages classified as %s:</b></td>
<td><b>From:</b></td>
! <td class='reviewheaders'><a href='javascript: onHeader("%s", "Discard");'>Discard</a></td>
! <td class='reviewheaders'><a href='javascript: onHeader("%s", "Defer");'>Defer</a></td>
! <td class='reviewheaders'><a href='javascript: onHeader("%s", "Ham");'>Ham</a></td>
! <td class='reviewheaders'><a href='javascript: onHeader("%s", "Spam");'>Spam</a></td>
! </tr>"""
upload = """<form action='%s' method='POST'
***************
*** 657,669 ****
return keys, date, prior, start, end
! def appendMessages(self, lines, keyedMessages, label):
"""Appends the lines of a table of messages to 'lines'."""
buttons = \
! """<input type='radio' name='classify:%s:%s' value='discard'>
! <input type='radio' name='classify:%s:%s' value='defer' %s>
! <input type='radio' name='classify:%s:%s' value='ham' %s>
! <input type='radio' name='classify:%s:%s' value='spam' %s>"""
stripe = 0
for key, message in keyedMessages:
# Parse the message and get the relevant headers and the first
# part of the body if we can.
--- 658,677 ----
return keys, date, prior, start, end
! def appendMessages(self, lines, keyedMessages, label, startAt, howMany):
"""Appends the lines of a table of messages to 'lines'."""
buttons = \
! """<td align='center'><input type='radio' name='classify:%s:%s' value='discard'></td>
! <td align='center'><input type='radio' name='classify:%s:%s' value='defer' %s></td>
! <td align='center'><input type='radio' name='classify:%s:%s' value='ham' %s></td>
! <td align='center'><input type='radio' name='classify:%s:%s' value='spam' %s></td>"""
stripe = 0
+ i = -1
for key, message in keyedMessages:
+ i += 1
+ if i < startAt:
+ continue
+ if i >= startAt+howMany:
+ break
+
# Parse the message and get the relevant headers and the first
# part of the body if we can.
***************
*** 687,706 ****
text = self.trimAndQuote(text.strip(), 200, True)
# Output the table row for this message.
defer = ham = spam = ""
! if label == 'Spam':
spam='checked'
! elif label == 'Ham':
ham='checked'
! elif label == 'Unsure':
defer='checked'
! subject = "<span title=\"%s\">%s</span>" % (text, subject)
! radioGroup = buttons % (label, key,
! label, key, defer,
! label, key, ham,
! label, key, spam)
stripeClass = ['stripe_on', 'stripe_off'][stripe]
lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
! <td align='center'>%s</td></tr>""" % \
(stripeClass, subject, from_, radioGroup))
stripe = stripe ^ 1
--- 695,728 ----
text = self.trimAndQuote(text.strip(), 200, True)
+ buttonLabel = label
+ # classify unsure messages
+ if buttonLabel == 'Unsure':
+ tokens = tokenizer.tokenize(message)
+ prob, clues = state.bayes.spamprob(tokens, evidence=True)
+ if prob < options.ham_cutoff:
+ buttonLabel = 'Ham'
+ elif prob >= options.spam_cutoff:
+ buttonLabel = 'Spam'
+
# Output the table row for this message.
defer = ham = spam = ""
! if buttonLabel == 'Spam':
spam='checked'
! elif buttonLabel == 'Ham':
ham='checked'
! elif buttonLabel == 'Unsure':
defer='checked'
! subject = ('<span title="%s">'
! '<a target=_top href="/view?key=%s&corpus=%s">'
! '%s'
! '</a>'
! '</span>') % (text, key, label, subject)
! radioGroup = buttons % (buttonLabel, key,
! buttonLabel, key, defer,
! buttonLabel, key, ham,
! buttonLabel, key, spam)
stripeClass = ['stripe_on', 'stripe_off'][stripe]
lines.append("""<tr class='%s'><td>%s</td><td>%s</td>
! %s</tr>""" % \
(stripeClass, subject, from_, radioGroup))
stripe = stripe ^ 1
***************
*** 712,717 ****
numTrained = 0
numDeferred = 0
for key, value in params.items():
! if key.startswith('classify:'):
id = key.split(':')[2]
if value == 'spam':
--- 734,745 ----
numTrained = 0
numDeferred = 0
+ startAt = 0
+ howMany = 20
for key, value in params.items():
! if key == 'startAt':
! startAt = int(value)
! elif key == 'howMany':
! howMany = int(value)
! elif key.startswith('classify:'):
id = key.split(':')[2]
if value == 'spam':
***************
*** 797,811 ****
nextState = 'disabled'
lines = [self.onReviewHeader,
! self.reviewHeader % (prior, next, priorState, nextState)]
for header, label in ((options.header_spam_string, 'Spam'),
(options.header_ham_string, 'Ham'),
(options.header_unsure_string, 'Unsure')):
if keyedMessages[header]:
! lines.append("<tr><td> </td><td></td><td></td></tr>")
lines.append(self.reviewSubheader %
(label, label, label, label, label))
! self.appendMessages(lines, keyedMessages[header], label)
! lines.append("""<tr><td></td><td></td><td align='center'> <br>
<input type='submit' value='Train'></td></tr>""")
lines.append("</table></form>")
--- 825,842 ----
nextState = 'disabled'
lines = [self.onReviewHeader,
! self.reviewHeader % (prior, next,
! startAt+howMany, howMany,
! priorState, nextState)]
for header, label in ((options.header_spam_string, 'Spam'),
(options.header_ham_string, 'Ham'),
(options.header_unsure_string, 'Unsure')):
if keyedMessages[header]:
! lines.append("<tr><td> </td><td></td></tr>")
lines.append(self.reviewSubheader %
(label, label, label, label, label))
! self.appendMessages(lines, keyedMessages[header], label,
! startAt, howMany)
! lines.append("""<tr><td></td><td></td><td align='center' colspan='4'> <br>
<input type='submit' value='Train'></td></tr>""")
lines.append("</table></form>")
***************
*** 853,856 ****
--- 884,906 ----
self.push(body)
+ def onView(self, params):
+ msgkey = corpus = None
+ for key, value in params.items():
+ if key == 'key':
+ msgkey = value
+ elif key == 'corpus':
+ corpus = value
+ if msgkey is not None and corpus is not None:
+ message = state.unknownCorpus.get(msgkey)
+ if message is None:
+ self.push("<p>Can't find message %s.\n" % msgkey)
+ self.push("Maybe it expired.</p>\n")
+ else:
+ self.push("<pre>")
+ self.push(message.hdrtxt.replace("<", "<"))
+ self.push("\n")
+ self.push(message.payload.replace("<", "<"))
+ self.push("</pre>")
+ msgkey = corpus = None
# This keeps the global state of the module - the command-line options,
***************
*** 878,882 ****
self.unknownCache = options.pop3proxy_unknown_cache
self.runTestServer = False
- self.isTest = False
if self.gzipCache:
factory = GzipFileMessageFactory()
--- 928,931 ----
***************
*** 904,937 ****
print "Done."
! # Don't set up the caches and training objects when running the
! # self-test, so as not to clutter the filesystem.
! if not self.isTest:
! def ensureDir(dirname):
! try:
! os.mkdir(dirname)
! except OSError, e:
! if e.errno != errno.EEXIST:
! raise
!
! # Create/open the Corpuses.
! map(ensureDir, [self.spamCache, self.hamCache, self.unknownCache])
! if self.gzipCache:
! factory = GzipFileMessageFactory()
! else:
! factory = FileMessageFactory()
! age = options.pop3proxy_cache_expiry_days*24*60*60
! self.spamCorpus = ExpiryFileCorpus(age, factory, self.spamCache)
! self.hamCorpus = ExpiryFileCorpus(age, factory, self.hamCache)
! self.unknownCorpus = FileCorpus(factory, self.unknownCache)
! # Expire old messages from the trained corpuses.
! self.spamCorpus.removeExpiredMessages()
! self.hamCorpus.removeExpiredMessages()
! # Create the Trainers.
! self.spamTrainer = storage.SpamTrainer(self.bayes)
! self.hamTrainer = storage.HamTrainer(self.bayes)
! self.spamCorpus.addObserver(self.spamTrainer)
! self.hamCorpus.addObserver(self.hamTrainer)
state = State()
--- 953,977 ----
print "Done."
! def ensureDir(dirname):
! try:
! os.mkdir(dirname)
! except OSError, e:
! if e.errno != errno.EEXIST:
! raise
! # Create/open the Corpuses.
! map(ensureDir, [self.spamCache, self.hamCache, self.unknownCache])
! if self.gzipCache:
! factory = GzipFileMessageFactory()
! else:
! factory = FileMessageFactory()
! age = options.pop3proxy_cache_expiry_days*24*60*60
! self.spamCorpus = ExpiryFileCorpus(age, factory, self.spamCache)
! self.hamCorpus = ExpiryFileCorpus(age, factory, self.hamCache)
! self.unknownCorpus = FileCorpus(factory, self.unknownCache)
! # Expire old messages from the trained corpuses.
! self.spamCorpus.removeExpiredMessages()
! self.hamCorpus.removeExpiredMessages()
state = State()
***************
*** 949,989 ****
# ===================================================================
- # Test code.
- # ===================================================================
-
- # One example of spam and one of ham - both are used to train, and are
- # then classified. Not a good test of the classifier, but a perfectly
- # good test of the POP3 proxy. The bodies of these came from the
- # spambayes project, and I added the headers myself because the
- # originals had no headers.
-
- spam1 = """From: friend at public.com
- Subject: Make money fast
-
- Hello tim_chandler , Want to save money ?
- Now is a good time to consider refinancing. Rates are low so you can cut
- your current payments and save money.
-
- http://64.251.22.101/interest/index%38%30%300%2E%68t%6D
-
- Take off list on site [s5]
- """
-
- good1 = """From: chris at example.com
- Subject: ZPT and DTML
-
- Jean Jordaan wrote:
- > 'Fraid so ;> It contains a vintage dtml-calendar tag.
- > http://www.zope.org/Members/teyc/CalendarTag
- >
- > Hmm I think I see what you mean: one needn't manually pass on the
- > namespace to a ZPT?
-
- Yeah, Page Templates are a bit more clever, sadly, DTML methods aren't :-(
-
- Chris
- """
-
- # ===================================================================
# __main__ driver.
# ===================================================================
--- 989,992 ----
***************
*** 1017,1020 ****
if __name__ == '__main__':
- sys.setrecursionlimit(100)
run()
--- 1020,1022 ----
More information about the Spambayes-checkins
mailing list