[Spambayes] Classify issue with pop3proxy

Mon Nov 18 14:19:35 2002

If I cut&past a message in the box, it get classified. If I open it through
the [file...] button, it get the following result:

===========================
Spam probability: 0.52423052

Clues:

*H*    0.58188342
*S*    0.63034446
x-mailer:none    0.21414650
content-type:text/plain    0.24312113
message-id:invalid    0.93478261
===========================

My guess is that this is a MacOS line ending issue. But this works for
training both way. The difference I see is line 774 in onTrain wich is not
in onClassify. I sugest adding it at line 793.

Tested here, it works.

>From this morning CVS in pop3proxy.py line 763 et sqq:

    def onTrain(self, params):
        """Train on an uploaded or pasted message."""
        # Upload or paste?  Spam or ham?
        message = params.get('file') or params.get('text')
        isSpam = (params['which'] == 'Train as Spam')

        # Append the message to a file, to make it easier to rebuild
        # the database later.   This is a temporary implementation -
        # it should keep a Corpus (from Tim Stone's forthcoming message
        # management module) to manage a cache of messages.  It needs
        # to keep them for the HTML retraining interface anyway.
        message = message.replace('\r\n', '\n').replace('\r', '\n') #<====
        if isSpam:
            f = open("_pop3proxyspam.mbox", "a")
        else:
            f = open("_pop3proxyham.mbox", "a")
        f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
        f.write(message)
        f.write("\n\n")
        f.close()

        # Train on the message.
        tokens = tokenizer.tokenize(message)
        self.bayes.learn(tokens, isSpam, True)
        self.push("<p>OK. Return <a href='/'>Home</a> or train
another:</p>")
        self.push(self.pageSection % ('Train another', self.train))

    def onClassify(self, params):
        """Classify an uploaded or pasted message."""
        message = params.get('file') or params.get('text')
        tokens = tokenizer.tokenize(message)               #<====
        prob, clues = self.bayes.spamprob(tokens, evidence=True)
        self.push("<p>Spam probability: <b>%.8f</b></p>" % prob)
        self.push("<table class='sectiontable' cellspacing='0'>")
        self.push("<tr><td class='sectionheading'>Clues:</td></tr>\n")
        self.push("<tr><td class='sectionbody'><table>")
        for w, p in clues:
            self.push("<tr><td>%s</td><td>%.8f</td></tr>\n" % (w, p))
        self.push("</table></td></tr></table>")
        self.push("<p>Return <a href='/'>Home</a> or classify another:</p>")
        self.push(self.pageSection % ('Classify another', self.classify))

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>