From noreply at sourceforge.net Mon Sep 15 15:31:47 2008 From: noreply at sourceforge.net (SourceForge.net) Date: Mon, 15 Sep 2008 13:31:47 +0000 Subject: [spambayes-bugs] [ spambayes-Bugs-1600821 ] Classifier UnicodeDecodeError on wrong transfer encoding Message-ID: Bugs item #1600821, was opened at 2006-11-22 00:59 Message generated for change (Comment added) made by gelato You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1600821&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: imapfilter Group: 1.0.1 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Ivan Vilata i Balaguer (ivilata) Assigned to: Skip Montanaro (montanaro) Summary: Classifier UnicodeDecodeError on wrong transfer encoding Initial Comment: Running ``sb_imapfilter.py`` 1.0.1 seems to raise the following ``UnicodeDecodeError`` when it comes across a mail with 7-bit content transfer encoding with 8-bit characters in it while classifying:: Traceback (most recent call last): File "/usr/bin/sb_imapfilter.py", line 924, in ? run() File "/usr/bin/sb_imapfilter.py", line 914, in run imap_filter.Filter() File "/usr/bin/sb_imapfilter.py", line 785, in Filter self.unsure_folder) File "/usr/bin/sb_imapfilter.py", line 703, in Filter evidence=True) File "/usr/lib/python2.4/site-packages/spambayes/classifier.py", line 190, in chi2_spamprob clues = self._getclues(wordstream) File "/usr/lib/python2.4/site-packages/spambayes/classifier.py", line 496, in _getclues clues.sort() UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) I'm attaching the mail which caused this. I know it is not properly-formatted, but it is a legitimate mail produced by a popular MUA (Thunderbird 1.5). Spam surely is worsely formatted Someone talked about the same problem in the list: http://www.mail-archive.com/spambayes at python.org/msg04543.html ---------------------------------------------------------------------- Comment By: Sergio Gelato (gelato) Date: 2008-09-15 15:31 Message: I've had the same problem, with a similar traceback (also using spambayes 1.0.4). I was able to identify the exact word in the input data that triggered the problem. It turns out, however, that changing the database even slightly (I trained on a portion of the offending message) makes the symptoms disappear. In my case, The offending word was "Enk=E4t" (in a qp-encoded, charset="iso-8859-1" text/plain subpart of a message/rfc822 subpart of a multipart-mixed message). There were other similarly encoded words with non-ASCII data earlier in the message (even in the same body part), but only this one triggered the problem. (I established this by truncating the input message after a variable number of lines and noting which inputs were causing it to fail.) Extracting the message/rfc822 part and running it alone through sb_filter.py did not trigger the problem. In inspecting the spambayes source code, I noticed that tokenizer.py doesn't seem to take into account the MIME charset. I'm not necessarily saying that it should; in fact, spambayes must be able to cope with malformed input data. But the result is that the words out of the tokenizer are not in any well-defined encoding. clues is a list of (distance, prob, word, record) tuples. When there is a tie on prob (and therefore also on distance=abs(0.5-prob)), the sort() method will need to compare the word strings. This is where an implicit word.decode('ascii') may take place, especially when one of the operands is of type 'str' and the other one is of type 'unicode'. Training one more message will change the probabilities and make the symptoms disappear (or move somewhere else). I'd guess that some of the elements of the wordstream returned by the tokenizer are of the wrong class. They should either all be of type str or all of type unicode; probably the former. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-10-02 12:55 Message: Logged In: YES user_id=44345 Originator: NO None of these make the current version of sb_filter.py barf. I wonder if there's something peculiar about the way the mail is transmitted via IMAP? (Just a wild guess.) ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-10-02 02:06 Message: Logged In: YES user_id=44345 Originator: NO File Added: mailbox ---------------------------------------------------------------------- Comment By: Jess Cea Avin (jcea) Date: 2007-10-02 00:46 Message: Logged In: YES user_id=97460 Originator: NO Three examples sent to skip at pobox.com. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-09-26 04:48 Message: Logged In: YES user_id=44345 Originator: NO jcea, Do you have an email message I can work with? If so, zip it and send it to me as an attachment (skip at pobox.com). Thx, Skip ---------------------------------------------------------------------- Comment By: Jess Cea Avin (jcea) Date: 2007-09-26 04:02 Message: Logged In: YES user_id=97460 Originator: NO My version is 1.0.4 and the traceback is: """ Traceback (most recent call last): File "/usr/local/lib/python2.5/site-packages/Milter/__init__.py", line 203, in milter.set_eom_callback(lambda ctx: ctx.getpriv().eom()) File "antispam.py", line 513, in eom prob=hammiedb.score(msg) File "/usr/local/lib/python2.5/site-packages/spambayes/hammie.py", line 62, in score return self._scoremsg(msg, evidence) File "/usr/local/lib/python2.5/site-packages/spambayes/hammie.py", line 38, in _scoremsg return self.bayes.spamprob(tokenize(msg), evidence) File "/usr/local/lib/python2.5/site-packages/spambayes/classifier.py", line 190, in chi2_spamprob clues = self._getclues(wordstream) File "/usr/local/lib/python2.5/site-packages/spambayes/classifier.py", line 496, in _getclues clues.sort() UnicodeDecodeError: 'ascii' codec can't decode byte 0xfa in position 0: ordinal not in range(128) """ ---------------------------------------------------------------------- Comment By: Jess Cea Avin (jcea) Date: 2007-09-17 17:37 Message: Logged In: YES user_id=97460 Originator: NO My version is 1.0.4. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-09-07 04:11 Message: Logged In: YES user_id=44345 Originator: NO I ran the submitted email through the current sb_filter.py in Subversion (probably the same classifier as in 1.1a4). It worked for me. While I don't use the IMAP filter, any of the SpamBayes applications should use the same classifier code. I'm not sure this is a problem in the current code. What version of SpamBayes are you using? ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-09-05 19:23 Message: Logged In: YES user_id=44345 Originator: NO Do you have a traceback? What version of SpamBayes are you using? ---------------------------------------------------------------------- Comment By: Jess Cea Avin (jcea) Date: 2007-09-05 16:59 Message: Logged In: YES user_id=97460 Originator: NO I'm seeing a lot (>1 per hour in my system) of current spam crashing spambayes because they are marked as "ascii" but body is 8-bit actually. Since my milter spam filter crashes and sendmail disables the milter filtering for 50 seconds because the failure (my configuration, and I wouldn't like to touch it), a lot of spam is getting thru. About 30-100 spams, everytime this bug hits. Please, increase the priority of this bug a bit... It is hitting. Hard. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1600821&group_id=61702