From noreply at sourceforge.net Fri Dec 5 10:35:59 2008 From: noreply at sourceforge.net (SourceForge.net) Date: Fri, 05 Dec 2008 09:35:59 +0000 Subject: [spambayes-bugs] [ spambayes-Bugs-1600821 ] Classifier UnicodeDecodeError on wrong transfer encoding Message-ID: Bugs item #1600821, was opened at 2006-11-22 00:59 Message generated for change (Comment added) made by gelato You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1600821&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: imapfilter Group: 1.0.1 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Ivan Vilata i Balaguer (ivilata) Assigned to: Skip Montanaro (montanaro) Summary: Classifier UnicodeDecodeError on wrong transfer encoding Initial Comment: Running ``sb_imapfilter.py`` 1.0.1 seems to raise the following ``UnicodeDecodeError`` when it comes across a mail with 7-bit content transfer encoding with 8-bit characters in it while classifying:: Traceback (most recent call last): File "/usr/bin/sb_imapfilter.py", line 924, in ? run() File "/usr/bin/sb_imapfilter.py", line 914, in run imap_filter.Filter() File "/usr/bin/sb_imapfilter.py", line 785, in Filter self.unsure_folder) File "/usr/bin/sb_imapfilter.py", line 703, in Filter evidence=True) File "/usr/lib/python2.4/site-packages/spambayes/classifier.py", line 190, in chi2_spamprob clues = self._getclues(wordstream) File "/usr/lib/python2.4/site-packages/spambayes/classifier.py", line 496, in _getclues clues.sort() UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128) I'm attaching the mail which caused this. I know it is not properly-formatted, but it is a legitimate mail produced by a popular MUA (Thunderbird 1.5). Spam surely is worsely formatted Someone talked about the same problem in the list: http://www.mail-archive.com/spambayes at python.org/msg04543.html ---------------------------------------------------------------------- Comment By: Sergio Gelato (gelato) Date: 2008-12-05 10:35 Message: I now have the pleasure of submitting a very simple patch for this issue. I've just had a chance to test it (on spambayes 1.0.4). The only drawback is that it bumps the minimum required Python version to 2.4, but hopefully that's not too much of a problem nowadays. In a nutshell: only sort on the first component of the tuple, like this. - clues.sort() + clues.sort(key=lambda x:x[0]) ---------------------------------------------------------------------- Comment By: Sergio Gelato (gelato) Date: 2008-09-15 15:31 Message: I've had the same problem, with a similar traceback (also using spambayes 1.0.4). I was able to identify the exact word in the input data that triggered the problem. It turns out, however, that changing the database even slightly (I trained on a portion of the offending message) makes the symptoms disappear. In my case, The offending word was "Enk=E4t" (in a qp-encoded, charset="iso-8859-1" text/plain subpart of a message/rfc822 subpart of a multipart-mixed message). There were other similarly encoded words with non-ASCII data earlier in the message (even in the same body part), but only this one triggered the problem. (I established this by truncating the input message after a variable number of lines and noting which inputs were causing it to fail.) Extracting the message/rfc822 part and running it alone through sb_filter.py did not trigger the problem. In inspecting the spambayes source code, I noticed that tokenizer.py doesn't seem to take into account the MIME charset. I'm not necessarily saying that it should; in fact, spambayes must be able to cope with malformed input data. But the result is that the words out of the tokenizer are not in any well-defined encoding. clues is a list of (distance, prob, word, record) tuples. When there is a tie on prob (and therefore also on distance=abs(0.5-prob)), the sort() method will need to compare the word strings. This is where an implicit word.decode('ascii') may take place, especially when one of the operands is of type 'str' and the other one is of type 'unicode'. Training one more message will change the probabilities and make the symptoms disappear (or move somewhere else). I'd guess that some of the elements of the wordstream returned by the tokenizer are of the wrong class. They should either all be of type str or all of type unicode; probably the former. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-10-02 12:55 Message: Logged In: YES user_id=44345 Originator: NO None of these make the current version of sb_filter.py barf. I wonder if there's something peculiar about the way the mail is transmitted via IMAP? (Just a wild guess.) ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-10-02 02:06 Message: Logged In: YES user_id=44345 Originator: NO File Added: mailbox ---------------------------------------------------------------------- Comment By: Jes?s Cea Avi?n (jcea) Date: 2007-10-02 00:46 Message: Logged In: YES user_id=97460 Originator: NO Three examples sent to skip at pobox.com. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-09-26 04:48 Message: Logged In: YES user_id=44345 Originator: NO jcea, Do you have an email message I can work with? If so, zip it and send it to me as an attachment (skip at pobox.com). Thx, Skip ---------------------------------------------------------------------- Comment By: Jes?s Cea Avi?n (jcea) Date: 2007-09-26 04:02 Message: Logged In: YES user_id=97460 Originator: NO My version is 1.0.4 and the traceback is: """ Traceback (most recent call last): File "/usr/local/lib/python2.5/site-packages/Milter/__init__.py", line 203, in milter.set_eom_callback(lambda ctx: ctx.getpriv().eom()) File "antispam.py", line 513, in eom prob=hammiedb.score(msg) File "/usr/local/lib/python2.5/site-packages/spambayes/hammie.py", line 62, in score return self._scoremsg(msg, evidence) File "/usr/local/lib/python2.5/site-packages/spambayes/hammie.py", line 38, in _scoremsg return self.bayes.spamprob(tokenize(msg), evidence) File "/usr/local/lib/python2.5/site-packages/spambayes/classifier.py", line 190, in chi2_spamprob clues = self._getclues(wordstream) File "/usr/local/lib/python2.5/site-packages/spambayes/classifier.py", line 496, in _getclues clues.sort() UnicodeDecodeError: 'ascii' codec can't decode byte 0xfa in position 0: ordinal not in range(128) """ ---------------------------------------------------------------------- Comment By: Jes?s Cea Avi?n (jcea) Date: 2007-09-17 17:37 Message: Logged In: YES user_id=97460 Originator: NO My version is 1.0.4. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-09-07 04:11 Message: Logged In: YES user_id=44345 Originator: NO I ran the submitted email through the current sb_filter.py in Subversion (probably the same classifier as in 1.1a4). It worked for me. While I don't use the IMAP filter, any of the SpamBayes applications should use the same classifier code. I'm not sure this is a problem in the current code. What version of SpamBayes are you using? ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2007-09-05 19:23 Message: Logged In: YES user_id=44345 Originator: NO Do you have a traceback? What version of SpamBayes are you using? ---------------------------------------------------------------------- Comment By: Jes?s Cea Avi?n (jcea) Date: 2007-09-05 16:59 Message: Logged In: YES user_id=97460 Originator: NO I'm seeing a lot (>1 per hour in my system) of current spam crashing spambayes because they are marked as "ascii" but body is 8-bit actually. Since my milter spam filter crashes and sendmail disables the milter filtering for 50 seconds because the failure (my configuration, and I wouldn't like to touch it), a lot of spam is getting thru. About 30-100 spams, everytime this bug hits. Please, increase the priority of this bug a bit... It is hitting. Hard. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1600821&group_id=61702 From noreply at sourceforge.net Fri Dec 5 20:11:29 2008 From: noreply at sourceforge.net (SourceForge.net) Date: Fri, 05 Dec 2008 19:11:29 +0000 Subject: [spambayes-bugs] [ spambayes-Bugs-2393311 ] GOCR.EXE Has Encountered A Problem Message-ID: Bugs item #2393311, was opened at 2008-12-05 14:11 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=2393311&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Outlook Group: 1.1.x Status: Open Resolution: None Priority: 5 Private: No Submitted By: Paul I (piorio) Assigned to: Nobody/Anonymous (nobody) Summary: GOCR.EXE Has Encountered A Problem Initial Comment: During the filtering process, the error dialog "GOCR.EXE has encountered a problem and needs to close" appears. Filtering is paused until response is made. Originally GOCR.EXE V0.43 but it also happens with V0.46 (10/22/2008). Log file contains: Message 'Delivery Status Notification (Failure)' in 'Personal Folders/Inbox' had a Spam classification of 'Unsure' warning: gocr failed with exit code -1073741676 command line was: 'C:\\PROGRA~1\\SPAMBA~1\\bin\\gocr.exe "c:\\docume~1\\paul\\locals~1\\temp\\tmpnperst-spambayes-image" 2>nul' warning: gocr failed with exit code -1073741676 command line was: 'C:\\PROGRA~1\\SPAMBA~1\\bin\\gocr.exe "c:\\docume~1\\paul\\locals~1\\temp\\tmpobm2_p-spambayes-image" 2>nul' ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=2393311&group_id=61702 From noreply at sourceforge.net Fri Dec 5 21:18:55 2008 From: noreply at sourceforge.net (SourceForge.net) Date: Fri, 05 Dec 2008 20:18:55 +0000 Subject: [spambayes-bugs] [ spambayes-Bugs-2393311 ] GOCR.EXE Has Encountered A Problem Message-ID: Bugs item #2393311, was opened at 2008-12-05 13:11 Message generated for change (Comment added) made by montanaro You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=2393311&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Outlook Group: 1.1.x >Status: Closed >Resolution: Invalid Priority: 5 Private: No Submitted By: Paul I (piorio) Assigned to: Nobody/Anonymous (nobody) Summary: GOCR.EXE Has Encountered A Problem Initial Comment: During the filtering process, the error dialog "GOCR.EXE has encountered a problem and needs to close" appears. Filtering is paused until response is made. Originally GOCR.EXE V0.43 but it also happens with V0.46 (10/22/2008). Log file contains: Message 'Delivery Status Notification (Failure)' in 'Personal Folders/Inbox' had a Spam classification of 'Unsure' warning: gocr failed with exit code -1073741676 command line was: 'C:\\PROGRA~1\\SPAMBA~1\\bin\\gocr.exe "c:\\docume~1\\paul\\locals~1\\temp\\tmpnperst-spambayes-image" 2>nul' warning: gocr failed with exit code -1073741676 command line was: 'C:\\PROGRA~1\\SPAMBA~1\\bin\\gocr.exe "c:\\docume~1\\paul\\locals~1\\temp\\tmpobm2_p-spambayes-image" 2>nul' ---------------------------------------------------------------------- >Comment By: Skip Montanaro (montanaro) Date: 2008-12-05 14:18 Message: I think you're going to need to take this up with the gocr development team. While we use it for optical character recognition in SpamBayes we don't support it. The gocr home page is http://jocr.sourceforge.net/ (yes, spelled "jocr" instead of "gocr") Skip ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=2393311&group_id=61702