[spambayes-bugs] [ spambayes-Bugs-1175439 ] UnicodeEncodeError raised for bogus Content-Type header

SourceForge.net noreply at sourceforge.net
Sat Apr 2 19:09:10 CEST 2005


Bugs item #1175439, was opened at 2005-04-02 12:09
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1175439&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jim Correia (correia)
Assigned to: Nobody/Anonymous (nobody)
Summary: UnicodeEncodeError raised for bogus Content-Type header

Initial Comment:
When using sb_mboxtrain.py or sb_filter.py on the following 
message (which is a contrived sample based on an actual message 
in my mail spool from March) a UnicodeEncodeError is raised due to 
the bogus Content-Type header.

I'm using the sb 1.0.4 release.

Exception Backtrace:

$ ./scripts/sb_filter.py ~/Desktop/msg.txt 
Traceback (most recent call last):
  File "./scripts/sb_filter.py", line 257, in ?
    main()
  File "./scripts/sb_filter.py", line 248, in main
    action(msg)
  File "./scripts/sb_filter.py", line 180, in filter
    return self.h.filter(msg)
  File "/usr/local/lib/python2.4/site-packages/spambayes/
hammie.py", line 109, in filter
    prob, clues = self._scoremsg(msg, True)
  File "/usr/local/lib/python2.4/site-packages/spambayes/
hammie.py", line 38, in _scoremsg
    return self.bayes.spamprob(tokenize(msg), evidence)
  File "/usr/local/lib/python2.4/site-packages/spambayes/
classifier.py", line 190, in chi2_spamprob
    clues = self._getclues(wordstream)
  File "/usr/local/lib/python2.4/site-packages/spambayes/
classifier.py", line 492, in _getclues
    for word in Set(wordstream):
  File "/usr/local/lib/python2.4/sets.py", line 429, in __init__
    self._update(iterable)
  File "/usr/local/lib/python2.4/sets.py", line 383, in _update
    for element in iterable:
  File "/usr/local/lib/python2.4/site-packages/spambayes/
tokenizer.py", line 1224, in tokenize
    for tok in self.tokenize_headers(msg):
  File "/usr/local/lib/python2.4/site-packages/spambayes/
tokenizer.py", line 1235, in tokenize_headers
    for w in crack_content_xyz(x):
  File "/usr/local/lib/python2.4/site-packages/spambayes/
tokenizer.py", line 823, in crack_content_xyz
    for x in msg.get_charsets(None):
  File "/usr/local/lib/python2.4/email/Message.py", line 804, in 
get_charsets
    return [part.get_content_charset(failobj) for part in self.walk()]
  File "/usr/local/lib/python2.4/email/Message.py", line 784, in 
get_content_charset
    charset = unicode(charset[2], pcharset).encode('us-ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in 
position 0: ordinal not in range(128)

===

Sample message:

>From ???@??? Sat Mar  5 09:56:25 2005 -0500
Content-Type: text/plain; charset*=ISO-8859
-1''%E6N%C0%00%00%00%00%15;
	format=flowed
To: user at example.com
From: User <user at example.com>
Subject: example
Date: Sat, 5 Mar 2005 09:55:34 -0500

The contents of the body don't matter, just the bogus Content-Type 
header.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1175439&group_id=61702


More information about the Spambayes-bugs mailing list