[spambayes-bugs] [ spambayes-Bugs-1175439 ] UnicodeEncodeError
raised for bogus Content-Type header
SourceForge.net
noreply at sourceforge.net
Sat Apr 2 19:09:10 CEST 2005
Bugs item #1175439, was opened at 2005-04-02 12:09
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1175439&group_id=61702
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jim Correia (correia)
Assigned to: Nobody/Anonymous (nobody)
Summary: UnicodeEncodeError raised for bogus Content-Type header
Initial Comment:
When using sb_mboxtrain.py or sb_filter.py on the following
message (which is a contrived sample based on an actual message
in my mail spool from March) a UnicodeEncodeError is raised due to
the bogus Content-Type header.
I'm using the sb 1.0.4 release.
Exception Backtrace:
$ ./scripts/sb_filter.py ~/Desktop/msg.txt
Traceback (most recent call last):
File "./scripts/sb_filter.py", line 257, in ?
main()
File "./scripts/sb_filter.py", line 248, in main
action(msg)
File "./scripts/sb_filter.py", line 180, in filter
return self.h.filter(msg)
File "/usr/local/lib/python2.4/site-packages/spambayes/
hammie.py", line 109, in filter
prob, clues = self._scoremsg(msg, True)
File "/usr/local/lib/python2.4/site-packages/spambayes/
hammie.py", line 38, in _scoremsg
return self.bayes.spamprob(tokenize(msg), evidence)
File "/usr/local/lib/python2.4/site-packages/spambayes/
classifier.py", line 190, in chi2_spamprob
clues = self._getclues(wordstream)
File "/usr/local/lib/python2.4/site-packages/spambayes/
classifier.py", line 492, in _getclues
for word in Set(wordstream):
File "/usr/local/lib/python2.4/sets.py", line 429, in __init__
self._update(iterable)
File "/usr/local/lib/python2.4/sets.py", line 383, in _update
for element in iterable:
File "/usr/local/lib/python2.4/site-packages/spambayes/
tokenizer.py", line 1224, in tokenize
for tok in self.tokenize_headers(msg):
File "/usr/local/lib/python2.4/site-packages/spambayes/
tokenizer.py", line 1235, in tokenize_headers
for w in crack_content_xyz(x):
File "/usr/local/lib/python2.4/site-packages/spambayes/
tokenizer.py", line 823, in crack_content_xyz
for x in msg.get_charsets(None):
File "/usr/local/lib/python2.4/email/Message.py", line 804, in
get_charsets
return [part.get_content_charset(failobj) for part in self.walk()]
File "/usr/local/lib/python2.4/email/Message.py", line 784, in
get_content_charset
charset = unicode(charset[2], pcharset).encode('us-ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
position 0: ordinal not in range(128)
===
Sample message:
>From ???@??? Sat Mar 5 09:56:25 2005 -0500
Content-Type: text/plain; charset*=ISO-8859
-1''%E6N%C0%00%00%00%00%15;
format=flowed
To: user at example.com
From: User <user at example.com>
Subject: example
Date: Sat, 5 Mar 2005 09:55:34 -0500
The contents of the body don't matter, just the bogus Content-Type
header.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1175439&group_id=61702
More information about the Spambayes-bugs
mailing list