[Tutor] unicode encoding hell

David Bear david.bear at asu.edu
Thu Sep 6 05:51:10 CEST 2007


I'm using universal feed parser to grab an rss feed.

I'm carefull not to use any sys.out, print, file write ops, etc, UNLESS I
use a decode('utf-i') to convert the unicode string I get from feed parser
to utf-8. However, I'm still getting the blasted decode error stating that
one of the items in the unicode string is out range. I've checked the
encoding from the feed and it does indeed say it is utf-8. The content-type
header is set to application/rss+xml . I am using the following syntax on a
feedparser object:

feedp.entry.title.decode('utf-8', 'xmlcharrefreplace')

I assume it would take any unicode character and 'do the right thing',
including replacing higher ordinal chars with xml entity refs. But I still
get

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 31: ordinal not in range(128)

Clearly, I completely do not understand how unicode is working here. Can
anyone enlighten me?


--
David Bear
College of Public Programs at Arizona State University



More information about the Tutor mailing list