UnicodeDecodeError help please?

Robin Haswell rob at digital-crocus.com
Fri Apr 7 12:27:24 EDT 2006


Okay I'm getting really frustrated with Python's Unicode handling, I'm
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.

Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)

Python 2.4.1 (#2, May  6 2005, 11:22:24) 
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> import htmlentitydefs
>>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
>>> print char
©
>>> str = u"Apple"
>>> print str
Apple
>>> str + char
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
>>> a = str+char
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
>>> 

Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.

Any help would be great, I just hope that I have a brainwave over the
weekend because I've lost two days to Unicode errors now. It's even worse
that I've written the same app in PHP before with none of these problems -
and PHP4 doesn't even support Unicode.

Cheers

-Rob



More information about the Python-list mailing list