UnicodeDecodeError help please?

Fri Apr 7 12:52:52 EDT 2006

Robin Haswell wrote:

> Could someone explain to me what I'm doing wrong here, so I can hope to
> throw light on the myriad of similar problems I'm having? Thanks :-)
>
> Python 2.4.1 (#2, May  6 2005, 11:22:24)
> [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import sys
> >>> sys.getdefaultencoding()
> 'utf-8'

that's bad.  do not hack the default encoding.  it'll only make you sorry
when you try to port your code to some other python installation, or use
a library that relies on the factory settings being what they're supposed
to be.  do not hack the default encoding.

back to your code:

> >>> import htmlentitydefs
> >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
> >>> print char
> ©

that's a standard (8-bit) string:

>>> type(char)
<type 'str'>
>>> ord(char)
169
>>> len(char)
1

one byte that contains the value 169.  looks like ISO-8859-1 (Latin-1) to me.
let's see what the documentation says:

entitydefs
    A dictionary mapping XHTML 1.0 entity definitions to their replacement
    text in ISO Latin-1.

alright, so it's an ISO Latin-1 string.

> >>> str = u"Apple"
> >>> print str
> Apple

>>> type(str)
<type 'unicode'>
>>> len(str)
5

that's a 5-character unicode string.

> >>> str + char
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0:
> unexpected code byte

you're trying to combine an 8-bit string with a Unicode string, and you've
told Python (by hacking the site module) to treat all 8-bit strings as if they
contain UTF-8.  UTF-8 != ISO-Latin-1.

so, you can of course convert the string you got from the entitydefs dict
to a unicode string before you combine the two strings

    >>> unicode(char, "iso-8859-1") + str
    u'\xa9Apple'

but the htmlentitydefs module offers a better alternative:

name2codepoint
A dictionary that maps HTML entity names to the Unicode
codepoints. New in version 2.3.

which allows you to do

>>> char = unichr(htmlentitydefs.name2codepoint["copy"])
>>> char
u'\xa9'
>>> char + str
u'\xa9Apple'

without having to deal with things like

>>> len(htmlentitydefs.entitydefs["copy"])
1
>>> len(htmlentitydefs.entitydefs["rarr"])
7

> Basically my app is a search engine - I'm grabbing content from pages
> using HTMLParser and storing it in a database but I'm running in to these
> problems all over the shop (from decoding the entities to calling
> str.lower()) - I don't know what encoding my pages are coming in as, I'm
> just happy enough to accept that they're either UTF-8 or latin-1 with
> entities.

UTF-8 and Latin-1 are two different things, so your (international) users
will hate you if you don't do this right.

> It's even worse that I've written the same app in PHP before with none of
> these problems - and PHP4 doesn't even support Unicode.

a PHP4 application without I18N problems?  I'm not sure I believe you... ;-)

</F>