UnicodeDecodeError help please?
Fredrik Lundh
fredrik at pythonware.com
Fri Apr 7 12:52:52 EDT 2006
Robin Haswell wrote:
> Could someone explain to me what I'm doing wrong here, so I can hope to
> throw light on the myriad of similar problems I'm having? Thanks :-)
>
> Python 2.4.1 (#2, May 6 2005, 11:22:24)
> [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import sys
> >>> sys.getdefaultencoding()
> 'utf-8'
that's bad. do not hack the default encoding. it'll only make you sorry
when you try to port your code to some other python installation, or use
a library that relies on the factory settings being what they're supposed
to be. do not hack the default encoding.
back to your code:
> >>> import htmlentitydefs
> >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
> >>> print char
> ©
that's a standard (8-bit) string:
>>> type(char)
<type 'str'>
>>> ord(char)
169
>>> len(char)
1
one byte that contains the value 169. looks like ISO-8859-1 (Latin-1) to me.
let's see what the documentation says:
entitydefs
A dictionary mapping XHTML 1.0 entity definitions to their replacement
text in ISO Latin-1.
alright, so it's an ISO Latin-1 string.
> >>> str = u"Apple"
> >>> print str
> Apple
>>> type(str)
<type 'unicode'>
>>> len(str)
5
that's a 5-character unicode string.
> >>> str + char
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0:
> unexpected code byte
you're trying to combine an 8-bit string with a Unicode string, and you've
told Python (by hacking the site module) to treat all 8-bit strings as if they
contain UTF-8. UTF-8 != ISO-Latin-1.
so, you can of course convert the string you got from the entitydefs dict
to a unicode string before you combine the two strings
>>> unicode(char, "iso-8859-1") + str
u'\xa9Apple'
but the htmlentitydefs module offers a better alternative:
name2codepoint
A dictionary that maps HTML entity names to the Unicode
codepoints. New in version 2.3.
which allows you to do
>>> char = unichr(htmlentitydefs.name2codepoint["copy"])
>>> char
u'\xa9'
>>> char + str
u'\xa9Apple'
without having to deal with things like
>>> len(htmlentitydefs.entitydefs["copy"])
1
>>> len(htmlentitydefs.entitydefs["rarr"])
7
> Basically my app is a search engine - I'm grabbing content from pages
> using HTMLParser and storing it in a database but I'm running in to these
> problems all over the shop (from decoding the entities to calling
> str.lower()) - I don't know what encoding my pages are coming in as, I'm
> just happy enough to accept that they're either UTF-8 or latin-1 with
> entities.
UTF-8 and Latin-1 are two different things, so your (international) users
will hate you if you don't do this right.
> It's even worse that I've written the same app in PHP before with none of
> these problems - and PHP4 doesn't even support Unicode.
a PHP4 application without I18N problems? I'm not sure I believe you... ;-)
</F>
More information about the Python-list
mailing list