UnicodeDecodeError help please?

Ben C spamspam at spam.eggs
Fri Apr 7 13:04:36 EDT 2006


On 2006-04-07, Robin Haswell <rob at digital-crocus.com> wrote:
> Okay I'm getting really frustrated with Python's Unicode handling, I'm
> trying everything I can think of an I can't escape Unicode(En|De)codeError
> no matter what I try.
>
> Could someone explain to me what I'm doing wrong here, so I can hope to
> throw light on the myriad of similar problems I'm having? Thanks :-)
>
> Python 2.4.1 (#2, May  6 2005, 11:22:24) 
> [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import sys
>>>> sys.getdefaultencoding()
> 'utf-8'
>>>> import htmlentitydefs
>>>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
>>>> print char
> ©
>>>> str = u"Apple"
>>>> print str
> Apple
>>>> str + char
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
>>>> a = str+char
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

Try this:

import htmlentitydefs

char = htmlentitydefs.entitydefs["copy"]
char = unicode(char, "Latin1")

str = u"Apple"
print str
print str + char

htmlentitydefs.entitydefs is "A dictionary mapping XHTML 1.0 entity
definitions to their replacement text in ISO Latin-1".

So you get "char" back as a Latin-1 string. Then we use the builtin
function unicode to make a unicode string (which doesn't have an
encoding, as I understand it, it's just unicode). This can be added to
u"Apple" and printed out.

It prints out OK on a UTF-8 terminal, but you can print it in other
encodings using encode:

print (str + char).encode("Latin1")

for example.

For your search engine you should look at server headers, metatags,
BOMs, and guesswork, in roughly that order, to determine the encoding of
the source document. Convert it all to unicode (using builtin function
unicode) and use that to build your indexes etc., and write results out
in whatever you need to write it out in (probably UTF-8).

HTH.



More information about the Python-list mailing list