Treating a unicode string as latin-1

Thu Jan 3 08:52:08 EST 2008

Simon Willison wrote:

> Hello,
> 
> I'm using ElementTree to parse an XML file which includes some data
> encoded as cp1252, for example:
> 
> <name>Bob\x92s Breakfast</name>
> 
> If this was a regular bytestring, I would convert it to utf8 using the
> following:
> 
>>>> print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
> Bob's Breakfast
> 
> But ElementTree gives me back a unicode string, so I get the following
> error:
> 
>>>> print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/encodings/cp1252.py", line 15, in decode
>     return codecs.charmap_decode(input,errors,decoding_table)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
> position 3: ordinal not in range(128)
> 
> How can I tell Python "I know this says it's a unicode string, but I
> need you to treat it like a bytestring"?

I don't get your problem. You get a unicode-object. Which means that it got
decoded by ET for you, as any XML-parser must do.

So - why don't you get rid of that .decode('cp1252') and happily encode it
to utf-8?

Diez