Treating a unicode string as latin-1
Diez B. Roggisch
deets at nospam.web.de
Thu Jan 3 08:52:08 EST 2008
Simon Willison wrote:
> Hello,
>
> I'm using ElementTree to parse an XML file which includes some data
> encoded as cp1252, for example:
>
> <name>Bob\x92s Breakfast</name>
>
> If this was a regular bytestring, I would convert it to utf8 using the
> following:
>
>>>> print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
> Bob's Breakfast
>
> But ElementTree gives me back a unicode string, so I get the following
> error:
>
>>>> print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/encodings/cp1252.py", line 15, in decode
> return codecs.charmap_decode(input,errors,decoding_table)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
> position 3: ordinal not in range(128)
>
> How can I tell Python "I know this says it's a unicode string, but I
> need you to treat it like a bytestring"?
I don't get your problem. You get a unicode-object. Which means that it got
decoded by ET for you, as any XML-parser must do.
So - why don't you get rid of that .decode('cp1252') and happily encode it
to utf-8?
Diez
More information about the Python-list
mailing list