Treating a unicode string as latin-1

Simon Willison simon at simonwillison.net
Thu Jan 3 08:31:46 EST 2008


Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:

>>> print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:

>>> print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Thanks,

Simon Willison



More information about the Python-list mailing list