[Tutor] encoding question
Alex Kleider
akleider at sonic.net
Mon Jan 6 00:03:01 CET 2014
On 2014-01-05 14:26, Steven D'Aprano wrote:
> On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:
>
>> Danny walked you through the XML. Note that he didn't decode the
>> response. It includes an encoding on the first line:
>>
>> <?xml version="1.0" encoding="ISO-8859-1" ?>
>
> That surprises me. I thought XML was only valid in UTF-8? Or maybe that
> was wishful thinking.
>
>> tree = ET.fromstring(response.read())
I believe you were correct the first time.
My experience with all of this has been that in spite of the xml having
been advertised as having been encoded in ISO-8859-1 (which I believe is
synonymous with Latin-1), my script (specifically Python's xml parser:
xml.etree.ElementTree) didn't work until the xml was decoded from
Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet
with some comments mentioning the painful lessons learned:
"""
response = urllib2.urlopen(url_format_str %\
(ip_address, ))
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding)
# <info> comes in as <type 'unicode'>.
n = info.find('\n')
xml = info[n+1:] # Get rid of a header line.
# root = ET.fromstring(xml) # This causes error:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
# in position 456: ordinal not in range(128)
root = ET.fromstring(xml.encode("utf-8"))
"""
>
> In other words, leave it to ElementTree to manage the decoding and
> encoding itself. Nice -- I like that solution.
More information about the Tutor
mailing list