[Tutor] encoding question

Mon Jan 6 00:03:01 CET 2014

On 2014-01-05 14:26, Steven D'Aprano wrote:
> On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:
> 
>> Danny walked you through the XML. Note that he didn't decode the
>> response. It includes an encoding on the first line:
>> 
>>     <?xml version="1.0" encoding="ISO-8859-1" ?>
> 
> That surprises me. I thought XML was only valid in UTF-8? Or maybe that
> was wishful thinking.
> 
>>         tree = ET.fromstring(response.read())

I believe you were correct the first time.
My experience with all of this has been that in spite of the xml having 
been advertised as having been encoded in ISO-8859-1 (which I believe is 
synonymous with Latin-1), my script (specifically Python's xml parser: 
xml.etree.ElementTree) didn't work until the xml was decoded from 
Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet 
with some comments mentioning the painful lessons learned:
"""
     response =  urllib2.urlopen(url_format_str %\
                                    (ip_address, ))
     encoding = response.headers.getparam('charset')
     info = response.read().decode(encoding)
     # <info> comes in as <type 'unicode'>.
     n = info.find('\n')
     xml = info[n+1:]  # Get rid of a header line.
     # root = ET.fromstring(xml) # This causes error:
     # UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
     # in position 456: ordinal not in range(128)
     root = ET.fromstring(xml.encode("utf-8"))
"""


> 
> In other words, leave it to ElementTree to manage the decoding and
> encoding itself. Nice -- I like that solution.