writing Unicode objects to XML
Alessio Pace
puccio_13 at yahoo.it
Tue May 6 05:37:01 EDT 2003
>Exactly the utf-8 encoding, as we can see. Can you reproduce
>this tiny example and see what it gives you? Then, go look for
>the differences between this and what you WERE doing.
HERE it is:
>>> from xml.dom import minidom
>>> s = '<?xml version="1.0" encoding="utf-8"?><foo>aèb</foo>'
>>> s
'<?xml version="1.0" encoding="utf-8"?><foo>aèb</foo>'
>>> xmldoc = minidom.parseString(s)
>>> f = open('file.xml', 'w')
>>> f.write(xmldoc.toxml(encoding='utf-8'))
>>> f.close()
>>> s1 = open('file.xml').read()
>>> s1
'<?xml version="1.0" encoding="utf-8"?>\n<foo>a\xc3\xa8b</foo>'
So, as I said, it works. I thought it didn't work because what I saw in the
file.xml after writing is two characters: --> Ú (I paste them, I don't
know how they will be visualized). They seemed to me encoded non correctly
but I understood they are after the very detailed explanation you gave
below.
Thank you again. I think I will not have any problems now :-)
>> the weird thing is that parsing over again the file.xml I get the same
>> Unicode objects as when I read it for the first time with characters
>> references as content. So, in the end, it works...
>> But if instead of the accented e character reference(è) I write the
>> UTF-8 encoded sequence (\xc3\xa8 is the correct form to write there
>In a Python string, it would be. In a textfile, definitely not -- what
>you need to have there are the two BYTES of hexadecimal values c3 and a8,
>while apparently what you're putting in the file are eight bytes (a
>slash, an x, a c, a 3, another slash, another x, an a, and an 8), which
>is a COMPLETELY different thing.
[.....]
Alex
--
bye
Alessio Pace
More information about the Python-list
mailing list