writing Unicode objects to XML

Tue May 6 05:37:01 EDT 2003

>Exactly the utf-8 encoding, as we can see.  Can you reproduce
>this tiny example and see what it gives you?  Then, go look for
>the differences between this and what you WERE doing.

HERE it is:
>>> from xml.dom import minidom
>>> s = '<?xml version="1.0" encoding="utf-8"?><foo>a&#xe8;b</foo>'
>>> s
'<?xml version="1.0" encoding="utf-8"?><foo>a&#xe8;b</foo>'
>>> xmldoc = minidom.parseString(s)
>>> f = open('file.xml', 'w')
>>> f.write(xmldoc.toxml(encoding='utf-8'))
>>> f.close()
>>> s1 = open('file.xml').read()
>>> s1
'<?xml version="1.0" encoding="utf-8"?>\n<foo>a\xc3\xa8b</foo>'

So, as I said, it works. I thought it didn't work because what I saw in the
file.xml after writing is two characters: --> Ãš  (I paste them, I don't
know how they will be visualized). They seemed to me encoded non correctly
but I understood they are after the very detailed explanation you gave
below.

Thank you again. I think I will not have any problems now  :-)

>> the weird thing is that parsing over again the file.xml I get the same
>> Unicode objects as when I read it for the first time with characters
>> references as content. So, in the end, it works...
>> But if instead of the accented e character reference(&#xe8;)  I write the
>> UTF-8 encoded sequence (\xc3\xa8 is the correct form to write there

>In a Python string, it would be.  In a textfile, definitely not -- what
>you need to have there are the two BYTES of hexadecimal values c3 and a8,
>while apparently what you're putting in the file are eight bytes (a
>slash, an x, a c, a 3, another slash, another x, an a, and an 8), which
>is a COMPLETELY different thing.
[.....]

Alex

-- 
bye
Alessio Pace