writing Unicode objects to XML

Tue May 6 04:46:00 EDT 2003

<posted & mailed>

Alessio Pace wrote:

> First, thank you so much for the detailed explanation.
> 
> Second: having (valid) character references such as &#xe8; the parsing
> phase is ok, I get a good Unicode object(--> u'\xe8' is it the right one,
> isnt it??) but when I do :
>    f = open("file.xml", "w")
>    f.write(xmldoc.toxml(encoding="utf-8")
>    f.close()
> in the file.xml I get, as I expect, a 2 byte sequence, but they are not
> the "expected" UTF-8 charactes. I mean, they are strange characters, but

Please let's give specific small examples because discursive details
are clearly leading us astray -- please do *NOT* "summarize in your
own words" (clearly you aren't just copying and pasting working code
here, as there's a missing closed parens for example).

So, let's work at the interactive prompt to ensure small examples:

>>> s
'<?xml version="1.0" encoding="utf-8"?><foo>a&#xe8;b</foo>'
>>> xmldoc = xml.dom.minidom.parseString(s)

OK so far?  OK, let's now write it to file:

>>> f = open('file.xml', 'w')
>>> f.write(xmldoc.toxml(encoding='utf-8'))
>>> f.close()

AND check what we have written:

>>> s1 = open('file.xml').read()
>>> s1
'<?xml version="1.0" encoding="utf-8"?>\n<foo>a\xc3\xa8b</foo>'

Exactly the utf-8 encoding, as we can see.  Can you reproduce
this tiny example and see what it gives you?  Then, go look for
the differences between this and what you WERE doing.

> the weird thing is that parsing over again the file.xml I get the same
> Unicode objects as when I read it for the first time with characters
> references as content. So, in the end, it works...
> But if instead of the accented e character reference(&#xe8;)  I write the
> UTF-8 encoded sequence (\xc3\xa8 is the correct form to write there

In a Python string, it would be.  In a textfile, definitely not -- what
you need to have there are the two BYTES of hexadecimal values c3 and a8,
while apparently what you're putting in the file are eight bytes (a
slash, an x, a c, a 3, another slash, another x, an a, and an 8), which
is a COMPLETELY different thing.

Check:

>>> s='<?xml version="1.0" encoding="utf-8"?>a\xc3\xa8b</foo>'
>>> ord(s[39])
195
>>> ord(s[40])
168

See?  When I write IN THE PYTHON LITERAL STRING the escape sequence
\xc3, the Python parser puts in the string object ONE byte with a
decimal value of 195, i.e. a hex value of 0xc3, same thing.  In a
file, what you need to have is exactly a byte with a decimal value
of 195 followed by one with a decimal value of 168, *NOT* the escape
sequences that represent those values in some language or other.

> right?) when I parse the xml document with minidom I get Unicode objects
> of kind u'\\xc3\\xa8' and so it does not match at all with the previous
> example.....Why??? The XML document is declared to be encoded UTF-8 :-(

The encoding in question does not specify any special interpretation
for backslashes -- that's just a convention used in many programming
languages such as Python and C, *NOT* in the utf-8 encoding!

> On the console I did a simple unicode('\xc3\xa8', 'utf-8') and I got
> u'\xe8' so the problem is in the parsing of the XML not in the way I write
> it I guess...

Wrong: it's exactly in the way you write it.  If you write to
file the s string of the latest example you'll get exactly the
way the file should be written, with the right values in the bytes.

Alex