writing Unicode objects to XML

Mon May 5 14:22:11 EDT 2003

First, thank you so much for the detailed explanation.

Second: having (valid) character references such as &#xe8; the parsing
phase is ok, I get a good Unicode object(--> u'\xe8' is it the right one,
isnt it??) but when I do :
   f = open("file.xml", "w")
   f.write(xmldoc.toxml(encoding="utf-8")
   f.close()
in the file.xml I get, as I expect, a 2 byte sequence, but they are not the
"expected" UTF-8 charactes. I mean, they are strange characters, but the
weird thing is that parsing over again the file.xml I get the same Unicode
objects as when I read it for the first time with characters references as
content. So, in the end, it works...
But if instead of the accented e character reference(&#xe8;)  I write the
UTF-8 encoded sequence (\xc3\xa8 is the correct form to write there right?)
when I parse the xml document with minidom I get Unicode objects of kind
u'\\xc3\\xa8' and so it does not match at all with the previous
example.....Why??? The XML document is declared to be encoded UTF-8 :-(

On the console I did a simple unicode('\xc3\xa8', 'utf-8') and I got u'\xe8' 
so the problem is in the parsing of the XML not in the way I write it I
guess...

Thanks for further help.

Paul Boddie wrote:

> Alessio Pace <puccio_13 at yahoo.it> wrote in message
> news:<UNqta.38692$DN.951106 at tornado.fastwebnet.it>...
>>
>> Maybe I am missing something, because I tried but in the resulting new
>> XML file I dont' see what I expect.. Starting again, I have an XML file
>> declared encoded in UTF-8 (anyway, is it the default if I don't specify
>> anything?)
> 
> I guess we have to assume that your file actually contains valid UTF-8
> byte sequences, but let's do that.
> 
>> and which contains character references such as &#xe8; and some others in
>> the Text nodes.
> 
> It occurs to me that you could quite easily use ASCII as an encoding
> if you're intent on using entity references all over the place. The
> above question about default encodings could be disregarded for our
> purposes.
> 
>> I parse it with xml.dom.minidom.parse(pathToFile) and get a reference to
>> a DOM tree, let's call this variable 'xmldoc'.
> 
> If you do what had been previously recommended...
> 
>   xmldoc.toxml()
> 
> ...you'll probably see a string prefixed with "u" and with "\xe8"
> where your "&#xe8;" reference was.
> 
>> Now, let's say I want to store again this DOM tree (because my
>> application will have to modify some parameters in it). I thought I had
>> to do just: f = codecs.open('file.xml', 'w', 'utf8')
> 
> The help for codecs.open suggests that you need to supply objects to
> the file object's write method which can be converted to the stated
> encoding. In other words, given an object passed to f.write():
> 
>   f.write(s) # write my "string" object
> 
> ...that object will be processed in such a way that the encode method
> is effectively called upon it:
> 
>   s.encode("utf-8") # when we opened a file with "utf-8" as the
> encoding
> 
> This is what appears to happen internally in the object returned by
> codecs.open, but this also means that the encode method should produce
> a valid result (and not raise an exception). Now, writing Unicode
> objects to your file does seem to work:
> 
>   f.write(xmldoc.toxml())
> 
> This shouldn't be a surprise, since the result of xmldoc.toxml() is a
> Unicode object (as we saw above). However, you are attempting to this
> Unicode object to a Python string (or byte stream) first:
> 
>> f.write(xmldoc.toxml(encoding='utf-8'))
>> f.close()
>> But the result is not the original xml....
> 
> I'm surprised the program even finishes, as I'll attempt to explain.
> 
> Here, the object that gets passed to the write method is a normal
> Python string as can be observed by the output of the following
> statement:
> 
>   xmldoc.toxml(encoding='utf-8')
> 
> You'll see that two characters are generated for your single "&#xe8;"
> character. Moreover, attempting to convert either or both of these
> characters to UTF-8 gives the dreaded "UnicodeError: ASCII decoding
> error: ordinal not in range(128)" as Python refuses to guess which
> encoding you might employ for normal Python strings (as that's what
> the decode method on strings is for).
> 
> Consequently, if you consider your input object to the write method as
> follows:
> 
>   s = xmldoc.toxml(encoding="utf-8")
> 
> You'll realise that when the write method does the equivalent of this
> operation...
> 
>   s.encode("utf-8")
> 
> ...an identical exception is raised to that described above.
> 
> Anyway, you can either use codecs.open and send the Unicode result of
> xmldoc.toxml():
> 
>   f = codecs.open("file.xml", "w", "utf-8")
>   f.write(xmldoc.toxml())
>   f.close()
> 
> Or you can do this:
> 
>   f = open("file.xml", "w")
>   f.write(xmldoc.toxml(encoding="utf-8")
>   f.close()
> 
> I prefer to do the second thing, because I can imagine that minidom
> introduces the correct encoding declaration into the XML, whereas I
> cannot imagine the same occurring in the first case, where you rely on
> "utf-8" as the default encoding, and if you change your codecs.open
> encoding parameter, the XML parser will become confused about the
> encoding when it tries to read the resulting file back in at a later
> point in time.
> 
>> My sys.defaultencoding  is iso-8859-1, specified in the sitecustomize.py
>> script in python site-packages directory.
> 
> I can't remember if the locale or any other information makes any
> difference to the interpretation of characters in Python strings, but
> you shouldn't need to even let that interfere with your processing.
> The XML modules produce Unicode objects in virtually all cases, as far
> as I have seen, and unless you have a need to extract or introduce
> data in a particular encoding, you should not concern yourself with
> working in such encodings.
> 
> Paul
> 
> P.S. I hope that I've represented the state of various technologies
> correctly. I'm not familiar with the internals of codecs.open or the
> XML processing tools.
> 
> P.P.S. Don't try and reply by mail from this message as the return
> address is outdated - somewhat antisocial, I suppose, but a Google
> search is surely within the powers of the determined.

-- 
bye
Alessio Pace