writing Unicode objects to XML

Paul Boddie paul at boddie.net
Mon May 5 13:32:14 EDT 2003


Alessio Pace <puccio_13 at yahoo.it> wrote in message news:<UNqta.38692$DN.951106 at tornado.fastwebnet.it>...
>
> Maybe I am missing something, because I tried but in the resulting new XML
> file I dont' see what I expect.. Starting again, I have an XML file
> declared encoded in UTF-8 (anyway, is it the default if I don't specify
> anything?)

I guess we have to assume that your file actually contains valid UTF-8
byte sequences, but let's do that.

> and which contains character references such as &#xe8; and some others in the
> Text nodes.

It occurs to me that you could quite easily use ASCII as an encoding
if you're intent on using entity references all over the place. The
above question about default encodings could be disregarded for our
purposes.

> I parse it with xml.dom.minidom.parse(pathToFile) and get a reference to a DOM
> tree, let's call this variable 'xmldoc'.

If you do what had been previously recommended...

  xmldoc.toxml()

...you'll probably see a string prefixed with "u" and with "\xe8"
where your "&#xe8;" reference was.

> Now, let's say I want to store again this DOM tree (because my application
> will have to modify some parameters in it). I thought I had to do just:
> f = codecs.open('file.xml', 'w', 'utf8')

The help for codecs.open suggests that you need to supply objects to
the file object's write method which can be converted to the stated
encoding. In other words, given an object passed to f.write():

  f.write(s) # write my "string" object

...that object will be processed in such a way that the encode method
is effectively called upon it:

  s.encode("utf-8") # when we opened a file with "utf-8" as the
encoding

This is what appears to happen internally in the object returned by
codecs.open, but this also means that the encode method should produce
a valid result (and not raise an exception). Now, writing Unicode
objects to your file does seem to work:

  f.write(xmldoc.toxml())

This shouldn't be a surprise, since the result of xmldoc.toxml() is a
Unicode object (as we saw above). However, you are attempting to this
Unicode object to a Python string (or byte stream) first:

> f.write(xmldoc.toxml(encoding='utf-8'))
> f.close()
> But the result is not the original xml....

I'm surprised the program even finishes, as I'll attempt to explain.

Here, the object that gets passed to the write method is a normal
Python string as can be observed by the output of the following
statement:

  xmldoc.toxml(encoding='utf-8')

You'll see that two characters are generated for your single "&#xe8;"
character. Moreover, attempting to convert either or both of these
characters to UTF-8 gives the dreaded "UnicodeError: ASCII decoding
error: ordinal not in range(128)" as Python refuses to guess which
encoding you might employ for normal Python strings (as that's what
the decode method on strings is for).

Consequently, if you consider your input object to the write method as
follows:

  s = xmldoc.toxml(encoding="utf-8")

You'll realise that when the write method does the equivalent of this
operation...

  s.encode("utf-8")

...an identical exception is raised to that described above.

Anyway, you can either use codecs.open and send the Unicode result of
xmldoc.toxml():

  f = codecs.open("file.xml", "w", "utf-8")
  f.write(xmldoc.toxml())
  f.close()

Or you can do this:

  f = open("file.xml", "w")
  f.write(xmldoc.toxml(encoding="utf-8")
  f.close()

I prefer to do the second thing, because I can imagine that minidom
introduces the correct encoding declaration into the XML, whereas I
cannot imagine the same occurring in the first case, where you rely on
"utf-8" as the default encoding, and if you change your codecs.open
encoding parameter, the XML parser will become confused about the
encoding when it tries to read the resulting file back in at a later
point in time.

> My sys.defaultencoding  is iso-8859-1, specified in the sitecustomize.py
> script in python site-packages directory.

I can't remember if the locale or any other information makes any
difference to the interpretation of characters in Python strings, but
you shouldn't need to even let that interfere with your processing.
The XML modules produce Unicode objects in virtually all cases, as far
as I have seen, and unless you have a need to extract or introduce
data in a particular encoding, you should not concern yourself with
working in such encodings.

Paul

P.S. I hope that I've represented the state of various technologies
correctly. I'm not familiar with the internals of codecs.open or the
XML processing tools.

P.P.S. Don't try and reply by mail from this message as the return
address is outdated - somewhat antisocial, I suppose, but a Google
search is surely within the powers of the determined.




More information about the Python-list mailing list