writing Unicode objects to XML

Tue May 6 06:44:40 EDT 2003

Alessio Pace <puccio_13 at yahoo.it> wrote in message news:<E0yta.39396$DN.968490 at tornado.fastwebnet.it>...
> First, thank you so much for the detailed explanation.
> 
> Second: having (valid) character references such as &#xe8; the parsing
> phase is ok, I get a good Unicode object(--> u'\xe8' is it the right one,
> isnt it??) but when I do :
>    f = open("file.xml", "w")
>    f.write(xmldoc.toxml(encoding="utf-8")
>    f.close()
> in the file.xml I get, as I expect, a 2 byte sequence, but they are not the
> "expected" UTF-8 charactes.

That's what you think. ;-) Martin v. Löwis gave some good advice about
editors and consoles that you definitely should take in. You see, the
two bytes contain byte values and cannot be trivially compared to
character values - they are the actual byte values used in UTF-8 to
represent the character whose Unicode value is "e8". Python will
report this Unicode value when asked to display a Unicode object -
u"\xe8" is what you see - but this is not a dump of byte values by any
means.

It can be confusing to examine the converted form of u"\xe8" in other
character sets, especially if we choose ISO-8859-1:

  u"\xe8".encode("iso-8859-1")

In this case, we get the Python (byte) string '\xe8' as a result. You
may have heard that certain ISO-8859-1 characters "have the same value
as" certain Unicode characters, but remember that while the character
value inside the Unicode string is "the same as" the value inside this
Python string, we stop talking about "straight Unicode" when
converting to particular byte-oriented representations of Unicode
characters, eg. UTF-8.

I don't know if you expected to see this '\xe8' character in your
file, but remember that this byte value is the ISO-8859-1
representation of u"\xe8" and you asked for UTF-8 in your file. Now
consider the converted form of u"\xe8" in UTF-8:

  u"\xe8".encode("utf-8")

In this case, we get the Python (byte) string '\xc3\xa8'. What happens
is that the Unicode character value "e8" is explicitly converted to a
two byte sequence. When you view your output file, which contains this
sequence, in an editor, how these bytes are represented really depends
on how your editor is set up. For example, I can fire up "vim" and see
the following characters:

  Ã¨ (an A accented with ~, and an umlaut/double-dot symbol on its
own)

However, this is really the editor's interpretation of those byte
values as characters in an assumed encoding - ISO-8859-15 in my case.
In other words, the editor isn't showing me the true byte values at
all - it's showing me what those bytes mean in my default encoding.

To see the characters in my default encoding but where "vim" has
interpreted the byte values correctly as UTF-8, do something like
this:

  vi -c "e ++enc=utf-8" file.xml

Consequently, I see the following character:

  è (an e with a grave accent)

> I mean, they are strange characters, but the
> weird thing is that parsing over again the file.xml I get the same Unicode
> objects as when I read it for the first time with characters references as
> content. So, in the end, it works...

It works because your editor is lying to you, but Python isn't so
easily confused - see above.

> But if instead of the accented e character reference(&#xe8;)  I write the
> UTF-8 encoded sequence (\xc3\xa8 is the correct form to write there right?)
> when I parse the xml document with minidom I get Unicode objects of kind
> u'\\xc3\\xa8' and so it does not match at all with the previous
> example.....Why??? The XML document is declared to be encoded UTF-8 :-(

It all depends on how you wrote those bytes. If you open the file as a
"normal" file in Python, insert those bytes using normal Python
strings (not Unicode strings), and then write the file, I can imagine
that it should work.

> On the console I did a simple unicode('\xc3\xa8', 'utf-8') and I got u'\xe8' 
> so the problem is in the parsing of the XML not in the way I write it I
> guess...

It could well be the way that you're writing it, though. My advice
when editing UTF-8 is to avoid tools which force you to see the text
as bytes interpreted as characters in other encodings. Indeed, avoid
using editors which don't present the UTF-8 as characters which truly
represent the actual Unicode characters in the document. If you are
more comfortable writing text in encodings such as ISO-8859-1, and
your editor works properly in such encodings, then declare such
encodings in your XML documents and don't attempt to hand edit UTF-8.
Should you be given UTF-8 documents, use Python to convert them to
your favourite encoding before trying to hand edit them:

  f = open("new_file.xml", "w")
  f.write(xmldoc.toxml(encoding="iso-8859-1"))
  f.close()

And hope for the best that ISO-8859-1, for example, can represent all
the characters found in that document.

Good luck!

Paul