[XML-SIG] XML Unicode and UTF-8

Paul Boddie paul.boddie at ementor.no
Mon Aug 9 12:07:28 CEST 2004


Neil Youngman [mailto:n.youngman at ntlworld.com] wrote:
>
> OK. I think we're starting from different assumptions here. The data
> comes from decoding an RFC1522 header. It is therefore assumed to be
> text, albeit in a non-ASCII character set. It should not be an
> arbitrary chunk of binary data.

That's why I was slightly puzzled by the remark about invalid Unicode
values. But then I wasn't following the discussion that closely.

> I'm assuming, possibly incorrectly, that the standards are set up in
> such a way that if it's valid text, it should be possible to insert
> the equivalent the UTF-8 equivalent in XML.

I think it's best to think of the problem with the following
terminology:

 * The original text is a normal Python string with a known encoding.
   We refer to that as a byte string.

 * You want to convert that string to a Unicode object and insert it
   into a DOM representation of an XML document. We refer to this as
   Unicode in the DOM.

 * You want to serialise the document using a UTF-8 encoding. We can
   refer to the content as UTF-8 in XML.

As has been mentioned already, you might well be able to put UTF-8
encoded byte strings into the DOM, but then you'll experience problems
with serialisation. If you put Unicode objects into the DOM,
serialisation should proceed successfully.

And as far as opening a file and serialising to it is concerned, I've
had most success with the following sequence of operations:

 * Open a file using Python's "open" built-in function - this exposes
   an output stream which should be considered as accepting byte
   values (as opposed to streams exposed by "codecs.open" which
   accept Unicode values).

 * Serialise to the stream using the various XML toolkit functions or
   methods. These functions or methods are able to produce an
   encoding declaration in the serialised document consistent with
   the actual encoding employed. They will also convert the Unicode
   values to the appropriate byte sequences for the output stream.

 * Close the file. ;-)

There may be a better way of doing this, but that's the most sane way
I've discovered so far.

Paul



More information about the XML-SIG mailing list