[XML-SIG] XML Unicode and UTF-8
Paul Boddie
paul.boddie at ementor.no
Mon Aug 9 12:07:28 CEST 2004
Neil Youngman [mailto:n.youngman at ntlworld.com] wrote:
>
> OK. I think we're starting from different assumptions here. The data
> comes from decoding an RFC1522 header. It is therefore assumed to be
> text, albeit in a non-ASCII character set. It should not be an
> arbitrary chunk of binary data.
That's why I was slightly puzzled by the remark about invalid Unicode
values. But then I wasn't following the discussion that closely.
> I'm assuming, possibly incorrectly, that the standards are set up in
> such a way that if it's valid text, it should be possible to insert
> the equivalent the UTF-8 equivalent in XML.
I think it's best to think of the problem with the following
terminology:
* The original text is a normal Python string with a known encoding.
We refer to that as a byte string.
* You want to convert that string to a Unicode object and insert it
into a DOM representation of an XML document. We refer to this as
Unicode in the DOM.
* You want to serialise the document using a UTF-8 encoding. We can
refer to the content as UTF-8 in XML.
As has been mentioned already, you might well be able to put UTF-8
encoded byte strings into the DOM, but then you'll experience problems
with serialisation. If you put Unicode objects into the DOM,
serialisation should proceed successfully.
And as far as opening a file and serialising to it is concerned, I've
had most success with the following sequence of operations:
* Open a file using Python's "open" built-in function - this exposes
an output stream which should be considered as accepting byte
values (as opposed to streams exposed by "codecs.open" which
accept Unicode values).
* Serialise to the stream using the various XML toolkit functions or
methods. These functions or methods are able to produce an
encoding declaration in the serialised document consistent with
the actual encoding employed. They will also convert the Unicode
values to the appropriate byte sequences for the output stream.
* Close the file. ;-)
There may be a better way of doing this, but that's the most sane way
I've discovered so far.
Paul
More information about the XML-SIG
mailing list