[XML-SIG] XML Unicode and UTF-8

Paul Boddie paul.boddie at ementor.no
Thu Aug 5 15:26:34 CEST 2004


n.youngman at ntlworld.com wrote:
>
> OK. That kind of makes sense, but I now have to figure out what is in
the
> byte string and how to transform it to UTF-8. I guess that it's
actually
> raw data in the character set given by the other part of the pair.
> Assuming it's a string in koi8-r, I have to get a codec that witll
> transform koi8-r to UTF-8, probably via unicode.

I've only been following this thread in a vague way, but the easiest way
to approach this problem and many others that you might have with
character encodings is to convert input data to Unicode objects as soon
as possible. Note that there's a distinction between Unicode (which you
can think of as a scheme where any character value can be stored and
addressed) and UTF-8 (which is a way of serialising most of those
character values in a byte stream). When you're converting to Unicode
you aren't converting to UTF-8 or any other such representation - you're
actually putting the data in Python Unicode objects. Meanwhile, UTF-8 is
a side issue which you only need to think about when you're producing
textual output for other systems to process - you should be able to keep
UTF-8 data out of your program.

> OK. I read the opaque documentation^W^W fine manual for a while, then
> googled for a while, and finally decided to just hack about with what
I
> had.
>
> I now have
>
>     charset_tag.appendChild( doc.createTextNode( segment[1] ) )
>     unicode = segment[0].decode( segment[1] ).encode( "utf-8")

This actually produces a byte (normal Python) string containing a UTF-8
representation of the text. This is not the same as having that text in
a Unicode object, which is the most useful form to have it in. Consider
checking the length of the text - you won't necessarily get the true
number of characters. (Moreover, you're trampling on the unicode
function here.)

Do this instead:

      utext = segment[0].decode( segment[1] )

>     unicode_tag = doc.createElement( 'unicode' )
>     unicode_tag.appendChild( doc.createTextNode( unicode ) )

And this:

      unicode_tag.appendChild( doc.createTextNode( utext ) )

When you need to serialise this, the serialiser should then be able to
choose a suitable character encoding (eg. UTF-8) without running into
the problems you were experiencing.

Paul



More information about the XML-SIG mailing list