[XML-SIG] XML Unicode and UTF-8

Thu Aug 5 14:22:43 CEST 2004

> From: "Martin v. Löwis" <martin at v.loewis.de>
> Date: 2004/08/05 Thu AM 11:35:18 GMT
> To: n.youngman at ntlworld.com
> CC: xml-sig at python.org
> Subject: Re: [XML-SIG] XML Unicode and UTF-8
> 
> n.youngman at ntlworld.com wrote:
> > First Pass:
> > 
> > segment_tag.appendChild( charset_tag ) unicode_tag =
> > doc.createElement( 'unicode' ) unicode_tag.appendChild(
> > doc.createTextNode( segment[0] ) ) segment_tag.appendChild(
> > unicode_tag )
> > 
> > Inserts binary data into the segment/unicode tag
> 
> What is segment[0] here? In XML, there is no notion of "binary data".

Sorry, I missed a key point out. Segment[0] is the decoded part of the output from email.Header.decode_header(). I believed this was a unicode string, but checking back in the documentation it doesn't actually say that, so I guess at least part of the problem is I'm getting some sort of binary data, which I thought was Unicode, but isn't.

> > Leaves binary data in the document. I have assumed that this was raw
> > Unicode, may be that's a flawed assumption?
> 
> There is nothing that could be called "raw Unicode", either. Again,
> XML does not support binary data.

XML doesn't, Python does. If I ask it to print without encoding it, I don't know whether it's passed through unchanged. Raw Unicode seems to me like a reasonable term for the data in a unicode string.

> > consumed = self.encode(object, self.errors) UnicodeDecodeError:
> > 'ascii' codec can't decode byte 0xee in position 0: ordinal not in
> > range(128)
> > 
> > I hoped this would convert everything to UTF-8 and save it . The
> > appearance of an ASCII codec was a complete surprise to me.
> 
> You can only encode Unicode objects. Since you apparently have put
> a byte string object (<type 'str'>) into the DOM tree, it needs to
> convert the byte string into a Unicode string first, before it
> can encode the Unicode string as UTF-8. For that, it uses the system
> default encoding, which is us-ascii.
> 
> Now, the byte string contains the byte '\xee', which is not supported
> in ASCII.

OK. That kind of makes sense, but I now have to figure out what is in the byte string and how to transform it to UTF-8. I guess that it's actually raw data in the character set given by the other part of the pair. Assuming it's a string in koi8-r, I have to get a codec that witll transform koi8-r to UTF-8, probably via unicode.

OK. I read the opaque documentation^W^W fine manual for a while, then googled for a while, and finally decided to just hack about with what I had.

I now have

    charset_tag.appendChild( doc.createTextNode( segment[1] ) )
    unicode = segment[0].decode( segment[1] ).encode( "utf-8")
    unicode_tag = doc.createElement( 'unicode' )
    unicode_tag.appendChild( doc.createTextNode( unicode ) )

This appears to be working, or at least it doesn't generate any errors.

Martin

You have neatly pinpointed where I was confused. Your assistance is much appreciated.

Many Thanks

Neil Youngman

-----------------------------------------
Email provided by http://www.ntlhome.com/