[XML-SIG] XML Unicode and UTF-8

Neil Youngman n.youngman at ntlworld.com
Sat Aug 7 21:36:58 CEST 2004


On Saturday 07 Aug 2004 6:59 pm, Mike Brown wrote:
> Neil Youngman wrote:
> > On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> > > The resulting Unicode object may contain characters which are not
> > > allowed in XML, and thus the text may not be serializable (at least not
> > > in a way that would produce well-formed XML).
> >
> > Yes, but it's being written out through a UTF-8 codec
>
> Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML
> places restrictions on what characters can be in the *decoded* (Unicode)
> version of the document. The encoded version of the document is just an
> alternative representation of the Unicode one.
>
> In Python's notation, each character in the document must be one of:
> \t  (tab)
> \n  (linefeed)
> \r  (carriage return)
> \u0020-\ud7ff
> \ue000-\ufffd
> \u10000-\u10ffff
>
> You are not allowed to have any other characters in your document, not even
> by reference (e.g., you can't write � to represent \u0000).
>
> So let's say you have 256 bytes of binary data, just byte values 0-255:
> >>> bytestring = ''.join(map(chr,range(256)))

OK. I think we're starting from different assumptions here. The data comes 
from decoding an RFC1522 header. It is therefore assumed to be text, albeit 
in a non-ASCII character set. It should not be an arbitrary chunk of binary 
data. 

I'm assuming, possibly incorrectly, that the standards are set up in such a 
way that if it's valid text, it should be possible to insert the equivalent 
the UTF-8 equivalent in XML. 

While I theoretically could get something that's not valid text, encoded in an 
RFC1522 header, it's only going to cause me real concern if it's a security 
flaw. If we can't adequately process invalid data, that's not a major concern 
for me. If you are saying that there may be text in character sets supported 
in Python (with CJK codecs), that I can't insert as plain UTF-8 into a UTF-8 
XML document that would be a concern.

Neil Youngman



More information about the XML-SIG mailing list