[XML-SIG] Re: Parsing a unicode string

Mike Brown mike at skew.org
Tue Oct 5 21:10:16 CEST 2004


Fredrik Lundh wrote:
> > I'd also expect parsers to accept unicode string objects with no encoding specification 
> > whatsoever. Decoding a Unicode encoding and parsing XML are two distinct steps
> 
> not really; XML is defined in terms of encoded bytestreams.

To clarify for Konrad's benefit -

XML syntax is defined in terms of ISO/IEC 10646 characters.
XML parsing is defined in terms of encoded byte streams.

If the XML spec weren't so strict about what a parser must do, it would be 
able to operate on pre-decoded streams. But as it is, the lowest-level parser 
must play dumb, and any Unicode-friendliness must be provided by a higher 
layer. SAX for example does accept Unicode character streams as entities and 
specifies that any encoding declaration appearing in the stream will be 
ignored, which is technically a violation of a couple of rules, e.g. that
the declaration must be accurate :)


More information about the XML-SIG mailing list