[XML-SIG] Re: Parsing a unicode string

Andrew Clover and-xml at doxdesk.com
Sun Oct 10 19:52:07 CEST 2004


> It really only makes sense to describe XML parsing in terms of byte
> streams.

Certainly this has traditionally been the case.

In DOM Level 3 LS, however, LSInput can now specify a character input 
source (characterStream or stringData properties) in which no attempt is 
made to do byte-to-character decoding.

There was a bit of a kerfuffle over what inputEncoding such Documents 
should report; 'utf-16' was decided on as this is DOM's native string 
type. Unfortunately this doesn't quite hang with Python where a 
DOM-acceptable string might be narrow or, in the case where Python is 
compiled with wide chars, 32 bits long. (pxdom plumps for reporting 
'utf-8' and 'utf-32' in these cases, but it's not really clear-cut.)

Anyway as a consequence pxdom can indeed accept Unicode strings to 
parseString, but this can't be relied upon for other implementations, 
especially DOM Level 2 ones.

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the XML-SIG mailing list