[XML-SIG] Strings or Unicode ?

M.-A. Lemburg mal@lemburg.com
Thu, 08 Nov 2001 17:18:35 +0100


"Martin v. Loewis" wrote:
> 
> > 1. find out the XML encoding by looking at the header,
> >    decode the data into Unicode,
> >    run the parser over the Unicode string and let it
> >    generate Unicode tag names, attributes, etc.
> 
> That is what xmlproc does. It supports chunked input (i.e. feeding),
> and converts any new chunk using the established encoding.
> 
> Processing the first chunk is tricky: it first tries to do encoding
> autodetection. If that does not give any clue, it parses the xml
> declaration as a byte string until it sees the encoding
> declaration. It then recodes the first chunk using the established
> encoding, and trusts that the current position in the string is good
> for the Unicode string also. Since XML only supports ASCII supersets
> as encodings (*), I think this is a reliable assumption.

Thanks for the insight. 

The question still remains, though: is this an acceptable approach 
in practice ? (Converting Unicode back to strings has its cost and
it might be worthwhile having the 8-bit string approach available
too.)

> (*) Other encodings apparently are only supported when some
> higher-level protocol already reports the encoding used, or if EBCDIC
> is autodetected.

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/