[XML-SIG] Strings or Unicode ?
M.-A. Lemburg
mal@lemburg.com
Thu, 08 Nov 2001 17:18:35 +0100
"Martin v. Loewis" wrote:
>
> > 1. find out the XML encoding by looking at the header,
> > decode the data into Unicode,
> > run the parser over the Unicode string and let it
> > generate Unicode tag names, attributes, etc.
>
> That is what xmlproc does. It supports chunked input (i.e. feeding),
> and converts any new chunk using the established encoding.
>
> Processing the first chunk is tricky: it first tries to do encoding
> autodetection. If that does not give any clue, it parses the xml
> declaration as a byte string until it sees the encoding
> declaration. It then recodes the first chunk using the established
> encoding, and trusts that the current position in the string is good
> for the Unicode string also. Since XML only supports ASCII supersets
> as encodings (*), I think this is a reliable assumption.
Thanks for the insight.
The question still remains, though: is this an acceptable approach
in practice ? (Converting Unicode back to strings has its cost and
it might be worthwhile having the 8-bit string approach available
too.)
> (*) Other encodings apparently are only supported when some
> higher-level protocol already reports the encoding used, or if EBCDIC
> is autodetected.
Thanks,
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/