[XML-SIG] Strings or Unicode ?

Martin v. Loewis martin@v.loewis.de
Thu, 8 Nov 2001 17:04:59 +0100


> 1. find out the XML encoding by looking at the header,
>    decode the data into Unicode,
>    run the parser over the Unicode string and let it
>    generate Unicode tag names, attributes, etc.

That is what xmlproc does. It supports chunked input (i.e. feeding),
and converts any new chunk using the established encoding.

Processing the first chunk is tricky: it first tries to do encoding
autodetection. If that does not give any clue, it parses the xml
declaration as a byte string until it sees the encoding
declaration. It then recodes the first chunk using the established
encoding, and trusts that the current position in the string is good
for the Unicode string also. Since XML only supports ASCII supersets
as encodings (*), I think this is a reliable assumption.

Regards,
Martin

(*) Other encodings apparently are only supported when some
higher-level protocol already reports the encoding used, or if EBCDIC
is autodetected.