[XML-SIG] unicode data

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 7 Nov 2000 00:12:38 +0100


> Actually, this part of the spec is basically confused and confusing.
> The character encoding used only matters in primitive programming
> languages where there is no suitable wide string type.  This basically
> means C and C++.
> 
> That the requirement for UTF-16 fits Java, tcl and Python is mostly
> pure luck, since both UTF-8 (used by Perl) and UCS-4 (used by gcc) are
> credible alternatives.
> 
> In most languages, the character encoding used in wide strings are
> something the DOM should keep quiet about.

The real underlying requirement is: strings in the DOM are number with
Unicode code points (i.e. Unicode is the character set); and the data
type should allow access to individual characters (rather than to
individual bytes of encoding) - IOW, DOM applications don't need to
care about multiple character sets.

In that way, UTF-16 and UCS-4 certainly qualify (if accessible on a
per-character basis). I don't know about Perl, but I think UTF-8
encoded byte strings would not be suitable in Python.

Regards,
Martin

P.S. The Unicode standard (even Unicode 3.0) is stupid enough to
outrule UCS-4 as an in-memory representation for Unicode. So it'll
take a while until others (like W3C) get it really right. As a
starting point, I think DOMString was the right thing to define.