[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
24 Sep 2002 00:13:55 +0200


Eric van der Vlist <vdv@dyomedea.com> writes:

> > No. Why do you think this? Strictly speaking, XML 1.0 defines a
> > "character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000.
> > This means only characters in the Basic Multilingual Plane are allowed
> > in XML. James Clark's document is, strictly speaking, ill-formed.
> 
> That's weird...

I'm not surprised. James is interested in funny and strange cases. He
is, as usual, ahead of his time, and predicts the future - most likely
correctly. He does not care about strict conformance, but acts as an
early adaptor, making things work that aren't supposed to work just
yet.

You should use his test suite only if you can follow his principles.

> And I need to do the same in Python...

Not necessarily. You can

1. Ignore the problem. This is probably fine: nobody is using non-BMP
   characters right now. Most systems have serious problem displaying
   them, since font systems are restricted to 64k glyphs, and, in many
   cases, to displaying characters in the BMP only.

2. Declare that this works correctly in UCS-4 builds of Python
   only. People that need such characters will use an UCS-4 build of
   Python, anyway; Guido expects Chinese users to be early adaptors
   here. Notice that James has no such option: Java is inherently tied
   to UTF-16.

3. Implement it properly. Please understand that you will be trading
   efficiency for correctness.

> > Notice also that U+10800 is unassigned even in Unicode 3.2.
> 
> I wonder why he has picked this value!

Out of the blue. He is not really interested in non-BMP characters,
but this particular value is "even", so a good choice for a test case.

Regards,
Martin