Unicode Newbie

Tue Sep 16 17:18:30 EDT 2003

[Martin von Löwis]
> Manuel Huesser <sylphaleya at hta.fhz.ch> writes:
> 
> > Yep Unicode supports less characters than there are possible with
> > utf-8 (ucs range = 2 ** 31).
> > 
> > so there is no possibility to support the full range of the ucs
> > character set with python?
> 
> The ucs range (for UCS-4) is *not* 2**31; it is 17*2**16. It was 2**32
> in ISO/IEC 10646:1993 (I believe), but it got constrained in 10646:2000.

I think UCS-4 is (or at least was) defined for 2**31 code points only.  I
do not know why the sign bit was excluded (maybe to avoid problems with
negative values for code points?), but if you consider the logic of
UTF-8, you will see than one full byte would be needed to support the
32th bit.  This does not mean it was the reason, I do not know.

UTF-16 has 17*2**16 code points.  I did not recently study the legal
verses, but my overall impression is that UTF-16 has been more or less
integrated in UCS-2 in more recent Unicode versions, and made official.
I do not know exactly what means UCS-2 nowadays, as it does not really
exist anymore as defined originally (with the intent of being fixed
width).  Unless UCS-2 is 2**16 - 2**11 codepoints?  The surrogate areas
cannot sensibly be part of it, at least nowadays.  Hmph!  I should
really read recent legal texts when I get to dive in such areas... :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard