Unicode Newbie
François Pinard
pinard at iro.umontreal.ca
Tue Sep 16 17:18:30 EDT 2003
[Martin von Löwis]
> Manuel Huesser <sylphaleya at hta.fhz.ch> writes:
>
> > Yep Unicode supports less characters than there are possible with
> > utf-8 (ucs range = 2 ** 31).
> >
> > so there is no possibility to support the full range of the ucs
> > character set with python?
>
> The ucs range (for UCS-4) is *not* 2**31; it is 17*2**16. It was 2**32
> in ISO/IEC 10646:1993 (I believe), but it got constrained in 10646:2000.
I think UCS-4 is (or at least was) defined for 2**31 code points only. I
do not know why the sign bit was excluded (maybe to avoid problems with
negative values for code points?), but if you consider the logic of
UTF-8, you will see than one full byte would be needed to support the
32th bit. This does not mean it was the reason, I do not know.
UTF-16 has 17*2**16 code points. I did not recently study the legal
verses, but my overall impression is that UTF-16 has been more or less
integrated in UCS-2 in more recent Unicode versions, and made official.
I do not know exactly what means UCS-2 nowadays, as it does not really
exist anymore as defined originally (with the intent of being fixed
width). Unless UCS-2 is 2**16 - 2**11 codepoints? The surrogate areas
cannot sensibly be part of it, at least nowadays. Hmph! I should
really read recent legal texts when I get to dive in such areas... :-)
--
François Pinard http://www.iro.umontreal.ca/~pinard
More information about the Python-list
mailing list