[Tutor] why is unichr(sys.maxunicode) blank?

eryksun eryksun at gmail.com
Sat May 18 16:23:45 CEST 2013


On Sat, May 18, 2013 at 6:01 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> East Asian languages. But later on Joel Spolsky's "standard" page about unicode
> I read that it goes to 6 bytes. That's what I implied when I mentioned "utf8".

Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of
20-bits. Thus UTF-16 sets the upper bound on the number of code points
at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of
codes.

> A certain locale implies a certain codepage (on Windows), but where does the locale
> category LC_CTYPE fit in this story?

LC_CTYPE is the locale category that classifies characters. In Debian
Linux, the English-language locales copy LC_CTYPE from the i18n
(internationalization) locale:

short: http://goo.gl/Hs8RD
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/locales/i18n?view=markup

Here's the mapping between the symbolic Unicode names in the latter
(e.g. <U0020>) and UTF-8:

short: http://goo.gl/cZ3dS
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/charmaps/UTF-8?view=markup

The i18n locale is defined by the ISO/IEC technical report 14652, as
an instance of an upward compatible extension to the POSIX locale
specification called the FDCC-set (i.e. Set of Formal Definitions of
Cultural Conventions). Here it is in all its glory, if you like
reading technical reports:

http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf

If that's not enough, here's the POSIX 1003.1 locale spec:

short: http://goo.gl/aOJUx
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

> Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)?

Narrow builds create UTF-16 surrogate pairs from \U literals, but
these aren't treated as an atomic unit for slicing, iteration, or
string length.


More information about the Tutor mailing list