Support for "wide" Unicode characters

Fri Jun 29 14:18:15 EDT 2001

I like this PEP except for a few little details:

>   * codecs will be upgraded to support "wide characters"
>    (represented directly in UCS-4, as surrogate pairs in UTF-16 and
>    as multi-byte sequences in UTF-8). On narrow Python builds, the
>    codecs will generate surrogate pairs, on wide Python builds they
>    will generate a single character. This is the main part of the
>    implementation left to be done.

I would prefer that on narrow builds the codecs generate exceptions
when they encounter surrogate pairs. Having surrogate pairs in your
unicode objects is going to make life very unpleasant for the majority
of people because indexing and interpretation become difficult.

BTW, it looks like the UTF-8 and unicode escape decoders insert
surrogate pairs into the unicode object while UTF-16 throws an
exception (UnicodeError: code pairs are not supported).

>   There is a new configure options:
>
>       --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                             wchar_t if it fits
>       --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
>                             whchar_t if it fits
>       --enable-unicode      same as "=ucs2"

So what is the proposed behavior if sizeof(Py_UNICODE) >
sizeof(wchar_t)? When an extension module asks for wide characters,
will Python attempt to down-encode the buffer? If that is not
possible, will it insert surrogate pairs? Will it generate an
exception?