unicode by default

Thu May 12 18:25:24 EDT 2011

On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 5/12/2011 12:17 PM, Ian Kelly wrote:
>> Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
>> same thing as UTF-16, by the way) to represent Unicode strings.
>
> I know some people say that, but according to the definitions of the unicode
> consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
> Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
> standard considers 'UCS-2' obsolete long ago. See
>
> https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
> or http://www.unicode.org/faq/basic_q.html#14

At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."

PEP 100 says:

    The internal format for Unicode objects should use a Python
    specific fixed format <PythonUnicode> implemented as 'unsigned
    short' (or another unsigned numeric type having 16 bits).  Byte
    order is platform dependent.

    This format will hold UTF-16 encodings of the corresponding
    Unicode ordinals.  The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    UTF-16 without surrogates provides access to about 64k characters
    and covers all characters in the Basic Multilingual Plane (BMP) of
    Unicode.

    It is the Codec's responsibility to ensure that the data they pass
    to the Unicode object constructor respects this assumption.  The
    constructor does not check the data for Unicode compliance or use
    of surrogates.

I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.

Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.