[Python-Dev] len(chr(i)) = 2?

Sat Nov 20 10:05:38 CET 2010

Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
> "Martin v. Löwis" writes:
> 
>  > The term "UCS-2" is a character set that can encode only encode 65536
>  > characters; it thus refers to Unicode 1.1. According to the Unicode
>  > Consortium's FAQ, the term UCS-2 should be avoided these days.
> 
> So what do you propose we call the Python implementation?

A technical correct description would be to say that Python uses either
16-bit code units or 32-bit code units; for brevity, these can be called
narrow and wide code units.

> Strictly speaking, internally Python only encodes 65536 characters in
> 2-octet builds.  Its (Unicode) string-handling code does not know
> about surrogates at all, AFAIK

Here you are mistaken: it does indeed know about UTF-16 and surrogates
in several places, e.g. in the UTF-8 codec, or in the repr()
implementation; likewise in the parser.

> and therefore is not UTF-16 conforming.

I disagree. Python does "conform" to "UTF-16" (certainly in the
sense that no UTF-16 specification ever mandates a certain Python
API, and that Python follows all general requirements of the
UTF-16 specification).

> AFAIK this was not supposed to change in Python 3; indexing and
> slicing go by code unit (isomorphic to UCS-n), not character, and due
> to PEP 383 4-octet builds do not conform (internally) to UTF-32, and
> can produce output that conforms to Unicode not at all (as a user
> option, of course, but it's still non-conformant).

What behavior specifically do you consider non-conforming, and what
specific specification do you think it is not conforming to? For
example, it *is* fully conforming with UTF-8.

Regards,
Martin