[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 02:36:14 CEST 2011

Guido van Rossum writes:

 > I see nothing wrong with having the language's fundamental data types
 > (i.e., the unicode object, and even the re module) to be defined in
 > terms of codepoints, not characters, and I see nothing wrong with
 > len() returning the number of codepoints (as long as it is advertised
 > as such).

In fact, the Unicode Standard, Version 6, goes farther (to code units):

    2.7  Unicode Strings

    A Unicode string data type is simply an ordered sequence of code
    units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit
    code units, a Unicode 16-bit string is an ordered sequence of
    16-bit code units, and a Unicode 32-bit string is an ordered
    sequence of 32-bit code units. 

    Depending on the programming environment, a Unicode string may or
    may not be required to be in the corresponding Unicode encoding
    form. For example, strings in Java, C#, or ECMAScript are Unicode
    16-bit strings, but are not necessarily well-formed UTF-16
    sequences.

(p. 32).