[Python-Dev] len(chr(i)) = 2?

Mon Nov 22 12:22:35 CET 2010

Am 22.11.2010 11:47, schrieb Stephen J. Turnbull:
> "Martin v. Löwis" writes:
> 
>  > More interestingly (and to the subject) is chr: how did you arrive
>  > at C9 banning Python3's definition of chr? This chr function puts
>  > the code sequence into well-formed UTF-16; that's the whole point of
>  > UTF-16.
> 
> No, it doesn't, in the specific case of surrogate code points.  In
> 3.1.2 from MacPorts on a iBook G4 and from Gentoo on AMD64,
> chr(0xd800) returns "\ud800".

Ah, I see - this is *not* the subject's issue, right?

> 
> I don't know if that's by design (eg, so that it can be used in the
> implementation of the surrogateescape error handler) or a correctable
> oversight, but it's not conformant.

I disagree: Quoting from Unicode 5.0, section 5.4:

# The individual components of implementations may have different
# levels of support for surrogates, as long as those components are
# assembled and communicate correctly. Low-level string processing,
# where a Unicode string is not interpreted but is handled simply as an
# array of code units, may ignore surrogate pairs. With such strings,
# for example, a truncation operation with an arbitrary offset might
# break a surrogate pair. (For further discussion, see Section 2.7,
# Unicode Strings.) For performance in string operations, such behavior
# is reasonable at a low level, but it requires higher-level processes
# to ensure that offsets are on character boundaries so as to guarantee
# the integrity of surrogate pairs.

So lower-level routines (which I claim chr() is one) are allowed
to create lone surrogates. The formal requirement behind this is C1:

# A process shall not interpret a high-surrogate code point or a
# low-surrogate code point as an abstract character.

I also claim that Python, in both narrow and wide mode, conforms to
this requirement. Notice that the requirement is a ban on interpreting
the code point as a character. In particular, unicodedata.category
claims that the code point is of class Cs (surrogate), which I consider
conforming.

By the same line of reasoning, it is also OK that chr() allows the
creation of unassigned code points, even though C2 says that they
must not be interpreted as abstract characters.

The rationale for supporting these characters in chr() goes back much
further than the surrogateescape handler - as Python unicode strings
are sequences of code points, it would be impractical if you couldn't
create some of them, or even would have to consult the UCD before
determining whether they can be created.

Regards,
Martin