[Python-Dev] len(chr(i)) = 2?

Tue Nov 23 17:16:55 CET 2010

"Martin v. Löwis" writes:

 > I disagree: Quoting from Unicode 5.0, section 5.4:
 > 
 > # The individual components of implementations may have different
 > # levels of support for surrogates, as long as those components are
 > # assembled and communicate correctly.

"Assembly" is the problem.  If chr() or a slice creates a lone
surrogate and surrogateescape passes it back out, Python as a whole is
non-conforming.

Technically, you can hide behind "none of slicing, chr(), or
surrogateescape promises to conform", and maybe that would fly to a
standards lawyer; I'd have to see the precise statement.

Here's a more convincing example.  A user specifies "utf8" as her
locale charset.  Then she specifies a string containing a non-BMP
character as the "description" of a file, and internal code munges
this via slicing into a file name conforming to some specification
(eg, length limit + uniquifier if needed).  Then if the non-BMP
character is in the "right" place, she will get either a broken file
name, which will either get written to disk or raise an exception,
depending on whether the munging program has enabled surrogateescape
or not.

I claim both of those results are non-conforming to the specification
of UTF-16, and therefore Python Unicode processing as a whole must be
considered non-conforming.

It's still pretty damn good.  But I've elaborated that point
elsewhere.

 > The rationale for supporting these characters in chr() goes back much
 > further than the surrogateescape handler - as Python unicode strings
 > are sequences of code points, it would be impractical if you couldn't
 > create some of them, or even would have to consult the UCD before
 > determining whether they can be created.

The Zen is irrelevant to determining conformance to Unicode, which has
its own Zen.