[Python-Dev] len(chr(i)) = 2?

Mon Nov 22 06:28:13 CET 2010

"Martin v. Löwis" writes:

 > Chapter and verse?

Unicode 5.0, Chapter 3, verse C9:

    When a process generates a code unit sequence which purports to be
    in a Unicode character encoding form, it shall not emit ill-formed
    code sequences.

I think anything called "UTF-8 something" is likely to be taken to
"purport".  Furthermore, users don't necessarily see which error
handlers are being used.  A user who specifies "utf8" as the output
codec is likely to be rather surprised if non-UTF-8 is emitted because
the app specified surrogateescape.  Eg, consider a script which munges
file descriptions into reasonable-length file names on Unix.  Yes,
technically the non-Unicode output is the app's fault, but I expect
many users will put some blame on Python.

I am in full agreement with you about the technicalities, but I am
looking for ways to clue in users that (a) the technicalities matter,
and (b) that Python does a *very* good job of making things as safe as
possible without becoming unable to handle bytes.  I think "wide"
vs. "narrow" fails at both.  It focuses on storage issues, which of
course are important, but at the cost of ignoring the fact that for
users of non-BMP characters 32-bit code units are much safer.  Users
who need non-BMP characters are relatively few, and at least at the
present time most are painfully aware of the need to care for
technicalities.  I expect them to be pleasantly surprised by how easy
it is to get reasonably safe behavior even from a 16-bit build.

 > > Python's internal coding does not conform to UTF-16, and that internal
 > > coding can, under certain conditions, escape to the outside world as
 > > invalid "Unicode" output.
 > 
 > I'm fairly certain there are provisions in the Unicode standard for such
 > behavior (taking into account "certain conditions").

Sure.  There's nothing in the Unicode standard that says you have to
conform to it unless you claim to conform to it.

So it is valid to say that Python's Unicode codecs without
surrogateescape do conform.  The point is that Python does not, even
if all of the input is valid Unicode, because of the provision of
surrogateescape and the lack of Unicode conformance-checking for
certain internal functionality like chr() and slicing.

You can say "we don't make any such claim", but IMO the distinction in
question is too fine a point for most users, and requires a very large
amount of Unicode knowledge (not to mention standards geekiness) to
even understand the precise statement.

"Unicode support" to users should mean that Python does the right
thing, not that if you look hard enough in the documentation you will
discover that Python doesn't claim to do the right thing even though
in practice it mostly does.  IMO, "UCS-2" is a pretty good description
of what the user can leave up to Python in perfect safety.  RDM's
reply worries me a little, but I'll reply to his message separately.

 > *Any* Unicode implementation will do that, since they all have to
 > support legacy encodings in some form. This is certainly conforming to
 > the Unicode standard, and in fact one of the primary Unicode design
 > principles.

No.  Support for legacy encodings takes you outside of the realm of
Unicode conformance by definition.  Their names tell you that,
however.  "UTF-8 with surrogate escapes" on the other hand is an
entirely different kettle of fish.  It pretends to be UTF-8, but
isn't.  I think that users who give Python valid input should be able
to expect valid output, but they can't.

Chapter 3, verse C7:

    When a process purports not to modify the interpretation of a
    valid coded character sequence, it shall make no change to that
    coded character sequence other than the possible replacement of
    character sequences by their canonical-equivalent sequences, or
    the deletion of *noncharacter* code points.

Sure, you can tell users the truth: "Python may modify your Unicode
characters if you slice or index Unicode strings.  It may even
silently turn them into invalid codes which will eventually raise
Errors."  Then you are conformant, but why would anyone want to use
such a program?

If you tell them "UCS-2[sic] Python is safe to use with *no* extra
care if you use only UCS-2 [or BMP] characters", suddenly Python looks
very nice indeed again.  "UCS-4" Python is even better; all you have
to do is to avoid surrogateescape codecs.  However, you're still
vulnerable to hard-to-diagnose errors at the output stage in case of
program bugs, because not enough checking of values is done by Python
itself.

 > > A Unicode-conforming Python implementation would error at the
 > > chr() call, or perhaps would not provide surrogateescape error
 > > handlers.
 > 
 > Chapter and verse?

Chapter 3, verse C9 again.

 > > "Although practicality beats purity."
 > 
 > The Unicode standard itself is based on practicality. It wouldn't
 > have received the success it did if it was based on purity only
 > (and indeed, was often rejected in cases where it put purity over
 > practicality, e.g. with the Hangul syllables).

Python practicality is very different from Unicode practicality.