[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 11:39:46 CEST 2011

"Martin v. Löwis" writes:

 > No, that's explicitly *not* what C6 says. Instead, it says that a
 > process that treats s1 and s2 differently shall not assume that others
 > will do the same, i.e. that it is ok to treat them the same even though
 > they have different code points. Treating them differently is also
 > conforming.

Then what requirement does C6 impose, in your opinion?  It sounds like
you don't think it imposes any, in practice.

Note that in the discussion of C6, the standard says,

- Ideally, an implementation would *always* interpret two
  canonical-equivalent sequences *identically*.  There are practical
  circumstances under which implementations may reasonably distinguish
  them.  (Emphasis mine.)

The examples given are things like "inspecting memory representation
structure" (which properly speaking is really outside of Unicode
conformance) and "ignoring collation behavior of combining sequences
outside the repertoire of a specified language."  That sounds like
"Special cases aren't special enough to break the rules. Although
practicality beats purity." to me.  Treating things differently is an
exceptional case, that requires sufficient justification.

My understanding is that if those strings are exchanged with an
another process, then whether or not treating them differently is
allowed depends on whether the results will be output to another
process, and what the definition of our process is.  Sometimes it will
be allowed, but mostly it won't.  Take file names as an example.

If our process is working with an external process (the OS's file
system driver) whose definition includes the statement that "File
names are sequences of Unicode characters", then C6 says our process
must compare canonically equivalent sequences that it takes to be file
names as the same, whether or not they are in the same normalized
form, or normalized at all, because we can't assume the file system
will treat them as different.  If we do treat them as different, our
users will get very upset (eg, if we don't signal a duplicate file
name input by the user, and then the OS proceeds to overwrite an
existing file).

Dually, having made the statement that file names are Unicode, C6 says
that the OS driver must return the same file given two canonically
equivalent strings that happen to have different code points in them,
because it may not assume that *we* will treat those strings as
different names of different files.

*Users* will certainly take the viewpoint that two strings that
display the same on their monitor should identify the same file when
they use them as file names.

Now, I'm *not* saying that Python's strings *should* conform to the
Unicode standard in this respect yet (or ever, for that matter; I'm
with Guido on that).  I'm simply saying that the current
implementation of strings, as improved by PEP 393, can not be said to
be conforming.

I would like to see something much more conformant done as a separate
library (the Python Components for Unicode, say), intended to support
users who need character-based behavior, Unicode-ly correct collation,
etc., more than efficiency.  Applications that need both will have to
make their own way at first, either by contributing improvements to
the library or by using application-specific algorithms.