[Python-3000] PEP: Supporting Non-ASCII Identifiers

Stephen J. Turnbull stephen at xemacs.org
Mon Jun 4 03:53:21 CEST 2007


Rauli Ruohonen writes:

 > On 6/3/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
 > > Sure - but how can Python tell whether a non-normalized string was
 > > intentionally put into the source, or as a side effect of the editor
 > > modifying it?
 > 
 > It can't, but does it really need to? It could always assume the latter.

No, it can't.  One might want to write Python code that implements
normalization algorithms, for example, and there will be "binary
strings".  Only in the context of Unicode text are you allowed to do
those things.

This would require Python to internally distinguish between Unicode
text files and other files.

[example of a dictionary application using Unicode strings]

 > Now if these are written by two different people using different
 > editors, one might be normalized in a different way than the other,
 > and the code would look all right but mysteriously fail to work.

It seems to me that once we have a proper separation between bytes
objects and unicode objects, that the latter should always be compared
internally to the dictionary using the kinds of techniques described
in UTS#10 and UTR#30.  External normalization is not the right way to
handle this issue.

 > But a partial solution is better than no solution.

Not if it leads to unexpected failures that are hard to diagnose,
especially in the face of human belief that this problem has been
"solved".

 > The line ending there is '\r\n', and Python normalizes it when
 > reading in the source code, even though '\r\n' matters even less
 > than doing NFC normalization.

That's not a Python language normalization; that's an artifact of the
line-reading function.  It's deliberate, of course, but it's not
really character-level, it's a line-level transformation.  If I start
up an interpreter and type

>>> a = """^V^M^V^J"""
>>> repr(a)
"'\\r\\n'"

(On my Mac, on other systems the quoting character for key entry of
control characters is probably different.)



More information about the Python-3000 mailing list