[Python-3000] PEP: Supporting Non-ASCII Identifiers

Rauli Ruohonen rauli.ruohonen at gmail.com
Mon Jun 4 07:52:14 CEST 2007


On 6/4/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> No, it can't.  One might want to write Python code that implements
> normalization algorithms, for example, and there will be "binary
> strings".  Only in the context of Unicode text are you allowed to
> do those things.

But Python files are text and should be readable to humans. Invisible
differences in code that are significant aren't good practice -
I think that was well established in the PEP 3131 discussion :-)
Is there some reason normalization algorithm implementations can't
use escapes (which are ASCII and thus not normalized) for non-NFC
strings? Note that editors are allowed to normalize as they will
(though the ones I use don't). From the Unicode standard, chapter 3:

:C9 A process shall not assume that the interpretations of two
:   canonical-equivalent character sequences are distinct.
:
: - The implications of this conformance clause are twofold. First,
:   a process is never required to give different interpretations
:   to two different, but canonical-equivalent character sequences.
:   Second, no process can assume that another process will make
:   a distinction between two different, but canonical-equivalent
:   character sequences.

As other programs processing Python source code files may not be
assumed to distinguish between normalization forms, depending on
them to do so (in normalization algorithm source code or elsewhere)
is a bit disquieting.

> It seems to me that once we have a proper separation between bytes
> objects and unicode objects, that the latter should always be
> compared internally to the dictionary using the kinds of techniques
> described in UTS#10 and UTR#30.

This sounds good if it's feasible performance-wise.

> External normalization is not the right way to handle this issue.

It depends on what problem you're solving. What I'm concerned about
most is that there may be rare (because NFC is so ubiquitous) but
annoying heisenbugs whose immediate cause is an invisible difference
in the source code. Such a class of problems shouldn't exist without
a good reason, and the reason "someone might want to write code that
depends on invisible easter eggs in the source code" doesn't sound
like a good reason to me.

Collation also doesn't solve all of the problem for naive users.
E.g. is len('ばしょ') 3 or 4? It depends on the normalization.
Whether each index in it is a hiragana character or not also
depends on the normalization. Same for e.g. 'café'.

>  > But a partial solution is better than no solution.
>
> Not if it leads to unexpected failures that are hard to diagnose,
> especially in the face of human belief that this problem has been
> "solved".

Sure, the concatenation of two normalized strings is not necessarily
a normalized string because you can have a string with a
combining character at the beginning, but people who deal with such
things know (or at least really, really, should!) how to fend for
themselves. There's nothing you can do to help them either, except
education.

There's value in keeping simple things simple and ensuring nothing
unexpected happens with simple things. In a large class of use
cases you really don't need to care that it's a complex world.
This is the case with many legacy encodings (such as Latin-1), and
the users of those will surely be surprised if switching to utf-8
causes single characters to sometimes be split to multiple parts
depending on the phase of the Moon.

> If I start up an interpreter and type
>
> >>> a = """^V^M^V^J"""
> >>> repr(a)
> "'\\r\\n'"

What the interpreter prompt does is less of an issue, as the
code is not long-lived and the programmer is there all the time
observing what the code does.

Anyway, the deadline for PEPs for py3k has passed and there's no
PEP this one would fit in, so I guess this wart will have to stay.
It's not a pressing issue, as everyone who's sane uses NFC
anyway, and if someone edits your code with a NFD-normalizing editor
you can just beat them over the head with a stick and force them to
use vim as a penance :-)


More information about the Python-3000 mailing list