PEP 3131: Supporting Non-ASCII Identifiers

Wed May 16 04:08:36 EDT 2007

Martin v. Lowis wrote:
> Lorenzo Gatti wrote:
>> Not providing an explicit listing of allowed characters is inexcusable
>> sloppiness.

> That is a deliberate part of the specification. It is intentional that
> it does *not* specify a precise list, but instead defers that list
> to the version of the Unicode standard used (in the unicodedata
> module).

Ok, maybe you considered listing characters but you earnestly decided
to follow an authority; but this reliance on the Unicode standard is
not a merit: it defers to an external entity (UAX 31 and the Unicode
database) a foundation of Python syntax.
The obvious purpose of Unicode Annex 31 is defining a framework for
parsing the identifiers of arbitrary programming languages, it's only,
in its own words, "specifications for recommended defaults for the use
of Unicode in the definitions of identifiers and in pattern-based
syntax". It suggests an orderly way to add tens of thousands of exotic
characters to programming language grammars, but it doesn't prove it
would be wise to do so.

You seem to like Unicode Annex 31, but keep in mind that:
- it has very limited resources (only the Unicode standard, i.e. lists
and properties of characters, and not sensible programming language
design, software design, etc.)
- it is culturally biased in favour of supporting as much of the
Unicode character set as possible, disregarding the practical
consequences and assuming without discussion that programming language
designers want to do so
- it is also culturally biased towards the typical Unicode patterns of
providing well explained general algorithms, ensuring forward
compatibility, and relying on existing Unicode standards (in this
case, character types) rather than introducing new data (but the
character list of Table 3 is unavoidable); the net result is caring
even less for actual usage.

>> The XML standard is an example of how listings of large parts of the
>> Unicode character set can be provided clearly, exactly and (almost)
>> concisely.

> And, indeed, this is now recognized as one of the bigger mistakes
> of the XML recommendation: they provide an explicit list, and fail
> to consider characters that are unassigned. In XML 1.1, they try
> to address this issue, by now allowing unassigned characters in
> XML names even though it's not certain yet what those characters
> mean (until they are assigned).

XML 1.1 is, for practical purposes, not used except by mistake. I
challenge you to show me XML languages or documents of some importance
that need XML 1.1 because they use non-ASCII names.
XML 1.1 is supported by many tools and standards because of buzzword
compliance, enthusiastic obedience to the W3C and low cost of
implementation, but this doesn't mean that its features are an
improvement over XML 1.0.

>>> ``ID_Continue`` is defined as all characters in ``ID_Start``, plus
>>> nonspacing marks (Mn), spacing combining marks (Mc), decimal number
>>> (Nd), and connector punctuations (Pc).
>>
>> Am I the first to notice how unsuitable these characters are?

> Probably. Nobody in the Unicode consortium noticed, but what
> do they know about suitability of Unicode characters...

Don't be silly. These characters are suitable for writing text, not
for use in identifiers; the fact that UAX 31 allows them merely proves
how disconnected from actual programming language needs that document
is.

In typical word processing, what characters are used is the editor's
problem and the only thing that matters is the correctness of the
printed result; program code is much more demanding, as it needs to do
more (exact comparisons, easy reading...) with less (straightforward
keyboard inputs and monospaced fonts instead of complex input systems
and WYSIWYG graphical text). The only way to work with program text
successfully is limiting its complexity.
Hard to input characters, hard to see characters, ambiguities and
uncertainty in the sequence of characters, sets of hard to distinguish
glyphs and similar problems are unacceptable.

It seems I'm not the first to notice a lot of Unicode characters that
are unsuitable for identifiers. Appendix I of the XML 1.1 standard
recommends to avoid variation selectors, interlinear annotations (I
missed them...), various decomposable characters, and "names which are
nonsensical, unpronounceable, hard to read, or easily confusable with
other names".
The whole appendix I is a clear admission of self-defeat, probably the
result of committee compromises.  Do you think you could do better?

Regards,
Lorenzo Gatti