[Python-3000] Support for PEP 3131

Wed May 23 07:05:05 CEST 2007

Jim Jewett writes:

 > On 5/22/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
 > 
 > > That's why Java and C++ use \u, so you would write L\u00F6wis
 > > as an identifier. ...
 > > I think you are really arguing for \u escapes in identifiers here.
 > 
 > Yes, that is effectively what I was suggesting.
 > 
 > > *This* is truly unambiguous. I claim that it is also useless.
 > 
 > It means users could see the usability benefits of PEP3131, but the
 > python internals could still work with ASCII only.

But this reasoning is not coherent.  Python internals will have no
problems with non-ASCII; in fact, they would have no problems with
tokens containing Cf characters or even reserved code points.  Just
give an unambiguous grammar for tokens composed of code points.  It's
only when a human enters the loop (ie, presentation of the identifier
on an output stream) that they cause problems.

It's *users* who are at risk, not the Python translator, and if there
are any usability benefits to be taken advantage of by *presenting*
identifiers that don't stick to ASCII, the risks of confusing or
deliberately obfuscated code inhere in that very presentation.  Not in
the internals.  For example:

 > It simplifies checking for identifiers that *don't* stick to ASCII,

Only if you assume that people will actually perceive the 10-character
string "L\u00F6wis" as an identifier, regardless of the fact that any
programmable editor can be trained to display the 5-character string
"Löwis" in a very small amount of code.  Conversely, any programmable
editor can easily be trained to take the internal representation
"Löwis" and display it as "L\u00F6wis", giving all the benefits of the
representation you propose.  But who would ever enable it?  (I suppose
this is what Martin means by "useless".)

 > which reduces some of the concerns about confusable characters, and
 > which ones to allow.

For the given reasons above, it reduces no concerns at all, except to
the extent that it makes use of human-readable identifiers as Python
identifiers inconvenient.

I conclude that IMO PEP 3131 is precisely correct in scope as far as
it goes.  The only issues PEP 3131 should be concerned with *defining*
are those that cause problems with canonicalization, and the range of
characters and languages allowed in the standard library.

I propose it would be useful to provide a standard mechanism for
auditing the input stream.  There would be one implementation for the
stdlib that complains[1] about non-ASCII characters and possibly
non-English words, and IMO that should be the default (for the reasons
Ka-Ping gives for opposing the whole PEP).  A second one should
provide a very conservative Unicode set, with provision for amendment
as experience shows restriction to be desirable or extension to be
safe.  A third, allowing any character that can be canonicalized into
the form that PEP 3131 allows internally, is left as an exercise for
the reader wild 'n' crazy enough to want to use it.

For user convenience, it would be nice if these were implemented using
the codec interface, although if applied to raw input there would need
to be some duplication of parsing logic (specifically, comments and
strings would have to be passed unchecked).  I suppose it would be too
expensive to use the codec interface at the point of interning an
identifier (but maybe not, since it only needs to happen when adding
an identifier to the symbol table; later occurrances would be
short-circuited by probing the table and finding the token there).

Footnotes: 
[1]  I'm not sure what "complain" would mean in practice, since the
PEP acknowledges use cases for both non-ASCII and non-English in the
stdlib.