PEP 3131: Supporting Non-ASCII Identifiers

Marco Colombo marco.colombo at gmail.com
Mon May 14 07:42:21 EDT 2007


I suggest we keep focused on the main issue here, which is "shoud non-
ascii identifiers be allowed, given that we already allow non-ascii
strings literals and comments?"

Most arguments against this proposal really fall into the category
"ascii-only source files". If you want to promote code-sharing, then
you should enfore quite restrictive policies:
- 7-bit only source files, so that everyone is able to correctly
display and _print_ them (somehow I feel that printing foreign glyphs
can be harder than displaying them) ;
- English-only, readable comments _and_ identifiers (if you think of
it, it's really the same issue, readability... I know no Coding Style
that requires good commenting but allows meaningless identifiers).

Now, why in the first place one should be allowed to violate those
policies? One reason is freedom. Let me write my code the way I like
it, and don't force me writing it the way you like it (unless it's
supposed to be part of _your_ project, then have me follow _your_
style).

Another reason is that readability is quite a relative term...
comments that won't make any sense in a real world program, may be
appropriate in a 'getting started with' guide example:

# this is another way to increment variable 'a'
a += 1

we know a comment like that is totally useless (and thus harmful) to
any programmer (makes me think "thanks, but i knew that already"), but
it's perfectly appropriate if you're introducing that += operator for
the first time to a newbie.

You could even say that most string literals are best made English-
only:

print "Ciao Mondo!"

it's better written:

print _("Hello World!")

or with any other mean to allow the i18n of the output. The Italian
version should be implemented with a .po file or whatever.

Yet, we support non-ascii encodings for source files. That's in order
to give authors more freedom. And freedom comes at a price, of course,
as non-ascii string literals, comments and identifiers are all harmful
to some extents and in some contexts.

What I fail to see is a context in which it makes sense to allow non-
ascii literals and non-ascii comments but _not_ non-ascii identifiers.
Or a context in which it makes sense to rule out non-ascii identifiers
but not string literals and comments. E.g. would you accept a patch
with comments you don't understand (or even that you are not able to
display correctly)? How can you make sure the patch is correct, if you
can't read and understand the string literals it adds?

My point being that most public open source projects already have
plenty of good reasons to enforce an English-only, ascii-only policy
on source files. I don't think that allowing non-ascii indentifiers at
language level would hinder thier ability to enforce such a policy
more than allowing non-ascii comments or literals did.

OTOH, I won't be able to contribute much to a project that already
uses, say, Chinese for comments and strings. Even if I manage to
display the source code correctly here, still I won't understand much
of it. So I'm not losing much by allowing them to use Chinese for
identifiers too.
And whether it was a mistake on their part not to choose an "English
only, ascii only" policy it's their call, not ours, IMHO.

.TM.




More information about the Python-list mailing list