PEP 3131: Supporting Non-ASCII Identifiers

Sun May 13 18:03:28 EDT 2007

On May 13, 11:44 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to
> python-3... at python.org
>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").
>
> I believe this PEP differs from other Py3k PEPs in that it really
> requires feedback from people with different cultural background
> to evaluate it fully - most other PEPs are culture-neutral.
>
> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported? why?
> - would you use them if it was possible to do so? in what cases?
>
> Regards,
> Martin
>
> PEP: 3131
> Title: Supporting Non-ASCII Identifiers
> Version: $Revision: 55059 $
> Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
> Author: Martin v. Löwis <mar... at v.loewis.de>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 1-May-2007
> Python-Version: 3.0
> Post-History:
>
> Abstract
> ========
>
> This PEP suggests to support non-ASCII letters (such as accented
> characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.
>
> Rationale
> =========
>
> Python code is written by many people in the world who are not familiar
> with the English language, or even well-acquainted with the Latin
> writing system.  Such developers often desire to define classes and
> functions with names in their native languages, rather than having to
> come up with an (often incorrect) English translation of the concept
> they want to name.
>
> For some languages, common transliteration systems exist (in particular,
> for the Latin-based writing systems).  For other languages, users have
> larger difficulties to use Latin to write their native words.
>
> Common Objections
> =================
>
> Some objections are often raised against proposals similar to this one.
>
> People claim that they will not be able to use a library if to do so
> they have to use characters they cannot type on their keyboards.
> However, it is the choice of the designer of the library to decide on
> various constraints for using the library: people may not be able to use
> the library because they cannot get physical access to the source code
> (because it is not published), or because licensing prohibits usage, or
> because the documentation is in a language they cannot understand.  A
> developer wishing to make a library widely available needs to make a
> number of explicit choices (such as publication, licensing, language
> of documentation, and language of identifiers).  It should always be the
> choice of the author to make these decisions - not the choice of the
> language designers.
>
> In particular, projects wishing to have wide usage probably might want
> to establish a policy that all identifiers, comments, and documentation
> is written in English (see the GNU coding style guide for an example of
> such a policy). Restricting the language to ASCII-only identifiers does
> not enforce comments and documentation to be English, or the identifiers
> actually to be English words, so an additional policy is necessary,
> anyway.
>
> Specification of Language Changes
> =================================
>
> The syntax of identifiers in Python will be based on the Unicode
> standard annex UAX-31 [1]_, with elaboration and changes as defined
> below.
>
> Within the ASCII range (U+0001..U+007F), the valid characters for
> identifiers are the same as in Python 2.5.  This specification only
> introduces additional characters from outside the ASCII range.  For
> other characters, the classification uses the version of the Unicode
> Character Database as included in the ``unicodedata`` module.
>
> The identifier syntax is ``<ID_Start> <ID_Continue>*``.
>
> ``ID_Start`` is defined as all characters having one of the general
> categories uppercase letters (Lu), lowercase letters (Ll), titlecase
> letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
> (Nl), plus the underscore (XXX what are "stability extensions" listed in
> UAX 31).
>
> ``ID_Continue`` is defined as all characters in ``ID_Start``, plus
> nonspacing marks (Mn), spacing combining marks (Mc), decimal number
> (Nd), and connector punctuations (Pc).
>
> All identifiers are converted into the normal form NFC while parsing;
> comparison of identifiers is based on NFC.
>
> Policy Specification
> ====================
>
> As an addition to the Python Coding style, the following policy is
> prescribed: All identifiers in the Python standard library MUST use
> ASCII-only identifiers, and SHOULD use English words wherever feasible.
>
> As an option, this specification can be applied to Python 2.x.  In that
> case, ASCII-only identifiers would continue to be represented as byte
> string objects in namespace dictionaries; identifiers with non-ASCII
> characters would be represented as Unicode strings.
>
> Implementation
> ==============
>
> The following changes will need to be made to the parser:
>
> 1. If a non-ASCII character is found in the UTF-8 representation of the
>    source code, a forward scan is made to find the first ASCII
>    non-identifier character (e.g. a space or punctuation character)
>
> 2. The entire UTF-8 string is passed to a function to normalize the
>    string to NFC, and then verify that it follows the identifier syntax.
>    No such callout is made for pure-ASCII identifiers, which continue to
>    be parsed the way they are today.
>
> 3. If this specification is implemented for 2.x, reflective libraries
>    (such as pydoc) must be verified to continue to work when Unicode
>    strings appear in ``__dict__`` slots as keys.
>
> References
> ==========
>
> .. [1]http://www.unicode.org/reports/tr31/
>
> Copyright
> =========
>
> This document has been placed in the public domain.

I don't think that supporting non-ascii characters for identifiers
would cause any problem. Most people won't use it anyway. People who
use non-english identifiers for their project and hope for it to be
popular worldwide will probably just fail because of their foolish
coding style policy choice. I put that kind of choice in the same
ballpark as deciding to use hungarian notation for python code.

As for malicious patch submission, I think this is a non issue.
Designing tool to detect any non-ascii char identifier in a file
should be a trivial script to write.

I say that if there is a demand for it, let's do it.