[Python-3000] PEP: Supporting Non-ASCII Identifiers

Tue May 1 12:52:02 CEST 2007

PEP: 31xx
Title: Supporting Non-ASCII Identifiers
Version: $Revision$
Last-Modified: $Date$
Author: Martin v. Löwis <martin at v.loewis.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:

Abstract
========

This PEP suggests to support Non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not
familiar with the English language, or even well-acquainted with the
Latin writing system. Such developers often desire to define classes
and functions with names in their native languages, rather than
having to come up with an (often incorrect) English translation
of the concept they want to name.

For some languages, common transliteration systems exists (in
particular, for the Latin-bases writing systems). For other languages,
users have larger difficulties to use Latin to write their native
words.

Common Objections
=================

Some objections are often raised agains proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their
keyboards. However, it is the choice of the designer of the library to
decide on various constraints for using the library: people may not be
able to use the library because they cannot get physical access to the
source code (because it is not published), or because licensing
prohibits usage, or because the documentation is in a language they
cannot understand. A developer wishing to make a library widely
available needs to make a number of explicit choices (such as
publication, licensing, language of documentation, and language of
identifiers). It should always be the choice of the author to make
these decisions - not the choice of the language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and
documentation is written in English (see the GNU coding style guide
for an example of such a policy). Restricting the language to
ASCII-only identifiers does not enforce comments and documentation to
be English, or the identifiers actually to be English words, so an
additional policy is necessary, anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the unicodedata module.

The identifier syntax is <ID_Start> <ID_Continue>\*.

ID_Start is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter
numbers (Nl), plus the underscore (XXX what are "stability extensions
listed in UAX 31).

ID_Continue is defined as all characters in ID_Start, plus nonspacing
marks (Mn), spacing combining marks (Mc), decimal number (Nd), and
connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words whereever
feasible.

As an option, this specification can be applied to Python 2.x.  In
that case, ASCII-only identifiers would continue to be represented as
byte string objects in namespace dictionaries; identifiers with
non-ASCII characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of
   the source code, a forward scan is made to find the first ASCII
   non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
   string to NFC, and then verify that it follows the identifier
   syntax. No such callout is made for pure-ASCII identifiers, which
   continue to be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
   (such as pydoc) must be verified to continue to work when Unicode
   strings appear in __dict__ slots as keys.

References
==========

.. [1] http://www.unicode.org/reports/tr31/

Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: