[Python-ideas] Support Unicode code point notation

Alexander Belopolsky alexander.belopolsky at gmail.com
Fri Aug 2 08:17:43 CEST 2013


On Thu, Aug 1, 2013 at 11:30 PM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:
>
>
> -1.  The obvious way forward is \N{U+1FFFF}.  That *looks* like an
> algorithmically generated name, and (wow!) that's what it *is*.[1]


The only problem is that this is not a conforming name according to the
Unicode standard.

The standard is very explicit in its recommendation on how the names should
be generated:

"Use in APIs. APIs which return the value of a Unicode “character name” for
a given code point might vary somewhat in their behavior. An API which is
defined as strictly returning the value of the Unicode Name property (the
“na” attribute), should return a null string for any Unicode code point
other than graphic or format characters, as that is the actual value of the
property for such code points. On the other hand, an API which returns a
name for Unicode code points, but which is expected to provide useful,
unique labels for unassigned, reserved code points and other special code
point types, should return the value of the Unicode Name property for any
code point for which it is non-null, but should otherwise construct a code
point label to stand in for a character name."

The recommendation on what should be accepted as a valid name is more
relaxed: "... it can be more effective for a user interface to use names
that were translated or otherwise adjusted to meet the expectations of the
targeted user community. By also listing the formal character name, a user
interface could ensure that users can unambiguously refer to the character
by the name documented in the Unicode Standard."  This does not literally
preclude treating U+NNNN as a character name, but it looks like such use is
discouraged: "A constructed code point label is distinguished from the
designation of the code point itself (for example, “U+0009” or “U+FFFF”),
which is also a unique identifier."

> [1]  Of course, it's also an invalid code point in any Unicode stream. ;-)

This is not accurate.  U+1FFFF is a valid code point and its generated
label is <noncharacter-1FFFF>.  Noncharacters "are forbidden for use in
open interchange of Unicode text data. ... Applications are free to use any
of these noncharacter code points internally but should never attempt to
exchange them." (See Chapter 16.7 Noncharacters.)

In Python 0x1FFFF is a valid code point:

>>> chr(0x1FFFF)
'\U0001ffff'

An application written in Python can use strings containing '\U0001ffff'
internally, but should not interchange such strings with other applications.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130802/c53d1022/attachment.html>


More information about the Python-ideas mailing list