[Python-ideas] Support Unicode code point notation

Stephen J. Turnbull stephen at xemacs.org
Fri Aug 2 05:30:29 CEST 2013


Alexander Belopolsky writes:
 > On Thu, Aug 1, 2013 at 9:46 PM, MRAB <python at mrabarnett.plus.com> wrote:

 >> We could follow Perl or Ruby, or both of them, or even allow
 >> braces with any of the hex escapes.

 > That choice is unfortunately precluded by backwards compatibility
 > because both "\u1FFFF" and "\x1FFFF" are valid strings. (Are braces
 > optional in Perl's \x{..} or Ruby's \u{..}?) Also, the upper-case U
 > is more in-line with U+ notation and \N escape.  If we are looking
 > for "one obvious way," I think it should be \U with \x and \u
 > remaining the other less obvious ways.

-1.  The obvious way forward is \N{U+1FFFF}.  That *looks* like an
algorithmically generated name, and (wow!) that's what it *is*.[1]
The existing \U, \u, and \x escapes are fine as they are.  They can't
really be deprecated because they're needed for portability to older
Python versions which won't have any of the proposed extensions.

Changing the syntax of \U to allow braces with a variable-width
hexadecimal argument is only a minor compatibility break, but please
have pity on the folks who support python-list.  They'll forever be
dealing with questions like "I know I've seen other people write
'\U3bb', why do I get a weird syntax error?" and "I use Python 3.3.
Why do I get a syntax error with '\U{3BB}'?"

On the other hand, \N{U+1FFFF} will currently get a lookup failure.  I
think that's OK, since currently code needs to be prepared for that to
fail anyway since it raises an error, and users will be used to it
because it's easy to typo Unicode names when typing from memory --
they're pretty regular but not 100% so.

Footnotes: 
[1]  Of course, it's also an invalid code point in any Unicode stream. ;-)



More information about the Python-ideas mailing list