[Python-ideas] Support Unicode code point notation

MRAB python at mrabarnett.plus.com
Sat Jul 27 17:28:14 CEST 2013


On 27/07/2013 12:17, Chris “Kwpolska” Warrick wrote:
> On Sat, Jul 27, 2013 at 12:01 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> Unicode's standard notation for code points is U+ followed by a 4, 5 or 6
>> hex digit string, such as π = U+03C0. This notation is found throughout the
>> Unicode Consortium's website, e.g.:
>>
>> http://www.unicode.org/versions/corrigendum2.html
>>
>> as well as in third party sites that have reason to discuss Unicode code
>> points, e.g.:
>>
>> https://en.wikipedia.org/wiki/Eth#Computer_input
>>
>> I propose that Python strings support this as the preferred escape notation
>> for Unicode code points:
>>
>> '\U+03C0'
>> => 'π'
>>
>> The existing \U and \u variants must be kept for backwards compatibility,
>> but should be (mildly) discouraged in new code.
>
> As Marc-Andre Lemburg said, C, C++ and Java use the same notation as
> Python does.
>
> And there is NO programming language implementing the U+ syntax.  Why
> should we?  Why should we violate de-facto standards?
>
> Existing programming languages use one or more of:
>
> a) \uHHHH
> b) \UHHHHHHHH
> c) \u{H..HHHHHH} (eg. Ruby)
> c) \xH..HH
> d) \x{H..HHHHHH}
> e) \O..OOO
>
> and probably some more variants I am not aware of or forgot about, but
> there is probably no programming language that does \U+{H..HHHHHH}, so
> why should we?
>
[snip]

Perl supports "\N{U+1234}" and "\x{1234}", and I believe that some 
languages also
support "\x{41 42 43}" as an abbreviation for "\x{41}\x{42}\x{43}".

As others have said, "\U+1234" suffers from the same problem as octal
escapes.

-1



More information about the Python-ideas mailing list