Benefits of unicode identifiers (was: Allow additional separator in identifiers)

Thu Nov 23 17:15:16 EST 2017

On 11/23/17 4:31 PM, Chris Angelico wrote:
> On Fri, Nov 24, 2017 at 8:19 AM, Richard Damon <Richard at damon-family.org> wrote:
>> On 11/23/17 2:46 PM, Thomas Jollans wrote:
>>> On 23/11/17 19:42, Mikhail V wrote:
>>>> I mean for a real practical situation - for example for an average
>>>> Python programmer or someone who seeks a programmer job.
>>>> And who does not have a 500-key keyboard,
>>> I don't think it's too much to ask for a programmer to have the
>>> technology and expertise necessary to type their own language in its
>>> proper alphabet.
>>
>> My personal feeling is that the language needs to be fully usable with just
>> ASCII, so the - character (HYPHEN/MINUS) is the subtraction/negation
>> operator, not an in-name hyphen. This also means the main library should use
>> just the ASCII character set.
>>
>> I do also realize that it could be very useful for programmers who are
>> programming with other languages as their native, to be able to use words in
>> their native language for their own symbols, and thus useful to use their
>> own character sets. Yes, doing so may add difficulty to the programmers, as
>> they may need to be switching keyboard layouts (especially if not using a
>> LATIN based language), but that is THEIR decision to do so. It also may make
>> it harder for outside programmers to hep, but again, that is the teams
>> decision to make.
>>
>> The Unicode Standard provides a fairly good classification of the
>> characters, and it would make sense to define that an character that is
>> defined as a 'Letter' or a 'Number', and some classes of Punctuation
>> (connector and dash) be allowed in identifiers.
> That's exactly how Python's identifiers are defined (modulo special
> handling of some of the ASCII set, for reasons of backward
> compatibility).
>
>> Fully implementing may be more complicated than it is worth. An interim
>> simple solution would be just allow ALL (or maybe most, excluding a limited
>> number of obvious exceptions) of the characters above the ASCII set, with a
>> warning that only those classified as above are promised to remain valid,
>> and that other characters, while currently not generating a syntax error,
>> may do so in the future. It should also be stated that while currently no
>> character normalization is being done, it may be added in the future, so
>> identifiers that differ only by code point sequences that are defined as
>> being equivalent, might in the future not be distinct.
> No, that would be a bad idea; some of those characters are more
> logically operators or brackets, and some are most definitely
> whitespace. Also, it's easier to *expand* the valid character set than
> to *restrict* it, so it's better to start with only those characters
> that you know for sure make sense, and then add more later. If the
> xid_start and xid_continue classes didn't exist, it might be
> reasonable to use "Letter, any" and "Number, any" as substitutes; but
> those classes DO exist, so Python uses them.
>
> But broadly speaking, yes; it's not hard to allow a bunch of
> characters as part of Python source code. Actual language syntax (eg
> keywords) is restricted to ASCII and to those symbols that can easily
> be typed on most keyboards, but your identifiers are your business.
>
> ChrisA

My thought is you define a legal only those Unicode characters that via 
the defined classification would be normally legal, but perhaps the 
first implementation doesn't diagnose many of the illegal combinations. 
If that isn't Pythonic, then yes, implementing a fuller classification 
would be needed. That might also say normalization questions would need 
to be decided too.

-- 
Richard Damon