[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Wed Jun 6 09:09:43 CEST 2007

On 6/6/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> No.  The point is that people want to use their current tools; they
> may not be able to easily specify normalization.

> Please look through the list (I've already done so; I'm speaking from
> detailed examination of the data) and state what compatibility
> characters you want to keep.

I cannot really say about code points I'm not familiar with, but I
wouldn't use any of the ones I do know in identifiers. The only
compatibility characters in ID_Continue I have used myself are,
I think, halfwidth katakana and fullwidth alphanumerics. Examples:

ﾀ -> タ # halfwidth katakana
ｘ -> x # fullwidth alphabetic
１ -> 1 # fullwidth numeric

Practically speaking I won't be using such things in my code. I don't
like them but if it's more pragmatic to allow them then I guess it can't
be helped.

There are some cases where users might in the future want to make
a distinction between "compatibility" characters, such as these:
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
If some day everyone writes their TeX using such things, then it'd make
sense to allow and distinguish them in Python, too. For this reason
I think that compatibility transformation, if any, should only be
applied to characters where there's a practical reason to do so, and for
other cases punting (=syntax error) is safest. When in doubt, refuse
the temptation to guess.

> as a daily user of several Japanese input methods, I can tell you it
> would be a massive pain in the ass if Python doesn't convert those,
> and errors would be an on-the-minute-every-minute annoyance.

I use two Japanese input methods (MS IME and scim/anthy), but only the
latter one daily. When I type text that mixes Japanese and other
languages, I switch the input mode off when not typing Japanese. For
code that uses a lot of Japanese this may not be convenient, but then
you'd want to set your input method to use ASCII for ASCII anyway,
as that would still be required in literals (０ｘ１５ or "ａ" won't
work) and punctuation (ａ「１５」。ｆｏｏ＝（５、６） won't work).
A code mixing fullwidth and halfwidth alphanumerics also looks
horrible, but that's just a coding style issue :-)

>  > Unicode, and adding extra equivalences (whether it's "FoO" == "foo",
>  > "ｶｷ" ==
> "カキ" or "Ａ１２３" == "A123") is surprising.
>
> How many Japanese documents do you deal with on a daily basis?

Much fewer than you, as I don't live in Japan. I read a fair amount
but don't type long texts in Japanese. When I do type, I usually use
fullwidth alphanumerics except for foreign words that aren't acronyms.
E.g. ＦＢＩ but not ａｌｐｈａｂｅｔ. For code, consistently using
ASCII for ASCII would be the most predictable rule (TOOWTDI).

You have to go out of your way to type halfwidth katakana, and it
isn't really useful in identifiers IMHO.

> They are treated as font variants, not different characters, by *all*
> users.

I think programmers in general expect identifier identity to behave the
same way as string identity. In this way they are a special class of
users. (those who use case-insensitive programming languages have
all my sympathy :-)

> I would like this code to return "KK".  This might be an unpleasant
> surprise, once, and there would need to be a warning on the box for
> distribution in Japan (and other cultures with compatibility
> decompositions).

This won't have a big impact if you apply it only to carefully
selected code points, and that way it sounds like a viable choice. Asking
your students for input as you suggested is surely a good idea.