[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Sun Jul 15 14:39:29 EDT 2018


Robert Vanden Eynde writes:

 > Not for control characters.

There's a standard convention for "naming" control characters (U+0000,
U+0001, etc), which is recommended by the Unicode Standard (in
slightly generalized form) for characters that otherwise don't have
names, as "code point labels".  This has been suggested by MRAB in the
past.  Personally I would generalized Steven d'Aprano's function a
bit, and provide a "CONTROL-" prefix for these instead of "U+".

I don't see why even the C0 ASCII control function aliases should be
particularly privileged, especially since the main alias is the
spelled-out name, not the more commonly used 2- or 3-character
abbreviation (will people associate "alarm" with "BEL"? I don't).
Many are just meaningless (the 4 "device control" codes).  And some
are actively misleading: U+0018 (^X) "cancel" and U+001A (^Z)
"substitute", which are generally interpreted as "exit" (an
interactive program) and "end of file" (on Windows), or as "cut" and
"revert" in CUA UI.  I for one would find it more useful if they
aliased to "ctrl-c-prefix" and "zap-up-to-char".[1]

And nobody's ever heard of the C1 ISO 6249 control characters (not to
mention that three of them are literally figments of somebody's
imagination, and never standardized).

So I think using NameAliases.txt for this purpose is silly.  If we're
going to provide aliases based on the traditional control functions, I
would use only the NameAliases.txt aliases for the following: NUL,
BEL, BS, HT, LF, VT, FF, CR, ESC, SP, DEL, NEL, NBSP, and SHY.  (NEL
is included because it's recommended that it be treated as a newline
function in the Unicode standard.)  For the rest, I would use
CONTROL-<code>, which is more likely to make sense in most
contexts.[2]

 > About the Han case, they all have a
 > unicodedata.name<http://unicodedata.name> don't they ? (Sorry if I
 > misread your message)

Yes, they have names, constructed algorithmically from the code point:
"CJK UNIFIED IDEOGRAPH-4E00".  I know what that one is (the character
that denotes the number 1).  But that's the only one that I know
offhand.

I think Han (which are named daily, surely millions, if not billions,
of times) should be treated as well as controls (which even
programmers rarely bother to name, especially for those that don't
have standard escape sequences).  That's why I strongly advocate that
there be provision for extension, and that the databases at least be
provided by a module that can be updated far more frequently than the
stdlib is.

Footnotes: 
[1]  Those are the commands they are bound to in Emacs.

[2]  There are a few others that I personally would find useful and
unambiguous because they're used in multilingual ISO 2022 encodings,
but that's rather far into the weeds.  They're rarely seen in
practice; most of the time 7-bit codes with escape sequences are used,
or 8-bit codes without control sequences.


More information about the Python-ideas mailing list