Best GUI for Python

Mon Apr 27 12:31:01 EDT 2015

On Tue, 28 Apr 2015 12:54 am, Christian Gollwitzer wrote:

> Am 27.04.15 um 09:15 schrieb Steven D'Aprano:
>> On Monday 27 April 2015 16:55, Christian Gollwitzer wrote:
>>
>>> YMMV. Is non-BMP needed for any living non-esoteric language? I agree
>>> that it is a big flaw, but still is useful for very many projects.
>>
>> Yes.
>>
>> The Unicode Supplementary Multilingual Planes (SMPs) are used for rare
>> but still current East Asian characters (which means that some of your
>> Chinese users may not be able to write their names without it), also some
>> mathematics symbols (okay, that's *slightly* esoteric), as well as emoji.
> 
> OK current Chinese characters count in. Were these available in
> pre-unicode 16bit encodings? 

Certainly not :-)

Unicode currently encodes over 74,000 CJK (Chinese/Japanese/Korean)
ideographs, which is comfortably larger than 2**16, so no 16-bit encoding
can handle the complete range of CJK characters. In fact, Unicode has
projected that at least another five thousand characters will be added in
version 8, and probably more beyond that.

> If not, how did they cope? 

Mostly badly :-)

Actually, that's probably unfair. The situation wasn't as bad as it could
have been for a number of reasons:

- In Korea, hanja (literally "Chinese characters") are mostly used for older
historical documents and some names, so most new documents would be written
entirely in hangul instead of Chinese ideographs.

- In Japan, kanji (also literally "Chinese characters") are not the only
option either. There is a choice between kanji and three other systems:

  hiragana is used for syllables and words for which there is either 
  no kanji available, or the author doesn't know the kanji;

  katakana is used for foreign words, loan words, scientific and 
  technical terms, the names of companies, for emphasis, and 
  onomatopoeia (words that sound like the sound of the thing they
  describe, e.g. bling, splash, pow, fizz, cock-a-doodle-do, oink);

  rōmaji (literally "Roman letters"), although it is rare to use
  that to write Japanese; it's mostly used for communication with
  non-Japanese, and as an input system for entering kanji.

- 16 bits is enough to do the most common Chinese characters. For less
common characters, people can re-word the phrase, use alternative spelling,
or use a custom font or image, depending on how important the exact correct
character is considered.

- When it comes to names, sometimes people can use a similar character. The
nearest approximation in Latin languages would be if the preferred spelling
of your name was Zöe Weiß and you wrote Zoe Weiss instead.

- Vietnam hasn't used Chinese characters (Nôm) since the 1920s, except for
limited use as decorative and ceremonial uses.

Disclaimer: nearly all of the above is taken from Wikipedia, my personal
knowledge of Chinese and Japanese is limited to "yum cha" and "banzai".
Everything else is all Greek to me :-) Consequently actual speakers of CJK
languages may have different opinions.

> Not rying to  
> defend the BMP, but still wondering whether this is a new issue due to
> the switch from 16bit to unicode, or if people can finally use all
> characters thanks to unicode (software with full support). Emoji and
> rare math is somewhat more esoteric (given the limited codepoint space)

No, it is not a new issue. Legacy East-Asian encodings have the same
problems. It will probably take many more years before the entire CJK
character set is added to Unicode, simply because the characters left to
add are obscure and rare. Some may never be added. E.g. in 2007 Taiwan
withdrew a submission to add 6,545 characters used as personal names as
they were deemed to no longer be in use.

-- 
Steven