Coding systems are political (was Exended ASCII and code pages)

Random832 random832 at fastmail.com
Sat May 28 14:05:40 EDT 2016


On Sat, May 28, 2016, at 00:46, Rustom Mody wrote:
> Which also means that if the Chinese were to have more say in the
> design of Unicode/ UTF-8 they would likely not waste swathes of prime
> real-estate for almost never used control characters just in the name
> of ASCII compliance

There are only 128 code points in the single-byte range of UTF-8. Only
32 of which are used for, almost-never-used or otherwise, control
characters. What do you imagine they would have put there instead?

At least Unicode doesn't do as badly as the first-draft ISO-UCS, which
didn't allow a C0/C1 control value in *any* position in UCS-2 or UCS-4,
therefore UCS-2 would encode only 192*192=36,864 codepoints as two bytes
(and 64 control characters as one byte), as opposed to UTF-16's 63,488
(including all control characters) two-byte characters.

For completeness, I'll note that conventional East Asian character
coding systems do have a higher information density compared to UTF-8,
but at a cost of not being self-synchronizing. And their single-byte
characters are in fact ASCII and C0/C1 controls, with only Japanese
Shift-
JIS encodings additionally having Katakana as single-byte characters.



More information about the Python-list mailing list