Coding systems are political (was Exended ASCII and code pages)

Steven D'Aprano steve at pearwood.info
Sun May 29 01:37:35 EDT 2016


On Sat, 28 May 2016 02:46 pm, Rustom Mody wrote:

[...]
> In idealized, simplified models like Turing models where
> 3 is 111
> 7 is 1111111
> 100, 8364 etc I wont try to write but you get the idea!
> its quite clear that bigger numbers cost more than smaller ones

I'm not sure that a tally (base-1, unary) is a good model for memory usage
in any computing system available today. And I thought that the Turing
model was based on binary: the machine could both mark a cell and erase the
mark, which corresponds to a bit.



> With current hardware it would seem to be a flat characteristic for
> everything < 2³² (or even 2⁶⁴)
> 
> But thats only an optical illusion because after that the characteristic
> will rise jaggedly, slowly but monotonically, typically log-linearly
> [which AIUI is jmf's principal error]

Can you be more specific at what you are trying to say? You seem to think
that you're saying something profound here, but I don't know what it is.



> Which also means that if the Chinese were to have more say in the design
> of Unicode/ UTF-8 they would likely not waste swathes of prime real-estate
> for almost never used control characters just in the name of ASCII
> compliance

There is this meme going around that Unicode is a Western imperialistic
conspiracy against Asians. For example, there was a blog post a year or so
ago by somebody bitterly complaining that he could draw a pile of poo in
Unicode but not write his own name, blaming Westerners for this horrible
state of affairs.

But like most outrage on the Internet, his complaint was nonsense. He *can*
write his name -- he just has to use a combining character to add an
accent(?) to a base character. (Or possibly a better analogy is that of a
ligature.) His complaint came down to the fact that because his name
included a character which was unusual even in his own language (Bengali),
he had to use two Unicode code points rather than one to represent it. This
is, of course, the second worst[1] kind of discrimination.

https://news.ycombinator.com/item?id=9219162

Likewise the hoo-har over CJK unification. Some people believe that this is
the evil Western imperialists forcing their ignorant views on the Chinese,
Japanese and Koreans, but the reality is that the Unicode Consortium merely
follows the decisions made by the Ideographic Rapporteur Group (IRG),
originally the CJK-JRG group. That is a multinational group set up by the
Chinese and Japanese, now including other East Asians (both Koreas,
Singapore, Vietnam) to decide on a common set of Han characters.

Anyway, I digress.

Given that there are tens of thousands of Han characters (with unification),
more than will fit in 16 bits, the 64 control characters in Unicode is not
going to make any practical difference. In some hypothetical world where
Han speakers got to claim code points U+0000-001F and U+0080-009F for
ideographs, pushing the control characters out into the astral planes, all
they would gain is *sixty four* code points. They would still need multiple
thousands of astral characters.

Besides, some level of ASCII compatibility is useful even for Han speakers.
Their own native-designed standard encodings like Big5 and Shift-JIS (which
predate Unicode) keep byte-compatibility with the 32 ASCII control
characters. (I'm not sure about the 32 "C1" control characters.) Since the
Chinese and Japanese national standards pre-dating Unicode choose to keep
compatibility with the ASCII control characters, I don't think that there
is any good reason to think they would have made a different decision when
it came to Unicode had they had more of a say than they already did.

Which was, and still is, considerable. Both China and Japan are very
influential in the Unicode Consortium, driving the addition of many new Han
characters and emoji. The idea that a bunch of Western corporations and
academics are pushing them around is laughable.




[1] The worst being that my US English keyboard doesn't have a proper curly
apostrophe, forcing me to use a straight ' mark in my name like some sort
of animal.

-- 
Steven




More information about the Python-list mailing list