Coding systems are political (was Exended ASCII and code pages)

Rustom Mody rustompmody at gmail.com
Sun May 29 02:12:25 EDT 2016


On Sunday, May 29, 2016 at 11:07:51 AM UTC+5:30, Steven D'Aprano wrote:
> On Sat, 28 May 2016 02:46 pm, Rustom Mody wrote:
> 
> [...]
> > In idealized, simplified models like Turing models where
> > 3 is 111
> > 7 is 1111111
> > 100, 8364 etc I wont try to write but you get the idea!
> > its quite clear that bigger numbers cost more than smaller ones
> 
> I'm not sure that a tally (base-1, unary) is a good model for memory usage
> in any computing system available today. And I thought that the Turing
> model was based on binary: the machine could both mark a cell and erase the
> mark, which corresponds to a bit.

Well you can take your pick
See unary here http://jeapostrophe.github.io/2013-10-29-tmadd-post.html

> 
> 
> 
> > With current hardware it would seem to be a flat characteristic for
> > everything < 2³² (or even 2⁶⁴)
> > 
> > But thats only an optical illusion because after that the characteristic
> > will rise jaggedly, slowly but monotonically, typically log-linearly
> > [which AIUI is jmf's principal error]
> 
> Can you be more specific at what you are trying to say? You seem to think
> that you're saying something profound here, but I don't know what it is.

I think that you seem to think that you know what I seem to think... but I digress.

Big numbers are big ie expensive
Small numbers are cheap
Easy so far??

Then there is technology... making arbitrary decisions eg a word is 32 bits
This just muddies the discussion but does not change the speed of light -- aka
properties of the universe are invariant in the face of committee decisions
-- even international consortiums

So it SEEMS (to ppl like jmf) that a million is no more costly than ten

However consider an 8 bit machine (eg 8088)
the natural size 
- for fitting 25 is a byte
- for 1000 is 2 bytes
- for a million is 3 or 4 bytes depending on what we mean by 'natural'

In short that a € costs more than a $ is a combination of the factors
- a natural cause -- there are a million chars to encode (lets assume that the
million of Unicode is somehow God-given AS A SET)
- an artificial political one -- out of the million-factorial permutations of 
that million, the one that the Unicode consortium chose is towards satisfying the
equation: Keep ASCII users undisturbed and happy
> 
> 
> 
> > Which also means that if the Chinese were to have more say in the design
> > of Unicode/ UTF-8 they would likely not waste swathes of prime real-estate
> > for almost never used control characters just in the name of ASCII
> > compliance
> 
> There is this meme going around that Unicode is a Western imperialistic
> conspiracy against Asians. For example, there was a blog post a year or so
> ago by somebody bitterly complaining that he could draw a pile of poo in
> Unicode but not write his own name, blaming Westerners for this horrible
> state of affairs.
> 
> But like most outrage on the Internet, his complaint was nonsense. He *can*
> write his name -- he just has to use a combining character to add an
> accent(?) to a base character. (Or possibly a better analogy is that of a
> ligature.) His complaint came down to the fact that because his name
> included a character which was unusual even in his own language (Bengali),
> he had to use two Unicode code points rather than one to represent it. This
> is, of course, the second worst[1] kind of discrimination.
> 
> https://news.ycombinator.com/item?id=9219162
> 
> Likewise the hoo-har over CJK unification. Some people believe that this is
> the evil Western imperialists forcing their ignorant views on the Chinese,
> Japanese and Koreans, but the reality is that the Unicode Consortium merely
> follows the decisions made by the Ideographic Rapporteur Group (IRG),
> originally the CJK-JRG group. That is a multinational group set up by the
> Chinese and Japanese, now including other East Asians (both Koreas,
> Singapore, Vietnam) to decide on a common set of Han characters.
> 
> Anyway, I digress.
> 
> Given that there are tens of thousands of Han characters (with unification),
> more than will fit in 16 bits, the 64 control characters in Unicode is not
> going to make any practical difference. In some hypothetical world where
> Han speakers got to claim code points U+0000-001F and U+0080-009F for
> ideographs, pushing the control characters out into the astral planes, all
> they would gain is *sixty four* code points. They would still need multiple
> thousands of astral characters.
> 
> Besides, some level of ASCII compatibility is useful even for Han speakers.
> Their own native-designed standard encodings like Big5 and Shift-JIS (which
> predate Unicode) keep byte-compatibility with the 32 ASCII control
> characters. (I'm not sure about the 32 "C1" control characters.) Since the
> Chinese and Japanese national standards pre-dating Unicode choose to keep
> compatibility with the ASCII control characters, I don't think that there
> is any good reason to think they would have made a different decision when
> it came to Unicode had they had more of a say than they already did.
> 
> Which was, and still is, considerable. Both China and Japan are very
> influential in the Unicode Consortium, driving the addition of many new Han
> characters and emoji. The idea that a bunch of Western corporations and
> academics are pushing them around is laughable.
> 
> 
> 
> 
> [1] The worst being that my US English keyboard doesn't have a proper curly
> apostrophe, forcing me to use a straight ' mark in my name like some sort
> of animal.
> 
> -- 
> Steven




More information about the Python-list mailing list