Coding systems are political (was Exended ASCII and code pages)

Rustom Mody rustompmody at gmail.com
Sat May 28 00:46:56 EDT 2016


On Friday, May 27, 2016 at 9:39:19 PM UTC+5:30, Random832 wrote:
> On Fri, May 27, 2016, at 11:53, Rustom Mody wrote:
> > And coding systems are VERY political.
> > Sure what characters are put in (and not) is political
> > But more invisible but equally political is the collating order.
> > 
> > eg No one understands what jmf's gripes are... My guess is that a Euro
> > costs 3 times a Dollar.
> > 
> > >>> "€".encode("UTF-8")
> > b'\xe2\x82\xac'
> > >>> "$".encode("UTF-8")
> > b'$'
> > 
> > [Its another matter that this is not the evil deed of python but of
> > UTF-8!]
> 
> AIUI jmf's issue is that python's string type (nothing to do with UTF-8)
> doesn't treat all strings equally. Strings that are only in Latin-1
> (including your dollar example) have only one byte per character,
> whereas strings with BMP characters have two bytes per character (he
> also has some more difficult to understand objections to the large fixed
> overhead and the cached UTF-8 version [which ASCII strings don't have])

Yeah I know and my choice of using UTF-8 encode was probably not felicitous

Consider instead:
>>> ord('$')
36
>>> ord('€')
8364
>>> bin(ord('$'))
'0b100100'
>>> bin(ord('€'))
'0b10000010101100'
>>> 

Shows that '$' costs 6 bits
whereas '€' costs 14

In idealized, simplified models like Turing models where
3 is 111
7 is 1111111
100, 8364 etc I wont try to write but you get the idea!
its quite clear that bigger numbers cost more than smaller ones

With current hardware it would seem to be a flat characteristic for everything
< 2³² (or even 2⁶⁴)

But thats only an optical illusion because after that the characteristic
will rise jaggedly, slowly but monotonically, typically log-linearly
[which AIUI is jmf's principal error]

Which also means that if the Chinese were to have more say in the design of
Unicode/ UTF-8 they would likely not waste swathes of prime real-estate
for almost never used control characters just in the name of ASCII compliance

IOW ANY coding standard makes choices that are essentially political
Unicode just happens to be (currently) politically correct



More information about the Python-list mailing list