Newbie question about text encoding

Marko Rauhamaa marko at pacujo.net
Sat Mar 7 06:53:10 EST 2015


Steven D'Aprano <steve+comp.lang.python at pearwood.info>:

> Rustom Mody wrote:
>> My conclusion: Early adopters of unicode -- Windows and Java -- were
>> punished for their early adoption. You can blame the unicode
>> consortium, you can blame the babel of human languages, particularly
>> that some use characters and some only (the equivalent of) what we
>> call words.
>
> I see you are blaming everyone except the people actually to blame.

I don't think you need to blame anybody. I think the UCS-2 mistake was
both deplorable and very understandable. At the time it looked like the
magic bullet to get out of the 8-bit mess. While 16-bit wide wchar_t's
looked like a hugely expensive price, it was deemed forward-looking to
pay it anyway to resolve the character set problem once and for all.

Linux was lucky to join the fray late enough to benefit from the bad
UCS-2 experience. That said, UTF-8 does suffer badly from its not being
a bijective mapping.

(Linux didn't quite dodge the bullet with pthreads, threads being
another sad fad of the 1990's. The hippies that cooked up the fork
system call should be awarded the next Millennium Prize. That foresight
or stroke of luck has withstood the challenge of half a century.)

> But there's nothing wrong with the design of the SMP. It allows the
> great majority of text, probably 99% or more, to use two bytes
> (UTF-16) or no more than three bytes (UTF-8), while only relatively
> specialised uses need four bytes for some code points.

The main dream was a fixed-width encoding scheme. People thought 16 bits
would be enough. The dream is so precious and true to us in the West
that people don't want to give it up.

It may yet be that UTF-32 replaces all previous schemes since it has all
the benefits of ASCII and only one drawback: redundancy. Maybe one day
we'll declare the byte 32 bits wide and be done with it. In some many
other aspects, 32-bit "bytes" are the de-facto reality already. Even C
coders routinely use 32 bits to express boolean values.

> And when Roy's customers demand that his product support emoji, or
> complain that they cannot spell their own name because of his
> parochial and ignorant idea of "crap", perhaps he will consider doing
> what he should have done from the beginning:

That's a recurring theme: Why didn't we do IPv6 from the get-go? Why
didn't we do multi-user from the get-go? Why didn't we do localization
from the get-go?

There comes a point when you have to release to start making money. You
then suffer the consequences until your company goes bankrupt.


Marko



More information about the Python-list mailing list