Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Thu Feb 26 17:13:57 EST 2015


On Fri, Feb 27, 2015 at 4:59 AM, Rustom Mody <rustompmody at gmail.com> wrote:
> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main point.
>
> Ok I need a word for
> 1. I have no need for this
> 2. 99.9% of the (living) on this planet also have no need for this

So what, seven million people need it? Sounds pretty useful to me. And
your figure is an exaggeration; a lot more people than that use
emoji/emoticons.

>> > Also, how does assigning meanings to codepoints "waste storage"? As
>> > soon as Unicode 2.0 hit and 16-bit code units stopped being
>> > sufficient, everyone needed to allocate storage - either 32 bits per
>> > character, or some other system - and the fact that some codepoints
>> > were unassigned had absolutely no impact on that. This is decidedly
>> > NOT unprofessional, and it's not wasteful either.
>>
>> I agree.
>
> I clearly am more enthusiastic than knowledgeable about unicode.
> But I know my basic CS well enough (as I am sure you and Chris also do)
>
> So I dont get how 4 bytes is not more expensive than 2.
> Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
> You could use a clever representation like UTF-8 or FSR.
> But I dont see how you can get out of this that full-unicode costs more than
> exclusive BMP.

Sure, UCS-2 is cheaper than the current Unicode spec. But Unicode 2.0
was when that changed, and the change was because 65536 characters
clearly wouldn't be enough - and that was due to the number of
characters needed for other things than those you're complaining
about. Every spec since then has not changed anything that affects
storage. There are still, today, quite a lot of unallocated blocks of
characters (we're really using only about two planes' worth so far,
maybe three), but even if Unicode specified just two planes of 64K
characters each, you wouldn't be able to save much on transmission
(UTF-8 is already flexible and uses only what you need; if a future
Unicode spec allows 64K planes, UTF-8 transmission will cost exactly
the same for all existing characters), and on an eight-bit-byte
system, the very best you'll be able to do is three bytes - which you
can do today, too; you already know 21 bits will do. So since the BMP
was proven insufficient (back in 1996), no subsequent changes have had
any costs in storage.

> Still if I were to expand on the criticisms here are some examples:
>
> Math-Greek: Consider the math-alpha block
> http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block
>
> Now imagine a beginning student not getting the difference between font, glyph,
> character.  To me this block represents this same error cast into concrete and
> dignified by the (supposed) authority of the unicode consortium.
>
> There are probably dozens of other such stupidities like distinguishing kelvin K from latin K as if that is the business of the unicode consortium

A lot of these kinds of characters come from a need to unambiguously
transliterate text stored in other encodings. I don't personally
profess to understand the reasoning behind the various
indistinguishable characters, but I'm aware that there are a lot of
tricky questions to be decided; and if once the Consortium decides to
allocate a character, that character must remain forever allocated.

> My real reservations about unicode come from their work in areas that I happen to know something about
>
> Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ ♫ is perhaps ok
> However all this stuff http://xahlee.info/comp/unicode_music_symbols.html
> makes no sense (to me) given that music (ie standard western music written in staff notation) is inherently 2 dimensional --  multi-voiced, multi-staff, chordal

The placement on the page is up to the display library. You can
produce a PDF that places the note symbols at their correct positions,
and requires no images to render sheet music.

> Sanskrit/Devanagari:
> Consists of bogus letters that dont exist in devanagari
> The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
> But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
> So I call it bogus-devanagari
>
> Contrariwise an important letter in vedic pronunciation the double-udatta is missing
> http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html
>
> All of which adds up to the impression that the unicode consortium occasionally fails to do due diligence

Which proves that they're not perfect. Don't forget, they can always
add more characters later.

ChrisA



More information about the Python-list mailing list