Newbie question about text encoding

Rustom Mody rustompmody at gmail.com
Thu Feb 26 12:59:05 EST 2015


On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> >> Wrote something up on why we should stop using ASCII:
> >> http://blog.languager.org/2015/02/universal-unicode.html
> 
> I think that the main point of the post, that many Unicode chars are 
> truly planetary rather than just national/regional, is excellent.
> 
> >  From that post:
> >
> > """
> > 5.1 Gibberish
> >
> > When going from the original 2-byte unicode (around version 3?) to the
> > one having supplemental planes, the unicode consortium added blocks
> > such as
> >
> > * Egyptian hieroglyphs
> > * Cuneiform
> > * Shavian
> > * Deseret
> > * Mahjong
> > * Klingon
> >
> > To me (a layman) it looks unprofessional – as though they are playing
> > games – that billions of computing devices, each having billions of
> > storage words should have their storage wasted on blocks such as
> > these.
> > """
> >
> > The shift from Unicode as a 16-bit code to having multiple planes came
> > in with Unicode 2.0, but the various blocks were assigned separately:
> > * Egyptian hieroglyphs: Unicode 5.2
> > * Cuneiform: Unicode 5.0
> > * Shavian: Unicode 4.0
> > * Deseret: Unicode 3.1
> > * Mahjong Tiles: Unicode 5.1
> > * Klingon: Not part of any current standard
> 
> You should add emoticons, but not call them or the above 'gibberish'.


Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno…
In any case I'd like to stay clear of political(izable) questions


> I think that this part of your post is more 'unprofessional' than the 
> character blocks.  It is very jarring and seems contrary to your main point.

Ok I need a word for
1. I have no need for this
2. 99.9% of the (living) on this planet also have no need for this

> 
> > However, I don't think historians will appreciate you calling all of
> > these "gibberish". To adequately describe and discuss old texts
> > without these Unicode blocks, we'd have to either do everything with
> > images, or craft some kind of reversible transliteration system and
> > have dedicated software to render the texts on screen. Instead, what
> > we have is a well-known and standardized system for transliterating
> > all of these into numbers (code points), and rendering them becomes a
> > simple matter of installing an appropriate font.
> >
> > Also, how does assigning meanings to codepoints "waste storage"? As
> > soon as Unicode 2.0 hit and 16-bit code units stopped being
> > sufficient, everyone needed to allocate storage - either 32 bits per
> > character, or some other system - and the fact that some codepoints
> > were unassigned had absolutely no impact on that. This is decidedly
> > NOT unprofessional, and it's not wasteful either.
> 
> I agree.

I clearly am more enthusiastic than knowledgeable about unicode.
But I know my basic CS well enough (as I am sure you and Chris also do)

So I dont get how 4 bytes is not more expensive than 2.
Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
You could use a clever representation like UTF-8 or FSR.
But I dont see how you can get out of this that full-unicode costs more than
exclusive BMP.

eg consider the case of 32 vs 64 bit executables.
The 64 bit executable is generally larger than the 32 bit one
Now consider the case of a machine that has say 2GB RAM and a 64-bit processor.
You could -- I think -- make a reasonable case that all those all-zero hi-address-words are 'waste'.

And youve got the general sense best so far:
> I think that the main point of the post, that many Unicode chars are
> truly planetary rather than just national/regional, 

And if the general tone/tenor of what I have written is probably not getting 
across by some words (like 'gibberish'?) so I'll try and reword.

However let me try and clarify that the whole of section 5 is 'iffy' with 5.1 being only more extreme.  Ive not written these in because the point of that
post is not to criticise unicode but to highlight the universal(isable) parts.

Still if I were to expand on the criticisms here are some examples:

Math-Greek: Consider the math-alpha block
http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block

Now imagine a beginning student not getting the difference between font, glyph,
character.  To me this block represents this same error cast into concrete and
dignified by the (supposed) authority of the unicode consortium.

There are probably dozens of other such stupidities like distinguishing kelvin K from latin K as if that is the business of the unicode consortium

My real reservations about unicode come from their work in areas that I happen to know something about

Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ ♫ is perhaps ok
However all this stuff http://xahlee.info/comp/unicode_music_symbols.html
makes no sense (to me) given that music (ie standard western music written in staff notation) is inherently 2 dimensional --  multi-voiced, multi-staff, chordal

Sanskrit/Devanagari:
Consists of bogus letters that dont exist in devanagari
The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
So I call it bogus-devanagari

Contrariwise an important letter in vedic pronunciation the double-udatta is missing
http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html

All of which adds up to the impression that the unicode consortium occasionally fails to do due diligence

In any case all of the above is contrary to /irrelevant to my post which is about
identifying the more universal parts of unicode



More information about the Python-list mailing list