Newbie question about text encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Feb 26 20:05:10 EST 2015


Rustom Mody wrote:

> Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno…
> In any case I'd like to stay clear of political(izable) questions

Emoji is the term used in Japan, gradually spreading to the rest of the
word. Emoticons, I believe, should be restricted to the practice of using
ASCII-only digraphs and trigraphs such as :-) (colon, hyphen, right-parens)
to indicate "smileys".

I believe that emoji will eventually lead to Unicode's victory. People will
want smileys and piles of poo on their mobile phones, and from there it
will gradually spread to everywhere. All they need to do to make victory
inevitable is add cartoon genitals...


>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main
>> point.
> 
> Ok I need a word for
> 1. I have no need for this
> 2. 99.9% of the (living) on this planet also have no need for this

0.1% of the living is seven million people. I'll tell you what, you tell me
which seven million people should be relegated to second-class status, and
I'll tell them where you live.

:-)


[...]
> I clearly am more enthusiastic than knowledgeable about unicode.
> But I know my basic CS well enough (as I am sure you and Chris also do)
> 
> So I dont get how 4 bytes is not more expensive than 2.

Obviously it is. But it's only twice as expensive, and in computer science
terms that counts as "close enough". It's quite common for data structures
to "waste" space by using "no more than twice as much space as needed",
e.g. Python dicts and lists.

The whole Unicode range U+0000 to U+10FFFF needs only 21 bits, which fits
into three bytes. Nevertheless, there's no three-byte UTF encoding, because
on modern hardware it is more efficient to "waste" an entire extra byte per
code point and deal with an even multiple of bytes.


> Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
> You could use a clever representation like UTF-8 or FSR.
> But I dont see how you can get out of this that full-unicode costs more
> than exclusive BMP.

Are you missing a word there? Costs "no more" perhaps?


> eg consider the case of 32 vs 64 bit executables.
> The 64 bit executable is generally larger than the 32 bit one
> Now consider the case of a machine that has say 2GB RAM and a 64-bit
> processor. You could -- I think -- make a reasonable case that all those
> all-zero hi-address-words are 'waste'.

Sure. The whole point of 64-bit processors is to enable the use of more than
2GB of RAM. One might as well say that using 32-bit processors is wasteful
if you only have 64K of memory. Yes it is, but the only things which use
16-bit or 8-bit processors these days are embedded devices.


[...] 
> Math-Greek: Consider the math-alpha block
>
http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block
> 
> Now imagine a beginning student not getting the difference between font,
> glyph,
> character.  To me this block represents this same error cast into concrete 
> and dignified by the (supposed) authority of the unicode consortium.

Not being privy to the internal deliberations of the Consortium, it is
sometimes difficult to tell why two symbols are sometimes declared to be
mere different glyphs for the same character, and other times declared to
be worthy of being separate characters.

E.g. I think we should all agree that the English "A" and the French "A"
shouldn't count as separate characters, although the Greek "Α" and
Russian "А" do.

In the case of the maths symbols, it isn't obvious to me what the deciding
factors were. I know that one of the considerations they use is to consider
whether or not users of the symbols have a tradition of treating the
symbols as mere different glyphs, i.e. stylistic variations. In this case,
I'm pretty sure that mathematicians would *not* consider:

U+2115 DOUBLE-STRUCK CAPITAL N "ℕ"
U+004E LATIN CAPITAL LETTER N "N"

as mere stylistic variations. If you defined a matrix called ℕ, you would
probably be told off for using the wrong symbol, not for using the wrong
formatting.

On the other hand, I'm not so sure about 

U+210E PLANCK CONSTANT "ℎ"

versus a mere lowercase h (possibly in italic).


> There are probably dozens of other such stupidities like distinguishing
> kelvin K from latin K as if that is the business of the unicode consortium

But it *is* the business of the Unicode consortium. They have at least two
important aims:

- to be able to represent every possible human-language character;

- to allow lossless round-trip conversion to all existing legacy encodings
  (for the subset of Unicode handled by that encoding).


The second reason is why Unicode includes code points for degree-Celsius and
degree-Fahrenheit, rather than just using °C and °F like sane people.
Because some idiot^W code-page designer back in the 1980s or 90s decided to
add single character ℃ and ℉. So now Unicode has to be able to round-trip
(say) "°C℃" without loss.

I imagine that the same applies to U+212A KELVIN SIGN K.


> My real reservations about unicode come from their work in areas that I
> happen to know something about
> 
> Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪
> ♫ is perhaps ok However all this stuff
> http://xahlee.info/comp/unicode_music_symbols.html
> makes no sense (to me) given that music (ie standard western music written
> in staff notation) is inherently 2 dimensional --  multi-voiced,
> multi-staff, chordal

(1) Text can also be two dimensional.
(2) Where you put the symbol on the page is a separate question from whether
or not the symbol exists.


> Consists of bogus letters that dont exist in devanagari
> The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
> But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
> So I call it bogus-devanagari

Hmm, well I love Wikipedia as much as the next guy, but I think that even
Jimmy Wales would suggest that Wikipedia is not a primary source for what
counts as Devanagari vowels. What makes you think that Wikipedia is right
and Unicode is wrong?

That's not to say that Unicode hasn't made some mistakes. There are a few
deprecated code points, or code points that have been given the wrong name.
Oops. Mistakes happen.


> Contrariwise an important letter in vedic pronunciation the double-udatta
> is missing
>
http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html

I quote:


    I do not see any need for a "double udaatta". Perhaps "double 
    ANudaatta" is meant here?


I don't know Sanskrit, but if somebody suggested that Unicode doesn't
support English because the important letter "double-oh" (as
in "moon", "spoon", "croon" etc.) was missing, I wouldn't be terribly
impressed. We have a "double-u" letter, why not "double-oh"?


Another quote:

    I should strongly recommend not to hurry with a standardization
    proposal until the text collection of Vedic texts has been finished


In other words, even the experts in Vedic texts don't yet know all the
characters which they may or may not need.



-- 
Steven




More information about the Python-list mailing list