Newbie question about text encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Mar 3 21:54:27 EST 2015


Rustom Mody wrote:

> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
>> On 2/26/2015 8:24 AM, Chris Angelico wrote:
>> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
>> >> Wrote something up on why we should stop using ASCII:
>> >> http://blog.languager.org/2015/02/universal-unicode.html
>> 
>> I think that the main point of the post, that many Unicode chars are
>> truly planetary rather than just national/regional, is excellent.
> 
> <snipped>
> 
>> You should add emoticons, but not call them or the above 'gibberish'.
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main
>> point.
> 
> Ok Done
> 
> References to gibberish removed from
> http://blog.languager.org/2015/02/universal-unicode.html

I consider it unethical to make semantic changes to a published work in
place without acknowledgement. Fixing minor typos or spelling errors, or
dead links, is okay. But any edit that changes the meaning should be
commented on, either by an explicit note on the page itself, or by striking
out the previous content and inserting the new.

As for the content of the essay, it is currently rather unfocused. It
appears to be more of a list of "here are some Unicode characters I think
are interesting, divided into subgroups, oh and here are some I personally
don't have any use for, which makes them silly" than any sort of discussion
about the universality of Unicode. That makes it rather idiosyncratic and
parochial. Why should obscure maths symbols be given more importance than
obscure historical languages?

I think that the universality of Unicode could be explained in a single
sentence:

"It is the aim of Unicode to be the one character set anyone needs to
represent every character, ideogram or symbol (but not necessarily distinct
glyph) from any existing or historical human language."

I can expand on that, but in a nutshell that is it.


You state:

"APL and Z Notation are two notable languages APL is a programming language
and Z a specification language that did not tie themselves down to a
restricted charset ..."


but I don't think that is correct. I'm pretty sure that neither APL nor Z
allowed you to define new characters. They might not have used ASCII alone,
but they still had a restricted character set. It was merely less
restricted than ASCII.

You make a comment about Cobol's relative unpopularity, but (1) Cobol
doesn't require you to write out numbers as English words, and (2) Cobol is
still used, there are uncounted billions of lines of Cobol code being used,
and if the number of Cobol programmers is less now than it was 16 years
ago, there are still a lot of them. Academics and FOSS programmers don't
think much of Cobol, but it has to count as one of the most amazing success
stories in the field of programming languages, despite its lousy design.

You list ideographs such as Cuneiform under "Icons". They are not icons.
They are a mixture of symbols used for consonants, syllables, and
logophonetic, consonantal alphabetic and syllabic signs. That sits them
firmly in the same categories as modern languages with consonants, ideogram
languages like Chinese, and syllabary languages like Cheyenne.

Just because native readers of Cuneiform are all dead doesn't make Cuneiform
unimportant. There are probably more people who need to write Cuneiform
than people who need to write APL source code.

You make a comment:

"To me – a unicode-layman – it looks unprofessional… Billions of computing
devices world over, each having billions of storage words having their
storage wasted on blocks such as these??"

But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
Why are you so worried about an (illusionary) minor optimization?

Whether code points are allocated or not doesn't affect how much space they
take up. There are millions of unused Unicode code points today. If they
are allocated tomorrow, the space your documents take up will not increase
one byte.

Allocating code points to Cuneiform has not increased the space needed by
Unicode at all. Two bytes alone is not enough for even existing human
languages (thanks China). For hardware related reasons, it is faster and
more efficient to use four bytes than three, so the obvious and "dumb" (in
the simplest thing which will work) way to store Unicode is UTF-32, which
takes a full four bytes per code point, regardless of whether there are
65537 code points or 1114112. That makes it less expensive than floating
point numbers, which take eight. Would you like to argue that floating
point doubles are "unprofessional" and wasteful?

As Dave pointed out, and you apparently agreed with him enough to quote him
TWICE (once in each of two blog posts), history of computing is full of
premature optimizations for space. (In fact, some of these may have been
justified by the technical limitations of the day.) Technically Unicode is
also limited, but it is limited to over one million code points, 1114112 to
be exact, although some of them are reserved as invalid for technical
reasons, and there is no indication that we'll ever run out of space in
Unicode.

In practice, there are three common Unicode encodings that nearly all
Unicode documents will use.

* UTF-8 will use between one and (by memory) four bytes per code 
  point. For Western European languages, that will be mostly one 
  or two bytes per character.

* UTF-16 uses a fixed two bytes per code point in the Basic Multilingual 
  Plane, which is enough for nearly all Western European writing and 
  much East Asian writing as well. For the rest, it uses a fixed four 
  bytes per code point.

* UTF-32 uses a fixed four bytes per code point. Hardly anyone uses 
  this as a storage format.


In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
doesn't change the space used. If you actually include a few hieroglyphs to
your document, the space increases only by the actual space used by those
hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
single hieroglyph in your document force you to expand the non-hieroglyph
characters to use more space.


> What I was trying to say expanded here
> http://blog.languager.org/2015/03/whimsical-unicode.html

You have at least two broken links, referring to a non-existent page:

http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html

This essay seems to be even more rambling and unfocused than the first. What
does the cost of semi-conductor plants have to do with whether or not
programmers support Unicode in their applications?

Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
it isn't so silly. If your text begins with the UTF-8 mark, treat it as
UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
or text editor's encoding cookies.

Your discussion of "complexifiers and simplifiers" doesn't seem to be
terribly relevant, or at least if it is relevant, you don't give any reason
for it. The whole thing about Moore's Law and the cost of semi-conductor
plants seems irrelevant to Unicode except in the most over-generalised
sense of "things are bigger today than in the past, we've gone from
five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?

You agree that 16-bits are not enough, and yet you critice Unicode for using
more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is
an inconsistent position to take.

UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support.

The problem is when your language treats UTF-16 as a fixed-width two-byte
format instead of a variable-width, two- or four-byte format. (That's more
or less like the old, obsolete, UCS-2 standard.) There are all sorts of
good ways to solve the problem of surrogate pairs and the SMPs in UTF-16.
If some programming languages or software fails to do so, they are buggy,
not UTF-16.

After explaining that 16 bits are not enough, you then propose a 16 bit
standard. /face-palm

UTF-16 cannot break the fixed with invariant, because it has no fixed width
invariant. That's like arguing against UTF-8 because it breaks the fixed
width invariant "all characters are single byte ASCII characters".

If you cannot handle SMP characters, you are not supporting Unicode.


You suggest that Chinese users should be looking at Big5 or GB. I really,
really don't think so.

- Neither is universal. What makes you think that Chinese writers need 
  to use maths symbols, or include (say) Thai or Russian in their work 
  any less than Western writers do?

- Neither even support all of Chinese. Big5 supports Traditional 
  Chinese, but not Simplified Chinese. GB supports Simplified 
  Chinese, but not Traditional Chinese. 

- Big5 likewise doesn't support placenames, many people's names, and
  other less common parts of Chinese.

- Big5 is a shift-system, like Shift-JIS, and suffers from the same sort
  of data corruption issues.

- There is no one single Big5 standard, but a whole lot of vendor 
  extensions.


You say:

"I just want to suggest that the Unicode consortium going overboard in
adding zillions of codepoints of nearly zero usefulness, is in fact
undermining unicode’s popularity and spread."

Can you demonstrate this? Can you show somebody who says "Well, I was going
to support full Unicode, but since they added a snowman, I'm going to stick
to ASCII"?

The "whimsical" characters you are complaining about were important enough
to somebody to spend significant amounts of time and money to write up a
proposal, have it go through the Unicode Consortium bureaucracy, and
eventually have it accepted. That's not easy or cheap, and people didn't
add a snowman on a whim. They did it because there are a whole lot of
people who want a shared standard for map symbols.

It is easy to mock what is not important to you. I daresay kids adding emoji
to their 10 character tweets would mock all the useless maths symbols in
Unicode too.




-- 
Steven




More information about the Python-list mailing list