Newbie question about text encoding

Rustom Mody rustompmody at gmail.com
Tue Mar 3 23:05:08 EST 2015


On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
> 
> > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> >> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> >> >> Wrote something up on why we should stop using ASCII:
> >> >> http://blog.languager.org/2015/02/universal-unicode.html
> >> 
> >> I think that the main point of the post, that many Unicode chars are
> >> truly planetary rather than just national/regional, is excellent.
> > 
> > <snipped>
> > 
> >> You should add emoticons, but not call them or the above 'gibberish'.
> >> I think that this part of your post is more 'unprofessional' than the
> >> character blocks.  It is very jarring and seems contrary to your main
> >> point.
> > 
> > Ok Done
> > 
> > References to gibberish removed from
> > http://blog.languager.org/2015/02/universal-unicode.html
> 
> I consider it unethical to make semantic changes to a published work in
> place without acknowledgement. Fixing minor typos or spelling errors, or
> dead links, is okay. But any edit that changes the meaning should be
> commented on, either by an explicit note on the page itself, or by striking
> out the previous content and inserting the new.

Dunno What you are grumping about…

Anyway the attribution is made more explicit – footnote 5 in
 http://blog.languager.org/2015/03/whimsical-unicode.html.

Note Terry Reedy's post who mainly objected was already acked earlier.
Ive just added one more ack¹
And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication.

> 
> As for the content of the essay, it is currently rather unfocused.

True.

 It
> appears to be more of a list of "here are some Unicode characters I think
> are interesting, divided into subgroups, oh and here are some I personally
> don't have any use for, which makes them silly" than any sort of discussion
> about the universality of Unicode. That makes it rather idiosyncratic and
> parochial. Why should obscure maths symbols be given more importance than
> obscure historical languages?

Idiosyncratic ≠ parochial


> 
> I think that the universality of Unicode could be explained in a single
> sentence:
> 
> "It is the aim of Unicode to be the one character set anyone needs to
> represent every character, ideogram or symbol (but not necessarily distinct
> glyph) from any existing or historical human language."
> 
> I can expand on that, but in a nutshell that is it.
> 
> 
> You state:
> 
> "APL and Z Notation are two notable languages APL is a programming language
> and Z a specification language that did not tie themselves down to a
> restricted charset ..."

Tsk Tsk – dihonest snipping. I wrote

| APL and Z Notation are two notable languages APL is a programming language 
| and Z a specification language that did not tie themselves down to a 
| restricted charset even in the day that ASCII ruled.

so its clear that the restricted applies to ASCII
> 
> You list ideographs such as Cuneiform under "Icons". They are not icons.
> They are a mixture of symbols used for consonants, syllables, and
> logophonetic, consonantal alphabetic and syllabic signs. That sits them
> firmly in the same categories as modern languages with consonants, ideogram
> languages like Chinese, and syllabary languages like Cheyenne.

Ok changed to iconic.
Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages.
In 2015 when someone sees them and recognizes them, they are 'those things that
Sumerians/Egyptians wrote' No one except a rare expert knows those languages

> 
> Just because native readers of Cuneiform are all dead doesn't make Cuneiform
> unimportant. There are probably more people who need to write Cuneiform
> than people who need to write APL source code.
> 
> You make a comment:
> 
> "To me – a unicode-layman – it looks unprofessional… Billions of computing
> devices world over, each having billions of storage words having their
> storage wasted on blocks such as these??"
> 
> But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
> Why are you so worried about an (illusionary) minor optimization?

2 < 4 as far as I am concerned.
[If you disagree one man's illusionary is another's waking]

> 
> Whether code points are allocated or not doesn't affect how much space they
> take up. There are millions of unused Unicode code points today. If they
> are allocated tomorrow, the space your documents take up will not increase
> one byte.
> 
> Allocating code points to Cuneiform has not increased the space needed by
> Unicode at all. Two bytes alone is not enough for even existing human
> languages (thanks China). For hardware related reasons, it is faster and
> more efficient to use four bytes than three, so the obvious and "dumb" (in
> the simplest thing which will work) way to store Unicode is UTF-32, which
> takes a full four bytes per code point, regardless of whether there are
> 65537 code points or 1114112. That makes it less expensive than floating
> point numbers, which take eight. Would you like to argue that floating
> point doubles are "unprofessional" and wasteful?
> 
> As Dave pointed out, and you apparently agreed with him enough to quote him
> TWICE (once in each of two blog posts), history of computing is full of
> premature optimizations for space. (In fact, some of these may have been
> justified by the technical limitations of the day.) Technically Unicode is
> also limited, but it is limited to over one million code points, 1114112 to
> be exact, although some of them are reserved as invalid for technical
> reasons, and there is no indication that we'll ever run out of space in
> Unicode.
> 
> In practice, there are three common Unicode encodings that nearly all
> Unicode documents will use.
> 
> * UTF-8 will use between one and (by memory) four bytes per code 
>   point. For Western European languages, that will be mostly one 
>   or two bytes per character.
> 
> * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual 
>   Plane, which is enough for nearly all Western European writing and 
>   much East Asian writing as well. For the rest, it uses a fixed four 
>   bytes per code point.
> 
> * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses 
>   this as a storage format.
> 
> 
> In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
> doesn't change the space used. If you actually include a few hieroglyphs to
> your document, the space increases only by the actual space used by those
> hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
> single hieroglyph in your document force you to expand the non-hieroglyph
> characters to use more space.
> 
> 
> > What I was trying to say expanded here
> > http://blog.languager.org/2015/03/whimsical-unicode.html
> 
> You have at least two broken links, referring to a non-existent page:
> 
> http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html

Thanks corrected

> 
> This essay seems to be even more rambling and unfocused than the first. What
> does the cost of semi-conductor plants have to do with whether or not
> programmers support Unicode in their applications?
> 
> Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
> Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
> it isn't so silly. If your text begins with the UTF-8 mark, treat it as
> UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
> or text editor's encoding cookies.
> 
> Your discussion of "complexifiers and simplifiers" doesn't seem to be
> terribly relevant, or at least if it is relevant, you don't give any reason
> for it. The whole thing about Moore's Law and the cost of semi-conductor
> plants seems irrelevant to Unicode except in the most over-generalised
> sense of "things are bigger today than in the past, we've gone from
> five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?

- Most people need only 16 bits.
- Many notable examples of software fail going from 16 to 23.
- If you are a software writer, and you fail going 16 to 23 its ok but try to 
give useful errors

> 
> You agree that 16-bits are not enough, and yet you critice Unicode for using
> more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is
> an inconsistent position to take.

| ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support – 
| ASCII.  BMP-only Unicode is universal enough but within practical limits 
| whereas full (7.0) Unicode is 'really' universal at a cost of performance and 
| whimsicality.

Do you disagree that BMP-only = 16 bits?

> 
> UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support.
> 
> The problem is when your language treats UTF-16 as a fixed-width two-byte
> format instead of a variable-width, two- or four-byte format. (That's more
> or less like the old, obsolete, UCS-2 standard.) There are all sorts of
> good ways to solve the problem of surrogate pairs and the SMPs in UTF-16.
> If some programming languages or software fails to do so, they are buggy,
> not UTF-16.
> 
> After explaining that 16 bits are not enough, you then propose a 16 bit
> standard. /face-palm
> 
> UTF-16 cannot break the fixed with invariant, because it has no fixed width
> invariant. That's like arguing against UTF-8 because it breaks the fixed
> width invariant "all characters are single byte ASCII characters".
> 
> If you cannot handle SMP characters, you are not supporting Unicode.


7.0

> 
> 
> You suggest that Chinese users should be looking at Big5 or GB. I really,
> really don't think so.
> 
> - Neither is universal. What makes you think that Chinese writers need 
>   to use maths symbols, or include (say) Thai or Russian in their work 
>   any less than Western writers do?
> 
> - Neither even support all of Chinese. Big5 supports Traditional 
>   Chinese, but not Simplified Chinese. GB supports Simplified 
>   Chinese, but not Traditional Chinese. 
> 
> - Big5 likewise doesn't support placenames, many people's names, and
>   other less common parts of Chinese.
> 
> - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort
>   of data corruption issues.
> 
> - There is no one single Big5 standard, but a whole lot of vendor 
>   extensions.
> 
> 
> You say:
> 
> "I just want to suggest that the Unicode consortium going overboard in
> adding zillions of codepoints of nearly zero usefulness, is in fact
> undermining unicode’s popularity and spread."
> 
> Can you demonstrate this? Can you show somebody who says "Well, I was going
> to support full Unicode, but since they added a snowman, I'm going to stick
> to ASCII"?

I gave a list of softwares which goof/break going BMP to 7.0 unicode
> 
> The "whimsical" characters you are complaining about were important enough
> to somebody to spend significant amounts of time and money to write up a
> proposal, have it go through the Unicode Consortium bureaucracy, and
> eventually have it accepted. That's not easy or cheap, and people didn't
> add a snowman on a whim. They did it because there are a whole lot of
> people who want a shared standard for map symbols.
> 
> It is easy to mock what is not important to you. I daresay kids adding emoji
> to their 10 character tweets would mock all the useless maths symbols in
> Unicode too.

Head para of section 5 has:
| However (the following) are (in the standard)! So lets use them! 
Looks like mocking to you

The only mocking is at 5.1. And even here I dont mock the users of these blocks
– now or millenia ago. I only mock the unicode consortium for putting them into
unicode

----------------------
¹ And somewhere around here we get into Gödelian problems -- known to programmers
under the form "Write a program that prints itself". Likewise Acks.
I am going to deal with the Gödel-loop by the device:
- Address real issues/objects
- Smile at grumpiness



More information about the Python-list mailing list