Newbie question about text encoding

Rustom Mody rustompmody at gmail.com
Tue Mar 3 23:16:02 EST 2015


On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote:
> On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
> > Rustom Mody wrote:
> > 
> > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> > >> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> > >> >> Wrote something up on why we should stop using ASCII:
> > >> >> http://blog.languager.org/2015/02/universal-unicode.html
> > >> 
> > >> I think that the main point of the post, that many Unicode chars are
> > >> truly planetary rather than just national/regional, is excellent.
> > > 
> > > <snipped>
> > > 
> > >> You should add emoticons, but not call them or the above 'gibberish'.
> > >> I think that this part of your post is more 'unprofessional' than the
> > >> character blocks.  It is very jarring and seems contrary to your main
> > >> point.
> > > 
> > > Ok Done
> > > 
> > > References to gibberish removed from
> > > http://blog.languager.org/2015/02/universal-unicode.html
> > 
> > I consider it unethical to make semantic changes to a published work in
> > place without acknowledgement. Fixing minor typos or spelling errors, or
> > dead links, is okay. But any edit that changes the meaning should be
> > commented on, either by an explicit note on the page itself, or by striking
> > out the previous content and inserting the new.
> 
> Dunno What you are grumping about…
> 
> Anyway the attribution is made more explicit – footnote 5 in
>  http://blog.languager.org/2015/03/whimsical-unicode.html.
> 
> Note Terry Reedy's post who mainly objected was already acked earlier.
> Ive just added one more ack¹
> And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication.
> 
> > 
> > As for the content of the essay, it is currently rather unfocused.
> 
> True.
> 
>  It
> > appears to be more of a list of "here are some Unicode characters I think
> > are interesting, divided into subgroups, oh and here are some I personally
> > don't have any use for, which makes them silly" than any sort of discussion
> > about the universality of Unicode. That makes it rather idiosyncratic and
> > parochial. Why should obscure maths symbols be given more importance than
> > obscure historical languages?
> 
> Idiosyncratic ≠ parochial
> 
> 
> > 
> > I think that the universality of Unicode could be explained in a single
> > sentence:
> > 
> > "It is the aim of Unicode to be the one character set anyone needs to
> > represent every character, ideogram or symbol (but not necessarily distinct
> > glyph) from any existing or historical human language."
> > 
> > I can expand on that, but in a nutshell that is it.
> > 
> > 
> > You state:
> > 
> > "APL and Z Notation are two notable languages APL is a programming language
> > and Z a specification language that did not tie themselves down to a
> > restricted charset ..."
> 
> Tsk Tsk – dihonest snipping. I wrote
> 
> | APL and Z Notation are two notable languages APL is a programming language 
> | and Z a specification language that did not tie themselves down to a 
> | restricted charset even in the day that ASCII ruled.
> 
> so its clear that the restricted applies to ASCII
> > 
> > You list ideographs such as Cuneiform under "Icons". They are not icons.
> > They are a mixture of symbols used for consonants, syllables, and
> > logophonetic, consonantal alphabetic and syllabic signs. That sits them
> > firmly in the same categories as modern languages with consonants, ideogram
> > languages like Chinese, and syllabary languages like Cheyenne.
> 
> Ok changed to iconic.
> Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages.
> In 2015 when someone sees them and recognizes them, they are 'those things that
> Sumerians/Egyptians wrote' No one except a rare expert knows those languages
> 
> > 
> > Just because native readers of Cuneiform are all dead doesn't make Cuneiform
> > unimportant. There are probably more people who need to write Cuneiform
> > than people who need to write APL source code.
> > 
> > You make a comment:
> > 
> > "To me – a unicode-layman – it looks unprofessional… Billions of computing
> > devices world over, each having billions of storage words having their
> > storage wasted on blocks such as these??"
> > 
> > But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
> > Why are you so worried about an (illusionary) minor optimization?
> 
> 2 < 4 as far as I am concerned.
> [If you disagree one man's illusionary is another's waking]
> 
> > 
> > Whether code points are allocated or not doesn't affect how much space they
> > take up. There are millions of unused Unicode code points today. If they
> > are allocated tomorrow, the space your documents take up will not increase
> > one byte.
> > 
> > Allocating code points to Cuneiform has not increased the space needed by
> > Unicode at all. Two bytes alone is not enough for even existing human
> > languages (thanks China). For hardware related reasons, it is faster and
> > more efficient to use four bytes than three, so the obvious and "dumb" (in
> > the simplest thing which will work) way to store Unicode is UTF-32, which
> > takes a full four bytes per code point, regardless of whether there are
> > 65537 code points or 1114112. That makes it less expensive than floating
> > point numbers, which take eight. Would you like to argue that floating
> > point doubles are "unprofessional" and wasteful?
> > 
> > As Dave pointed out, and you apparently agreed with him enough to quote him
> > TWICE (once in each of two blog posts), history of computing is full of
> > premature optimizations for space. (In fact, some of these may have been
> > justified by the technical limitations of the day.) Technically Unicode is
> > also limited, but it is limited to over one million code points, 1114112 to
> > be exact, although some of them are reserved as invalid for technical
> > reasons, and there is no indication that we'll ever run out of space in
> > Unicode.
> > 
> > In practice, there are three common Unicode encodings that nearly all
> > Unicode documents will use.
> > 
> > * UTF-8 will use between one and (by memory) four bytes per code 
> >   point. For Western European languages, that will be mostly one 
> >   or two bytes per character.
> > 
> > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual 
> >   Plane, which is enough for nearly all Western European writing and 
> >   much East Asian writing as well. For the rest, it uses a fixed four 
> >   bytes per code point.
> > 
> > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses 
> >   this as a storage format.
> > 
> > 
> > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
> > doesn't change the space used. If you actually include a few hieroglyphs to
> > your document, the space increases only by the actual space used by those
> > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
> > single hieroglyph in your document force you to expand the non-hieroglyph
> > characters to use more space.
> > 
> > 
> > > What I was trying to say expanded here
> > > http://blog.languager.org/2015/03/whimsical-unicode.html
> > 
> > You have at least two broken links, referring to a non-existent page:
> > 
> > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html
> 
> Thanks corrected
> 
> > 
> > This essay seems to be even more rambling and unfocused than the first. What
> > does the cost of semi-conductor plants have to do with whether or not
> > programmers support Unicode in their applications?
> > 
> > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
> > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
> > it isn't so silly. If your text begins with the UTF-8 mark, treat it as
> > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
> > or text editor's encoding cookies.
> > 
> > Your discussion of "complexifiers and simplifiers" doesn't seem to be
> > terribly relevant, or at least if it is relevant, you don't give any reason
> > for it. The whole thing about Moore's Law and the cost of semi-conductor
> > plants seems irrelevant to Unicode except in the most over-generalised
> > sense of "things are bigger today than in the past, we've gone from
> > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?
> 
> - Most people need only 16 bits.
> - Many notable examples of software fail going from 16 to 23.
> - If you are a software writer, and you fail going 16 to 23 its ok but try to 
> give useful errors

Uh… 21
Thats what makes 3 chars per 64-bit word a possibility.
A possibility that can become realistic if/when Intel decides to add 'packed-unicode' string instructions.



More information about the Python-list mailing list