Glyphs and graphemes [was Re: Cult-like behaviour]

Tue Jul 17 07:29:20 EDT 2018

> On Jul 17, 2018, at 3:44 AM, Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
> 
> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:
> 
>>> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano
>>> <steve+comp.lang.python at pearwood.info> wrote:
>>> 
>>>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:
>>>> 
>>>> You are defining a variable/fixed width codepoint set. Many others
>>>> want to deal with CHARACTER sets.
>>> 
>>> Good luck coming up with a universal, objective, language-neutral,
>>> consistent definition for a character.
>>> 
>> Who says there needs to be one. A good engineer will use the definition
>> that is most appropriate to the task at hand. Some things need very
>> solid definitions, and some things don’t.
> 
> The the problem is solved: we have a perfectly good de facto definition 
> of character: it is a synonym for "code point", and every single one of 
> Marko's objections disappears.
> 
Which is a ‘changed’ definition! Do you agree that the concept of variable width encoding vastly predates the creation of Unicode? Can you also find any use of the word codepoint that predates the development of Unicode? 
Code points and code words are an invention of the Unicode consortium, and as such should really only be used in talking about IT and not some other encodings. I believe that Unicode also created the idea of storing composed characters as a series of codepoints instead of it being done in the input routine and the character set needing to define a character code for every needed composed character.
> 
>> This goes back to my original point, where I said some people consider
>> UTF-32 as a variable width encoding. For very many things, practically,
>> the ‘codepoint’ isn’t the important thing, 
> 
> Ah, is this another one of those "let's pick a definition that nobody 
> else uses, and state it as a fact" like UTF-32 being variable width?
> 
> If by "very many things", you mean "not very many things", I agree with 
> you. In my experience, dealing with code points is "good enough", 
> especially if you use Western European alphabets, and even more so if 
> you're willing to do a normalization step before processing text.
> 
AH, that is the rub, you only deal with the parts of Unicode that are simple and regular. This is EXACTLY the issue that you blame people who want to use ASCII or Codepages to solve, just the next step in the evolution.

One problem with normalization is that for Western European characters it tends to be able to convert every ‘Character’ to a code point, but in some corner cases, especially for other languages it can’t. I am not just talking about digraphs like ch that have been mentioned, but the real composed characters with a base glyph with marks above/below/embedded on it. Unicode represents many of them with a code point, but no where near all of them.

If you actually read the Unicode documents, they do talk about Characters, and admit that they aren’t necessarily codepoints, so if you actually want to talk about a CHARACTER set, Unicode, even UTF-32 needs to sometimes be treated as variable width. 

> But of course other people's experience may vary. I'm interested in 
> learning about the library you use to process graphemes in your software.
> 
> 
>> so the fact that every UTF-32
>> code point takes the same number of bytes or code words isn’t that
>> important. They are dealing with something that needs to be rendered and
>> preserving larger units, like the grapheme is important.
> 
> If you're writing a text widget or a shell, you need to worry about 
> rendering glyphs. Everyone else just delegates to their text widget, GUI 
> framework, or shell.
> 
But someone needs to write that text widget, or it might not do exactly what you want, say wrapping the text around obstacles already placed on the screen/page.

And try using that text widget to find the ‘middle’ (as shown) of a text string, (other than iterating with multiple calls to it to try and find it).

Unicode made the processing of Codepoints simpler, but made the processing of actual rendered text much more complicated if you want to handle everything right. 
> 
>>>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t
>>>> the magical cure that some were hoping for.
>>> 
>>> Nobody ever claimed it was, except for the people railing that since it
>>> isn't a magically system we ought to go back to the Good Old Days of
>>> code page hell, or even further back when everyone just used ASCII.
>>> 
>> Sometimes ASCII is good enough, especially on a small machine with
>> limited resources.
> 
> I doubt that there are many general purpose computers with resources 
> *that* limited. Even MicroPython supports Unicode, and that runs on 
> embedded devices with memory measured in kilobytes. 8K is considered the 
> smallest amount of memory usable with MicroPython, although 128K is more 
> realistic as the *practical* lower limit.
> 
> In the mid 1980s, I was using computers with 128K of RAM, and they were 
> still able to deal with more than just ASCII. I think the "limited 
> resources" argument is bogus.
> 
I regularly use processors with 8k of Ram and 32k of flash. Yes, I will admit that I wouldn’t think of using Python there, as the overhead would be excessive. Yes if I needed to I could put a bigger processor in there, but it would cost space, dollars, and power, so I don’t. The applications there can deal with just ASCII so I do. I would say that on such a processor that actually trying to really process Unicode would be out of reach, as even as simple of a function as isdigit wouldn’t fit if you wanted a proper Unicode definition, and tolower would be out of the question.

> 
> -- 
> Steven D'Aprano
> "Ever since I learned about confirmation bias, I've been seeing
> it everywhere." -- Jon Ronson
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list