Glyphs and graphemes [was Re: Cult-like behaviour]

Tue Jul 17 03:44:47 EDT 2018

On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:

>> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano
>> <steve+comp.lang.python at pearwood.info> wrote:
>> 
>>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:
>>> 
>>> You are defining a variable/fixed width codepoint set. Many others
>>> want to deal with CHARACTER sets.
>> 
>> Good luck coming up with a universal, objective, language-neutral,
>> consistent definition for a character.
>> 
> Who says there needs to be one. A good engineer will use the definition
> that is most appropriate to the task at hand. Some things need very
> solid definitions, and some things don’t.

The the problem is solved: we have a perfectly good de facto definition 
of character: it is a synonym for "code point", and every single one of 
Marko's objections disappears.

> This goes back to my original point, where I said some people consider
> UTF-32 as a variable width encoding. For very many things, practically,
> the ‘codepoint’ isn’t the important thing, 

Ah, is this another one of those "let's pick a definition that nobody 
else uses, and state it as a fact" like UTF-32 being variable width?

If by "very many things", you mean "not very many things", I agree with 
you. In my experience, dealing with code points is "good enough", 
especially if you use Western European alphabets, and even more so if 
you're willing to do a normalization step before processing text.

But of course other people's experience may vary. I'm interested in 
learning about the library you use to process graphemes in your software.

> so the fact that every UTF-32
> code point takes the same number of bytes or code words isn’t that
> important. They are dealing with something that needs to be rendered and
> preserving larger units, like the grapheme is important.

If you're writing a text widget or a shell, you need to worry about 
rendering glyphs. Everyone else just delegates to their text widget, GUI 
framework, or shell.

>>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t
>>> the magical cure that some were hoping for.
>> 
>> Nobody ever claimed it was, except for the people railing that since it
>> isn't a magically system we ought to go back to the Good Old Days of
>> code page hell, or even further back when everyone just used ASCII.
>> 
> Sometimes ASCII is good enough, especially on a small machine with
> limited resources.

I doubt that there are many general purpose computers with resources 
*that* limited. Even MicroPython supports Unicode, and that runs on 
embedded devices with memory measured in kilobytes. 8K is considered the 
smallest amount of memory usable with MicroPython, although 128K is more 
realistic as the *practical* lower limit.

In the mid 1980s, I was using computers with 128K of RAM, and they were 
still able to deal with more than just ASCII. I think the "limited 
resources" argument is bogus.

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson