Glyphs and graphemes [was Re: Cult-like behaviour]

Mon Jul 16 14:22:27 EDT 2018

> On Jul 16, 2018, at 1:36 PM, Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
> 
> On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote:
> 
>>> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano
>>> <steve+comp.lang.python at pearwood.info> wrote:
>>> 
>>>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:
>>>> 
>>>> if your new system used Python3's UTF-32 strings as a foundation, that
>>>> would be an equally naïve misstep. You'd need to reach a notch higher
>>>> and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
>>>> after all, is a variable-width encoding.
>>> 
>>> Python's strings aren't UTF-32. They are sequences of abstract code
>>> points.
>>> 
>>> UTF-32 is not a variable-width encoding.
>>> 
>>> --
>>> Steven D'Aprano
>>> 
>>> 
>> Many consider that UTF-32 is a variable-width encoding because of the
>> combining characters. It can take multiple ‘codepoints’ to define what
>> should be a single ‘character’ for display.
> 
> Ah, well if we're going to start making up our own definitions of terms, 
> then ASCII is a variable-width encoding too.
> 
> "Ch" (a single letter of the alphabet in a number of European languages, 
> including Welsh and Czech) requires two code points in ASCII. Even in 
> English, "qu" could be considered a two-byte "character" (grapheme), and 
> for ASCII users, (c) is a THREE code point character for what ought to be 
> a single character ©.
> 
> The standard definition of variable- and fixed-width encodings refers to 
> how many *code units* is required to make up a single *code point*.
> 
> Under that standard definition, UTF-8 and UTF-16 are variable-width, and 
> UTF-32 is fixed-width. 
> 
> But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII 
> is too.
> 
> -- 
> Steven D'Aprano
> 

But I am not talking about those sort of characters or ligatures, but ‘characters’ that are built up of a combining diacritical marks (like accents) and a base character. Unicode define many code points for the more common of these, but many others do not.