Glyphs and graphemes [was Re: Cult-like behaviour]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jul 16 13:36:07 EDT 2018


On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote:

>> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano
>> <steve+comp.lang.python at pearwood.info> wrote:
>> 
>>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:
>>> 
>>> if your new system used Python3's UTF-32 strings as a foundation, that
>>> would be an equally naïve misstep. You'd need to reach a notch higher
>>> and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
>>> after all, is a variable-width encoding.
>> 
>> Python's strings aren't UTF-32. They are sequences of abstract code
>> points.
>> 
>> UTF-32 is not a variable-width encoding.
>> 
>> --
>> Steven D'Aprano
>> 
>> 
> Many consider that UTF-32 is a variable-width encoding because of the
> combining characters. It can take multiple ‘codepoints’ to define what
> should be a single ‘character’ for display.

Ah, well if we're going to start making up our own definitions of terms, 
then ASCII is a variable-width encoding too.

"Ch" (a single letter of the alphabet in a number of European languages, 
including Welsh and Czech) requires two code points in ASCII. Even in 
English, "qu" could be considered a two-byte "character" (grapheme), and 
for ASCII users, (c) is a THREE code point character for what ought to be 
a single character ©.

The standard definition of variable- and fixed-width encodings refers to 
how many *code units* is required to make up a single *code point*.

Under that standard definition, UTF-8 and UTF-16 are variable-width, and 
UTF-32 is fixed-width. 

But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII 
is too.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson




More information about the Python-list mailing list