Glyphs and graphemes [was Re: Cult-like behaviour]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jul 16 21:27:18 EDT 2018


On Mon, 16 Jul 2018 15:28:51 -0400, Terry Reedy wrote:

> On 7/16/2018 1:11 PM, Richard Damon wrote:
> 
>> Many consider that UTF-32 is a variable-width encoding because of the
>> combining characters. It can take multiple ‘codepoints’ to define what
>> should be a single ‘character’ for display.
> 
> I hope you realize that this is not the standard meaning of
> 'variable-width encoding', which is 'variable number of bytes for a
> codepoint'.

A minor correction Terry: it is the number of code units, not bytes.

UTF-8 uses 1-byte code units, and from 1 to 4 code units per code point;

UTF-16 uses 2-byte code units (a 16-bit word), and 1 or 2 words per code 
point;

UTF-32 uses 4-byte code units (a 32-bit word), and only ever a single 
code unit for every code point.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson




More information about the Python-list mailing list