Glyphs and graphemes [was Re: Cult-like behaviour]

Mon Jul 16 16:11:27 EDT 2018

On Tue, Jul 17, 2018 at 5:51 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>> Under that standard definition, UTF-8 and UTF-16 are variable-width,
>> and UTF-32 is fixed-width.
>>
>> But I'll accept that UTF-32 is variable-width if Marko accepts that
>> ASCII is too.
>
> If that makes you happy, fine. The point is, UTF-32 has no advantages
> over UTF-8. And I'm referring to the text abstraction as seen by the
> programmer. It has nothing to do with the layout of bytes inside
> CPython.
>
> I use UTF-8 in my C programs and sense no disadvantage. I have never
> felt a need for wchar_t. Similarly, I had a small Python2 program that
> quizzed me about Hebrew vocabulary with Finnish translations and
> Esperanto pronunciation instructions. All UTF-8. No unicode strings. (I
> *have* converted that to Python3 just to be on the bleeding edge, but it
> didn't give me any advantages over Python2.)

Challenge: Reverse a string in UTF-8.

Challenge: Center text in UTF-8.

Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes,
find the immediately preceding character.

All of these are fundamentally difficult by nature, but if you index
by code points, you eliminate one level of difficulty; indexing by
bytes retains all the existing difficulty and adds another layer.

ChrisA