Glyphs and graphemes [was Re: Cult-like behaviour]

Marko Rauhamaa marko at pacujo.net
Tue Jul 17 04:27:34 EDT 2018


Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:
>> Who says there needs to be one. A good engineer will use the
>> definition that is most appropriate to the task at hand. Some things
>> need very solid definitions, and some things don’t.
>
> The the problem is solved: we have a perfectly good de facto definition 
> of character: it is a synonym for "code point", and every single one of 
> Marko's objections disappears.

I admit it. Python3 is the perfect medium for your codepoint delivery
needs.

What you don't seem to understand about my objections is that no
programmer needs codepoints per se. Also, Python2's strings do as good a
job at delivering codepoints as Python3. Simultaneously, Python2's
strings are a better fit for the Unix system and network programming
model.

>> This goes back to my original point, where I said some people
>> consider UTF-32 as a variable width encoding. For very many things,
>> practically, the ‘codepoint’ isn’t the important thing,
>
> Ah, is this another one of those "let's pick a definition that nobody
> else uses, and state it as a fact" like UTF-32 being variable width?

   Each 32-bit value in UTF-32 represents one Unicode code point and is
   exactly equal to that code point's numerical value.

   <URL: https://en.wikipedia.org/wiki/UTF-32>

That is called bijection. Even more, it's a homomorphism. Homomorphism
is very high degree of sameness.

It is essential for people to understand that the very same issues that
plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that
fact.

> If by "very many things", you mean "not very many things", I agree
> with you. In my experience, dealing with code points is "good enough",
> especially if you use Western European alphabets, and even more so if
> you're willing to do a normalization step before processing text.

Of course, UTF-8 doesn't relieve you from Unicode problems. But it has
one big advantage: it can usually deal with non-Unicode data without any
extra considerations while Python3's strings make you have to take
elaborate measures to handle those special cases. Why, even print() must
be guarded against UnicodeEncodeError when the printed string is not in
the programmer's control.

> But of course other people's experience may vary. I'm interested in 
> learning about the library you use to process graphemes in your software.

For me, the issue is where do I produce a line break in my text output?
Currently, I'm just counting codepoints to estimate the width of the
output.


Marko



More information about the Python-list mailing list