Glyphs and graphemes [was Re: Cult-like behaviour]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jul 16 12:51:25 EDT 2018


On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:

> if your new system used Python3's UTF-32 strings as a foundation, that
> would be an equally naïve misstep. You'd need to reach a notch higher
> and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
> after all, is a variable-width encoding.

Python's strings aren't UTF-32. They are sequences of abstract code 
points.

UTF-32 is not a variable-width encoding.

I don't know what *you* mean by "semiotic atoms", (possibly you mean 
graphemes?) but "glyphs" are the visual images of characters, and there's 
a virtual infinity of those for each character, differing in type-face, 
size, and style (roman, italic, bold, reverse-oblique, etc).

There is no evidence aside from your say-so that a programming language 
"need" support "glyphs" as a native data type, or even graphemes. For 
starters, such a system would be exceedingly complex: graphemes are both 
language and context dependent.

English, for example, has around 250 distinct graphemes:

https://books.google.com.au/books?
id=QrBQAmfXYooC&pg=PT238&lpg=PT238&dq=250
+graphemes&source=bl&ots=abiymnQ5pq&sig=eq3k06BkuGfpuGC6wKqPkCR_8Bw&hl=en&sa=X&ei=HAdyUbfULpCnqwGRi4DYAg&redir_esc=y


Certainly it would be utterly impractical for a programming language 
designer, knowing nothing but a few half-remembered jargon terms, to try 
to design a native string type that matched the grapheme rules for the 
hundreds of human languages around the world. Or even just for English. 
Let third-party libraries blaze that trail first.


By no means is Unicode the last word in text processing. It might not 
even be the last word in native string types for programming languages. 
But it is a true international standard which provides a universal 
character set and a selection of useful algorithms able to be used as 
powerful building blocks for text-processing libraries.

Honestly Marko, your argument strikes me as akin to somebody who insists 
that because Python's float data type doesn't support full CAS (computer 
algebra system) and theorem prover, its useless and a step backwards and 
we should abandon IEEE-754 float semantics and let users implement their 
own floating point maths using nothing but fixed 1-byte integers.

A float, after all, is nothing but 8 bytes.




-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson




More information about the Python-list mailing list