Glyphs and graphemes [was Re: Cult-like behaviour]
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Jul 16 12:51:25 EDT 2018
On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:
> if your new system used Python3's UTF-32 strings as a foundation, that
> would be an equally naïve misstep. You'd need to reach a notch higher
> and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
> after all, is a variable-width encoding.
Python's strings aren't UTF-32. They are sequences of abstract code
points.
UTF-32 is not a variable-width encoding.
I don't know what *you* mean by "semiotic atoms", (possibly you mean
graphemes?) but "glyphs" are the visual images of characters, and there's
a virtual infinity of those for each character, differing in type-face,
size, and style (roman, italic, bold, reverse-oblique, etc).
There is no evidence aside from your say-so that a programming language
"need" support "glyphs" as a native data type, or even graphemes. For
starters, such a system would be exceedingly complex: graphemes are both
language and context dependent.
English, for example, has around 250 distinct graphemes:
https://books.google.com.au/books?
id=QrBQAmfXYooC&pg=PT238&lpg=PT238&dq=250
+graphemes&source=bl&ots=abiymnQ5pq&sig=eq3k06BkuGfpuGC6wKqPkCR_8Bw&hl=en&sa=X&ei=HAdyUbfULpCnqwGRi4DYAg&redir_esc=y
Certainly it would be utterly impractical for a programming language
designer, knowing nothing but a few half-remembered jargon terms, to try
to design a native string type that matched the grapheme rules for the
hundreds of human languages around the world. Or even just for English.
Let third-party libraries blaze that trail first.
By no means is Unicode the last word in text processing. It might not
even be the last word in native string types for programming languages.
But it is a true international standard which provides a universal
character set and a selection of useful algorithms able to be used as
powerful building blocks for text-processing libraries.
Honestly Marko, your argument strikes me as akin to somebody who insists
that because Python's float data type doesn't support full CAS (computer
algebra system) and theorem prover, its useless and a step backwards and
we should abandon IEEE-754 float semantics and let users implement their
own floating point maths using nothing but fixed 1-byte integers.
A float, after all, is nothing but 8 bytes.
--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson
More information about the Python-list
mailing list