Glyphs and graphemes [was Re: Cult-like behaviour]

Chris Angelico rosuav at gmail.com
Tue Jul 17 04:46:12 EDT 2018


On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> But of course other people's experience may vary. I'm interested in
>> learning about the library you use to process graphemes in your software.
>
> For me, the issue is where do I produce a line break in my text output?
> Currently, I'm just counting codepoints to estimate the width of the
> output.

Well, that's just flat out wrong, then. Counting graphemes isn't going
to make it any better. Grab a well-known library like Pango and let it
do your measurements for you, *in pixels*. Or better still, just poke
your text to a dedicated text-display widget and let it display it
correctly.

Back in the early 2000s, I built a program that displayed text in a
monospaced font, and it was riddled with assumptions that "one byte ==
one character == N pixels of width" (for some value of N that changed
only when you change font). It was easier to throw it out completely
and start over than to try to "bolt on" true Unicode support. The
replacement program uses GTK and Pango to do all its display work, and
while it still has a lot of complexities (because it has to handle
colour codes, highlighting, point-to-word, and such, all of which get
very complicated when you mix LTR and RTL text), at least it can 100%
dependably say "wrap to this point". For the convenience of the human
using it, it specifies a wrap width in characters, but in the fine
print, the wrap width is defined as "the width of that many of the
letter 'n' in the chosen font". At no point do I ever count bytes,
code units, code points, grapheme clusters, or blue-faced baboons, to
try to pretend that I know the width of the string. All of them are
wrong for the wrapping of text.

ChrisA



More information about the Python-list mailing list