Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Sat Jul 15 10:49:57 EDT 2017


On Sun, Jul 16, 2017 at 12:08 AM, Rick Johnson
<rantingrickjohnson at gmail.com> wrote:
> On Friday, July 14, 2017 at 2:40:43 AM UTC-5, Chris Angelico wrote:
>> [...]
>> What is the length of a string? How often do you actually
>> care about the number of grapheme clusters - and not, for
>> example, about the pixel width? (To columnate text, for
>> instance, you need to know about its width in pixels or
>> millimeters, not the number of characters in the line.)
>
> Not in the case of a fixed width font!

Yes, of course. How silly of me. Hold on, let me just grab my MUD
client, which is already using a fixed width font...

Here's a piece of text, copied and pasted straight from the client.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
忘掉那 無形鎖
الثلج لا يشعرني بإكتئاب
הקור לא מפריע לי, לא חודר
U+1680 is " "
U+200B is ""
U+180E is "᠎"
다 잊어 다 잊어
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

And here's how it renders.

http://imgur.com/1xTT1s0

It's so easy! Monospaced fonts solve everything. Every single
character gets the exact same number of pixels of width, because
that's how the standard stipulates it.

>> And if you're going to group code points together because
>> some of them are combining characters, would you also group
>> them together because there's a zero-width joiner in the
>> middle? The answer will sometimes be "yes of course" and
>> sometimes "of course not".
>
> Consistency is the key. And we must remember that he who
> assembled such inconsistent strings can only blame herself.

Except that it's the same string in different contexts. There is no
inconsistency in the string.

ChrisA



More information about the Python-list mailing list