Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Sat Jul 15 20:28:40 EDT 2017


On Sun, Jul 16, 2017 at 9:50 AM, Gregory Ewing
<greg.ewing at canterbury.ac.nz> wrote:
> Chris Angelico wrote:
>>
>> Hold on, let me just grab my MUD
>> client, which is already using a fixed width font...
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> 忘掉那 無形鎖
>> الثلج لا يشعرني بإكتئاب
>> הקור לא מפריע לי, לא חודר
>> U+1680 is " "
>> U+200B is ""
>> U+180E is "᠎"
>> 다 잊어 다 잊어
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>
> I suspect that different lines in that example are actually
> being rendered in different fonts. Characters within the *same*
> monospaced font should have the same width (otherwise it's not
> really a monospaced font!), but there are no guarantees between
> different fonts.
>
> Perhaps the meta-problem here is that Unicode being so big has
> made it impractical to have a single font that encompasses all
> the characters you might ever want to render, so you often have
> to make do with a hodgepodge of fonts that don't play well
> together.

That could explain some of it. However, Chinese characters have a
well-defined space which is significantly wider than most monospaced
fonts would use for Latin characters, so it would look ugly for most
text in Western European languages. Also, that doesn't deal with
U+200B or U+180E, which have well-defined widths *smaller* than
typical Latin letters. (200B is a zero-width space. Is it a
character?) Hebrew text is rendered right-to-left, which makes
columnar alignment *very* interesting. Arabic text, in addition to
being RTL, is written in a joined/running style, so individual letters
aren't rendered the same way that an entire word is. And in the Korean
example, half the glyphs are represented as composed syllables (U+B2E4
HANGUL SYLLABLE DA) and half are decomposed letters (U+1103 HANGUL
CHOSEONG TIKEUT followed by U+1161 HANGUL JUNGSEONG A). These are not
combining characters - they are legitimate characters in their own
right. (At least, I can't find anything in the Unicode data files that
indicates that they aren't letters. I can use them individually in
Python identifiers, for instance.)

So even if someone were to create a single font with every Unicode
character represented, it couldn't actually give every character the
same width, because that would result in incorrect rendering for many
scripts.

ChrisA



More information about the Python-list mailing list