Grapheme clusters, a.k.a.real characters

Thomas Jollans tjol at tjol.eu
Wed Jul 19 13:41:10 EDT 2017


On 19/07/17 04:19, Rustom Mody wrote:
> On Wednesday, July 19, 2017 at 3:00:21 AM UTC+5:30, Marko Rauhamaa wrote:
>> Chris Angelico :
>>
>>> Let me give you one concrete example: the letter "ö". In English, it
>>> is (very occasionally) used to indicate diaeresis, where a pair of
>>> letters is not a double letter - for example, "coöperate". (You can
>>> also hyphenate, "co-operate".) In German, it is the letter "o" with a
>>> pronunciation mark (umlaut), and is considered the same letter as "o".
>>> In Swedish, it is a distinct letter, alphabetized last (following z,
>>> å, and ä, in that order). But in all these languages, it's represented
>>> the exact same way.
>> The German Wikipedia entry on "ä" calls "ä" a letter ("Buchstabe"):
>>
>>    Der Buchstabe Ä (kleingeschrieben ä) ist ein Buchstabe des
>>    lateinischen Schriftsystems.
>>
>> Furthermore, it makes a distinction between "ä" the letter and "ä" the
>> "a with a diaeresis:"
>>
>>    In guten Druckschriften unterscheiden sich die Umlautpunkte von den
>>    zwei Punkten des Tremas: Die Umlautpunkte sind kleiner, stehen näher
>>    zusammen und liegen etwas tiefer.
>>
>>    In good fonts umlaut dots are different from the two dots of a
>>    diaeresis: the umlaut dots are smaller and closer to each other and
>>    lie a little lower. [translation mine]
>>
> Very interesting!
> And may I take it that the two different variants — u-umlaut and u-diaresis — of ü are not (yet) given a seat in unicode?
Yes, the tréma/diæresis and the umlaut are two historically distinct
beasts that share appearances and codepoints. (And the question of
whether ÄÖÜẞ are letters in German is rather more subtle than whether
ÅÄÖ are letters in Swedish)

For added confusion there are languages like Dutch which use both the
umlaut (in German loanwords like ‘überhaupt’) and the tréma (in words
like vacuüm).

Other languages, like Turkish, use the umlaut symbol for separate vowels
that are not umlauts (i.e. shifted vowels, like mouse - mice / Maus - Mäuse)

So let's just pretend that characters in general have no meaning?

> Now compare with:
> - hyphen-minus 0x2D
> − minus sign 0x2212
> ‐ hyphen 0x2010
> – en dash 0x2013
> — em dash 0x2014
> ― horizontal bar 0x2015
> … And perhaps another half-dozen

… but then again there's the whole business of Han unification. 


-- Thomas





More information about the Python-list mailing list