Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Sun Jul 16 01:45:12 EDT 2017


On Sun, Jul 16, 2017 at 2:33 PM, Rustom Mody <rustompmody at gmail.com> wrote:
> On Sunday, July 16, 2017 at 4:09:16 AM UTC+5:30, Mikhail V wrote:
>> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>> > Random access to code points is as uninteresting as random access to
>> > UTF-8 bytes.
>> > I might want random access to the "Grapheme clusters, a.k.a.real
>> > characters".
>>
>> What _real_ characters are you referring to?
>> If your data has "á" (U00E1), then it is one real character,
>> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
>> real characters. So in both cases you have access to code points =
>> real characters.
>
> Right now in an adjacent mailing list (debian) I see someone signed off with a
>
> grüß
>
> I guess the third character is a u with some ‘dirt’
> Whats the fourth?

It's a "sharp S".

Tell me, is "å" an a with some 'dirt', or is it a separate character?
Is "i" an ı with some dirt, or a separate letter? Oh wait, you
probably think that "i" is a letter, and "ı" is the same letter but
with some dirt missing. What about "p"? Is that just "d" written the
wrong way up? At what point does something merit being called a
different letter?

ALL of these are unique characters. If you look up the alphabetization
rules for Norwegian, Turkish, and English, you'll see that "å" is not
"a", that "ı" is not "i", and that "p" is not "d".

ChrisA



More information about the Python-list mailing list