Grapheme clusters, a.k.a.real characters

Sat Jul 15 10:31:38 EDT 2017

On Sun, Jul 16, 2017 at 12:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Steve D'Aprano <steve+python at pearwood.info>:
>
>> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>>> I might want random access to the "Grapheme clusters, a.k.a.real
>>> characters".
>>
>> That would be nice to have, but the truth is that for most coders,
>> Unicode code points are the low-hanging fruit that get you 95% of the
>> way, and for many applications that's "close enough".
>
> I think "close enough" is actually dangerous. We shouldn't encourage
> that practice.
>
>> Support for the Unicode grapheme breaking algorithm would get you
>> probably 90% of the rest of the way. And then some sort of
>> configurable system where defaults were based on the locale would
>> probably get you a fairly complete grapheme-based text library.

Okay. So here's your challenge: don't get "close enough", get perfect.
Divide the following strings into "characters" by your definition;
give me a list of one-character strings. Make sure you are perfect and
consistent. I'll start with an easy one.

1) "Giờ\u00A0ra\u00A0đi, một\u00A0mình\u00A0ta"
2) "לעזוב, לעזוב"
3) "اطلقي سرك"
4) "「別讓他們進來看見」"
5) "다 잊어 다 잊어"

Your locale, should  this matter, is your choice of en_AU.utf8,
en_US.utf8, tr_TR.utf8, or sv_SE.utf8.

In case the information is lost in transmission, here are the same
strings, as sequences of codepoints.

1) U+0047 U+0069 U+1EDD U+00A0 U+0072 U+0061 U+00A0 U+0111 U+0069
U+002C U+0020 U+006D U+1ED9 U+0074 U+00A0 U+006D U+00EC U+006E U+0068
U+00A0 U+0074 U+0061
2) U+05DC U+05E2 U+05D6 U+05D5 U+05D1 U+002C U+0020 U+05DC U+05E2
U+05D6 U+05D5 U+05D1
3) U+0627 U+0637 U+0644 U+0642 U+064A U+0020 U+0633 U+0631 U+0643
4) U+300C U+5225 U+8B93 U+4ED6 U+5011 U+9032 U+4F86 U+770B U+898B U+300D
5) U+B2E4 U+0020 U+C78A U+C5B4 U+0020 U+1103 U+1161 U+0020 U+110B
U+1175 U+11BD U+110B U+1165

Once this is solved, you can propose adding an iteration function that
follows these rules. Probably to the unicodedata module, although it'd
most likely have to go via PyPI first.

ChrisA