Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Wed Jul 19 11:59:11 EDT 2017


On Thu, Jul 20, 2017 at 1:45 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> So let's assume we will expand str to accommodate the requirements of
> grapheme clusters.
>
> All existing code would still produce only traditional strings. The only
> way to introduce the new "super code points" is by invoking the
> str.canonical() method:
>
>     text = "hyvää yötä".canonical()
>
> In this case text would still be a fully traditional string because both
> ä and ö are represented by a single code point in NFC. However:
>
>     >>> q = unicodedata.normalize("NFC", "aq̈u")
>     >>> len(q)
>     4
>     >>> text = q.canonical()
>     >>> len(text)
>     3
>     >>> t[0]
>     "a"
>     >>> t[1]
>     "q̈"
>     >>> t[2]
>     "u"
>     >>> q2 = unicodedata.normalize("NFC", text)
>     >>> len(q2)
>     4
>     >>> text.encode()
>     b'aq\xcc\x88u'
>     >>> q.encode()
>     b'aq\xcc\x88u'

Ahh, I see what you're looking at. This is fundamentally very similar
to what was suggested a few hundred posts ago: a function in the
unicodedata module which yields a string's combined characters as
units. So you only see this when you actually want it, and the process
of creating it is a form of iterating over the string.

This could easily be done, as a class or function in unicodedata,
without any language-level support. It might even already exist on
PyPI.

ChrisA



More information about the Python-list mailing list