Grapheme clusters, a.k.a.real characters

Ben Finney ben+python at benfinney.id.au
Thu Jul 13 22:18:20 EDT 2017


Steve D'Aprano <steve+python at pearwood.info> writes:

> From time to time, people discover that Python's string algorithms work on code
> points rather than "real characters", which can lead to anomalies like the
> following:
>
> s = 'xäex'
> s = unicodedata.normalize('NFD', s)
> print(s)
> print(s[::-1])
>
>
> which results in:
>
> xäex
> xëax

> If you're interested in this issue

Note that it depends on the difference between two apparently identical
strings::

    >>> s1 = 'xäex'
    >>> s2 = unicodedata.normalize('NFD', s1)
    >>> s1, s2
    ('xäex', 'xäex')

The strings are different, and the items you get when iterating them are
different::

    >>> len(s1), len(s2)
    (4, 5)
    >>> [unicodedata.name(c) for c in s1]
    ['LATIN SMALL LETTER X',
     'LATIN SMALL LETTER A WITH DIAERESIS',
     'LATIN SMALL LETTER E',
     'LATIN SMALL LETTER X']
    >>> [unicodedata.name(c) for c in s2]
    ['LATIN SMALL LETTER X',
     'LATIN SMALL LETTER A',
     'COMBINING DIAERESIS',
     'LATIN SMALL LETTER E',
     'LATIN SMALL LETTER X']

which explains why they're different when reversed::

>>> [unicodedata.name(c) for c in reversed(s1)]
['LATIN SMALL LETTER X',
 'LATIN SMALL LETTER E',
 'LATIN SMALL LETTER A WITH DIAERESIS',
 'LATIN SMALL LETTER X']
>>> "".join(reversed(s1))
'xeäx'
>>> [unicodedata.name(c) for c in reversed(s2)]
['LATIN SMALL LETTER X',
 'LATIN SMALL LETTER E',
 'COMBINING DIAERESIS',
 'LATIN SMALL LETTER A',
 'LATIN SMALL LETTER X']
>>> "".join(reversed(s2))
'xëax'

-- 
 \           “I know that we can never get rid of religion …. But that |
  `\   doesn’t mean I shouldn’t hate the lie of faith consistently and |
_o__)                     without apology.” —Paul Z. Myers, 2011-12-28 |
Ben Finney




More information about the Python-list mailing list