[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE
Matthew Barnett
report at bugs.python.org
Mon Aug 14 13:57:37 EDT 2017
Matthew Barnett added the comment:
The re module works with codepoints, it doesn't understand canonical equivalence.
For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}".
This is true for Python in general, except for identifiers, which are normalised:
>>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}"
'É'
>>> É = 0
>>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}"
'É'
>>> É
0
This also means that, say '.' will match only 1 _codepoint_.
----------
nosy: +mrabarnett
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue31193>
_______________________________________
More information about the Python-bugs-list
mailing list