[New-bugs-announce] [issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

David MacIver report at bugs.python.org
Sun Aug 13 09:43:47 EDT 2017


New submission from David MacIver:

chr(304).lower() is a two character string - a lower case i followed by a combining chr(775) ('COMBINING DOT ABOVE').

The re module seems not to understand the combining character and a regex compiled with IGNORECASE will erroneously match a single lower case i without the required combining character. The attached file demonstrates this. I've tested this on Python 3.6.1 with my locale as ('en_GB', 'UTF-8') (I don't know whether that matters for reproducing this, but I know it can affect how lower/upper work so am including it for the sake of completeness).

The problem does not reproduce on Python 2.7.13 because on that case chr(304).lower() is 'i' without the combining character, so it fails earlier.

This is presumably related to #12728, but as that is closed as fixed while this still reproduces I don't believe it's a duplicate.

----------
components: Library (Lib)
files: casing.py
messages: 300219
nosy: David MacIver
priority: normal
severity: normal
status: open
title: re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE
versions: Python 3.6
Added file: http://bugs.python.org/file47080/casing.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue31193>
_______________________________________


More information about the New-bugs-announce mailing list