Case-insensitive string equality

Chris Angelico rosuav at gmail.com
Fri Sep 1 20:53:32 EDT 2017


On Sat, Sep 2, 2017 at 10:31 AM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote:
>
>> Aside from lower(), which returns the string unchanged, the case
>> conversion rules say that this contains two letters.
>
> Do you have a reference to that?
>
> I mean, where in the Unicode case conversion rules is that stated? You cannot
> take the behaviour of Python as necessarily correct here -- it may be that the
> behaviour of Python is erroneous.

Yep! It's all in here.

ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

> For what its worth, even under Unicode's own rules, there are always going to be
> odd corner cases that surprise people. The most obvious cases are:
>
> You can't keep everybody happy. Doesn't mean we can't meet 99% of the usescases.
>
> After all, what do you think the regex case insensitive matching does?

Honestly, I don't know what it does without checking. But code is
often *wrong* due to backward compatibility concerns. Then you have to
decide whether, for a brand new API, it's better to "do the same as
the regex module" or to "do what the Unicode consortium says".

As it turns out, the Python 're' module doesn't match the letters
against the ligature:

>>> re.search("F", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("f", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("I", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("i", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("S", "\N{LATIN SMALL LETTER SHARP S}", re.IGNORECASE)
>>> re.search("s", "\N{LATIN SMALL LETTER SHARP S}", re.IGNORECASE)
>>>

I would consider that code to be incorrect.

ChrisA



More information about the Python-list mailing list