Case-insensitive string equality

Tim Chase python.list at tim.thechases.com
Mon Sep 4 16:27:50 EDT 2017


On 2017-09-02 12:21, Steve D'Aprano wrote:
> On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote:
> > I'd want to have an optional parameter to take locale into
> > consideration.  E.g.  
> 
> Does regular case-sensitive equality take the locale into
> consideration?

No.  Python says that .casefold()

https://docs.python.org/3/library/stdtypes.html#str.casefold

implements the Unicode case-folding specification

ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt

which calls out the additional processing for Turkic languages:

# T: special case for uppercase I and dotted uppercase I
#    - For non-Turkic languages, this mapping is normally not used.
#    - For Turkic languages (tr, az), this mapping can be used
#      instead of the normal mapping for these characters.
#      Note that the Turkic mappings do not maintain canonical
#      equivalence without additional processing.
#      See the discussions of case mapping in the Unicode Standard
#      for more information.

So it looks like what Python lacks is that "additional processing",
after which .casefold() should solve the problems.

According to my reading, if locale doesn't play part in the equation

   s1.casefold() == s2.casefold()

should suffice.  Any case-insensitive code using .upper() or .lower()
instead of .casefold() becomes a code-smell.

> If regular case-sensitive string comparisons don't support the
> locale, why should case-insensitive comparisons be required to?

Adding new code to Python that just does what is already available is
indeed bad justification. But adding *new* functionality that handles
the locale-aware-case-insensitive-comparison could be justified.

> As far as I'm concerned, the only "must have" is that ASCII letters
> do the right thing. Everything beyond that is a "quality of
> implementation" issue.

But for this use-case, we already have .casefold() which does the job
and even extends beyond plain 7-bit ASCII to most of the typical
i18n/Unicode use-cases.

-tkc








More information about the Python-list mailing list