Case-insensitive string equality

Thu Aug 31 10:27:58 EDT 2017

On Fri, 1 Sep 2017 12:03 am, Chris Angelico wrote:

> On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>> Chris Angelico <rosuav at gmail.com> writes:

>>>The method you proposed seems a little odd - it steps through the
>>>strings character by character and casefolds them separately. How is
>>>it superior to the two-line function?
>>
>>   When the strings are long, casefolding both strings
>>   just to be able to tell that the first character of
>>   the left string is »;« while the first character of
>>   the right string is »'« and so the result is »False«
>>   might be slower than necessary.

Thanks Stefan, that was my reasoning.

Also, if the implementation was in C, doing the comparison character by
character is unlikely to be slower than doing the comparison all at once, since
the "all at once" comparison actually is character by character under the hood.

>> [chomp]
>>   However, premature optimization is the root of all evil!
> 
> Fair enough.

Its not premature optimization to plan ahead for obvious scenarios.

Sometimes we may want to compare two large strings. Calling casefold on them
temporarily uses doubles the memory taken by the two strings, which can be
significant. Assuming the method were written in C, you would be very unlikely
to build up two large temporary case-folded arrays before doing the comparison.

If I were subclassing str in pure Python, I wouldn't bother. The tradeoffs are
different.

> However, I'm more concerned about the possibility of a semantic
> difference between the two. Is it at all possible for the case folding
> of an entire string to differ from the concatenation of the case
> foldings of its individual characters?

I don't believe so, but I welcome correction.

> Additionally: a proper "case insensitive comparison" should almost
> certainly start with a Unicode normalization. But should it be NFC/NFD
> or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the
> application.

Normalisation is orthogonal to comparisons and searches. Python doesn't
automatically normalise strings, as people have pointed out a bazillion times
in the past, and it happily compares 

'ö' LATIN SMALL LETTER O WITH DIAERESIS

'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS

as unequal. I don't propose to change that just so that we can get 'a'
equals 'A' :-)

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.