python 2.7.12 on Linux behaving differently than on Windows

Fri Dec 9 02:41:51 EST 2016

On Fri, 9 Dec 2016 01:52 pm, Chris Angelico wrote:

> On Fri, Dec 9, 2016 at 12:34 PM, BartC <bc at freeuk.com> wrote:
>> With a case-sensitive file system, how do you search only for 'harry',
>> not knowing what combinations of upper and lower case have been used?
>> (It's a good thing Google search isn't case sensitive!)
> 
> This is handled by "case correlations", which aren't quite the same as
> case conversions. 

I think you mean to say, this *should* be handled by case correlations.

I expect that there is a lot of software on the planet that just blindly
converts the strings to both lower- or both uppercase and then checks
equality. And even more that assumes ASCII case conversions and just does
bit-twiddling to compare letters.

(By the way, case correlation... I've never come across that term before,
and googling doesn't find anything useful. Are you sure that's the right
term?)

> In Python, that's the .casefold() method. You 
> casefold the string "harry", and then casefold every file name that
> might potentially match, and see if they become the same. Taking my
> original example string:
> 
>>>> "ßẞıİÅσςσ".casefold()
> 'ssssıi̇åσσσ'

Your example string is a good demonstration of mojibake. Or possibly a *bad*
demonstration of mojibake, since I cannot imagine any real words that would
generate that via encoding/decoding into the wrong character set :-)

I'm not really sure what point you were trying to make with Bart. Globbing
doesn't support case-insensitive matches on any platform I know of, so I
don't think this is really relevant.

> Any string that casefolds to that same string should be considered a
> match. This does NOT stop you from having multiple such files in the
> directory.

If you want to support multilingual string comparisons, you have to do more
than just casefold(). You need to normalise the strings, otherwise 'café'
and 'café' will be treated as distinct.

Frankly, I think that Apple HFS+ is the only modern file system that gets
Unicode right. Not only does it restrict file systems to valid UTF-8
sequences, but it forces them to a canonical form to avoid the é é gotcha,
and treats file names as case preserving but case insensitive.

Lastly, there's one last fly in the ointment for multilingual case
comparisons: Turkish i. Unfortunately, there's no clean way to do case
comparisons that works for "any arbitrary language".

Turkish, and one or two other languages, want dotless and dotted I to be
treated as distinct: ıI go together, and iİ go together. But other
languages want iI to go together, meaning that the standard case
conversions are lossy:

py> 'ıIiİ'.lower()
'ıiii'
py> 'ıIiİ'.upper()
'IIIİ'

Maybe it would have been better if the standard had kept ıİ together, and
iI, so at least the case conversion was lossless. Alas, too late now.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.