python 2.7.12 on Linux behaving differently than on Windows

Fri Dec 9 05:51:26 EST 2016

On Fri, Dec 9, 2016 at 6:41 PM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> On Fri, 9 Dec 2016 01:52 pm, Chris Angelico wrote:
>
>> On Fri, Dec 9, 2016 at 12:34 PM, BartC <bc at freeuk.com> wrote:
>>> With a case-sensitive file system, how do you search only for 'harry',
>>> not knowing what combinations of upper and lower case have been used?
>>> (It's a good thing Google search isn't case sensitive!)
>>
>> This is handled by "case correlations", which aren't quite the same as
>> case conversions.
>
> I think you mean to say, this *should* be handled by case correlations.
>
> I expect that there is a lot of software on the planet that just blindly
> converts the strings to both lower- or both uppercase and then checks
> equality. And even more that assumes ASCII case conversions and just does
> bit-twiddling to compare letters.

This is true. However, I would consider that to be buggy software, not
flawed design of file systems.

> (By the way, case correlation... I've never come across that term before,
> and googling doesn't find anything useful. Are you sure that's the right
> term?)

Hmm, now you mention it, I'm actually not sure where I got that term
from. But it doesn't matter what you call it; the point is that there
is a "case insensitive comparison equivalency" that is not the same as
merely converting the string to upper/lower. It's allowed to be lossy;
in fact, I would start the process with NFKC or NFKD normalization, to
eliminate any problems with ligatures and such.

>> In Python, that's the .casefold() method. You
>> casefold the string "harry", and then casefold every file name that
>> might potentially match, and see if they become the same. Taking my
>> original example string:
>>
>>>>> "ßẞıİÅσςσ".casefold()
>> 'ssssıi̇åσσσ'
>
> Your example string is a good demonstration of mojibake. Or possibly a *bad*
> demonstration of mojibake, since I cannot imagine any real words that would
> generate that via encoding/decoding into the wrong character set :-)

Heh. It's more of a "stress test" file name, picking up edge cases
from several languages. It's plausible for a file name to have one or
two of those characters in it, and for different file names to have
different selections from that set, and for a single file system to
have to cope with all of those examples.

> I'm not really sure what point you were trying to make with Bart. Globbing
> doesn't support case-insensitive matches on any platform I know of, so I
> don't think this is really relevant.

Windows does case insensitive matches. Has for as long as I've known it.

>> Any string that casefolds to that same string should be considered a
>> match. This does NOT stop you from having multiple such files in the
>> directory.
>
> If you want to support multilingual string comparisons, you have to do more
> than just casefold(). You need to normalise the strings, otherwise 'café'
> and 'café' will be treated as distinct.

IMO you should be able to NFC normalize file names before they get
stored. The only reason Linux file systems currently can't is backward
compat - the file names might not represent text. But they are
*names*. Fundamentally, they are supposed to be meaningful. Mandating
that they be UTF-8 byte streams representing Unicode text is not
unreasonable.

> Frankly, I think that Apple HFS+ is the only modern file system that gets
> Unicode right. Not only does it restrict file systems to valid UTF-8
> sequences, but it forces them to a canonical form to avoid the é é gotcha,
> and treats file names as case preserving but case insensitive.

Agreed. Other file systems and operating systems should pick this up.

> Lastly, there's one last fly in the ointment for multilingual case
> comparisons: Turkish i. Unfortunately, there's no clean way to do case
> comparisons that works for "any arbitrary language".

You may notice that my original string has a couple of examples of that :)

ChrisA