Case-insensitive sorting of strings (Python newbie)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Jan 23 12:56:53 EST 2015


John Sampson wrote:

> I notice that the string method 'lower' seems to convert some strings
> (input from a text file) to Unicode but not others.

I don't think so. You're going to have to show an example.

I *think* what you might be running into is an artifact of printing to a
terminal, which may (or may not) interpret some byte sequences as UTF-8
characters, but I can't replicate it. So I'll have to see an example.
Please state what OS you are running on, and what encoding your terminal is
set to. Also, are you opening the file in text mode or binary mode?


> This messes up sorting if it is used on arguments of 'sorted' since
> Unicode strings come before ordinary ones.
> 
> Is there a better way of case-insensitive sorting of strings in a list?
> Is it necessary to convert strings read from a plaintext file
> to Unicode? If so, how? This is Python 2.7.8.

Best practice is to always convert to Unicode, even if you know your text is
pure ASCII. You *may* be able to get away with not doing so if you know you
have ASCII, but that's still the lazy way. And of course you need to know
what encoding has been used.

There is some overhead with decoding to Unicode, so if performance really is
critical, *and* your needs are quite low, you may be able to get away with
just treating the strings as ASCII byte strings:

with open("my file.txt") as f:
    for line in f:
        print line.lower()


will correctly lowercase ASCII strings. It won't lowercase non-ASCII
letters, and there's a good chance that they may display as raw bytes in
some encoding. Otherwise, I think the best way to approach this may be:


import io
with io.open("my file.txt", encoding='utf-8') as f:
    for line in f:
        print line.lower()


Assuming the file actually is encoded with UTF-8, that ought to work
perfectly.

But to really know what is going on we will need more information.


-- 
Steven




More information about the Python-list mailing list