Case-insensitive sorting of strings (Python newbie)

Peter Otten __peter__ at web.de
Fri Jan 23 12:53:03 EST 2015


John Sampson wrote:

> I notice that the string method 'lower' seems to convert some strings
> (input from a text file) to Unicode but not others.
> This messes up sorting if it is used on arguments of 'sorted' since
> Unicode strings come before ordinary ones.
> 
> Is there a better way of case-insensitive sorting of strings in a list?
> Is it necessary to convert strings read from a plaintext file
> to Unicode? If so, how? This is Python 2.7.8.

The standard recommendation is to convert bytes to unicode as early as 
possible and only manipulate unicode. This is more likely to give correct 
results when slicing or converting a string.

$ cat tmp.txt
ähnlich
üblich
nötig
möglich
Maß
Maße
Masse
ÄHNLICH
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> for line in open("tmp.txt"):
...     line = line.strip()
...     print line, line.lower()
... 
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH Ähnlich

Now the same with unicode. To read text with a specific encoding use either 
codecs.open() or io.open() instead of the built-in (replace utf-8 with your 
actual encoding):

>>> import io
>>> for line in io.open("tmp.txt", encoding="utf-8"): 
...     line = line.strip()
...     print line, line.lower()
... 
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH ähnlich

Unfortunately this will not give the order that you (or a german speaker in 
the example below) will probably expect:

>>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
Masse
Maß
Maße
möglich
nötig
ähnlich
ÄHNLICH
üblich

For case-insensitive sorting you get better results with locale.strxfrm() -- 
but this doesn't accept unicode:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
0: ordinal not in range(128)

As a workaround you can sort first:

>>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
ähnlich
ÄHNLICH
Maß
Masse
Maße
möglich
nötig
üblich

You should still convert the result to unicode if you want to do further 
processing in Python.




More information about the Python-list mailing list