Case-insensitive sorting of strings (Python newbie)
Peter Otten
__peter__ at web.de
Fri Jan 23 12:53:03 EST 2015
John Sampson wrote:
> I notice that the string method 'lower' seems to convert some strings
> (input from a text file) to Unicode but not others.
> This messes up sorting if it is used on arguments of 'sorted' since
> Unicode strings come before ordinary ones.
>
> Is there a better way of case-insensitive sorting of strings in a list?
> Is it necessary to convert strings read from a plaintext file
> to Unicode? If so, how? This is Python 2.7.8.
The standard recommendation is to convert bytes to unicode as early as
possible and only manipulate unicode. This is more likely to give correct
results when slicing or converting a string.
$ cat tmp.txt
ähnlich
üblich
nötig
möglich
Maß
Maße
Masse
ÄHNLICH
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> for line in open("tmp.txt"):
... line = line.strip()
... print line, line.lower()
...
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH Ähnlich
Now the same with unicode. To read text with a specific encoding use either
codecs.open() or io.open() instead of the built-in (replace utf-8 with your
actual encoding):
>>> import io
>>> for line in io.open("tmp.txt", encoding="utf-8"):
... line = line.strip()
... print line, line.lower()
...
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH ähnlich
Unfortunately this will not give the order that you (or a german speaker in
the example below) will probably expect:
>>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
Masse
Maß
Maße
möglich
nötig
ähnlich
ÄHNLICH
üblich
For case-insensitive sorting you get better results with locale.strxfrm() --
but this doesn't accept unicode:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position
0: ordinal not in range(128)
As a workaround you can sort first:
>>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
ähnlich
ÄHNLICH
Maß
Masse
Maße
möglich
nötig
üblich
You should still convert the result to unicode if you want to do further
processing in Python.
More information about the Python-list
mailing list