Case-insensitive sorting of strings (Python newbie)
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Sat Jan 24 05:34:43 EST 2015
Le vendredi 23 janvier 2015 18:54:11 UTC+1, Peter Otten a écrit :
> John Sampson wrote:
>
> > I notice that the string method 'lower' seems to convert some strings
> > (input from a text file) to Unicode but not others.
> > This messes up sorting if it is used on arguments of 'sorted' since
> > Unicode strings come before ordinary ones.
> >
> > Is there a better way of case-insensitive sorting of strings in a list?
> > Is it necessary to convert strings read from a plaintext file
> > to Unicode? If so, how? This is Python 2.7.8.
>
> The standard recommendation is to convert bytes to unicode as early as
> possible and only manipulate unicode. This is more likely to give correct
> results when slicing or converting a string.
>
> $ cat tmp.txt
> ähnlich
> üblich
> nötig
> möglich
> Maß
> Maße
> Masse
> ÄHNLICH
> $ python
> Python 2.7.6 (default, Mar 22 2014, 22:59:56)
> [GCC 4.8.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> for line in open("tmp.txt"):
> ... line = line.strip()
> ... print line, line.lower()
> ...
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH Ähnlich
>
> Now the same with unicode. To read text with a specific encoding use either
> codecs.open() or io.open() instead of the built-in (replace utf-8 with your
> actual encoding):
>
> >>> import io
> >>> for line in io.open("tmp.txt", encoding="utf-8"):
> ... line = line.strip()
> ... print line, line.lower()
> ...
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH ähnlich
>
> Unfortunately this will not give the order that you (or a german speaker in
> the example below) will probably expect:
>
> >>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
> Masse
> Maß
> Maße
> möglich
> nötig
> ähnlich
> ÄHNLICH
> üblich
>
> For case-insensitive sorting you get better results with locale.strxfrm() --
> but this doesn't accept unicode:
>
> >>> import locale
> >>> locale.setlocale(locale.LC_ALL, "")
> 'de_DE.UTF-8'
> >>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position
> 0: ordinal not in range(128)
>
> As a workaround you can sort first:
>
> >>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
> ähnlich
> ÄHNLICH
> Maß
> Masse
> Maße
> möglich
> nötig
> üblich
>
> You should still convert the result to unicode if you want to do further
> processing in Python.
-------
Hard drive archeology. Python 2 and Python 3.
One (among other) way(s) to work is to use the Unicode
Collation Algorithm (Default Unicode Collation Element
Table (DUCET)).
In action with a reduced (latin only, > ~1000 code points)
characters set from allkeys.txt. Dirty work.
I added the French word éléphant.
code:
[...]
li = ['ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', \
'möglich', 'nötig', 'üblich']
li.insert(0, 'éléphant')
print(li)
r = sorted(li, key=c.tri)
print(r)
[...]
output:
>c:\python32\pythonw -u "unicodecollation.py"
['éléphant', 'ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', 'möglich', 'nötig', 'üblich']
['ähnlich', 'ÄHNLICH', 'éléphant', 'Maß', 'Maße', 'Masse', 'möglich', 'nötig', 'üblich']
>Exit code: 0
---
Why to continue to waste time with this product?
Its ridiculous(?), absurd(?), ascii-centric, non std,
buggy (definitively) Unicode implementation?
This just become only nice for pedacogical purposes.
jmf
More information about the Python-list
mailing list