Case-insensitive sorting of strings (Python newbie)

Sat Jan 24 05:34:43 EST 2015

Le vendredi 23 janvier 2015 18:54:11 UTC+1, Peter Otten a écrit :
> John Sampson wrote:
> 
> > I notice that the string method 'lower' seems to convert some strings
> > (input from a text file) to Unicode but not others.
> > This messes up sorting if it is used on arguments of 'sorted' since
> > Unicode strings come before ordinary ones.
> > 
> > Is there a better way of case-insensitive sorting of strings in a list?
> > Is it necessary to convert strings read from a plaintext file
> > to Unicode? If so, how? This is Python 2.7.8.
> 
> The standard recommendation is to convert bytes to unicode as early as 
> possible and only manipulate unicode. This is more likely to give correct 
> results when slicing or converting a string.
> 
> $ cat tmp.txt
> ähnlich
> üblich
> nötig
> möglich
> Maß
> Maße
> Masse
> ÄHNLICH
> $ python
> Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
> [GCC 4.8.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> for line in open("tmp.txt"):
> ...     line = line.strip()
> ...     print line, line.lower()
> ... 
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH Ähnlich
> 
> Now the same with unicode. To read text with a specific encoding use either 
> codecs.open() or io.open() instead of the built-in (replace utf-8 with your 
> actual encoding):
> 
> >>> import io
> >>> for line in io.open("tmp.txt", encoding="utf-8"): 
> ...     line = line.strip()
> ...     print line, line.lower()
> ... 
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH ähnlich
> 
> Unfortunately this will not give the order that you (or a german speaker in 
> the example below) will probably expect:
> 
> >>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
> Masse
> Maß
> Maße
> möglich
> nötig
> ähnlich
> ÄHNLICH
> üblich
> 
> For case-insensitive sorting you get better results with locale.strxfrm() -- 
> but this doesn't accept unicode:
> 
> >>> import locale
> >>> locale.setlocale(locale.LC_ALL, "")
> 'de_DE.UTF-8'
> >>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
> 0: ordinal not in range(128)
> 
> As a workaround you can sort first:
> 
> >>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
> ähnlich
> ÄHNLICH
> Maß
> Masse
> Maße
> möglich
> nötig
> üblich
> 
> You should still convert the result to unicode if you want to do further 
> processing in Python.

-------
Hard drive archeology. Python 2 and Python 3.

One (among other) way(s) to work is to use the Unicode
Collation Algorithm (Default Unicode Collation Element
Table (DUCET)).

In action with a reduced (latin only, > ~1000 code points)
characters set from allkeys.txt. Dirty work.
I added the French word éléphant.

code:

[...]

    li = ['ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', \
          'möglich', 'nötig', 'üblich']
    li.insert(0, 'éléphant')
    print(li)
    r = sorted(li, key=c.tri)
    print(r)

[...]

output:

>c:\python32\pythonw -u "unicodecollation.py"

['éléphant', 'ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', 'möglich', 'nötig', 'üblich']
['ähnlich', 'ÄHNLICH', 'éléphant', 'Maß', 'Maße', 'Masse', 'möglich', 'nötig', 'üblich']
>Exit code: 0

---

Why to continue to waste time with this product?
Its ridiculous(?), absurd(?), ascii-centric, non std,
buggy (definitively) Unicode implementation?

This just become only nice for pedacogical purposes.

jmf