Sorting strings containing special characters (german 'Umlaute')

Peter Otten __peter__ at web.de
Fri Mar 2 09:25:49 EST 2007


DierkErdmann at mail.com wrote:

> I know that this topic has been discussed in the past, but I could not
> find a working solution for my problem: sorting (lists of) strings
> containing special characters like "ä", "ü",... (german umlaute).
> Consider the following list:
> l = ["Aber", "Beere", "Ärger"]
> 
> For sorting the letter "Ä" is supposed to be treated like "Ae",

I don't think so:

>>> sorted(["Ast", "Ärger", "Ara"], locale.strcoll)
['Ara', '\xc3\x84rger', 'Ast']

>>> sorted(["Ast", "Aerger", "Ara"])
['Aerger', 'Ara', 'Ast']

> therefore sorting this list should yield
> l = ["Aber, "Ärger", "Beere"]
> 
> I know about the module locale and its method strcoll(string1,
> string2), but currently this does not work correctly for me. Consider
>      >>> locale.strcoll("Ärger", "Beere")
>      1
> 
> Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
> Can someone help?
> 
> Btw: I'm using WinXP (german) and
>>>> locale.getdefaultlocale()
> prints
>    ('de_DE', 'cp1252')

The default locale is not used by default; you have to set it explicitly

>>> import locale
>>> locale.strcoll("Ärger", "Beere")
1
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> locale.strcoll("Ärger", "Beere")
-1

By the way, you will avoid a lot of "Ärger"* if you use unicode right from
the start.

Finally, for efficient sorting, a key function is preferable over a cmp
function:

>>> sorted(["Ast", "Ärger", "Ara"], key=locale.strxfrm)
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"



More information about the Python-list mailing list