case-insensitive and internationalized sort

Thu Dec 19 17:27:52 EST 2002

"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:attdhe$bot$05$1 at news.t-online.com...
> > The fact that the built-in list sort method is case-sensitive seems to
be a
> > recurring topic. If a named parameter is going to be added to the sort
> > method, it would probably require a PEP and discussion on python-dev
before
> > it was accepted. But since sort() doesn't do what a lot of people expect
I
> > would like to discuss the issues here first.
>
> I disagree that sorting a list of strings doesn't do what most people
> expect; I also disagree that sorting in a case-insensitive manner would
> be the right solution.
>
> Sorting L produces an order such that, for every n, L[n] <= L[n+1]. I
> think this is what most people expect when they talk about sorting.
>
> So it is the comparison that you are unhappy with, not the sorting.
> While the definition of string comparison may not be intuitive, the
> lexicographical ordering that Python implements is really the only
> sensible thing.

Correct, it is the compare that I take issue with. I've been looking through
my code and I can't find any instances where when comparing strings I was
interested in an ordinal value comparison. Whether it is filenames or lists
in HTML, lists of strings for a UI, etc. ignoring case is what the end user
expects to see when dealing with strings. Sorting strings with the default:

>>> s = ['c', 'C', 'b', 'a', 'A', 'B']
>>> s.sort()
>>> s
['A', 'B', 'C', 'a', 'b', 'c']

is just so 1970s. ;-)

> >   s.sort(lambda a, b: cmp(a.upper(), b.upper()))
> >
> > Having a separate function may be easier to read, but are there speed
> > differences or other trade-offs?
>
> There won't be any difference between a a lambda function and a proper
> function. Notice, however, that this computes L[n].upper() quite often,
> so you may want to avoid the cost of uppercasing every string over and
> over again. Why not make a list of pairs
>
> s = [(x.upper(), x) for x in s]
>
> It turns out that the builtin comparison of such tuples does what you
> want:
>    cmp ((x1upper, x1),(x2upper,x2)) is cmp(x1upper,x2upper) if this
>    is nonzero, and is cmp(x1,x2) if both upper-case strings compare
>    equal
>
> So this will sort the list in a way that is case insensitive; if two
> strings differ only in case, it will sort the strings in a
> case-sensitive manner.

That looks good, but I'm a bit slow. I'm not seeing how to create just a
compare function so I can sort in place. This is the closest I could I do
and it requires a standalone function and the temporary creation of another
list:

>>> def caseinsensitive(list):
...     list = [(x.upper(), x) for x in list]
...     list.sort()
...     return [x[1] for x in list]
...
>>> s = [u'ö', u'ä', u'Ä', 'b', 'a', 'B', u'a', 'A']
>>> s = caseinsensitive(s)
>>> s
['A', 'a', u'a', 'B', 'b', u'\xc4', u'\xe4', u'\xf6']

Is there a simpler way I'm just not seeing one that avoids a lot of extra
upper() calls? The list creation overhead may be worse than extra upper
calls. I didn't have any succes trying to incorporate locale.strcoll or
locale.strxfrm. I kept getting an exception

UnicodeError: ASCII encoding error: ordinal not in range(128)

So, some other string encoding transformation must be necessary before
list.sort() is called in the function? Those functions were new to me, but
they sound like what I want since locale-aware sorting would be a good
addition.

> > Now the next point is that it would be nice to be able to get a
> > case-insensitive sort, which seems to be the most likely thing you want
to
> > do when sorting strings. If an optional casesensitive arg was added to
> > sort() then without breaking any old code you could do:
> >
> > s.sort(casesensitive=False)
>
> This won't fly. What if you are not sorting a list of strings, but, say,
> a list of numbers?

I retract the suggestion since it is the default cmp when applied to strings
that I take issue with.

> > The final point is that the solution above still doesn't handle the
> > characters a umlaut and A umlaut correctly.
>
> Define "correctly". Different languages sort accented characters in
> different ways: either put them after all other letters (Swedish, I
> believe, and French for their accented characters), or sort them
> immediately after the unaccented form (German DIN sorting), or
> sort them as-if transliterated (ie. sort Löwis as Loewis, old German
> phonebook sorting). Since we have two sorting orders in Germany for
> accented characters, I always forget how they work, and have to look in
> the dictionary to see what sorting procedure it uses.
>
> In any case, if you want to use locale-aware sorting, you need to use
> the locale.strcoll function (use setlocale before using that function).
> Since strcoll is expensive, you can use locale.strxfrm on each string
> to transform it into a byte string for which lexicographical sorting
> honors collation order.
>
> > Anyone up for a casesensitive flag addition to the sort() method?
>
> Please, no.
>
> Notice that strcoll and strxfrm aren't the most recent approach to
> string collation: If you use Unicode and the IBM ICU library, they'll
> offer extensive collation support where you can chose between various
> collation conventions at run-time.
>
> Regards,
> Martin

Thanks Martin,

ka