Sorting strings.

Thu Feb 8 07:29:21 EST 2001

"Gaute B Strokkenes" <gs234 at cam.ac.uk> wrote in message
news:4a1yt93frj.fsf at kern.srcf.societies.cam.ac.uk...
    [snip]
> After all this is done, it is time to sort the entries in appropriate
> order before we print them out again.  Currently I do this:
>
> def entry_cmp(a, b):
>     if a.author < b.author:
>         return -1
>     if a.author > b.author:
>         return 1
>     if a.title < b.title:
>         return -1
>     if a.title > b.title:
>         return 1
>     if a.prefix < b.prefix:
>         return -1
>     if a.prefix > b.prefix:
>         return 1
>     if a.code < b.code:
>         return -1
>     if a.code > b.code:
>         return 1
>     return 0
>
> ...
>
> ll.sort(entry_cmp)
>
> The problem with this approach is that string comparison doesn't quite
> have the behaviour that I would like.  For instance, I'd like numbers
> to come after letters, and I'd like to be case insensitive.  Is there
> a reasonably clean way to do this?

Not only there is, but it may make your whole program faster.  The
key idea is to apply sort, *WITHOUT* an argument, to an _auxiliary_
list, whose items are tuples -- each corresponding to a "suitably
transformed version" of the corresponding item you want to sort,
followed by said 'corresponding item'.  After the sort, you pick
the 'corresponding item's only, discarding the auxiliary fields
that you computed once just for sorting purposes.  (I'm used to
calling this design pattern a 'Schwartzian transform' but every time
I do so somebody nitpicks about it:-).

To ensure comparison is case-insensitive you need to map all
letters to (e.g.) uppercase, and to ensure digits come after
letters you must map them to characters that follow the letters
in the alphabet (if you map letters to uppercase, you may at
the same time map digits 0=9 to e.g. letters a-j, which do
come after the uppercase ones).

For these transformations, you use two functions in the
string module -- maketrans prepares a transformation matrix,
translate applies it.  You'll have problems in locales where
the numbers of uppercase and lowercase letters differ, but
that is a complication best met separately.  In normal locales:

import string
assert(len(string.lowercase) == len(string.uppercase))
from =  string.lowercase + string.digits
onto = (string.uppercase + string.lowercase)[:len(from)]
transform = string.maketrans(from, onto)

and now you can call string.translate(x, transform) to
get the transformed version of any string x.

So, your sort task can now become:

def mysort(alist):
    def ok(x):
        return string.translate(x, transform)
    temp = [(ok(x.author), ok(x.title),
        ok(x.prefix), ok(x.code), x) for x in alist]
    temp.sort()
    return [ x[-1] for x in temp ]

and you'll call it as

li = mysort(li)

The order of items with identical (transformed) author,
title, prefix and code will vary between the two versions,
but since .sort is not a stable sort you may not care all
that much anyway.  It's easy to make THIS sort stable,
by the way -- just consider "index" as the penultimate
entry in each tuple, e.g.:

    temp = [(ok(alist[i].author), ok(alist[i].title),
        ok(alist[i].prefix), ok(alist[i].code), i)
            for i in range(len(alist))]
    temp.sort()
    return [ alist[x[-1]] for x in temp ]

I would consider this 'reasonably clean' or better... it
makes runtime generally faster, and source code more compact,
while adding more flexibility to how sorting is performed...
few program-design 'trade-offs' are as one-sided as this!-)

Alex