Need Help sorting alphabetically.

Alex Martelli aleaxit at yahoo.com
Tue Dec 19 04:59:56 EST 2000


"Chris Arai" <chris at araidesign.com> wrote in message
news:3A3F1A91.A7ED9182 at araidesign.com...
> Hi,
>
> This seems like it should be obvious, but it is evading me!
>
> I would like to sort alphabetically (I'm not sure that is the correct
> term) so that strings of alpha numerics sort alphabetically instead of
> by ASCII order.
>
> eg:
> >>> m=['Pg95_100.GIF',
> 'Pg9_100.GIF','Pg95A_100.gif','Pg9A_100.gif','Pg10_100.gif']
> >>> m.sort()
> >>> m
> ['Pg10_100.gif', 'Pg95A_100.gif', 'Pg95_100.GIF', 'Pg9A_100.gif',
> 'Pg9_100.GIF']
>
>
> I would like the order to be
> ['Pg9_100.GIF', 'Pg9A_100.gif', 'Pg10_100.gif, 'Pg95_100.GIF',
> Pg95A_100.gif' ]
>
> What is the obvious answer??

First: you have to *precisely* define, for each string, what
'data equivalent to it' you ARE comparing for your sorting
purposes.  This is no doubt the major task you face.

For your specific example, e.g., it would seem that the
'equivalent data' to each string is an integer made up
from the digits between a leading 'Pg' prefix and some
following non-digits, then the following non-digits if
any up to but not including a '_', then the rest - but
your criteria might well be very different.  First of all,
nail down those criteria.

Extraction of 'equivalent data' can be done by using
regular expressions, or in other ways.  RE's can be
very compact and pretty fast, but there may be other
simpler ways, depending on what exactly you're doing.

Assuming RE's, then, for example:

import re
cre = re.compile('(\D*)(\d+)([^_]*)_(.*)')

We match 4 parenthesized groups:
    0 or more non-digits (the prefix),
    1 or more digits (the number),
    0 or more non-underlines,
    whatever comes after the underling
Of these, the second field needs to be
converted to an int, the other will
remain strings.

def eqv1(astring):
    mo = cre.match(astring)
    if not mo: raise ValueError, "no match: %r"%astring
    groups = mo.groups()
    return (groups[0], int(groups[1])) + groups[2:]

if __name__=='__main__':
    for s in m:
        print s, eqv1(s)

Running this will confirm we're extracting the right
'equivalent' data:

D:\PySym>python ccc.py
Pg95_100.GIF ('Pg', 95, '', '100.GIF')
Pg9_100.GIF ('Pg', 9, '', '100.GIF')
Pg95A_100.gif ('Pg', 95, 'A', '100.gif')
Pg9A_100.gif ('Pg', 9, 'A', '100.gif')
Pg10_100.gif ('Pg', 10, '', '100.gif')

D:\PySym>

Actually, you'll probably want to lowercase some of
this, too, always for comparison purposes -- I suspect
you want it to be irrelevant to the sort whether a
string ends in '.GIF' or '.gif', for example, or
whether the prefix is spelled mixed, upper, lowercase.

So, you could change the last line in eqv1 (after
also changing the import to
    import re, string
of course), to:

    return (groups[0].lower(), int(groups[1])
        ) + tuple(map(string.lower,groups[2:]))

and now the test would emit:

D:\PySym>python ccc.py
Pg95_100.GIF ('pg', 95, '', '100.gif')
Pg9_100.GIF ('pg', 9, '', '100.gif')
Pg95A_100.gif ('pg', 95, 'a', '100.gif')
Pg9A_100.gif ('pg', 9, 'a', '100.gif')
Pg10_100.gif ('pg', 10, '', '100.gif')

D:\PySym>

more likely to be right.

Anyway, use this kind of exploratory approach to
make sure you have the right 'equivalence extraction'
function for your purposes; be sure to test quite
extensively on data which IS representative of the
scenarios you will want to handle!


Once you *DO* have a good 'eqv' function for your
purposes -- one that, applied to any string you
may be sorting, returns the correct 'equivalent'
tuple you want to use for comparison -- there are
basically two approaches to using it.  One involves
passing a comparison-function to the sort method,
and I would not advise it -- if the extraction
function does anything 'heavy', sorting lists of
substantial length can be quite slow this way.

A better approach (which I _think_ should be
called 'Schwartzian transform', from the name of
its inventor, but there is anything but consensus
on this group regarding the name, or the priority
of Randall Schwartz' "invention" of this...!):

def richSort(aList, eqv):
    auxList = [ (eqv(x), x) for x in aList ]
    auxList.sort()
    return [ x for (junk, x) in auxList ]


Changing the unit-test at the end of the module to:

if __name__=='__main__':
    for s in m:
        print s, eqv1(s)
    x = richSort(m, eqv1)
    print x

we now have the following test-output when just
running the module:

D:\PySym>python ccc.py
Pg95_100.GIF ('pg', 95, '', '100.gif')
Pg9_100.GIF ('pg', 9, '', '100.gif')
Pg95A_100.gif ('pg', 95, 'a', '100.gif')
Pg9A_100.gif ('pg', 9, 'a', '100.gif')
Pg10_100.gif ('pg', 10, '', '100.gif')
['Pg9_100.GIF', 'Pg9A_100.gif', 'Pg10_100.gif', 'Pg95_100.GIF',
'Pg95A_100.gif']

D:\PySym>

which seems a good match for the required functionality.

Function richSort is not complex.  It build an auxiliary
list, where each item's "sorting-equivalent" is prepended
to the item itself; sorts the auxiliary-lists with its
own sort method, which ('lexicographically'...:-) will
consider the comparison of the 'sorting-equivalents' to
have precedence; extract the results by junking all the
sorting-equivalents previously prepended.  It's much
clearer and more concise to express this in Python, of
course, as above done, than in English, at least my
personal, somewhat verbose idiolect of it:-).


If you *had* to do it with comparison-function-argument-
to-sort, it could be, for example:

def richSortAlternate(aList, eqv):
    def compare(a, b, eqv=eqv):
        return cmp(eqv(a), eqv(b))
    aList.sort(compare)
    return aList

and it would operate in-place.  But I would strongly
advise the other approach...!


In any case, developing the 'eqv' data-extraction is
the key step, and also the one depending strongly on
the specific problems you want to solve, while the
ways you apply that function to the sorting tasks are
general-purpose Python patterns/idioms.


Alex






More information about the Python-list mailing list