Flexable Collating (feedback please)

Wed Oct 18 02:42:45 EDT 2006

I put together the following module today and would like some feedback on any 
obvious problems.  Or even opinions of weather or not it is a good approach.

While collating is not a difficult thing to do for experienced programmers, I 
have seen quite a lot of poorly sorted lists in commercial applications, so it 
seems it would be good to have an easy to use ready made API for collating.

I tried to make this both easy to use and flexible.  My first thoughts was to 
try and target actual uses such as Phone directory sorting, or Library sorting, 
etc., but it seemed using keywords to alter the behavior is both easier and more 
flexible.

I think the regular expressions I used to parse leading and trailing numerals 
could be improved. They work, but you will probably get inconsistent results if 
the strings are not well formed.  Any suggestions on this would be appreciated.

Should I try to extend it to cover dates and currency sorting?  Probably those 
types should be converted before sorting, but maybe sometimes it's useful
not to?

Another variation is collating dewy decimal strings.  It should be easy to add 
if someone thinks that might be useful.

I haven't tested this in *anything* yet, so don't plug it into production code 
of any type.  I also haven't done any performance testing.

See the doc tests below for examples of how it's used.

Cheers,
    Ron Adam

"""
     Collate.py

     A general purpose configurable collate module.

     Collation can be modified with the following keywords:

         CAPS_FIRST              -> Aaa, aaa, Bbb, bbb
         HYPHEN_AS_SPACE         -> Don't ignore hyphens
         UNDERSCORE_AS_SPACE     -> Underscores as white space
         IGNORE_LEADING_WS       -> Disregard leading white space
         NUMERICAL               -> Digit sequences as numerals
         COMMA_IN_NUMERALS       -> Allow commas in numerals

     * See doctests for examples.

     Author: Ron Adam, ron at ronadam.com, 10/18/2006

"""
import re
import locale

locale.setlocale(locale.LC_ALL, '')  # use current locale settings

#  The above line may change the string constants from the string
#  module.  This may have unintended effects if your program
#  assumes they are always the ascii defaults.

CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

class Collate(object):
     """ A general purpose and configurable collator class.
     """
     def __init__(self, flag):
         self.flag = flag
     def transform(self, s):
         """ Transform a string for collating.
         """
         if self.flag & CAPS_FIRST:
             s = s.swapcase()
         if self.flag & HYPHEN_AS_SPACE:
             s = s.replace('-', ' ')
         if self.flag & UNDERSCORE_AS_SPACE:
             s = s.replace('_', ' ')
         if self.flag & IGNORE_LEADING_WS:
             s = s.strip()
         if self.flag & NUMERICAL:
             if self.flag & COMMA_IN_NUMERALS:
                 rex = re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)', 
re.LOCALE)
             else:
                 rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', re.LOCALE)
             slist = rex.split(s)
             for i, x in enumerate(slist):
                 if self.flag & COMMA_IN_NUMERALS:
                     x = x.replace(',', '')
                 try:
                     slist[i] = float(x)
                 except:
                     slist[i] = locale.strxfrm(x)
             return slist
         return locale.strxfrm(s)

     def __call__(self, a, b):
         """ This allows the Collate class work as a sort key.

                 USE: list.sort(key=Collate(flags))
         """
         return cmp(self.transform(a), self.transform(b))

def collate(slist, flags=0):
     """ Collate list of strings in place.
     """
     return slist.sort(Collate(flags))

def collated(slist, flags=0):
     """ Return a collated list of strings.

         This is a decorate-undecorate collate.
     """
     collator = Collate(flags)
     dd = [(collator.transform(x), x) for x in slist]
     dd.sort()
     return list([B for (A, B) in dd])

def _test():
     """
     DOC TESTS AND EXAMPLES:

     Sort (and sorted) normally order all words beginning with caps
     before all words beginning with lower case.

         >>> t = ['tuesday', 'Tuesday', 'Monday', 'monday']
         >>> sorted(t)     # regular sort
         ['Monday', 'Tuesday', 'monday', 'tuesday']

     Locale collation puts words beginning with caps after words
     beginning with lower case of the same letter.

         >>> collated(t)
         ['monday', 'Monday', 'tuesday', 'Tuesday']

     The CAPS_FIRST option can be used to put all words beginning
     with caps after words beginning in lowercase of the same letter.

         >>> collated(t, CAPS_FIRST)
         ['Monday', 'monday', 'Tuesday', 'tuesday']

     The HYPHEN_AS_SPACE option causes hyphens to be equal to space.

         >>> t = ['a-b', 'b-a', 'aa-b', 'bb-a']
         >>> collated(t)
         ['aa-b', 'a-b', 'b-a', 'bb-a']

         >>> collated(t, HYPHEN_AS_SPACE)
         ['a-b', 'aa-b', 'b-a', 'bb-a']

     The IGNORE_LEADING_WS and UNDERSCORE_AS_SPACE options can be
     used together to improve ordering in some situations.

         >>> t = ['sum', '__str__', 'about', '  round']
         >>> collated(t)
         ['  round', '__str__', 'about', 'sum']

         >>> collated(t, IGNORE_LEADING_WS)
         ['__str__', 'about', '  round', 'sum']

         >>> collated(t, UNDERSCORE_AS_SPACE)
         ['  round', '__str__', 'about', 'sum']

         >>> collated(t, IGNORE_LEADING_WS|UNDERSCORE_AS_SPACE)
         ['about', '  round', '__str__', 'sum']

     The NUMERICAL option orders leading and trailing digits as numerals.

         >>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
         >>> collated(t, NUMERICAL)
         ['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

     The COMMA_IN_NUMERALS option ignores commas instead of using them to
     seperate numerals.

         >>> t = ['a5', 'a4,000', '500b', '100,000b']
         >>> collated(t, NUMERICAL|COMMA_IN_NUMERALS)
         ['500b', '100,000b', 'a5', 'a4,000']

     Collating also can be done in place using collate() instead of collated().

         >>> t = ['Fred', 'Ron', 'Carol', 'Bob']
         >>> collate(t)
         >>> t
         ['Bob', 'Carol', 'Fred', 'Ron']

     """
     import doctest
     doctest.testmod()

if __name__ == '__main__':
     _test()