Flexable Collating (feedback please)

Wed Oct 18 20:36:44 EDT 2006

Gabriel Genellina wrote:
> At Wednesday 18/10/2006 03:42, Ron Adam wrote:
> 
>> I put together the following module today and would like some feedback 
>> on any
>> obvious problems.  Or even opinions of weather or not it is a good 
>> approach.
>>          if self.flag & CAPS_FIRST:
>>              s = s.swapcase()
> 
> This is just coincidental; it relies on (lowercase)<(uppercase) on the 
> locale collating sequence, and I don't see why it should be always so.

The LC_COLLATE structure (in the python.exe C code I think) controls the order 
of upper and lower case during collating.  I don't know if there is anyway to 
examine it unfortunately.

If there was a way to change the LC_COLLATE structure, I wouldn't need to resort 
to tricks like s.swapcase().  But without that info, I don't know of another way.

Maybe changing the CAPS_FIRST to REVERSE_CAPS_ORDER would do?

>>          if self.flag & IGNORE_LEADING_WS:
>>              s = s.strip()

I'm not sure if this would make any visible difference. It might determine order 
of two strings where they are the same, but one has white space at the end the 
other doesn't.

They run at the same speed either way, so I'll go ahead and change it.  Thanks.

> This ignores trailing ws too. (lstrip?)
> 
>>          if self.flag & NUMERICAL:
>>              if self.flag & COMMA_IN_NUMERALS:
>>                  rex = 
>> re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
>> re.LOCALE)
>>              else:
>>                  rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', 
>> re.LOCALE)
>>              slist = rex.split(s)
>>              for i, x in enumerate(slist):
>>                  if self.flag & COMMA_IN_NUMERALS:
>>                      x = x.replace(',', '')
>>                  try:
>>                      slist[i] = float(x)
>>                  except:
>>                      slist[i] = locale.strxfrm(x)
>>              return slist
>>          return locale.strxfrm(s)
> 
> You should try to make this part a bit more generic. If you are 
> concerned about locales, do not use "comma" explicitely. In other 
> countries 10*100=1.000 - and 1,234 is a fraction between 1 and 2.

See the most recent version of this I posted.  It is a bit more generic.

       news://news.cox.net:119/PNxZg.6714$fl.4591@dukeread08

Maybe a 'comma_is_decimal' option?

Options are cheep so it's no problem to add them as long as they make sense. ;-)

These options are what I refer to as mid-level options.  The programmer does 
still need to know something about the data they are collating.  They may still 
need to do some preprocessing even with this, but maybe not as much.

In a higher level collation routine, I think you would just need to specify a 
named sort type, such as 'dictionary', 'directory', 'enventory' and it would set 
the options and accordingly.  The problem with that approach is the higher level 
definitions may be different depending on locale or even the field it is used in.

>>      The NUMERICAL option orders leading and trailing digits as numerals.
>>
>>          >>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
>>          >>> collated(t, NUMERICAL)
>>          ['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']
> 
>  From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5, 
> a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
> Maybe GROUP_NUMBERS... but I dont like that too much either...

How about 'VALUE_ORDERING' ?

The term I've seen before is called natural ordering, but that is more general 
and can include date, roman numerals, as well as other type.

Cheers,
    Ron