Custom alphabetical sort

Pander Musubi pander.musubi at gmail.com
Mon Dec 24 11:40:22 EST 2012


> > Hi all,
> 
> >
> 
> > I would like to sort according to this order:
> 
> >
> 
> > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',
> 
> > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',
> 
> > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',
> 
> > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',
> 
> > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',
> 
> > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',
> 
> > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',
> 
> > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> >
> 
> > How can I do this? The default sorted() does not give the desired result.
> 
> 
> 
> I'm assuming that doesn't correspond to some standard locale's collating 
> 
> order, so we really do need to roll our own encoding (and that you have 
> 
> a good reason for wanting to do this).

It is for creating a Dutch dictionary. This sorting order is not to be found in an existing locale.

>  I'm also assuming that what I'm 
> 
> seeing as question marks are really accented characters in some encoding 
> 
> that my news reader just isn't dealing with (it seems to think your post 
> 
> was in ISO-2022-CN (Simplified Chinese).
> 
> 
> 
> I'm further assuming that you're starting with a list of unicode 
> 
> strings, the contents of which are limited to the above alphabet.

Correct.

>  I'm 
> 
> even further assuming that the volume of data you need to sort is small 
> 
> enough that efficiency is not a huge concern.

Well, it is for 200,000 - 450,000 words but the code is allowed be slow. It will not be used for web application or something which requires a quick response.

> Given all that, I would start by writing some code which turned your 
> 
> alphabet into a pair of dicts.  One maps from the code point to a 
> 
> collating sequence number (i.e. ordinals), the other maps back.  
> 
> Something like (for python 2.7):
> 
> 
> 
> alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
> 
>             '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
> 
>             [...]
> 
>             'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
> 
> 
> 
> map1 = {c: n for n, c in enumerate(alphabet)}
> 
> map2 = {n: c for n, c in enumerate(alphabet)}

OK, similar to Thomas' proposal.

> Next, I would write some functions which encode your strings as lists of 
> 
> ordinals (and back again)
> 
> 
> 
> def encode(s):
> 
>    "encode('foo') ==> [34, 19, 19]"  # made-up ordinals
> 
>    return [map1[c] for c in s]
> 
> 
> 
> def decode(l):
> 
>    "decode([34, 19, 19]) ==> 'foo'"
> 
>     return ''.join(map2[i] for i in l)
> 
> 
> 
> Use these to convert your strings to lists of ints which will sort as 
> 
> per your specified collating order, and then back again:
> 
> 
> 
> encoded_strings = [encode(s) for s in original_list]
> 
> encoded_strings.sort()
> 
> sorted_strings = [decode(l) for l in encoded_strings]
> 
> 
> 
> That's just a rough sketch, and completely untested, but it should get 
> 
> you headed in the right direction.  Or at least one plausible direction.  
> 
> Old-time perl hackers will recognize this as the Schwartzian Transform.

I will test it and let you know. :) Pander



More information about the Python-list mailing list