Custom sorting (including digraphs)

Erik Max Francis max at alcyone.com
Wed Mar 12 20:39:18 EST 2003


Peter Clark wrote:

> I'm creating a dictionary creation program (as in lexicons, not
> hashes) and have run into a small problem: I would like to implement
> customizable sorting, which would include the ability to sort multiple
> characters as one. For example:
> 
> order = "a, b, c, ch, d, e, (etc)"
> 
> would sort "cat, celery, cherry" as "cat, cherry, celery". In other
> words, all occurances of "ch" would be sorted as following "c".

Hmmm, are you going to want gh, hh, jh, and sh, too?  :-)

> Naturally, this also needs to be able to take Unicode data into
> account (which I think wouldn't be a problem, but you never know), be
> able to scale upwards to trigraphs ("thl" for example) and quadgraphs
> ("shch"), as well as be as speedy as possible, since I expect to be
> using it to  alphabetize several thousand entries.
> 
> I've looked over at the Cookbook, and while it has some nice recipes
> for custom sorting, but none take into consideration treating multiple
> characters as a single unit. TIA,

The most general way I can think of, particularly if you want the system
to generalize, is to convert each string (Unicode or not) to a sequence
of numeric tokens (for the purposes of sorting, at least), and then have
a transition function which maps a string ("ehhoshanghe chiujhaude") to
a sequence of these tokens in such a way that the numerical values of
the characters is what you want.  So, in a simplified example, if the
letters A through Z have the numeric values 10 through 260 (at 10
increments), then maybe CH would be 35 (between C and D), SH would be
195 (between S and T), and SHCH would be 197 (between SH and T).

Given that you know the prefixes in advance, it shouldn't be hard to
even automate the choice of tokenizing functions into a little finite
state machine (if, for instance, the number of tokens you want to
discriminate is really enormous).

-- 
 Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
 __ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/  \ Life is an effort that deserves a better cause.
\__/ Karl Kraus
    The laws list / http://www.alcyone.com/max/physics/laws/
 Laws, rules, principles, effects, paradoxes, etc. in physics.




More information about the Python-list mailing list