Sorting a list of Unicode strings?

Sun Aug 19 14:09:42 EDT 2007

oliver at obeattie.com <oliver at obeattie.com> wrote:
   ...
> > > Maybe I'm missing something fundamental here, but if I have a list of
> > > Unicode strings, and I want to sort these alphabetically, then it
> > > places those that begin with unicode characters at the bottom.
   ...
> Anyway, I know _why_ it does this, but I really do need it to sort
> them correctly based on how humans would look at it.

Depending on the nationality of those humans, you may need very
different sorting criteria; indeed, in some countries, different sorting
criteria apply to different use cases (such as sorting surnames versus
sorting book titles, etc; sorry, I don't recall specific examples, but
if you delve on sites about i18n issues you'll find some).

In both Swedish and Danish, I believe, A-with-ring sorts AFTER the
letter Z in the alphabet; so, having Åaland (where I'm using Aa for
A-with-ring, since this newsreader has some problem in letting me enter
non-ascii characters;-) sort "right at the bottom", while it "doesn't
look right" to YOU (maybe an English-speaker?) may look right to the
inhabitants of that locality (be they Danes or Swedes -- but I believe
Norwegian may also work similarly in terms of sorting).

The Unicode consortium does define a standard collation algorithm (UCA)
and table (DUCET) to use when you need a locale-independent ordering; at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>
you'll be able to obtain James Tauber's Python implementation of UCA, to
work with the DUCET found at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>.

I suspect you won't like the collation order you obtain this way, but
you might start from there, subsetting and tweaking the DUCET into an
OUCET (Oliver Unicode Collation Element Table;-) that suits you better.

A simpler, rougher approach, if you think the "right" collation is
obtained by ignoring accents, diacritics, etc (even though the speakers
of many languages that include diacritics, &c, disagree;-) is to use the
key=coll argument in your sorting call, passing a function coll that
maps any Unicode string to what you _think_ it should be like for
sorting purposes.  The .translate method of Unicode string objects may
help there: it takes a dict mapping Unicode ordinals to ordinals or
string (or None for characters you want to delete as part of the
translation).

For example, suppose that what we want is the following somewhat silly
collation: we only care about ISO-8859-1 characters, and want to ignore
for sorting purposes any accent (be it grave, acute or circumflex),
umlauts, slashes through letters, tildes, cedillas.  htmlentitydefs has
a useful dict called codepoint2name that helps us identify those "weirdy
decorated foreign characters".

def make_transdict():
    import htmlentitydefs
    cp2n = htmlentitydefs.codepoint2name
    suffixes = 'acute crave circ uml slash tilde cedil'.split()
    td = {}
    for x in range(128, 256):
        if x not in cp2n: continue
        n = cp2n[x]
        for s in suffixes:
            if n.endswith(s):
                td[x] = unicode(n[-len(s)])
                break
    return td

def coll(us, td=make_transdict()):
    return us.translate(td)

listofus.sort(key=coll)

I haven't tested this code, but it should be reasonably easy to fix any
problems it might have, as well as making make_transdict "richer" to
meet your goals.  Just be aware that the resulting collation (e.g.,
sorting a-ring just as if it was a plain a) will be ABSOLUTELY WEIRD to
anybody who knows something about Scandinavian languages...!!!-)

Alex