sorting slovak utf

Jeff Epler jepler at unpythonic.net
Mon Dec 8 12:31:29 EST 2003


There are several problems all going on at the same time:
First, you have to select a locale that your OS supports.  Python can't
fix the fact that your installation of Windows doesn't support the
sk_SK.utf-8 locale.  It may support the sk_SK locale in another
encoding, though.  You could try this sequence:
    >>> locale.setlocale(locale.LC_ALL, "sk_SK")
    'sk_SK'
    >>> locale.getlocale()
    ('sk_SK', 'utf') # my system chooses utf-8 for this locale
    >>> enc = _[1]   # use whatever encoding the system chose
                     # _ is special in interactive prompt, it
                     # holds the value of the last expression
                     # use enc = locale.getlocale()[1] normally

now, read the file and encode it into the system's locale:
    >>> f = open("aaa.txt")
    >>> l = [line.decode("utf-8").encode(enc) for line in f]
    >>> f.close()
(note that there are pitfalls here if you want to sort utf-8 data
according to sk_SK conventions while that data contains unicode
characters not expressible in the character set that you used in the
first step!)

Now, you have to use a locale-sensitive function when you sort your list
of strings, because 'l.sort()' without a comparison function will use
the same ordering by character value no matter your locale.  You have
two choices.  First, locale.strcoll is suitable for use as the list.sort
comparison function:
    >>> l.sort(locale.strcoll)
    >>> print "".join(l).decode(enc).encode("utf-8")
    # prints the locale-aware sorted version of l
    # assuming your terminal is utf-8

Second, you can instead use the DSU pattern and locale.strxfrm:
    >>> m = [(locale.strxfrm(line), line) for line in l]
    >>> m.sort()
    >>> l[:] = [i[1] for i in m]

Jeff





More information about the Python-list mailing list