sorting slovak utf
Jeff Epler
jepler at unpythonic.net
Mon Dec 8 12:31:29 EST 2003
There are several problems all going on at the same time:
First, you have to select a locale that your OS supports. Python can't
fix the fact that your installation of Windows doesn't support the
sk_SK.utf-8 locale. It may support the sk_SK locale in another
encoding, though. You could try this sequence:
>>> locale.setlocale(locale.LC_ALL, "sk_SK")
'sk_SK'
>>> locale.getlocale()
('sk_SK', 'utf') # my system chooses utf-8 for this locale
>>> enc = _[1] # use whatever encoding the system chose
# _ is special in interactive prompt, it
# holds the value of the last expression
# use enc = locale.getlocale()[1] normally
now, read the file and encode it into the system's locale:
>>> f = open("aaa.txt")
>>> l = [line.decode("utf-8").encode(enc) for line in f]
>>> f.close()
(note that there are pitfalls here if you want to sort utf-8 data
according to sk_SK conventions while that data contains unicode
characters not expressible in the character set that you used in the
first step!)
Now, you have to use a locale-sensitive function when you sort your list
of strings, because 'l.sort()' without a comparison function will use
the same ordering by character value no matter your locale. You have
two choices. First, locale.strcoll is suitable for use as the list.sort
comparison function:
>>> l.sort(locale.strcoll)
>>> print "".join(l).decode(enc).encode("utf-8")
# prints the locale-aware sorted version of l
# assuming your terminal is utf-8
Second, you can instead use the DSU pattern and locale.strxfrm:
>>> m = [(locale.strxfrm(line), line) for line in l]
>>> m.sort()
>>> l[:] = [i[1] for i in m]
Jeff
More information about the Python-list
mailing list