Problem with lower() for unicode strings in russian

"Martin v. Löwis" martin at v.loewis.de
Sun Oct 5 17:30:09 EDT 2008


> I have a set of strings (all letters are capitalized) at utf-8,

That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.

What you need to do is to operate on Unicode strings. I.e. instead
of

  s.lower()

do

  s.decode("utf-8").lower()

or (if you need byte strings back)

  s.decode("utf-8").lower().encode("utf-8")

If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).

There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).

HTH,
Martin



More information about the Python-list mailing list