Is unicode.lower() locale-independent?

John Machin sjmachin at lexicon.net
Sat Jan 12 05:46:13 EST 2008


On Jan 12, 8:25 pm, Robert Kern <robert.k... at gmail.com> wrote:
> The section on "String Methods"[1] in the Python documentation states that for
> the case conversion methods like str.lower(), "For 8-bit strings, this method is
> locale-dependent." Is there a guarantee that unicode.lower() is
> locale-*in*dependent?
>
> The section on "Case Conversion" in PEP 100 suggests this, but the code itself
> looks like to may call the C function towlower() if it is available. On OS X
> Leopard, the manpage for towlower(3) states that it "uses the current locale"
> though it doesn't say exactly *how* it uses it.
>
> This is the bug I'm trying to fix:
>
>    http://scipy.org/scipy/numpy/ticket/643
>    http://dev.laptop.org/ticket/5559
>
> [1]http://docs.python.org/lib/string-methods.html
> [2]http://www.python.org/dev/peps/pep-0100/
>

The Unicode standard says that case mappings are language-dependent.
It gives the example of the Turkish dotted capital letter I and
dotless small letter i that "caused" the numpy problem. See
http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180

Here is what the Python 2.5.1 unicode implementation does in an
English-language locale:

>>> import unicodedata as ucd
>>> eyes = u"Ii\u0130\u0131"
>>> for eye in eyes:
...     print repr(eye), ucd.name(eye)
...
u'I' LATIN CAPITAL LETTER I
u'i' LATIN SMALL LETTER I
u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
u'\u0131' LATIN SMALL LETTER DOTLESS I
>>> for eye in eyes:
...    print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
eye.capitalize())
...
u'I' u'I' u'i' u'I'
u'i' u'I' u'i' u'I'
u'\u0130' u'\u0130' u'i' u'\u0130'
u'\u0131' u'I' u'\u0131' u'I'

The conversions for I and i are not correct for a Turkish locale.

I don't know how to repeat the above in a Turkish locale.

However it appears from your bug ticket that you have a much narrower
problem (case-shifting a small known list of English words like VOID)
and can work around it by writing your own locale-independent casing
functions. Do you still need to find out whether Python unicode
casings are locale-dependent?

Cheers,
John





More information about the Python-list mailing list