Is unicode.lower() locale-independent?

Sat Jan 12 06:39:17 EST 2008

John Machin wrote:
> On Jan 12, 8:25 pm, Robert Kern <robert.k... at gmail.com> wrote:
>> The section on "String Methods"[1] in the Python documentation states that for
>> the case conversion methods like str.lower(), "For 8-bit strings, this method is
>> locale-dependent." Is there a guarantee that unicode.lower() is
>> locale-*in*dependent?
>>
>> The section on "Case Conversion" in PEP 100 suggests this, but the code itself
>> looks like to may call the C function towlower() if it is available. On OS X
>> Leopard, the manpage for towlower(3) states that it "uses the current locale"
>> though it doesn't say exactly *how* it uses it.
>>
>> This is the bug I'm trying to fix:
>>
>>    http://scipy.org/scipy/numpy/ticket/643
>>    http://dev.laptop.org/ticket/5559
>>
>> [1]http://docs.python.org/lib/string-methods.html
>> [2]http://www.python.org/dev/peps/pep-0100/
> 
> The Unicode standard says that case mappings are language-dependent.
> It gives the example of the Turkish dotted capital letter I and
> dotless small letter i that "caused" the numpy problem. See
> http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180

That doesn't determine the behavior of unicode.lower(), I don't think. That 
specifies semantics for when one is dealing with a given language in the 
abstract. That doesn't specify concrete behavior with respect to a given locale 
setting on a real computer. For example, my strings 'VOID', 'INT', etc. are all 
English, and I want English case behavior. The language of the data and the 
transformations I want to apply to the data is English even though the user may 
have set the locale to something else.

> Here is what the Python 2.5.1 unicode implementation does in an
> English-language locale:
> 
>>>> import unicodedata as ucd
>>>> eyes = u"Ii\u0130\u0131"
>>>> for eye in eyes:
> ...     print repr(eye), ucd.name(eye)
> ...
> u'I' LATIN CAPITAL LETTER I
> u'i' LATIN SMALL LETTER I
> u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
> u'\u0131' LATIN SMALL LETTER DOTLESS I
>>>> for eye in eyes:
> ...    print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
> eye.capitalize())
> ...
> u'I' u'I' u'i' u'I'
> u'i' u'I' u'i' u'I'
> u'\u0130' u'\u0130' u'i' u'\u0130'
> u'\u0131' u'I' u'\u0131' u'I'
> 
> The conversions for I and i are not correct for a Turkish locale.
> 
> I don't know how to repeat the above in a Turkish locale.

If you have the correct locale data in your operating system, this should be 
sufficient, I believe:

$ LANG=tr_TR python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import locale
 >>> locale.setlocale(locale.LC_ALL, '')
'tr_TR'
 >>> 'VOID'.lower()
'vo\xfdd'
 >>> 'VOID'.lower().decode('iso-8859-9')
u'vo\u0131d'
 >>> u'VOID'.lower()
u'void'
 >>>

> However it appears from your bug ticket that you have a much narrower
> problem (case-shifting a small known list of English words like VOID)
> and can work around it by writing your own locale-independent casing
> functions. Do you still need to find out whether Python unicode
> casings are locale-dependent?

I would still like to know. There are other places where .lower() is used in 
numpy, not to mention the rest of my code.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco