Is unicode.lower() locale-independent?
Robert Kern
robert.kern at gmail.com
Sat Jan 12 06:39:17 EST 2008
John Machin wrote:
> On Jan 12, 8:25 pm, Robert Kern <robert.k... at gmail.com> wrote:
>> The section on "String Methods"[1] in the Python documentation states that for
>> the case conversion methods like str.lower(), "For 8-bit strings, this method is
>> locale-dependent." Is there a guarantee that unicode.lower() is
>> locale-*in*dependent?
>>
>> The section on "Case Conversion" in PEP 100 suggests this, but the code itself
>> looks like to may call the C function towlower() if it is available. On OS X
>> Leopard, the manpage for towlower(3) states that it "uses the current locale"
>> though it doesn't say exactly *how* it uses it.
>>
>> This is the bug I'm trying to fix:
>>
>> http://scipy.org/scipy/numpy/ticket/643
>> http://dev.laptop.org/ticket/5559
>>
>> [1]http://docs.python.org/lib/string-methods.html
>> [2]http://www.python.org/dev/peps/pep-0100/
>
> The Unicode standard says that case mappings are language-dependent.
> It gives the example of the Turkish dotted capital letter I and
> dotless small letter i that "caused" the numpy problem. See
> http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180
That doesn't determine the behavior of unicode.lower(), I don't think. That
specifies semantics for when one is dealing with a given language in the
abstract. That doesn't specify concrete behavior with respect to a given locale
setting on a real computer. For example, my strings 'VOID', 'INT', etc. are all
English, and I want English case behavior. The language of the data and the
transformations I want to apply to the data is English even though the user may
have set the locale to something else.
> Here is what the Python 2.5.1 unicode implementation does in an
> English-language locale:
>
>>>> import unicodedata as ucd
>>>> eyes = u"Ii\u0130\u0131"
>>>> for eye in eyes:
> ... print repr(eye), ucd.name(eye)
> ...
> u'I' LATIN CAPITAL LETTER I
> u'i' LATIN SMALL LETTER I
> u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
> u'\u0131' LATIN SMALL LETTER DOTLESS I
>>>> for eye in eyes:
> ... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
> eye.capitalize())
> ...
> u'I' u'I' u'i' u'I'
> u'i' u'I' u'i' u'I'
> u'\u0130' u'\u0130' u'i' u'\u0130'
> u'\u0131' u'I' u'\u0131' u'I'
>
> The conversions for I and i are not correct for a Turkish locale.
>
> I don't know how to repeat the above in a Turkish locale.
If you have the correct locale data in your operating system, this should be
sufficient, I believe:
$ LANG=tr_TR python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'tr_TR'
>>> 'VOID'.lower()
'vo\xfdd'
>>> 'VOID'.lower().decode('iso-8859-9')
u'vo\u0131d'
>>> u'VOID'.lower()
u'void'
>>>
> However it appears from your bug ticket that you have a much narrower
> problem (case-shifting a small known list of English words like VOID)
> and can work around it by writing your own locale-independent casing
> functions. Do you still need to find out whether Python unicode
> casings are locale-dependent?
I would still like to know. There are other places where .lower() is used in
numpy, not to mention the rest of my code.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list
mailing list