Is unicode.lower() locale-independent?

John Machin sjmachin at lexicon.net
Sat Jan 12 16:12:26 EST 2008


On Jan 12, 10:51 pm, Fredrik Lundh <fred... at pythonware.com> wrote:
> Robert Kern wrote:
> >> However it appears from your bug ticket that you have a much narrower
> >> problem (case-shifting a small known list of English words like VOID)
> >> and can work around it by writing your own locale-independent casing
> >> functions. Do you still need to find out whether Python unicode
> >> casings are locale-dependent?
>
> > I would still like to know. There are other places where .lower() is used in
> > numpy, not to mention the rest of my code.
>
> "lower" uses the informative case mappings provided by the Unicode
> character database; see
>
>      http://www.unicode.org/Public/4.1.0/ucd/UCD.html

of which the relevant part is
"""
Case Mappings

There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII. For more
information, see Chapter 3 in Unicode 4.0.

For compatibility with existing parsers, UnicodeData.txt only contains
case mappings for characters where they are one-to-one mappings; it
also omits information about context-sensitive case mappings.
Information about these special cases can be found in a separate data
file, SpecialCasing.txt.
"""

It seems that Python doesn't use the SpecialCasing.txt file. Effects
include:
(a) one-to-many mappings don't happen e.g. LATIN SMALL LETTER SHARP S:
u'\xdf'.upper() produces u'\xdf' instead of u'SS'
(b) language-sensitive mappings (e.g. dotted/dotless I/i for Turkish
(and Azeri)) don't happen
(c) context-sensitive mappings don't happen e.g. lower case of GREEK
CAPITAL LETTER SIGMA depends on whether it is the last letter in a
word.



>
> afaik, changing the locale has no influence whatsoever on Python's
> Unicode subsystem.
>
> </F>




More information about the Python-list mailing list