Can upper() or lower() ever change the length of a string?

MRAB python at mrabarnett.plus.com
Mon May 24 10:42:06 EDT 2010


Mark Dickinson wrote:
> On May 24, 1:13 pm, Steven D'Aprano <st... at REMOVE-THIS-
> cybersource.com.au> wrote:
>> Do unicode.lower() or unicode.upper() ever change the length of the
>> string?
>>
>> The Unicode standard allows for case conversions that change length, e.g.
>> sharp-S in German should convert to SS:
>>
>> http://unicode.org/faq/casemap_charprop.html#6
>>
>> but I see that Python doesn't do that:
>>
>>>>> s = "Paßstraße"
>>>>> s.upper()
>> 'PAßSTRAßE'
>>
>> The more I think about this, the more I think that upper/lower/title case
>> conversions should change length (at least sometimes) and if Python
>> doesn't do so, that's a bug. Any thoughts?
> 
> Digging a bit deeper, it looks like these methods are using the
> Simple_{Upper,Lower,Title}case_Mapping functions described at
> http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
> of the unicode data;  you can see this in the source in Tools/unicode/
> makeunicodedata.py, which is the Python code that generates the
> database of unicode properties.  It contains code like:
> 
> if record[12]:
>     upper = int(record[12], 16)
> else:
>     upper = char
> if record[13]:
>     lower = int(record[13], 16)
> else:
>     lower = char
> if record[14]:
>     title = int(record[14], 16)
> 
> ... and so on.
> 
> I agree that it might be desirable for these operations to product the
> multicharacter equivalents.  That idea looks like a tough sell,
> though:  apart from backwards compatibility concerns (which could
> probably be worked around somehow), it looks as though it would
> require significant effort to implement.
> 
If we were to make such a change, I think we should also cater for
locale-specific case changes (passing the locale to 'upper', 'lower' and
'title').

For example, normally "i".upper() returns "I", but in Turkish
"i".upper() should return "İ" (the uppercase version of lowercase dotted
i is uppercase dotted I).



More information about the Python-list mailing list