Can upper() or lower() ever change the length of a string?

Mon May 24 08:44:21 EDT 2010

On May 24, 1:13 pm, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> Do unicode.lower() or unicode.upper() ever change the length of the
> string?
>
> The Unicode standard allows for case conversions that change length, e.g.
> sharp-S in German should convert to SS:
>
> http://unicode.org/faq/casemap_charprop.html#6
>
> but I see that Python doesn't do that:
>
> >>> s = "Paßstraße"
> >>> s.upper()
>
> 'PAßSTRAßE'
>
> The more I think about this, the more I think that upper/lower/title case
> conversions should change length (at least sometimes) and if Python
> doesn't do so, that's a bug. Any thoughts?

Digging a bit deeper, it looks like these methods are using the
Simple_{Upper,Lower,Title}case_Mapping functions described at
http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14
of the unicode data;  you can see this in the source in Tools/unicode/
makeunicodedata.py, which is the Python code that generates the
database of unicode properties.  It contains code like:

if record[12]:
    upper = int(record[12], 16)
else:
    upper = char
if record[13]:
    lower = int(record[13], 16)
else:
    lower = char
if record[14]:
    title = int(record[14], 16)

... and so on.

I agree that it might be desirable for these operations to product the
multicharacter equivalents.  That idea looks like a tough sell,
though:  apart from backwards compatibility concerns (which could
probably be worked around somehow), it looks as though it would
require significant effort to implement.

--
Mark