Python Unicode handling wins again -- mostly

Fri Nov 29 23:21:49 EST 2013

On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:

> In article <529934dc$0$29993$c3e8da3$5496439d at news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
> 
>> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>> 
>> Like most other languages, Python 3.2 fails:
>> 
>> py> 'baffle'.upper()
>> 'BAfflE'

You edited my text to remove the ligature? That's... unfortunate.

>> but Python 3.3 passes:
>> 
>> py> 'baffle'.upper()
>> 'BAFFLE'
> 
> I disagree.
> 
> The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to all 
languages that have ligatures. It also partly depends on how you define 
ligatures. For example, would you consider that ampersand & to be a 
ligature? These days, I would consider & to be a distinct character, but 
originally it began as a ligature for "et" (Latin for "and").

But let's skip such corner cases, as they provide much heat but no 
illumination, and I'll agree that when it comes to ligatures like fl, fi 
and ffl, they are purely typographic.

> The crossbar
> on the "f" (at least in some fonts) runs into the dot on the "i".
> Likewise, the top curl on an "f" run into the serif on top of the "l"
> (and similarly for ffl).
> 
> There is no such thing as a "FFL" ligature, because the upper case
> letterforms don't run into each other like the lower case ones do. Thus,
> I would argue that it's wrong to say that calling upper() on an ffl
> ligature should yield FFL.

Your conclusion doesn't follow from the argument you are making. Since 
the ffl ligature ﬄ is purely a typographical feature, then it should 
uppercase to FFL (there being no typographic feature for uppercase FFL 
ligature).

Consider the examples shown above, where you or your software 
unfortunately edited out the ligature and replaced it with ASCII "ffl". 
Or perhaps I should say *fortunately*, since it demonstrates the problem.

Since we agree that the ﬄ ligature is merely a typographic artifact of 
some type-designers whimsy, we can expect that the word "baﬄe" is 
semantically exactly the same as the word "baffle". How foolish Python 
would look if it did this:

py> 'baffle'.upper()
'BAfflE'

Replace the 'ffl' with the ligature, and the conclusion remains:

py> 'baﬄe'.upper()
'BAﬄE'

would be equally wrong.

Now, I accept that this picture isn't entirely black and white. For 
example, we might argue that if ﬄ is purely typographical in nature, 
surely we would also want 'baffle' == 'baﬄe' too? Or maybe not. This 
indicates that capturing *all* the rules for text across the many 
languages, writing systems and conventions is impossible.

There are some circumstances where we would want 'baffle' and 'baﬄe' to 
compare equal, and others where we would want them to compare the same. 
Python gives us both:

py> "bapy> "baffle" == "baﬄe"
False
ffle" == unicodedata.normalize("NFKC", "baﬄe")
True

but frankly I'm baffled *wink* that you think there are any circumstances 
where you would want the uppercase of ﬄ to be anything but FFL.

> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.

You would expect wrongly. You are over-generalising from English, and if 
you include ligatures and other special cases, not even all of English.

See, for example:

http://www.unicode.org/faq/casemap_charprop.html#7a

Apart from ligatures, some examples of troublesome characters with regard 
to case are:

* German Eszett (sharp-S) ß can be uppercased to SS, SZ or ẞ depending 
  on context, particular when dealing with placenames and family names.

  (That last character, LATIN CAPITAL LETTER SHARP S, goes back to at
  least the 1930s, although the official rules of German orthography
  still insist on uppercasing ß to SS.)

* The English long-s ſ is uppercased to regular S.

* Turkish dotted and dotless I (İ and i, I and ı) uses the same Latin
  letters I and i but the case conversion rules are different.

* Both the Greek sigma σ and final sigma ς uppercase to Σ.

That last one is especially interesting: Python 3.3 gets it right, while 
older Pythons do not. In Python 3.2:

py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύσ (Odysseus)'

while in 3.3 it roundtrips correctly:

py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύς (Odysseus)'

So... case conversions are not as simple as they appear at first glance. 
They aren't always reversible, nor do they always roundtrip. Titlecase is 
not necessarily the same as "uppercase the first letter and lowercase the 
rest". Case conversions can be context or locale sensitive.

Anyway... even if you disagree with everything I have said, it is a fact 
that Python has committed to following the Unicode standard, and the 
Unicode standard requires that certain ligatures, including FFL, FL and 
FI, are decomposed when converted to uppercase.

-- 
Steven