Python Unicode handling wins again -- mostly

Sat Nov 30 19:22:28 EST 2013

On Sun, 01 Dec 2013 11:37:30 +1300, Gregory Ewing wrote:

> Which makes it even sillier to have an 'ffi' character in this day and
> age, when you can simply space the characters so that they overlap.

It's in Unicode to support legacy character sets that included it[1]. 
There are a bunch of similar cases:

* LATIN CAPITAL LETTER A WITH RING ABOVE versus ANGSTROM SIGN
* KELVIN SIGN versus LATIN CAPITAL LETTER A
* DEGREE CELSIUS and DEGREE FAHRENHEIT
* the whole set of full-width and half-width forms

On the other hand, there are cases which to a naive reader might look 
like needless duplication but actually aren't. For example, there are a 
bunch of visually indistinguishable characters[2] in European languages, 
like AΑА and BΒВ. The reason for this becomes more obvious[3] when you 
lowercase them:

py> 'AΑА BΒВ'.lower()
'aαа bβв'

Sorting and case-conversion rules would become insanely complicated, and 
context-sensitive, if Unicode only included a single code point per thing-
that-looks-the-same.

The rules for deciding what is and what isn't a distinct character can be 
quite complex, and often politically charged. There's a lot of opposition 
to Unicode in East Asian countries because it unifies Han ideograms that 
look and behave the same in Chinese, Japanese and Korean. The reason they 
do this is for the same reason that Unicode doesn't distinguish between 
(say) English A, German A and French A. One reason some East Asians want 
it to is for the same reason you or I might wish to flag a section of 
text as English and another section of text as German, and have them 
displayed in slightly different typefaces and spell-checked with a 
different dictionary. The Unicode Consortium's answer to that is, this is 
beyond the remit of the character set, and is best handled by markup or 
higher-level formatting.

(Another reason for opposing Han unification is, let's be frank, pure 
nationalism.)

[1] As far as I can tell, the only character supported by legacy 
character sets which is not included in Unicode is the Apple logo from 
Mac charsets.

[2] The actual glyphs depends on the typeface used.

[3] Again, modulo the typeface you're using to view them.

-- 
Steven