encoding problems (é and è)

"Martin v. Löwis" martin at v.loewis.de
Fri Mar 24 17:52:39 EST 2006


John Machin wrote:
>> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH 
>> YIWN), it does not even work.
> 
> Sorry, I don't understand.
> 0565 is stand-alone ECH
> 0582 is stand-alone YIWN
> 0587 is the ligature.
> What doesn't work? At first guess, in the absence of an Armenian 
> informant, for pre-matching normalisation, I'd replace 0587 by the two 
> constituents -- just like 00DF would be expanded to "ss" (before 
> upshifting and before not caring too much about differences caused by 
> doubled letters).

Looking at the UnicodeData helps here:

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;

So U+0587 is a compatibility character for U+0565,U+0582. Not sure
what the rationale for *this* compatibility character is, but in many
cases, they are in Unicode only for compatibility with some existing
encoding - if they had gone through the proper Unification, they should
not have been introduced as separate characters.

In many cases, ligature characters exist for typographical reasons; 
other examples are

FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;;
FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;;
FB03;LATIN SMALL LIGATURE FFI;Ll;0;L;<compat> 0066 0066 0069;;;;N;;;;;
FB04;LATIN SMALL LIGATURE FFL;Ll;0;L;<compat> 0066 0066 006C;;;;N;;;;;

In these cases, it is the font designers which want to have code points
for these characters: the glyphs of the ligature cannot be automatically
derived from the glyphs of the individual characters. I can only guess
that the issue with that Armenian ligature is similar.

Notice that the issue of U+00DF is entirely different: it is a character
on its own, not a ligature. That a common transliteration for this
character exists is again a different story.

Now, as to what might not work: While compatibility decomposition
(NFKD) converts \u0587 to \u0565\u0582, the reverse process is not
supported. This is intentional, of course: there is no "canonical"
compatibility character for every decomposed code point.

Regards,
Martin



More information about the Python-list mailing list