Re: encoding problems (é and è)

Serge Orlov Serge.Orlov at gmail.com
Fri Mar 24 23:24:20 EST 2006


Martin v. Löwis wrote:
> John Machin wrote:
> >> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH
> >> YIWN), it does not even work.
> >
> > Sorry, I don't understand.
> > 0565 is stand-alone ECH
> > 0582 is stand-alone YIWN
> > 0587 is the ligature.
> > What doesn't work? At first guess, in the absence of an Armenian
> > informant, for pre-matching normalisation, I'd replace 0587 by the two
> > constituents -- just like 00DF would be expanded to "ss" (before
> > upshifting and before not caring too much about differences caused by
> > doubled letters).
>
> Looking at the UnicodeData helps here:
>
> 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
> 0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;
>
> So U+0587 is a compatibility character for U+0565,U+0582. Not sure
> what the rationale for *this* compatibility character is, but in many
> cases, they are in Unicode only for compatibility with some existing
> encoding - if they had gone through the proper Unification, they should
> not have been introduced as separate characters.

The problem is that U+0587 is a ligature in Western Armenian dialect
(hy locale) and a character in Eastern Armenian dialect (hy_AM locale).
It is strange the code point is marked as compatibility char. It either
mistake or political decision. It used to be a ligature before
orthographic reform in 1930s by communist government in Armenia, then
it became a character, but after end of Soviet Union (1991) they
started to think about going back to old orthography. Though it hasn't
happened and it's not clear if it will ever happen. So U+0587 is a
character. By the way, this char/ligature is present on both Western
and Eastern Armenian keyboard layouts:
http://www.datacal.com/products/armenian-western-layout.htm
It is between 9 and (. In Eastern Armenian this character is used in
words և ( the word "and" in English) , արև ( "sun" in English) and
hundreds others. Needless to say how many documents exist with this
character.

>
> In many cases, ligature characters exist for typographical reasons;
> other examples are
>
> FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
> FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;;
> FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;;
> FB03;LATIN SMALL LIGATURE FFI;Ll;0;L;<compat> 0066 0066 0069;;;;N;;;;;
> FB04;LATIN SMALL LIGATURE FFL;Ll;0;L;<compat> 0066 0066 006C;;;;N;;;;;
>
> In these cases, it is the font designers which want to have code points
> for these characters: the glyphs of the ligature cannot be automatically
> derived from the glyphs of the individual characters. I can only guess
> that the issue with that Armenian ligature is similar.
>
> Notice that the issue of U+00DF is entirely different: it is a character
> on its own, not a ligature. That a common transliteration for this
> character exists is again a different story.
>
> Now, as to what might not work: While compatibility decomposition
> (NFKD) converts \u0587 to \u0565\u0582, the reverse process is not
> supported. This is intentional, of course: there is no "canonical"
> compatibility character for every decomposed code point.

Seems like NFKD will damage Eastern Armenian text (there are millions
of such documents). The result will be readable but the text will look
strange to the person who wrote the text.

  Serge.




More information about the Python-list mailing list