python 2.7 and unicode (one more time)

Tim Chase python.list at tim.thechases.com
Fri Nov 21 11:00:10 EST 2014


On 2014-11-22 02:23, Steven D'Aprano wrote:
> LATIN SMALL LETTER E
> COMBINING CIRCUMFLEX ACCENT
> 
> then my application should treat that as a single "character" and
> display it as:
> 
> LATIN SMALL LETTER E WITH CIRCUMFLEX
> 
> which looks like this: ê
> 
> rather than two distinct "characters" eˆ
> 
> Now, that specific example is a no-brainer, because the Unicode
> normalization routines will handle the conversion. But not every
> combination of accented characters has a canonical combined form.
> What about something like this?
> 
> 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING
> CARON}'
> 
> If I insert a character into my string, I want to be able to insert
> before the w or after the caron, but not in the middle of those
> three code points.

Things get even weirder if you have

 '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING
 OGONEK}\N{COMBINING CARON}'

and when you try to do comparisons like

 s1 = '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}'
 s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}'
 s3 = 'e\N{COMBINING OGONEK}\N{COMBINING CIRCUMFLEX ACCENT}'
 print(s1 == s2)
 print(s1 == s3)
 print(s2 == s3)

Then you also have the case where you want to edit text and the user
wants to remove the COMBINING OGONEK from the character, so you *do*
want to do something akin to

 s4 = ''.join(c for c in s3 if c != '\N{COMBINING OGONEK}')

And yet, weird things happen if you try to remove the circumflex:

  for test in (s1, s2, s3):
    print(test == ''.join(
      c for c in test if c != '\N{COMBINING CIRCUMFLEX ACCENT}'
      )

They all make sense if you understand what's going on under the hood,
but from a visual/conceptual perspective, something feels amiss.

-tkc







More information about the Python-list mailing list