[issue10254] unicodedata.normalize('NFC', s) regression

Fri Dec 17 09:47:00 CET 2010

Martin v. Löwis <martin at v.loewis.de> added the comment:

> The logic suggested by Martin in msg120018 looks right to me, but the
> whole code seems to be unnecessarily complex.  (And comb1==comb may
> need to be changed to comb1>=comb.) I don't understand why linear
> search through "skipped" array is needed.  At the very least instead
> of adding their positions to the "skipped" list, used combining
> characters can be replaced by a non-character to be later skipped.

The skipped array keeps track of what characters have been integrated
into a base character, as they must not appear in the output.
Assume you have a sequence B,C,N,C,N,B (B: base character, C: combined,
N: not combined). You need to remember not to output C, whereas you
still need to output N. I don't think replacing them with a
non-character can work: which one would you chose (that cannot also
appear in the input)?

The worst case (wrt. cskipped) is the maximum number of characters that
can get combined into a single base character. It used to be (and I
hope still is) 20 (decomposition of U+FDFA).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10254>
_______________________________________