[Python-3000] string C API

Sat Sep 16 20:01:28 CEST 2006

Nick Coghlan schrieb:
> If an 8-bit encoding other than latin-1 is used for the internal buffer,
> then every comparison operation would have to decode the string to
> Unicode in order to compare code points.
> 
> It seems much simpler to me to ensure that what is stored internally is
> *always* the Unicode code points, with the width (1, 2 or 4 bytes)
> determined by the largest code point in the string.

Just try implementing comparison some time. You can end up implementing
the same algorithm six times at least, once for each pair (1,1), (1,2),
(1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
you can't reduce (2,1) to (1,2)), you need 9 different versions of the
algorithm. That sounds more complicated than always decoding.

Regards,
Martin