[Python-3000] string C API

Nick Coghlan ncoghlan at gmail.com
Sat Sep 16 18:49:36 CEST 2006


Martin v. Löwis wrote:
> Nick Coghlan schrieb:
>> The choice of latin-1 is deliberate and non-arbitrary. The reason for the 
>> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:
> 
> That's true, but that this makes a good choice for a special case
> doesn't follow. Instead, frequency of occurrence of the special case
> makes it a good choice.

If an 8-bit encoding other than latin-1 is used for the internal buffer, then 
every comparison operation would have to decode the string to Unicode in order 
to compare code points.

It seems much simpler to me to ensure that what is stored internally is 
*always* the Unicode code points, with the width (1, 2 or 4 bytes) determined 
by the largest code point in the string. The latter two are the UCS-2 and 
UCS-4 formats that are compile-time selectable for unicode strings in Python 
2.x, but I'm not aware of any name other than 'latin-1' for the case where all 
of the code points are less than 256.

> Hardly. Instead, the codec would have to create the string of the right
> width; a codec written in C would make two passes, rather than
> temporarily allocating memory to actually represent the UCS-4 codes.

Indeed, that does make more sense - one pass to figure out the number of 
characters and the largest code point, and a second to copy the characters to 
the allocated buffer.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org


More information about the Python-3000 mailing list