[Python-Dev] PEP 393: Flexible String Representation

Thu Jan 27 22:24:34 CET 2011

James Y Knight, 27.01.2011 21:26:
> On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:
>> "Martin v. Löwis", 24.01.2011 21:17:
>>> The Py_UNICODE type is still supported but deprecated. It is always
>>> defined as a typedef for wchar_t, so the wstr representation can
>>> double as Py_UNICODE representation.
>>
>> It's too bad this isn't initialised by default, though. Py_UNICODE is
>> the only representation that can be used efficiently from C code and
>> Cython relies on it for fast text processing. This proposal will
>> therefore likely have a pretty negative performance impact on
>> extensions written in Cython as the compiler could no longer expect
>> this representation to be available instantaneously.
>
> But the whole point of the exercise is so that it doesn't have to store
> a 4byte-per-char representation when a 1byte-per-char rep would do.

I am well aware of that. But I'm arguing that the current simpler internal 
representation has had its advantages for CPython as a platform.

> If cython wants to work most efficiently with this proposal, it should
> learn to deal with the three possible raw representations.

I agree. After all, CPython is lucky to have it available. It wouldn't be 
the first time that we duplicate looping code based on the input type. 
However, like the looping code, it will also complicate all indexing code 
at runtime as it always needs to test which of the representations is 
current before it can read a character. Currently, all of this is a compile 
time decision. This will necessarily have a performance impact.

Stefan