[Python-Dev] PEP 393: Flexible String Representation

Sat Jan 29 07:33:54 CET 2011

"Martin v. Löwis", 28.01.2011 22:49:
> And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
> representation for such a loop. Instead, it should access the str
> representation

Sure.

>> Regarding Cython specifically, the above will still be *possible* under
>> the proposal, given that the memory layout of the strings will still
>> represent the Unicode code points. It will just be trickier to implement
>> in Cython's type system as there is no longer a (user visible) C type
>> representation for those code units.
>
> There is: Py_UCS4 remains available.

Thanks for that pointer. I had always thought that all "*UCS4*" names were 
platform specific and had completely missed that type. It's a lot nicer 
than Py_UNICODE because it allows users to fold surrogate pairs back into 
the character value.

It's completely missing from the docs, BTW. Google doesn't give me a single 
mention for all of docs.python.org, even though it existed at least since 
(and likely long before) Cython's oldest supported runtime Python 2.3.

If I had known about that type earlier, I could have ended up making that 
the native Unicode character type in Cython instead of bothering with 
Py_UNICODE. But this can still be changed I think. Since type inference was 
available before native Py_UNICODE support, it's unlikely that users will 
have Py_UNICODE written in their code explicitly. So I can make the switch 
under the hood.

Just to explain, a native CPython C type is much better than an arbitrary 
integer type, because it allows Cython to apply specific coercion rules 
from and to Python object types. As currently Py_UNICODE, Py_UCS4 would 
obviously coerce from and to a 1 character Unicode string, but it could 
additionally handle surrogate pair splitting and combining automatically on 
current 16-bit Unicode builds so that you'd get a Unicode string with two 
code points on coercion to Python.

>> While I'm somewhat confident that I'll
>> find a way to fix this in Cython, my point is just that this adds a
>> certain level of complexity to C code using the new memory layout that
>> simply wasn't there before.
>
> Understood. However, I think it is easier than you think it is.

Let's see about the implications once there is an implementation.

Stefan