[Python-Dev] PEP 393: Flexible String Representation

Fri Jan 28 11:30:33 CET 2011

Florian Weimer, 28.01.2011 10:35:
> * Stefan Behnel:
>> "Martin v. Löwis", 24.01.2011 21:17:
>>> The Py_UNICODE type is still supported but deprecated. It is always
>>> defined as a typedef for wchar_t, so the wstr representation can double
>>> as Py_UNICODE representation.
>>
>> It's too bad this isn't initialised by default, though. Py_UNICODE is
>> the only representation that can be used efficiently from C code
>
> Is this really true?  I don't think I've seen any C API which actually
> uses wchar_t, beyond that what is provided by libc.  UTF-8 and even
> UTF-16 are much, much more common.

They are also much harder to use, unless you are really only interested in 
7-bit ASCII data - which is the case for most C libraries, so I believe 
that's what you meant here. However, this is the CPython runtime with 
built-in Unicode support, not the C runtime where it comes as an add-on at 
best, and where Unicode processing without being Unicode aware is common.

The nice thing about Py_UNICODE is that is basically gives you native 
Unicode code points directly, without needing to decode UTF-8 byte runs and 
the like. In Cython, it allows you to do things like this:

     def test_for_those_characters(unicode s):
         for c in s:
             # warning: randomly chosen Unicode escapes ahead
             if c in u"\u0356\u1012\u3359\u4567":
                 return True
         else:
             return False

The loop runs in plain C, using the somewhat obvious implementation with a 
loop over Py_UNICODE characters and a switch statement for the comparison. 
This would look a *lot* more ugly with UTF-8 encoded byte strings.

Regarding Cython specifically, the above will still be *possible* under the 
proposal, given that the memory layout of the strings will still represent 
the Unicode code points. It will just be trickier to implement in Cython's 
type system as there is no longer a (user visible) C type representation 
for those code units. It can be any of uchar, ushort16 or uint32, neither 
of which is necessarily a 'native' representation of a Unicode character in 
CPython. While I'm somewhat confident that I'll find a way to fix this in 
Cython, my point is just that this adds a certain level of complexity to C 
code using the new memory layout that simply wasn't there before.

Stefan