[Python-Dev] PEP 393: Flexible String Representation
Stefan Behnel
stefan_ml at behnel.de
Fri Jan 28 11:30:33 CET 2011
Florian Weimer, 28.01.2011 10:35:
> * Stefan Behnel:
>> "Martin v. Löwis", 24.01.2011 21:17:
>>> The Py_UNICODE type is still supported but deprecated. It is always
>>> defined as a typedef for wchar_t, so the wstr representation can double
>>> as Py_UNICODE representation.
>>
>> It's too bad this isn't initialised by default, though. Py_UNICODE is
>> the only representation that can be used efficiently from C code
>
> Is this really true? I don't think I've seen any C API which actually
> uses wchar_t, beyond that what is provided by libc. UTF-8 and even
> UTF-16 are much, much more common.
They are also much harder to use, unless you are really only interested in
7-bit ASCII data - which is the case for most C libraries, so I believe
that's what you meant here. However, this is the CPython runtime with
built-in Unicode support, not the C runtime where it comes as an add-on at
best, and where Unicode processing without being Unicode aware is common.
The nice thing about Py_UNICODE is that is basically gives you native
Unicode code points directly, without needing to decode UTF-8 byte runs and
the like. In Cython, it allows you to do things like this:
def test_for_those_characters(unicode s):
for c in s:
# warning: randomly chosen Unicode escapes ahead
if c in u"\u0356\u1012\u3359\u4567":
return True
else:
return False
The loop runs in plain C, using the somewhat obvious implementation with a
loop over Py_UNICODE characters and a switch statement for the comparison.
This would look a *lot* more ugly with UTF-8 encoded byte strings.
Regarding Cython specifically, the above will still be *possible* under the
proposal, given that the memory layout of the strings will still represent
the Unicode code points. It will just be trickier to implement in Cython's
type system as there is no longer a (user visible) C type representation
for those code units. It can be any of uchar, ushort16 or uint32, neither
of which is necessarily a 'native' representation of a Unicode character in
CPython. While I'm somewhat confident that I'll find a way to fix this in
Cython, my point is just that this adds a certain level of complexity to C
code using the new memory layout that simply wasn't there before.
Stefan
More information about the Python-Dev
mailing list