[Cython] Py_UNICODE* string support

Stefan Behnel stefan_ml at behnel.de
Sun Mar 3 10:32:36 CET 2013


Nikita Nemkin, 03.03.2013 09:25:
> On Sun, 03 Mar 2013 13:52:49 +0600, Stefan Behnel wrote:
>> Are you aware that Py_UNICODE is deprecated as of Py3.3?
>>
>> http://docs.python.org/3.4/c-api/unicode.html
>>
>> Your changes look a bit excessive for supporting something that's
>> inefficient in recent Python versions and basically "dead".
> 
> Yes, I'm well aware of Py3.3 changes, but consider this:
> 
> 1. _All_ system APIs on Windows, old, new and in-between, use UTF-16 in the
>    form of zero-terminated 2-byte wchar_t* strings (on Windows Py_UNICODE is
>    _always_ aliased to wchar_t specifically for this reason).
>    Whatever happens to Python internals, the need to interoperate with
>    UTF-16 based platforms won't go away.

Ok, fine with me.

Your changes look fairly reasonable, especially for a first try. I have the
following comments.

1) I would like to get rid of UnicodeConst. A Py_UNICODE* is not different
from any other C array, except that it can coerce to and from Unicode
strings. So the representation of a literal should be a (properly reference
counted) Python Unicode object, and users would be allowed to cast them to
<Py_UNICODE*>, just as we support it for <char*> and bytes.

2) non-BMP literals should be supported by representing them as normal
Unicode strings and creating the Py_UNICODE representation at need (i.e.
explicitly through a cast, at runtime). Py_UNICODE[] literals are simply
not portable.

3) __Pyx_Py_UNICODE_strlen() is ok, but only for the special case that all
we have is a Py_UNICODE*. As long as we are dealing with Unicode string
objects, that won't be needed, so len() should be constant time in the
normal case instead of linear time.

4) most of the changes in PyrexTypes.py and ExprNodes.py look ok. I would
eventually like to see a couple of refactorings on these sections (because
the special cases add up over time), but that's not required for this change.

So, the basic idea would be to use Unicode strings and their (optional)
internal representation as Py_UNICODE[] instead of making Py_UNICODE[] a
first class data type. And then go from there and optimise certain things
to use the unpacked array directly, so that users won't need to put
explicit C-API calls into their code.

Stefan



More information about the cython-devel mailing list