[capi-sig] Unicode compatibility

Sun May 23 19:51:09 CEST 2010

Daniel Stutzbach, 21.05.2010 16:34:
> If you try to load an extension module that:
> - uses any of Python's Unicode functions, and
> - was compiled by a Python with the opposite Unicode setting (UCS2 vs UCS4)
> then you get an ugly "undefined symbol" error from the linker.

Well known problem, yes.

> By default, extensions will compile in a "Unicode-agnostic" mode, where
> Py_UNICODE is an incomplete type. The extension's code can pass Py_UNICODE
> pointers back and forth between Python API functions, but it cannot
> dereference them nor use sizeof(Py_UNICODE).  Unicode-agnostic modules will
> load and run in both UCS2 and UCS4 interpreters.  Most extensions fall into
> this category.

This is a pretty bad default for Cython code. Starting with version 0.13, 
Cython will try to infer Py_UNICODE for single character unicode strings 
and use that whenever possible, e.g. when for-looping over unicode strings 
and during character comparisons. Making Py_UNICODE an incomplete type will 
render this impossible.

> If a module needs to dereference Py_UNICODE, it can define
> PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a complete
> type

So that would be an option that all Cython modules (or at least those that 
use Py_UNICODE and/or single unicode characters somewhere) would use 
automatically. Not much to win here.

> Attempting to load such a module into a mismatched interpreter will
> cause an ImportError (instead of an ugly linker error).  If an extension
> uses PY_REAL_PY_UNICODE in any .c file, it must also use it in the .c file
> that calls PyModule_Create to ensure the Unicode width is stored in the
> module's information.

Cython modules should normally be self-contained, but it will not be 100% 
sure that a module that wraps C code using Py_UNICODE will also use 
Py_UNICODE somewhere, so that Cython could enable that option 
automatically. Cython would therefore be forced to enable the option for 
basically all code that calls into C code.

> 2) Would you prefer the default be reversed?  i.e, that Py_UNICODE be a
> complete type by default, and an extension must have a #define to compile in
> Unicode-agnostic mode?

Absolutely. IMHO, the only platform that always requires binaries due to 
incomplete operating system support for source distributions is MS Windows, 
where Py_UNICODE equals wchar_t anyway. In some cases, MacOS-X is broken 
enough to require binary releases, too, but the normal target on that 
platform is the system Python, which has a universal setting for the 
Py_UNICODE size as well.

So the only remaining platforms that suffer from binary incompatibility 
problems here are Linux und Unix systems, where the Py_UNICODE size differs 
between installations and distributions. Given that these systems are best 
targeted with a source distribution, it sounds like a bad default to 
complicate the usage of Py_UNICODE for everyone, unless users explicitly 
disable this behaviour. It's much better to provide this as an option for 
extension writers who really want (or need) to provide portable binary 
distributions for whatever reason.

Personally, I think the drawbacks totally outweigh the single advantage, 
though, so I could absolutely live without this change. It's easy enough to 
drop the linkage error message into a web search engine.

Stefan