[Python-Dev] PEP 393 Summer of Code Project

Victor Stinner victor.stinner at haypocalc.com
Fri Aug 26 23:37:42 CEST 2011


Le vendredi 26 août 2011 02:01:42, Dino Viehland a écrit :
> The biggest difficulty for IronPython here would be dealing w/ .NET
> interop. We can certainly introduce either an IronPython specific string
> class which is similar to CPython's PyUnicodeObject or we could have
> multiple distinct .NET types (IronPython.Runtime.AsciiString,
> System.String, and
> IronPython.Runtime.Ucs4String) which all appear as the same type to Python.
> 
> But when Python is calling a .NET API it's always going to return a
> System.String which is UTF-16.  If we had to check and convert all of
> those strings when they cross into Python it would be very bad for
> performance.  Presumably we could have a 4th type of "interop" string
> which lazily computes this but if we start wrapping .Net strings we could
> also get into object identity issues.

Python 3 encodes all Unicode strings to the OS encoding (and the result is 
decoded) for all syscalls and calls to libraries: to the locale encoding on 
UNIX, to UTF-16 on Windows. Currently, Py_UNICODE is wchar_t which is 16 bits. 
So Py_UNICODE* is already a UTF-16 string.

I don't know if the overhead of the PEP 393 (encode to UTF-16 on Windows) for 
these calls is important or not. But on UNIX, pure ASCII string don't have to 
be encoded anymore if the locale encoding is UTF-8 or ASCII.

IronPython can wait to see how CPython+PEP 383 handles these problems, and how 
slower it is.

> But it's a huge change - it'll almost certainly touch every single source
> file in IronPython.

With the PEP 393, it's transparent: the PyUnicode_AS_UNICODE encodes the 
string to UTF-16 (allocate memory, etc.). Except that applications should now 
check if an error occurred (check for NULL).

> I would think we'd get 3.2 done first and then think
> about what to do here.

I don't think that IronPython needs to support non-BMP characters without 
using surrogates. Bug reports about non-BMP characters usually don't have use 
cases, but just want to make Python perfect. There is no need to hurry.

PEP 393 tries to reduce the memory footprint. The effect on non-BMP character 
is just a *nice* border effect. Or was the PEP design to solve narrow build 
issues?

Victor



More information about the Python-Dev mailing list