[Python-Dev] unicode_internal codec and the PEP 393

Wed Nov 9 22:03:52 CET 2011

> The unicode_internal decoder doesn't decode surrogate pairs and so 
> test_unicode.UnicodeTest.test_codecs() is failing on Windows (16-bit wchar_t). 
> I don't know if this codec is still revelant with the PEP 393 because the 
> internal representation is now depending on the maximum character (Py_UCS1*, 
> Py_UCS2* or Py_UCS4*), whereas it was a fixed size with Python <= 3.2 
> (Py_UNICODE*).

The current status is the way it is because we (Torsten and me) didn't
bother figuring out the purpose of the internal codec.

> Should we:
> 
>  * Drop this codec (public and documented, but I don't know if it is used)
>  * Use wchar_t* (Py_UNICODE*) to provide a result similar to Python 3.2, and 
> so fix the decoder to handle surrogate pairs
>  * Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)

It's described as "Return the internal representation of the operand".
That would suggest that the last choice (i.e. return the real internal
representation) would be best, except that this doesn't round-trip.
Adding a prefix byte indicating the kind (and perhaps also the ASCII
flag) would then be closest to the real representation.

As that is likely not very useful, and might break some applications
of the encoding (if there are any at all) which might expect to
pass unicode-internal strings across Python versions, I would then
also deprecate the encoding.

Regards,
Martin