[Python-Dev] unicode_internal codec and the PEP 393

Wed Nov 9 11:14:50 CET 2011

Hi,

The unicode_internal decoder doesn't decode surrogate pairs and so 
test_unicode.UnicodeTest.test_codecs() is failing on Windows (16-bit wchar_t). 
I don't know if this codec is still revelant with the PEP 393 because the 
internal representation is now depending on the maximum character (Py_UCS1*, 
Py_UCS2* or Py_UCS4*), whereas it was a fixed size with Python <= 3.2 
(Py_UNICODE*).

Should we:

 * Drop this codec (public and documented, but I don't know if it is used)
 * Use wchar_t* (Py_UNICODE*) to provide a result similar to Python 3.2, and 
so fix the decoder to handle surrogate pairs
 * Use the real representation (Py_UCS1*, Py_UCS2 or Py_UCS4* string)

?

The failure on Windows:

FAIL: test_codecs (test.test_unicode.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\Buildslave\3.x.moore-windows\build\lib\test\test_unicode.py", line 
1408, in test_codecs
    self.assertEqual(str(u.encode(encoding),encoding), u)
AssertionError: '\ud800\udc01\ud840\udc02\ud880\udc03\ud8c0\udc04\ud900\udc05' 
!= '\U00030003\U00040004\U00050005'

Victor