[Python-ideas] Processing surrogates in

Stephen J. Turnbull stephen at xemacs.org
Thu May 14 16:38:57 CEST 2015


Andrew Barnert via Python-ideas writes:

 > > And yet one source of surrogates -- Python sources. eval(), etc.

Yep:

$ python3.4
Python 3.4.3 (default, Mar 10 2015, 14:53:35) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> chr((16*13+8)*256)
'\ud800'
>>> '\ud800'
'\ud800'
>>> '\ud834\udd1e'
'\ud834\udd1e'
>>> 

 > If I type '\uD834\uDD1E' in Python 3.4 source, am I actually going
 > to get an illegal Unicode string made of 2 surrogate code points
 > instead of either an error or the single-character string
 > '\U0001D11E'?

Yes.  How else do you propose to test the surrogateescape error
handler?  Now, are you sitting down?  If not, you should before
looking at the next example. ;-)

>>> '\U0000d834\U0000dd1e'
'\ud834\udd1e'
>>> 

Isn't that disgusting?  But in Python, str is an array of code units.
Literals and chr() can be used to produce str containing surrogates,
as well as codec error handling.




More information about the Python-ideas mailing list