[Python-ideas] Processing surrogates in

Thu May 14 16:38:57 CEST 2015

Andrew Barnert via Python-ideas writes:

 > > And yet one source of surrogates -- Python sources. eval(), etc.

Yep:

$ python3.4
Python 3.4.3 (default, Mar 10 2015, 14:53:35) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> chr((16*13+8)*256)
'\ud800'
>>> '\ud800'
'\ud800'
>>> '\ud834\udd1e'
'\ud834\udd1e'
>>> 

 > If I type '\uD834\uDD1E' in Python 3.4 source, am I actually going
 > to get an illegal Unicode string made of 2 surrogate code points
 > instead of either an error or the single-character string
 > '\U0001D11E'?

Yes.  How else do you propose to test the surrogateescape error
handler?  Now, are you sitting down?  If not, you should before
looking at the next example. ;-)

>>> '\U0000d834\U0000dd1e'
'\ud834\udd1e'
>>> 

Isn't that disgusting?  But in Python, str is an array of code units.
Literals and chr() can be used to produce str containing surrogates,
as well as codec error handling.