[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 01:57:21 CEST 2014

Jim Baker writes:

 > Given that Jython uses UTF-16 as its representation, it is possible to
 > frequently smuggle isolated surrogates in it. A surrogate pair must be a
 > low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
 > E000).
 > 
 > Of course, if you do actually have a smuggled isolated low
 > surrogate FOLLOWED by a smuggled isolated high surrogate - guess
 > what, the only interpretation is a codepoint. Or perhaps more
 > likely garbage. Of course it doesn't happen so often, so maybe we
 > are fine with the occasional bug ;)

The CPython representation uses trailing surrogates only[1], so it's
never possible to interpret them as anything but non-characters -- as
soon as you encounter them you know that it's a lone surrogate.
Surely you can do the same.

As long as the Java string manipulation functions don't check for
surrogates, you should be fine with this representation.  Of course I
suppose your matching functions (etc) don't check for them either, so
you will be somewhat vulnerable to bugs due to treating them as
characters.  But the same is true for CPython, AFAIK.

Footnotes: 
[1]  Only 128 bytes are necessary since the 128 ASCII characters are
embedded in Unicode as-is.