[Python-ideas] Add "has_surrogates" flags to string object

Stephen J. Turnbull stephen at xemacs.org
Tue Oct 8 15:31:07 CEST 2013


Masklinn writes:

 > I don't know the details of the flexible string representation, but I
 > believed the names fit what was actually in memory. UCS2 does not
 > have surrogate pairs, thus surrogate codes make no sense in UCS2,
 > they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not
 > codepoints, they have no reason to appear in either UCS2 or UCS4
 > outside of encoding errors.

True, but Python doesn't actually use UCS2 or UCS4 internally.  It
uses UCS2 or UCS4 plus a row of codes from the surrogate area to
represent undecodable bytes.  This feature is optional (enabled by
using the appropriate error= setting in the codec), but I don't
suppose it's going to go away.



More information about the Python-ideas mailing list