[Python-ideas] Add "has_surrogates" flags to string object
Stephen J. Turnbull
stephen at xemacs.org
Tue Oct 8 15:31:07 CEST 2013
Masklinn writes:
> I don't know the details of the flexible string representation, but I
> believed the names fit what was actually in memory. UCS2 does not
> have surrogate pairs, thus surrogate codes make no sense in UCS2,
> they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not
> codepoints, they have no reason to appear in either UCS2 or UCS4
> outside of encoding errors.
True, but Python doesn't actually use UCS2 or UCS4 internally. It
uses UCS2 or UCS4 plus a row of codes from the surrogate area to
represent undecodable bytes. This feature is optional (enabled by
using the appropriate error= setting in the codec), but I don't
suppose it's going to go away.
More information about the Python-ideas
mailing list