[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 13:58:20 CEST 2013

On 2013-10-08, at 13:43 , Serhiy Storchaka wrote:

> 08.10.13 14:38, Masklinn написав(ла):
>> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote:
>> 
>>> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states:
>>> 
>>> * String doesn't contain surrogates.
>>> * String contains surrogates.
>>> * It is still unknown.
>>> 
>>> We can combine this with "is_ascii" flag in 2-bit value:
>>> 
>>> * String is ASCII-only (and doesn't contain surrogates).
>>> * String is not ASCII-only and doesn't contain surrogates.
>>> * String is not ASCII-only and contains surrogates.
>>> * String is not ASCII-only and it is still unknown if it contains surrogate.
>> 
>> Isn't that redundant with the kind under shortest form representation?
> 
> No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and '\udc80\U00010000' is UCS4 string with surrogate code.

I don't know the details of the flexible string representation, but I
believed the names fit what was actually in memory. UCS2 does not
have surrogate pairs, thus surrogate codes make no sense in UCS2,
they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not
codepoints, they have no reason to appear in either UCS2 or UCS4
outside of encoding errors.

> UCS2 string without surrogate codes can be encoded in UTF-16 by memcpy().

Surrogate codes prevent that (modulo objections above) for slicing (not
that it's a big issue I think, a guard can just check whether it's
slicing within a surrogate pair, that only requires checking the first
and last 2 bytes of the range) but not for concatenation right?