[Python-ideas] Add "has_surrogates" flags to string object

Serhiy Storchaka storchaka at gmail.com
Tue Oct 8 13:17:59 CEST 2013


Here is an idea about adding a mark to PyUnicode object which allows 
fast answer to the question if a string has surrogate code. This mark 
has one of three possible states:

* String doesn't contain surrogates.
* String contains surrogates.
* It is still unknown.

We can combine this with "is_ascii" flag in 2-bit value:

* String is ASCII-only (and doesn't contain surrogates).
* String is not ASCII-only and doesn't contain surrogates.
* String is not ASCII-only and contains surrogates.
* String is not ASCII-only and it is still unknown if it contains surrogate.

By default a string is created in "unknown" state (if it is UCS2 or 
UCS4). After first request it can be switched to "has surrogates" or 
"hasn't surrogates". State of the result of concatenating or slicing can 
be determined from states of input strings.

This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a 
little faster UTF-8 encoding) and converting to wchar_t* if string 
hasn't surrogates (this is true in most cases).



More information about the Python-ideas mailing list