[Python-ideas] Add "has_surrogates" flags to string object
Serhiy Storchaka
storchaka at gmail.com
Tue Oct 8 13:17:59 CEST 2013
Here is an idea about adding a mark to PyUnicode object which allows
fast answer to the question if a string has surrogate code. This mark
has one of three possible states:
* String doesn't contain surrogates.
* String contains surrogates.
* It is still unknown.
We can combine this with "is_ascii" flag in 2-bit value:
* String is ASCII-only (and doesn't contain surrogates).
* String is not ASCII-only and doesn't contain surrogates.
* String is not ASCII-only and contains surrogates.
* String is not ASCII-only and it is still unknown if it contains surrogate.
By default a string is created in "unknown" state (if it is UCS2 or
UCS4). After first request it can be switched to "has surrogates" or
"hasn't surrogates". State of the result of concatenating or slicing can
be determined from states of input strings.
This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a
little faster UTF-8 encoding) and converting to wchar_t* if string
hasn't surrogates (this is true in most cases).
More information about the Python-ideas
mailing list