Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Thu Mar 28 20:54:41 EDT 2013

On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
> strings. It's only strings in the SMPs that could need surrogate pairs,
> and they don't need them in Python's implementation since it's a full 32-
> bit implementation. So where do the surrogate pairs come into this?

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...

> I also wonder why the implementation bothers keeping a UTF-8
> representation. That sounds like premature optimization to me. Surely you
> only need it when writing to a file with UTF-8 encoding? For most
> strings, that will never happen.

... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?

Minor nitpick, btw:
> (in which cast wstr_length differs form length)
Should be "in which case" and "from". Who has the power to correct
typos in PEPs?

ChrisA