Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Thu Mar 28 22:00:24 EDT 2013

On 29/03/2013 00:54, Chris Angelico wrote:
> On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
>> strings. It's only strings in the SMPs that could need surrogate pairs,
>> and they don't need them in Python's implementation since it's a full 32-
>> bit implementation. So where do the surrogate pairs come into this?
>
> PEP 393 says:
> """
> wstr_length, wstr: representation in platform's wchar_t
> (null-terminated). If wchar_t is 16-bit, this form may use surrogate
> pairs (in which cast wstr_length differs form length). wstr_length
> differs from length only if there are surrogate pairs in the
> representation.
>
> utf8_length, utf8: UTF-8 representation (null-terminated).
>
> data: shortest-form representation of the unicode string. The string
> is null-terminated (in its respective representation).
>
> All three representations are optional, although the data form is
> considered the canonical representation which can be absent only while
> the string is being created. If the representation is absent, the
> pointer is NULL, and the corresponding length field may contain
> arbitrary data.
> """
>
> If the string was created from a wchar_t string, that string will be
> retained, and presumably can be used to re-output the original for a
> clean and fast round-trip. Same with...
>
>> I also wonder why the implementation bothers keeping a UTF-8
>> representation. That sounds like premature optimization to me. Surely you
>> only need it when writing to a file with UTF-8 encoding? For most
>> strings, that will never happen.
>
> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
> of content will go out in the same encoding it came in in, so it makes
> sense to hang onto it where possible.
>
> Though, from the same quote: The UTF-8 representation is
> null-terminated. Does this mean that it can't be used if there might
> be a \0 in the string?
>
You could ask the same question about any encoding.

It's only an issue if it's passed to a C function which expects a
null-terminated string.

> Minor nitpick, btw:
>> (in which cast wstr_length differs form length)
> Should be "in which case" and "from". Who has the power to correct
> typos in PEPs?
>