Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Thu Mar 28 22:37:55 EDT 2013

On Fri, 29 Mar 2013 11:54:41 +1100, Chris Angelico wrote:

> On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
>> strings. It's only strings in the SMPs that could need surrogate pairs,
>> and they don't need them in Python's implementation since it's a full
>> 32- bit implementation. So where do the surrogate pairs come into this?
> 
> PEP 393 says:
> """
> wstr_length, wstr: representation in platform's wchar_t
> (null-terminated). If wchar_t is 16-bit, this form may use surrogate
> pairs (in which cast wstr_length differs form length). wstr_length
> differs from length only if there are surrogate pairs in the
> representation.
> 
> utf8_length, utf8: UTF-8 representation (null-terminated).
> 
> data: shortest-form representation of the unicode string. The string is
> null-terminated (in its respective representation).
> 
> All three representations are optional, although the data form is
> considered the canonical representation which can be absent only while
> the string is being created. If the representation is absent, the
> pointer is NULL, and the corresponding length field may contain
> arbitrary data.
> """

All the words are in English (well, most of them...) but what does it 
mean?

> If the string was created from a wchar_t string, that string will be
> retained, and presumably can be used to re-output the original for a
> clean and fast round-trip.

Under what circumstances will a string be created from a wchar_t string? 
How, and why, would such a string be created? Why would Python still 
support strings containing surrogates when it now has a nice, shiny, 
surrogate-free flexible representation?

>> I also wonder why the implementation bothers keeping a UTF-8
>> representation. That sounds like premature optimization to me. Surely
>> you only need it when writing to a file with UTF-8 encoding? For most
>> strings, that will never happen.
> 
> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
> of content will go out in the same encoding it came in in, so it makes
> sense to hang onto it where possible.

Not to me. That almost doubles the size of the string, on the off-chance 
that you'll need the UTF-8 encoding. Which for many uses, you don't, and 
even if you do, it seems like premature optimization to keep it around 
just in case. Encoding to UTF-8 will be fast for small N, and for large 
N, why carry around (potentially) multiple megabytes of duplicated data 
just in case the encoded version is needed some time?

> Though, from the same quote: The UTF-8 representation is
> null-terminated. Does this mean that it can't be used if there might be
> a \0 in the string?
> 
> Minor nitpick, btw:
>> (in which cast wstr_length differs form length)
> Should be "in which case" and "from". Who has the power to correct typos
> in PEPs?
> 
> ChrisA