Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Fri Mar 29 02:22:08 EDT 2013

On Fri, Mar 29, 2013 at 12:11 AM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
> From the PEP:
>
> """
> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
> representation. It is thus identical to the existing
> _PyUnicode_AsString, which is removed. The function will compute the
> utf8 representation when first called. Since this representation will
> consume memory until the string object is released, applications
> should use the existing PyUnicode_AsUTF8String where possible (which
> generates a new string object every time). APIs that implicitly
> converts a string to a char* (such as the ParseTuple functions) will
> use PyUnicode_AsUTF8 to compute a conversion.
> """
>
> So the utf8 representation is not populated when the string is
> created, but when a utf8 representation is requested, and only when
> requested by the API that returns a char*, not by the API that returns
> a bytes object.

Since the PEP specifically mentions ParseTuple string conversion, I am
thinking that this is probably the motivation for caching it.  A
string that is passed into a C function (that uses one of the various
UTF-8 char* format specifiers) is perhaps likely to be passed into
that function again at some point, so the UTF-8 representation is kept
around to avoid the need to recompose it at on each call.