Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Fri Mar 29 02:11:37 EDT 2013

On Thu, Mar 28, 2013 at 8:37 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
>>> I also wonder why the implementation bothers keeping a UTF-8
>>> representation. That sounds like premature optimization to me. Surely
>>> you only need it when writing to a file with UTF-8 encoding? For most
>>> strings, that will never happen.
>>
>> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
>> of content will go out in the same encoding it came in in, so it makes
>> sense to hang onto it where possible.
>
> Not to me. That almost doubles the size of the string, on the off-chance
> that you'll need the UTF-8 encoding. Which for many uses, you don't, and
> even if you do, it seems like premature optimization to keep it around
> just in case. Encoding to UTF-8 will be fast for small N, and for large
> N, why carry around (potentially) multiple megabytes of duplicated data
> just in case the encoded version is needed some time?

>From the PEP:

"""
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
representation. It is thus identical to the existing
_PyUnicode_AsString, which is removed. The function will compute the
utf8 representation when first called. Since this representation will
consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible (which
generates a new string object every time). APIs that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use PyUnicode_AsUTF8 to compute a conversion.
"""

So the utf8 representation is not populated when the string is
created, but when a utf8 representation is requested, and only when
requested by the API that returns a char*, not by the API that returns
a bytes object.