String changing size on failure?

Wed Nov 1 16:38:14 EDT 2017

On Thu, Nov 2, 2017 at 7:34 AM, Ned Batchelder <ned at nedbatchelder.com> wrote:
> On 11/1/17 4:17 PM, MRAB wrote:
>>
>> On 2017-11-01 19:26, Ned Batchelder wrote:
>>>
>>>   From David Beazley
>>> (https://twitter.com/dabeaz/status/925787482515533830):
>>>
>>>       >>> a = 'n'
>>>       >>> b = 'ñ'
>>>       >>> sys.getsizeof(a)
>>>      50
>>>       >>> sys.getsizeof(b)
>>>      74
>>>       >>> float(b)
>>>      Traceback (most recent call last):
>>>         File "<stdin>", line 1, in <module>
>>>      ValueError: could not convert string to float: 'ñ'
>>>       >>> sys.getsizeof(b)
>>>      77
>>>
>>> Huh?
>>>
>> It's all explained in PEP 393.
>>
>> It's creating an additional representation (UTF-8 + zero-byte terminator)
>> of the value and is caching that, so there'll then be the bytes for 'ñ' and
>> the bytes for the UTF-8 (0xC3 0xB1 0x00).
>>
>> When the string is ASCII, the bytes of the UTF-8 representation is
>> identical to those or the original string, so it can share them.
>
>
> That explains why b is larger than a to begin with, but it doesn't explain
> why float(b) is changing the size of b.

b doesn't initially even _have_ a UTF-8 representation. When float()
tries to parse the string, it asks for the UTF-8 form, and that form
gets saved into the string object in case it's needed later.

ChrisA