New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

Sun Aug 19 04:56:36 EDT 2012

On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:

> Steven D'Aprano wrote:

>> I don't know where people are getting this myth that PEP 393 uses
>> Latin-1 internally, it does not. Read the PEP, it explicitly states
>> that 1-byte formats are only used for ASCII strings.
> 
> From
> 
> Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) [GCC
> 4.6.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import sys
>>>> [sys.getsizeof("é"*i) for i in range(10)]
> [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because 
that would explain why your sizes are so larger than mine:

py> [sys.getsizeof("é"*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]

py> [sys.getsizeof("€"*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

py> c = chr(0xFFFF + 1)
py> [sys.getsizeof(c*i) for i in range(10)]
[25, 44, 48, 52, 56, 60, 64, 68, 72, 76]

On re-reading the PEP more closely, it looks like I did misunderstand the 
internal implementation, and strings which fit exactly in Latin-1 will 
also use 1 byte per character. There are three structures used:

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

and the third one comes in three variant forms, for 1-byte, 2-byte and 4-
byte data. So I stand corrected.

-- 
Steven