New internal string format in 3.3

Dave Angel d at davea.name
Sun Aug 19 08:35:23 EDT 2012


(pardon the resend, but I accidentally omitted a couple of words)
On 08/19/2012 08:14 AM, wxjmfauth at gmail.com wrote:
> Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
>> <SNIP>
>>
>>
>> No, it uses Unicode, and as an optimization, attempts to store the
>> codepoints in less than four bytes for most strings. The fact that a
>> one-byte storage format happens to look like latin-1 is rather
>> coincidental.
>>
> And this this is the common basic mistake. You do not push your
> argumentation far enough. A character may "fall" accidentally in a latin-1.
> The problem lies in these european characters, which can not fall in this
> coding. This *is* the cause of the negative side effects.
> If you are using a correct coding scheme, like cp1252, mac-roman or
> iso-8859-15, you will never see such a negative side effect.
> Again, the problem is not the result, the encoded character. The critical
> part is the character which may cause this side effect.
> You should think "character set" and not encoded "code point", considering
> this kind of expression has a sense in 8-bits coding scheme.
>
> jmf

But that choice was made decades ago when Unicode picked its second 128
characters.  The internal form used in this PEP is simply the low-order
byte of the Unicode code point.  Trying to scan the string deciding if
converting to cp1252 (for example) would work, would be a much more
expensive operation than seeing how many bytes it'd take for the largest
code point.

The 8 bit form is used if all the code points are less than 256.  That
is a simple description, and simple code.  As several people have said,
the fact that this byte matches on of the DECODED forms is coincidence.

-- 

DaveA




More information about the Python-list mailing list