New internal string format in 3.3

Sun Aug 19 08:29:17 EDT 2012

On 08/19/2012 08:14 AM, wxjmfauth at gmail.com wrote:
> Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
>> On Sun, Aug 19, 2012 at 8:19 PM,  <wxjmfauth at gmail.com> wrote:
>>
>>> This is precicely the weak point of this flexible
>>> representation. It uses latin-1 and latin-1 is for
>>> most users simply unusable.
>>
>>
>> No, it uses Unicode, and as an optimization, attempts to store the
>>
>> codepoints in less than four bytes for most strings. The fact that a
>>
>> one-byte storage format happens to look like latin-1 is rather
>>
>> coincidental.
>>
> And this this is the common basic mistake. You do not push your
> argumentation far enough. A character may "fall" accidentally in a latin-1.
> The problem lies in these european characters, which can not fall in this
> coding. This *is* the cause of the negative side effects.
> If you are using a correct coding scheme, like cp1252, mac-roman or
> iso-8859-15, you will never see such a negative side effect.
> Again, the problem is not the result, the encoded character. The critical
> part is the character which may cause this side effect.
> You should think "character set" and not encoded "code point", considering
> this kind of expression has a sense in 8-bits coding scheme.
>
> jmf

But that choice was made decades ago when Unicode picked its second 128
characters.  The internal form used in this PEP is simply the low-order
byte of the Unicode code point.  Trying to scan the string deciding if
converting to cp1252 (for example) would be a much more expensive
operation than seeing how many bytes it'd take for the largest code point.

-- 

DaveA