Flexible string representation, unicode, typography, ...

Frank Millman frank at chagford.com
Sat Aug 25 05:46:34 EDT 2012


On 25/08/2012 10:58, Mark Lawrence wrote:
> On 25/08/2012 08:27, wxjmfauth at gmail.com wrote:
>>
>> Unicode design: a flat table of code points, where all code
>> points are "equals".
>> As soon as one attempts to escape from this rule, one has to
>> "pay" for it.
>> The creator of this machinery (flexible string representation)
>> can not even benefit from it in his native language (I think
>> I'm correctly informed).
>>
>> Hint: Google -> "Das grosse Eszett"
>>
>> jmf
>>
>
> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> still baffled as to the point if any.  Could someone please enlightem me?
>

Here's what I think he is saying. I am posting this to test the water. I 
am also confused, and if I have got it wrong hopefully someone will 
correct me.

In python 3.3, unicode strings are now stored as follows -
   if all characters can be represented by 1 byte, the entire string is 
composed of 1-byte characters
   else if all characters can be represented by 1 or 2 bytea, the entire 
string is composed of 2-byte characters
   else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number 
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as 
the rest of the world will tend to have strings where at least one 
character requires 2 or 4 bytes. So they incur the overhead, without 
getting any benefit.

Therefore, I think he is saying that he would have preferred that python 
standardise on 4-byte characters, on the grounds that the saving in 
memory does not justify the performance overhead.

Frank Millman





More information about the Python-list mailing list