Flexible string representation, unicode, typography, ...

Mark Lawrence breamoreboy at yahoo.co.uk
Sat Aug 25 07:05:08 EDT 2012


On 25/08/2012 10:46, Frank Millman wrote:
> On 25/08/2012 10:58, Mark Lawrence wrote:
>> On 25/08/2012 08:27, wxjmfauth at gmail.com wrote:
>>>
>>> Unicode design: a flat table of code points, where all code
>>> points are "equals".
>>> As soon as one attempts to escape from this rule, one has to
>>> "pay" for it.
>>> The creator of this machinery (flexible string representation)
>>> can not even benefit from it in his native language (I think
>>> I'm correctly informed).
>>>
>>> Hint: Google -> "Das grosse Eszett"
>>>
>>> jmf
>>>
>>
>> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
>> still baffled as to the point if any.  Could someone please enlightem me?
>>
>
> Here's what I think he is saying. I am posting this to test the water. I
> am also confused, and if I have got it wrong hopefully someone will
> correct me.
>
> In python 3.3, unicode strings are now stored as follows -
>    if all characters can be represented by 1 byte, the entire string is
> composed of 1-byte characters
>    else if all characters can be represented by 1 or 2 bytea, the entire
> string is composed of 2-byte characters
>    else the entire string is composed of 4-byte characters
>
> There is an overhead in making this choice, to detect the lowest number
> of bytes required.
>
> jmfauth believes that this only benefits 'english-speaking' users, as
> the rest of the world will tend to have strings where at least one
> character requires 2 or 4 bytes. So they incur the overhead, without
> getting any benefit.
>
> Therefore, I think he is saying that he would have preferred that python
> standardise on 4-byte characters, on the grounds that the saving in
> memory does not justify the performance overhead.
>
> Frank Millman
>
>

I thought Terry Reedy had shot down any claims about performance 
overhead, and that the memory savings in many cases must be substantial 
and therefore worthwhile.  Or have I misread something?  Or what?

-- 
Cheers.

Mark Lawrence.




More information about the Python-list mailing list