Flexible string representation, unicode, typography, ...

Sat Aug 25 11:47:52 EDT 2012

Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
> On 25/08/2012 10:58, Mark Lawrence wrote:
> 
> > On 25/08/2012 08:27, wxjmfauth at gmail.com wrote:
> 
> >>
> 
> >> Unicode design: a flat table of code points, where all code
> 
> >> points are "equals".
> 
> >> As soon as one attempts to escape from this rule, one has to
> 
> >> "pay" for it.
> 
> >> The creator of this machinery (flexible string representation)
> 
> >> can not even benefit from it in his native language (I think
> 
> >> I'm correctly informed).
> 
> >>
> 
> >> Hint: Google -> "Das grosse Eszett"
> 
> >>
> 
> >> jmf
> 
> >>
> 
> >
> 
> > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> 
> > still baffled as to the point if any.  Could someone please enlightem me?
> 
> >
> 
> 
> 
> Here's what I think he is saying. I am posting this to test the water. I 
> 
> am also confused, and if I have got it wrong hopefully someone will 
> 
> correct me.
> 
> 
> 
> In python 3.3, unicode strings are now stored as follows -
> 
>    if all characters can be represented by 1 byte, the entire string is 
> 
> composed of 1-byte characters
> 
>    else if all characters can be represented by 1 or 2 bytea, the entire 
> 
> string is composed of 2-byte characters
> 
>    else the entire string is composed of 4-byte characters
> 
> 
> 
> There is an overhead in making this choice, to detect the lowest number 
> 
> of bytes required.
> 
> 
> 
> jmfauth believes that this only benefits 'english-speaking' users, as 
> 
> the rest of the world will tend to have strings where at least one 
> 
> character requires 2 or 4 bytes. So they incur the overhead, without 
> 
> getting any benefit.
> 
> 
> 
> Therefore, I think he is saying that he would have preferred that python 
> 
> standardise on 4-byte characters, on the grounds that the saving in 
> 
> memory does not justify the performance overhead.
> 
> 
> 
> Frank Millman

Very well explained. Thanks.

More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).

Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.

---

For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.

100% Unicode compliant from the day 0. Congratulations.

jmf