New internal string format in 3.3

wxjmfauth at gmail.com wxjmfauth at gmail.com
Sun Aug 19 10:09:14 EDT 2012


Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
> On 19/08/2012 13:59, wxjmfauth at gmail.com wrote:
> 
> > Le dimanche 19 ao�t 2012 14:29:17 UTC+2, Dave Angel a �crit :
> 
> >> On 08/19/2012 08:14 AM, wxjmfauth at gmail.com wrote:
> 
> >>
> 
> >>> Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
> 
> >>
> 
> >>>> On Sun, Aug 19, 2012 at 8:19 PM,  <wxjmfauth at gmail.com> wrote:
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>>> This is precicely the weak point of this flexible
> 
> >>
> 
> >>>>> representation. It uses latin-1 and latin-1 is for
> 
> >>
> 
> >>>>> most users simply unusable.
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> No, it uses Unicode, and as an optimization, attempts to store the
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> codepoints in less than four bytes for most strings. The fact that a
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> one-byte storage format happens to look like latin-1 is rather
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> coincidental.
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>> And this this is the common basic mistake. You do not push your
> 
> >>
> 
> >>> argumentation far enough. A character may "fall" accidentally in a latin-1.
> 
> >>
> 
> >>> The problem lies in these european characters, which can not fall in this
> 
> >>
> 
> >>> coding. This *is* the cause of the negative side effects.
> 
> >>
> 
> >>> If you are using a correct coding scheme, like cp1252, mac-roman or
> 
> >>
> 
> >>> iso-8859-15, you will never see such a negative side effect.
> 
> >>
> 
> >>> Again, the problem is not the result, the encoded character. The critical
> 
> >>
> 
> >>> part is the character which may cause this side effect.
> 
> >>
> 
> >>> You should think "character set" and not encoded "code point", considering
> 
> >>
> 
> >>> this kind of expression has a sense in 8-bits coding scheme.
> 
> >>
> 
> >>>
> 
> >>
> 
> >>> jmf
> 
> >>
> 
> >>
> 
> >>
> 
> >> But that choice was made decades ago when Unicode picked its second 128
> 
> >>
> 
> >> characters.  The internal form used in this PEP is simply the low-order
> 
> >>
> 
> >> byte of the Unicode code point.  Trying to scan the string deciding if
> 
> >>
> 
> >> converting to cp1252 (for example) would be a much more expensive
> 
> >>
> 
> >> operation than seeing how many bytes it'd take for the largest code point.
> 
> >>
> 
> >>
> 
> >
> 
> > You are absoletely right. (I'm quite comfortable with Unicode).
> 
> > If Python wish to perpetuate this, lets call it, design mistake
> 
> > or ennoyement, it will continue to live with problems.
> 
> 
> 
> Please give a precise description of the design mistake and what you 
> 
> would do to correct it.
> 
> 
> 
> >
> 
> > People (tools) who chose pure utf-16 or utf-32 are not suffering
> 
> > from this issue.
> 
> >
> 
> > *My* final comment on this thread.
> 
> >
> 
> > In August 2012, after 20 years of development, Python is not
> 
> > able to display a piece of text correctly on a Windows console
> 
> > (eg cp65001).
> 
> 
> 
> Examples please.
> 
> 
> 
> >
> 
> > I downloaded the go language, zero experience, I did not succeed
> 
> > to display incorrecly a piece of text. (This is by the way *the*
> 
> > reason why I tested it). Where the problems are coming from, I have
> 
> > no idea.
> 
> >
> 
> > I find this situation quite comic. Python is able to
> 
> > produce this:
> 
> >
> 
> >>>> (1.1).hex()
> 
> > '0x1.199999999999ap+0'
> 
> >
> 
> > but it is not able to display a piece of text!
> 
> 
> 
> So you keep saying, but when asked for examples or evidence nothing gets 
> 
> produced.
> 
> 
> 
> >
> 
> > Try to convince end users IEEE 754 is more important than the
> 
> > ability to read/wirite a piece a text, a 6-years kid has learned
> 
> > at school :-)
> 
> >
> 
> > (I'm not suffering from this kind of effect, as a Windows user,
> 
> > I'm always working via gui, it still remains, the problem exists.
> 
> 
> 
> Windows is a law unto itself.  Its problems are hardly specific to Python.
> 
> 
> 
> >
> 
> > Regards,
> 
> > jmf
> 
> >
> 
> 
> 
> Now two or three times you've said you're going but have come back.  If 
> 
> you come again could you please provide examples and or evidence of what 
> 
> you're on about, because you still have me baffled.
> 
> 
> 
> -- 
> 
> Cheers.
> 
> 
> 
> Mark Lawrence.

Yesterday, I went to bed.
More seriously.

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

It is up to you, the core developers to give an explanation
about this behaviour.

As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1,  to ...)

jmf



More information about the Python-list mailing list