RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Sun Jul 28 14:13:29 EDT 2013


Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth at gmail.com> wrote:
> 
> > Back to utf. utfs are not only elements of a unique set of encoded
> 
> > code points. They have an interesting feature. Each "utf chunk"
> 
> > holds intrisically the character (in fact the code point) it is
> 
> > supposed to represent. In utf-32, the obvious case, it is just
> 
> > the code point. In utf-8, that's the first chunk which helps and
> 
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> 
> > implementation using bytes, for any pointer position it is always
> 
> > possible to find the corresponding encoded code point and from this
> 
> > the corresponding character without any "programmed" information. See
> 
> > my editor example, how to find the char under the caret? In fact,
> 
> > a silly example, how can the caret can be positioned or moved, if
> 
> > the underlying corresponding encoded code point can not be
> 
> > dicerned!
> 
> 
> 
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
> 
> easy to determine the identity of the code point at that location.
> 
> But this is not often a useful operation, save for resynchronization
> 
> in the case that the string data is corrupted.  The caret of an editor
> 
> does not conceptually correspond to a pointer location, but to a
> 
> character index.  Given a particular character index (e.g. 127504), an
> 
> editor must be able to determine the identity and/or the memory
> 
> location of the character at that index, and for UTF-8 and UTF-16
> 
> without an auxiliary data structure that is a O(n) operation.
> 
> 
> 
> > 2) Take a look at this. Get rid of the overhead.
> 
> >
> 
> >>>> sys.getsizeof('b'*1000000 + 'c')
> 
> > 1000026
> 
> >>>> sys.getsizeof('b'*1000000 + '€')
> 
> > 2000040
> 
> >
> 
> > What does it mean? It means that Python has to
> 
> > reencode a str every time it is necessary because
> 
> > it works with multiple codings.
> 
> 
> 
> Large strings in practical usage do not need to be resized like this
> 
> often.  Python 3.3 has been in production use for months now, and you
> 
> still have yet to produce any real-world application code that
> 
> demonstrates a performance regression.  If there is no real-world
> 
> regression, then there is no problem.
> 
> 
> 
> > 3) Unicode compliance. We know retrospectively, latin-1,
> 
> > is was a bad choice. Unusable for 17 European languages.
> 
> > Believe of not. 20 years of Unicode of incubation is not
> 
> > long enough to learn it. When discussing once with a French
> 
> > Python core dev, one with commit access, he did not know one
> 
> > can not use latin-1 for the French language!
> 
> 
> 
> Probably because for many French strings, one can.  As far as I am
> 
> aware, the only characters that are missing from Latin-1 are the Euro
> 
> sign (an unfortunate victim of history), the ligature œ (I have no
> 
> doubt that many users just type oe anyway), and the rare capital Ÿ
> 
> (the miniscule version is present in Latin-1).  All French strings
> 
> that are fortunate enough to be absent these characters can be
> 
> represented in Latin-1 and so will have a 1-byte width in the FSR.

------

latin-1? that's not even truth.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ü')
38
>>> sys.getsizeof('aa')
27
>>> sys.getsizeof('aü')
39


jmf




More information about the Python-list mailing list