RE Module Performance

Sat Jul 27 23:53:22 EDT 2013

On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth at gmail.com> wrote:
> Back to utf. utfs are not only elements of a unique set of encoded
> code points. They have an interesting feature. Each "utf chunk"
> holds intrisically the character (in fact the code point) it is
> supposed to represent. In utf-32, the obvious case, it is just
> the code point. In utf-8, that's the first chunk which helps and
> utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> implementation using bytes, for any pointer position it is always
> possible to find the corresponding encoded code point and from this
> the corresponding character without any "programmed" information. See
> my editor example, how to find the char under the caret? In fact,
> a silly example, how can the caret can be positioned or moved, if
> the underlying corresponding encoded code point can not be
> dicerned!

Yes, given a pointer location into a utf-8 or utf-16 string, it is
easy to determine the identity of the code point at that location.
But this is not often a useful operation, save for resynchronization
in the case that the string data is corrupted.  The caret of an editor
does not conceptually correspond to a pointer location, but to a
character index.  Given a particular character index (e.g. 127504), an
editor must be able to determine the identity and/or the memory
location of the character at that index, and for UTF-8 and UTF-16
without an auxiliary data structure that is a O(n) operation.

> 2) Take a look at this. Get rid of the overhead.
>
>>>> sys.getsizeof('b'*1000000 + 'c')
> 1000026
>>>> sys.getsizeof('b'*1000000 + '€')
> 2000040
>
> What does it mean? It means that Python has to
> reencode a str every time it is necessary because
> it works with multiple codings.

Large strings in practical usage do not need to be resized like this
often.  Python 3.3 has been in production use for months now, and you
still have yet to produce any real-world application code that
demonstrates a performance regression.  If there is no real-world
regression, then there is no problem.

> 3) Unicode compliance. We know retrospectively, latin-1,
> is was a bad choice. Unusable for 17 European languages.
> Believe of not. 20 years of Unicode of incubation is not
> long enough to learn it. When discussing once with a French
> Python core dev, one with commit access, he did not know one
> can not use latin-1 for the French language!

Probably because for many French strings, one can.  As far as I am
aware, the only characters that are missing from Latin-1 are the Euro
sign (an unfortunate victim of history), the ligature œ (I have no
doubt that many users just type oe anyway), and the rare capital Ÿ
(the miniscule version is present in Latin-1).  All French strings
that are fortunate enough to be absent these characters can be
represented in Latin-1 and so will have a 1-byte width in the FSR.