RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Tue Jul 30 10:01:02 EDT 2013


Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth at gmail.com> wrote:
> 
> > Back to utf. utfs are not only elements of a unique set of encoded
> 
> > code points. They have an interesting feature. Each "utf chunk"
> 
> > holds intrisically the character (in fact the code point) it is
> 
> > supposed to represent. In utf-32, the obvious case, it is just
> 
> > the code point. In utf-8, that's the first chunk which helps and
> 
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> 
> > implementation using bytes, for any pointer position it is always
> 
> > possible to find the corresponding encoded code point and from this
> 
> > the corresponding character without any "programmed" information. See
> 
> > my editor example, how to find the char under the caret? In fact,
> 
> > a silly example, how can the caret can be positioned or moved, if
> 
> > the underlying corresponding encoded code point can not be
> 
> > dicerned!
> 
> 
> 
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
> 
> easy to determine the identity of the code point at that location.
> 
> But this is not often a useful operation, save for resynchronization
> 
> in the case that the string data is corrupted.  The caret of an editor
> 
> does not conceptually correspond to a pointer location, but to a
> 
> character index.  Given a particular character index (e.g. 127504), an
> 
> editor must be able to determine the identity and/or the memory
> 
> location of the character at that index, and for UTF-8 and UTF-16
> 
> without an auxiliary data structure that is a O(n) operation.
> 
> 
------

Same conceptual mistake as Steven's example with its buffers,
the buffer does not know it holds characters.
This is not the point to discuss.

-----

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Do not confuse the process of optimisation and the result of
optimization (funny, it's like the utf's).

There is a trick to make the editor to know if it has
to be "optimized". Just put some flag somewhere. Then
you fall on the "Houston" syndrome. Houston, we got a
problem, our buffer consumes much more bytes than expected.

>>> sys.getsizeof('€')
40
>>> sys.getsizeof('a')
26

Now the good news. In an editor à la FSR, the
"composition" is not so important. You know,
"practicality beats purity". The hard job
is the text rendering engine and the handling
of the font (even in a raw unicode editor).
And as these tools are luckily not woking à la FSR
(probably because they understand the coding
of the characters), your editor is still working
not so badly.

jmf




More information about the Python-list mailing list