RE Module Performance

Sat Jul 27 14:21:14 EDT 2013

Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit :
> On 07/26/2013 07:21 AM, wxjmfauth at gmail.com wrote:
> 
> >>>> sys.getsizeof('––') - sys.getsizeof('–')
> 
> > 
> 
> > I have already explained / commented this.
> 
> 
> 
> Maybe it got lost in translation, but I don't understand your point with
> 
> that.
> 
> 
> 
> > Hint: To understand Unicode (and every coding scheme), you should
> 
> > understand "utf". The how and the *why*.
> 
> 
> 
> Hmm, so if python used utf-8 internally to represent unicode strings
> 
> would not that punish *all* users (not just non-ascii users) since
> 
> searching a string for a certain character position requires an O(n)
> 
> operation?  UTF-32 I could see (and indeed that's essentially what FSR
> 
> uses when necessary does it not?), but not utf-8 or utf-16.

------

Did you read my previous link? Unicode Character Encoding Model.
Did you understand it?

Unicode only - No FSR (I skip some points and I still attempt to
be still correct.)

Unicode is a four-steps process.
[ {unique set of characters}  --> {unique set of code points, the
"labels"} -->  {unique set of encoded code points} ] --> implementation
(bytes)

First point to notice. "pure unicode", [...], is different from
the "implementation". *This is a deliberate choice*.

The critical step is the path {unique set of characters} --->
{unique set of encoded code points} in such a way so that
the implementation can "work comfortably" with this *unique* set
of encoded code points. Conceptualy, the implementation works
with an unique set of "already prepared encoded code points".
This is a very critical step. To explain it in a dirty way:
in the above chain, this problem is "already" eliminated and
solved. Like a byte/char coding schemes where this step is
a no-op.

Now, and if you wish this is a seperated/different problem.
To create this unique set of encoded code points, "Unicode"
uses these "utf(s)". I repeat again, a confusing name, for the
process and the result of the process. (I neglect ucs).
What are these? Chunks of bits, group of 8/16/32 bits, words.
It is up to the implementation to convert these sequences
of bits into bytes, ***if you wish to convert these in bytes!***.
Suprise! Why not putting two of the 32-bits words in a 64-bits
"machine"? (see golang / rune / int32).

Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each "utf chunk"
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any "programmed" information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!

Next step and one another separated problem.
Why all these utf versions? It is always the
same story. Some prefer the universality (utf-32) and
some prefer, well, some kind of conservatism. utf-8 is
more complicated, it demands more work and logically,
in an expected way, some performance regression.
utf-8 is more suited to produce bytes, utf16/32 for
internal processing. utf-8 had no choice to lose the
indexing. And so on.
Fact: all these coding schemes are working with a unique
set of encoded code points (suprise again, it's like byte
string!). The loss of performance of utf-8 is very minimal
compared to the loss of performance one can get compare to
a multiple coding scheme. This kind of work has been done,
and if my informations are correct, even by the creators
of utf-8. (There are sometimes good scientists).

There are plenty of advantages in using utf instead of
something else and advantages in other fields than just
the pure coding.
utf-16/32 schemes have the advantages to ditch ascii
for ever. The ascii concept is no more existing.

One should also understand that all this stuff has
not been created from scratch. It was a balance between
existing technologies. MS sticked with the idea, no more
ascii, let's use ucs-2 and the *x world breaks the unicode
adoption as possible. utf-8 is one of the compromise for
the adoption of Unicode. Retrospectivly, a not so good
compromise.

Computer scientists are funny scientists. They do love
to solve the problems they created themselves.

-----

Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a 

>>> sys.getsizeof('€')
40

needs 14 bytes more than a

>>> sys.getsizeof('z')
26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take a look at this. Get rid of the overhead.

>>> sys.getsizeof('b'*1000000 + 'c')
1000026
>>> sys.getsizeof('b'*1000000 + '€')
2000040

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.

This FSR is not even a copy of the utf-8.
>>> len(('b'*1000000 + '€').encode('utf-8'))
1000003

utf-8 or any (utf) never need and never spend their time
in reencoding.

3) Unicode compliance. We know retrospectively, latin-1,
is was a bad choice. Unusable for 17 European languages.
Believe of not. 20 years of Unicode of incubation is not
long enough to learn it. When discussing once with a French
Python core dev, one with commit access, he did not know one
can not use latin-1 for the French language! BTW, I proposed
to the French devs, to test the FST with the set of characters,
recognized by the "Imprimerie Nationale", some kind of 
the legal French authority regarding characters and typography.
Never heared about it. Of course, I dit it.

In short
FSR = bad performance + bad memory mangement + non unicode
compliance.

Good point. FSR, nice tool for those who wish to teach
Unicode. It is not every day, one has such an opportunity.

---------

I'm practicaly no more programming, writing applications.
I'm still active and observing since a decade and plus all this 
unicode world, languages (go, c#, Python, Ruby), text
processing systems (esp. Unicode TeX engines) and font technology.
Very, very interesting.

jmf