Abuse of subject, was Re: Abuse of Big Oh notation

Tue Aug 21 13:16:06 EDT 2012

Le mardi 21 août 2012 09:52:09 UTC+2, Peter Otten a écrit :
> wxjmfauth at gmail.com wrote:
> 
> 
> 
> > By chance and luckily, first attempt.
> 
>  
> 
> > c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
> 
> > , 'œ')"
> 
> > 1000000 loops, best of 3: 1.48 usec per loop
> 
> > c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
> 
> > , 'œ')"
> 
> > 100000 loops, best of 3: 7.62 usec per loop
> 
> 
> 
> OK, that is roughly factor 5. Let's see what I get:
> 
> 
> 
> $ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
> 
> 100000 loops, best of 3: 1.8 usec per loop
> 
> $ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
> 
> 10000 loops, best of 3: 9.11 usec per loop
> 
> 
> 
> That is factor 5, too. So I can replicate your measurement on an AMD64 Linux 
> 
> system with self-built 3.3 versus system 3.2.
> 
> 
> 
> > Note
> 
> > The used characters are not members of the latin-1 coding
> 
> > scheme (btw an *unusable* coding).
> 
> > They are however charaters in cp1252 and mac-roman.
> 
> 
> 
> You seem to imply that the slowdown is connected to the inability of latin-1 
> 
> to encode "œ" and "€" (to take the examples relevant to the above 
> 
> microbench). So let's repeat with latin-1 characters:
> 
> 
> 
> $ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
> 
> 100000 loops, best of 3: 1.76 usec per loop
> 
> $ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
> 
> 10000 loops, best of 3: 10.3 usec per loop
> 
> 
> 
> Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory 
> 
> that an unfortunate choice of the 8 bit encoding is causing it. Do you 
> 
> agree?

- I do not care too much about the numbers. It's
an attempt to show the principles.

- The fact, considering latin-1 as a bad coding,
lies on the point that is is simply unsuable
for some scripts / languages. It has mainly to do
with source/text files coding. This is not really
the point here.

- Now, the technical aspect. This "coding" (latin-1)
may be considered somehow as the pseudo-coding covering
the unicode code points range 128..255. Unfortunatelly,
this "coding" is not very optimal (or can be see as) when
you work with a full range of Unicode, but is is fine
when one works only in pure latin-1, with only 256
characters.
This range 128..255 is always the critical part
(all codings considered). And probably represents
the most used characters.

I hope, it was not too confused.

I have no proof for my theory. With my experience on that
field, I highly suspect this as the bottleneck.

Some os as before.

Py 3.2.3
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[1.5384088242603358, 1.532421642233382, 1.5327445924545433]
>>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')")
[1.561762063667686, 1.5443503206462594, 1.5458670051605168]

3.3.0b2
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[7.701523104134512, 7.720358191179441, 7.614549852683501]>>> 
>>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')")
[4.887939423990709, 4.868787294350611, 4.865697999795991]

Quite mysterious!

In any way it is a regression.

jmf