From armin.rigo at gmail.com  Wed Nov  2 06:04:25 2016
From: armin.rigo at gmail.com (Armin Rigo)
Date: Wed, 2 Nov 2016 11:04:25 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
Message-ID: <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>

Hi Victor,

On 19 October 2016 at 18:55, Victor Stinner <victor.stinner at gmail.com> wrote:
> 3) new --duplication option to perf timeit

This is never a good idea on top of PyPy, so I wouldn't mind if using
this option on top of PyPy threw an error.


A bient?t,

Armin.

From victor.stinner at gmail.com  Wed Nov  2 06:50:39 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 2 Nov 2016 11:50:39 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
Message-ID: <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>

2016-11-02 11:04 GMT+01:00 Armin Rigo <armin.rigo at gmail.com>:
> On 19 October 2016 at 18:55, Victor Stinner <victor.stinner at gmail.com> wrote:
>> 3) new --duplication option to perf timeit
>
> This is never a good idea on top of PyPy, so I wouldn't mind if using
> this option on top of PyPy threw an error.

Can you please elaborate?

Victor

From armin.rigo at gmail.com  Wed Nov  2 07:12:59 2016
From: armin.rigo at gmail.com (Armin Rigo)
Date: Wed, 2 Nov 2016 12:12:59 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
Message-ID: <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>

Hi Victor,

On 2 November 2016 at 11:50, Victor Stinner <victor.stinner at gmail.com> wrote:
> 2016-11-02 11:04 GMT+01:00 Armin Rigo <armin.rigo at gmail.com>:
>> On 19 October 2016 at 18:55, Victor Stinner <victor.stinner at gmail.com> wrote:
>>> 3) new --duplication option to perf timeit
>>
>> This is never a good idea on top of PyPy, so I wouldn't mind if using
>> this option on top of PyPy threw an error.
>
> Can you please elaborate?

Yes, exactly :-)  Consider a benchmark written like that:

    for i in range(lots):
         z = a + b
         z = a + b
         z = a + b
         z = a + b
         z = a + b

What you are really measuring by running PyPy on this is completely
different from what you *think* you are measuring---in this case,
mostly everything is optimized away.  If you try to make it actually
do something so that it's not optimized away, then the problem of
duplicating lines becomes of making the tracing JIT compiler not happy
at all.  If you duplicate the lines too many times, the loop body
becomes too long for the JIT compiler to swallow---never duplicates
1000 times, that's always too much!  But even if you duplicate only 10
times, then the more subtle problem is: assume that each line can
follow *two* control flow paths (even internally, e.g. because of some
condition done in RPython).  (It is likely the case, if you try to do
something non-trivially-optimizable-away.)  Then if you duplicate the
line 10 times, there are suddenly 2**10 control flow paths.  That
means the JIT will never be able to warm up completely.  Suddenly you
are measuring the JIT compiler's performance and not at all your
code's.

The --duplication option on PyPy is thus either useless or limited to
use cases where you definitely know there is only one code path ever
followed, and don't duplicate too much, and know for sure that
multiple repetitions of the same line won't cause cross-line
optimizations.  That's not possible to explain without going very
technical.


A bient?t,

Armin.

From victor.stinner at gmail.com  Wed Nov  2 08:00:26 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 2 Nov 2016 13:00:26 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
 <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
Message-ID: <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>

Hum, so for an usability point of view, I think that the best to do is
to ignore the option if Python has a JIT.

On CPython, --duplicate makes sense (no?). So for example, the
following command should use duplicate on CPython but not on PyPy:

   python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy

Victor

From gmludo at gmail.com  Tue Nov  1 19:34:13 2016
From: gmludo at gmail.com (Ludovic Gasc)
Date: Wed, 2 Nov 2016 00:34:13 +0100
Subject: [Speed] [Python-Dev] Benchmarking Python and micro-optimizations
In-Reply-To: <CAMpsgwbyAfovmZ_99q68G8AMa3Xx3ZLpZGLg0HWFk7Fjgf5-BQ@mail.gmail.com>
References: <CAMpsgwYsAXJgHfJH_hDH6pj13pHacMWYbi_LExLoXdcoUo6F-Q@mail.gmail.com>
 <CAMpsgwbyAfovmZ_99q68G8AMa3Xx3ZLpZGLg0HWFk7Fjgf5-BQ@mail.gmail.com>
Message-ID: <CAON-fpEk54H1mp1xZdYtCBSgAP1Fr+8s9HDcqPEjb2zpXCWHDg@mail.gmail.com>

Hi,

Thanks first for that, it's very interesting.
About to enrich benchmark suite, I might have a suggestion: We might add
REST/JSON scenarios, because a lot of people use Python for that.
It isn't certainly not the best REST/JSON scenarios, because they have a
small payload, but better than nothing:
https://www.techempower.com/benchmarks/#section=code&hw=peak&test=fortune
Moreover, we already have several implementations for the most populars Web
frameworks:
https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Python

The drawback is that a lot of tests need a database.
I can help if you're interested in.

Have a nice week.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161102/ad06001d/attachment.html>

From armin.rigo at gmail.com  Wed Nov  2 10:20:44 2016
From: armin.rigo at gmail.com (Armin Rigo)
Date: Wed, 2 Nov 2016 15:20:44 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
 <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
 <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>
Message-ID: <CAMSv6X2-hDF2Ooc905CkrjZ9xfg7=rGOKUCzEim8xopRD8xN9Q@mail.gmail.com>

Hi Victor,

On 2 November 2016 at 13:00, Victor Stinner <victor.stinner at gmail.com> wrote:
> On CPython, --duplicate makes sense (no?). So for example, the
> following command should use duplicate on CPython but not on PyPy:
>
>    python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy

This example means "compare CPython where the data cache gets extra
pressure from reading a strangely large code object, and PyPy where
the multiplication might be entirely removed for all I know".

Is that really the kind of examples you want to put forward?


A bient?t,

Armin.

From victor.stinner at gmail.com  Wed Nov  2 11:53:45 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 2 Nov 2016 16:53:45 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMSv6X2-hDF2Ooc905CkrjZ9xfg7=rGOKUCzEim8xopRD8xN9Q@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
 <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
 <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>
 <CAMSv6X2-hDF2Ooc905CkrjZ9xfg7=rGOKUCzEim8xopRD8xN9Q@mail.gmail.com>
Message-ID: <CAMpsgwYfkShFxNfhvsnjXV3u6Jw_9_mgqT4tV2=1eLVHmZ1cFw@mail.gmail.com>

2016-11-02 15:20 GMT+01:00 Armin Rigo <armin.rigo at gmail.com>:
> Is that really the kind of examples you want to put forward?

I am not a big fan of timeit, but we must use it sometimes to
micro-optimizations in CPython to check if an optimize really makes
CPython faster or not. I am only trying to enhance timeit.
Understanding results require to understand how the statements are
executed.


> This example means "compare CPython where the data cache gets extra pressure from reading a strangely large code object,

I wrote --duplicate option to benchmark "x+y" with "x=1; y=2". I know,
it's an extreme and stupid benchmark, but many people spend a lot of
time on trying to optimize this in Python/ceval.c:
https://bugs.python.org/issue21955

I tried multiple values of --duplicate when benchmarking x+y, and x+y
seems "faster" when using a larger --duplicate value. I understand
that the cost of the outer loop is higher than the cost of "reading a
strangely large code object".

I provide a tool and I try to document how to use it. But it's hard to
prevent users to use it for stupid things.

For example, recently I spent time trying to optimize bytes%args in
Python 3 after reading an article, but then I realized that the Python
2 benchmark was meaningless:
https://atleastfornow.net/blog/not-all-bytes/

def bytes_plus():
    b"hi" + b" " + b"there"
... benchmark(bytes_plus) ...

bytes_plus() is optimized by the _compiler_, so the benchmark measure
the cost of LOAD_CONST :-)

The issue was not the tool but the usage of the tool :-D

Victor

From armin.rigo at gmail.com  Wed Nov  2 12:03:32 2016
From: armin.rigo at gmail.com (Armin Rigo)
Date: Wed, 2 Nov 2016 17:03:32 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMpsgwYfkShFxNfhvsnjXV3u6Jw_9_mgqT4tV2=1eLVHmZ1cFw@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
 <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
 <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>
 <CAMSv6X2-hDF2Ooc905CkrjZ9xfg7=rGOKUCzEim8xopRD8xN9Q@mail.gmail.com>
 <CAMpsgwYfkShFxNfhvsnjXV3u6Jw_9_mgqT4tV2=1eLVHmZ1cFw@mail.gmail.com>
Message-ID: <CAMSv6X01YRBq1StZthtx-KNUmm+vthh0ST0970Fc-kCDX6v4EA@mail.gmail.com>

Hi Victor,

On 2 November 2016 at 16:53, Victor Stinner <victor.stinner at gmail.com> wrote:
> 2016-11-02 15:20 GMT+01:00 Armin Rigo <armin.rigo at gmail.com>:
>> Is that really the kind of examples you want to put forward?
>
> I am not a big fan of timeit, but we must use it sometimes to
> micro-optimizations in CPython to check if an optimize really makes
> CPython faster or not. I am only trying to enhance timeit.
> Understanding results require to understand how the statements are
> executed.

Don't get me wrong, I understand the point of the following usage of timeit:

    python2 -m perf timeit '[1,2]*1000' --duplicate=1000

What I'm criticizing here is this instead:

    python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy

because you're very unlikely to get any relevant information from such
a comparison.  I stand by my original remark: I would say it should be
an error or at least a big fat warning to use --duplicate and PyPy in
the same invocation.  This is as opposed to silently ignoring
--duplicate for PyPy, which is just adding more confusion imho.


A bient?t,

Armin.

From victor.stinner at gmail.com  Wed Nov  2 21:30:34 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 3 Nov 2016 02:30:34 +0100
Subject: [Speed] Tune the system the run benchmarks
Message-ID: <CAMpsgwac8mDgv=QiUJcA6kcKA3ZdjSSVKaBeTdV69F4pa7x3eA@mail.gmail.com>

Hi,

Last months, I used various shell and Python scripts to tune a system
to run benchmarks.

There are many parameters to be set to "fully" configure a Linux:

* Turbo Boost of Intel CPU
* CPU scaling governor
* CPU speed
* Isolate CPU
* Disable kernel RCU on isolated CPUs
* etc.

I added a new "sytem tune" command to the newly released perf 0.8.4. I
implemented many operations:
http://perf.readthedocs.io/en/latest/cli.html#system

Right now, intel_pstate is better supported.

I'm not sure about the CPU scaling governor when intel_pstate is not
used, so this is one is not implemented yet. In my old Python script,
I used the "userland" governor and a fixed speed for the CPUs.

My old Python script also disabled interruptions (IRQ) on isolated
CPUs. I will also implement that later. I don't know if setting the
default CPU mask for IRQ is enough, or if it's better to set the CPU
mask of all invididual IRQs.


Example on the speed.python.org server:
-----
haypo at speed-python$ sudo python3 -m perf system tune
CPU Frequency: Minimum frequency of CPU 1 set to the maximum frequency
CPU Frequency: Minimum frequency of CPU 3 set to the maximum frequency
...
CPU Frequency: Minimum frequency of CPU 23 set to the maximum frequency
Turbo Boost (MSR): Turbo Boost disabled on CPU 0: MSR 0x1a0 set to 0x4000850089
Turbo Boost (MSR): Turbo Boost disabled on CPU 1: MSR 0x1a0 set to 0x4000850089

ASLR: Full randomization
Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
CPU Frequency: 0,2,4,6,8,10,12,14,16,18,20,22=min=1600 MHz, max=3333
MHz; 1,3,5,7,9,11,13,15,17,19,21,23=min=3333 MHz, max=3333 MHz
Turbo Boost (MSR): CPU 0-23: disabled
-----

"Reset" the config:
-----
haypo at speed-python$ sudo python3 -m perf system reset
CPU Frequency: Minimum frequency of CPU 1 reset to the minimum frequency
CPU Frequency: Minimum frequency of CPU 3 reset to the minimum frequency
...
CPU Frequency: Minimum frequency of CPU 23 reset to the minimum frequency
Turbo Boost (MSR): Turbo Boost enabled on CPU 0: MSR 0x1a0 set to 0x850089
Turbo Boost (MSR): Turbo Boost enabled on CPU 1: MSR 0x1a0 set to 0x850089

ASLR: Full randomization
Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
CPU Frequency: 0-23=min=1600 MHz, max=3333 MHz
Turbo Boost (MSR): CPU 0-23: enabled
-----

Victor

From victor.stinner at gmail.com  Fri Nov  4 08:12:42 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 4 Nov 2016 13:12:42 +0100
Subject: [Speed] New benchmarks results on speed.python.org
Message-ID: <CAMpsgwZaYOcQFX6h1GK7czTSb7+QUqnvA86EUfUzSKMEPJBHXg@mail.gmail.com>

Hi,

Good news, I regenerated all benchmark results of CPython using the
latest versions of perf and perfomance and the results look much more
reliable. Sadly, I didn't kept a screenshot of old benchmarks, so you
should trust me, I cannot show you the old unstable timeline.

--

I regenerated all benchmark results of speed.python.org using
performance 0.3.2. I now have an (almost) fully automated script to
run benchmarks (compile python, run benchmarks, etc.) using a list of
Python revisions and/or branches. Only the last step, upload the JSON,
is still manual, but it's nothing to automate this part ;-)

   https://github.com/python/performance/tree/master/scripts

Python is compiled using LTO, but not PGO. The compilation with PGO
fails with an internal GCC bug, speed-python uses Ubuntu 14.04, the
GCC bug seems to be known (and fixed upstream...).


Because of various bugs (including a bug in the Linux kernel ;-)
NOHZ_FULL+intel_pstate), I didn't have time to analyze the impact of
compilation options (-O2, -O3, LTO, PGO, etc.) on the stability of
benchmark results.


I isolated all CPUs of the NUMA node 1 (the CPU has two NUMA nodes): I
added the following parameters to the the Linux kernel command line of
the speed-python server:

   isolcpus=1,3,5,7,9,11,13,15,17,19,21,23
rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23


Before running the benchmarks, I used the "python3 -m perf system
tune" command (of the development version of perf) to tune the server.
Results of the tuning:
-------------------------
$ sudo python3 -m perf system
System state
============

ASLR: Full randomization
Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
CPU Frequency: 0,2,4,6,8,10,12,14,16,18,20,22=min=1600 MHz, max=3333
MHz; 1,3,5,7,9,11,13,15,17,19,21,23=min=max=3333 MHz
Turbo Boost (MSR): CPU 0,2,4,6,8,10,12,14,16,18,20,22: enabled, CPU
1,3,5,7,9,11,13,15,17,19,21,23: disabled
IRQ affinity: irqbalance service: inactive
IRQ affinity: Default IRQ affinity: CPU 0,2,4,6,8,10,12,14,16,18,20,22
IRQ affinity: IRQ affinity: 0,2=0-23,
1,3-15,17,20,22-23,67-82=0,2,4,6,8,10,12,14,16,18,20,22
-------------------------

I don't well yet the hardware of the speed-python server. The CPU is a
"Intel(R) Xeon(R) CPU X5680  @ 3.33GHz":

* I only disabled Turbo Boost on the CPUs used to run benchmarks.
Maybe I should disable Turbo Boost on all CPUs? On my computers using
intel_pstate, Turbo Boost is disabled globally (for all CPUs) using an
option of the intel_pstate driver.

* I didn't tune the CPU scaling governor yet: all CPUs use "ondemand"

* Maybe I should use a fixed CPU frequency on all CPUs and use the
"userland" scaling governor?


Results seem more stable, but it's still not perfect yet (see below).
See [Timeline] (x) Display all in grid:

https://speed.python.org/timeline/#/?exe=4&ben=grid&env=1&revs=50&equid=off&quarts=on&extr=on


There are still some hiccups:


(*) call_method: temporary peak of 29 ms for October 19, whereas all
other revisions are around 17 ms:

https://speed.python.org/timeline/#/?exe=4&ben=call_method&env=1&revs=50&equid=off&quarts=on&extr=on


(*) python_startup increased from 21 ms to 27.5 ms between Sept 9 and
Sept 15... The problem is that this one is not a temporary hiccup, but
seems like a real performance regression: there are 4 points at 21 ms
(Sept 4-Sept 9) and 7 points at 27.5 ms (Sept 15-Nov 3). But I was
unable yet to reproduce the slowndown on my laptop.

https://speed.python.org/timeline/#/?exe=4&ben=python_startup&env=1&revs=50&equid=off&quarts=on&extr=on


Victor

From victor.stinner at gmail.com  Fri Nov  4 08:28:36 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 4 Nov 2016 13:28:36 +0100
Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6
 performance
Message-ID: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>

Hi,

You may know that I'm working on benchmarks. I regenerated all
benchmark results of speed.python.org using performance 0.3.2
(benchmark suite). I started to analyze results.

All results are available online on the website:

   https://speed.python.org/


To communicate on my work on benchmarks, I tweeted two pictures:

"sympy benchmarks: Python 3.6 is between 8% and 48% faster than Python
2.7 #python #benchmark":
https://twitter.com/VictorStinner/status/794289596683210760

"Python 3.6 is between 25% and 54% slower than Python 2.7 in the
following benchmarks":
https://twitter.com/VictorStinner/status/794305065708376069


Many people were disappointed that Python 3.6 can be up to 54% slower
than Python 2.7. In fact, I know many reasons which explain that, but
it's hard to summarize them in 140 characters ;-)

For example, Python 3.6 is 54% slower than Python 2.7 on the benchmark
pycrypto_aes. This benchmark tests a pure Python implementation of the
crypto cipher AES. You may know that CPython is slow for CPU intensive
functions, especially on integer and floatting point numbers.

"int" in Python 3 is now "long integers" by default, which is known to
be a little bit slower than "short int" of Python 2. On a more
realistic benchmark (see other benchmarks), the overhead of Python 3
"long int" is negligible.

AES is a typical example stressing integers. For me, it's a dummy
benchmark: it doesn't make sense to use Python for AES: modern CPUs
have an *hardware* implemention which is super fast.


Well, I didn't have time to analyze in depth individual benchmarks. If
you want to help me, here is the source code of benchmarks:
https://github.com/python/performance/blob/master/performance/benchmarks/


Raw results of Python 3.6 compared to Python 2.7:
-------------------
$ python3 -m perf compare_to 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz
2016-11-03_15-38-3.6-c4319c0d0131.json.gz -G --min-speed=5
Slower (40):
- python_startup: 7.74 ms +- 0.28 ms -> 26.9 ms +- 0.6 ms: 3.47x slower
- python_startup_no_site: 4.43 ms +- 0.08 ms -> 10.4 ms +- 0.4 ms: 2.36x slower
- unpickle_pure_python: 417 us +- 3 us -> 918 us +- 14 us: 2.20x slower
- call_method: 16.3 ms +- 0.2 ms -> 28.6 ms +- 0.8 ms: 1.76x slower
- call_method_slots: 16.2 ms +- 0.4 ms -> 28.3 ms +- 0.7 ms: 1.75x slower
- call_method_unknown: 18.4 ms +- 0.2 ms -> 30.8 ms +- 0.8 ms: 1.67x slower
- crypto_pyaes: 161 ms +- 2 ms -> 249 ms +- 2 ms: 1.54x slower
- xml_etree_parse: 201 ms +- 5 ms -> 298 ms +- 8 ms: 1.49x slower
- logging_simple: 26.4 us +- 0.3 us -> 38.4 us +- 0.7 us: 1.46x slower
- logging_format: 31.3 us +- 0.4 us -> 45.5 us +- 0.8 us: 1.45x slower
- pickle_pure_python: 986 us +- 9 us -> 1.41 ms +- 0.03 ms: 1.43x slower
- spectral_norm: 208 ms +- 2 ms -> 287 ms +- 2 ms: 1.38x slower
- logging_silent: 660 ns +- 7 ns -> 865 ns +- 31 ns: 1.31x slower
- chaos: 240 ms +- 2 ms -> 314 ms +- 4 ms: 1.31x slower
- go: 490 ms +- 2 ms -> 640 ms +- 26 ms: 1.31x slower
- xml_etree_iterparse: 178 ms +- 2 ms -> 230 ms +- 5 ms: 1.29x slower
- sqlite_synth: 8.29 us +- 0.16 us -> 10.6 us +- 0.2 us: 1.28x slower
- xml_etree_process: 210 ms +- 6 ms -> 268 ms +- 14 ms: 1.28x slower
- django_template: 387 ms +- 4 ms -> 484 ms +- 5 ms: 1.25x slower
- fannkuch: 830 ms +- 32 ms -> 1.04 sec +- 0.03 sec: 1.25x slower
- hexiom: 20.2 ms +- 0.1 ms -> 24.7 ms +- 0.2 ms: 1.22x slower
- chameleon: 26.1 ms +- 0.2 ms -> 31.9 ms +- 0.4 ms: 1.22x slower
- regex_compile: 395 ms +- 2 ms -> 482 ms +- 6 ms: 1.22x slower
- json_dumps: 25.8 ms +- 0.2 ms -> 31.0 ms +- 0.5 ms: 1.20x slower
- nqueens: 229 ms +- 2 ms -> 274 ms +- 2 ms: 1.20x slower
- genshi_text: 81.9 ms +- 0.6 ms -> 97.8 ms +- 1.1 ms: 1.19x slower
- raytrace: 1.17 sec +- 0.03 sec -> 1.39 sec +- 0.03 sec: 1.19x slower
- scimark_monte_carlo: 240 ms +- 7 ms -> 282 ms +- 10 ms: 1.17x slower
- scimark_sor: 441 ms +- 8 ms -> 517 ms +- 12 ms: 1.17x slower
- deltablue: 17.4 ms +- 0.1 ms -> 20.1 ms +- 0.6 ms: 1.16x slower
- sqlalchemy_declarative: 310 ms +- 3 ms -> 354 ms +- 6 ms: 1.14x slower
- call_simple: 12.2 ms +- 0.2 ms -> 13.9 ms +- 0.2 ms: 1.14x slower
- scimark_fft: 613 ms +- 19 ms -> 694 ms +- 23 ms: 1.13x slower
- meteor_contest: 191 ms +- 1 ms -> 215 ms +- 2 ms: 1.13x slower
- pathlib: 46.9 ms +- 0.4 ms -> 52.6 ms +- 0.9 ms: 1.12x slower
- richards: 181 ms +- 1 ms -> 201 ms +- 6 ms: 1.11x slower
- genshi_xml: 191 ms +- 2 ms -> 209 ms +- 2 ms: 1.10x slower
- float: 290 ms +- 5 ms -> 310 ms +- 7 ms: 1.07x slower
- scimark_sparse_mat_mult: 8.19 ms +- 0.22 ms -> 8.74 ms +- 0.15 ms:
1.07x slower
- xml_etree_generate: 302 ms +- 3 ms -> 320 ms +- 8 ms: 1.06x slower

Faster (15):
- telco: 707 ms +- 22 ms -> 22.1 ms +- 0.4 ms: 32.04x faster
- unpickle_list: 15.0 us +- 0.3 us -> 7.86 us +- 0.16 us: 1.90x faster
- pickle_list: 14.7 us +- 0.2 us -> 9.12 us +- 0.38 us: 1.61x faster
- json_loads: 98.7 us +- 2.3 us -> 62.3 us +- 0.7 us: 1.58x faster
- pickle: 40.4 us +- 0.6 us -> 27.1 us +- 0.5 us: 1.49x faster
- sympy_sum: 361 ms +- 10 ms -> 244 ms +- 7 ms: 1.48x faster
- sympy_expand: 1.68 sec +- 0.02 sec -> 1.15 sec +- 0.03 sec: 1.47x faster
- regex_v8: 62.0 ms +- 0.5 ms -> 47.2 ms +- 0.6 ms: 1.31x faster
- sympy_str: 699 ms +- 22 ms -> 537 ms +- 15 ms: 1.30x faster
- regex_effbot: 6.67 ms +- 0.04 ms -> 5.23 ms +- 0.05 ms: 1.28x faster
- mako: 61.5 ms +- 0.7 ms -> 49.7 ms +- 2.5 ms: 1.24x faster
- html5lib: 298 ms +- 7 ms -> 261 ms +- 6 ms: 1.14x faster
- sympy_integrate: 55.9 ms +- 0.3 ms -> 51.8 ms +- 1.0 ms: 1.08x faster
- pickle_dict: 69.4 us +- 0.9 us -> 65.2 us +- 3.2 us: 1.06x faster
- scimark_lu: 551 ms +- 26 ms -> 523 ms +- 18 ms: 1.05x faster

Benchmark hidden because not significant (8): 2to3, dulwich_log,
nbody, pidigits, regex_dna, tornado_http, unpack_sequence, unpickle
Ignored benchmarks (3) of 2016-11-03_15-36-2.7-91f024fc9b3a.json:
hg_startup, pyflate, spambayes
-------------------

Please ignore call_method, call_method_slots, call_method_unknown
benchmarks: it seems like I had an issue on the benchmark server. I
was unable to reproduce he 70% slowdown on my laptop.

I attached the two compressed JSON files to this email if you want to
analyze them yourself.

I hope that my work on benchmarks will motive some developers to look
closer at Python 3 performance to find interesting optimizations ;-)

Victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz
Type: application/x-gzip
Size: 107594 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/speed/attachments/20161104/20759909/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2016-11-03_15-38-3.6-c4319c0d0131.json.gz
Type: application/x-gzip
Size: 102546 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/speed/attachments/20161104/20759909/attachment-0003.bin>

From tobami at gmail.com  Fri Nov  4 15:18:48 2016
From: tobami at gmail.com (Miquel Torres)
Date: Fri, 04 Nov 2016 19:18:48 +0000
Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python
 3.6 performance
In-Reply-To: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
References: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
Message-ID: <CAGf+9VxYg7sAKmZotiNHZrNMyEO5LEMek9zOvJBgq0rxN0hRGw@mail.gmail.com>

Nice! For the record, I'll be giving a talk in PyCon Ireland about
Codespeed. Would you mind me citing those tweets and screenshots, to
highlight usage on speed.python.org?

You mentioned new more reliable vs old results. How close are we to have an
stable setup that gives us benchmarks numbers regularly?

Cheers,
Miquel
El El vie, 4 nov 2016 a las 12:30, Victor Stinner <victor.stinner at gmail.com>
escribi?:

> Hi,
>
> You may know that I'm working on benchmarks. I regenerated all
> benchmark results of speed.python.org using performance 0.3.2
> (benchmark suite). I started to analyze results.
>
> All results are available online on the website:
>
>    https://speed.python.org/
>
>
> To communicate on my work on benchmarks, I tweeted two pictures:
>
> "sympy benchmarks: Python 3.6 is between 8% and 48% faster than Python
> 2.7 #python #benchmark":
> https://twitter.com/VictorStinner/status/794289596683210760
>
> "Python 3.6 is between 25% and 54% slower than Python 2.7 in the
> following benchmarks":
> https://twitter.com/VictorStinner/status/794305065708376069
>
>
> Many people were disappointed that Python 3.6 can be up to 54% slower
> than Python 2.7. In fact, I know many reasons which explain that, but
> it's hard to summarize them in 140 characters ;-)
>
> For example, Python 3.6 is 54% slower than Python 2.7 on the benchmark
> pycrypto_aes. This benchmark tests a pure Python implementation of the
> crypto cipher AES. You may know that CPython is slow for CPU intensive
> functions, especially on integer and floatting point numbers.
>
> "int" in Python 3 is now "long integers" by default, which is known to
> be a little bit slower than "short int" of Python 2. On a more
> realistic benchmark (see other benchmarks), the overhead of Python 3
> "long int" is negligible.
>
> AES is a typical example stressing integers. For me, it's a dummy
> benchmark: it doesn't make sense to use Python for AES: modern CPUs
> have an *hardware* implemention which is super fast.
>
>
> Well, I didn't have time to analyze in depth individual benchmarks. If
> you want to help me, here is the source code of benchmarks:
> https://github.com/python/performance/blob/master/performance/benchmarks/
>
>
> Raw results of Python 3.6 compared to Python 2.7:
> -------------------
> $ python3 -m perf compare_to 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz
> 2016-11-03_15-38-3.6-c4319c0d0131.json.gz -G --min-speed=5
> Slower (40):
> - python_startup: 7.74 ms +- 0.28 ms -> 26.9 ms +- 0.6 ms: 3.47x slower
> - python_startup_no_site: 4.43 ms +- 0.08 ms -> 10.4 ms +- 0.4 ms: 2.36x
> slower
> - unpickle_pure_python: 417 us +- 3 us -> 918 us +- 14 us: 2.20x slower
> - call_method: 16.3 ms +- 0.2 ms -> 28.6 ms +- 0.8 ms: 1.76x slower
> - call_method_slots: 16.2 ms +- 0.4 ms -> 28.3 ms +- 0.7 ms: 1.75x slower
> - call_method_unknown: 18.4 ms +- 0.2 ms -> 30.8 ms +- 0.8 ms: 1.67x slower
> - crypto_pyaes: 161 ms +- 2 ms -> 249 ms +- 2 ms: 1.54x slower
> - xml_etree_parse: 201 ms +- 5 ms -> 298 ms +- 8 ms: 1.49x slower
> - logging_simple: 26.4 us +- 0.3 us -> 38.4 us +- 0.7 us: 1.46x slower
> - logging_format: 31.3 us +- 0.4 us -> 45.5 us +- 0.8 us: 1.45x slower
> - pickle_pure_python: 986 us +- 9 us -> 1.41 ms +- 0.03 ms: 1.43x slower
> - spectral_norm: 208 ms +- 2 ms -> 287 ms +- 2 ms: 1.38x slower
> - logging_silent: 660 ns +- 7 ns -> 865 ns +- 31 ns: 1.31x slower
> - chaos: 240 ms +- 2 ms -> 314 ms +- 4 ms: 1.31x slower
> - go: 490 ms +- 2 ms -> 640 ms +- 26 ms: 1.31x slower
> - xml_etree_iterparse: 178 ms +- 2 ms -> 230 ms +- 5 ms: 1.29x slower
> - sqlite_synth: 8.29 us +- 0.16 us -> 10.6 us +- 0.2 us: 1.28x slower
> - xml_etree_process: 210 ms +- 6 ms -> 268 ms +- 14 ms: 1.28x slower
> - django_template: 387 ms +- 4 ms -> 484 ms +- 5 ms: 1.25x slower
> - fannkuch: 830 ms +- 32 ms -> 1.04 sec +- 0.03 sec: 1.25x slower
> - hexiom: 20.2 ms +- 0.1 ms -> 24.7 ms +- 0.2 ms: 1.22x slower
> - chameleon: 26.1 ms +- 0.2 ms -> 31.9 ms +- 0.4 ms: 1.22x slower
> - regex_compile: 395 ms +- 2 ms -> 482 ms +- 6 ms: 1.22x slower
> - json_dumps: 25.8 ms +- 0.2 ms -> 31.0 ms +- 0.5 ms: 1.20x slower
> - nqueens: 229 ms +- 2 ms -> 274 ms +- 2 ms: 1.20x slower
> - genshi_text: 81.9 ms +- 0.6 ms -> 97.8 ms +- 1.1 ms: 1.19x slower
> - raytrace: 1.17 sec +- 0.03 sec -> 1.39 sec +- 0.03 sec: 1.19x slower
> - scimark_monte_carlo: 240 ms +- 7 ms -> 282 ms +- 10 ms: 1.17x slower
> - scimark_sor: 441 ms +- 8 ms -> 517 ms +- 12 ms: 1.17x slower
> - deltablue: 17.4 ms +- 0.1 ms -> 20.1 ms +- 0.6 ms: 1.16x slower
> - sqlalchemy_declarative: 310 ms +- 3 ms -> 354 ms +- 6 ms: 1.14x slower
> - call_simple: 12.2 ms +- 0.2 ms -> 13.9 ms +- 0.2 ms: 1.14x slower
> - scimark_fft: 613 ms +- 19 ms -> 694 ms +- 23 ms: 1.13x slower
> - meteor_contest: 191 ms +- 1 ms -> 215 ms +- 2 ms: 1.13x slower
> - pathlib: 46.9 ms +- 0.4 ms -> 52.6 ms +- 0.9 ms: 1.12x slower
> - richards: 181 ms +- 1 ms -> 201 ms +- 6 ms: 1.11x slower
> - genshi_xml: 191 ms +- 2 ms -> 209 ms +- 2 ms: 1.10x slower
> - float: 290 ms +- 5 ms -> 310 ms +- 7 ms: 1.07x slower
> - scimark_sparse_mat_mult: 8.19 ms +- 0.22 ms -> 8.74 ms +- 0.15 ms:
> 1.07x slower
> - xml_etree_generate: 302 ms +- 3 ms -> 320 ms +- 8 ms: 1.06x slower
>
> Faster (15):
> - telco: 707 ms +- 22 ms -> 22.1 ms +- 0.4 ms: 32.04x faster
> - unpickle_list: 15.0 us +- 0.3 us -> 7.86 us +- 0.16 us: 1.90x faster
> - pickle_list: 14.7 us +- 0.2 us -> 9.12 us +- 0.38 us: 1.61x faster
> - json_loads: 98.7 us +- 2.3 us -> 62.3 us +- 0.7 us: 1.58x faster
> - pickle: 40.4 us +- 0.6 us -> 27.1 us +- 0.5 us: 1.49x faster
> - sympy_sum: 361 ms +- 10 ms -> 244 ms +- 7 ms: 1.48x faster
> - sympy_expand: 1.68 sec +- 0.02 sec -> 1.15 sec +- 0.03 sec: 1.47x faster
> - regex_v8: 62.0 ms +- 0.5 ms -> 47.2 ms +- 0.6 ms: 1.31x faster
> - sympy_str: 699 ms +- 22 ms -> 537 ms +- 15 ms: 1.30x faster
> - regex_effbot: 6.67 ms +- 0.04 ms -> 5.23 ms +- 0.05 ms: 1.28x faster
> - mako: 61.5 ms +- 0.7 ms -> 49.7 ms +- 2.5 ms: 1.24x faster
> - html5lib: 298 ms +- 7 ms -> 261 ms +- 6 ms: 1.14x faster
> - sympy_integrate: 55.9 ms +- 0.3 ms -> 51.8 ms +- 1.0 ms: 1.08x faster
> - pickle_dict: 69.4 us +- 0.9 us -> 65.2 us +- 3.2 us: 1.06x faster
> - scimark_lu: 551 ms +- 26 ms -> 523 ms +- 18 ms: 1.05x faster
>
> Benchmark hidden because not significant (8): 2to3, dulwich_log,
> nbody, pidigits, regex_dna, tornado_http, unpack_sequence, unpickle
> Ignored benchmarks (3) of 2016-11-03_15-36-2.7-91f024fc9b3a.json:
> hg_startup, pyflate, spambayes
> -------------------
>
> Please ignore call_method, call_method_slots, call_method_unknown
> benchmarks: it seems like I had an issue on the benchmark server. I
> was unable to reproduce he 70% slowdown on my laptop.
>
> I attached the two compressed JSON files to this email if you want to
> analyze them yourself.
>
> I hope that my work on benchmarks will motive some developers to look
> closer at Python 3 performance to find interesting optimizations ;-)
>
> Victor
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161104/ef2eb922/attachment.html>

From yselivanov.ml at gmail.com  Fri Nov  4 15:21:31 2016
From: yselivanov.ml at gmail.com (Yury Selivanov)
Date: Fri, 4 Nov 2016 15:21:31 -0400
Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python
 3.6 performance
In-Reply-To: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
References: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
Message-ID: <a8c114a0-561b-18cc-7fac-457d1c31d99b@gmail.com>

I'm curious why call_* benchmarks became slower on 3.x?


Yury


On 2016-11-04 8:28 AM, Victor Stinner wrote:
> Hi,
>
> You may know that I'm working on benchmarks. I regenerated all
> benchmark results of speed.python.org using performance 0.3.2
> (benchmark suite). I started to analyze results.
>
> All results are available online on the website:
>
>     https://speed.python.org/
>
>
> To communicate on my work on benchmarks, I tweeted two pictures:
>
> "sympy benchmarks: Python 3.6 is between 8% and 48% faster than Python
> 2.7 #python #benchmark":
> https://twitter.com/VictorStinner/status/794289596683210760
>
> "Python 3.6 is between 25% and 54% slower than Python 2.7 in the
> following benchmarks":
> https://twitter.com/VictorStinner/status/794305065708376069
>
>
> Many people were disappointed that Python 3.6 can be up to 54% slower
> than Python 2.7. In fact, I know many reasons which explain that, but
> it's hard to summarize them in 140 characters ;-)
>
> For example, Python 3.6 is 54% slower than Python 2.7 on the benchmark
> pycrypto_aes. This benchmark tests a pure Python implementation of the
> crypto cipher AES. You may know that CPython is slow for CPU intensive
> functions, especially on integer and floatting point numbers.
>
> "int" in Python 3 is now "long integers" by default, which is known to
> be a little bit slower than "short int" of Python 2. On a more
> realistic benchmark (see other benchmarks), the overhead of Python 3
> "long int" is negligible.
>
> AES is a typical example stressing integers. For me, it's a dummy
> benchmark: it doesn't make sense to use Python for AES: modern CPUs
> have an *hardware* implemention which is super fast.
>
>
> Well, I didn't have time to analyze in depth individual benchmarks. If
> you want to help me, here is the source code of benchmarks:
> https://github.com/python/performance/blob/master/performance/benchmarks/
>
>
> Raw results of Python 3.6 compared to Python 2.7:
> -------------------
> $ python3 -m perf compare_to 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz
> 2016-11-03_15-38-3.6-c4319c0d0131.json.gz -G --min-speed=5
> Slower (40):
> - python_startup: 7.74 ms +- 0.28 ms -> 26.9 ms +- 0.6 ms: 3.47x slower
> - python_startup_no_site: 4.43 ms +- 0.08 ms -> 10.4 ms +- 0.4 ms: 2.36x slower
> - unpickle_pure_python: 417 us +- 3 us -> 918 us +- 14 us: 2.20x slower
> - call_method: 16.3 ms +- 0.2 ms -> 28.6 ms +- 0.8 ms: 1.76x slower
> - call_method_slots: 16.2 ms +- 0.4 ms -> 28.3 ms +- 0.7 ms: 1.75x slower
> - call_method_unknown: 18.4 ms +- 0.2 ms -> 30.8 ms +- 0.8 ms: 1.67x slower
> - crypto_pyaes: 161 ms +- 2 ms -> 249 ms +- 2 ms: 1.54x slower
> - xml_etree_parse: 201 ms +- 5 ms -> 298 ms +- 8 ms: 1.49x slower
> - logging_simple: 26.4 us +- 0.3 us -> 38.4 us +- 0.7 us: 1.46x slower
> - logging_format: 31.3 us +- 0.4 us -> 45.5 us +- 0.8 us: 1.45x slower
> - pickle_pure_python: 986 us +- 9 us -> 1.41 ms +- 0.03 ms: 1.43x slower
> - spectral_norm: 208 ms +- 2 ms -> 287 ms +- 2 ms: 1.38x slower
> - logging_silent: 660 ns +- 7 ns -> 865 ns +- 31 ns: 1.31x slower
> - chaos: 240 ms +- 2 ms -> 314 ms +- 4 ms: 1.31x slower
> - go: 490 ms +- 2 ms -> 640 ms +- 26 ms: 1.31x slower
> - xml_etree_iterparse: 178 ms +- 2 ms -> 230 ms +- 5 ms: 1.29x slower
> - sqlite_synth: 8.29 us +- 0.16 us -> 10.6 us +- 0.2 us: 1.28x slower
> - xml_etree_process: 210 ms +- 6 ms -> 268 ms +- 14 ms: 1.28x slower
> - django_template: 387 ms +- 4 ms -> 484 ms +- 5 ms: 1.25x slower
> - fannkuch: 830 ms +- 32 ms -> 1.04 sec +- 0.03 sec: 1.25x slower
> - hexiom: 20.2 ms +- 0.1 ms -> 24.7 ms +- 0.2 ms: 1.22x slower
> - chameleon: 26.1 ms +- 0.2 ms -> 31.9 ms +- 0.4 ms: 1.22x slower
> - regex_compile: 395 ms +- 2 ms -> 482 ms +- 6 ms: 1.22x slower
> - json_dumps: 25.8 ms +- 0.2 ms -> 31.0 ms +- 0.5 ms: 1.20x slower
> - nqueens: 229 ms +- 2 ms -> 274 ms +- 2 ms: 1.20x slower
> - genshi_text: 81.9 ms +- 0.6 ms -> 97.8 ms +- 1.1 ms: 1.19x slower
> - raytrace: 1.17 sec +- 0.03 sec -> 1.39 sec +- 0.03 sec: 1.19x slower
> - scimark_monte_carlo: 240 ms +- 7 ms -> 282 ms +- 10 ms: 1.17x slower
> - scimark_sor: 441 ms +- 8 ms -> 517 ms +- 12 ms: 1.17x slower
> - deltablue: 17.4 ms +- 0.1 ms -> 20.1 ms +- 0.6 ms: 1.16x slower
> - sqlalchemy_declarative: 310 ms +- 3 ms -> 354 ms +- 6 ms: 1.14x slower
> - call_simple: 12.2 ms +- 0.2 ms -> 13.9 ms +- 0.2 ms: 1.14x slower
> - scimark_fft: 613 ms +- 19 ms -> 694 ms +- 23 ms: 1.13x slower
> - meteor_contest: 191 ms +- 1 ms -> 215 ms +- 2 ms: 1.13x slower
> - pathlib: 46.9 ms +- 0.4 ms -> 52.6 ms +- 0.9 ms: 1.12x slower
> - richards: 181 ms +- 1 ms -> 201 ms +- 6 ms: 1.11x slower
> - genshi_xml: 191 ms +- 2 ms -> 209 ms +- 2 ms: 1.10x slower
> - float: 290 ms +- 5 ms -> 310 ms +- 7 ms: 1.07x slower
> - scimark_sparse_mat_mult: 8.19 ms +- 0.22 ms -> 8.74 ms +- 0.15 ms:
> 1.07x slower
> - xml_etree_generate: 302 ms +- 3 ms -> 320 ms +- 8 ms: 1.06x slower
>
> Faster (15):
> - telco: 707 ms +- 22 ms -> 22.1 ms +- 0.4 ms: 32.04x faster
> - unpickle_list: 15.0 us +- 0.3 us -> 7.86 us +- 0.16 us: 1.90x faster
> - pickle_list: 14.7 us +- 0.2 us -> 9.12 us +- 0.38 us: 1.61x faster
> - json_loads: 98.7 us +- 2.3 us -> 62.3 us +- 0.7 us: 1.58x faster
> - pickle: 40.4 us +- 0.6 us -> 27.1 us +- 0.5 us: 1.49x faster
> - sympy_sum: 361 ms +- 10 ms -> 244 ms +- 7 ms: 1.48x faster
> - sympy_expand: 1.68 sec +- 0.02 sec -> 1.15 sec +- 0.03 sec: 1.47x faster
> - regex_v8: 62.0 ms +- 0.5 ms -> 47.2 ms +- 0.6 ms: 1.31x faster
> - sympy_str: 699 ms +- 22 ms -> 537 ms +- 15 ms: 1.30x faster
> - regex_effbot: 6.67 ms +- 0.04 ms -> 5.23 ms +- 0.05 ms: 1.28x faster
> - mako: 61.5 ms +- 0.7 ms -> 49.7 ms +- 2.5 ms: 1.24x faster
> - html5lib: 298 ms +- 7 ms -> 261 ms +- 6 ms: 1.14x faster
> - sympy_integrate: 55.9 ms +- 0.3 ms -> 51.8 ms +- 1.0 ms: 1.08x faster
> - pickle_dict: 69.4 us +- 0.9 us -> 65.2 us +- 3.2 us: 1.06x faster
> - scimark_lu: 551 ms +- 26 ms -> 523 ms +- 18 ms: 1.05x faster
>
> Benchmark hidden because not significant (8): 2to3, dulwich_log,
> nbody, pidigits, regex_dna, tornado_http, unpack_sequence, unpickle
> Ignored benchmarks (3) of 2016-11-03_15-36-2.7-91f024fc9b3a.json:
> hg_startup, pyflate, spambayes
> -------------------
>
> Please ignore call_method, call_method_slots, call_method_unknown
> benchmarks: it seems like I had an issue on the benchmark server. I
> was unable to reproduce he 70% slowdown on my laptop.
>
> I attached the two compressed JSON files to this email if you want to
> analyze them yourself.
>
> I hope that my work on benchmarks will motive some developers to look
> closer at Python 3 performance to find interesting optimizations ;-)
>
> Victor
>
>
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed


From victor.stinner at gmail.com  Fri Nov  4 16:56:21 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 4 Nov 2016 21:56:21 +0100
Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python
 3.6 performance
In-Reply-To: <CAGf+9VxYg7sAKmZotiNHZrNMyEO5LEMek9zOvJBgq0rxN0hRGw@mail.gmail.com>
References: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
 <CAGf+9VxYg7sAKmZotiNHZrNMyEO5LEMek9zOvJBgq0rxN0hRGw@mail.gmail.com>
Message-ID: <CAMpsgwZ=yeZ2QOHFQZ0ZNOr5GHe56d-hRaECSQf2sTCh4P0ZiA@mail.gmail.com>

2016-11-04 20:18 GMT+01:00 Miquel Torres <tobami at gmail.com>:
> Nice! For the record, I'll be giving a talk in PyCon Ireland about
> Codespeed. Would you mind me citing those tweets and screenshots, to
> highlight usage on speed.python.org?

Sure. Keep me in touch in you publish your slides later.


> You mentioned new more reliable vs old results. How close are we to have an
> stable setup that gives us benchmarks numbers regularly?

My plan for the short term is to analyze last (latest?) benchmarks
hiccups and try to fix them.

The fully automated script to run benchmarks is already written:
https://github.com/python/performance/tree/master/scripts

Then, the plan we decided with Zachary Ware is to run a script in a
loop which compiles the default branch of CPython. Later, we may also
do the same for 2.7 and 3.6 branches. And then add PyPy (and PyPy 3).

Victor

From victor.stinner at gmail.com  Fri Nov  4 16:58:19 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 4 Nov 2016 21:58:19 +0100
Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python
 3.6 performance
In-Reply-To: <a8c114a0-561b-18cc-7fac-457d1c31d99b@gmail.com>
References: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
 <a8c114a0-561b-18cc-7fac-457d1c31d99b@gmail.com>
Message-ID: <CAMpsgwaAo7db0WG2j96jyyarfr7DZkMCLUZxd2y5GJ8c_nQ74g@mail.gmail.com>

2016-11-04 20:21 GMT+01:00 Yury Selivanov <yselivanov.ml at gmail.com>:
> I'm curious why call_* benchmarks became slower on 3.x?

It's almost the same between 2.7 and default. For 3.6, it looks like
an issue on the benchmark runner, not on Python itself:

>> Please ignore call_method, call_method_slots, call_method_unknown
>> benchmarks: it seems like I had an issue on the benchmark server. I
>> was unable to reproduce he 70% slowdown on my laptop.

Victor

From victor.stinner at gmail.com  Fri Nov  4 18:35:26 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 4 Nov 2016 23:35:26 +0100
Subject: [Speed] Performance difference in call_method()
Message-ID: <CAMpsgwYfgW4O-7k6M4WfBuYJ1pF=ecywhGp864WwRXMv7ac4yg@mail.gmail.com>

Hi,

I noticed a temporary performance peak in the call_method:

https://speed.python.org/timeline/#/?exe=4&ben=call_method&env=1&revs=50&equid=off&quarts=on&extr=on

The difference is major: 17 ms => 29 ms, 70% slower!

I expected a temporary issue on the server used to run benchmarks,
but... I reproduced the result on the server.

Recently, the performance of call_method() changed in CPython default
from 17 ms to 28 ms (well, the exact value is variable: 25 ms, 28 ms,
29 ms, ...) and then back to 17 ms:

(1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms

https://hg.python.org/cpython/rev/83877018ef97

(2) 3e073e7b4460: 28 ms => 204a43c452cc (Oct 22): 17 ms

https://hg.python.org/cpython/rev/204a43c452cc

None of these revisions modify code used in the call_method()
benchmark, so I guess that it's yet another compiler joke.


On my laptop and my desktop PC, I'm unable to reproduce the issue: the
performance is the same (I tested ce85a1f129e3, 83877018ef97,
204a43c452cc). These PC uses Fedora 24, GCC 6.2.1. CPUs:

* laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
* desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz


The speed-python runs Ubuntu 14.04, GCC 4.8.4-2ubuntu1~14.04. CPU:
"Intel(R) Xeon(R) CPU X5680  @ 3.33GHz".


call_method() benchmark is a microbenchmark which seems to depend a
lot of very low level stuff like CPU L1 cache. Maybe the impact on the
compiler is more important on speed-python which has an older CPU,
than my more recent hardware. Maybe GCC 6.2 produces more efficient
machine code than GCC 4.8.


I expect that PGO would "fix" the call_method() performance issue, but
PGO compilation fails on Ubuntu 14.04 with a compiler error :-p A
solution would be to upgrade the OS of this server.

Victor

From victor.stinner at gmail.com  Fri Nov  4 19:20:48 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sat, 5 Nov 2016 00:20:48 +0100
Subject: [Speed] Performance difference in call_method()
In-Reply-To: <CAMpsgwYfgW4O-7k6M4WfBuYJ1pF=ecywhGp864WwRXMv7ac4yg@mail.gmail.com>
References: <CAMpsgwYfgW4O-7k6M4WfBuYJ1pF=ecywhGp864WwRXMv7ac4yg@mail.gmail.com>
Message-ID: <CAMpsgwbFTmO9+s2nDrNeUcWwfujgVy8Mo-BovPkbtFKyG3HKPQ@mail.gmail.com>

I found some interesting differences using the Linux perf tool.

# perf stat -e L1-icache-loads,L1-icache-load-misses ./python
performance/benchmarks/bm_call_method.py  --inherit=PYTHONPATH -v
--worker -l1 -n 25 -w0

2016-11-04 23:35 GMT+01:00 Victor Stinner <victor.stinner at gmail.com>:
> (1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms
>
> https://hg.python.org/cpython/rev/83877018ef97

Comparison of metrics of rev ce85a1f129e3 (fast) => rev 83877018ef97 (slow):

L1-icache-load-misses: 0.06% => 8.41% of all L1-icache hits
Instructions per cycle: 2.38 => 1.41
stalled-cycles-frontend: 12.99% => 42.85% frontend cycles idle
stalled-cycles-backend: 2.28% => 21.36% backend  cycles idle

So it confirms what I expected: call_method() is highly impacted by
the CPU L1 instruction cache.

I don't know exactly why the revision 83877018ef97 has an impact on
the CPU L1 cache.

Victor

From victor.stinner at gmail.com  Fri Nov  4 20:31:22 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sat, 5 Nov 2016 01:31:22 +0100
Subject: [Speed] Performance difference in call_method()
In-Reply-To: <CAMpsgwbFTmO9+s2nDrNeUcWwfujgVy8Mo-BovPkbtFKyG3HKPQ@mail.gmail.com>
References: <CAMpsgwYfgW4O-7k6M4WfBuYJ1pF=ecywhGp864WwRXMv7ac4yg@mail.gmail.com>
 <CAMpsgwbFTmO9+s2nDrNeUcWwfujgVy8Mo-BovPkbtFKyG3HKPQ@mail.gmail.com>
Message-ID: <CAMpsgwY1x9HQstTm_ZgE0zh70FZ=eaZ+92aX+LBxM4w6frRBZA@mail.gmail.com>

I proposed a patch which fixes the issue:
http://bugs.python.org/issue28618
"Decorate hot functions using __attribute__((hot)) to optimize Python"

Victor

From victor.stinner at gmail.com  Fri Nov  4 22:23:15 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sat, 5 Nov 2016 03:23:15 +0100
Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python
 3.6 performance
In-Reply-To: <CAMpsgwaAo7db0WG2j96jyyarfr7DZkMCLUZxd2y5GJ8c_nQ74g@mail.gmail.com>
References: <CAMpsgwY0uP59-Er1jXo=z+jkQEz403J04xZkYn8YoN8Q3hdP2Q@mail.gmail.com>
 <a8c114a0-561b-18cc-7fac-457d1c31d99b@gmail.com>
 <CAMpsgwaAo7db0WG2j96jyyarfr7DZkMCLUZxd2y5GJ8c_nQ74g@mail.gmail.com>
Message-ID: <CAMpsgwaQe3zU0ua9eYNSd3_ogSH9TJ4yjmqWBWunQSUmtLCOrw@mail.gmail.com>

2016-11-04 21:58 GMT+01:00 Victor Stinner <victor.stinner at gmail.com>:
> 2016-11-04 20:21 GMT+01:00 Yury Selivanov <yselivanov.ml at gmail.com>:
>> I'm curious why call_* benchmarks became slower on 3.x?
>
> It's almost the same between 2.7 and default. For 3.6, it looks like
> an issue on the benchmark runner, not on Python itself: (...)

Aha, it seems to be a compiler performance issue. I proposed a patch
to fix the issue:
http://bugs.python.org/issue28618

Victor

From ncoghlan at gmail.com  Sat Nov  5 10:56:27 2016
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 6 Nov 2016 00:56:27 +1000
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CAMSv6X01YRBq1StZthtx-KNUmm+vthh0ST0970Fc-kCDX6v4EA@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
 <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
 <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>
 <CAMSv6X2-hDF2Ooc905CkrjZ9xfg7=rGOKUCzEim8xopRD8xN9Q@mail.gmail.com>
 <CAMpsgwYfkShFxNfhvsnjXV3u6Jw_9_mgqT4tV2=1eLVHmZ1cFw@mail.gmail.com>
 <CAMSv6X01YRBq1StZthtx-KNUmm+vthh0ST0970Fc-kCDX6v4EA@mail.gmail.com>
Message-ID: <CADiSq7dttrPQZbEo7nMr3sbk3ebhZ_9APrdtz42nJkp7JD+N=A@mail.gmail.com>

On 3 November 2016 at 02:03, Armin Rigo <armin.rigo at gmail.com> wrote:
> Hi Victor,
>
> On 2 November 2016 at 16:53, Victor Stinner <victor.stinner at gmail.com> wrote:
>> 2016-11-02 15:20 GMT+01:00 Armin Rigo <armin.rigo at gmail.com>:
>>> Is that really the kind of examples you want to put forward?
>>
>> I am not a big fan of timeit, but we must use it sometimes to
>> micro-optimizations in CPython to check if an optimize really makes
>> CPython faster or not. I am only trying to enhance timeit.
>> Understanding results require to understand how the statements are
>> executed.
>
> Don't get me wrong, I understand the point of the following usage of timeit:
>
>     python2 -m perf timeit '[1,2]*1000' --duplicate=1000
>
> What I'm criticizing here is this instead:
>
>     python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy
>
> because you're very unlikely to get any relevant information from such
> a comparison.  I stand by my original remark: I would say it should be
> an error or at least a big fat warning to use --duplicate and PyPy in
> the same invocation.  This is as opposed to silently ignoring
> --duplicate for PyPy, which is just adding more confusion imho.

Since the use case for --duplicate is to reduce the relative overhead
of the outer loop when testing a micro-optimisation within a *given*
interpreter, perhaps the error should be for combining --duplicate and
--compare-to at all? And then it would just be up to developers of a
*particular* implementation to know if "--duplicate" is relevant to
them.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

From victor.stinner at gmail.com  Sat Nov  5 11:34:47 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sat, 5 Nov 2016 16:34:47 +0100
Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1
In-Reply-To: <CADiSq7dttrPQZbEo7nMr3sbk3ebhZ_9APrdtz42nJkp7JD+N=A@mail.gmail.com>
References: <CAMpsgwYhsEKJ=o-Vc2u8_e-h3=PWq8Fx4y8PU8yVbtXJy2Rijw@mail.gmail.com>
 <CAMSv6X0OcaSMxxcwPPRZWOEGVeyuusen1MdyapBNT5nSx6VSfg@mail.gmail.com>
 <CAMpsgwYbXk4j2A6SFF8S9viB0-cG-4OVhyunrprQnBnZYjq4Jg@mail.gmail.com>
 <CAMSv6X1+G4YCRzjTSfmU_7SvQD-BgEsVA7KLb5jpZt5xtvvJhA@mail.gmail.com>
 <CAMpsgwahPxZXjwqG9qBniS3zN175wN9XSC1d8U39mB_1mJWD1w@mail.gmail.com>
 <CAMSv6X2-hDF2Ooc905CkrjZ9xfg7=rGOKUCzEim8xopRD8xN9Q@mail.gmail.com>
 <CAMpsgwYfkShFxNfhvsnjXV3u6Jw_9_mgqT4tV2=1eLVHmZ1cFw@mail.gmail.com>
 <CAMSv6X01YRBq1StZthtx-KNUmm+vthh0ST0970Fc-kCDX6v4EA@mail.gmail.com>
 <CADiSq7dttrPQZbEo7nMr3sbk3ebhZ_9APrdtz42nJkp7JD+N=A@mail.gmail.com>
Message-ID: <CAMpsgwbbhtjUBg5yWzvGdQ5Qf0GXxFee3gNBcEGOHAvQAjB_ZA@mail.gmail.com>

2016-11-05 15:56 GMT+01:00 Nick Coghlan <ncoghlan at gmail.com>:
> Since the use case for --duplicate is to reduce the relative overhead
> of the outer loop when testing a micro-optimisation within a *given*
> interpreter, perhaps the error should be for combining --duplicate and
> --compare-to at all? And then it would just be up to developers of a
> *particular* implementation to know if "--duplicate" is relevant to
> them.

Hum, I think that using "timeit --compare-to=python --duplicate=1000"
makes sense when you compare two versions of CPython.

If I understood correctly Armin, the usage of --duplicate on a Python
using a JIT must fail with an error.

It's in my (long) TODO list ;-)

Victor

From ncoghlan at gmail.com  Sat Nov  5 12:35:47 2016
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 6 Nov 2016 02:35:47 +1000
Subject: [Speed] New benchmarks results on speed.python.org
In-Reply-To: <CAMpsgwZaYOcQFX6h1GK7czTSb7+QUqnvA86EUfUzSKMEPJBHXg@mail.gmail.com>
References: <CAMpsgwZaYOcQFX6h1GK7czTSb7+QUqnvA86EUfUzSKMEPJBHXg@mail.gmail.com>
Message-ID: <CADiSq7fPZDJ1e8dCSerQvN3T-eBi3o_=+0DPCHNzvK+CvrdVOQ@mail.gmail.com>

On 4 November 2016 at 22:12, Victor Stinner <victor.stinner at gmail.com> wrote:
> I don't well yet the hardware of the speed-python server. The CPU is a
> "Intel(R) Xeon(R) CPU X5680  @ 3.33GHz":

This is still the system HP contributed a few years back, so the full
system specs can be found at https://speed.python.org/about/

Once you get the benchmark suite up and running reliably there, it
could be interesting to get it running under Beaker [1] and then let
it loose as an automated job in Red Hat's hardware compatibility
testing environment :)

Cheers,
Nick.

[1] https://beaker-project.org/

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

From kmod at dropbox.com  Mon Nov  7 22:59:12 2016
From: kmod at dropbox.com (Kevin Modzelewski)
Date: Mon, 7 Nov 2016 19:59:12 -0800
Subject: [Speed] Performance difference in call_method()
In-Reply-To: <CAMpsgwYfgW4O-7k6M4WfBuYJ1pF=ecywhGp864WwRXMv7ac4yg@mail.gmail.com>
References: <CAMpsgwYfgW4O-7k6M4WfBuYJ1pF=ecywhGp864WwRXMv7ac4yg@mail.gmail.com>
Message-ID: <CAO=oM6uL=WH+2wi_bsmgGxYfVLOmjhDb__cH1=27wXgTCMQk+w@mail.gmail.com>

Code layout matters a lot and you can get lucky or unlucky with it.  I
wasn't able to make it to this talk but the slides look quite interesting:
https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-
of-performance-instability-due-to-code-placement-in-x86

I'm not sure how much us mere mortals can debug this sort of thing, but I
know the intel folks have at one point expressed interest in making sure
that Python runs quickly on their processors so they might be willing to
give advice (the deck even says "if all else fails, ask Intel").

On Fri, Nov 4, 2016 at 3:35 PM, Victor Stinner <victor.stinner at gmail.com>
wrote:

> Hi,
>
> I noticed a temporary performance peak in the call_method:
>
> https://speed.python.org/timeline/#/?exe=4&ben=call_
> method&env=1&revs=50&equid=off&quarts=on&extr=on
>
> The difference is major: 17 ms => 29 ms, 70% slower!
>
> I expected a temporary issue on the server used to run benchmarks,
> but... I reproduced the result on the server.
>
> Recently, the performance of call_method() changed in CPython default
> from 17 ms to 28 ms (well, the exact value is variable: 25 ms, 28 ms,
> 29 ms, ...) and then back to 17 ms:
>
> (1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms
>
> https://hg.python.org/cpython/rev/83877018ef97
>
> (2) 3e073e7b4460: 28 ms => 204a43c452cc (Oct 22): 17 ms
>
> https://hg.python.org/cpython/rev/204a43c452cc
>
> None of these revisions modify code used in the call_method()
> benchmark, so I guess that it's yet another compiler joke.
>
>
> On my laptop and my desktop PC, I'm unable to reproduce the issue: the
> performance is the same (I tested ce85a1f129e3, 83877018ef97,
> 204a43c452cc). These PC uses Fedora 24, GCC 6.2.1. CPUs:
>
> * laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
> * desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
>
>
> The speed-python runs Ubuntu 14.04, GCC 4.8.4-2ubuntu1~14.04. CPU:
> "Intel(R) Xeon(R) CPU X5680  @ 3.33GHz".
>
>
> call_method() benchmark is a microbenchmark which seems to depend a
> lot of very low level stuff like CPU L1 cache. Maybe the impact on the
> compiler is more important on speed-python which has an older CPU,
> than my more recent hardware. Maybe GCC 6.2 produces more efficient
> machine code than GCC 4.8.
>
>
> I expect that PGO would "fix" the call_method() performance issue, but
> PGO compilation fails on Ubuntu 14.04 with a compiler error :-p A
> solution would be to upgrade the OS of this server.
>
> Victor
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161107/828dc47e/attachment.html>

From paul at paulgraydon.co.uk  Thu Nov 10 12:01:20 2016
From: paul at paulgraydon.co.uk (Paul Graydon)
Date: Thu, 10 Nov 2016 17:01:20 +0000
Subject: [Speed] Ubuntu 16.04 speed issues
Message-ID: <20161110170120.GA13009@paulgraydon.co.uk>

I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm
completely failing to find it in my emails.

The OpenStack-Ansible project has noticed that performance on Ubuntu 16.04 is quite significantly slower than on 14.04.
At the moment it's looking like *possibly* a GCC related bug.

https://bugs.launchpad.net/ubuntu/+source/python2.7/+bug/1638695

From victor.stinner at gmail.com  Thu Nov 10 16:31:47 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 10 Nov 2016 22:31:47 +0100
Subject: [Speed] Ubuntu 16.04 speed issues
In-Reply-To: <20161110170120.GA13009@paulgraydon.co.uk>
References: <20161110170120.GA13009@paulgraydon.co.uk>
Message-ID: <CAMpsgwaYpkrbZe0mx4m-8jqbcYDuGfr0njztsYMKFB-gHK_2Fg@mail.gmail.com>

Hello,

> The OpenStack-Ansible project has noticed that performance on Ubuntu 16.04 is quite significantly slower than on 14.04.
> At the moment it's looking like *possibly* a GCC related bug.

Is it exactly the same Python version? What is the full version?

Try to get compiler flags:

python2 -c 'import sysconfig; print(sysconfig.get_config_var("CFLAGS"))'


2016-11-10 18:01 GMT+01:00 Paul Graydon <paul at paulgraydon.co.uk>:
> I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm
> completely failing to find it in my emails.

You might run https://github.com/python/performance on Ubuntu 14.04
and 16.04 on the same hardware, or at least similar hardware, to
compare performance.

Victor

From ncoghlan at gmail.com  Mon Nov 14 09:20:18 2016
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 15 Nov 2016 00:20:18 +1000
Subject: [Speed] Ubuntu 16.04 speed issues
In-Reply-To: <20161110170120.GA13009@paulgraydon.co.uk>
References: <20161110170120.GA13009@paulgraydon.co.uk>
Message-ID: <CADiSq7ddiZowCMcji7Sp4Rd9bxQLSzXEdxWRFptO7E=pGSV+Pg@mail.gmail.com>

On 11 November 2016 at 03:01, Paul Graydon <paul at paulgraydon.co.uk> wrote:
> I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm
> completely failing to find it in my emails.

You may be thinking of the PGO-related issue that Victor found on
*14*.04: https://mail.python.org/pipermail/speed/2016-November/000471.html

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

From paul at paulgraydon.co.uk  Mon Nov 14 15:19:24 2016
From: paul at paulgraydon.co.uk (Paul Graydon)
Date: Mon, 14 Nov 2016 20:19:24 +0000
Subject: [Speed] Ubuntu 16.04 speed issues
In-Reply-To: <CADiSq7ddiZowCMcji7Sp4Rd9bxQLSzXEdxWRFptO7E=pGSV+Pg@mail.gmail.com>
References: <20161110170120.GA13009@paulgraydon.co.uk>
 <CADiSq7ddiZowCMcji7Sp4Rd9bxQLSzXEdxWRFptO7E=pGSV+Pg@mail.gmail.com>
Message-ID: <20161114201924.GA16889@paulgraydon.co.uk>

On Tue, Nov 15, 2016 at 12:20:18AM +1000, Nick Coghlan wrote:
> On 11 November 2016 at 03:01, Paul Graydon <paul at paulgraydon.co.uk> wrote:
> > I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm
> > completely failing to find it in my emails.
> 
> You may be thinking of the PGO-related issue that Victor found on
> *14*.04: https://mail.python.org/pipermail/speed/2016-November/000471.html
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

I think you might be right there.  Too many bugs going bouncing around at work, and on other projects, I guess I'm losing track :D

Paul

From victor.stinner at gmail.com  Fri Nov 18 20:32:26 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sat, 19 Nov 2016 02:32:26 +0100
Subject: [Speed] Analysis of a Python performance issue
Message-ID: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>

Hi,

I'm happy because I just finished an article putting the most
important things that I learnt this year on the most silly issue with
Python performance: code placement.

https://haypo.github.io/analysis-python-performance-issue.html

I explain how to debug such issue and my attempt to fix it in CPython.

I hate code placement issues :-) I hate performance slowdowns caused
by random unrelated changes...

Victor

From sguelton at quarkslab.com  Sat Nov 19 15:29:35 2016
From: sguelton at quarkslab.com (serge guelton)
Date: Sat, 19 Nov 2016 21:29:35 +0100
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
Message-ID: <20161119202935.bkczl4nvyyl3zwgh@lakota>

On Sat, Nov 19, 2016 at 02:32:26AM +0100, Victor Stinner wrote:
> Hi,
> 
> I'm happy because I just finished an article putting the most
> important things that I learnt this year on the most silly issue with
> Python performance: code placement.
> 
> https://haypo.github.io/analysis-python-performance-issue.html
> 
> I explain how to debug such issue and my attempt to fix it in CPython.
> 
> I hate code placement issues :-) I hate performance slowdowns caused
> by random unrelated changes...
> 
> Victor

Thanks *a lot* victor for this great article. You not only very
accurately describe the method you used to track the performance bug,
but also give very convincing results.

I still wonder what the conclusion should be:

- (this) Micro benchmarks are not relevant at all, they are sensible to minor
  factors that are not relevant to bigger applications

- There is a generally good code layout that favors most applications?
  Maybe some core function from the interpreter ? Why does PGO fails to
  ``find'' them?

Serge


From victor.stinner at gmail.com  Sat Nov 19 18:54:41 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sun, 20 Nov 2016 00:54:41 +0100
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <20161119202935.bkczl4nvyyl3zwgh@lakota>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
 <20161119202935.bkczl4nvyyl3zwgh@lakota>
Message-ID: <CAMpsgwZs8-wiCX7D80LxUQqaOXap72CU5-18BwMOhPgWnr2sOA@mail.gmail.com>

Le 19 nov. 2016 21:29, "serge guelton" <sguelton at quarkslab.com> a ?crit :
> Thanks *a lot* victor for this great article. You not only very
> accurately describe the method you used to track the performance bug,
> but also give very convincing results.

You're welcome. I'm not 100% sure that adding the hot attrbute makes the
performance of call_method reliable at 100%. My hope is that the 70%
slowdown doesn't reoccur.

> I still wonder what the conclusion should be:
>
> - (this) Micro benchmarks are not relevant at all, they are sensible to
minor
>   factors that are not relevant to bigger applications

Other benchmarks had peaks: logging_silent and json_loads. I'm unable to
say if microbenchmarks must be used or not to cehck for performance
regression or test the performance of a patch. So I try instead to analyze
and fix performance issues.

At least I can say that temporary peaks are higher and more frequent on
microbenchmark.

Homework: define what is a microbenchmark :-)

> - There is a generally good code layout that favors most applications?

This is an hard question. I don't know the answer. The hot attributes put
tagged functions in a separated ELF section, but I understand that inside
the section, order is not deterministic.

Maybe the size of a function code matters too. What happens if a function
grows? Does it impact other functions?

>   Maybe some core function from the interpreter ?

I chose to only tag the most famous functions of the core right now. I'm
testing tagging functions of extensions like json but I'm not sure that the
result is significant.

> Why does PGO fails to
>   ``find'' them?

I don't use PGO on speed-python.

I'm not sure that is PGO is reliable neither (reproductible performance).

Victor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161120/c685067e/attachment.html>

From kmod at dropbox.com  Sat Nov 19 20:58:19 2016
From: kmod at dropbox.com (Kevin Modzelewski)
Date: Sat, 19 Nov 2016 17:58:19 -0800
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <20161119202935.bkczl4nvyyl3zwgh@lakota>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
 <20161119202935.bkczl4nvyyl3zwgh@lakota>
Message-ID: <CAO=oM6tcQ56P2ADzujSxSoE6N6Yce0X6HHF1h2wn7yMmF04cVQ@mail.gmail.com>

I think it's safe to not reinvent the wheel here.  Some searching gives:
http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf
http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf
https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort

Pyston takes a different approach where we pull the list of hot functions
from the PGO build, ie defer all the hard work to the C compiler.

On Sat, Nov 19, 2016 at 12:29 PM, serge guelton <sguelton at quarkslab.com>
wrote:

> On Sat, Nov 19, 2016 at 02:32:26AM +0100, Victor Stinner wrote:
> > Hi,
> >
> > I'm happy because I just finished an article putting the most
> > important things that I learnt this year on the most silly issue with
> > Python performance: code placement.
> >
> > https://haypo.github.io/analysis-python-performance-issue.html
> >
> > I explain how to debug such issue and my attempt to fix it in CPython.
> >
> > I hate code placement issues :-) I hate performance slowdowns caused
> > by random unrelated changes...
> >
> > Victor
>
> Thanks *a lot* victor for this great article. You not only very
> accurately describe the method you used to track the performance bug,
> but also give very convincing results.
>
> I still wonder what the conclusion should be:
>
> - (this) Micro benchmarks are not relevant at all, they are sensible to
> minor
>   factors that are not relevant to bigger applications
>
> - There is a generally good code layout that favors most applications?
>   Maybe some core function from the interpreter ? Why does PGO fails to
>   ``find'' them?
>
> Serge
>
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161119/c1b1bd2a/attachment.html>

From kmod at dropbox.com  Mon Nov 21 18:26:19 2016
From: kmod at dropbox.com (Kevin Modzelewski)
Date: Mon, 21 Nov 2016 15:26:19 -0800
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <20161121193908.thax2o3faxa5pxfx@lakota>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
 <20161119202935.bkczl4nvyyl3zwgh@lakota>
 <CAO=oM6tcQ56P2ADzujSxSoE6N6Yce0X6HHF1h2wn7yMmF04cVQ@mail.gmail.com>
 <20161121193908.thax2o3faxa5pxfx@lakota>
Message-ID: <CAO=oM6uJSR9OV_WP+Hh7Vc=4BGM7nV9L4nd_tejqD9pX0ftG1g@mail.gmail.com>

Oh sorry I was unclear, yes this is for the pyston binary itself, and yes
PGO does a better job and I definitely think it should be used.

Separately, we often use non-pgo builds for quick checks, so we also have
the system I described that makes our non-pgo build more reliable by using
the function ordering from the pgo build.

On Mon, Nov 21, 2016 at 11:39 AM, serge guelton <
serge.guelton at telecom-bretagne.eu> wrote:

> On Sat, Nov 19, 2016 at 05:58:19PM -0800, Kevin Modzelewski wrote:
> > I think it's safe to not reinvent the wheel here.  Some searching gives:
> > http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/
> Articles/papers15.pdf
> > http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf
> > https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort
>
> Thanks Kevin for the pointers! I'm new to this area of optimization...
> another source of fun and weirdness :-$
>
> > Pyston takes a different approach where we pull the list of hot functions
> > from the PGO build, ie defer all the hard work to the C compiler.
>
> You're talking about the build of Pyston itself, not the jit generated
> code, right? In that case, how is it different to a regular
>
>     -fprofile-generate followed by several runs then -fprofile-use?
>
> PGO builds should perform better than marking some functions as hot, as
> it also includes info for better branch prediction too, right?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161121/3862942d/attachment.html>

From kmod at dropbox.com  Sat Nov 26 17:16:54 2016
From: kmod at dropbox.com (Kevin Modzelewski)
Date: Sat, 26 Nov 2016 14:16:54 -0800
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <20161124110735.q2t5axix2llyfsd5@lakota>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
 <20161119202935.bkczl4nvyyl3zwgh@lakota>
 <CAO=oM6tcQ56P2ADzujSxSoE6N6Yce0X6HHF1h2wn7yMmF04cVQ@mail.gmail.com>
 <20161121193908.thax2o3faxa5pxfx@lakota>
 <CAO=oM6uJSR9OV_WP+Hh7Vc=4BGM7nV9L4nd_tejqD9pX0ftG1g@mail.gmail.com>
 <20161124110735.q2t5axix2llyfsd5@lakota>
Message-ID: <CAO=oM6vC4RzY-mTKoBRXW6BR7rH=GQUw=X7Eh1o4pqyCkedDhg@mail.gmail.com>

On Thu, Nov 24, 2016 at 3:07 AM, serge guelton <sguelton at quarkslab.com>
wrote:

> On Mon, Nov 21, 2016 at 03:26:19PM -0800, Kevin Modzelewski wrote:
> > Oh sorry I was unclear, yes this is for the pyston binary itself, and yes
> > PGO does a better job and I definitely think it should be used.
>
> That raised a second question: do you collect branch / hotness info
> during lower tier jitted code run, so as to improve performance of
> higher tiers ?
>

We don't (yet) do code placement optimizations.  We should be getting some
basic amount of this, though, by our generated code being grouped by "tier
that compiled it" which is highly correlated with hotness.


>
> > Separately, we often use non-pgo builds for quick checks, so we also have
> > the system I described that makes our non-pgo build more reliable by
> using
> > the function ordering from the pgo build.
>
> ok. Are you just ? putting hot stuff in the hot section ? or did you try
> to specify an ordering to further improve locality? (I don't know if it's
> possible, it's mentionned in one of the paper)
>

We pull the function order from the PGO build and ask the non-pgo build to
use the same order, so it's up to whatever the C compiler did.  Though to
keep things tractable we only do this for functions that have some
non-negligible hotness.

I think this does help with overall performance of the non-pgo build, but
our main goal was performance consistency.


>
> Thanks,
>
> Serge
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20161126/e4534d3e/attachment.html>

From serge.guelton at telecom-bretagne.eu  Mon Nov 21 14:39:08 2016
From: serge.guelton at telecom-bretagne.eu (serge guelton)
Date: Mon, 21 Nov 2016 20:39:08 +0100
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <CAO=oM6tcQ56P2ADzujSxSoE6N6Yce0X6HHF1h2wn7yMmF04cVQ@mail.gmail.com>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
 <20161119202935.bkczl4nvyyl3zwgh@lakota>
 <CAO=oM6tcQ56P2ADzujSxSoE6N6Yce0X6HHF1h2wn7yMmF04cVQ@mail.gmail.com>
Message-ID: <20161121193908.thax2o3faxa5pxfx@lakota>

On Sat, Nov 19, 2016 at 05:58:19PM -0800, Kevin Modzelewski wrote:
> I think it's safe to not reinvent the wheel here.  Some searching gives:
> http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf
> http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf
> https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort

Thanks Kevin for the pointers! I'm new to this area of optimization...
another source of fun and weirdness :-$

> Pyston takes a different approach where we pull the list of hot functions
> from the PGO build, ie defer all the hard work to the C compiler.

You're talking about the build of Pyston itself, not the jit generated
code, right? In that case, how is it different to a regular

    -fprofile-generate followed by several runs then -fprofile-use?

PGO builds should perform better than marking some functions as hot, as
it also includes info for better branch prediction too, right?


From sguelton at quarkslab.com  Thu Nov 24 06:07:35 2016
From: sguelton at quarkslab.com (serge guelton)
Date: Thu, 24 Nov 2016 12:07:35 +0100
Subject: [Speed] Analysis of a Python performance issue
In-Reply-To: <CAO=oM6uJSR9OV_WP+Hh7Vc=4BGM7nV9L4nd_tejqD9pX0ftG1g@mail.gmail.com>
References: <CAMpsgwY+wY=W8G-LsWR9t8JZCTs3zNBH=FVhezo7QKi25QxKjA@mail.gmail.com>
 <20161119202935.bkczl4nvyyl3zwgh@lakota>
 <CAO=oM6tcQ56P2ADzujSxSoE6N6Yce0X6HHF1h2wn7yMmF04cVQ@mail.gmail.com>
 <20161121193908.thax2o3faxa5pxfx@lakota>
 <CAO=oM6uJSR9OV_WP+Hh7Vc=4BGM7nV9L4nd_tejqD9pX0ftG1g@mail.gmail.com>
Message-ID: <20161124110735.q2t5axix2llyfsd5@lakota>

On Mon, Nov 21, 2016 at 03:26:19PM -0800, Kevin Modzelewski wrote:
> Oh sorry I was unclear, yes this is for the pyston binary itself, and yes
> PGO does a better job and I definitely think it should be used.

That raised a second question: do you collect branch / hotness info
during lower tier jitted code run, so as to improve performance of
higher tiers ?

> Separately, we often use non-pgo builds for quick checks, so we also have
> the system I described that makes our non-pgo build more reliable by using
> the function ordering from the pgo build.

ok. Are you just ? putting hot stuff in the hot section ? or did you try
to specify an ordering to further improve locality? (I don't know if it's possible, it's mentionned in one of the paper)

Thanks,

Serge