From victor.stinner at gmail.com  Wed Mar  1 12:05:25 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 1 Mar 2017 18:05:25 +0100
Subject: [Speed] perf 0.9.4 released
Message-ID: <CAMpsgwb9MsR9deiD5Qg+Z9nbCupkC6mwJbbiMSLUAOJNR4Jvog@mail.gmail.com>

Hi,

I released the version 0.9.4 of my Python perf module:

* Add --compare-to option to the Runner CLI
* compare_to command: Add --table option to render a table

http://perf.readthedocs.io/en/latest/

I used the --table feature to write this FASTCALL microbenchmarks article:
https://haypo.github.io/fastcall-microbenchmarks.html

Example:

+---------------------+---------+------------------------------+
| Benchmark           | 3.5     | 3.7                          |
+=====================+=========+==============================+
| struct.pack("i", 1) | 105 ns  | 77.6 ns: 1.36x faster (-26%) |
+---------------------+---------+------------------------------+
| getattr(1, "real")  | 79.4 ns | 64.4 ns: 1.23x faster (-19%) |
+---------------------+---------+------------------------------+

Use --quiet for smaller table.

The --compare-to command is the generalization to any perf script of
the existing perf timeit --compare-to option, to quickly compare two
Python binaries.

Example with timeit (because I'm too lazy to write a perf script!):
---
$ ./python -m perf timeit 'int(0)' --duplicate=100 --compare-to
../master-ref/python -p3
/home/haypo/prog/bench_python/master-ref/python: .... 112 ns +- 1 ns
/home/haypo/prog/bench_python/master/python: .... 108 ns +- 1 ns

Median +- std dev: [/home/haypo/prog/bench_python/master-ref/python]
112 ns +- 1 ns -> [/home/haypo/prog/bench_python/master/python] 108 ns
+- 1 ns: 1.04x faster (-3%)
----

Hum, I should write an option to allow to specify the name of python
binaries, to replace [/home/haypo/prog/bench_python/master-ref/python]
just with [ref].

Victor

From victor.stinner at gmail.com  Wed Mar  1 12:32:06 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 1 Mar 2017 18:32:06 +0100
Subject: [Speed] perf 0.9.4 released
In-Reply-To: <CAMpsgwb9MsR9deiD5Qg+Z9nbCupkC6mwJbbiMSLUAOJNR4Jvog@mail.gmail.com>
References: <CAMpsgwb9MsR9deiD5Qg+Z9nbCupkC6mwJbbiMSLUAOJNR4Jvog@mail.gmail.com>
Message-ID: <CAMpsgwahyE=4H-amcF1aqvcRzAcoXL=YKhm_GENcuhj+a37qBw@mail.gmail.com>

The perf API, command line interface (API) and JSON file format are
now complete enough for *my* needs. I plan to use the version 1.0 for
the next release and stabilize the API.

I still have a long list of enhancement ideas (see the TODO.rst file
in the Git repository), but none has a major impact on the API.

If you see a major flaw in the API or CLI, please speak up!

I know that the PyPy support is very limited, but again, fixing PyPy
support shouldn't impact the CLI or API, and so can be done later.

2017-03-01 18:05 GMT+01:00 Victor Stinner <victor.stinner at gmail.com>:
> Hum, I should write an option to allow to specify the name of python
> binaries, to replace [/home/haypo/prog/bench_python/master-ref/python]
> just with [ref].

Ok, I just added a --python-names option ;-) I annoyed me to have to
modify manually perf output when posting to bugs.python.org to replace
long [...] with short [ref] and [patch].

Victor

From victor.stinner at gmail.com  Mon Mar  6 18:37:03 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 7 Mar 2017 00:37:03 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
Message-ID: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>

Hi,

Serhiy Storchaka opened a bug report in my perf module: perf displays
Median +- std dev, whereas median absolute deviation (MAD) should be
displayed instead:
https://github.com/haypo/perf/issues/20

I just modified perf to display Median +- MAD, but I'm not sure that
it's better than Mean +- std dev.

The question is important when a benchmark is unstable (has a lot of
outliers). There is good example below with "Median +- MAD: 276 ns +-
10 ns" and "Mean +- std dev: 371 ns +- 196 ns".

The goal of perf is to get reproductible benchmark results. So the
question is what should be displayed (median or mean?) to get the most
reproductible output?

Median +- MAD "hides" outliers. In my experience, outliers are not
"reproductible", but caused by "noise" of the system and other
applications.

I feel that Median +- MAD is what I want, but I would feel more
confortable if someone can confirm with his/her experience :-)

-----------------
haypo at selma$ PYTHONPATH=~/prog/GIT/perf ./python -m perf show --hist
--stats bench.json.gz

234 ns:   3 #
264 ns: 114 ##################################################
293 ns:   9 ####
322 ns:   2 #
351 ns:   0 |
381 ns:   0 |
410 ns:   0 |
439 ns:   1 |
469 ns:   0 |
498 ns:   1 |
527 ns:   1 |
557 ns:   0 |
586 ns:   1 |
615 ns:   1 |
644 ns:   1 |
674 ns:   2 #
703 ns:   1 |
732 ns:   1 |
762 ns:   2 #
791 ns:  15 #######
820 ns:   5 ##

Total duration: 1 min 14.5 sec
Start date: 2017-03-06 23:30:49
End date: 2017-03-06 23:33:11
Raw sample minimum: 137 ms
Raw sample maximum: 444 ms

Number of runs: 42
Total number of samples: 160
Number of samples per run: 4
Number of warmups per run: 2
Loop iterations per sample: 2^19 (128 outer-loops x 4096 inner-loops)

Minimum: 262 ns (-5%)
Median +- MAD: 276 ns +- 10 ns
Mean +- std dev: 371 ns +- 196 ns
Maximum: 847 ns (+207%)

ERROR: the benchmark is very unstable, the standard deviation is very
high (stdev/mean: 53%)!
Try to rerun the benchmark with more runs, samples and/or loops

Median +- MAD: 276 ns +- 10 ns
-----------------

See attached bench.json.gz for full data.

Victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench.json.gz
Type: application/x-gzip
Size: 6108 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/speed/attachments/20170307/22e8b400/attachment.bin>

From victor.stinner at gmail.com  Mon Mar  6 19:03:23 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 7 Mar 2017 01:03:23 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
Message-ID: <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>

Another example on the same computer. It's interesting:
* MAD and std dev is the half of result 1
* the benchmark is less unstable
* median is very close to result 1
* mean changed much more than median

Benchmark result 1:

Median +- MAD: 276 ns +- 10 ns
Mean +- std dev: 371 ns +- 196 ns

Benchmark result 2:

Median +- MAD: 278 ns +- 5 ns
Mean +- std dev: 303 ns +- 103 ns

If the goal is to get reproductible results, Median +- MAD seems better.

---
haypo at selma$ PYTHONPATH=~/prog/GIT/perf ./python -m perf show --hist
--stats bench2.json.gz
250 ns: 75 ###########################################################
278 ns: 73 #########################################################
306 ns:  3 ##
333 ns:  0 |
361 ns:  0 |
389 ns:  0 |
417 ns:  0 |
445 ns:  0 |
472 ns:  1 #
500 ns:  1 #
528 ns:  0 |
556 ns:  0 |
584 ns:  0 |
611 ns:  1 #
639 ns:  0 |
667 ns:  0 |
695 ns:  1 #
722 ns:  0 |
750 ns:  1 #
778 ns:  1 #
806 ns:  3 ##

Total duration: 1 min 4.0 sec
Start date: 2017-03-07 00:39:03
End date: 2017-03-07 00:41:05
Raw sample minimum: 140 ms
Raw sample maximum: 431 ms

Number of runs: 42
Total number of samples: 160
Number of samples per run: 4
Number of warmups per run: 2
Loop iterations per sample: 2^19 (128 outer-loops x 4096 inner-loops)

Minimum: 266 ns (-4%)
Median +- MAD: 278 ns +- 5 ns
Mean +- std dev: 303 ns +- 103 ns
Maximum: 822 ns (+195%)

ERROR: the benchmark is very unstable, the standard deviation is very
high (stdev/mean: 34%)!
Try to rerun the benchmark with more runs, samples and/or loops

Median +- MAD: 278 ns +- 5 ns
---
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench2.json.gz
Type: application/x-gzip
Size: 6011 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/speed/attachments/20170307/19153df7/attachment.bin>

From solipsis at pitrou.net  Mon Mar 13 16:38:57 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Mon, 13 Mar 2017 21:38:57 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
Message-ID: <20170313213857.23d5a783@fsol>

On Tue, 7 Mar 2017 01:03:23 +0100
Victor Stinner <victor.stinner at gmail.com>
wrote:
> Another example on the same computer. It's interesting:
> * MAD and std dev is the half of result 1
> * the benchmark is less unstable
> * median is very close to result 1
> * mean changed much more than median
> 
> Benchmark result 1:
> 
> Median +- MAD: 276 ns +- 10 ns
> Mean +- std dev: 371 ns +- 196 ns
> 
> Benchmark result 2:
> 
> Median +- MAD: 278 ns +- 5 ns
> Mean +- std dev: 303 ns +- 103 ns
> 
> If the goal is to get reproductible results, Median +- MAD seems better.

Getting reproducible results is only half of the goal. Getting
meaningful (i.e. informative) results is the other half.

The mean approximates the expected performance over multiple runs (note
"expected" is a rigorously defined term in statistics here: see
https://en.wikipedia.org/wiki/Expected_value).  The median doesn't tell
you anything about the expected value (*).  So the mean is more
informative for the task at hand.

Additionally, while mean and std dev are generally quite well
understood, the properties of the median absolute deviation are
generally little known.

So my vote goes to mean +/- std dev.


(*) Quick example: let's say your runtimes in seconds are
[1, 1, 1, 1, 1, 1, 10, 10, 10, 10].
Evidently, there are four outliers (over 10 measurements) that indicate
a huge performance regression occurring at random points.  However, the
median here is 1 and the median absolute deviation (the median of
absolute deviations from the median, i.e. the median of [0, 0, 0, 0, 0,
0, 9, 9, 9, 9]) is 0: the information about possible performance
regressions is entirely lost, and the numbers (median +/- MAD) make it
look like the benchmark reliably takes 1 s. to run.

Regards

Antoine.


From storchaka at gmail.com  Tue Mar 14 03:14:45 2017
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Tue, 14 Mar 2017 09:14:45 +0200
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <20170313213857.23d5a783@fsol>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol>
Message-ID: <oa8594$kuk$1@blaine.gmane.org>

On 13.03.17 22:38, Antoine Pitrou wrote:
> The mean approximates the expected performance over multiple runs (note
> "expected" is a rigorously defined term in statistics here: see
> https://en.wikipedia.org/wiki/Expected_value).  The median doesn't tell
> you anything about the expected value (*).  So the mean is more
> informative for the task at hand.

The median tells you that results of a half of runs will be less than 
the median and results of other half will be larger. This is pretty 
informative and even more informative than the mean for some applications.

> Additionally, while mean and std dev are generally quite well
> understood, the properties of the median absolute deviation are
> generally little known.

Std dev is well understood for the distribution close to normal. But 
when the distribution is too skewed or multimodal (as in your quick 
example) common assumptions (that 2/3 of samples are in the range of the 
std dev, 95% of samples are in the range of two std devs, 99% of samples 
are in the range of three std devs) are no longer valid.


From ncoghlan at gmail.com  Tue Mar 14 10:42:07 2017
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 15 Mar 2017 00:42:07 +1000
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <oa8594$kuk$1@blaine.gmane.org>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
Message-ID: <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>

On 14 March 2017 at 17:14, Serhiy Storchaka <storchaka at gmail.com> wrote:

> On 13.03.17 22:38, Antoine Pitrou wrote:
>
>> Additionally, while mean and std dev are generally quite well
>> understood, the properties of the median absolute deviation are
>> generally little known.
>>
>
> Std dev is well understood for the distribution close to normal. But when
> the distribution is too skewed or multimodal (as in your quick example)
> common assumptions (that 2/3 of samples are in the range of the std dev,
> 95% of samples are in the range of two std devs, 99% of samples are in the
> range of three std devs) are no longer valid.


That would suggest that the implicit assumption of a measure-of-centrality
with a measure-of-symmetric-deviation may need to be challenged, as at
least some meaningful performance problems are going to show up as
non-normal distributions in the benchmark results.

Network services typically get around the "inherent variance" problem by
looking at a few key percentiles like 50%, 90% and 95%. Perhaps that would
be appropriate here as well?

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20170315/c4c7e628/attachment.html>

From solipsis at pitrou.net  Tue Mar 14 13:05:30 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 14 Mar 2017 18:05:30 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
Message-ID: <20170314180530.4b37cd7e@fsol>

On Tue, 14 Mar 2017 09:14:45 +0200
Serhiy Storchaka <storchaka at gmail.com>
wrote:
> The median tells you that results of a half of runs will be less than 
> the median and results of other half will be larger. This is pretty 
> informative and even more informative than the mean for some
> applications.

How so?  Whether a measurement is below or above the median is a
pointless piece of information in itself, because you don't know by how
much.  If a sample is 0.05% below the median, it might just as well be
0.05% above for all I care.  If half of the samples are 1% below the
median and half of the samples are 50% above, it's not the same thing
at all as if half of the samples are 50% below and half of the samples
are 1% above.  Yet "median +/- MAD" gives the exact same results in
both cases.

> > Additionally, while mean and std dev are generally quite well
> > understood, the properties of the median absolute deviation are
> > generally little known.  
> 
> Std dev is well understood for the distribution close to normal. But 
> when the distribution is too skewed or multimodal (as in your quick 
> example) common assumptions (that 2/3 of samples are in the range of the 
> std dev, 95% of samples are in the range of two std devs, 99% of samples 
> are in the range of three std devs) are no longer valid.

Not for individual samples, but for expected performance over a large
enough number of runs, yes, you can more or less use common assumptions
(thanks to the central limit theorem).   And expected performance is a
rather important piece of information.

Regards

Antoine.


From solipsis at pitrou.net  Tue Mar 14 13:13:59 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 14 Mar 2017 18:13:59 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
Message-ID: <20170314181359.7251e493@fsol>

On Wed, 15 Mar 2017 00:42:07 +1000
Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> That would suggest that the implicit assumption of a measure-of-centrality
> with a measure-of-symmetric-deviation may need to be challenged, as at
> least some meaningful performance problems are going to show up as
> non-normal distributions in the benchmark results.

Well, the real issue here is that an important contributor to
non-normality is not the benchmark itself, but measurement noise due to
various issues (such as system noise, which has of course a highly
skewed distribution).

Victor is trying to eliminate the effects of system noise by using the
median, but if that's the primary goal, using the minimum is arguably
better, since the system noise is always a positive contributor (i.e.
it can only increase the runtimes).

The median is arguably a bastardized solution, which satisfies neither
the requirement of eliminating system noise, nor the requirement of
faithfully representing performance variations due to non-deterministic
effects in the Python runtime and/or benchmark itself.

Regards

Antoine.


From storchaka at gmail.com  Wed Mar 15 02:41:47 2017
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Wed, 15 Mar 2017 08:41:47 +0200
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <20170314180530.4b37cd7e@fsol>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <20170314180530.4b37cd7e@fsol>
Message-ID: <oaann5$asd$1@blaine.gmane.org>

On 14.03.17 19:05, Antoine Pitrou wrote:
> On Tue, 14 Mar 2017 09:14:45 +0200
> Serhiy Storchaka <storchaka at gmail.com>
> wrote:
>> The median tells you that results of a half of runs will be less than
>> the median and results of other half will be larger. This is pretty
>> informative and even more informative than the mean for some
>> applications.
>
> How so?  Whether a measurement is below or above the median is a
> pointless piece of information in itself, because you don't know by how
> much.  If a sample is 0.05% below the median, it might just as well be
> 0.05% above for all I care.  If half of the samples are 1% below the
> median and half of the samples are 50% above, it's not the same thing
> at all as if half of the samples are 50% below and half of the samples
> are 1% above.  Yet "median +/- MAD" gives the exact same results in
> both cases.

"half of the samples are 1% below the median and half of the samples are 
50% above" -- this is unrealistic example. In real examples samples are 
distributed around some point, with the skew and outliers. The median is 
close to the mean, but less affected by outliers. For benchmarking 
purpose the absolute value is not important. The change between two 
measurements of two builds is important. The median is more stable and 
that means that we have less chance to get the false result.


From storchaka at gmail.com  Wed Mar 15 02:54:44 2017
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Wed, 15 Mar 2017 08:54:44 +0200
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
Message-ID: <oaaofe$750$1@blaine.gmane.org>

On 14.03.17 16:42, Nick Coghlan wrote:
> That would suggest that the implicit assumption of a
> measure-of-centrality with a measure-of-symmetric-deviation may need to
> be challenged, as at least some meaningful performance problems are
> going to show up as non-normal distributions in the benchmark results.
>
> Network services typically get around the "inherent variance" problem by
> looking at a few key percentiles like 50%, 90% and 95%. Perhaps that
> would be appropriate here as well?

Yes, quantiles would be useful, but I suppose they are less stable. If 
you have have only 20 samples, it is not enough to determine the 95% 
percentile.

But absolute values are not important for the purposes of our 
benchmarking. We need only know whether one build is faster or slower 
than others.

I suggested to calculate the probability of one build be faster than the 
other when compare two builds. This is just one number and it doesn't 
depend on assumptions about the normality of distributions.


From solipsis at pitrou.net  Wed Mar 15 09:52:32 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 15 Mar 2017 14:52:32 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
 <oaaofe$750$1@blaine.gmane.org>
Message-ID: <20170315145232.265ed0fe@fsol>

On Wed, 15 Mar 2017 08:54:44 +0200
Serhiy Storchaka <storchaka at gmail.com>
wrote:
> 
> But absolute values are not important for the purposes of our 
> benchmarking. We need only know whether one build is faster or slower 
> than others.

Not really.  If you don't know by how much it is faster or slower, it
is often useless in itself (because being 0.1% faster doesn't matter,
even if that's a very reproduceable speedup).

Really, the idea that actual values don't matter and only ordering does
is broken.  Of course actual values matter, because by how much
something is faster is a much more useful piece of information than
simply "it is faster".  If changing for another interpreter makes some
benchmark 3x faster, I may go for it.  If it makes some benchmark 3%
faster, I won't bother.

Regards

Antoine.


From solipsis at pitrou.net  Wed Mar 15 10:24:06 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 15 Mar 2017 15:24:06 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <20170314180530.4b37cd7e@fsol> <oaann5$asd$1@blaine.gmane.org>
Message-ID: <20170315152406.4c18e1ff@fsol>

On Wed, 15 Mar 2017 08:41:47 +0200
Serhiy Storchaka <storchaka at gmail.com>
wrote:
> 
> "half of the samples are 1% below the median and half of the samples are 
> 50% above" -- this is unrealistic example.

I was inventing an extreme example for the sake of clarity.
You can easily derive more "realistic" examples from the same
principle and get similar results at the end: non-negligible variations
being totally unrepresented in the "median +- MAD" aggregate.

> In real examples samples are 
> distributed around some point, with the skew and outliers.

If you're assuming the benchmark itself is stable and variations are
due to outside system noise, then you should really take the minimum,
which has the most chance of ignoring system noise.

If you're mainly worried about outliers, you can first insert a data
preparation (or cleanup) phase before computing the mean.  But you have
to decide up front whether an outlier is due to system noise or actual
benchmark instability (which can be due to non-determinism in the
runtime, e.g. hash randomization). For that, you may want to collect
additional system data while running the benchmark (for example, if
total CPU occupation during the benchmark is much more than the
benchmark's own CPU times, you might decide the system wasn't idle
enough and the result may be classified as outlier).

Regards

Antoine.


From victor.stinner at gmail.com  Wed Mar 15 12:32:49 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 15 Mar 2017 17:32:49 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <20170313213857.23d5a783@fsol>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol>
Message-ID: <CAMpsgwZ+fC1HCwNqezr9dzp-08kkv1CghJDSpWpSCBSDYeLczA@mail.gmail.com>

2017-03-13 21:38 GMT+01:00 Antoine Pitrou <solipsis at pitrou.net>:
>> If the goal is to get reproductible results, Median +- MAD seems better.
>
> Getting reproducible results is only half of the goal. Getting
> meaningful (i.e. informative) results is the other half.

If the system is tuned for benchmarks (run "python3 -m perf system
tune"), you get almost no outlier on CPU-bound functions. In this
case, mean/median and stdev/MAD are similar.

The problem is when people don't tune their system to run benchmarks,
which is likely the most common case. In this case, the distribution
is never normal :-) It's always skewed (positive skew, the right part
contains more points).

Reproductibility is a very concrete and practical issue for me.


> Additionally, while mean and std dev are generally quite well
> understood, the properties of the median absolute deviation are
> generally little known.

A friend suggested me to display sigma = 1.48 * MAD, instead of
displaying directly MAD, to get a value close to the standard
deviation without outliers. I don't know if it makes sense :-)

Victor

From victor.stinner at gmail.com  Wed Mar 15 12:36:06 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 15 Mar 2017 17:36:06 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <oa8594$kuk$1@blaine.gmane.org>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
Message-ID: <CAMpsgwaS=oBMtFoKXVbe=nKsmNG79WmWoczPKtKBA4dkuBVi2A@mail.gmail.com>

2017-03-14 8:14 GMT+01:00 Serhiy Storchaka <storchaka at gmail.com>:
> Std dev is well understood for the distribution close to normal. But when
> the distribution is too skewed or multimodal (as in your quick example)
> common assumptions (that 2/3 of samples are in the range of the std dev, 95%
> of samples are in the range of two std devs, 99% of samples are in the range
> of three std devs) are no longer valid.

The Python timeit module only displays the minimum. I chose to display
also the standard deviation in perf to give an idea of the stability
of the benchmark.

For example, "10 +- 1 ms" is quite stable, whereas "10 ms +- 15 ms"
seems not reliable at all.

MAD contains 50% of samples, whereas std dev contains 66% of samples.
If I only look at percentage, I prefer std dev because it gives a
better estimation of the stability of the benchmark.

Victor

From victor.stinner at gmail.com  Wed Mar 15 12:39:33 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 15 Mar 2017 17:39:33 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
Message-ID: <CAMpsgwY4svtqb4ufdXuvuceyEz7qEkVuNO4Pki8J0kz9ag6ngg@mail.gmail.com>

2017-03-14 15:42 GMT+01:00 Nick Coghlan <ncoghlan at gmail.com>:
> That would suggest that the implicit assumption of a measure-of-centrality
> with a measure-of-symmetric-deviation may need to be challenged, as at least
> some meaningful performance problems are going to show up as non-normal
> distributions in the benchmark results.
>
> Network services typically get around the "inherent variance" problem by
> looking at a few key percentiles like 50%, 90% and 95%. Perhaps that would
> be appropriate here as well?

Right now, there is almost no visualisation tool for perf :-( It
started to list projects that may be reused to visualize benchmark
results, to "see" the distribution.

A first step would be to add these "key percentiles like 50%, 90% and
95%" to the perf stats command. I don't know how to compute them.

But my question is for the most important summary: the result of the
"perf show" command, which is what most users see except if they use
more advanced commands.

Victor

From victor.stinner at gmail.com  Wed Mar 15 12:41:59 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 15 Mar 2017 17:41:59 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <20170314181359.7251e493@fsol>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CADiSq7ficAtBZbZEOaDbghuRfCM0Usm1rNGvGXzu81uPgrsMuA@mail.gmail.com>
 <20170314181359.7251e493@fsol>
Message-ID: <CAMpsgwYPb71ZNjqcPhhAKj8DSG7n2HAV2RWxOB7mUe=B4J3qwQ@mail.gmail.com>

2017-03-14 18:13 GMT+01:00 Antoine Pitrou <solipsis at pitrou.net>:
> Victor is trying to eliminate the effects of system noise by using the
> median, but if that's the primary goal, using the minimum is arguably
> better, since the system noise is always a positive contributor (i.e.
> it can only increase the runtimes).
>
> The median is arguably a bastardized solution, which satisfies neither
> the requirement of eliminating system noise, nor the requirement of
> faithfully representing performance variations due to non-deterministic
> effects in the Python runtime and/or benchmark itself.

The Python timeit module uses the minimum an in my experience, it's
far not reproductible at all. Mean or median provides a more
reproductible value.

Victor

From victor.stinner at gmail.com  Wed Mar 15 12:59:07 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 15 Mar 2017 17:59:07 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <oa8594$kuk$1@blaine.gmane.org>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
Message-ID: <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>

While I like the "automatic removal of outliers feature" of median and
MAD ("robust" statistics), I'm not confortable with these numbers.
They are new to me and uncommon in other benchmark tools.

It's not easy to compare MAD to standard deviation. It seems like MAD
can even be misleading when reading the "1 ms" part of "10 ms +- 1
ms".

The perf module has already a function to emit warnings if a benchmark
is considered as "unstable". A warning is emitted if stdev/mean is
greater than 0.10. I chose this threshold arbitrary.

Maybe we need another check to emit a warning when mean and median, or
std dev and MAD are too different?

Maybe we need a new --median command line option to display
median/MAD, instead of mean/stdev displayed by default?

About the reproductibility, I should experiment mean vs median.
Currently, perf doesn't use MAD nor std dev to compare two benchmark
results.

Victor

From solipsis at pitrou.net  Wed Mar 15 13:11:25 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 15 Mar 2017 18:11:25 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>
Message-ID: <20170315181125.268e5432@fsol>

On Wed, 15 Mar 2017 17:59:07 +0100
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 
> The perf module has already a function to emit warnings if a benchmark
> is considered as "unstable". A warning is emitted if stdev/mean is
> greater than 0.10. I chose this threshold arbitrary.

It doesn't sound too bad :-)

> Maybe we need another check to emit a warning when mean and median, or
> std dev and MAD are too different?
> 
> Maybe we need a new --median command line option to display
> median/MAD, instead of mean/stdev displayed by default?

I would say keep it simple.  mean/stddev is informative enough, no need
to add or maintain options of dubious utility.

Regards

Antoine.


From victor.stinner at gmail.com  Wed Mar 15 14:11:15 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 15 Mar 2017 19:11:15 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <20170315181125.268e5432@fsol>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>
 <20170315181125.268e5432@fsol>
Message-ID: <CAMpsgwa040wQqGiJubzL4PdHdx0gvO06D1OuSXk3PF10dm_78g@mail.gmail.com>

2017-03-15 18:11 GMT+01:00 Antoine Pitrou <solipsis at pitrou.net>:
> I would say keep it simple.  mean/stddev is informative enough, no need
> to add or maintain options of dubious utility.

Ok. I added a message to suggest to use perf stats to analyze results.

Example of warnings with a benchmark result considered as unstable,
python startup time measured by the new bench_command() function:
---
$ python3 -m perf show startup1.json
WARNING: the benchmark result may be unstable
* the standard deviation (6.08 ms) is 16% of the mean (39.1 ms)
* the minimum (23.6 ms) is 40% smaller than the mean (39.1 ms)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python3 -m perf system tune' command to reduce the system jitter.
Use perf stats to analyze results, or --quiet to hide warnings.

Median +- MAD: 40.7 ms +- 3.9 ms
----

Statistics of this result:
---
$ python3 -m perf stats startup1.json -q
Total duration: 37.2 sec
Start date: 2017-03-15 18:02:46
End date: 2017-03-15 18:03:27
Raw value minimum: 189 ms
Raw value maximum: 390 ms

Number of runs: 25
Total number of values: 75
Number of values per run: 3
Number of warmups per run: 1
Loop iterations per value: 8

Minimum: 23.6 ms (-42% of the median)
Median +- MAD: 40.7 ms +- 3.9 ms
Mean +- std dev: 39.1 ms +- 6.1 ms
Maximum: 48.7 ms (+20% of the median)
---

Victor

From storchaka at gmail.com  Wed Mar 15 18:44:54 2017
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Thu, 16 Mar 2017 00:44:54 +0200
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>
Message-ID: <oacg50$kkk$1@blaine.gmane.org>

On 15.03.17 18:59, Victor Stinner wrote:
> It's not easy to compare MAD to standard deviation. It seems like MAD
> can even be misleading when reading the "1 ms" part of "10 ms +- 1
> ms".

Don't use the "+-" notation. It is misleading even for the stddev of 
normal distribution, because with the chance 1 against 2 the sample is 
out of the specified interval. Use "Mean: 10 ms  Stddev: 1 ms" or 
"Median: 10 ms  MAD: 1 ms" instead.

> Maybe we need a new --median command line option to display
> median/MAD, instead of mean/stdev displayed by default?

Yes, make this configurable. And make median/MAD the default. ;)


From victor.stinner at gmail.com  Wed Mar 15 20:50:39 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 16 Mar 2017 01:50:39 +0100
Subject: [Speed] ASLR
Message-ID: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>

2017-03-16 1:38 GMT+01:00 Wang, Peter Xihong <peter.xihong.wang at intel.com>:
> Hi All,
>
> I am attaching an image with comparison running the CALL_METHOD in the old Grand Unified Python Benchmark (GUPB) suite (https://hg.python.org/benchmarks), with and without ASLR disabled.

This benchmark suite is now deprecated, please update to the new
'performance' benchmark suite:
https://github.com/python/performance

The old benchmark suite didn't spawn multiple processes and so was
less reliable.

By the way, maybe I should commit a change in hg.python.org/benchmarks
to remove the code and only keep a README.txt? Code will still be
accessible in Mercurial history.


> You could see the run2run variation was reduced significantly, from data scattering all over the place, to just one single outlier, out of 30 repeated runs.
> This effectively eliminated most of the variations for this micro-benchmark.
>
> On a Linux system, you could do this by:
> as root
> echo 0 > /proc/sys/kernel/randomize_va_space   # to disable
> echo 2 > /proc/sys/kernel/randomize_va_space   # to enable
>
> If anyone still experiences run2run variation, I'd suggest to read on:
> Based on my observation in our labs, a lot of factors could impact performance, including environment (yes, even a room temperature),

I made my own experiment on the impact on temperature on performance,
and above 100?C, I didn't notice anything:
https://haypo.github.io/intel-cpus-part2.html
"Impact of the CPU temperature on benchmarks"

I tested a desktop and a laptop PC with an Intel CPU.


>  HW components or related such as platforms, chipset, memory DIMMs, CPU generations and stepping, BIOS version, kernels, the list goes on and on.
>
> Being said that, would it be helpful we work together, to identify the root cause, be it due to SW, or anything else?  We could start with a specific micro-benchmark, with specific goal as what to measure.
> After that, or in parallel after some baseline work is done, then focus on measurement process/methodology?
>
> Is this helpful?
>
> Thanks,
>
> Peter

Note: Please open a new thread instead of replying to an email of an
existing thread.

Victor

From victor.stinner at gmail.com  Wed Mar 15 20:59:11 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 16 Mar 2017 01:59:11 +0100
Subject: [Speed] perf 0.9.6 released
Message-ID: <CAMpsgwaNTCmrERuBv7JriGuE-i6En_p6o4Wg-Tv1Nr5c0fdm6g@mail.gmail.com>

Hi,

I released perf 0.9.6 with many changes. First, "Mean +- std dev" is
now displayed, instead of "Median +- std dev", as a result of the
previous thread on this list. The median is still accessible via the
stats command. By the way, the "stats" command now displays "Median +-
MAD" instead of "Median +- std dev".

I broke the API to fix an old mistake. I used the term "sample" for a
single value, whereas a "sample" in statistics is a set of values (one
or more), and so the term is misused. I replace "sample" with "value"
and "samples" with "values" everywhere in perf.

http://perf.readthedocs.io/en/latest/changelog.html#version-0-9-6-2017-03-15

Version 0.9.6 (2017-03-15)
--------------------------

Major change:

* Display ``Mean +- std dev`` instead of ``Median +- std dev``

Enhancements:

* Add a new ``Runner.bench_command()`` method to measure the execution time of
  a command.
* Add ``mean()``, ``median_abs_dev()`` and ``stdev()`` methods to ``Benchmark``
* ``check`` command: test also minimum and maximum compared to the mean

Major API change, rename "sample" to "value":

* Rename attributes and methods:

  - ``Benchmark.bench_sample_func()`` => ``Benchmark.bench_time_func()``.
  - ``Run.samples`` => ``Run.values``
  - ``Benchmark.get_samples()`` => ``Benchmark.get_values()``
  - ``get_nsample()`` => ``get_nvalue()``
  - ``Benchmark.format_sample()`` => ``Benchmark.format_value()``
  - ``Benchmark.format_samples()`` => ``Benchmark.format_values()``

* Rename Runner command line options:

  - ``--samples`` => ``--values``
  - ``--debug-single-sample`` => ``--debug-single-value``

Changes:

* ``convert``: Remove ``--remove-outliers`` option
* ``check`` command now tests stdev/mean, instead of testing stdev/median
* setup.py: statistics dependency is now installed using ``extras_require`` to
  support setuptools 18 and newer
* Add setup.cfg to enable universal builds: same wheel package for Python 2
  and Python 3
* Add ``perf.VERSION`` constant: tuple of int
* JSON version 6: write metadata common to all benchmarks (common to all runs
  of all benchmarks) at the root; rename 'samples' to 'values' in runs.

Victor

From victor.stinner at gmail.com  Wed Mar 15 21:22:59 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 16 Mar 2017 02:22:59 +0100
Subject: [Speed] ASLR
In-Reply-To: <371EBC7881C7844EAAF5556BFF21BCCC583F8B42@ORSMSX105.amr.corp.intel.com>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
 <371EBC7881C7844EAAF5556BFF21BCCC583F8B42@ORSMSX105.amr.corp.intel.com>
Message-ID: <CAMpsgwaWB7T6d9Pdm61s9vjc+nDdwVza6NYJcL8GTWeug+2f9A@mail.gmail.com>

2017-03-16 2:04 GMT+01:00 Wang, Peter Xihong <peter.xihong.wang at intel.com>:
> Understood on the obsolete benchmark part.  This was the work done before the new benchmark was created on github.

I strongly advice you to move to performance. It also has a nice a
API. It now produces a JSON file with *all* data, instead of just
writing into summaries into stdout.

> I thought this is related, and thus didn't open a new thread.

The other thread was a discussion about statistics, how to summarize
all timing into two numbers :-)

> Maybe you could point me to one single micro-benchmark for the time being, and then we could compare result across?

The "new" performance project is a fork of the old "benchmark"
project. Benchmark names are very close or even the same for many
benchmarks.

If you would like to validate that your benchmark runner is stable:
run call_method and call_simple microbenchmarks on different revisions
of CPython, reboot sometimes the computer used to run benchmarks, and
make sure that results are stable.

Compare them with results of speed.python.org.

call_method:
https://speed.python.org/timeline/#/?exe=5&ben=call_method&env=1&revs=50&equid=off&quarts=on&extr=on

call_simple:
https://speed.python.org/timeline/#/?exe=5&ben=call_simple&env=1&revs=50&equid=off&quarts=on&extr=on

Around november and december 2016, you should notice a significant
speedup on call_method.

The best is to be able to avoid "temporary spikes" like this one:
https://haypo.github.io/analysis-python-performance-issue.html

The API of the perf project, PGO and LTO compilation, new performance
using perf, "perf system tune" for system tuning, etc. helped to get
more stable results.

Victor

From victor.stinner at gmail.com  Wed Mar 15 21:27:25 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 16 Mar 2017 02:27:25 +0100
Subject: [Speed] perf 0.9.6 released
In-Reply-To: <CAMpsgwaNTCmrERuBv7JriGuE-i6En_p6o4Wg-Tv1Nr5c0fdm6g@mail.gmail.com>
References: <CAMpsgwaNTCmrERuBv7JriGuE-i6En_p6o4Wg-Tv1Nr5c0fdm6g@mail.gmail.com>
Message-ID: <CAMpsgwbmiXAokuTVQJqxeJY3+BcxkMUYHs-VrySqwpXLLnubTw@mail.gmail.com>

I updated performance for perf 0.9.6. I patched python_startup and
hg_startup benchmarks to use the new bench_command() method.

This new method uses the following Python script to measure the time
to execute a command:
https://github.com/haypo/perf/blob/master/perf/_process_time.py

I wrote the _process_time.py script to be small and simple to reduce
the overhead of the benchmark itself. It's similar to the "real time"
line of UNIX 'time' command, but it works on Windows too.

I chose to use time.perf_counter(), wall clock, instead of using
getrusage() which provides CPU time. It's easy for me to understand
wall clock time rather than CPU time, and it's more consistent with
other perf methods.

Victor

2017-03-16 1:59 GMT+01:00 Victor Stinner <victor.stinner at gmail.com>:
> Hi,
>
> I released perf 0.9.6 with many changes. First, "Mean +- std dev" is
> now displayed, instead of "Median +- std dev", as a result of the
> previous thread on this list. The median is still accessible via the
> stats command. By the way, the "stats" command now displays "Median +-
> MAD" instead of "Median +- std dev".
>
> I broke the API to fix an old mistake. I used the term "sample" for a
> single value, whereas a "sample" in statistics is a set of values (one
> or more), and so the term is misused. I replace "sample" with "value"
> and "samples" with "values" everywhere in perf.
>
> http://perf.readthedocs.io/en/latest/changelog.html#version-0-9-6-2017-03-15
>
> Version 0.9.6 (2017-03-15)
> --------------------------
>
> Major change:
>
> * Display ``Mean +- std dev`` instead of ``Median +- std dev``
>
> Enhancements:
>
> * Add a new ``Runner.bench_command()`` method to measure the execution time of
>   a command.
> * Add ``mean()``, ``median_abs_dev()`` and ``stdev()`` methods to ``Benchmark``
> * ``check`` command: test also minimum and maximum compared to the mean
>
> Major API change, rename "sample" to "value":
>
> * Rename attributes and methods:
>
>   - ``Benchmark.bench_sample_func()`` => ``Benchmark.bench_time_func()``.
>   - ``Run.samples`` => ``Run.values``
>   - ``Benchmark.get_samples()`` => ``Benchmark.get_values()``
>   - ``get_nsample()`` => ``get_nvalue()``
>   - ``Benchmark.format_sample()`` => ``Benchmark.format_value()``
>   - ``Benchmark.format_samples()`` => ``Benchmark.format_values()``
>
> * Rename Runner command line options:
>
>   - ``--samples`` => ``--values``
>   - ``--debug-single-sample`` => ``--debug-single-value``
>
> Changes:
>
> * ``convert``: Remove ``--remove-outliers`` option
> * ``check`` command now tests stdev/mean, instead of testing stdev/median
> * setup.py: statistics dependency is now installed using ``extras_require`` to
>   support setuptools 18 and newer
> * Add setup.cfg to enable universal builds: same wheel package for Python 2
>   and Python 3
> * Add ``perf.VERSION`` constant: tuple of int
> * JSON version 6: write metadata common to all benchmarks (common to all runs
>   of all benchmarks) at the root; rename 'samples' to 'values' in runs.
>
> Victor

From solipsis at pitrou.net  Thu Mar 16 05:22:00 2017
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 16 Mar 2017 10:22:00 +0100
Subject: [Speed] ASLR
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
Message-ID: <20170316102200.746f709a@fsol>

On Thu, 16 Mar 2017 01:50:39 +0100
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 
> I made my own experiment on the impact on temperature on performance,
> and above 100?C, I didn't notice anything:
> https://haypo.github.io/intel-cpus-part2.html
> "Impact of the CPU temperature on benchmarks"

I suspect temperature can have an impact on performance if Turbo is
enabled (or, as you noticed, if CPU cooling is deficient).

Note that tweaking a system for benchmarking (disabling Turbo,
disabling ASLR, etc.) may make the results more reproducible, but it
may also make them less representative of real-world conditions
(because few people disable Turbo or ASLR, except precisely on
benchmarking machines :-)).  It's a delicate balancing act!

Regards

Antoine.


From victor.stinner at gmail.com  Thu Mar 16 07:19:05 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 16 Mar 2017 12:19:05 +0100
Subject: [Speed] ASLR
In-Reply-To: <20170316102200.746f709a@fsol>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
 <20170316102200.746f709a@fsol>
Message-ID: <CAMpsgwaQVBRLSE66d9xnjwqAY4ES7OQu0V4mMCxwwA686C0t1A@mail.gmail.com>

2017-03-16 10:22 GMT+01:00 Antoine Pitrou <solipsis at pitrou.net>:
> I suspect temperature can have an impact on performance if Turbo is
> enabled (or, as you noticed, if CPU cooling is deficient).

Oh sure, I now always start by disabling Turbo Boost. It's common that
I run benchmarks on my desktop PC with Firefox running in the
background. Variable workload on other CPUs is very likely to change
the peak CPU frequency on the CPUs used for benhcmarks, even if CPU
isolation and CPU pinning is used.

> Note that tweaking a system for benchmarking (disabling Turbo,
> disabling ASLR, etc.) may make the results more reproducible, but it
> may also make them less representative of real-world conditions
> (because few people disable Turbo or ASLR, except precisely on
> benchmarking machines :-)).  It's a delicate balancing act!

Yeah, that's also why I chose to enable ASLR. I fear that disabling
ASLR will put me a "local minimum" which is not representative of
average performance when ASLR is enabled and benchmark run using
multiple processes (to test multiple address layouts).

Victor

From peter.xihong.wang at intel.com  Wed Mar 15 20:38:14 2017
From: peter.xihong.wang at intel.com (Wang, Peter Xihong)
Date: Thu, 16 Mar 2017 00:38:14 +0000
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <CAMpsgwa040wQqGiJubzL4PdHdx0gvO06D1OuSXk3PF10dm_78g@mail.gmail.com>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>
 <20170315181125.268e5432@fsol>
 <CAMpsgwa040wQqGiJubzL4PdHdx0gvO06D1OuSXk3PF10dm_78g@mail.gmail.com>
Message-ID: <371EBC7881C7844EAAF5556BFF21BCCC583F8AED@ORSMSX105.amr.corp.intel.com>

Hi All,

I am attaching an image with comparison running the CALL_METHOD in the old Grand Unified Python Benchmark (GUPB) suite (https://hg.python.org/benchmarks), with and without ASLR disabled.
You could see the run2run variation was reduced significantly, from data scattering all over the place, to just one single outlier, out of 30 repeated runs.
This effectively eliminated most of the variations for this micro-benchmark.

On a Linux system, you could do this by:
as root
echo 0 > /proc/sys/kernel/randomize_va_space   # to disable
echo 2 > /proc/sys/kernel/randomize_va_space   # to enable

If anyone still experiences run2run variation, I'd suggest to read on:
Based on my observation in our labs, a lot of factors could impact performance, including environment (yes, even a room temperature), HW components or related such as platforms, chipset, memory DIMMs, CPU generations and stepping, BIOS version, kernels, the list goes on and on.

Being said that, would it be helpful we work together, to identify the root cause, be it due to SW, or anything else?  We could start with a specific micro-benchmark, with specific goal as what to measure.
After that, or in parallel after some baseline work is done, then focus on measurement process/methodology?  

Is this helpful?

Thanks,

Peter


?
-----Original Message-----
From: Speed [mailto:speed-bounces+peter.xihong.wang=intel.com at python.org] On Behalf Of Victor Stinner
Sent: Wednesday, March 15, 2017 11:11 AM
To: Antoine Pitrou <solipsis at pitrou.net>
Cc: speed at python.org
Subject: Re: [Speed] Median +- MAD or Mean +- std dev?

2017-03-15 18:11 GMT+01:00 Antoine Pitrou <solipsis at pitrou.net>:
> I would say keep it simple.  mean/stddev is informative enough, no 
> need to add or maintain options of dubious utility.

Ok. I added a message to suggest to use perf stats to analyze results.

Example of warnings with a benchmark result considered as unstable, python startup time measured by the new bench_command() function:
---
$ python3 -m perf show startup1.json
WARNING: the benchmark result may be unstable
* the standard deviation (6.08 ms) is 16% of the mean (39.1 ms)
* the minimum (23.6 ms) is 40% smaller than the mean (39.1 ms)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python3 -m perf system tune' command to reduce the system jitter.
Use perf stats to analyze results, or --quiet to hide warnings.

Median +- MAD: 40.7 ms +- 3.9 ms
----

Statistics of this result:
---
$ python3 -m perf stats startup1.json -q Total duration: 37.2 sec Start date: 2017-03-15 18:02:46 End date: 2017-03-15 18:03:27 Raw value minimum: 189 ms Raw value maximum: 390 ms

Number of runs: 25
Total number of values: 75
Number of values per run: 3
Number of warmups per run: 1
Loop iterations per value: 8

Minimum: 23.6 ms (-42% of the median)
Median +- MAD: 40.7 ms +- 3.9 ms
Mean +- std dev: 39.1 ms +- 6.1 ms
Maximum: 48.7 ms (+20% of the median)
---

Victor
_______________________________________________
Speed mailing list
Speed at python.org
https://mail.python.org/mailman/listinfo/speed
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ASLR_disabled_enabled_comparison.jpg
Type: image/jpeg
Size: 79494 bytes
Desc: ASLR_disabled_enabled_comparison.jpg
URL: <http://mail.python.org/pipermail/speed/attachments/20170316/b1a565a4/attachment-0001.jpg>

From peter.xihong.wang at intel.com  Wed Mar 15 21:04:14 2017
From: peter.xihong.wang at intel.com (Wang, Peter Xihong)
Date: Thu, 16 Mar 2017 01:04:14 +0000
Subject: [Speed] ASLR
In-Reply-To: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
Message-ID: <371EBC7881C7844EAAF5556BFF21BCCC583F8B42@ORSMSX105.amr.corp.intel.com>

Hi Victor,

Understood on the obsolete benchmark part.  This was the work done before the new benchmark was created on github.
I thought this is related, and thus didn't open a new thread.

Maybe you could point me to one single micro-benchmark for the time being, and then we could compare result across?  
?
Regards,

Peter


-----Original Message-----
From: Victor Stinner [mailto:victor.stinner at gmail.com] 
Sent: Wednesday, March 15, 2017 5:51 PM
To: speed at python.org; Wang, Peter Xihong <peter.xihong.wang at intel.com>
Subject: ASLR

2017-03-16 1:38 GMT+01:00 Wang, Peter Xihong <peter.xihong.wang at intel.com>:
> Hi All,
>
> I am attaching an image with comparison running the CALL_METHOD in the old Grand Unified Python Benchmark (GUPB) suite (https://hg.python.org/benchmarks), with and without ASLR disabled.

This benchmark suite is now deprecated, please update to the new 'performance' benchmark suite:
https://github.com/python/performance

The old benchmark suite didn't spawn multiple processes and so was less reliable.

By the way, maybe I should commit a change in hg.python.org/benchmarks to remove the code and only keep a README.txt? Code will still be accessible in Mercurial history.


> You could see the run2run variation was reduced significantly, from data scattering all over the place, to just one single outlier, out of 30 repeated runs.
> This effectively eliminated most of the variations for this micro-benchmark.
>
> On a Linux system, you could do this by:
> as root
> echo 0 > /proc/sys/kernel/randomize_va_space   # to disable
> echo 2 > /proc/sys/kernel/randomize_va_space   # to enable
>
> If anyone still experiences run2run variation, I'd suggest to read on:
> Based on my observation in our labs, a lot of factors could impact 
> performance, including environment (yes, even a room temperature),

I made my own experiment on the impact on temperature on performance, and above 100?C, I didn't notice anything:
https://haypo.github.io/intel-cpus-part2.html
"Impact of the CPU temperature on benchmarks"

I tested a desktop and a laptop PC with an Intel CPU.


>  HW components or related such as platforms, chipset, memory DIMMs, CPU generations and stepping, BIOS version, kernels, the list goes on and on.
>
> Being said that, would it be helpful we work together, to identify the root cause, be it due to SW, or anything else?  We could start with a specific micro-benchmark, with specific goal as what to measure.
> After that, or in parallel after some baseline work is done, then focus on measurement process/methodology?
>
> Is this helpful?
>
> Thanks,
>
> Peter

Note: Please open a new thread instead of replying to an email of an existing thread.

Victor

From brett at python.org  Thu Mar 16 12:19:35 2017
From: brett at python.org (Brett Cannon)
Date: Thu, 16 Mar 2017 16:19:35 +0000
Subject: [Speed] ASLR
In-Reply-To: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
Message-ID: <CAP1=2W7uN3VxS4iE7xdpT4zjXLq4PHnp=4JOV6hqb8iJxZiJKQ@mail.gmail.com>

On Wed, 15 Mar 2017 at 17:54 Victor Stinner <victor.stinner at gmail.com>
wrote:

> 2017-03-16 1:38 GMT+01:00 Wang, Peter Xihong <peter.xihong.wang at intel.com
> >:
> > Hi All,
> >
> > I am attaching an image with comparison running the CALL_METHOD in the
> old Grand Unified Python Benchmark (GUPB) suite (
> https://hg.python.org/benchmarks), with and without ASLR disabled.
>
> This benchmark suite is now deprecated, please update to the new
> 'performance' benchmark suite:
> https://github.com/python/performance
>
> The old benchmark suite didn't spawn multiple processes and so was
> less reliable.
>
> By the way, maybe I should commit a change in hg.python.org/benchmarks
> to remove the code and only keep a README.txt? Code will still be
> accessible in Mercurial history.
>

Since we might not shut down hg.python.org for a long time I say go ahead
and commit such a change.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20170316/8364ec2f/attachment.html>

From victor.stinner at gmail.com  Thu Mar 16 13:28:40 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 16 Mar 2017 18:28:40 +0100
Subject: [Speed] Median +- MAD or Mean +- std dev?
In-Reply-To: <oacg50$kkk$1@blaine.gmane.org>
References: <CAMpsgwYCsGv1RSM+x+xh42jpXPnQKT233RTPgJ4eGMKU0ziaRw@mail.gmail.com>
 <CAMpsgwaVLKQTJ99ncvkTMxKRKB5NQPGk5e2UV-TL=chQABQpWA@mail.gmail.com>
 <20170313213857.23d5a783@fsol> <oa8594$kuk$1@blaine.gmane.org>
 <CAMpsgwZtCq5adoSbRDExbg+ac8AW4Wr0hogF0inSnOd9mJc-3Q@mail.gmail.com>
 <oacg50$kkk$1@blaine.gmane.org>
Message-ID: <CAMpsgwYFqA6Lp+ZRny8t4jD8ksa09ovKC7JFjh2-yGtZyK4Kaw@mail.gmail.com>

2017-03-15 23:44 GMT+01:00 Serhiy Storchaka <storchaka at gmail.com>:
> Don't use the "+-" notation. It is misleading even for the stddev of normal
> distribution, because with the chance 1 against 2 the sample is out of the
> specified interval. Use "Mean: 10 ms  Stddev: 1 ms" or "Median: 10 ms  MAD:
> 1 ms" instead.

I know that it's an abuse of "value +- range" notation.

Since I already changed the default formatting of a benchmark multiple
times and it seems like Serhiy doesn't like the current format, a
first action is to remove the public methods to format a benchmark :-)
https://github.com/haypo/perf/commit/881a282cdac7969e3c759ff344ad766b3ae0f065

So at least, I will not break the API if I change the format again in
the future.

Victor

From peter.xihong.wang at intel.com  Thu Mar 16 19:00:19 2017
From: peter.xihong.wang at intel.com (Wang, Peter Xihong)
Date: Thu, 16 Mar 2017 23:00:19 +0000
Subject: [Speed] ASLR
In-Reply-To: <20170316102200.746f709a@fsol>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
 <20170316102200.746f709a@fsol>
Message-ID: <371EBC7881C7844EAAF5556BFF21BCCC583FA12E@ORSMSX105.amr.corp.intel.com>

[Wang, Peter Xihong] 
I am wondering what others are using micro-benchmarks for, or if there is a usage statistics somewhere about these benchmarks.  For me, it's optimization delta driven.  e.g., if I expect my optimization to boost performance by 5%, but the variation reaches up to or greater than 5%, then I am getting lost, and the perf data cannot be trusted.?

In addition to turbo boost, I also turned off hyperthreading, and c-state, p-state, on Intel CPUs.

Regards,

Peter

> -----Original Message-----
> From: Speed [mailto:speed-
> bounces+peter.xihong.wang=intel.com at python.org] On Behalf Of Antoine
> Pitrou
> Sent: Thursday, March 16, 2017 2:22 AM
> To: speed at python.org
> Subject: Re: [Speed] ASLR
> 
> On Thu, 16 Mar 2017 01:50:39 +0100
> Victor Stinner <victor.stinner at gmail.com>
> wrote:
> >
> > I made my own experiment on the impact on temperature on performance,
> > and above 100?C, I didn't notice anything:
> > https://haypo.github.io/intel-cpus-part2.html
> > "Impact of the CPU temperature on benchmarks"
> 
> I suspect temperature can have an impact on performance if Turbo is enabled
> (or, as you noticed, if CPU cooling is deficient).
> 
> Note that tweaking a system for benchmarking (disabling Turbo, disabling ASLR,
> etc.) may make the results more reproducible, but it may also make them less
> representative of real-world conditions (because few people disable Turbo or
> ASLR, except precisely on benchmarking machines :-)).  It's a delicate
> balancing act!
> 
> Regards
> 
> Antoine.
> 
> 
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed

From victor.stinner at gmail.com  Thu Mar 16 22:07:35 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 17 Mar 2017 03:07:35 +0100
Subject: [Speed] perf 1.0 released: with a stable API
Message-ID: <CAMpsgwbjXCLzu4ssEAApODKj7e62068G171arsyRHB05ByJMJw@mail.gmail.com>

Hi,

After 9 months of development, the perf API became stable with the
awaited "1.0" version. The perf module has now a complete API to
write, run and analyze benchmarks and a nice documentation explaining
traps of benchmarking and how to avoid, or even, fix them.

http://perf.readthedocs.io/

Last days, I rewrote the documentation, hid a few more functions to
prevent API changes after the 1.0 release, and I made last backward
incompatible changes to fix old design issues.

I don't expect the module to be perfect. It's more a milestone to
freeze the API and focus on features instead ;-)

Changes between 0.9.6 and 1.0:

Enhancements:

* ``stats`` command now displays percentiles
* ``hist`` command now also checks the benchmark stability by default
* dump command now displays raw value of calibration runs.
* Add ``Benchmark.percentile()`` method

Backward incompatible changes:

* Remove the ``compare`` command to only keep the ``compare_to`` command
  which is better defined
* Run warmup values must now be normalized per loop iteration.
* Remove ``format()`` and ``__str__()`` methods from Benchmark. These methods
  were too opiniated.
* Rename ``--name=NAME`` option to ``--benchmark=NAME``
* Remove ``perf.monotonic_clock()`` since it wasn't monotonic on Python 2.7.
* Remove ``is_significant()`` from the public API

Other changes:

* check command now only complains if min/max is 50% smaller/larger than
  the mean, instead of 25%.


Note: I already updated the performance project to perf 1.0.

Victor

From victor.stinner at gmail.com  Thu Mar 16 22:11:14 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 17 Mar 2017 03:11:14 +0100
Subject: [Speed] ASLR
In-Reply-To: <371EBC7881C7844EAAF5556BFF21BCCC583FA12E@ORSMSX105.amr.corp.intel.com>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
 <20170316102200.746f709a@fsol>
 <371EBC7881C7844EAAF5556BFF21BCCC583FA12E@ORSMSX105.amr.corp.intel.com>
Message-ID: <CAMpsgwZ-nnYZyEMhsKKokSnQe=SfjWS0MNfw6DaLXXYH+z+35g@mail.gmail.com>

2017-03-17 0:00 GMT+01:00 Wang, Peter Xihong <peter.xihong.wang at intel.com>:
> In addition to turbo boost, I also turned off hyperthreading, and c-state, p-state, on Intel CPUs.

My "python3 -m perf system tune" command sets the minimum frequency of
CPUs used for benchmarks to the maximum frequency. I expect that it
reduces or even avoid changes on P-state and C-state.

See my documentation on How to tune a system for benchmarking:

http://perf.readthedocs.io/en/latest/system.html

Victor

From victor.stinner at gmail.com  Thu Mar 16 22:29:19 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 17 Mar 2017 03:29:19 +0100
Subject: [Speed] pymicrobench: collection of CPython microbenchmarks
Message-ID: <CAMpsgwYTexSs7koQU-HYWMu5R_YXnUkEDK==XdBw-b28zmBajQ@mail.gmail.com>

Hi,

I started to create a collection of microbenchmarks for CPython from
scripts found on the bug tracker:
https://github.com/haypo/pymicrobench

I'm not sure that this collection is used yet, but some of you may
want to take a look :-)

I know that some people have random microbenchmarks in a local
directory. Maybe you want to share them?

I don't really care to sort them or group them. My plan is first to
populate the repository, and later see what to do with it :-)

Victor

From victor.stinner at gmail.com  Sun Mar 26 18:12:21 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Mon, 27 Mar 2017 00:12:21 +0200
Subject: [Speed] speed.python.org: move to Git, remove old previous results
Message-ID: <CAMpsgwbunpF6+Lgpq+xd1TGNe2OiGLLP46QSqq=6ALKe0Qyc-g@mail.gmail.com>

Hi,

I'm going to remove old previous benchmark results from
speed.python.org. As we discussed previously, there is no plan to keep
old results when we need to change something. In this case, CPython
moved from Mercurial to Git, and I'm too lazy to upgrade the revisions
in database. I prefer to run again benchmarks :-)

My plan:

* Remove all previous benchmark results
* Run benchmarks on master, 2.7, 3.6 and 3.5 branches
* Run benchmarks on one revision per year quarter on the last 2 years
* Then see if we should run benchmarks on even older revisions and/or
if we need more than one plot per quarter.
* Maybe one point per month at least? The problem is that the UI is
limited to 50 points on the "Display all in a grid" view of the
Timeline. I would like to be able to render 2 years on this view.

For each year quarter, I will use the first commit of the master
branch on this period.

Victor

From victor.stinner at gmail.com  Mon Mar 27 10:43:37 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Mon, 27 Mar 2017 16:43:37 +0200
Subject: [Speed] speed.python.org: move to Git,
 remove old previous results
In-Reply-To: <CAMpsgwbunpF6+Lgpq+xd1TGNe2OiGLLP46QSqq=6ALKe0Qyc-g@mail.gmail.com>
References: <CAMpsgwbunpF6+Lgpq+xd1TGNe2OiGLLP46QSqq=6ALKe0Qyc-g@mail.gmail.com>
Message-ID: <CAMpsgwbj4qAT6DKKFbjRxjH0F2tWr2L5QxrNUnj_RhbQSxn8RQ@mail.gmail.com>

Zachary Ware told me on IRC that it's ok for him to drop old data.

If nobody else complains, I will remove old data tomorrow (tuesday).

I already validated that the patched scripts work on Git. I released
new versions of perf and performance to make sure that the latest
version of the code is released and used. By the way, the newly
released perf 1.1 gets a new "perf command" command to measure the
time of a command, it's like the Unix "time" command.

http://perf.readthedocs.io/en/latest/cli.html#command-cmd

$ python3 -m perf command -- python2 -c pass
.....................
command: Mean +- std dev: 21.2 ms +- 3.2 ms

Victor

2017-03-27 0:12 GMT+02:00 Victor Stinner <victor.stinner at gmail.com>:
> Hi,
>
> I'm going to remove old previous benchmark results from
> speed.python.org. As we discussed previously, there is no plan to keep
> old results when we need to change something. In this case, CPython
> moved from Mercurial to Git, and I'm too lazy to upgrade the revisions
> in database. I prefer to run again benchmarks :-)
>
> My plan:
>
> * Remove all previous benchmark results
> * Run benchmarks on master, 2.7, 3.6 and 3.5 branches
> * Run benchmarks on one revision per year quarter on the last 2 years
> * Then see if we should run benchmarks on even older revisions and/or
> if we need more than one plot per quarter.
> * Maybe one point per month at least? The problem is that the UI is
> limited to 50 points on the "Display all in a grid" view of the
> Timeline. I would like to be able to render 2 years on this view.
>
> For each year quarter, I will use the first commit of the master
> branch on this period.
>
> Victor

From victor.stinner at gmail.com  Mon Mar 27 19:17:26 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 28 Mar 2017 01:17:26 +0200
Subject: [Speed] ASLR
In-Reply-To: <CAP1=2W7uN3VxS4iE7xdpT4zjXLq4PHnp=4JOV6hqb8iJxZiJKQ@mail.gmail.com>
References: <CAMpsgwZzXJ7f7M98XggvZB4+Vx5nvtsrKjpAaJVAr03iLJUjUw@mail.gmail.com>
 <CAP1=2W7uN3VxS4iE7xdpT4zjXLq4PHnp=4JOV6hqb8iJxZiJKQ@mail.gmail.com>
Message-ID: <CAMpsgwZJNKCU1-k6YjFz3q7WmT=mN8qs_XE1y7dDxRWeodq0kg@mail.gmail.com>

2017-03-16 17:19 GMT+01:00 Brett Cannon <brett at python.org>:
>> By the way, maybe I should commit a change in hg.python.org/benchmarks
>> to remove the code and only keep a README.txt? Code will still be
>> accessible in Mercurial history.
>
> Since we might not shut down hg.python.org for a long time I say go ahead
> and commit such a change.

Ok, done!
https://hg.python.org/benchmarks/file/tip/README.txt
https://hg.python.org/benchmarks/file/tip

Victor

From tobami at gmail.com  Tue Mar 28 03:36:35 2017
From: tobami at gmail.com (Miquel Torres)
Date: Tue, 28 Mar 2017 07:36:35 +0000
Subject: [Speed] speed.python.org: move to Git,
 remove old previous results
In-Reply-To: <CAMpsgwbj4qAT6DKKFbjRxjH0F2tWr2L5QxrNUnj_RhbQSxn8RQ@mail.gmail.com>
References: <CAMpsgwbunpF6+Lgpq+xd1TGNe2OiGLLP46QSqq=6ALKe0Qyc-g@mail.gmail.com>
 <CAMpsgwbj4qAT6DKKFbjRxjH0F2tWr2L5QxrNUnj_RhbQSxn8RQ@mail.gmail.com>
Message-ID: <CAGf+9VxjHaRPxutLTYWgZqNnJh62CaZcEbGt76xB7VBTL7u61Q@mail.gmail.com>

I can have a look into increasing the number of points displayed.
El El lun, 27 mar 2017 a las 15:44, Victor Stinner <victor.stinner at gmail.com>
escribi?:

> Zachary Ware told me on IRC that it's ok for him to drop old data.
>
> If nobody else complains, I will remove old data tomorrow (tuesday).
>
> I already validated that the patched scripts work on Git. I released
> new versions of perf and performance to make sure that the latest
> version of the code is released and used. By the way, the newly
> released perf 1.1 gets a new "perf command" command to measure the
> time of a command, it's like the Unix "time" command.
>
> http://perf.readthedocs.io/en/latest/cli.html#command-cmd
>
> $ python3 -m perf command -- python2 -c pass
> .....................
> command: Mean +- std dev: 21.2 ms +- 3.2 ms
>
> Victor
>
> 2017-03-27 0:12 GMT+02:00 Victor Stinner <victor.stinner at gmail.com>:
> > Hi,
> >
> > I'm going to remove old previous benchmark results from
> > speed.python.org. As we discussed previously, there is no plan to keep
> > old results when we need to change something. In this case, CPython
> > moved from Mercurial to Git, and I'm too lazy to upgrade the revisions
> > in database. I prefer to run again benchmarks :-)
> >
> > My plan:
> >
> > * Remove all previous benchmark results
> > * Run benchmarks on master, 2.7, 3.6 and 3.5 branches
> > * Run benchmarks on one revision per year quarter on the last 2 years
> > * Then see if we should run benchmarks on even older revisions and/or
> > if we need more than one plot per quarter.
> > * Maybe one point per month at least? The problem is that the UI is
> > limited to 50 points on the "Display all in a grid" view of the
> > Timeline. I would like to be able to render 2 years on this view.
> >
> > For each year quarter, I will use the first commit of the master
> > branch on this period.
> >
> > Victor
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20170328/e6907a00/attachment.html>

From victor.stinner at gmail.com  Tue Mar 28 07:05:06 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 28 Mar 2017 13:05:06 +0200
Subject: [Speed] speed.python.org: move to Git,
 remove old previous results
In-Reply-To: <CAGf+9VxjHaRPxutLTYWgZqNnJh62CaZcEbGt76xB7VBTL7u61Q@mail.gmail.com>
References: <CAMpsgwbunpF6+Lgpq+xd1TGNe2OiGLLP46QSqq=6ALKe0Qyc-g@mail.gmail.com>
 <CAMpsgwbj4qAT6DKKFbjRxjH0F2tWr2L5QxrNUnj_RhbQSxn8RQ@mail.gmail.com>
 <CAGf+9VxjHaRPxutLTYWgZqNnJh62CaZcEbGt76xB7VBTL7u61Q@mail.gmail.com>
Message-ID: <CAMpsgwbMJrR=3L93gyV7jfMarSn7mcCz+-GaZ-yG9pAHti9K6A@mail.gmail.com>

2017-03-28 9:36 GMT+02:00 Miquel Torres <tobami at gmail.com>:
> I can have a look into increasing the number of points displayed.

There is a "Show the last [50] results" widget, but it's disabled if
you select "(o) Display all in a grid". Maybe we should enable the
first widget but limit the maximum number of results when this
specific view is selected? Just keep 50 by default ;-)

Victor

From victor.stinner at gmail.com  Tue Mar 28 08:11:54 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 28 Mar 2017 14:11:54 +0200
Subject: [Speed] Interesting Ruby pull request
Message-ID: <CAMpsgwb1v32Hdo=PpkYPU1RxfhyU38v_e0mj-z_BfctqNgK53Q@mail.gmail.com>

Hi,

It seems like Urabe, Shyouhei succeeded to write an efficient
optimizer for Ruby:
https://github.com/ruby/ruby/pull/1419

Since Ruby and CPython design are similar, maybe we can pick some
ideas. It seems like the optimizer is not done yet, the PR is not
merged yet.

I don't understand how the optimizer works. An interesting commit:
https://github.com/ruby/ruby/pull/1419/commits/d7b376949eb1626b9e5088f907db4cda5698ac1b
---

basic optimization infrastructure

This commit adds on-the-fly ISeq analyzer.  It detects an ISeq's
purity, i.e. if that ISeq has side-effect or not.  Purity is the key
concept of whole optimization techniques in general, but in Ruby it is
yet more important because there is a method called eval.  A pure ISeq
is free from eval, while those not pure are stuck in the limbo where
any of its side effects _could_ result in (possibly aliased) call to
eval.  So an optimization tend not be possible against them.

Note however, that the analyzer cannot statically say if the ISeq in
question is pure or not.  It categorizes an ISeq into 3 states namely
pure, not pure, or "unpredictable".  The last category is used when
for instance there are branches yet to be analyzed, or method calls to
another unpredictable ISeq.

An ISeq's purity changes over time, not only by redefinition of
methods, but by other optimizations, like, by entering a rarely-taken
branch of a formerly-unpredictable ISeq to kick analyzer to fix its
purity.  Such change propagates to its callers.

* optimize.c: new file.

* optimize.h: new file.

* common.mk (COMMONOBJS): dependencies for new files.

* iseq.h (ISEQ_NEEDS_ANALYZE): new flag to denote the iseq in
 question might need (re)analyzing.
---

I had this link in my bookmark for months, but I forgot it. This email
is not to forget it again ;-) Someone may find it useful!

Victor

From victor.stinner at gmail.com  Tue Mar 28 19:22:31 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 29 Mar 2017 01:22:31 +0200
Subject: [Speed] Results of CPython benchmarks on 2016
Message-ID: <CAMpsgwZnsFDAyNaS+kP2VB7oobP6k47t3jr3oQfGHCHY+00Bjw@mail.gmail.com>

Hi,

Before removing everything from speed.python.org database, I took
screenshots on interesting pages:
https://haypo.github.io/speed-python-org-march-2017.html

* Benchmarks where Python 3.7 is faster than Python 2.7
* Benchmarks where Python 3.7 is slower than Python 2.7
* Significant optimizations
* etc.

CPython became faster on many benchmarks in 2016:

* call_method
* float
* hexiom
* nqueens
* pickle_list
* richards
* scimark_lu
* scimark_sor
* sympy_sum
* telco
* unpickle_list.

I now have to analyze what made these benchmarks faster for my future
Pycon US talk "Optimizations which made Python 3.6 faster than Python
3.5" ;-)

I also kept many screenshots to see that benchmarks are now stable!

Victor

From victor.stinner at gmail.com  Fri Mar 31 18:47:35 2017
From: victor.stinner at gmail.com (Victor Stinner)
Date: Sat, 1 Apr 2017 00:47:35 +0200
Subject: [Speed] Issues to run benchmarks on Python before 2015-04-01
Message-ID: <CAMpsgwZVyVf1kk5tjpBFGiBVaVSrq4Ob-_KGhuYSL+t2N56vRg@mail.gmail.com>

Hi,

I'm trying to run benchmarks on revisions between 2014-01-01 and
today, but I got two different issues: see below. I'm now looking for
workarounds :-/ Because of these bugs, I'm unable to get benchmarks
results before 2015-04-01 (at 2015-04-01, benchmarks work again).


(1) 2014-01-01: "python3 -m pip install performance" fails with a
TypeError: "charset argument must be specified when non-ASCII
characters are used in the payload"

It's a regression of Python 3.4 beta: http://bugs.python.org/issue20531


(2) 2014-04-01, 2014-07-01, 2014-10-01, 2015-01-01: "venv/bin/python
-m pip install" fails in extract_stack() of pyparsing

---
haypo at selma$ /home/haypo/prog/bench_python/tmpdir/prefix/bin/python3
Python 3.5.0a0 (default, Apr  1 2017, 00:01:30)
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pip
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/__init__.py",
line 26, in <module>
    from pip.utils import get_installed_distributions, get_prog
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/utils/__init__.py",
line 27, in <module>
    from pip._vendor import pkg_resources
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pkg_resources/__init__.py",
line 74, in <module>
    __import__('pip._vendor.packaging.requirements')
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/packaging/requirements.py",
line 9, in <module>
    from pip._vendor.pyparsing import (
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py",
line 4715, in <module>
    _escapedPunc = Word( _bslash, r"\[]-*.$+^?()~ ", exact=2
).setParseAction(lambda s,l,t:t[0][1])
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py",
line 1261, in setParseAction
    self.parseAction = list(map(_trim_arity, list(fns)))
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py",
line 1043, in _trim_arity
    this_line = extract_stack(limit=2)[-1]
  File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py",
line 1028, in extract_stack
    return [(frame_summary.filename, frame_summary.lineno)]
AttributeError: 'tuple' object has no attribute 'filename'
---

Note: same error using the pip program (ex: " prefix/bin/pip --version").

Victor