From victor.stinner at gmail.com  Mon Jul  4 04:53:25 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Mon, 4 Jul 2016 10:53:25 +0200
Subject: [Speed] bm_pickle: why testing protocol 2?
Message-ID: <CAMpsgwZGRL02fBcPjM50kgFc2On=iX-oy1AdVUp0wJQKKhvcoQ@mail.gmail.com>

Hi,

performance/bm_pickle.py of the CPython benchmark suite uses the
pickle protocol 2 by default. Why not always testing the highest
protocol?

In Python 3.5, the highest protocol is 4 which is more efficient than
the protocol 2.

Is it a deliberate choice to test exactly the same thing between
Python 2 and Python 3?

Victor

From solipsis at pitrou.net  Mon Jul  4 05:38:11 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Mon, 4 Jul 2016 11:38:11 +0200
Subject: [Speed] bm_pickle: why testing protocol 2?
References: <CAMpsgwZGRL02fBcPjM50kgFc2On=iX-oy1AdVUp0wJQKKhvcoQ@mail.gmail.com>
Message-ID: <20160704113811.3f90a109@fsol>

On Mon, 4 Jul 2016 10:53:25 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> Hi,
> 
> performance/bm_pickle.py of the CPython benchmark suite uses the
> pickle protocol 2 by default. Why not always testing the highest
> protocol?

I think this comes from the Unladen Swallow benchmark suite, and
Unladen Swallow was Python 2-only, so protocol 2 *was* the highest
protocol in those circumstances.

Regards

Antoine.


From victor.stinner at gmail.com  Mon Jul  4 10:17:23 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Mon, 4 Jul 2016 16:17:23 +0200
Subject: [Speed] New CPython benchmark suite based on perf
Message-ID: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>

Hi,

I modified the CPython benchmark suite to use my perf module:
https://hg.python.org/sandbox/benchmarks_perf


Changes:

* use statistics.median() rather than mean() to compute of "average"
of samples. Example:

   Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower

* replace compat.py with external six dependency
* replace util.py with perf
* replace explicit warmups with perf automatic warmup
* add name metadata
* for benchmark taking parameters, save parameters in metadata
* avoid nested loops, prefer a single level of loop: perf is
responsible to call the sample function enough times to collect enough
samples
* store django and mako version in metadata
* use JSON format to exchange timings between benchmarks and runner.py


perf adds more features:

* run each benchmark in multiple processes (25 by default, 50 in rigorous mode)
* calibrate each benchmark to compute the number of loops to get a
sample between 100 ms and 1 second


TODO:

* Right now the calibration in done twice: in the reference python and
in the changed python. It should only be once in the reference python
* runner.py should write results in a JSON file. Currently, data are
not written on disk (a pipe is used with child processes)
* Drop external dependencies and create a virtual environment per python
* Port more Python 2-only benchmarks to Python 3
* Add more benchmarks from PyPy, Pyston and Pyjion benchmark suites:
unify again the benchmark suites :-)


perf has builtin tools to analyze the distribution of samples:

* add --hist option to a benchmark to display an histogram in text mode
* add --stats option to a benchmark to display statistics: number of
samples, shortest raw sample, min, max, etc.
* "python3 -m perf" CLI allows has many commands to analyze a benchmark:
http://perf.readthedocs.io/en/latest/cli.html


Right now, perf JSON format is only able to store one benchmark. I
will extend the format to be able to store a list of benchmarks. So it
will be possible to store all results of a python version into a
single file.

By the way, I also want to change runner.py CLI to be able to run the
benchmarks on a single python version and then use a second command to
compare two files. Rather than always running each benchmark twice
(reference python, changed python). PyPy runner also works like that
if I recall correctly.

Victor

From victor.stinner at gmail.com  Mon Jul  4 11:08:06 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Mon, 4 Jul 2016 17:08:06 +0200
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
Message-ID: <CAMpsgwbsFmNkZhMAKoU=DMh7sdHuqUgg4S_QqmL2MmPhZUCiTQ@mail.gmail.com>

2016-07-04 16:17 GMT+02:00 Victor Stinner <victor.stinner at gmail.com>:
> I modified the CPython benchmark suite to use my perf module:
> https://hg.python.org/sandbox/benchmarks_perf

Hum, you need the development version of perf to test it:

   git clone https://github.com/haypo/perf.git


> Changes:
>
> * replace explicit warmups with perf automatic warmup
> (...)
> * avoid nested loops, prefer a single level of loop: perf is
> responsible to call the sample function enough times to collect enough
> samples

Concrete example with performance/bm_go.py.

Before:
-------------------------
def main(n, timer):
    times = []
    for i in range(5):
        versus_cpu() # warmup
    for i in range(n):
        t1 = timer()
        versus_cpu()
        t2 = timer()
        times.append(t2 - t1)
    return times
-------------------------

After:
-------------------------
def main(loops):
    t0 = perf.perf_counter()

    for _ in xrange(loops):
        versus_cpu()

    return perf.perf_counter() - t0
-------------------------


Example of go benchmark output:
---
$ python3 benchmarks_perf/performance/bm_go.py -v
calibration: 1 loop: 599 ms
calibration: use 1 loop
Run 1/25: warmup (1): 601 ms; raw samples (3): 593 ms, 593 ms, 593 ms
Run 2/25: warmup (1): 609 ms; raw samples (3): 609 ms, 610 ms, 608 ms
Run 3/25: warmup (1): 599 ms; raw samples (3): 598 ms, 606 ms, 598 ms
(...)
Run 25/25: warmup (1): 606 ms; raw samples (3): 591 ms, 590 ms, 591 ms

Median +- std dev: 598 ms +- 8 ms
---

The warmup samples ("warmup (1): ... ms") are not used to compute
median or std dev.


Another example to show fancy features of perf:
---
$ python3 benchmarks_perf/performance/bm_telco.py -v --hist --stats
--metadata -n5 -p50
calibration: 1 loop: 34.6 ms
calibration: 2 loops: 57.8 ms
calibration: 4 loops: 105 ms
calibration: use 4 loops
Run 1/50: warmup (1): 116 ms; raw samples (5): 106 ms, 106 ms, 105 ms,
106 ms, 106 ms
Run 2/50: warmup (1): 107 ms; raw samples (5): 107 ms, 107 ms, 106 ms,
106 ms, 106 ms
Run 3/50: warmup (1): 107 ms; raw samples (5): 106 ms, 106 ms, 106 ms,
106 ms, 106 ms
(...)
Run 50/50: warmup (1): 106 ms; raw samples (5): 104 ms, 105 ms, 105
ms, 106 ms, 105 ms

Metadata:
- aslr: enabled
- cpu_count: 4
- cpu_model_name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
- date: 2016-07-04T17:00:33
- description: Test the performance of the Telco decimal benchmark
- duration: 35.6 sec
- hostname: smithers
- name: telco
- perf_version: 0.6
- platform: Linux-4.5.7-300.fc24.x86_64-x86_64-with-fedora-24-Twenty_Four
- python_executable: /usr/bin/python3
- python_implementation: cpython
- python_version: 3.5.1 (64bit)
- timer: clock_gettime(CLOCK_MONOTONIC), resolution: 1.00 ns

25.8 ms:  1 ##
25.9 ms:  2 #####
26.0 ms:  4 ##########
26.0 ms: 13 ###############################
26.1 ms: 27 #################################################################
26.2 ms: 28 ###################################################################
26.3 ms: 21 ##################################################
26.3 ms: 25 ############################################################
26.4 ms: 32 #############################################################################
26.5 ms: 33 ###############################################################################
26.6 ms: 18 ###########################################
26.6 ms: 13 ###############################
26.7 ms:  8 ###################
26.8 ms:  8 ###################
26.8 ms:  7 #################
26.9 ms:  4 ##########
27.0 ms:  4 ##########
27.1 ms:  1 ##
27.1 ms:  0 |
27.2 ms:  0 |
27.3 ms:  1 ##

Number of samples: 250 (50 runs x 5 samples; 1 warmup)
Standard deviation / median: 1%
Shortest raw sample: 103 ms (4 loops)

Minimum: 25.9 ms (-2.1%)
Median +- std dev: 26.4 ms +- 0.2 ms
Maximum: 27.3 ms (+3.4%)

Median +- std dev: 26.4 ms +- 0.2 ms
---
I used " -n5 -p50" to compute 5 samples per process and use 50
processes. It helps to get a nicer histogram :-) (to have a better
uniform distribution) For histogram, I like using telco because it
generates a regular gaussian curve :-)

Victor

From brett at python.org  Mon Jul  4 13:32:52 2016
From: brett at python.org (Brett Cannon)
Date: Mon, 04 Jul 2016 17:32:52 +0000
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
Message-ID: <CAP1=2W6PZv+vrXces_gVoO-6-92j_hzeuvAn8jWGHbQgmF-J6Q@mail.gmail.com>

I just wanted to quickly say, Victor, this all sounds great!

On Mon, 4 Jul 2016 at 07:17 Victor Stinner <victor.stinner at gmail.com> wrote:

> Hi,
>
> I modified the CPython benchmark suite to use my perf module:
> https://hg.python.org/sandbox/benchmarks_perf
>
>
> Changes:
>
> * use statistics.median() rather than mean() to compute of "average"
> of samples. Example:
>
>    Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower
>
> * replace compat.py with external six dependency
> * replace util.py with perf
> * replace explicit warmups with perf automatic warmup
> * add name metadata
> * for benchmark taking parameters, save parameters in metadata
> * avoid nested loops, prefer a single level of loop: perf is
> responsible to call the sample function enough times to collect enough
> samples
> * store django and mako version in metadata
> * use JSON format to exchange timings between benchmarks and runner.py
>
>
> perf adds more features:
>
> * run each benchmark in multiple processes (25 by default, 50 in rigorous
> mode)
> * calibrate each benchmark to compute the number of loops to get a
> sample between 100 ms and 1 second
>
>
> TODO:
>
> * Right now the calibration in done twice: in the reference python and
> in the changed python. It should only be once in the reference python
> * runner.py should write results in a JSON file. Currently, data are
> not written on disk (a pipe is used with child processes)
> * Drop external dependencies and create a virtual environment per python
> * Port more Python 2-only benchmarks to Python 3
> * Add more benchmarks from PyPy, Pyston and Pyjion benchmark suites:
> unify again the benchmark suites :-)
>
>
> perf has builtin tools to analyze the distribution of samples:
>
> * add --hist option to a benchmark to display an histogram in text mode
> * add --stats option to a benchmark to display statistics: number of
> samples, shortest raw sample, min, max, etc.
> * "python3 -m perf" CLI allows has many commands to analyze a benchmark:
> http://perf.readthedocs.io/en/latest/cli.html
>
>
> Right now, perf JSON format is only able to store one benchmark. I
> will extend the format to be able to store a list of benchmarks. So it
> will be possible to store all results of a python version into a
> single file.
>
> By the way, I also want to change runner.py CLI to be able to run the
> benchmarks on a single python version and then use a second command to
> compare two files. Rather than always running each benchmark twice
> (reference python, changed python). PyPy runner also works like that
> if I recall correctly.
>
> Victor
> _______________________________________________
> Speed mailing list
> Speed at python.org
> https://mail.python.org/mailman/listinfo/speed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160704/b229266b/attachment.html>

From solipsis at pitrou.net  Mon Jul  4 13:49:52 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Mon, 4 Jul 2016 19:49:52 +0200
Subject: [Speed] New CPython benchmark suite based on perf
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
Message-ID: <20160704194952.3d30dc79@fsol>

On Mon, 4 Jul 2016 16:17:23 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> Changes:
> 
> * use statistics.median() rather than mean() to compute of "average"
> of samples. Example:
> 
>    Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower

That doesn't sound like a terrific idea. Why do you think the median
gives a more interesting figure here?

(please note that median() doesn't compute an "average" at all...)

> * replace compat.py with external six dependency

I would suggest vendoring six, to avoid adding dependencies.

> * use JSON format to exchange timings between benchmarks and runner.py

That's a very nice improvement.

> TODO:
> 
> * Right now the calibration in done twice: in the reference python and
> in the changed python. It should only be once in the reference python

I think doing calibration in each interpreter is the right thing to do,
because the two interpreters may have very different performance
characteristics (say one is 10x faster than the other).

Regards

Antoine.


From victor.stinner at gmail.com  Mon Jul  4 16:51:11 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Mon, 4 Jul 2016 22:51:11 +0200
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <20160704194952.3d30dc79@fsol>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
Message-ID: <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>

2016-07-04 19:49 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
>>    Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower
>
> That doesn't sound like a terrific idea. Why do you think the median
> gives a more interesting figure here?

When the distribution is uniform, mean and median are the same. In my
experience with Python benchmarks, usually the curse is skewed: the
right tail is much longer.

When the system noise is high, the skewness is much larger. In this
case, median looks "more correct". IMO it helps to reduce the system
noise. See graphics and the discussion for the detail:
https://github.com/haypo/perf/issues/1


>> * replace compat.py with external six dependency
>
> I would suggest vendoring six, to avoid adding dependencies.

Ah, that's a different topic. I'm more in favor of dropping of vendor
copies of libraries, and rather get them from PyPI using a virtualenv.
It should make the benchmark repository smaller and allow to upgrade
dependencies more easily.

What do you think?


>> TODO:
>>
>> * Right now the calibration in done twice: in the reference python and
>> in the changed python. It should only be once in the reference python
>
> I think doing calibration in each interpreter is the right thing to do,
> because the two interpreters may have very different performance
> characteristics (say one is 10x faster than the other).

Ah yes, maybe. It's true that telco benchmark is *much* faster on Python 3.

Anyway, the result is normalized per loop iteration: raw sample /
loops. By the way, perf has an "inner-loops" parameter for
micro-benchmarks which duplicates an instruction N times to reduce the
overhead of loops.

Victor

From solipsis at pitrou.net  Tue Jul  5 04:08:52 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 5 Jul 2016 10:08:52 +0200
Subject: [Speed] New CPython benchmark suite based on perf
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
Message-ID: <20160705100852.39358967@fsol>

On Mon, 4 Jul 2016 22:51:11 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 2016-07-04 19:49 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
> >>    Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower
> >
> > That doesn't sound like a terrific idea. Why do you think the median
> > gives a more interesting figure here?
> 
> When the distribution is uniform, mean and median are the same. In my
> experience with Python benchmarks, usually the curse is skewed: the
> right tail is much longer.
> 
> When the system noise is high, the skewness is much larger. In this
> case, median looks "more correct".

It "looks" more correct?

Let's say your Python implementation has a flaw: it is almost always
fast, but every 10 runs, it becomes 3x slower.  Taking the mean will
reflect the occasional slowness.  Taking the median will completely
hide it.

Then of course, since you have several processes and several runs per
process, you could try something more convoluted, such as
mean-of-medians or mean-of-mins or...

However, if you're concerned by system noise, there may be other ways
to avoid it. For example, measure both CPU time and wall time, and if
CPU time < 0.9 * wall time (for example), ignore the number and take
another measurement.

(this assumes all benchmarks are CPU-bound - which they should be here
- and single-threaded - which they *probably* are, except in a
hypothetical parallelizing Python implementation ;-)))

Regards

Antoine.


From victor.stinner at gmail.com  Tue Jul  5 05:35:30 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 5 Jul 2016 11:35:30 +0200
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <20160705100852.39358967@fsol>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
 <20160705100852.39358967@fsol>
Message-ID: <CAMpsgwZWsygsv--V00xJA-hGFMktdem8VChTKt_Fm5CSZEeHMA@mail.gmail.com>

2016-07-05 10:08 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
>> When the system noise is high, the skewness is much larger. In this
>> case, median looks "more correct".
>
> It "looks" more correct?

My main worry is to get reproductible "stable" benchmark results. I
started to work on perf because most results of the CPython benchmark
suite just looked like pure noise. It became very hard for me to
decide if it's my fault, if my change makes Python slower and faster.
I'm not talking of specific benchmarks which are obviously much faster
or much slower, but all small changes between -5% and +5%.

It looks like median helps to reduce the effect of outliers.


> Let's say your Python implementation has a flaw: it is almost always
> fast, but every 10 runs, it becomes 3x slower.  Taking the mean will
> reflect the occasional slowness.  Taking the median will completely
> hide it.

I'm not sure that the median will completly hide such behaviour.
Moreover, I modified the benchmark suite to always display the
standard deviation just after the median. The standard deviation
should help to detect a large variation.

In practice, it almost never occurs to have all samples with the same
value. There is always a statistic distribution, usually as a gaussian
curse. The question is what is the best way to "summary" a curve with
two numbers. I add a constraint: I also want to reduce the system
noise.


> Then of course, since you have several processes and several runs per
> process, you could try something more convoluted, such as
> mean-of-medians or mean-of-mins or...

I don't know these functions. I also prefer consider each sample as
individual and only apply a function on the whole serie of all
samples.


> However, if you're concerned by system noise, there may be other ways
> to avoid it. For example, measure both CPU time and wall time, and if
> CPU time < 0.9 * wall time (for example), ignore the number and take
> another measurement.
>
> (this assumes all benchmarks are CPU-bound - which they should be here
> - and single-threaded - which they *probably* are, except in a
> hypothetical parallelizing Python implementation ;-)))

CPU isolation helps a lot to reduce the system noise, but it requires
"complex" system tuning. I don't understand that users will use it,
especially users of timeit.

I don't think that CPU time is generic enough to put it in the perf
module. I would prefer to not restrict myself to CPU-bound benchmarks.

But the perf module already warns users when it detects that the
benchmark looks too unstable. See the example at the end of:
http://perf.readthedocs.io/en/latest/perf.html#runs-samples-warmups-outter-and-inner-loops

Or try: "python3 -m perf.timeit --loops=10 pass".

Currently, I'm using the shortest raw sample (>= 1 ms) and standard
deviation / median (< 10%).

Someone suggested me to compare the minimum and the maximum to the
median. You get already see that using perf stats:
------------------
$ python3 -m perf show --stats perf/tests/telco.json
Number of samples: 250 (50 runs x 5 samples; 1 warmup)
Standard deviation / median: 1%
Shortest raw sample: 264 ms (10 loops)

Minimum: 26.4 ms (-1.8%)
Median +- std dev: 26.9 ms +- 0.2 ms
Maximum: 27.3 ms (+1.7%)

Median +- std dev: 26.9 ms +- 0.2 ms
------------------
=> -1.8% and +1.7% numbers for minimum and maximum

When you get outliers, numbers are up to 20% for the maximum or much more.

Victor

From solipsis at pitrou.net  Tue Jul  5 06:08:37 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 5 Jul 2016 12:08:37 +0200
Subject: [Speed] New CPython benchmark suite based on perf
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
 <20160705100852.39358967@fsol>
 <CAMpsgwZWsygsv--V00xJA-hGFMktdem8VChTKt_Fm5CSZEeHMA@mail.gmail.com>
Message-ID: <20160705120837.5608d5ae@fsol>

On Tue, 5 Jul 2016 11:35:30 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 
> It looks like median helps to reduce the effect of outliers.

If you want to reduce the effect the outliers, you can just remove
them: for example, ignore the 5% shortest samples and the 5% longest
ones.

The median will not only reduce the effect of outliers but also
completely ignore the value of most samples *except* the median sample.

> In practice, it almost never occurs to have all samples with the same
> value. There is always a statistic distribution, usually as a gaussian
> curse.

If it's a gaussian curve (not a curse, probably :-)), then you can
summarize it with two values: the mean and the stddev.  But it's
probably not a gaussian, because of system noise and other factors, so
your assumption is wrong :-)

Regards

Antoine.


From victor.stinner at gmail.com  Tue Jul  5 07:55:39 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 5 Jul 2016 13:55:39 +0200
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <20160705120837.5608d5ae@fsol>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
 <20160705100852.39358967@fsol>
 <CAMpsgwZWsygsv--V00xJA-hGFMktdem8VChTKt_Fm5CSZEeHMA@mail.gmail.com>
 <20160705120837.5608d5ae@fsol>
Message-ID: <CAMpsgwYqQ+AKNenFeD1sWaLBQoZMvq3V_curFG1rREr0jwBS6A@mail.gmail.com>

2016-07-05 12:08 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
> If it's a gaussian curve (not a curse, probably :-)), then you can
> summarize it with two values: the mean and the stddev.  But it's
> probably not a gaussian, because of system noise and other factors, so
> your assumption is wrong :-)

What do you propose? Revert to average (arithmeric mean) + std dev
(centered on the average)?

Victor

From ncoghlan at gmail.com  Tue Jul  5 08:04:51 2016
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 5 Jul 2016 22:04:51 +1000
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <20160705120837.5608d5ae@fsol>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
 <20160705100852.39358967@fsol>
 <CAMpsgwZWsygsv--V00xJA-hGFMktdem8VChTKt_Fm5CSZEeHMA@mail.gmail.com>
 <20160705120837.5608d5ae@fsol>
Message-ID: <CADiSq7fA0KjebooiBc52mEsZ0rT8WcN-bQpb7W6YaARC+KF98A@mail.gmail.com>

On 5 July 2016 at 20:08, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Tue, 5 Jul 2016 11:35:30 +0200
> Victor Stinner <victor.stinner at gmail.com>
> wrote:
>> In practice, it almost never occurs to have all samples with the same
>> value. There is always a statistic distribution, usually as a gaussian
>> curse.
>
> If it's a gaussian curve (not a curse, probably :-)), then you can
> summarize it with two values: the mean and the stddev.  But it's
> probably not a gaussian, because of system noise and other factors, so
> your assumption is wrong :-)

If you haven't already, I highly recommend reading the discussion in
https://github.com/haypo/perf/issues/1 that led to Victor adopting the
current median + stddev approach

As Mahmoud noted there, in terms of really understanding the benchmark
results, there's no substitute for actually looking at the histograms
with the result distributions. The numeric results are never going to
be able to do more than provide a "flavour" for those results, since
the distributions aren't Guassian, but trying to characterise and
describe them properly would inevitably confuse folks that aren't
already expert statisticians.

The median + stddev approach helps convey a "typical" result better
than the minimum or mean do, while also providing an indication when
the variation in results is too high for the median to really be
meaningful.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

From solipsis at pitrou.net  Tue Jul  5 08:07:49 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 5 Jul 2016 14:07:49 +0200
Subject: [Speed] New CPython benchmark suite based on perf
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
 <20160705100852.39358967@fsol>
 <CAMpsgwZWsygsv--V00xJA-hGFMktdem8VChTKt_Fm5CSZEeHMA@mail.gmail.com>
 <20160705120837.5608d5ae@fsol>
 <CAMpsgwYqQ+AKNenFeD1sWaLBQoZMvq3V_curFG1rREr0jwBS6A@mail.gmail.com>
Message-ID: <20160705140749.5f70622d@fsol>

On Tue, 5 Jul 2016 13:55:39 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 2016-07-05 12:08 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
> > If it's a gaussian curve (not a curse, probably :-)), then you can
> > summarize it with two values: the mean and the stddev.  But it's
> > probably not a gaussian, because of system noise and other factors, so
> > your assumption is wrong :-)
> 
> What do you propose? Revert to average (arithmeric mean) + std dev
> (centered on the average)?

Yes.  And if you want to ignore outliers, just remove them: remove the
5% smallest and 5% largest samples.

Regards

Antoine.


From solipsis at pitrou.net  Tue Jul  5 08:10:15 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 5 Jul 2016 14:10:15 +0200
Subject: [Speed] New CPython benchmark suite based on perf
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
 <20160704194952.3d30dc79@fsol>
 <CAMpsgwamo1y49pFqpWGPOxp2GNi2FavX7yjPki9i-H-zBpaJcQ@mail.gmail.com>
 <20160705100852.39358967@fsol>
 <CAMpsgwZWsygsv--V00xJA-hGFMktdem8VChTKt_Fm5CSZEeHMA@mail.gmail.com>
 <20160705120837.5608d5ae@fsol>
 <CADiSq7fA0KjebooiBc52mEsZ0rT8WcN-bQpb7W6YaARC+KF98A@mail.gmail.com>
Message-ID: <20160705141015.3855420f@fsol>

On Tue, 5 Jul 2016 22:04:51 +1000
Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> The median + stddev approach helps convey a "typical" result better
> than the minimum or mean do, while also providing an indication when
> the variation in results is too high for the median to really be
> meaningful.

This is missing the primary goal, which is to compare results between
implementations.  For this you need a single number, and the median is
a poor indication of overall performance (because it totally ignores
the actual distribution shape).

Providing detailed statistical information (median, mean,
deviation, quartiles, etc.) about each benchmark run is useful in
itself, but a secondary concern for most uses of the benchmark suite.

Regards

Antoine.


From victor.stinner at gmail.com  Wed Jul  6 12:17:49 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 6 Jul 2016 18:17:49 +0200
Subject: [Speed] perf 0.6 released
Message-ID: <CAMpsgwazmGhemc-cJg-3-vdoA1RBsAtLJOaUr+66FAZ-g4KPHA@mail.gmail.com>

Hi,

I'm pleased to announce the release of the Python perf modul version 0.6.

The main difference is that the JSON format and perf command line
tools support benchmark suites and not only individual benchmarks.

I added a new "convert" command which allows to modify a benchmark
file: remove benchmarks, remove benchmarks runs, and a special "remove
outliers" command. I'm not sure that removing outliers is a good
practice, I will have to play with it to give you a feedback :-)

I added --fast and --rigorous options: simple options to configure the
number of processes and the of sample per process. The idea of these
options comes from the CPython benchmark suite.

I added --hist and --stats options to TextRunner, so it's now possible
to directly render an histogram and compute statistics on a benchmark
(without having to use a file).

Finally, the --json-append option allows to append a benchmark to an
existing benchmark suite file. It allows to "concatenate" multiple
benchmarks into a single JSON file.


timeit example showing new features:
---
$ python3 -m perf timeit -s 'x=" abc"' 'x.strip()' --stats --hist -v
--rigorous --json=timeit.json
calibration: 1 loop: 2.40 us
calibration: 2 loops: 1.60 us
(...)
calibration: 2^20 loops: 118 ms
calibration: use 2^20 loops
Run 1/20: warmup (1): 116 ms; raw samples (5): 116 ms, 118 ms, 116 ms,
116 ms, 136 ms (+17%)
Run 2/20: warmup (1): 119 ms; raw samples (5): 115 ms, 116 ms, 121 ms,
118 ms, 128 ms (+9%)
Run 3/20: warmup (1): 272 ms; raw samples (5): 208 ms (+76%), 117 ms,
121 ms, 122 ms, 119 ms
(...)
Run 20/20: warmup (1): 140 ms; raw samples (5): 116 ms, 115 ms, 115
ms, 116 ms, 115 ms

106 ns: 38 #################################
111 ns: 44 ######################################
115 ns:  8 #######
119 ns:  2 ##
124 ns:  5 ####
128 ns:  2 ##
133 ns:  0 |
137 ns:  0 |
141 ns:  0 |
146 ns:  0 |
150 ns:  0 |
155 ns:  0 |
159 ns:  0 |
164 ns:  0 |
168 ns:  0 |
172 ns:  0 |
177 ns:  0 |
181 ns:  0 |
186 ns:  0 |
190 ns:  0 |
195 ns:  1 #

Number of samples: 100 (20 runs x 5 samples; 1 warmup)
Loop iterations per sample: 2^20
Raw sample minimum: 115 ms
Raw sample maximum: 208 ms

Minimum: 110 ns (-1%)
Median +- std dev: 111 ns +- 10 ns
Mean +- std dev: 114 ns +- 10 ns
Maximum: 198 ns (+79%)

Median +- std dev: 111 ns +- 10 ns
---

The list of runs now highlight outliers by showing the percent for
samples out of the range [median - 5%; median + 5%]. Example: "raw
samples (5): 208 ms (+76%)".


Example of removing outliers:
---
$ python3 -m perf convert timeit.json --remove-outliers -o timeit2.json

haypo at selma$ python3 -m perf show --hist --stats -v timeit2.json
Run 1/12: warmup (1): 124 ms; raw samples (5): 117 ms, 119 ms, 118 ms,
117 ms, 118 ms
Run 2/12: warmup (1): 119 ms; raw samples (5): 119 ms, 117 ms, 117 ms,
118 ms, 117 ms
Run 3/12: warmup (1): 116 ms; raw samples (5): 117 ms, 116 ms, 116 ms,
117 ms, 118 ms
(...)
Run 12/12: warmup (1): 140 ms; raw samples (5): 116 ms, 115 ms, 115
ms, 116 ms, 115 ms

110 ns: 10 ###########################
110 ns: 14 ######################################
110 ns: 10 ###########################
111 ns:  7 ###################
111 ns:  2 #####
111 ns:  1 ###
112 ns:  5 ##############
112 ns:  3 ########
112 ns:  1 ###
112 ns:  1 ###
113 ns:  3 ########
113 ns:  0 |
113 ns:  1 ###
114 ns:  1 ###
114 ns:  0 |
114 ns:  0 |
115 ns:  0 |
115 ns:  0 |
115 ns:  0 |
115 ns:  0 |
116 ns:  1 ###

Number of samples: 60 (12 runs x 5 samples; 1 warmup)
Loop iterations per sample: 2^20
Raw sample minimum: 115 ms
Raw sample maximum: 121 ms

Minimum: 110 ns (-1%)
Median +- std dev: 111 ns +- 1 ns
Mean +- std dev: 111 ns +- 1 ns
Maximum: 116 ns (+5%)

Median +- std dev: 111 ns +- 1 ns
---

Without outliers, the histogram "looks better" but it changed a lot
the standard deviation (11 ns => 1 ns).

Victor

From victor.stinner at gmail.com  Wed Jul  6 12:25:59 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 6 Jul 2016 18:25:59 +0200
Subject: [Speed] New CPython benchmark suite based on perf
In-Reply-To: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
References: <CAMpsgwZSpt0cig7NZL5XrGTa_knxH8fji2nHK1C8MM3hbXo4Mg@mail.gmail.com>
Message-ID: <CAMpsgwbLe+yJzj=OZWRxgpfYANknwvub267NmbxVehyA_YbMJA@mail.gmail.com>

2016-07-04 16:17 GMT+02:00 Victor Stinner <victor.stinner at gmail.com>:
> I modified the CPython benchmark suite to use my perf module:
> https://hg.python.org/sandbox/benchmarks_perf

Updates with the release of perf 0.6.

runner.py now has 3 commands: run, compare, run_compare

* "run" runs benchmarks on a single python, result can be written into a file
* "compare" takes two JSON files as input and compares them
* "run_compare" is the previous default behaviour: run benchmarks on
two python versions and then compare results. The results can also be
saved into two JSON files

The main advantage is that it's now possible to only run the benchmark
suite once on the baseline python, rather than having to run it each
time. So each comparison to a changed python (run+compare) should be
simply twice faster.

It also becomes possible to exchange full benchmark results (all
samples of all processes) as files, rather than just summaries (median
+- std dev lines) as text.

TODO:

* update remaining benchmarks (3 special benchmarks are currently broken)
* rework the code to compare benchmarks
* repair memory tracking feature?
* continue the implementation using virtual environments and external
dependencies

Victor

From solipsis at pitrou.net  Wed Jul  6 12:41:05 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 6 Jul 2016 18:41:05 +0200
Subject: [Speed] perf 0.6 released
References: <CAMpsgwazmGhemc-cJg-3-vdoA1RBsAtLJOaUr+66FAZ-g4KPHA@mail.gmail.com>
Message-ID: <20160706184105.27a73b88@fsol>

On Wed, 6 Jul 2016 18:17:49 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 
> The list of runs now highlight outliers by showing the percent for
> samples out of the range [median - 5%; median + 5%]. Example: "raw
> samples (5): 208 ms (+76%)".

I'm not sure this is meant to implement my suggestion from the other
thread, but if so, there is a misunderstanding: I did not suggest to
remove the samples outside of the range [median - 5%; median + 5%].  I
suggested to remove the 5% smallest and the 5% largest samples.

Regards

Antoine.


From victor.stinner at gmail.com  Wed Jul  6 16:16:43 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Wed, 6 Jul 2016 22:16:43 +0200
Subject: [Speed] perf 0.6 released
In-Reply-To: <20160706184105.27a73b88@fsol>
References: <CAMpsgwazmGhemc-cJg-3-vdoA1RBsAtLJOaUr+66FAZ-g4KPHA@mail.gmail.com>
 <20160706184105.27a73b88@fsol>
Message-ID: <CAMpsgwbTXBmxuvMbm1ZLaQ2qxaDnPhPruQ5JOjh=d52p9EGx4w@mail.gmail.com>

2016-07-06 18:41 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
> I'm not sure this is meant to implement my suggestion from the other
> thread,

Yes, I implemented this after the discussion we had in the other thread.


> but if so, there is a misunderstanding: I did not suggest to
> remove the samples outside of the range [median - 5%; median + 5%].  I
> suggested to remove the 5% smallest and the 5% largest samples.

I tried something to remove outliers. I didn't try to implement what
you suggested.

5% smallest/5% largest: do you mean something like sorting all
samples, remove items from the two tails?

Something like sorted(samples)[3:-3] ?

Victor

From solipsis at pitrou.net  Wed Jul  6 16:24:40 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 6 Jul 2016 22:24:40 +0200
Subject: [Speed] perf 0.6 released
References: <CAMpsgwazmGhemc-cJg-3-vdoA1RBsAtLJOaUr+66FAZ-g4KPHA@mail.gmail.com>
 <20160706184105.27a73b88@fsol>
 <CAMpsgwbTXBmxuvMbm1ZLaQ2qxaDnPhPruQ5JOjh=d52p9EGx4w@mail.gmail.com>
Message-ID: <20160706222440.027ec686@fsol>

On Wed, 6 Jul 2016 22:16:43 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> > but if so, there is a misunderstanding: I did not suggest to
> > remove the samples outside of the range [median - 5%; median + 5%].  I
> > suggested to remove the 5% smallest and the 5% largest samples.
> 
> I tried something to remove outliers. I didn't try to implement what
> you suggested.
> 
> 5% smallest/5% largest: do you mean something like sorting all
> samples, remove items from the two tails?
> 
> Something like sorted(samples)[3:-3] ?

Yes.

Regards

Antoine.


From victor.stinner at gmail.com  Wed Jul  6 19:21:34 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 7 Jul 2016 01:21:34 +0200
Subject: [Speed] perf 0.6 released
In-Reply-To: <20160706222440.027ec686@fsol>
References: <CAMpsgwazmGhemc-cJg-3-vdoA1RBsAtLJOaUr+66FAZ-g4KPHA@mail.gmail.com>
 <20160706184105.27a73b88@fsol>
 <CAMpsgwbTXBmxuvMbm1ZLaQ2qxaDnPhPruQ5JOjh=d52p9EGx4w@mail.gmail.com>
 <20160706222440.027ec686@fsol>
Message-ID: <CAMpsgwYXEYhDBRjtxCXtu4JBLBqfQ-JPLNSveQ7HtZProH3v9Q@mail.gmail.com>

2016-07-06 22:24 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
>> 5% smallest/5% largest: do you mean something like sorting all
>> samples, remove items from the two tails?
>>
>> Something like sorted(samples)[3:-3] ?
>
> Yes.

Hum, it may work if the distribution is uniform (symmetric), but
usually the right tail is much longer.

Victor

From victor.stinner at gmail.com  Mon Jul 18 18:49:07 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 19 Jul 2016 00:49:07 +0200
Subject: [Speed] perf module 0.7 released
Message-ID: <CAMpsgwbbzS-kTD3r7V-m2jGLeJ=DAhGEjSsFknX+ULajbDP0Sw@mail.gmail.com>

Hi,

I released perf 0.7 (quickly followed by a 0.7.1 bugfix):
http://perf.readthedocs.io/

I wrote this new version to collect more data in each process. It now
reads (and stores) CPUs config, CPUs temperature, CPUs frequency,
system load average, etc. Later we can add for example the process RSS
peak or other useful metrics.

Oh, and the timestamp is now stored per process (run). Again, it's no
more global. I noticed a temporarely slowdown which might be caused by
a cron task, I'm not sure yet. At least, timestamps should help to
debug such issue.

I added many CPU metrics because I wanted to analyze why *sometimes* a
benchmark suddenly becomes 50% slower (up to 100% slower). It may be
related to the CPUs temperature or Intel Turbo Boost, I don't know yet
exactly.

The previous perf design didn't allow to store information per
process, only globally per benchmark.

perf 0.7 now supports much better benchmark suites (not only
individual benchmarks) and has now a really working --append command.
A benchmark file has not enough runs? Run it again with --append!

Changes:

* new "pybench" command (similar to "python3 -m perf ...")
* the --append is now safer and works on benchmark suites
* most perf commands now support multiple files and support benchmark
suites (not only individual benchmarks)
* new dump command and --dump option to display runs
* new metadata: cpu_config, cpu_freq, cpu_temp, load_avg_1min

In the meanwhile, I also completed and updated my fork of the CPython
benchmark suite:
https://hg.python.org/sandbox/benchmarks_perf

Victor

From victor.stinner at gmail.com  Thu Jul 28 13:19:13 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 28 Jul 2016 19:19:13 +0200
Subject: [Speed] Tracking memory usage
Message-ID: <CAMpsgwY7M5gp+ckPhcBHGE-0sgaBqHdTp_dCzxX8w+Ry4DbgAQ@mail.gmail.com>

Hi,

tl;dr Do you use "perf.py --track_memory"? If yes, for which purpose?
Are you using it on Windows or Linux?


I'm working on the CPython benchmark suite. It has a --track_memory
command option to measure the peak of the memory usage. A main process
runs worker processes and track their memory usage.

On Linux, the main process reads the "private data" from
/proc/pid/smaps of a worker process. It uses a busy-loop: it reads
/proc/pid/smaps as fast as possible (with no sleep)!

On Windows, PeakPagefileUsage of GetProcessMemoryInfo(process_handle)
is used. It uses a loop using a sleep of 1 ms.

Do you think that the Linux implementation is reliable? What happens
if the worker process only reachs its peak during 1 ms but the main
process (the watcher) reads the memory usage every 10 ms?

The exact value probably also depends a lot on how the operating
system computes the memory usage. RSS is very different from PSS
(proportional set size) for example. Linux has also "USS" (unshared
memory)...

I would prefer to implement the code to track memory in the worker
process directly. On Windows, it looks reliable to get the peak after
each run. On Linux, it is less clear. Should I use a thread reading
/proc/self/smaps in a busy loop?

For me, the most reliable option is to use tracemalloc to get the peak
of the *Python* memory usage. But this module is only available on
Python 3.4 and newer. Another issue is that it slows down a lot the
code (something like 2x slower!).

I guess that they are two use cases:

- read coarse memory usage but don't hit performance
- read precise memory usage, ignore performance

Victor

From victor.stinner at gmail.com  Thu Jul 28 13:24:44 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 28 Jul 2016 19:24:44 +0200
Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc.
Message-ID: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>

Hi,

I updated all benchmarks of the CPython Benchmark Suite to use my perf
module. So you get timings of all individual run all *all* benchmarks
and store them in JSON to analyze them in detail. Each benchmark has a
full CLI, for example it gets a new --output option to store result as
JSON directly. But it also gets nice funtions like --hist for
histogram, --stats for statistics, etc.

The two remaining questions are:

* Should it support --track_memory? it doesn't support it correctly
right now, it's less precise than before (memory is tracked directly
in worker processes, no more by the main process) => discuss this
point in my previous email

* Should we remove vendor copies of libraries and work with virtual
environments? Not all libraries are available on PyPI :-/ See the
requirements.txt file and TODO.

My repository:
https://hg.python.org/sandbox/benchmarks_perf

I would like to push my work as a single giant commit.

Brett also proposed me to move the benchmarks repository to GitHub
(and so convert it to Git). I don't know if it's appropriate to do all
these things at once? What do you think?

Reminder: My final goal is to merge again all benchmarks suites
(CPython, PyPy, Pyston, Pyjion, ...) into one unique project!

Victor

From victor.stinner at gmail.com  Fri Jul 29 12:58:19 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 29 Jul 2016 18:58:19 +0200
Subject: [Speed] Tracking memory usage
In-Reply-To: <CAMpsgwY7M5gp+ckPhcBHGE-0sgaBqHdTp_dCzxX8w+Ry4DbgAQ@mail.gmail.com>
References: <CAMpsgwY7M5gp+ckPhcBHGE-0sgaBqHdTp_dCzxX8w+Ry4DbgAQ@mail.gmail.com>
Message-ID: <CAMpsgwbJ0mRxpecsSin3U0D65fcmB+3K=byvuyUvW2x6PjYsDw@mail.gmail.com>

I modified my perf module to add two new options: --tracemalloc and
--track-memory.

--tracemalloc enables tracemalloc and gets the peak of the traced
Python memory allocations: peak is computed per process.

--track-memory is similar but reads PeakPagefileUsage of
GetProcessMemoryInfo() on Windows or private data from /proc/self/smap
on Linux. The read is done every millisecond (1 ms) in a thread, in
the worker process.

It's not perfect, but it should be "as good" as the "old" CPython
benchmark suite. And it makes the benchmark suite simpler because
tracking memory usage is now done automatically by the perf module.

Victor

From brett at python.org  Fri Jul 29 13:03:10 2016
From: brett at python.org (Brett Cannon)
Date: Fri, 29 Jul 2016 17:03:10 +0000
Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub,
 etc.
In-Reply-To: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>
References: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>
Message-ID: <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>

On Thu, 28 Jul 2016 at 10:25 Victor Stinner <victor.stinner at gmail.com>
wrote:

> Hi,
>
> I updated all benchmarks of the CPython Benchmark Suite to use my perf
> module. So you get timings of all individual run all *all* benchmarks
> and store them in JSON to analyze them in detail. Each benchmark has a
> full CLI, for example it gets a new --output option to store result as
> JSON directly. But it also gets nice funtions like --hist for
> histogram, --stats for statistics, etc.
>
> The two remaining questions are:
>
> * Should it support --track_memory? it doesn't support it correctly
> right now, it's less precise than before (memory is tracked directly
> in worker processes, no more by the main process) => discuss this
> point in my previous email
>

I don't have an opinion as I have never gotten to use the old feature.


>
> * Should we remove vendor copies of libraries and work with virtual
> environments? Not all libraries are available on PyPI :-/ See the
> requirements.txt file and TODO.
>

If they are not on PyPI then we should just drop the benchmark. And I say
we do use virtual environments to keep the repo size down.


>
> My repository:
> https://hg.python.org/sandbox/benchmarks_perf
>
> I would like to push my work as a single giant commit.
>
> Brett also proposed me to move the benchmarks repository to GitHub
> (and so convert it to Git). I don't know if it's appropriate to do all
> these things at once? What do you think?
>

I say just start a new repo from scratch. There isn't a ton of magical
history in the  hg repo that I think we need to have carried around in the
git repo. Plus if we stop shipping project source with the repo then it
will be immensely smaller if we start from scratch.


>
> Reminder: My final goal is to merge again all benchmarks suites
> (CPython, PyPy, Pyston, Pyjion, ...) into one unique project!


I hope this happens!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/speed/attachments/20160729/619d6631/attachment.html>

From victor.stinner at gmail.com  Fri Jul 29 13:09:43 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 29 Jul 2016 19:09:43 +0200
Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub,
 etc.
In-Reply-To: <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>
References: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>
 <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>
Message-ID: <CAMpsgwYLSBD1pQQAV5EQCZS=puNJrO5R00_Y=h8bvdfr46O+Rg@mail.gmail.com>

2016-07-29 19:03 GMT+02:00 Brett Cannon <brett at python.org>:
>> * Should it support --track_memory? it doesn't support it correctly
>> right now, it's less precise than before (memory is tracked directly
>> in worker processes, no more by the main process) => discuss this
>> point in my previous email
>
> I don't have an opinion as I have never gotten to use the old feature.

As I wrote in my other email, I implemented this feature in perf, so
the benchmark suite will get it for free. The implementation is not
complete, but it's working ;-)

>> * Should we remove vendor copies of libraries and work with virtual
>> environments? Not all libraries are available on PyPI :-/ See the
>> requirements.txt file and TODO.
>
> If they are not on PyPI then we should just drop the benchmark. And I say we
> do use virtual environments to keep the repo size down.

Right. We can start with "a subset" of benchmarks and enlarge the test
suite later, even reimport old benchmark with their dependency not on
PyPI on case by case.

Right now, I only wrote requirements.txt. I didn't touch the Python
which still "hardcodes" PYTHONPATH to get the local copy of
dependencies. It works if you run manually benchmarks in a virtual
environment.

I will write some glue to automate things and make the code "just
work" bfore starting a new thing on GitHub.

Victor

From zachary.ware+pydev at gmail.com  Fri Jul 29 14:51:34 2016
From: zachary.ware+pydev at gmail.com (Zachary Ware)
Date: Fri, 29 Jul 2016 13:51:34 -0500
Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub,
 etc.
In-Reply-To: <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>
References: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>
 <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>
Message-ID: <CAKJDb-OGmaMq_+Y8iCgZTx8mG0MPbZECLf=37DgMJuW0Sho0ig@mail.gmail.com>

On Fri, Jul 29, 2016 at 12:03 PM, Brett Cannon <brett at python.org> wrote:
> On Thu, 28 Jul 2016 at 10:25 Victor Stinner <victor.stinner at gmail.com>
> wrote:
>> * Should we remove vendor copies of libraries and work with virtual
>> environments? Not all libraries are available on PyPI :-/ See the
>> requirements.txt file and TODO.
>
> If they are not on PyPI then we should just drop the benchmark. And I say we
> do use virtual environments to keep the repo size down.

I think rather than using virtual environments which aren't truly
supported by <3.3 anyway, we should instead make use of pip's
--target, --root, and/or --prefix flags (whatever combination it
takes, I haven't looked into it deeply) to install the packages into a
particular dir which is then added to each benchmarked interpreter's
PYTHONPATH.  This way, we're sure that each interpreter is running
exactly the same code.

Either way, I'm for not vendoring libraries.  If the library
disappears from PyPI, it's probably not an important workload anymore
anyway.

>> Reminder: My final goal is to merge again all benchmarks suites
>> (CPython, PyPy, Pyston, Pyjion, ...) into one unique project!
>
> I hope this happens!

Me too!  I'd also like to get the benchmark runner for
speed.python.org set up to build and benchmark as many interpreters as
possible.

-- 
Zach

From victor.stinner at gmail.com  Fri Jul 29 17:42:03 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 29 Jul 2016 23:42:03 +0200
Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub,
 etc.
In-Reply-To: <CAKJDb-OGmaMq_+Y8iCgZTx8mG0MPbZECLf=37DgMJuW0Sho0ig@mail.gmail.com>
References: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>
 <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>
 <CAKJDb-OGmaMq_+Y8iCgZTx8mG0MPbZECLf=37DgMJuW0Sho0ig@mail.gmail.com>
Message-ID: <CAMpsgwbiafyG-86EOrSYUUsKsO=uEZL=FFTz=xe=UbVEiarW=g@mail.gmail.com>

2016-07-29 20:51 GMT+02:00 Zachary Ware <zachary.ware+pydev at gmail.com>:
> I think rather than using virtual environments which aren't truly
> supported by <3.3 anyway, ...

What do you mean? I'm building and destroying dozens of venv everyday
at work using tox on Python 2.7. The virtualenv command works well,
no? Do you have issues with it?

Victor

From zachary.ware+pydev at gmail.com  Fri Jul 29 17:55:05 2016
From: zachary.ware+pydev at gmail.com (Zachary Ware)
Date: Fri, 29 Jul 2016 16:55:05 -0500
Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub,
 etc.
In-Reply-To: <CAMpsgwbiafyG-86EOrSYUUsKsO=uEZL=FFTz=xe=UbVEiarW=g@mail.gmail.com>
References: <CAMpsgwYiTtF8Psp1EWxE2W_sPGEN+_HLpQyVNNGdF6xgFgq0mw@mail.gmail.com>
 <CAP1=2W6K=ya=xed=f=SkLVdymw=f=jZVp9uDeQztYZyTmv77Bg@mail.gmail.com>
 <CAKJDb-OGmaMq_+Y8iCgZTx8mG0MPbZECLf=37DgMJuW0Sho0ig@mail.gmail.com>
 <CAMpsgwbiafyG-86EOrSYUUsKsO=uEZL=FFTz=xe=UbVEiarW=g@mail.gmail.com>
Message-ID: <CAKJDb-Oz-X30PqGtobz7h-cgWJEz7TcNH-1d_3zpmh1Lu47EzQ@mail.gmail.com>

On Fri, Jul 29, 2016 at 4:42 PM, Victor Stinner
<victor.stinner at gmail.com> wrote:
> 2016-07-29 20:51 GMT+02:00 Zachary Ware <zachary.ware+pydev at gmail.com>:
>> I think rather than using virtual environments which aren't truly
>> supported by <3.3 anyway, ...
>
> What do you mean? I'm building and destroying dozens of venv everyday
> at work using tox on Python 2.7. The virtualenv command works well,
> no? Do you have issues with it?

Not in particular, just that 3.3+ have official support for venvs
whereas virtualenv is a bit of a hack by necessity.  However, the
second point is the real reason I'd rather avoid venvs for this: to
make sure that the interpreters actually use the same exact code, so
that there can't be any setup.py shenanigans that do things
differently between versions/implementations.

-- 
Zach

From arigo at tunes.org  Sat Jul 30 13:48:42 2016
From: arigo at tunes.org (Armin Rigo)
Date: Sat, 30 Jul 2016 19:48:42 +0200
Subject: [Speed] Tracking memory usage
In-Reply-To: <CAMpsgwbJ0mRxpecsSin3U0D65fcmB+3K=byvuyUvW2x6PjYsDw@mail.gmail.com>
References: <CAMpsgwY7M5gp+ckPhcBHGE-0sgaBqHdTp_dCzxX8w+Ry4DbgAQ@mail.gmail.com>
 <CAMpsgwbJ0mRxpecsSin3U0D65fcmB+3K=byvuyUvW2x6PjYsDw@mail.gmail.com>
Message-ID: <CAMSv6X2GS0snG2BG9CdjZQHEnkSdXi_4HUL36gqaVeUcWN+xUQ@mail.gmail.com>

Hi Victor,

Fwiw, there is some per-OS (and even apparently
per-Linux-distribution) solution mentioned here:
http://stackoverflow.com/questions/774556/peak-memory-usage-of-a-linux-unix-process

For me on Arch Linux, "/usr/bin/time -v CMD" returns a reasonable
value in "Maximum resident set size (kbytes)".  I guess that on OSes
where this works, it gives a zero-overhead, exact answer.


A bient?t,

Armin.