From victor.stinner at gmail.com Mon Jul 4 04:53:25 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 4 Jul 2016 10:53:25 +0200 Subject: [Speed] bm_pickle: why testing protocol 2? Message-ID: Hi, performance/bm_pickle.py of the CPython benchmark suite uses the pickle protocol 2 by default. Why not always testing the highest protocol? In Python 3.5, the highest protocol is 4 which is more efficient than the protocol 2. Is it a deliberate choice to test exactly the same thing between Python 2 and Python 3? Victor From solipsis at pitrou.net Mon Jul 4 05:38:11 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 4 Jul 2016 11:38:11 +0200 Subject: [Speed] bm_pickle: why testing protocol 2? References: Message-ID: <20160704113811.3f90a109@fsol> On Mon, 4 Jul 2016 10:53:25 +0200 Victor Stinner wrote: > Hi, > > performance/bm_pickle.py of the CPython benchmark suite uses the > pickle protocol 2 by default. Why not always testing the highest > protocol? I think this comes from the Unladen Swallow benchmark suite, and Unladen Swallow was Python 2-only, so protocol 2 *was* the highest protocol in those circumstances. Regards Antoine. From victor.stinner at gmail.com Mon Jul 4 10:17:23 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 4 Jul 2016 16:17:23 +0200 Subject: [Speed] New CPython benchmark suite based on perf Message-ID: Hi, I modified the CPython benchmark suite to use my perf module: https://hg.python.org/sandbox/benchmarks_perf Changes: * use statistics.median() rather than mean() to compute of "average" of samples. Example: Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower * replace compat.py with external six dependency * replace util.py with perf * replace explicit warmups with perf automatic warmup * add name metadata * for benchmark taking parameters, save parameters in metadata * avoid nested loops, prefer a single level of loop: perf is responsible to call the sample function enough times to collect enough samples * store django and mako version in metadata * use JSON format to exchange timings between benchmarks and runner.py perf adds more features: * run each benchmark in multiple processes (25 by default, 50 in rigorous mode) * calibrate each benchmark to compute the number of loops to get a sample between 100 ms and 1 second TODO: * Right now the calibration in done twice: in the reference python and in the changed python. It should only be once in the reference python * runner.py should write results in a JSON file. Currently, data are not written on disk (a pipe is used with child processes) * Drop external dependencies and create a virtual environment per python * Port more Python 2-only benchmarks to Python 3 * Add more benchmarks from PyPy, Pyston and Pyjion benchmark suites: unify again the benchmark suites :-) perf has builtin tools to analyze the distribution of samples: * add --hist option to a benchmark to display an histogram in text mode * add --stats option to a benchmark to display statistics: number of samples, shortest raw sample, min, max, etc. * "python3 -m perf" CLI allows has many commands to analyze a benchmark: http://perf.readthedocs.io/en/latest/cli.html Right now, perf JSON format is only able to store one benchmark. I will extend the format to be able to store a list of benchmarks. So it will be possible to store all results of a python version into a single file. By the way, I also want to change runner.py CLI to be able to run the benchmarks on a single python version and then use a second command to compare two files. Rather than always running each benchmark twice (reference python, changed python). PyPy runner also works like that if I recall correctly. Victor From victor.stinner at gmail.com Mon Jul 4 11:08:06 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 4 Jul 2016 17:08:06 +0200 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: References: Message-ID: 2016-07-04 16:17 GMT+02:00 Victor Stinner : > I modified the CPython benchmark suite to use my perf module: > https://hg.python.org/sandbox/benchmarks_perf Hum, you need the development version of perf to test it: git clone https://github.com/haypo/perf.git > Changes: > > * replace explicit warmups with perf automatic warmup > (...) > * avoid nested loops, prefer a single level of loop: perf is > responsible to call the sample function enough times to collect enough > samples Concrete example with performance/bm_go.py. Before: ------------------------- def main(n, timer): times = [] for i in range(5): versus_cpu() # warmup for i in range(n): t1 = timer() versus_cpu() t2 = timer() times.append(t2 - t1) return times ------------------------- After: ------------------------- def main(loops): t0 = perf.perf_counter() for _ in xrange(loops): versus_cpu() return perf.perf_counter() - t0 ------------------------- Example of go benchmark output: --- $ python3 benchmarks_perf/performance/bm_go.py -v calibration: 1 loop: 599 ms calibration: use 1 loop Run 1/25: warmup (1): 601 ms; raw samples (3): 593 ms, 593 ms, 593 ms Run 2/25: warmup (1): 609 ms; raw samples (3): 609 ms, 610 ms, 608 ms Run 3/25: warmup (1): 599 ms; raw samples (3): 598 ms, 606 ms, 598 ms (...) Run 25/25: warmup (1): 606 ms; raw samples (3): 591 ms, 590 ms, 591 ms Median +- std dev: 598 ms +- 8 ms --- The warmup samples ("warmup (1): ... ms") are not used to compute median or std dev. Another example to show fancy features of perf: --- $ python3 benchmarks_perf/performance/bm_telco.py -v --hist --stats --metadata -n5 -p50 calibration: 1 loop: 34.6 ms calibration: 2 loops: 57.8 ms calibration: 4 loops: 105 ms calibration: use 4 loops Run 1/50: warmup (1): 116 ms; raw samples (5): 106 ms, 106 ms, 105 ms, 106 ms, 106 ms Run 2/50: warmup (1): 107 ms; raw samples (5): 107 ms, 107 ms, 106 ms, 106 ms, 106 ms Run 3/50: warmup (1): 107 ms; raw samples (5): 106 ms, 106 ms, 106 ms, 106 ms, 106 ms (...) Run 50/50: warmup (1): 106 ms; raw samples (5): 104 ms, 105 ms, 105 ms, 106 ms, 105 ms Metadata: - aslr: enabled - cpu_count: 4 - cpu_model_name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - date: 2016-07-04T17:00:33 - description: Test the performance of the Telco decimal benchmark - duration: 35.6 sec - hostname: smithers - name: telco - perf_version: 0.6 - platform: Linux-4.5.7-300.fc24.x86_64-x86_64-with-fedora-24-Twenty_Four - python_executable: /usr/bin/python3 - python_implementation: cpython - python_version: 3.5.1 (64bit) - timer: clock_gettime(CLOCK_MONOTONIC), resolution: 1.00 ns 25.8 ms: 1 ## 25.9 ms: 2 ##### 26.0 ms: 4 ########## 26.0 ms: 13 ############################### 26.1 ms: 27 ################################################################# 26.2 ms: 28 ################################################################### 26.3 ms: 21 ################################################## 26.3 ms: 25 ############################################################ 26.4 ms: 32 ############################################################################# 26.5 ms: 33 ############################################################################### 26.6 ms: 18 ########################################### 26.6 ms: 13 ############################### 26.7 ms: 8 ################### 26.8 ms: 8 ################### 26.8 ms: 7 ################# 26.9 ms: 4 ########## 27.0 ms: 4 ########## 27.1 ms: 1 ## 27.1 ms: 0 | 27.2 ms: 0 | 27.3 ms: 1 ## Number of samples: 250 (50 runs x 5 samples; 1 warmup) Standard deviation / median: 1% Shortest raw sample: 103 ms (4 loops) Minimum: 25.9 ms (-2.1%) Median +- std dev: 26.4 ms +- 0.2 ms Maximum: 27.3 ms (+3.4%) Median +- std dev: 26.4 ms +- 0.2 ms --- I used " -n5 -p50" to compute 5 samples per process and use 50 processes. It helps to get a nicer histogram :-) (to have a better uniform distribution) For histogram, I like using telco because it generates a regular gaussian curve :-) Victor From brett at python.org Mon Jul 4 13:32:52 2016 From: brett at python.org (Brett Cannon) Date: Mon, 04 Jul 2016 17:32:52 +0000 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: References: Message-ID: I just wanted to quickly say, Victor, this all sounds great! On Mon, 4 Jul 2016 at 07:17 Victor Stinner wrote: > Hi, > > I modified the CPython benchmark suite to use my perf module: > https://hg.python.org/sandbox/benchmarks_perf > > > Changes: > > * use statistics.median() rather than mean() to compute of "average" > of samples. Example: > > Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower > > * replace compat.py with external six dependency > * replace util.py with perf > * replace explicit warmups with perf automatic warmup > * add name metadata > * for benchmark taking parameters, save parameters in metadata > * avoid nested loops, prefer a single level of loop: perf is > responsible to call the sample function enough times to collect enough > samples > * store django and mako version in metadata > * use JSON format to exchange timings between benchmarks and runner.py > > > perf adds more features: > > * run each benchmark in multiple processes (25 by default, 50 in rigorous > mode) > * calibrate each benchmark to compute the number of loops to get a > sample between 100 ms and 1 second > > > TODO: > > * Right now the calibration in done twice: in the reference python and > in the changed python. It should only be once in the reference python > * runner.py should write results in a JSON file. Currently, data are > not written on disk (a pipe is used with child processes) > * Drop external dependencies and create a virtual environment per python > * Port more Python 2-only benchmarks to Python 3 > * Add more benchmarks from PyPy, Pyston and Pyjion benchmark suites: > unify again the benchmark suites :-) > > > perf has builtin tools to analyze the distribution of samples: > > * add --hist option to a benchmark to display an histogram in text mode > * add --stats option to a benchmark to display statistics: number of > samples, shortest raw sample, min, max, etc. > * "python3 -m perf" CLI allows has many commands to analyze a benchmark: > http://perf.readthedocs.io/en/latest/cli.html > > > Right now, perf JSON format is only able to store one benchmark. I > will extend the format to be able to store a list of benchmarks. So it > will be possible to store all results of a python version into a > single file. > > By the way, I also want to change runner.py CLI to be able to run the > benchmarks on a single python version and then use a second command to > compare two files. Rather than always running each benchmark twice > (reference python, changed python). PyPy runner also works like that > if I recall correctly. > > Victor > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon Jul 4 13:49:52 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 4 Jul 2016 19:49:52 +0200 Subject: [Speed] New CPython benchmark suite based on perf References: Message-ID: <20160704194952.3d30dc79@fsol> On Mon, 4 Jul 2016 16:17:23 +0200 Victor Stinner wrote: > Changes: > > * use statistics.median() rather than mean() to compute of "average" > of samples. Example: > > Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower That doesn't sound like a terrific idea. Why do you think the median gives a more interesting figure here? (please note that median() doesn't compute an "average" at all...) > * replace compat.py with external six dependency I would suggest vendoring six, to avoid adding dependencies. > * use JSON format to exchange timings between benchmarks and runner.py That's a very nice improvement. > TODO: > > * Right now the calibration in done twice: in the reference python and > in the changed python. It should only be once in the reference python I think doing calibration in each interpreter is the right thing to do, because the two interpreters may have very different performance characteristics (say one is 10x faster than the other). Regards Antoine. From victor.stinner at gmail.com Mon Jul 4 16:51:11 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 4 Jul 2016 22:51:11 +0200 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: <20160704194952.3d30dc79@fsol> References: <20160704194952.3d30dc79@fsol> Message-ID: 2016-07-04 19:49 GMT+02:00 Antoine Pitrou : >> Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower > > That doesn't sound like a terrific idea. Why do you think the median > gives a more interesting figure here? When the distribution is uniform, mean and median are the same. In my experience with Python benchmarks, usually the curse is skewed: the right tail is much longer. When the system noise is high, the skewness is much larger. In this case, median looks "more correct". IMO it helps to reduce the system noise. See graphics and the discussion for the detail: https://github.com/haypo/perf/issues/1 >> * replace compat.py with external six dependency > > I would suggest vendoring six, to avoid adding dependencies. Ah, that's a different topic. I'm more in favor of dropping of vendor copies of libraries, and rather get them from PyPI using a virtualenv. It should make the benchmark repository smaller and allow to upgrade dependencies more easily. What do you think? >> TODO: >> >> * Right now the calibration in done twice: in the reference python and >> in the changed python. It should only be once in the reference python > > I think doing calibration in each interpreter is the right thing to do, > because the two interpreters may have very different performance > characteristics (say one is 10x faster than the other). Ah yes, maybe. It's true that telco benchmark is *much* faster on Python 3. Anyway, the result is normalized per loop iteration: raw sample / loops. By the way, perf has an "inner-loops" parameter for micro-benchmarks which duplicates an instruction N times to reduce the overhead of loops. Victor From solipsis at pitrou.net Tue Jul 5 04:08:52 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 5 Jul 2016 10:08:52 +0200 Subject: [Speed] New CPython benchmark suite based on perf References: <20160704194952.3d30dc79@fsol> Message-ID: <20160705100852.39358967@fsol> On Mon, 4 Jul 2016 22:51:11 +0200 Victor Stinner wrote: > 2016-07-04 19:49 GMT+02:00 Antoine Pitrou : > >> Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower > > > > That doesn't sound like a terrific idea. Why do you think the median > > gives a more interesting figure here? > > When the distribution is uniform, mean and median are the same. In my > experience with Python benchmarks, usually the curse is skewed: the > right tail is much longer. > > When the system noise is high, the skewness is much larger. In this > case, median looks "more correct". It "looks" more correct? Let's say your Python implementation has a flaw: it is almost always fast, but every 10 runs, it becomes 3x slower. Taking the mean will reflect the occasional slowness. Taking the median will completely hide it. Then of course, since you have several processes and several runs per process, you could try something more convoluted, such as mean-of-medians or mean-of-mins or... However, if you're concerned by system noise, there may be other ways to avoid it. For example, measure both CPU time and wall time, and if CPU time < 0.9 * wall time (for example), ignore the number and take another measurement. (this assumes all benchmarks are CPU-bound - which they should be here - and single-threaded - which they *probably* are, except in a hypothetical parallelizing Python implementation ;-))) Regards Antoine. From victor.stinner at gmail.com Tue Jul 5 05:35:30 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 5 Jul 2016 11:35:30 +0200 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: <20160705100852.39358967@fsol> References: <20160704194952.3d30dc79@fsol> <20160705100852.39358967@fsol> Message-ID: 2016-07-05 10:08 GMT+02:00 Antoine Pitrou : >> When the system noise is high, the skewness is much larger. In this >> case, median looks "more correct". > > It "looks" more correct? My main worry is to get reproductible "stable" benchmark results. I started to work on perf because most results of the CPython benchmark suite just looked like pure noise. It became very hard for me to decide if it's my fault, if my change makes Python slower and faster. I'm not talking of specific benchmarks which are obviously much faster or much slower, but all small changes between -5% and +5%. It looks like median helps to reduce the effect of outliers. > Let's say your Python implementation has a flaw: it is almost always > fast, but every 10 runs, it becomes 3x slower. Taking the mean will > reflect the occasional slowness. Taking the median will completely > hide it. I'm not sure that the median will completly hide such behaviour. Moreover, I modified the benchmark suite to always display the standard deviation just after the median. The standard deviation should help to detect a large variation. In practice, it almost never occurs to have all samples with the same value. There is always a statistic distribution, usually as a gaussian curse. The question is what is the best way to "summary" a curve with two numbers. I add a constraint: I also want to reduce the system noise. > Then of course, since you have several processes and several runs per > process, you could try something more convoluted, such as > mean-of-medians or mean-of-mins or... I don't know these functions. I also prefer consider each sample as individual and only apply a function on the whole serie of all samples. > However, if you're concerned by system noise, there may be other ways > to avoid it. For example, measure both CPU time and wall time, and if > CPU time < 0.9 * wall time (for example), ignore the number and take > another measurement. > > (this assumes all benchmarks are CPU-bound - which they should be here > - and single-threaded - which they *probably* are, except in a > hypothetical parallelizing Python implementation ;-))) CPU isolation helps a lot to reduce the system noise, but it requires "complex" system tuning. I don't understand that users will use it, especially users of timeit. I don't think that CPU time is generic enough to put it in the perf module. I would prefer to not restrict myself to CPU-bound benchmarks. But the perf module already warns users when it detects that the benchmark looks too unstable. See the example at the end of: http://perf.readthedocs.io/en/latest/perf.html#runs-samples-warmups-outter-and-inner-loops Or try: "python3 -m perf.timeit --loops=10 pass". Currently, I'm using the shortest raw sample (>= 1 ms) and standard deviation / median (< 10%). Someone suggested me to compare the minimum and the maximum to the median. You get already see that using perf stats: ------------------ $ python3 -m perf show --stats perf/tests/telco.json Number of samples: 250 (50 runs x 5 samples; 1 warmup) Standard deviation / median: 1% Shortest raw sample: 264 ms (10 loops) Minimum: 26.4 ms (-1.8%) Median +- std dev: 26.9 ms +- 0.2 ms Maximum: 27.3 ms (+1.7%) Median +- std dev: 26.9 ms +- 0.2 ms ------------------ => -1.8% and +1.7% numbers for minimum and maximum When you get outliers, numbers are up to 20% for the maximum or much more. Victor From solipsis at pitrou.net Tue Jul 5 06:08:37 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 5 Jul 2016 12:08:37 +0200 Subject: [Speed] New CPython benchmark suite based on perf References: <20160704194952.3d30dc79@fsol> <20160705100852.39358967@fsol> Message-ID: <20160705120837.5608d5ae@fsol> On Tue, 5 Jul 2016 11:35:30 +0200 Victor Stinner wrote: > > It looks like median helps to reduce the effect of outliers. If you want to reduce the effect the outliers, you can just remove them: for example, ignore the 5% shortest samples and the 5% longest ones. The median will not only reduce the effect of outliers but also completely ignore the value of most samples *except* the median sample. > In practice, it almost never occurs to have all samples with the same > value. There is always a statistic distribution, usually as a gaussian > curse. If it's a gaussian curve (not a curse, probably :-)), then you can summarize it with two values: the mean and the stddev. But it's probably not a gaussian, because of system noise and other factors, so your assumption is wrong :-) Regards Antoine. From victor.stinner at gmail.com Tue Jul 5 07:55:39 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 5 Jul 2016 13:55:39 +0200 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: <20160705120837.5608d5ae@fsol> References: <20160704194952.3d30dc79@fsol> <20160705100852.39358967@fsol> <20160705120837.5608d5ae@fsol> Message-ID: 2016-07-05 12:08 GMT+02:00 Antoine Pitrou : > If it's a gaussian curve (not a curse, probably :-)), then you can > summarize it with two values: the mean and the stddev. But it's > probably not a gaussian, because of system noise and other factors, so > your assumption is wrong :-) What do you propose? Revert to average (arithmeric mean) + std dev (centered on the average)? Victor From ncoghlan at gmail.com Tue Jul 5 08:04:51 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 5 Jul 2016 22:04:51 +1000 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: <20160705120837.5608d5ae@fsol> References: <20160704194952.3d30dc79@fsol> <20160705100852.39358967@fsol> <20160705120837.5608d5ae@fsol> Message-ID: On 5 July 2016 at 20:08, Antoine Pitrou wrote: > On Tue, 5 Jul 2016 11:35:30 +0200 > Victor Stinner > wrote: >> In practice, it almost never occurs to have all samples with the same >> value. There is always a statistic distribution, usually as a gaussian >> curse. > > If it's a gaussian curve (not a curse, probably :-)), then you can > summarize it with two values: the mean and the stddev. But it's > probably not a gaussian, because of system noise and other factors, so > your assumption is wrong :-) If you haven't already, I highly recommend reading the discussion in https://github.com/haypo/perf/issues/1 that led to Victor adopting the current median + stddev approach As Mahmoud noted there, in terms of really understanding the benchmark results, there's no substitute for actually looking at the histograms with the result distributions. The numeric results are never going to be able to do more than provide a "flavour" for those results, since the distributions aren't Guassian, but trying to characterise and describe them properly would inevitably confuse folks that aren't already expert statisticians. The median + stddev approach helps convey a "typical" result better than the minimum or mean do, while also providing an indication when the variation in results is too high for the median to really be meaningful. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From solipsis at pitrou.net Tue Jul 5 08:07:49 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 5 Jul 2016 14:07:49 +0200 Subject: [Speed] New CPython benchmark suite based on perf References: <20160704194952.3d30dc79@fsol> <20160705100852.39358967@fsol> <20160705120837.5608d5ae@fsol> Message-ID: <20160705140749.5f70622d@fsol> On Tue, 5 Jul 2016 13:55:39 +0200 Victor Stinner wrote: > 2016-07-05 12:08 GMT+02:00 Antoine Pitrou : > > If it's a gaussian curve (not a curse, probably :-)), then you can > > summarize it with two values: the mean and the stddev. But it's > > probably not a gaussian, because of system noise and other factors, so > > your assumption is wrong :-) > > What do you propose? Revert to average (arithmeric mean) + std dev > (centered on the average)? Yes. And if you want to ignore outliers, just remove them: remove the 5% smallest and 5% largest samples. Regards Antoine. From solipsis at pitrou.net Tue Jul 5 08:10:15 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 5 Jul 2016 14:10:15 +0200 Subject: [Speed] New CPython benchmark suite based on perf References: <20160704194952.3d30dc79@fsol> <20160705100852.39358967@fsol> <20160705120837.5608d5ae@fsol> Message-ID: <20160705141015.3855420f@fsol> On Tue, 5 Jul 2016 22:04:51 +1000 Nick Coghlan wrote: > > The median + stddev approach helps convey a "typical" result better > than the minimum or mean do, while also providing an indication when > the variation in results is too high for the median to really be > meaningful. This is missing the primary goal, which is to compare results between implementations. For this you need a single number, and the median is a poor indication of overall performance (because it totally ignores the actual distribution shape). Providing detailed statistical information (median, mean, deviation, quartiles, etc.) about each benchmark run is useful in itself, but a secondary concern for most uses of the benchmark suite. Regards Antoine. From victor.stinner at gmail.com Wed Jul 6 12:17:49 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 6 Jul 2016 18:17:49 +0200 Subject: [Speed] perf 0.6 released Message-ID: Hi, I'm pleased to announce the release of the Python perf modul version 0.6. The main difference is that the JSON format and perf command line tools support benchmark suites and not only individual benchmarks. I added a new "convert" command which allows to modify a benchmark file: remove benchmarks, remove benchmarks runs, and a special "remove outliers" command. I'm not sure that removing outliers is a good practice, I will have to play with it to give you a feedback :-) I added --fast and --rigorous options: simple options to configure the number of processes and the of sample per process. The idea of these options comes from the CPython benchmark suite. I added --hist and --stats options to TextRunner, so it's now possible to directly render an histogram and compute statistics on a benchmark (without having to use a file). Finally, the --json-append option allows to append a benchmark to an existing benchmark suite file. It allows to "concatenate" multiple benchmarks into a single JSON file. timeit example showing new features: --- $ python3 -m perf timeit -s 'x=" abc"' 'x.strip()' --stats --hist -v --rigorous --json=timeit.json calibration: 1 loop: 2.40 us calibration: 2 loops: 1.60 us (...) calibration: 2^20 loops: 118 ms calibration: use 2^20 loops Run 1/20: warmup (1): 116 ms; raw samples (5): 116 ms, 118 ms, 116 ms, 116 ms, 136 ms (+17%) Run 2/20: warmup (1): 119 ms; raw samples (5): 115 ms, 116 ms, 121 ms, 118 ms, 128 ms (+9%) Run 3/20: warmup (1): 272 ms; raw samples (5): 208 ms (+76%), 117 ms, 121 ms, 122 ms, 119 ms (...) Run 20/20: warmup (1): 140 ms; raw samples (5): 116 ms, 115 ms, 115 ms, 116 ms, 115 ms 106 ns: 38 ################################# 111 ns: 44 ###################################### 115 ns: 8 ####### 119 ns: 2 ## 124 ns: 5 #### 128 ns: 2 ## 133 ns: 0 | 137 ns: 0 | 141 ns: 0 | 146 ns: 0 | 150 ns: 0 | 155 ns: 0 | 159 ns: 0 | 164 ns: 0 | 168 ns: 0 | 172 ns: 0 | 177 ns: 0 | 181 ns: 0 | 186 ns: 0 | 190 ns: 0 | 195 ns: 1 # Number of samples: 100 (20 runs x 5 samples; 1 warmup) Loop iterations per sample: 2^20 Raw sample minimum: 115 ms Raw sample maximum: 208 ms Minimum: 110 ns (-1%) Median +- std dev: 111 ns +- 10 ns Mean +- std dev: 114 ns +- 10 ns Maximum: 198 ns (+79%) Median +- std dev: 111 ns +- 10 ns --- The list of runs now highlight outliers by showing the percent for samples out of the range [median - 5%; median + 5%]. Example: "raw samples (5): 208 ms (+76%)". Example of removing outliers: --- $ python3 -m perf convert timeit.json --remove-outliers -o timeit2.json haypo at selma$ python3 -m perf show --hist --stats -v timeit2.json Run 1/12: warmup (1): 124 ms; raw samples (5): 117 ms, 119 ms, 118 ms, 117 ms, 118 ms Run 2/12: warmup (1): 119 ms; raw samples (5): 119 ms, 117 ms, 117 ms, 118 ms, 117 ms Run 3/12: warmup (1): 116 ms; raw samples (5): 117 ms, 116 ms, 116 ms, 117 ms, 118 ms (...) Run 12/12: warmup (1): 140 ms; raw samples (5): 116 ms, 115 ms, 115 ms, 116 ms, 115 ms 110 ns: 10 ########################### 110 ns: 14 ###################################### 110 ns: 10 ########################### 111 ns: 7 ################### 111 ns: 2 ##### 111 ns: 1 ### 112 ns: 5 ############## 112 ns: 3 ######## 112 ns: 1 ### 112 ns: 1 ### 113 ns: 3 ######## 113 ns: 0 | 113 ns: 1 ### 114 ns: 1 ### 114 ns: 0 | 114 ns: 0 | 115 ns: 0 | 115 ns: 0 | 115 ns: 0 | 115 ns: 0 | 116 ns: 1 ### Number of samples: 60 (12 runs x 5 samples; 1 warmup) Loop iterations per sample: 2^20 Raw sample minimum: 115 ms Raw sample maximum: 121 ms Minimum: 110 ns (-1%) Median +- std dev: 111 ns +- 1 ns Mean +- std dev: 111 ns +- 1 ns Maximum: 116 ns (+5%) Median +- std dev: 111 ns +- 1 ns --- Without outliers, the histogram "looks better" but it changed a lot the standard deviation (11 ns => 1 ns). Victor From victor.stinner at gmail.com Wed Jul 6 12:25:59 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 6 Jul 2016 18:25:59 +0200 Subject: [Speed] New CPython benchmark suite based on perf In-Reply-To: References: Message-ID: 2016-07-04 16:17 GMT+02:00 Victor Stinner : > I modified the CPython benchmark suite to use my perf module: > https://hg.python.org/sandbox/benchmarks_perf Updates with the release of perf 0.6. runner.py now has 3 commands: run, compare, run_compare * "run" runs benchmarks on a single python, result can be written into a file * "compare" takes two JSON files as input and compares them * "run_compare" is the previous default behaviour: run benchmarks on two python versions and then compare results. The results can also be saved into two JSON files The main advantage is that it's now possible to only run the benchmark suite once on the baseline python, rather than having to run it each time. So each comparison to a changed python (run+compare) should be simply twice faster. It also becomes possible to exchange full benchmark results (all samples of all processes) as files, rather than just summaries (median +- std dev lines) as text. TODO: * update remaining benchmarks (3 special benchmarks are currently broken) * rework the code to compare benchmarks * repair memory tracking feature? * continue the implementation using virtual environments and external dependencies Victor From solipsis at pitrou.net Wed Jul 6 12:41:05 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 Jul 2016 18:41:05 +0200 Subject: [Speed] perf 0.6 released References: Message-ID: <20160706184105.27a73b88@fsol> On Wed, 6 Jul 2016 18:17:49 +0200 Victor Stinner wrote: > > The list of runs now highlight outliers by showing the percent for > samples out of the range [median - 5%; median + 5%]. Example: "raw > samples (5): 208 ms (+76%)". I'm not sure this is meant to implement my suggestion from the other thread, but if so, there is a misunderstanding: I did not suggest to remove the samples outside of the range [median - 5%; median + 5%]. I suggested to remove the 5% smallest and the 5% largest samples. Regards Antoine. From victor.stinner at gmail.com Wed Jul 6 16:16:43 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 6 Jul 2016 22:16:43 +0200 Subject: [Speed] perf 0.6 released In-Reply-To: <20160706184105.27a73b88@fsol> References: <20160706184105.27a73b88@fsol> Message-ID: 2016-07-06 18:41 GMT+02:00 Antoine Pitrou : > I'm not sure this is meant to implement my suggestion from the other > thread, Yes, I implemented this after the discussion we had in the other thread. > but if so, there is a misunderstanding: I did not suggest to > remove the samples outside of the range [median - 5%; median + 5%]. I > suggested to remove the 5% smallest and the 5% largest samples. I tried something to remove outliers. I didn't try to implement what you suggested. 5% smallest/5% largest: do you mean something like sorting all samples, remove items from the two tails? Something like sorted(samples)[3:-3] ? Victor From solipsis at pitrou.net Wed Jul 6 16:24:40 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 Jul 2016 22:24:40 +0200 Subject: [Speed] perf 0.6 released References: <20160706184105.27a73b88@fsol> Message-ID: <20160706222440.027ec686@fsol> On Wed, 6 Jul 2016 22:16:43 +0200 Victor Stinner wrote: > > but if so, there is a misunderstanding: I did not suggest to > > remove the samples outside of the range [median - 5%; median + 5%]. I > > suggested to remove the 5% smallest and the 5% largest samples. > > I tried something to remove outliers. I didn't try to implement what > you suggested. > > 5% smallest/5% largest: do you mean something like sorting all > samples, remove items from the two tails? > > Something like sorted(samples)[3:-3] ? Yes. Regards Antoine. From victor.stinner at gmail.com Wed Jul 6 19:21:34 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 7 Jul 2016 01:21:34 +0200 Subject: [Speed] perf 0.6 released In-Reply-To: <20160706222440.027ec686@fsol> References: <20160706184105.27a73b88@fsol> <20160706222440.027ec686@fsol> Message-ID: 2016-07-06 22:24 GMT+02:00 Antoine Pitrou : >> 5% smallest/5% largest: do you mean something like sorting all >> samples, remove items from the two tails? >> >> Something like sorted(samples)[3:-3] ? > > Yes. Hum, it may work if the distribution is uniform (symmetric), but usually the right tail is much longer. Victor From victor.stinner at gmail.com Mon Jul 18 18:49:07 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 19 Jul 2016 00:49:07 +0200 Subject: [Speed] perf module 0.7 released Message-ID: Hi, I released perf 0.7 (quickly followed by a 0.7.1 bugfix): http://perf.readthedocs.io/ I wrote this new version to collect more data in each process. It now reads (and stores) CPUs config, CPUs temperature, CPUs frequency, system load average, etc. Later we can add for example the process RSS peak or other useful metrics. Oh, and the timestamp is now stored per process (run). Again, it's no more global. I noticed a temporarely slowdown which might be caused by a cron task, I'm not sure yet. At least, timestamps should help to debug such issue. I added many CPU metrics because I wanted to analyze why *sometimes* a benchmark suddenly becomes 50% slower (up to 100% slower). It may be related to the CPUs temperature or Intel Turbo Boost, I don't know yet exactly. The previous perf design didn't allow to store information per process, only globally per benchmark. perf 0.7 now supports much better benchmark suites (not only individual benchmarks) and has now a really working --append command. A benchmark file has not enough runs? Run it again with --append! Changes: * new "pybench" command (similar to "python3 -m perf ...") * the --append is now safer and works on benchmark suites * most perf commands now support multiple files and support benchmark suites (not only individual benchmarks) * new dump command and --dump option to display runs * new metadata: cpu_config, cpu_freq, cpu_temp, load_avg_1min In the meanwhile, I also completed and updated my fork of the CPython benchmark suite: https://hg.python.org/sandbox/benchmarks_perf Victor From victor.stinner at gmail.com Thu Jul 28 13:19:13 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 28 Jul 2016 19:19:13 +0200 Subject: [Speed] Tracking memory usage Message-ID: Hi, tl;dr Do you use "perf.py --track_memory"? If yes, for which purpose? Are you using it on Windows or Linux? I'm working on the CPython benchmark suite. It has a --track_memory command option to measure the peak of the memory usage. A main process runs worker processes and track their memory usage. On Linux, the main process reads the "private data" from /proc/pid/smaps of a worker process. It uses a busy-loop: it reads /proc/pid/smaps as fast as possible (with no sleep)! On Windows, PeakPagefileUsage of GetProcessMemoryInfo(process_handle) is used. It uses a loop using a sleep of 1 ms. Do you think that the Linux implementation is reliable? What happens if the worker process only reachs its peak during 1 ms but the main process (the watcher) reads the memory usage every 10 ms? The exact value probably also depends a lot on how the operating system computes the memory usage. RSS is very different from PSS (proportional set size) for example. Linux has also "USS" (unshared memory)... I would prefer to implement the code to track memory in the worker process directly. On Windows, it looks reliable to get the peak after each run. On Linux, it is less clear. Should I use a thread reading /proc/self/smaps in a busy loop? For me, the most reliable option is to use tracemalloc to get the peak of the *Python* memory usage. But this module is only available on Python 3.4 and newer. Another issue is that it slows down a lot the code (something like 2x slower!). I guess that they are two use cases: - read coarse memory usage but don't hit performance - read precise memory usage, ignore performance Victor From victor.stinner at gmail.com Thu Jul 28 13:24:44 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 28 Jul 2016 19:24:44 +0200 Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc. Message-ID: Hi, I updated all benchmarks of the CPython Benchmark Suite to use my perf module. So you get timings of all individual run all *all* benchmarks and store them in JSON to analyze them in detail. Each benchmark has a full CLI, for example it gets a new --output option to store result as JSON directly. But it also gets nice funtions like --hist for histogram, --stats for statistics, etc. The two remaining questions are: * Should it support --track_memory? it doesn't support it correctly right now, it's less precise than before (memory is tracked directly in worker processes, no more by the main process) => discuss this point in my previous email * Should we remove vendor copies of libraries and work with virtual environments? Not all libraries are available on PyPI :-/ See the requirements.txt file and TODO. My repository: https://hg.python.org/sandbox/benchmarks_perf I would like to push my work as a single giant commit. Brett also proposed me to move the benchmarks repository to GitHub (and so convert it to Git). I don't know if it's appropriate to do all these things at once? What do you think? Reminder: My final goal is to merge again all benchmarks suites (CPython, PyPy, Pyston, Pyjion, ...) into one unique project! Victor From victor.stinner at gmail.com Fri Jul 29 12:58:19 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 29 Jul 2016 18:58:19 +0200 Subject: [Speed] Tracking memory usage In-Reply-To: References: Message-ID: I modified my perf module to add two new options: --tracemalloc and --track-memory. --tracemalloc enables tracemalloc and gets the peak of the traced Python memory allocations: peak is computed per process. --track-memory is similar but reads PeakPagefileUsage of GetProcessMemoryInfo() on Windows or private data from /proc/self/smap on Linux. The read is done every millisecond (1 ms) in a thread, in the worker process. It's not perfect, but it should be "as good" as the "old" CPython benchmark suite. And it makes the benchmark suite simpler because tracking memory usage is now done automatically by the perf module. Victor From brett at python.org Fri Jul 29 13:03:10 2016 From: brett at python.org (Brett Cannon) Date: Fri, 29 Jul 2016 17:03:10 +0000 Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc. In-Reply-To: References: Message-ID: On Thu, 28 Jul 2016 at 10:25 Victor Stinner wrote: > Hi, > > I updated all benchmarks of the CPython Benchmark Suite to use my perf > module. So you get timings of all individual run all *all* benchmarks > and store them in JSON to analyze them in detail. Each benchmark has a > full CLI, for example it gets a new --output option to store result as > JSON directly. But it also gets nice funtions like --hist for > histogram, --stats for statistics, etc. > > The two remaining questions are: > > * Should it support --track_memory? it doesn't support it correctly > right now, it's less precise than before (memory is tracked directly > in worker processes, no more by the main process) => discuss this > point in my previous email > I don't have an opinion as I have never gotten to use the old feature. > > * Should we remove vendor copies of libraries and work with virtual > environments? Not all libraries are available on PyPI :-/ See the > requirements.txt file and TODO. > If they are not on PyPI then we should just drop the benchmark. And I say we do use virtual environments to keep the repo size down. > > My repository: > https://hg.python.org/sandbox/benchmarks_perf > > I would like to push my work as a single giant commit. > > Brett also proposed me to move the benchmarks repository to GitHub > (and so convert it to Git). I don't know if it's appropriate to do all > these things at once? What do you think? > I say just start a new repo from scratch. There isn't a ton of magical history in the hg repo that I think we need to have carried around in the git repo. Plus if we stop shipping project source with the repo then it will be immensely smaller if we start from scratch. > > Reminder: My final goal is to merge again all benchmarks suites > (CPython, PyPy, Pyston, Pyjion, ...) into one unique project! I hope this happens! -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Fri Jul 29 13:09:43 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 29 Jul 2016 19:09:43 +0200 Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc. In-Reply-To: References: Message-ID: 2016-07-29 19:03 GMT+02:00 Brett Cannon : >> * Should it support --track_memory? it doesn't support it correctly >> right now, it's less precise than before (memory is tracked directly >> in worker processes, no more by the main process) => discuss this >> point in my previous email > > I don't have an opinion as I have never gotten to use the old feature. As I wrote in my other email, I implemented this feature in perf, so the benchmark suite will get it for free. The implementation is not complete, but it's working ;-) >> * Should we remove vendor copies of libraries and work with virtual >> environments? Not all libraries are available on PyPI :-/ See the >> requirements.txt file and TODO. > > If they are not on PyPI then we should just drop the benchmark. And I say we > do use virtual environments to keep the repo size down. Right. We can start with "a subset" of benchmarks and enlarge the test suite later, even reimport old benchmark with their dependency not on PyPI on case by case. Right now, I only wrote requirements.txt. I didn't touch the Python which still "hardcodes" PYTHONPATH to get the local copy of dependencies. It works if you run manually benchmarks in a virtual environment. I will write some glue to automate things and make the code "just work" bfore starting a new thing on GitHub. Victor From zachary.ware+pydev at gmail.com Fri Jul 29 14:51:34 2016 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Fri, 29 Jul 2016 13:51:34 -0500 Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc. In-Reply-To: References: Message-ID: On Fri, Jul 29, 2016 at 12:03 PM, Brett Cannon wrote: > On Thu, 28 Jul 2016 at 10:25 Victor Stinner > wrote: >> * Should we remove vendor copies of libraries and work with virtual >> environments? Not all libraries are available on PyPI :-/ See the >> requirements.txt file and TODO. > > If they are not on PyPI then we should just drop the benchmark. And I say we > do use virtual environments to keep the repo size down. I think rather than using virtual environments which aren't truly supported by <3.3 anyway, we should instead make use of pip's --target, --root, and/or --prefix flags (whatever combination it takes, I haven't looked into it deeply) to install the packages into a particular dir which is then added to each benchmarked interpreter's PYTHONPATH. This way, we're sure that each interpreter is running exactly the same code. Either way, I'm for not vendoring libraries. If the library disappears from PyPI, it's probably not an important workload anymore anyway. >> Reminder: My final goal is to merge again all benchmarks suites >> (CPython, PyPy, Pyston, Pyjion, ...) into one unique project! > > I hope this happens! Me too! I'd also like to get the benchmark runner for speed.python.org set up to build and benchmark as many interpreters as possible. -- Zach From victor.stinner at gmail.com Fri Jul 29 17:42:03 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 29 Jul 2016 23:42:03 +0200 Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc. In-Reply-To: References: Message-ID: 2016-07-29 20:51 GMT+02:00 Zachary Ware : > I think rather than using virtual environments which aren't truly > supported by <3.3 anyway, ... What do you mean? I'm building and destroying dozens of venv everyday at work using tox on Python 2.7. The virtualenv command works well, no? Do you have issues with it? Victor From zachary.ware+pydev at gmail.com Fri Jul 29 17:55:05 2016 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Fri, 29 Jul 2016 16:55:05 -0500 Subject: [Speed] CPython Benchmark Suite usinf my perf module, GitHub, etc. In-Reply-To: References: Message-ID: On Fri, Jul 29, 2016 at 4:42 PM, Victor Stinner wrote: > 2016-07-29 20:51 GMT+02:00 Zachary Ware : >> I think rather than using virtual environments which aren't truly >> supported by <3.3 anyway, ... > > What do you mean? I'm building and destroying dozens of venv everyday > at work using tox on Python 2.7. The virtualenv command works well, > no? Do you have issues with it? Not in particular, just that 3.3+ have official support for venvs whereas virtualenv is a bit of a hack by necessity. However, the second point is the real reason I'd rather avoid venvs for this: to make sure that the interpreters actually use the same exact code, so that there can't be any setup.py shenanigans that do things differently between versions/implementations. -- Zach From arigo at tunes.org Sat Jul 30 13:48:42 2016 From: arigo at tunes.org (Armin Rigo) Date: Sat, 30 Jul 2016 19:48:42 +0200 Subject: [Speed] Tracking memory usage In-Reply-To: References: Message-ID: Hi Victor, Fwiw, there is some per-OS (and even apparently per-Linux-distribution) solution mentioned here: http://stackoverflow.com/questions/774556/peak-memory-usage-of-a-linux-unix-process For me on Arch Linux, "/usr/bin/time -v CMD" returns a reasonable value in "Maximum resident set size (kbytes)". I guess that on OSes where this works, it gives a zero-overhead, exact answer. A bient?t, Armin.