From victor.stinner at gmail.com Wed Mar 1 12:05:25 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 1 Mar 2017 18:05:25 +0100 Subject: [Speed] perf 0.9.4 released Message-ID: Hi, I released the version 0.9.4 of my Python perf module: * Add --compare-to option to the Runner CLI * compare_to command: Add --table option to render a table http://perf.readthedocs.io/en/latest/ I used the --table feature to write this FASTCALL microbenchmarks article: https://haypo.github.io/fastcall-microbenchmarks.html Example: +---------------------+---------+------------------------------+ | Benchmark | 3.5 | 3.7 | +=====================+=========+==============================+ | struct.pack("i", 1) | 105 ns | 77.6 ns: 1.36x faster (-26%) | +---------------------+---------+------------------------------+ | getattr(1, "real") | 79.4 ns | 64.4 ns: 1.23x faster (-19%) | +---------------------+---------+------------------------------+ Use --quiet for smaller table. The --compare-to command is the generalization to any perf script of the existing perf timeit --compare-to option, to quickly compare two Python binaries. Example with timeit (because I'm too lazy to write a perf script!): --- $ ./python -m perf timeit 'int(0)' --duplicate=100 --compare-to ../master-ref/python -p3 /home/haypo/prog/bench_python/master-ref/python: .... 112 ns +- 1 ns /home/haypo/prog/bench_python/master/python: .... 108 ns +- 1 ns Median +- std dev: [/home/haypo/prog/bench_python/master-ref/python] 112 ns +- 1 ns -> [/home/haypo/prog/bench_python/master/python] 108 ns +- 1 ns: 1.04x faster (-3%) ---- Hum, I should write an option to allow to specify the name of python binaries, to replace [/home/haypo/prog/bench_python/master-ref/python] just with [ref]. Victor From victor.stinner at gmail.com Wed Mar 1 12:32:06 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 1 Mar 2017 18:32:06 +0100 Subject: [Speed] perf 0.9.4 released In-Reply-To: References: Message-ID: The perf API, command line interface (API) and JSON file format are now complete enough for *my* needs. I plan to use the version 1.0 for the next release and stabilize the API. I still have a long list of enhancement ideas (see the TODO.rst file in the Git repository), but none has a major impact on the API. If you see a major flaw in the API or CLI, please speak up! I know that the PyPy support is very limited, but again, fixing PyPy support shouldn't impact the CLI or API, and so can be done later. 2017-03-01 18:05 GMT+01:00 Victor Stinner : > Hum, I should write an option to allow to specify the name of python > binaries, to replace [/home/haypo/prog/bench_python/master-ref/python] > just with [ref]. Ok, I just added a --python-names option ;-) I annoyed me to have to modify manually perf output when posting to bugs.python.org to replace long [...] with short [ref] and [patch]. Victor From victor.stinner at gmail.com Mon Mar 6 18:37:03 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 7 Mar 2017 00:37:03 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? Message-ID: Hi, Serhiy Storchaka opened a bug report in my perf module: perf displays Median +- std dev, whereas median absolute deviation (MAD) should be displayed instead: https://github.com/haypo/perf/issues/20 I just modified perf to display Median +- MAD, but I'm not sure that it's better than Mean +- std dev. The question is important when a benchmark is unstable (has a lot of outliers). There is good example below with "Median +- MAD: 276 ns +- 10 ns" and "Mean +- std dev: 371 ns +- 196 ns". The goal of perf is to get reproductible benchmark results. So the question is what should be displayed (median or mean?) to get the most reproductible output? Median +- MAD "hides" outliers. In my experience, outliers are not "reproductible", but caused by "noise" of the system and other applications. I feel that Median +- MAD is what I want, but I would feel more confortable if someone can confirm with his/her experience :-) ----------------- haypo at selma$ PYTHONPATH=~/prog/GIT/perf ./python -m perf show --hist --stats bench.json.gz 234 ns: 3 # 264 ns: 114 ################################################## 293 ns: 9 #### 322 ns: 2 # 351 ns: 0 | 381 ns: 0 | 410 ns: 0 | 439 ns: 1 | 469 ns: 0 | 498 ns: 1 | 527 ns: 1 | 557 ns: 0 | 586 ns: 1 | 615 ns: 1 | 644 ns: 1 | 674 ns: 2 # 703 ns: 1 | 732 ns: 1 | 762 ns: 2 # 791 ns: 15 ####### 820 ns: 5 ## Total duration: 1 min 14.5 sec Start date: 2017-03-06 23:30:49 End date: 2017-03-06 23:33:11 Raw sample minimum: 137 ms Raw sample maximum: 444 ms Number of runs: 42 Total number of samples: 160 Number of samples per run: 4 Number of warmups per run: 2 Loop iterations per sample: 2^19 (128 outer-loops x 4096 inner-loops) Minimum: 262 ns (-5%) Median +- MAD: 276 ns +- 10 ns Mean +- std dev: 371 ns +- 196 ns Maximum: 847 ns (+207%) ERROR: the benchmark is very unstable, the standard deviation is very high (stdev/mean: 53%)! Try to rerun the benchmark with more runs, samples and/or loops Median +- MAD: 276 ns +- 10 ns ----------------- See attached bench.json.gz for full data. Victor -------------- next part -------------- A non-text attachment was scrubbed... Name: bench.json.gz Type: application/x-gzip Size: 6108 bytes Desc: not available URL: From victor.stinner at gmail.com Mon Mar 6 19:03:23 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 7 Mar 2017 01:03:23 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: Message-ID: Another example on the same computer. It's interesting: * MAD and std dev is the half of result 1 * the benchmark is less unstable * median is very close to result 1 * mean changed much more than median Benchmark result 1: Median +- MAD: 276 ns +- 10 ns Mean +- std dev: 371 ns +- 196 ns Benchmark result 2: Median +- MAD: 278 ns +- 5 ns Mean +- std dev: 303 ns +- 103 ns If the goal is to get reproductible results, Median +- MAD seems better. --- haypo at selma$ PYTHONPATH=~/prog/GIT/perf ./python -m perf show --hist --stats bench2.json.gz 250 ns: 75 ########################################################### 278 ns: 73 ######################################################### 306 ns: 3 ## 333 ns: 0 | 361 ns: 0 | 389 ns: 0 | 417 ns: 0 | 445 ns: 0 | 472 ns: 1 # 500 ns: 1 # 528 ns: 0 | 556 ns: 0 | 584 ns: 0 | 611 ns: 1 # 639 ns: 0 | 667 ns: 0 | 695 ns: 1 # 722 ns: 0 | 750 ns: 1 # 778 ns: 1 # 806 ns: 3 ## Total duration: 1 min 4.0 sec Start date: 2017-03-07 00:39:03 End date: 2017-03-07 00:41:05 Raw sample minimum: 140 ms Raw sample maximum: 431 ms Number of runs: 42 Total number of samples: 160 Number of samples per run: 4 Number of warmups per run: 2 Loop iterations per sample: 2^19 (128 outer-loops x 4096 inner-loops) Minimum: 266 ns (-4%) Median +- MAD: 278 ns +- 5 ns Mean +- std dev: 303 ns +- 103 ns Maximum: 822 ns (+195%) ERROR: the benchmark is very unstable, the standard deviation is very high (stdev/mean: 34%)! Try to rerun the benchmark with more runs, samples and/or loops Median +- MAD: 278 ns +- 5 ns --- -------------- next part -------------- A non-text attachment was scrubbed... Name: bench2.json.gz Type: application/x-gzip Size: 6011 bytes Desc: not available URL: From solipsis at pitrou.net Mon Mar 13 16:38:57 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 13 Mar 2017 21:38:57 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? References: Message-ID: <20170313213857.23d5a783@fsol> On Tue, 7 Mar 2017 01:03:23 +0100 Victor Stinner wrote: > Another example on the same computer. It's interesting: > * MAD and std dev is the half of result 1 > * the benchmark is less unstable > * median is very close to result 1 > * mean changed much more than median > > Benchmark result 1: > > Median +- MAD: 276 ns +- 10 ns > Mean +- std dev: 371 ns +- 196 ns > > Benchmark result 2: > > Median +- MAD: 278 ns +- 5 ns > Mean +- std dev: 303 ns +- 103 ns > > If the goal is to get reproductible results, Median +- MAD seems better. Getting reproducible results is only half of the goal. Getting meaningful (i.e. informative) results is the other half. The mean approximates the expected performance over multiple runs (note "expected" is a rigorously defined term in statistics here: see https://en.wikipedia.org/wiki/Expected_value). The median doesn't tell you anything about the expected value (*). So the mean is more informative for the task at hand. Additionally, while mean and std dev are generally quite well understood, the properties of the median absolute deviation are generally little known. So my vote goes to mean +/- std dev. (*) Quick example: let's say your runtimes in seconds are [1, 1, 1, 1, 1, 1, 10, 10, 10, 10]. Evidently, there are four outliers (over 10 measurements) that indicate a huge performance regression occurring at random points. However, the median here is 1 and the median absolute deviation (the median of absolute deviations from the median, i.e. the median of [0, 0, 0, 0, 0, 0, 9, 9, 9, 9]) is 0: the information about possible performance regressions is entirely lost, and the numbers (median +/- MAD) make it look like the benchmark reliably takes 1 s. to run. Regards Antoine. From storchaka at gmail.com Tue Mar 14 03:14:45 2017 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 14 Mar 2017 09:14:45 +0200 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: <20170313213857.23d5a783@fsol> References: <20170313213857.23d5a783@fsol> Message-ID: On 13.03.17 22:38, Antoine Pitrou wrote: > The mean approximates the expected performance over multiple runs (note > "expected" is a rigorously defined term in statistics here: see > https://en.wikipedia.org/wiki/Expected_value). The median doesn't tell > you anything about the expected value (*). So the mean is more > informative for the task at hand. The median tells you that results of a half of runs will be less than the median and results of other half will be larger. This is pretty informative and even more informative than the mean for some applications. > Additionally, while mean and std dev are generally quite well > understood, the properties of the median absolute deviation are > generally little known. Std dev is well understood for the distribution close to normal. But when the distribution is too skewed or multimodal (as in your quick example) common assumptions (that 2/3 of samples are in the range of the std dev, 95% of samples are in the range of two std devs, 99% of samples are in the range of three std devs) are no longer valid. From ncoghlan at gmail.com Tue Mar 14 10:42:07 2017 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 15 Mar 2017 00:42:07 +1000 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: On 14 March 2017 at 17:14, Serhiy Storchaka wrote: > On 13.03.17 22:38, Antoine Pitrou wrote: > >> Additionally, while mean and std dev are generally quite well >> understood, the properties of the median absolute deviation are >> generally little known. >> > > Std dev is well understood for the distribution close to normal. But when > the distribution is too skewed or multimodal (as in your quick example) > common assumptions (that 2/3 of samples are in the range of the std dev, > 95% of samples are in the range of two std devs, 99% of samples are in the > range of three std devs) are no longer valid. That would suggest that the implicit assumption of a measure-of-centrality with a measure-of-symmetric-deviation may need to be challenged, as at least some meaningful performance problems are going to show up as non-normal distributions in the benchmark results. Network services typically get around the "inherent variance" problem by looking at a few key percentiles like 50%, 90% and 95%. Perhaps that would be appropriate here as well? Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Mar 14 13:05:30 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 14 Mar 2017 18:05:30 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? References: <20170313213857.23d5a783@fsol> Message-ID: <20170314180530.4b37cd7e@fsol> On Tue, 14 Mar 2017 09:14:45 +0200 Serhiy Storchaka wrote: > The median tells you that results of a half of runs will be less than > the median and results of other half will be larger. This is pretty > informative and even more informative than the mean for some > applications. How so? Whether a measurement is below or above the median is a pointless piece of information in itself, because you don't know by how much. If a sample is 0.05% below the median, it might just as well be 0.05% above for all I care. If half of the samples are 1% below the median and half of the samples are 50% above, it's not the same thing at all as if half of the samples are 50% below and half of the samples are 1% above. Yet "median +/- MAD" gives the exact same results in both cases. > > Additionally, while mean and std dev are generally quite well > > understood, the properties of the median absolute deviation are > > generally little known. > > Std dev is well understood for the distribution close to normal. But > when the distribution is too skewed or multimodal (as in your quick > example) common assumptions (that 2/3 of samples are in the range of the > std dev, 95% of samples are in the range of two std devs, 99% of samples > are in the range of three std devs) are no longer valid. Not for individual samples, but for expected performance over a large enough number of runs, yes, you can more or less use common assumptions (thanks to the central limit theorem). And expected performance is a rather important piece of information. Regards Antoine. From solipsis at pitrou.net Tue Mar 14 13:13:59 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 14 Mar 2017 18:13:59 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? References: <20170313213857.23d5a783@fsol> Message-ID: <20170314181359.7251e493@fsol> On Wed, 15 Mar 2017 00:42:07 +1000 Nick Coghlan wrote: > > That would suggest that the implicit assumption of a measure-of-centrality > with a measure-of-symmetric-deviation may need to be challenged, as at > least some meaningful performance problems are going to show up as > non-normal distributions in the benchmark results. Well, the real issue here is that an important contributor to non-normality is not the benchmark itself, but measurement noise due to various issues (such as system noise, which has of course a highly skewed distribution). Victor is trying to eliminate the effects of system noise by using the median, but if that's the primary goal, using the minimum is arguably better, since the system noise is always a positive contributor (i.e. it can only increase the runtimes). The median is arguably a bastardized solution, which satisfies neither the requirement of eliminating system noise, nor the requirement of faithfully representing performance variations due to non-deterministic effects in the Python runtime and/or benchmark itself. Regards Antoine. From storchaka at gmail.com Wed Mar 15 02:41:47 2017 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 15 Mar 2017 08:41:47 +0200 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: <20170314180530.4b37cd7e@fsol> References: <20170313213857.23d5a783@fsol> <20170314180530.4b37cd7e@fsol> Message-ID: On 14.03.17 19:05, Antoine Pitrou wrote: > On Tue, 14 Mar 2017 09:14:45 +0200 > Serhiy Storchaka > wrote: >> The median tells you that results of a half of runs will be less than >> the median and results of other half will be larger. This is pretty >> informative and even more informative than the mean for some >> applications. > > How so? Whether a measurement is below or above the median is a > pointless piece of information in itself, because you don't know by how > much. If a sample is 0.05% below the median, it might just as well be > 0.05% above for all I care. If half of the samples are 1% below the > median and half of the samples are 50% above, it's not the same thing > at all as if half of the samples are 50% below and half of the samples > are 1% above. Yet "median +/- MAD" gives the exact same results in > both cases. "half of the samples are 1% below the median and half of the samples are 50% above" -- this is unrealistic example. In real examples samples are distributed around some point, with the skew and outliers. The median is close to the mean, but less affected by outliers. For benchmarking purpose the absolute value is not important. The change between two measurements of two builds is important. The median is more stable and that means that we have less chance to get the false result. From storchaka at gmail.com Wed Mar 15 02:54:44 2017 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 15 Mar 2017 08:54:44 +0200 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: On 14.03.17 16:42, Nick Coghlan wrote: > That would suggest that the implicit assumption of a > measure-of-centrality with a measure-of-symmetric-deviation may need to > be challenged, as at least some meaningful performance problems are > going to show up as non-normal distributions in the benchmark results. > > Network services typically get around the "inherent variance" problem by > looking at a few key percentiles like 50%, 90% and 95%. Perhaps that > would be appropriate here as well? Yes, quantiles would be useful, but I suppose they are less stable. If you have have only 20 samples, it is not enough to determine the 95% percentile. But absolute values are not important for the purposes of our benchmarking. We need only know whether one build is faster or slower than others. I suggested to calculate the probability of one build be faster than the other when compare two builds. This is just one number and it doesn't depend on assumptions about the normality of distributions. From solipsis at pitrou.net Wed Mar 15 09:52:32 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 15 Mar 2017 14:52:32 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? References: <20170313213857.23d5a783@fsol> Message-ID: <20170315145232.265ed0fe@fsol> On Wed, 15 Mar 2017 08:54:44 +0200 Serhiy Storchaka wrote: > > But absolute values are not important for the purposes of our > benchmarking. We need only know whether one build is faster or slower > than others. Not really. If you don't know by how much it is faster or slower, it is often useless in itself (because being 0.1% faster doesn't matter, even if that's a very reproduceable speedup). Really, the idea that actual values don't matter and only ordering does is broken. Of course actual values matter, because by how much something is faster is a much more useful piece of information than simply "it is faster". If changing for another interpreter makes some benchmark 3x faster, I may go for it. If it makes some benchmark 3% faster, I won't bother. Regards Antoine. From solipsis at pitrou.net Wed Mar 15 10:24:06 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 15 Mar 2017 15:24:06 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? References: <20170313213857.23d5a783@fsol> <20170314180530.4b37cd7e@fsol> Message-ID: <20170315152406.4c18e1ff@fsol> On Wed, 15 Mar 2017 08:41:47 +0200 Serhiy Storchaka wrote: > > "half of the samples are 1% below the median and half of the samples are > 50% above" -- this is unrealistic example. I was inventing an extreme example for the sake of clarity. You can easily derive more "realistic" examples from the same principle and get similar results at the end: non-negligible variations being totally unrepresented in the "median +- MAD" aggregate. > In real examples samples are > distributed around some point, with the skew and outliers. If you're assuming the benchmark itself is stable and variations are due to outside system noise, then you should really take the minimum, which has the most chance of ignoring system noise. If you're mainly worried about outliers, you can first insert a data preparation (or cleanup) phase before computing the mean. But you have to decide up front whether an outlier is due to system noise or actual benchmark instability (which can be due to non-determinism in the runtime, e.g. hash randomization). For that, you may want to collect additional system data while running the benchmark (for example, if total CPU occupation during the benchmark is much more than the benchmark's own CPU times, you might decide the system wasn't idle enough and the result may be classified as outlier). Regards Antoine. From victor.stinner at gmail.com Wed Mar 15 12:32:49 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 15 Mar 2017 17:32:49 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: <20170313213857.23d5a783@fsol> References: <20170313213857.23d5a783@fsol> Message-ID: 2017-03-13 21:38 GMT+01:00 Antoine Pitrou : >> If the goal is to get reproductible results, Median +- MAD seems better. > > Getting reproducible results is only half of the goal. Getting > meaningful (i.e. informative) results is the other half. If the system is tuned for benchmarks (run "python3 -m perf system tune"), you get almost no outlier on CPU-bound functions. In this case, mean/median and stdev/MAD are similar. The problem is when people don't tune their system to run benchmarks, which is likely the most common case. In this case, the distribution is never normal :-) It's always skewed (positive skew, the right part contains more points). Reproductibility is a very concrete and practical issue for me. > Additionally, while mean and std dev are generally quite well > understood, the properties of the median absolute deviation are > generally little known. A friend suggested me to display sigma = 1.48 * MAD, instead of displaying directly MAD, to get a value close to the standard deviation without outliers. I don't know if it makes sense :-) Victor From victor.stinner at gmail.com Wed Mar 15 12:36:06 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 15 Mar 2017 17:36:06 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: 2017-03-14 8:14 GMT+01:00 Serhiy Storchaka : > Std dev is well understood for the distribution close to normal. But when > the distribution is too skewed or multimodal (as in your quick example) > common assumptions (that 2/3 of samples are in the range of the std dev, 95% > of samples are in the range of two std devs, 99% of samples are in the range > of three std devs) are no longer valid. The Python timeit module only displays the minimum. I chose to display also the standard deviation in perf to give an idea of the stability of the benchmark. For example, "10 +- 1 ms" is quite stable, whereas "10 ms +- 15 ms" seems not reliable at all. MAD contains 50% of samples, whereas std dev contains 66% of samples. If I only look at percentage, I prefer std dev because it gives a better estimation of the stability of the benchmark. Victor From victor.stinner at gmail.com Wed Mar 15 12:39:33 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 15 Mar 2017 17:39:33 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: 2017-03-14 15:42 GMT+01:00 Nick Coghlan : > That would suggest that the implicit assumption of a measure-of-centrality > with a measure-of-symmetric-deviation may need to be challenged, as at least > some meaningful performance problems are going to show up as non-normal > distributions in the benchmark results. > > Network services typically get around the "inherent variance" problem by > looking at a few key percentiles like 50%, 90% and 95%. Perhaps that would > be appropriate here as well? Right now, there is almost no visualisation tool for perf :-( It started to list projects that may be reused to visualize benchmark results, to "see" the distribution. A first step would be to add these "key percentiles like 50%, 90% and 95%" to the perf stats command. I don't know how to compute them. But my question is for the most important summary: the result of the "perf show" command, which is what most users see except if they use more advanced commands. Victor From victor.stinner at gmail.com Wed Mar 15 12:41:59 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 15 Mar 2017 17:41:59 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: <20170314181359.7251e493@fsol> References: <20170313213857.23d5a783@fsol> <20170314181359.7251e493@fsol> Message-ID: 2017-03-14 18:13 GMT+01:00 Antoine Pitrou : > Victor is trying to eliminate the effects of system noise by using the > median, but if that's the primary goal, using the minimum is arguably > better, since the system noise is always a positive contributor (i.e. > it can only increase the runtimes). > > The median is arguably a bastardized solution, which satisfies neither > the requirement of eliminating system noise, nor the requirement of > faithfully representing performance variations due to non-deterministic > effects in the Python runtime and/or benchmark itself. The Python timeit module uses the minimum an in my experience, it's far not reproductible at all. Mean or median provides a more reproductible value. Victor From victor.stinner at gmail.com Wed Mar 15 12:59:07 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 15 Mar 2017 17:59:07 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: While I like the "automatic removal of outliers feature" of median and MAD ("robust" statistics), I'm not confortable with these numbers. They are new to me and uncommon in other benchmark tools. It's not easy to compare MAD to standard deviation. It seems like MAD can even be misleading when reading the "1 ms" part of "10 ms +- 1 ms". The perf module has already a function to emit warnings if a benchmark is considered as "unstable". A warning is emitted if stdev/mean is greater than 0.10. I chose this threshold arbitrary. Maybe we need another check to emit a warning when mean and median, or std dev and MAD are too different? Maybe we need a new --median command line option to display median/MAD, instead of mean/stdev displayed by default? About the reproductibility, I should experiment mean vs median. Currently, perf doesn't use MAD nor std dev to compare two benchmark results. Victor From solipsis at pitrou.net Wed Mar 15 13:11:25 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 15 Mar 2017 18:11:25 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? References: <20170313213857.23d5a783@fsol> Message-ID: <20170315181125.268e5432@fsol> On Wed, 15 Mar 2017 17:59:07 +0100 Victor Stinner wrote: > > The perf module has already a function to emit warnings if a benchmark > is considered as "unstable". A warning is emitted if stdev/mean is > greater than 0.10. I chose this threshold arbitrary. It doesn't sound too bad :-) > Maybe we need another check to emit a warning when mean and median, or > std dev and MAD are too different? > > Maybe we need a new --median command line option to display > median/MAD, instead of mean/stdev displayed by default? I would say keep it simple. mean/stddev is informative enough, no need to add or maintain options of dubious utility. Regards Antoine. From victor.stinner at gmail.com Wed Mar 15 14:11:15 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 15 Mar 2017 19:11:15 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: <20170315181125.268e5432@fsol> References: <20170313213857.23d5a783@fsol> <20170315181125.268e5432@fsol> Message-ID: 2017-03-15 18:11 GMT+01:00 Antoine Pitrou : > I would say keep it simple. mean/stddev is informative enough, no need > to add or maintain options of dubious utility. Ok. I added a message to suggest to use perf stats to analyze results. Example of warnings with a benchmark result considered as unstable, python startup time measured by the new bench_command() function: --- $ python3 -m perf show startup1.json WARNING: the benchmark result may be unstable * the standard deviation (6.08 ms) is 16% of the mean (39.1 ms) * the minimum (23.6 ms) is 40% smaller than the mean (39.1 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python3 -m perf system tune' command to reduce the system jitter. Use perf stats to analyze results, or --quiet to hide warnings. Median +- MAD: 40.7 ms +- 3.9 ms ---- Statistics of this result: --- $ python3 -m perf stats startup1.json -q Total duration: 37.2 sec Start date: 2017-03-15 18:02:46 End date: 2017-03-15 18:03:27 Raw value minimum: 189 ms Raw value maximum: 390 ms Number of runs: 25 Total number of values: 75 Number of values per run: 3 Number of warmups per run: 1 Loop iterations per value: 8 Minimum: 23.6 ms (-42% of the median) Median +- MAD: 40.7 ms +- 3.9 ms Mean +- std dev: 39.1 ms +- 6.1 ms Maximum: 48.7 ms (+20% of the median) --- Victor From storchaka at gmail.com Wed Mar 15 18:44:54 2017 From: storchaka at gmail.com (Serhiy Storchaka) Date: Thu, 16 Mar 2017 00:44:54 +0200 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: On 15.03.17 18:59, Victor Stinner wrote: > It's not easy to compare MAD to standard deviation. It seems like MAD > can even be misleading when reading the "1 ms" part of "10 ms +- 1 > ms". Don't use the "+-" notation. It is misleading even for the stddev of normal distribution, because with the chance 1 against 2 the sample is out of the specified interval. Use "Mean: 10 ms Stddev: 1 ms" or "Median: 10 ms MAD: 1 ms" instead. > Maybe we need a new --median command line option to display > median/MAD, instead of mean/stdev displayed by default? Yes, make this configurable. And make median/MAD the default. ;) From victor.stinner at gmail.com Wed Mar 15 20:50:39 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 16 Mar 2017 01:50:39 +0100 Subject: [Speed] ASLR Message-ID: 2017-03-16 1:38 GMT+01:00 Wang, Peter Xihong : > Hi All, > > I am attaching an image with comparison running the CALL_METHOD in the old Grand Unified Python Benchmark (GUPB) suite (https://hg.python.org/benchmarks), with and without ASLR disabled. This benchmark suite is now deprecated, please update to the new 'performance' benchmark suite: https://github.com/python/performance The old benchmark suite didn't spawn multiple processes and so was less reliable. By the way, maybe I should commit a change in hg.python.org/benchmarks to remove the code and only keep a README.txt? Code will still be accessible in Mercurial history. > You could see the run2run variation was reduced significantly, from data scattering all over the place, to just one single outlier, out of 30 repeated runs. > This effectively eliminated most of the variations for this micro-benchmark. > > On a Linux system, you could do this by: > as root > echo 0 > /proc/sys/kernel/randomize_va_space # to disable > echo 2 > /proc/sys/kernel/randomize_va_space # to enable > > If anyone still experiences run2run variation, I'd suggest to read on: > Based on my observation in our labs, a lot of factors could impact performance, including environment (yes, even a room temperature), I made my own experiment on the impact on temperature on performance, and above 100?C, I didn't notice anything: https://haypo.github.io/intel-cpus-part2.html "Impact of the CPU temperature on benchmarks" I tested a desktop and a laptop PC with an Intel CPU. > HW components or related such as platforms, chipset, memory DIMMs, CPU generations and stepping, BIOS version, kernels, the list goes on and on. > > Being said that, would it be helpful we work together, to identify the root cause, be it due to SW, or anything else? We could start with a specific micro-benchmark, with specific goal as what to measure. > After that, or in parallel after some baseline work is done, then focus on measurement process/methodology? > > Is this helpful? > > Thanks, > > Peter Note: Please open a new thread instead of replying to an email of an existing thread. Victor From victor.stinner at gmail.com Wed Mar 15 20:59:11 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 16 Mar 2017 01:59:11 +0100 Subject: [Speed] perf 0.9.6 released Message-ID: Hi, I released perf 0.9.6 with many changes. First, "Mean +- std dev" is now displayed, instead of "Median +- std dev", as a result of the previous thread on this list. The median is still accessible via the stats command. By the way, the "stats" command now displays "Median +- MAD" instead of "Median +- std dev". I broke the API to fix an old mistake. I used the term "sample" for a single value, whereas a "sample" in statistics is a set of values (one or more), and so the term is misused. I replace "sample" with "value" and "samples" with "values" everywhere in perf. http://perf.readthedocs.io/en/latest/changelog.html#version-0-9-6-2017-03-15 Version 0.9.6 (2017-03-15) -------------------------- Major change: * Display ``Mean +- std dev`` instead of ``Median +- std dev`` Enhancements: * Add a new ``Runner.bench_command()`` method to measure the execution time of a command. * Add ``mean()``, ``median_abs_dev()`` and ``stdev()`` methods to ``Benchmark`` * ``check`` command: test also minimum and maximum compared to the mean Major API change, rename "sample" to "value": * Rename attributes and methods: - ``Benchmark.bench_sample_func()`` => ``Benchmark.bench_time_func()``. - ``Run.samples`` => ``Run.values`` - ``Benchmark.get_samples()`` => ``Benchmark.get_values()`` - ``get_nsample()`` => ``get_nvalue()`` - ``Benchmark.format_sample()`` => ``Benchmark.format_value()`` - ``Benchmark.format_samples()`` => ``Benchmark.format_values()`` * Rename Runner command line options: - ``--samples`` => ``--values`` - ``--debug-single-sample`` => ``--debug-single-value`` Changes: * ``convert``: Remove ``--remove-outliers`` option * ``check`` command now tests stdev/mean, instead of testing stdev/median * setup.py: statistics dependency is now installed using ``extras_require`` to support setuptools 18 and newer * Add setup.cfg to enable universal builds: same wheel package for Python 2 and Python 3 * Add ``perf.VERSION`` constant: tuple of int * JSON version 6: write metadata common to all benchmarks (common to all runs of all benchmarks) at the root; rename 'samples' to 'values' in runs. Victor From victor.stinner at gmail.com Wed Mar 15 21:22:59 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 16 Mar 2017 02:22:59 +0100 Subject: [Speed] ASLR In-Reply-To: <371EBC7881C7844EAAF5556BFF21BCCC583F8B42@ORSMSX105.amr.corp.intel.com> References: <371EBC7881C7844EAAF5556BFF21BCCC583F8B42@ORSMSX105.amr.corp.intel.com> Message-ID: 2017-03-16 2:04 GMT+01:00 Wang, Peter Xihong : > Understood on the obsolete benchmark part. This was the work done before the new benchmark was created on github. I strongly advice you to move to performance. It also has a nice a API. It now produces a JSON file with *all* data, instead of just writing into summaries into stdout. > I thought this is related, and thus didn't open a new thread. The other thread was a discussion about statistics, how to summarize all timing into two numbers :-) > Maybe you could point me to one single micro-benchmark for the time being, and then we could compare result across? The "new" performance project is a fork of the old "benchmark" project. Benchmark names are very close or even the same for many benchmarks. If you would like to validate that your benchmark runner is stable: run call_method and call_simple microbenchmarks on different revisions of CPython, reboot sometimes the computer used to run benchmarks, and make sure that results are stable. Compare them with results of speed.python.org. call_method: https://speed.python.org/timeline/#/?exe=5&ben=call_method&env=1&revs=50&equid=off&quarts=on&extr=on call_simple: https://speed.python.org/timeline/#/?exe=5&ben=call_simple&env=1&revs=50&equid=off&quarts=on&extr=on Around november and december 2016, you should notice a significant speedup on call_method. The best is to be able to avoid "temporary spikes" like this one: https://haypo.github.io/analysis-python-performance-issue.html The API of the perf project, PGO and LTO compilation, new performance using perf, "perf system tune" for system tuning, etc. helped to get more stable results. Victor From victor.stinner at gmail.com Wed Mar 15 21:27:25 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 16 Mar 2017 02:27:25 +0100 Subject: [Speed] perf 0.9.6 released In-Reply-To: References: Message-ID: I updated performance for perf 0.9.6. I patched python_startup and hg_startup benchmarks to use the new bench_command() method. This new method uses the following Python script to measure the time to execute a command: https://github.com/haypo/perf/blob/master/perf/_process_time.py I wrote the _process_time.py script to be small and simple to reduce the overhead of the benchmark itself. It's similar to the "real time" line of UNIX 'time' command, but it works on Windows too. I chose to use time.perf_counter(), wall clock, instead of using getrusage() which provides CPU time. It's easy for me to understand wall clock time rather than CPU time, and it's more consistent with other perf methods. Victor 2017-03-16 1:59 GMT+01:00 Victor Stinner : > Hi, > > I released perf 0.9.6 with many changes. First, "Mean +- std dev" is > now displayed, instead of "Median +- std dev", as a result of the > previous thread on this list. The median is still accessible via the > stats command. By the way, the "stats" command now displays "Median +- > MAD" instead of "Median +- std dev". > > I broke the API to fix an old mistake. I used the term "sample" for a > single value, whereas a "sample" in statistics is a set of values (one > or more), and so the term is misused. I replace "sample" with "value" > and "samples" with "values" everywhere in perf. > > http://perf.readthedocs.io/en/latest/changelog.html#version-0-9-6-2017-03-15 > > Version 0.9.6 (2017-03-15) > -------------------------- > > Major change: > > * Display ``Mean +- std dev`` instead of ``Median +- std dev`` > > Enhancements: > > * Add a new ``Runner.bench_command()`` method to measure the execution time of > a command. > * Add ``mean()``, ``median_abs_dev()`` and ``stdev()`` methods to ``Benchmark`` > * ``check`` command: test also minimum and maximum compared to the mean > > Major API change, rename "sample" to "value": > > * Rename attributes and methods: > > - ``Benchmark.bench_sample_func()`` => ``Benchmark.bench_time_func()``. > - ``Run.samples`` => ``Run.values`` > - ``Benchmark.get_samples()`` => ``Benchmark.get_values()`` > - ``get_nsample()`` => ``get_nvalue()`` > - ``Benchmark.format_sample()`` => ``Benchmark.format_value()`` > - ``Benchmark.format_samples()`` => ``Benchmark.format_values()`` > > * Rename Runner command line options: > > - ``--samples`` => ``--values`` > - ``--debug-single-sample`` => ``--debug-single-value`` > > Changes: > > * ``convert``: Remove ``--remove-outliers`` option > * ``check`` command now tests stdev/mean, instead of testing stdev/median > * setup.py: statistics dependency is now installed using ``extras_require`` to > support setuptools 18 and newer > * Add setup.cfg to enable universal builds: same wheel package for Python 2 > and Python 3 > * Add ``perf.VERSION`` constant: tuple of int > * JSON version 6: write metadata common to all benchmarks (common to all runs > of all benchmarks) at the root; rename 'samples' to 'values' in runs. > > Victor From solipsis at pitrou.net Thu Mar 16 05:22:00 2017 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 16 Mar 2017 10:22:00 +0100 Subject: [Speed] ASLR References: Message-ID: <20170316102200.746f709a@fsol> On Thu, 16 Mar 2017 01:50:39 +0100 Victor Stinner wrote: > > I made my own experiment on the impact on temperature on performance, > and above 100?C, I didn't notice anything: > https://haypo.github.io/intel-cpus-part2.html > "Impact of the CPU temperature on benchmarks" I suspect temperature can have an impact on performance if Turbo is enabled (or, as you noticed, if CPU cooling is deficient). Note that tweaking a system for benchmarking (disabling Turbo, disabling ASLR, etc.) may make the results more reproducible, but it may also make them less representative of real-world conditions (because few people disable Turbo or ASLR, except precisely on benchmarking machines :-)). It's a delicate balancing act! Regards Antoine. From victor.stinner at gmail.com Thu Mar 16 07:19:05 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 16 Mar 2017 12:19:05 +0100 Subject: [Speed] ASLR In-Reply-To: <20170316102200.746f709a@fsol> References: <20170316102200.746f709a@fsol> Message-ID: 2017-03-16 10:22 GMT+01:00 Antoine Pitrou : > I suspect temperature can have an impact on performance if Turbo is > enabled (or, as you noticed, if CPU cooling is deficient). Oh sure, I now always start by disabling Turbo Boost. It's common that I run benchmarks on my desktop PC with Firefox running in the background. Variable workload on other CPUs is very likely to change the peak CPU frequency on the CPUs used for benhcmarks, even if CPU isolation and CPU pinning is used. > Note that tweaking a system for benchmarking (disabling Turbo, > disabling ASLR, etc.) may make the results more reproducible, but it > may also make them less representative of real-world conditions > (because few people disable Turbo or ASLR, except precisely on > benchmarking machines :-)). It's a delicate balancing act! Yeah, that's also why I chose to enable ASLR. I fear that disabling ASLR will put me a "local minimum" which is not representative of average performance when ASLR is enabled and benchmark run using multiple processes (to test multiple address layouts). Victor From peter.xihong.wang at intel.com Wed Mar 15 20:38:14 2017 From: peter.xihong.wang at intel.com (Wang, Peter Xihong) Date: Thu, 16 Mar 2017 00:38:14 +0000 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> <20170315181125.268e5432@fsol> Message-ID: <371EBC7881C7844EAAF5556BFF21BCCC583F8AED@ORSMSX105.amr.corp.intel.com> Hi All, I am attaching an image with comparison running the CALL_METHOD in the old Grand Unified Python Benchmark (GUPB) suite (https://hg.python.org/benchmarks), with and without ASLR disabled. You could see the run2run variation was reduced significantly, from data scattering all over the place, to just one single outlier, out of 30 repeated runs. This effectively eliminated most of the variations for this micro-benchmark. On a Linux system, you could do this by: as root echo 0 > /proc/sys/kernel/randomize_va_space # to disable echo 2 > /proc/sys/kernel/randomize_va_space # to enable If anyone still experiences run2run variation, I'd suggest to read on: Based on my observation in our labs, a lot of factors could impact performance, including environment (yes, even a room temperature), HW components or related such as platforms, chipset, memory DIMMs, CPU generations and stepping, BIOS version, kernels, the list goes on and on. Being said that, would it be helpful we work together, to identify the root cause, be it due to SW, or anything else? We could start with a specific micro-benchmark, with specific goal as what to measure. After that, or in parallel after some baseline work is done, then focus on measurement process/methodology? Is this helpful? Thanks, Peter ? -----Original Message----- From: Speed [mailto:speed-bounces+peter.xihong.wang=intel.com at python.org] On Behalf Of Victor Stinner Sent: Wednesday, March 15, 2017 11:11 AM To: Antoine Pitrou Cc: speed at python.org Subject: Re: [Speed] Median +- MAD or Mean +- std dev? 2017-03-15 18:11 GMT+01:00 Antoine Pitrou : > I would say keep it simple. mean/stddev is informative enough, no > need to add or maintain options of dubious utility. Ok. I added a message to suggest to use perf stats to analyze results. Example of warnings with a benchmark result considered as unstable, python startup time measured by the new bench_command() function: --- $ python3 -m perf show startup1.json WARNING: the benchmark result may be unstable * the standard deviation (6.08 ms) is 16% of the mean (39.1 ms) * the minimum (23.6 ms) is 40% smaller than the mean (39.1 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python3 -m perf system tune' command to reduce the system jitter. Use perf stats to analyze results, or --quiet to hide warnings. Median +- MAD: 40.7 ms +- 3.9 ms ---- Statistics of this result: --- $ python3 -m perf stats startup1.json -q Total duration: 37.2 sec Start date: 2017-03-15 18:02:46 End date: 2017-03-15 18:03:27 Raw value minimum: 189 ms Raw value maximum: 390 ms Number of runs: 25 Total number of values: 75 Number of values per run: 3 Number of warmups per run: 1 Loop iterations per value: 8 Minimum: 23.6 ms (-42% of the median) Median +- MAD: 40.7 ms +- 3.9 ms Mean +- std dev: 39.1 ms +- 6.1 ms Maximum: 48.7 ms (+20% of the median) --- Victor _______________________________________________ Speed mailing list Speed at python.org https://mail.python.org/mailman/listinfo/speed -------------- next part -------------- A non-text attachment was scrubbed... Name: ASLR_disabled_enabled_comparison.jpg Type: image/jpeg Size: 79494 bytes Desc: ASLR_disabled_enabled_comparison.jpg URL: From peter.xihong.wang at intel.com Wed Mar 15 21:04:14 2017 From: peter.xihong.wang at intel.com (Wang, Peter Xihong) Date: Thu, 16 Mar 2017 01:04:14 +0000 Subject: [Speed] ASLR In-Reply-To: References: Message-ID: <371EBC7881C7844EAAF5556BFF21BCCC583F8B42@ORSMSX105.amr.corp.intel.com> Hi Victor, Understood on the obsolete benchmark part. This was the work done before the new benchmark was created on github. I thought this is related, and thus didn't open a new thread. Maybe you could point me to one single micro-benchmark for the time being, and then we could compare result across? ? Regards, Peter -----Original Message----- From: Victor Stinner [mailto:victor.stinner at gmail.com] Sent: Wednesday, March 15, 2017 5:51 PM To: speed at python.org; Wang, Peter Xihong Subject: ASLR 2017-03-16 1:38 GMT+01:00 Wang, Peter Xihong : > Hi All, > > I am attaching an image with comparison running the CALL_METHOD in the old Grand Unified Python Benchmark (GUPB) suite (https://hg.python.org/benchmarks), with and without ASLR disabled. This benchmark suite is now deprecated, please update to the new 'performance' benchmark suite: https://github.com/python/performance The old benchmark suite didn't spawn multiple processes and so was less reliable. By the way, maybe I should commit a change in hg.python.org/benchmarks to remove the code and only keep a README.txt? Code will still be accessible in Mercurial history. > You could see the run2run variation was reduced significantly, from data scattering all over the place, to just one single outlier, out of 30 repeated runs. > This effectively eliminated most of the variations for this micro-benchmark. > > On a Linux system, you could do this by: > as root > echo 0 > /proc/sys/kernel/randomize_va_space # to disable > echo 2 > /proc/sys/kernel/randomize_va_space # to enable > > If anyone still experiences run2run variation, I'd suggest to read on: > Based on my observation in our labs, a lot of factors could impact > performance, including environment (yes, even a room temperature), I made my own experiment on the impact on temperature on performance, and above 100?C, I didn't notice anything: https://haypo.github.io/intel-cpus-part2.html "Impact of the CPU temperature on benchmarks" I tested a desktop and a laptop PC with an Intel CPU. > HW components or related such as platforms, chipset, memory DIMMs, CPU generations and stepping, BIOS version, kernels, the list goes on and on. > > Being said that, would it be helpful we work together, to identify the root cause, be it due to SW, or anything else? We could start with a specific micro-benchmark, with specific goal as what to measure. > After that, or in parallel after some baseline work is done, then focus on measurement process/methodology? > > Is this helpful? > > Thanks, > > Peter Note: Please open a new thread instead of replying to an email of an existing thread. Victor From brett at python.org Thu Mar 16 12:19:35 2017 From: brett at python.org (Brett Cannon) Date: Thu, 16 Mar 2017 16:19:35 +0000 Subject: [Speed] ASLR In-Reply-To: References: Message-ID: On Wed, 15 Mar 2017 at 17:54 Victor Stinner wrote: > 2017-03-16 1:38 GMT+01:00 Wang, Peter Xihong >: > > Hi All, > > > > I am attaching an image with comparison running the CALL_METHOD in the > old Grand Unified Python Benchmark (GUPB) suite ( > https://hg.python.org/benchmarks), with and without ASLR disabled. > > This benchmark suite is now deprecated, please update to the new > 'performance' benchmark suite: > https://github.com/python/performance > > The old benchmark suite didn't spawn multiple processes and so was > less reliable. > > By the way, maybe I should commit a change in hg.python.org/benchmarks > to remove the code and only keep a README.txt? Code will still be > accessible in Mercurial history. > Since we might not shut down hg.python.org for a long time I say go ahead and commit such a change. -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Thu Mar 16 13:28:40 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 16 Mar 2017 18:28:40 +0100 Subject: [Speed] Median +- MAD or Mean +- std dev? In-Reply-To: References: <20170313213857.23d5a783@fsol> Message-ID: 2017-03-15 23:44 GMT+01:00 Serhiy Storchaka : > Don't use the "+-" notation. It is misleading even for the stddev of normal > distribution, because with the chance 1 against 2 the sample is out of the > specified interval. Use "Mean: 10 ms Stddev: 1 ms" or "Median: 10 ms MAD: > 1 ms" instead. I know that it's an abuse of "value +- range" notation. Since I already changed the default formatting of a benchmark multiple times and it seems like Serhiy doesn't like the current format, a first action is to remove the public methods to format a benchmark :-) https://github.com/haypo/perf/commit/881a282cdac7969e3c759ff344ad766b3ae0f065 So at least, I will not break the API if I change the format again in the future. Victor From peter.xihong.wang at intel.com Thu Mar 16 19:00:19 2017 From: peter.xihong.wang at intel.com (Wang, Peter Xihong) Date: Thu, 16 Mar 2017 23:00:19 +0000 Subject: [Speed] ASLR In-Reply-To: <20170316102200.746f709a@fsol> References: <20170316102200.746f709a@fsol> Message-ID: <371EBC7881C7844EAAF5556BFF21BCCC583FA12E@ORSMSX105.amr.corp.intel.com> [Wang, Peter Xihong] I am wondering what others are using micro-benchmarks for, or if there is a usage statistics somewhere about these benchmarks. For me, it's optimization delta driven. e.g., if I expect my optimization to boost performance by 5%, but the variation reaches up to or greater than 5%, then I am getting lost, and the perf data cannot be trusted.? In addition to turbo boost, I also turned off hyperthreading, and c-state, p-state, on Intel CPUs. Regards, Peter > -----Original Message----- > From: Speed [mailto:speed- > bounces+peter.xihong.wang=intel.com at python.org] On Behalf Of Antoine > Pitrou > Sent: Thursday, March 16, 2017 2:22 AM > To: speed at python.org > Subject: Re: [Speed] ASLR > > On Thu, 16 Mar 2017 01:50:39 +0100 > Victor Stinner > wrote: > > > > I made my own experiment on the impact on temperature on performance, > > and above 100?C, I didn't notice anything: > > https://haypo.github.io/intel-cpus-part2.html > > "Impact of the CPU temperature on benchmarks" > > I suspect temperature can have an impact on performance if Turbo is enabled > (or, as you noticed, if CPU cooling is deficient). > > Note that tweaking a system for benchmarking (disabling Turbo, disabling ASLR, > etc.) may make the results more reproducible, but it may also make them less > representative of real-world conditions (because few people disable Turbo or > ASLR, except precisely on benchmarking machines :-)). It's a delicate > balancing act! > > Regards > > Antoine. > > > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed From victor.stinner at gmail.com Thu Mar 16 22:07:35 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 17 Mar 2017 03:07:35 +0100 Subject: [Speed] perf 1.0 released: with a stable API Message-ID: Hi, After 9 months of development, the perf API became stable with the awaited "1.0" version. The perf module has now a complete API to write, run and analyze benchmarks and a nice documentation explaining traps of benchmarking and how to avoid, or even, fix them. http://perf.readthedocs.io/ Last days, I rewrote the documentation, hid a few more functions to prevent API changes after the 1.0 release, and I made last backward incompatible changes to fix old design issues. I don't expect the module to be perfect. It's more a milestone to freeze the API and focus on features instead ;-) Changes between 0.9.6 and 1.0: Enhancements: * ``stats`` command now displays percentiles * ``hist`` command now also checks the benchmark stability by default * dump command now displays raw value of calibration runs. * Add ``Benchmark.percentile()`` method Backward incompatible changes: * Remove the ``compare`` command to only keep the ``compare_to`` command which is better defined * Run warmup values must now be normalized per loop iteration. * Remove ``format()`` and ``__str__()`` methods from Benchmark. These methods were too opiniated. * Rename ``--name=NAME`` option to ``--benchmark=NAME`` * Remove ``perf.monotonic_clock()`` since it wasn't monotonic on Python 2.7. * Remove ``is_significant()`` from the public API Other changes: * check command now only complains if min/max is 50% smaller/larger than the mean, instead of 25%. Note: I already updated the performance project to perf 1.0. Victor From victor.stinner at gmail.com Thu Mar 16 22:11:14 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 17 Mar 2017 03:11:14 +0100 Subject: [Speed] ASLR In-Reply-To: <371EBC7881C7844EAAF5556BFF21BCCC583FA12E@ORSMSX105.amr.corp.intel.com> References: <20170316102200.746f709a@fsol> <371EBC7881C7844EAAF5556BFF21BCCC583FA12E@ORSMSX105.amr.corp.intel.com> Message-ID: 2017-03-17 0:00 GMT+01:00 Wang, Peter Xihong : > In addition to turbo boost, I also turned off hyperthreading, and c-state, p-state, on Intel CPUs. My "python3 -m perf system tune" command sets the minimum frequency of CPUs used for benchmarks to the maximum frequency. I expect that it reduces or even avoid changes on P-state and C-state. See my documentation on How to tune a system for benchmarking: http://perf.readthedocs.io/en/latest/system.html Victor From victor.stinner at gmail.com Thu Mar 16 22:29:19 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 17 Mar 2017 03:29:19 +0100 Subject: [Speed] pymicrobench: collection of CPython microbenchmarks Message-ID: Hi, I started to create a collection of microbenchmarks for CPython from scripts found on the bug tracker: https://github.com/haypo/pymicrobench I'm not sure that this collection is used yet, but some of you may want to take a look :-) I know that some people have random microbenchmarks in a local directory. Maybe you want to share them? I don't really care to sort them or group them. My plan is first to populate the repository, and later see what to do with it :-) Victor From victor.stinner at gmail.com Sun Mar 26 18:12:21 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 27 Mar 2017 00:12:21 +0200 Subject: [Speed] speed.python.org: move to Git, remove old previous results Message-ID: Hi, I'm going to remove old previous benchmark results from speed.python.org. As we discussed previously, there is no plan to keep old results when we need to change something. In this case, CPython moved from Mercurial to Git, and I'm too lazy to upgrade the revisions in database. I prefer to run again benchmarks :-) My plan: * Remove all previous benchmark results * Run benchmarks on master, 2.7, 3.6 and 3.5 branches * Run benchmarks on one revision per year quarter on the last 2 years * Then see if we should run benchmarks on even older revisions and/or if we need more than one plot per quarter. * Maybe one point per month at least? The problem is that the UI is limited to 50 points on the "Display all in a grid" view of the Timeline. I would like to be able to render 2 years on this view. For each year quarter, I will use the first commit of the master branch on this period. Victor From victor.stinner at gmail.com Mon Mar 27 10:43:37 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 27 Mar 2017 16:43:37 +0200 Subject: [Speed] speed.python.org: move to Git, remove old previous results In-Reply-To: References: Message-ID: Zachary Ware told me on IRC that it's ok for him to drop old data. If nobody else complains, I will remove old data tomorrow (tuesday). I already validated that the patched scripts work on Git. I released new versions of perf and performance to make sure that the latest version of the code is released and used. By the way, the newly released perf 1.1 gets a new "perf command" command to measure the time of a command, it's like the Unix "time" command. http://perf.readthedocs.io/en/latest/cli.html#command-cmd $ python3 -m perf command -- python2 -c pass ..................... command: Mean +- std dev: 21.2 ms +- 3.2 ms Victor 2017-03-27 0:12 GMT+02:00 Victor Stinner : > Hi, > > I'm going to remove old previous benchmark results from > speed.python.org. As we discussed previously, there is no plan to keep > old results when we need to change something. In this case, CPython > moved from Mercurial to Git, and I'm too lazy to upgrade the revisions > in database. I prefer to run again benchmarks :-) > > My plan: > > * Remove all previous benchmark results > * Run benchmarks on master, 2.7, 3.6 and 3.5 branches > * Run benchmarks on one revision per year quarter on the last 2 years > * Then see if we should run benchmarks on even older revisions and/or > if we need more than one plot per quarter. > * Maybe one point per month at least? The problem is that the UI is > limited to 50 points on the "Display all in a grid" view of the > Timeline. I would like to be able to render 2 years on this view. > > For each year quarter, I will use the first commit of the master > branch on this period. > > Victor From victor.stinner at gmail.com Mon Mar 27 19:17:26 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 28 Mar 2017 01:17:26 +0200 Subject: [Speed] ASLR In-Reply-To: References: Message-ID: 2017-03-16 17:19 GMT+01:00 Brett Cannon : >> By the way, maybe I should commit a change in hg.python.org/benchmarks >> to remove the code and only keep a README.txt? Code will still be >> accessible in Mercurial history. > > Since we might not shut down hg.python.org for a long time I say go ahead > and commit such a change. Ok, done! https://hg.python.org/benchmarks/file/tip/README.txt https://hg.python.org/benchmarks/file/tip Victor From tobami at gmail.com Tue Mar 28 03:36:35 2017 From: tobami at gmail.com (Miquel Torres) Date: Tue, 28 Mar 2017 07:36:35 +0000 Subject: [Speed] speed.python.org: move to Git, remove old previous results In-Reply-To: References: Message-ID: I can have a look into increasing the number of points displayed. El El lun, 27 mar 2017 a las 15:44, Victor Stinner escribi?: > Zachary Ware told me on IRC that it's ok for him to drop old data. > > If nobody else complains, I will remove old data tomorrow (tuesday). > > I already validated that the patched scripts work on Git. I released > new versions of perf and performance to make sure that the latest > version of the code is released and used. By the way, the newly > released perf 1.1 gets a new "perf command" command to measure the > time of a command, it's like the Unix "time" command. > > http://perf.readthedocs.io/en/latest/cli.html#command-cmd > > $ python3 -m perf command -- python2 -c pass > ..................... > command: Mean +- std dev: 21.2 ms +- 3.2 ms > > Victor > > 2017-03-27 0:12 GMT+02:00 Victor Stinner : > > Hi, > > > > I'm going to remove old previous benchmark results from > > speed.python.org. As we discussed previously, there is no plan to keep > > old results when we need to change something. In this case, CPython > > moved from Mercurial to Git, and I'm too lazy to upgrade the revisions > > in database. I prefer to run again benchmarks :-) > > > > My plan: > > > > * Remove all previous benchmark results > > * Run benchmarks on master, 2.7, 3.6 and 3.5 branches > > * Run benchmarks on one revision per year quarter on the last 2 years > > * Then see if we should run benchmarks on even older revisions and/or > > if we need more than one plot per quarter. > > * Maybe one point per month at least? The problem is that the UI is > > limited to 50 points on the "Display all in a grid" view of the > > Timeline. I would like to be able to render 2 years on this view. > > > > For each year quarter, I will use the first commit of the master > > branch on this period. > > > > Victor > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Tue Mar 28 07:05:06 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 28 Mar 2017 13:05:06 +0200 Subject: [Speed] speed.python.org: move to Git, remove old previous results In-Reply-To: References: Message-ID: 2017-03-28 9:36 GMT+02:00 Miquel Torres : > I can have a look into increasing the number of points displayed. There is a "Show the last [50] results" widget, but it's disabled if you select "(o) Display all in a grid". Maybe we should enable the first widget but limit the maximum number of results when this specific view is selected? Just keep 50 by default ;-) Victor From victor.stinner at gmail.com Tue Mar 28 08:11:54 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 28 Mar 2017 14:11:54 +0200 Subject: [Speed] Interesting Ruby pull request Message-ID: Hi, It seems like Urabe, Shyouhei succeeded to write an efficient optimizer for Ruby: https://github.com/ruby/ruby/pull/1419 Since Ruby and CPython design are similar, maybe we can pick some ideas. It seems like the optimizer is not done yet, the PR is not merged yet. I don't understand how the optimizer works. An interesting commit: https://github.com/ruby/ruby/pull/1419/commits/d7b376949eb1626b9e5088f907db4cda5698ac1b --- basic optimization infrastructure This commit adds on-the-fly ISeq analyzer. It detects an ISeq's purity, i.e. if that ISeq has side-effect or not. Purity is the key concept of whole optimization techniques in general, but in Ruby it is yet more important because there is a method called eval. A pure ISeq is free from eval, while those not pure are stuck in the limbo where any of its side effects _could_ result in (possibly aliased) call to eval. So an optimization tend not be possible against them. Note however, that the analyzer cannot statically say if the ISeq in question is pure or not. It categorizes an ISeq into 3 states namely pure, not pure, or "unpredictable". The last category is used when for instance there are branches yet to be analyzed, or method calls to another unpredictable ISeq. An ISeq's purity changes over time, not only by redefinition of methods, but by other optimizations, like, by entering a rarely-taken branch of a formerly-unpredictable ISeq to kick analyzer to fix its purity. Such change propagates to its callers. * optimize.c: new file. * optimize.h: new file. * common.mk (COMMONOBJS): dependencies for new files. * iseq.h (ISEQ_NEEDS_ANALYZE): new flag to denote the iseq in question might need (re)analyzing. --- I had this link in my bookmark for months, but I forgot it. This email is not to forget it again ;-) Someone may find it useful! Victor From victor.stinner at gmail.com Tue Mar 28 19:22:31 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 29 Mar 2017 01:22:31 +0200 Subject: [Speed] Results of CPython benchmarks on 2016 Message-ID: Hi, Before removing everything from speed.python.org database, I took screenshots on interesting pages: https://haypo.github.io/speed-python-org-march-2017.html * Benchmarks where Python 3.7 is faster than Python 2.7 * Benchmarks where Python 3.7 is slower than Python 2.7 * Significant optimizations * etc. CPython became faster on many benchmarks in 2016: * call_method * float * hexiom * nqueens * pickle_list * richards * scimark_lu * scimark_sor * sympy_sum * telco * unpickle_list. I now have to analyze what made these benchmarks faster for my future Pycon US talk "Optimizations which made Python 3.6 faster than Python 3.5" ;-) I also kept many screenshots to see that benchmarks are now stable! Victor From victor.stinner at gmail.com Fri Mar 31 18:47:35 2017 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 1 Apr 2017 00:47:35 +0200 Subject: [Speed] Issues to run benchmarks on Python before 2015-04-01 Message-ID: Hi, I'm trying to run benchmarks on revisions between 2014-01-01 and today, but I got two different issues: see below. I'm now looking for workarounds :-/ Because of these bugs, I'm unable to get benchmarks results before 2015-04-01 (at 2015-04-01, benchmarks work again). (1) 2014-01-01: "python3 -m pip install performance" fails with a TypeError: "charset argument must be specified when non-ASCII characters are used in the payload" It's a regression of Python 3.4 beta: http://bugs.python.org/issue20531 (2) 2014-04-01, 2014-07-01, 2014-10-01, 2015-01-01: "venv/bin/python -m pip install" fails in extract_stack() of pyparsing --- haypo at selma$ /home/haypo/prog/bench_python/tmpdir/prefix/bin/python3 Python 3.5.0a0 (default, Apr 1 2017, 00:01:30) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pip Traceback (most recent call last): File "", line 1, in File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/__init__.py", line 26, in from pip.utils import get_installed_distributions, get_prog File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/utils/__init__.py", line 27, in from pip._vendor import pkg_resources File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pkg_resources/__init__.py", line 74, in __import__('pip._vendor.packaging.requirements') File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/packaging/requirements.py", line 9, in from pip._vendor.pyparsing import ( File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py", line 4715, in _escapedPunc = Word( _bslash, r"\[]-*.$+^?()~ ", exact=2 ).setParseAction(lambda s,l,t:t[0][1]) File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py", line 1261, in setParseAction self.parseAction = list(map(_trim_arity, list(fns))) File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py", line 1043, in _trim_arity this_line = extract_stack(limit=2)[-1] File "/home/haypo/prog/bench_python/tmpdir/prefix/lib/python3.5/site-packages/pip/_vendor/pyparsing.py", line 1028, in extract_stack return [(frame_summary.filename, frame_summary.lineno)] AttributeError: 'tuple' object has no attribute 'filename' --- Note: same error using the pip program (ex: " prefix/bin/pip --version"). Victor