From fijall at gmail.com Wed Apr 13 14:57:35 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 13 Apr 2016 20:57:35 +0200 Subject: [Speed] Adapting pypy benchmark suite Message-ID: Hi I have a radical idea: to take a pypy benchmark suite, update the libraries to newer ones and replace python benchmarks with that. The main reason being that pypy has a much better coverage of things that are not microbenchmarks, the list (in json): http://paste.pound-python.org/show/4YVq0fv6pv8rVOSmCTag/ Which is much more extensive than this: https://hg.python.org/benchmarks/file/tip/performance I'm willing to put *some* effort, what do people think? Cheers, fijal From zachary.ware+pydev at gmail.com Wed Apr 13 16:33:22 2016 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Wed, 13 Apr 2016 15:33:22 -0500 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References: Message-ID: On Wed, Apr 13, 2016 at 1:57 PM, Maciej Fijalkowski wrote: > Hi > > I have a radical idea: to take a pypy benchmark suite, update the > libraries to newer ones and replace python benchmarks with that. The > main reason being that pypy has a much better coverage of things that > are not microbenchmarks, the list (in json): > > http://paste.pound-python.org/show/4YVq0fv6pv8rVOSmCTag/ > > Which is much more extensive than this: > > https://hg.python.org/benchmarks/file/tip/performance > > I'm willing to put *some* effort, what do people think? I'm in favor. My support has two conditions, though: 1) at least a majority of the benchmarks must be Python3 compatible. Preferably 2/3 compatible, but I assume all of the PyPy benchmarks are 2 compatible anyway. 2) and there should be an easy way to run the benchmarks against exactly 1 interpreter (for use with speed.python.org). I initially tried to set up speed.python.org using the PyPy benchmarks, but quickly ran into issues with trying to use 'nullpython.py' as the baseline Python. When I switched to using h.p.o/benchmarks, I added the '--raw' flag to perf.py which allows the benchmarks to be run on one interpreter instead of two. It was just a quick hack, though; I have no problems with that feature completely changing (even invoking it a different way is ok), so long as it exists. This project could probably start its life as github.com/python/benchmarks and save us from having to migrate h.p.o/benchmarks to GitHub. -- Zach From fijall at gmail.com Wed Apr 13 18:00:25 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 14 Apr 2016 00:00:25 +0200 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References:

Message-ID: On Wed, Apr 13, 2016 at 10:33 PM, Zachary Ware wrote: > On Wed, Apr 13, 2016 at 1:57 PM, Maciej Fijalkowski wrote: >> Hi >> >> I have a radical idea: to take a pypy benchmark suite, update the >> libraries to newer ones and replace python benchmarks with that. The >> main reason being that pypy has a much better coverage of things that >> are not microbenchmarks, the list (in json): >> >> http://paste.pound-python.org/show/4YVq0fv6pv8rVOSmCTag/ >> >> Which is much more extensive than this: >> >> https://hg.python.org/benchmarks/file/tip/performance >> >> I'm willing to put *some* effort, what do people think? > > I'm in favor. My support has two conditions, though: > > 1) at least a majority of the benchmarks must be Python3 compatible. > Preferably 2/3 compatible, but I assume all of the PyPy benchmarks are > 2 compatible anyway. The 3-compatible is likely about updating the libs > > 2) and there should be an easy way to run the benchmarks against > exactly 1 interpreter (for use with speed.python.org). I initially > tried to set up speed.python.org using the PyPy benchmarks, but > quickly ran into issues with trying to use 'nullpython.py' as the > baseline Python. When I switched to using h.p.o/benchmarks, I added > the '--raw' flag to perf.py which allows the benchmarks to be run on > one interpreter instead of two. It was just a quick hack, though; I > have no problems with that feature completely changing (even invoking > it a different way is ok), so long as it exists. That is something that we're tackling on "single-run" branch, check it out, I will finish it, maybe that's a good reason to finish it > > This project could probably start its life as > github.com/python/benchmarks and save us from having to migrate > h.p.o/benchmarks to GitHub. > > -- > Zach > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed From brett at python.org Thu Apr 14 12:51:12 2016 From: brett at python.org (Brett Cannon) Date: Thu, 14 Apr 2016 16:51:12 +0000 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References:

Message-ID: On Wed, 13 Apr 2016 at 13:33 Zachary Ware wrote: > On Wed, Apr 13, 2016 at 1:57 PM, Maciej Fijalkowski > wrote: > > Hi > > > > I have a radical idea: to take a pypy benchmark suite, update the > > libraries to newer ones and replace python benchmarks with that. The > > main reason being that pypy has a much better coverage of things that > > are not microbenchmarks, the list (in json): > > > > http://paste.pound-python.org/show/4YVq0fv6pv8rVOSmCTag/ > > > > Which is much more extensive than this: > > > > https://hg.python.org/benchmarks/file/tip/performance > > > > I'm willing to put *some* effort, what do people think? > > I'm in favor. My support has two conditions, though: > > 1) at least a majority of the benchmarks must be Python3 compatible. > Preferably 2/3 compatible, but I assume all of the PyPy benchmarks are > 2 compatible anyway. > Agreed (although I don't care about the 2/3 compatibility, just the 3 compat ;) . > > 2) and there should be an easy way to run the benchmarks against > exactly 1 interpreter (for use with speed.python.org). I initially > tried to set up speed.python.org using the PyPy benchmarks, but > quickly ran into issues with trying to use 'nullpython.py' as the > baseline Python. When I switched to using h.p.o/benchmarks, I added > the '--raw' flag to perf.py which allows the benchmarks to be run on > one interpreter instead of two. It was just a quick hack, though; I > have no problems with that feature completely changing (even invoking > it a different way is ok), so long as it exists. > > This project could probably start its life as > github.com/python/benchmarks and save us from having to migrate > h.p.o/benchmarks to GitHub. > Yep, I'm willing to postpone moving the benchmarks repo from hg.python.org if works starts on this idea and then not move the old repo at all if this succeeds. We could then make the people who care about the benchmarks the maintainers of the new repository and contribute to it directly (which means interested people from CPython, PyPy, Pyston, IronPython, and Jython). That way we all get exposure to everyone's benchmarks and there's no more benchmark fragmentation because some people disagree with another's approach (i.e. no more benchmark silos among the Python implementations). And just because we're talking a new repo, would it be worth considering relying on pip to grab libraries instead of embedding them in the repository, hence shrinking the overall size of the repo? -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Thu Apr 14 13:05:17 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 14 Apr 2016 19:05:17 +0200 Subject: [Speed] Adapting pypy benchmark suite References: Message-ID: <20160414190517.073ea4cf@fsol> On Wed, 13 Apr 2016 20:57:35 +0200 Maciej Fijalkowski wrote: > Hi > > I have a radical idea: to take a pypy benchmark suite, update the > libraries to newer ones and replace python benchmarks with that. The > main reason being that pypy has a much better coverage of things that > are not microbenchmarks, the list (in json): So why not consolidate all benchmarks together, instead of throwing away work already done? Regards Antoine. From fijall at gmail.com Thu Apr 14 13:49:37 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 14 Apr 2016 19:49:37 +0200 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References:

Message-ID: On Thu, Apr 14, 2016 at 6:51 PM, Brett Cannon wrote: > > > On Wed, 13 Apr 2016 at 13:33 Zachary Ware > wrote: >> >> On Wed, Apr 13, 2016 at 1:57 PM, Maciej Fijalkowski >> wrote: >> > Hi >> > >> > I have a radical idea: to take a pypy benchmark suite, update the >> > libraries to newer ones and replace python benchmarks with that. The >> > main reason being that pypy has a much better coverage of things that >> > are not microbenchmarks, the list (in json): >> > >> > http://paste.pound-python.org/show/4YVq0fv6pv8rVOSmCTag/ >> > >> > Which is much more extensive than this: >> > >> > https://hg.python.org/benchmarks/file/tip/performance >> > >> > I'm willing to put *some* effort, what do people think? >> >> I'm in favor. My support has two conditions, though: >> >> 1) at least a majority of the benchmarks must be Python3 compatible. >> Preferably 2/3 compatible, but I assume all of the PyPy benchmarks are >> 2 compatible anyway. > > > Agreed (although I don't care about the 2/3 compatibility, just the 3 compat > ;) . > >> >> >> 2) and there should be an easy way to run the benchmarks against >> exactly 1 interpreter (for use with speed.python.org). I initially >> tried to set up speed.python.org using the PyPy benchmarks, but >> quickly ran into issues with trying to use 'nullpython.py' as the >> baseline Python. When I switched to using h.p.o/benchmarks, I added >> the '--raw' flag to perf.py which allows the benchmarks to be run on >> one interpreter instead of two. It was just a quick hack, though; I >> have no problems with that feature completely changing (even invoking >> it a different way is ok), so long as it exists. >> >> This project could probably start its life as >> github.com/python/benchmarks and save us from having to migrate >> h.p.o/benchmarks to GitHub. > > > Yep, I'm willing to postpone moving the benchmarks repo from hg.python.org > if works starts on this idea and then not move the old repo at all if this > succeeds. We could then make the people who care about the benchmarks the > maintainers of the new repository and contribute to it directly (which means > interested people from CPython, PyPy, Pyston, IronPython, and Jython). That > way we all get exposure to everyone's benchmarks and there's no more > benchmark fragmentation because some people disagree with another's approach > (i.e. no more benchmark silos among the Python implementations). > > And just because we're talking a new repo, would it be worth considering > relying on pip to grab libraries instead of embedding them in the > repository, hence shrinking the overall size of the repo? > Both make sense - to benchmark against the latest lib X and to benchmark against a pinned lib X From fijall at gmail.com Thu Apr 14 13:49:56 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 14 Apr 2016 19:49:56 +0200 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: <20160414190517.073ea4cf@fsol> References: <20160414190517.073ea4cf@fsol> Message-ID: On Thu, Apr 14, 2016 at 7:05 PM, Antoine Pitrou wrote: > On Wed, 13 Apr 2016 20:57:35 +0200 > Maciej Fijalkowski > wrote: >> Hi >> >> I have a radical idea: to take a pypy benchmark suite, update the >> libraries to newer ones and replace python benchmarks with that. The >> main reason being that pypy has a much better coverage of things that >> are not microbenchmarks, the list (in json): > > So why not consolidate all benchmarks together, instead of throwing > away work already done? > > Regards > > Antoine. Yeah, you can call it that too. From victor.stinner at gmail.com Sun Apr 24 18:49:20 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 25 Apr 2016 00:49:20 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks Message-ID: Hi, Last months, I spent a lot of time on microbenchmarks. Probably too much time :-) I found a great Linux config to get a much more stable system to get reliable microbenchmarks: https://haypo-notes.readthedocs.org/microbenchmark.html * isolate some CPU cores * force CPU to performance * disable ASLR * block IRQ on isolated CPU cores With such Linux config, the system load doesn't impact benchmark results at all. Last days, I almost lost my mind trying to figure out why a very tiny change in C code makes a difference up to 8% slower. My main issue was to get reliable benchmark since running the same microbenchmark using perf.py gave me "random" results. I finished to run directly the underlying script bm_call_simple.py: taskset -c 7 ./python ../benchmarks/performance/bm_call_simple.py -n 5 --timer perf_counter In a single run, timings of each loop iteration is very stable. Example: 0.22682707803323865 0.22741253697313368 0.227521265973337 0.22750743699725717 0.22752994997426867 0.22753606992773712 0.22742654103785753 0.22750875598285347 0.22752253606449813 0.22718404198531061 Problem: each new run gives a different result. Example: * run 1: 0.226... * run 2: 0.255... * run 3: 0.248... * run 4: 0.258... * etc. I saw 3 groups of values: ~0.226, ~0.248, ~0.255. I didn't understand how running the same program can give so different result. The reply is the randomization of the Python hash function. Aaaaaaah! The last source of entropy in my microbenchmark! The performance difference can be seen by forcing a specific hash function: PYTHONHASHSEED=2 => 0.254... PYTHONHASHSEED=1 => 0.246... PYTHONHASHSEED=5 => 0.228... Sadly, perf.py and timeit don't disable hash randomization for me. I hacked perf.py to set PYTHONHASHSEED=0 and magically the result became super stable! Multiple runs of the command: $ taskset_isolated.py python3 perf.py ../default/python-ref ../default/python -b call_simple --fast Outputs: ### call_simple ### Min: 0.232621 -> 0.247904: 1.07x slower Avg: 0.232628 -> 0.247941: 1.07x slower Significant (t=-591.78) Stddev: 0.00001 -> 0.00010: 13.7450x larger ### call_simple ### Min: 0.232619 -> 0.247904: 1.07x slower Avg: 0.232703 -> 0.247955: 1.07x slower Significant (t=-190.58) Stddev: 0.00029 -> 0.00011: 2.6336x smaller ### call_simple ### Min: 0.232621 -> 0.247903: 1.07x slower Avg: 0.232629 -> 0.247918: 1.07x slower Significant (t=-5896.14) Stddev: 0.00001 -> 0.00001: 1.3350x larger Even with --fast, the result is *very* stable. See the very good standard deviation. In 3 runs, I got exactly the same "1.07x". Average timings are the same +/-1 up to 4 digits! No need to use the ultra slow --rigourous option. This option is probably designed to hide the noise of a very unstable system. But using my Linux config, it doesn't seem to be needed anymore, at least on this very specific microbenchmark. Ok, now I can investigate why my change on the C code introduced a performance regression :-D Victor From fijall at gmail.com Mon Apr 25 02:25:20 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 25 Apr 2016 08:25:20 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References: Message-ID: Hi Victor The problem with disabled ASLR is that you change the measurment from a statistical distribution, to one draw from a statistical distribution repeatedly. There is no going around doing multiple runs and doing an average on that. Essentially for the same reason why using min is much worse than using average, with ASLR say you get: 2.0+-0.3 which we run 5 times and 1.8, 1.9, 2.2, 2.1, 2.1 now if you disable ASLR, you get one draw repeated 5 times, which might be 2.0, but also might be 1.8, 5 times. That just hides the problem, but does not actually fix it (because if you touch something, stuff might be allocated in a different order and then you get a different draw) On Mon, Apr 25, 2016 at 12:49 AM, Victor Stinner wrote: > Hi, > > Last months, I spent a lot of time on microbenchmarks. Probably too > much time :-) I found a great Linux config to get a much more stable > system to get reliable microbenchmarks: > https://haypo-notes.readthedocs.org/microbenchmark.html > > * isolate some CPU cores > * force CPU to performance > * disable ASLR > * block IRQ on isolated CPU cores > > With such Linux config, the system load doesn't impact benchmark results at all. > > Last days, I almost lost my mind trying to figure out why a very tiny > change in C code makes a difference up to 8% slower. > > My main issue was to get reliable benchmark since running the same > microbenchmark using perf.py gave me "random" results. > > I finished to run directly the underlying script bm_call_simple.py: > > taskset -c 7 ./python ../benchmarks/performance/bm_call_simple.py -n 5 > --timer perf_counter > > In a single run, timings of each loop iteration is very stable. Example: > > 0.22682707803323865 > 0.22741253697313368 > 0.227521265973337 > 0.22750743699725717 > 0.22752994997426867 > 0.22753606992773712 > 0.22742654103785753 > 0.22750875598285347 > 0.22752253606449813 > 0.22718404198531061 > > Problem: each new run gives a different result. Example: > > * run 1: 0.226... > * run 2: 0.255... > * run 3: 0.248... > * run 4: 0.258... > * etc. > > I saw 3 groups of values: ~0.226, ~0.248, ~0.255. > > I didn't understand how running the same program can give so different > result. The reply is the randomization of the Python hash function. > Aaaaaaah! The last source of entropy in my microbenchmark! > > The performance difference can be seen by forcing a specific hash function: > > PYTHONHASHSEED=2 => 0.254... > PYTHONHASHSEED=1 => 0.246... > PYTHONHASHSEED=5 => 0.228... > > Sadly, perf.py and timeit don't disable hash randomization for me. I > hacked perf.py to set PYTHONHASHSEED=0 and magically the result became > super stable! > > Multiple runs of the command: > > $ taskset_isolated.py python3 perf.py ../default/python-ref > ../default/python -b call_simple --fast > > Outputs: > > ### call_simple ### > Min: 0.232621 -> 0.247904: 1.07x slower > Avg: 0.232628 -> 0.247941: 1.07x slower > Significant (t=-591.78) > Stddev: 0.00001 -> 0.00010: 13.7450x larger > > ### call_simple ### > Min: 0.232619 -> 0.247904: 1.07x slower > Avg: 0.232703 -> 0.247955: 1.07x slower > Significant (t=-190.58) > Stddev: 0.00029 -> 0.00011: 2.6336x smaller > > ### call_simple ### > Min: 0.232621 -> 0.247903: 1.07x slower > Avg: 0.232629 -> 0.247918: 1.07x slower > Significant (t=-5896.14) > Stddev: 0.00001 -> 0.00001: 1.3350x larger > > Even with --fast, the result is *very* stable. See the very good > standard deviation. In 3 runs, I got exactly the same "1.07x". Average > timings are the same +/-1 up to 4 digits! > > No need to use the ultra slow --rigourous option. This option is > probably designed to hide the noise of a very unstable system. But > using my Linux config, it doesn't seem to be needed anymore, at least > on this very specific microbenchmark. > > Ok, now I can investigate why my change on the C code introduced a > performance regression :-D > > Victor > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed From victor.stinner at gmail.com Mon Apr 25 03:52:39 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 25 Apr 2016 09:52:39 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References: Message-ID: Hi, The problem with not cheating on the statistical distribution is that I would like to get a quick feedback on my changes to know if my change is faster or not. Having to wait 1 hour to check a single change is not really convenient. I prefer to get a feedback is less than 5 minutes. Tune Linux and disable random hash allows to run less iterations and so take less time. If you don't tune anything, you need *a lot* of iterations to reduce the "noise" of the benchmark. 2016-04-25 8:25 GMT+02:00 Maciej Fijalkowski : > The problem with disabled ASLR is that you change the measurment from > a statistical distribution, to one draw from a statistical > distribution repeatedly. The problem is that perf.py only runs one process per benchmark and per Python binary. Let's that the binary A is run with no hash collision, all dict access succeed at the first iteration, whereas the binary B runs with many "hash collision" so get worse performance. Is it fair to compare only these two specific runs? What can we conclude from the result? Note: The dict type of CPython uses open addressing, so even if two keys get the different hash value, a dict lookup may need more than one iteration to retrieve a dict entry. Right now, using ASLR and randomized hash function with perf.py is not fair. I'm talking about only running perf.py once. I'm unable to combine/compare manually multiple runs of perf.py. Getting multiple different results is very confusing for me. If you want to use ASRL and hash randomization, perf.py must be modified to run multiple processes (sequentially) to get a better statistical distribution. No? I don't know how many processes do we have to run. Accoding to my quick analysis there are 3 different cases, the strict minimum would be to run 3 processes for my specific case. For bm_call_simple, the number of loops is 15 for fast, 150 by default, 300 using rigorous. Maybe we should only run one loop iteration per process? FYI right now, I'm not using perf.py to prove that my patch makes CPython faster, but more to analyze why my change makes CPython slower :-) It *looks* like Python function calls are between 2 and 7% slower, but it also looks that bm_call_simple is an unstable microbenchmark :-( > There is no going around doing multiple runs > and doing an average on that. Essentially for the same reason why > using min is much worse than using average, with ASLR say you get: > 2.0+-0.3 which we run 5 times and 1.8, 1.9, 2.2, 2.1, 2.1 now if you > disable ASLR, you get one draw repeated 5 times, which might be 2.0, > but also might be 1.8, 5 times. That just hides the problem, but does > not actually fix it (because if you touch something, stuff might be > allocated in a different order and then you get a different draw) My practical problem is to get reliable benchmark. If you don't tune Linux and don't disable hash randomization, the results look "random". Example of output without tuning. I ran "python3 perf.py ../default/python-revert ../default/python-commit -b call_simple -v --fast" 3 times. [ fast, 15 iterations ] ### call_simple ### Min: 0.235318 -> 0.247203: 1.05x slower Avg: 0.237601 -> 0.251384: 1.06x slower Significant (t=-6.32) Stddev: 0.00214 -> 0.00817: 3.8069x larger ### call_simple ### Min: 0.234191 -> 0.247109: 1.06x slower Avg: 0.234660 -> 0.247480: 1.05x slower Significant (t=-36.14) Stddev: 0.00102 -> 0.00093: 1.0967x smaller ### call_simple ### Min: 0.235790 -> 0.247089: 1.05x slower Avg: 0.238978 -> 0.247562: 1.04x slower Significant (t=-9.38) Stddev: 0.00342 -> 0.00094: 3.6504x smaller You ask to ignore the Min line. Ok. But the average line say 1.04, 1.05 and 1.06x slower. Which one is the "good" result? :-) Usually, the difference is much larger like between 1.02x slower and 1.07x slower. [ rigorous, 300 iterations ] The --fast option of perf.py is just a toy, right? Serious dev must use the super slow --rigorous mode! Ok, let's try it. ### call_simple ### Min: 0.234102 -> 0.248098: 1.06x slower Avg: 0.236218 -> 0.254318: 1.08x slower Significant (t=-30.32) Stddev: 0.00561 -> 0.00869: 1.5475x larger ### call_simple ### Min: 0.234109 -> 0.248024: 1.06x slower Avg: 0.240069 -> 0.255194: 1.06x slower Significant (t=-15.21) Stddev: 0.01126 -> 0.01304: 1.1584x larger ### call_simple ### Min: 0.235272 -> 0.248225: 1.06x slower Avg: 0.244106 -> 0.258349: 1.06x slower Significant (t=-13.27) Stddev: 0.00830 -> 0.01663: 2.0053x larger Again, I ignore the Min line. Average: hum... is it 1.06x or 1.08x slower? For me, it's not really the same :-/ In the reference Python, the average timing changes between 0.236 and 0.244 in the 3 runs. Since the difference between the reference and the patched Python is tiny, the stability of the microbenchmark matters. FYI I'm running benchmarks on my desktop PC. It's more convenient to run benchmarks locally than transfering changes to an hypotetical dedicated benchmark server (tuned to run more reliable (micro)benchmarks). Since benchmarks are slow, I'm doing something else while the benchmark is running. Victor From arigo at tunes.org Tue Apr 26 04:56:59 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 26 Apr 2016 10:56:59 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References: Message-ID: Hi, On 25 April 2016 at 08:25, Maciej Fijalkowski wrote: > The problem with disabled ASLR is that you change the measurment from > a statistical distribution, to one draw from a statistical > distribution repeatedly. There is no going around doing multiple runs > and doing an average on that. You should mention that it is usually enough to do the following: instead of running once with PYTHONHASHSEED=0, run five or ten times with PYTHONHASHSEED in range(5 or 10). In this way, you get all benefits: not-too-long benchmarking, no randomness, but still some statistically relevant sampling. A bient?t, Armin. From anto.cuni at gmail.com Tue Apr 26 05:01:06 2016 From: anto.cuni at gmail.com (Antonio Cuni) Date: Tue, 26 Apr 2016 11:01:06 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References: Message-ID: Hello Victor, On Mon, Apr 25, 2016 at 12:49 AM, Victor Stinner wrote: > Hi, > > Last months, I spent a lot of time on microbenchmarks. Probably too > much time :-) I found a great Linux config to get a much more stable > system to get reliable microbenchmarks: > https://haypo-notes.readthedocs.org/microbenchmark.html > > * isolate some CPU cores > you might be interested in cpusets and the cset utility: in theory, they allow you to isolate one CPU without having to reboot to change the kernel parameters: http://skebanga.blogspot.it/2012/06/cset-shield-easily-configure-cpusets.html https://github.com/lpechacek/cpuset ? ?However, I never did a scientific comparison between cpusets and isolcpu to see if the former behaves exactly like the latter. ?ciao, Anto? -------------- next part -------------- An HTML attachment was scrubbed... URL: From anto.cuni at gmail.com Tue Apr 26 05:03:14 2016 From: anto.cuni at gmail.com (Antonio Cuni) Date: Tue, 26 Apr 2016 11:03:14 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References:

Message-ID: Hi Armin, On Tue, Apr 26, 2016 at 10:56 AM, Armin Rigo wrote: > Hi, > > On 25 April 2016 at 08:25, Maciej Fijalkowski wrote: > > The problem with disabled ASLR is that you change the measurment from > > a statistical distribution, to one draw from a statistical > > distribution repeatedly. There is no going around doing multiple runs > > and doing an average on that. > > You should mention that it is usually enough to do the following: > instead of running once with PYTHONHASHSEED=0, run five or ten times > with PYTHONHASHSEED in range(5 or 10). In this way, you get all > benefits: not-too-long benchmarking, no randomness, but still some > statistically relevant sampling. > ?note that here there are two sources of "randomness": one is PYTHONHASHSEED (which you can control with the env variable), the other is ASLR? which, AFAIK, you cannot control in the same fine way: you can only either enable or disable it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Tue Apr 26 05:46:49 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 26 Apr 2016 11:46:49 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References:

Message-ID: Hi, 2016-04-26 10:56 GMT+02:00 Armin Rigo : > Hi, > > On 25 April 2016 at 08:25, Maciej Fijalkowski wrote: >> The problem with disabled ASLR is that you change the measurment from >> a statistical distribution, to one draw from a statistical >> distribution repeatedly. There is no going around doing multiple runs >> and doing an average on that. > > You should mention that it is usually enough to do the following: > instead of running once with PYTHONHASHSEED=0, run five or ten times > with PYTHONHASHSEED in range(5 or 10). In this way, you get all > benefits: not-too-long benchmarking, no randomness, but still some > statistically relevant sampling. I guess that the number of required runs to get a nice distribution depends on the size of the largest dictionary in the benchmark. I mean, the dictionaries that matter in performance. The best would be to handle this transparently in perf.py. Either disable all source of randomness, or run mutliple processes to have an uniform distribution, rather than on only having one sample for one specific config. Maybe it could be an option: by default, run multiple processes, but have an option to only run one process using PYTHONHASHSEED=0. By the way, timeit has a very similar issue. I'm quite sure that most Python developers run "python -m timeit ..." at least 3 times and take the minimum. "python -m timeit" could maybe be modified to also spawn child processes to get a better distribution, and maybe also modified to display the minimum, the average and the standard deviation? (not only the minimum) Well, the question is also if it's a good thing to have such really tiny microbenchmark like bm_call_simple in the Python benchmark suite. I spend 2 or 3 days to analyze CPython running bm_call_simple with Linux perf tool, callgrind and cachegrind. I'm still unable to understand the link between my changes on the C code and the result. IMHO this specific benchmark depends on very low-level things like the CPU L1 cache. Maybe bm_call_simple helps in some very specific use cases, like trying to make Python function calls faster. But in other cases, it can be a source of noise, confusion and frustration... Victor From victor.stinner at gmail.com Tue Apr 26 06:06:58 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 26 Apr 2016 12:06:58 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References:

Message-ID: Hi, 2016-04-26 11:01 GMT+02:00 Antonio Cuni : > On Mon, Apr 25, 2016 at 12:49 AM, Victor Stinner > wrote: >> Last months, I spent a lot of time on microbenchmarks. Probably too >> much time :-) I found a great Linux config to get a much more stable >> system to get reliable microbenchmarks: >> https://haypo-notes.readthedocs.org/microbenchmark.html >> >> * isolate some CPU cores > > you might be interested in cpusets and the cset utility: in theory, they > allow you to isolate one CPU without having to reboot to change the kernel > parameters: > > http://skebanga.blogspot.it/2012/06/cset-shield-easily-configure-cpusets.html > https://github.com/lpechacek/cpuset Ah, I didn't know this tool. Basically, it looks similar to the Linux isolcpus command line parameter, but done in userpace. I see an advantage, it can be used temporary without having to reboot the kernel. > However, I never did a scientific comparison between cpusets and isolcpu to > see if the former behaves exactly like the latter. I have a simple test: * run a benchmark when the system is idle * run a benchmark when the system is *very* busy (ex: system load > 5) Using CPU isolation + nohz_full + blocking IRQ on isolated CPUs, the benchmark result is the *same* in two cases. Try on a Linux without any specific config to see a huge difference. For example, performance divided by two. I'm using CPU isolation to be able to run benchmarks while I'm still working on my PC: use firefox, thunderbird, run heavy unit tests, compile C code, etc. Right code, I dedicated 2 physical cores to benchmarks and kept 2 physical cores for regular work. Maybe it's too much. It looks like almost all benchmarks only use logical core in practice (whereas 2 physical cores give me 4 logical cores). Next time I will probably only dedicate 1 physical core. The advantage of having two dedicated physical cores is to be able to run two "isolated" benchmarks in parallel ;-) I wrote a simple tool to get a system load larger than a minimum: https://bitbucket.org/haypo/misc/src/tip/bin/system_load.py I also started to write a script to configure a system for CPU isolation: https://bitbucket.org/haypo/misc/src/tip/bin/isolcpus.py * Block IRQ on isolated CPu cores * Disable ASLR * Force performance CPU speed on isolated cores, but not on other cores. I don't want to burn my PC :-) Intel P-state is still enabled on all CPU cores, so the power state of isolated cores still change dynamically in practice. You can see it using powertop for example. CPU isolation is not perfect, you still have random source of noises. There are also System Management Interrupt (SMI) and other low-level things. I hope that running multiple iterations of the benchmark is be enough to reduce (or remove) other sources of noise. By the way, search "Linux realtime" to find good information about "sources of noise" on Linux. Example: https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application#Hardware Hopefully, my requirements on timing are more cool than hard realtime ;-) Victor From fijall at gmail.com Tue Apr 26 12:28:32 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 26 Apr 2016 18:28:32 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: References:

Message-ID: On Tue, Apr 26, 2016 at 11:46 AM, Victor Stinner wrote: > Hi, > > 2016-04-26 10:56 GMT+02:00 Armin Rigo : >> Hi, >> >> On 25 April 2016 at 08:25, Maciej Fijalkowski wrote: >>> The problem with disabled ASLR is that you change the measurment from >>> a statistical distribution, to one draw from a statistical >>> distribution repeatedly. There is no going around doing multiple runs >>> and doing an average on that. >> >> You should mention that it is usually enough to do the following: >> instead of running once with PYTHONHASHSEED=0, run five or ten times >> with PYTHONHASHSEED in range(5 or 10). In this way, you get all >> benefits: not-too-long benchmarking, no randomness, but still some >> statistically relevant sampling. > > I guess that the number of required runs to get a nice distribution > depends on the size of the largest dictionary in the benchmark. I > mean, the dictionaries that matter in performance. > > The best would be to handle this transparently in perf.py. Either > disable all source of randomness, or run mutliple processes to have an > uniform distribution, rather than on only having one sample for one > specific config. Maybe it could be an option: by default, run multiple > processes, but have an option to only run one process using > PYTHONHASHSEED=0. > > By the way, timeit has a very similar issue. I'm quite sure that most > Python developers run "python -m timeit ..." at least 3 times and take > the minimum. "python -m timeit" could maybe be modified to also spawn > child processes to get a better distribution, and maybe also modified > to display the minimum, the average and the standard deviation? (not > only the minimum) taking the minimum is a terrible idea anyway, none of the statistical discussion makes sense if you do that > > Well, the question is also if it's a good thing to have such really > tiny microbenchmark like bm_call_simple in the Python benchmark suite. > I spend 2 or 3 days to analyze CPython running bm_call_simple with > Linux perf tool, callgrind and cachegrind. I'm still unable to > understand the link between my changes on the C code and the result. > IMHO this specific benchmark depends on very low-level things like the > CPU L1 cache. Maybe bm_call_simple helps in some very specific use > cases, like trying to make Python function calls faster. But in other > cases, it can be a source of noise, confusion and frustration... > > Victor maybe it's just a terrible benchmark (it surely is for pypy for example) From solipsis at pitrou.net Tue Apr 26 12:36:34 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 26 Apr 2016 18:36:34 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks References:

Message-ID: <20160426183634.007a161d@fsol> On Tue, 26 Apr 2016 18:28:32 +0200 Maciej Fijalkowski wrote: > > taking the minimum is a terrible idea anyway, none of the statistical > discussion makes sense if you do that The minimum is a reasonable metric for quick throwaway benchmarks as timeit is designed for, as it has a better hope of alleviating the impact of system load (as such throwaway benchmarks are often run on the developer's workstation). For a persistent benchmarks suite, where we can afford longer benchmark runtimes and are able to keep system noise to a minimum, we might prefer another metric. Regards Antoine. From fijall at gmail.com Tue Apr 26 13:21:10 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 26 Apr 2016 19:21:10 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: <20160426183634.007a161d@fsol> References:

<20160426183634.007a161d@fsol> Message-ID: On Tue, Apr 26, 2016 at 6:36 PM, Antoine Pitrou wrote: > On Tue, 26 Apr 2016 18:28:32 +0200 > Maciej Fijalkowski > wrote: >> >> taking the minimum is a terrible idea anyway, none of the statistical >> discussion makes sense if you do that > > The minimum is a reasonable metric for quick throwaway benchmarks as > timeit is designed for, as it has a better hope of alleviating the > impact of system load (as such throwaway benchmarks are often run on > the developer's workstation). > > For a persistent benchmarks suite, where we can afford longer > benchmark runtimes and are able to keep system noise to a minimum, we > might prefer another metric. > > Regards > > Antoine. No, it's not Antoine. Minimum is not better than one random measurment. We had this discussion before, but you guys are happily dismissing all the papers written on the subject. It *does* get rid of random system stuff, but it *also* does get rid of all the effects related to gc/malloc/caches and infinite details that are not working in the same predictable fashion. From victor.stinner at gmail.com Tue Apr 26 15:11:40 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 26 Apr 2016 21:11:40 +0200 Subject: [Speed] Disable hash randomization to get reliable benchmarks In-Reply-To: <20160426183634.007a161d@fsol> References:

<20160426183634.007a161d@fsol> Message-ID: 2016-04-26 18:36 GMT+02:00 Antoine Pitrou : > The minimum is a reasonable metric for quick throwaway benchmarks as > timeit is designed for, as it has a better hope of alleviating the > impact of system load (as such throwaway benchmarks are often run on > the developer's workstation). IMHO we must at least display the standard deviation. Maybe we can do better and provide 4 numbers: * Average * Standard deviation * Minimum * Maximum The maximum helps to detect rare events like Maciej said (something in the OS, GC collection, etc.). For example, we can use this format: Average: 293.5 ms +/- 143.2 ms (min: 213.9 ms, max: 629.7 ms) It's the result of still the same microbenchmark, bm_call_simple.py, run on my laptop. As you can see, there is a large deviation: 143 ms / 293 ms is 49%, the benchmark is unstable. Maybe we should say explicitly that the result is not significant? Example: Average: 293.5 ms +/- 143.2 ms (min: 213.9 ms, max: 629.7 ms) -- not significant The benchmark is unstable, maybe the system is heavily loaded? By the way, "293.5 ms +/- 143.2 ms" is misleading. Maybe we should display it as "0.3 sec +/- 0.1 sec" to not show inaccurate digits? Another example, same laptop but using CPU isolation: Average: 219.5 ms +/- 1.6 ms (min: 215.9 ms, max: 223.8 ms) In this example, we can see that "+/- 1.6" is is the standard deviation, it's unrelated to minimum and maximum. Victor From victor.stinner at gmail.com Wed Apr 27 11:06:06 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 27 Apr 2016 17:06:06 +0200 Subject: [Speed] When CPython performance depends on dead code... Message-ID: Hi, I'm working on an experimental change of CPython introducing a new "fast call" calling convention for Python and C functions. It pass an array of PyObject* and a number of arguments as an C int (PyObject **stack, int nargs) instead of using a temporary tuple (PyObject *args). The expectation is that avoiding the creation makes Python faster. http://bugs.python.org/issue26814 First microbenchmarks on optimized code are promising: between 18% and 44% faster. http://bugs.python.org/issue26814#msg263999 http://bugs.python.org/issue26814#msg264003 But I was quickly blocked on "macrobenchmarks" (?): running the Python benchmark suite says that many benchmarks are between 2% and 15% slower. I spent hours (days) to investigate the issue using Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was unable to understand how my change can makes CPython slower. My change is quite big: "34 files changed, 3301 insertions(+), 730 deletions(-)". In fact, the performance regression can be reproduced easily with a few lines of C code: see attached patches. You only have to add some *unused* (dead) code to see a "glitch" in performance. It's even worse: the performance change depends on the size of unused code. I done my best to isolate the microbenchmark to make it as reliable as possible. Results of bm_call_simple on my desktop PC: (a) Reference: Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms) (b) Add 2 unused functions, based on (a): Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms) (c) Add 1 unused short function ("return NULL;"), based on (a): Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms) (b) and (c) are 2 versions only adding unused code to (a). The difference between (b) and (c) is the size of unused code. The problem is that (b) makes the code slower and (c) makes the code faster (!), whereas I would not expect any performance change. A sane person should ignore such minor performance delta (+72 ms = +6% // -31.4 ms = -3%). Right. But for optimization patches on CPython, we use the CPython benchmark suite as a proof that yeah, the change really makes CPython faster, as announced. I compiled the C code using GCC (5.3) and Clang (3.7) using various options: -O0, -O3, -fno-align-functions, -falign-functions=N (with N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the performance looks "random". I'm unable to correlate the performance with any Linux perf event. IMHO the performance depends on something low level like L1 cache, CPU pipeline, branch prediction, etc. As I wrote, I'm unable to verify that. To reproduce my issue, you can use the following commands: --------------------------- hg clone https://hg.python.org/cpython fastcall # or: "hg clone (...)/cpython fastcall" # if you already have a local copy of cpython ;-) cd fastcall ./configure -C # build reference binary hg up -C -r 496e094f4734 patch -p1 < prepare.patch make && mv python python-ref # build binary with deadcode 1 hg up -C -r 496e094f4734 patch -p1 < prepare.patch patch -p1 < deadcode1.patch make && mv python python-deadcode1 # build binary with deadcode 2 hg up -C -r 496e094f4734 patch -p1 < prepare.patch patch -p1 < deadcode2.patch make && mv python python-deadcode2 # run benchmark PYTHONHASHSEED=0 ./python-ref bm_call_simple.py PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py --------------------------- It suggest you to isolate at least one CPU and run the benchmark on isolated CPUs to get reliable timings: --------------------------- # run benchmark on the CPU #2 PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py --------------------------- My notes on CPU isolation: http://haypo-notes.readthedocs.org/microbenchmark.html If you don't want to try CPU isolation, try to get an idle system and/or run the benchmark many times until the standard deviation (the "+/- ..." part) looks small enough... Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you will get random results depending on the secret hash key used by the randomized hash function. (Or modify the code to spawn enough child process to get an uniform distribution ;-)) I don't expect that you get the same numbers than me. For example, on my laptop, the delta is very small (+/- 1%): $ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms) $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms) => +1% (+13 ms) $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms) => -2% (-24 ms) CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4 physical cores with hyper-threading CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2 physical cores with hyper-threading I modified bm_call_simple.py to call foo() 100 times rather than 20 in the loop to see the issue more easily. I also removed dependencies and changed the output format to display average, standard deviation, minimum and maximum. For more benchmarks, see attached deadcode1.log and deadcode2.log: results of the CPython benchmark to compare deadcode1 VS reference, and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU isolation). Again, deadcode1 looks slower in most cases, whereas deadcode2 looks faster in most cases, whereas the difference is still dead code... Victor, disappointed -------------- next part -------------- A non-text attachment was scrubbed... Name: prepare.patch Type: text/x-patch Size: 5293 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: deadcode1.patch Type: text/x-patch Size: 3318 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: deadcode2.patch Type: text/x-patch Size: 1229 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bm_call_simple.py Type: text/x-python Size: 4298 bytes Desc: not available URL: From victor.stinner at gmail.com Wed Apr 27 11:07:34 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 27 Apr 2016 17:07:34 +0200 Subject: [Speed] When CPython performance depends on dead code... In-Reply-To: References: Message-ID: > For more benchmarks, see attached deadcode1.log and deadcode2.log: > results of the CPython benchmark to compare deadcode1 VS reference, > and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU > isolation). Again, deadcode1 looks slower in most cases, whereas > deadcode2 looks faster in most cases, whereas the difference is still > dead code... Sorry, I forgot to attach these two files. They are now attached to this new email. Victor -------------- next part -------------- A non-text attachment was scrubbed... Name: deadcode1.log Type: application/binary Size: 17183 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: deadcode2.log Type: application/binary Size: 14486 bytes Desc: not available URL: From brett at python.org Wed Apr 27 14:30:23 2016 From: brett at python.org (Brett Cannon) Date: Wed, 27 Apr 2016 18:30:23 +0000 Subject: [Speed] When CPython performance depends on dead code... In-Reply-To: References: Message-ID: My first intuition is some cache somewhere is unhappy w/ the varying sizes. Have you tried any of this on another machine to see if the results are consistent? On Wed, 27 Apr 2016 at 08:06 Victor Stinner wrote: > Hi, > > I'm working on an experimental change of CPython introducing a new > "fast call" calling convention for Python and C functions. It pass an > array of PyObject* and a number of arguments as an C int (PyObject > **stack, int nargs) instead of using a temporary tuple (PyObject > *args). The expectation is that avoiding the creation makes Python > faster. > http://bugs.python.org/issue26814 > > First microbenchmarks on optimized code are promising: between 18% and > 44% faster. > http://bugs.python.org/issue26814#msg263999 > http://bugs.python.org/issue26814#msg264003 > > But I was quickly blocked on "macrobenchmarks" (?): running the Python > benchmark suite says that many benchmarks are between 2% and 15% > slower. I spent hours (days) to investigate the issue using > Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was > unable to understand how my change can makes CPython slower. > > My change is quite big: "34 files changed, 3301 insertions(+), 730 > deletions(-)". In fact, the performance regression can be reproduced > easily with a few lines of C code: see attached patches. You only have > to add some *unused* (dead) code to see a "glitch" in performance. > It's even worse: the performance change depends on the size of unused > code. > > I done my best to isolate the microbenchmark to make it as reliable as > possible. Results of bm_call_simple on my desktop PC: > > (a) Reference: > Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms) > > (b) Add 2 unused functions, based on (a): > Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms) > > (c) Add 1 unused short function ("return NULL;"), based on (a): > Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms) > > (b) and (c) are 2 versions only adding unused code to (a). The > difference between (b) and (c) is the size of unused code. The problem > is that (b) makes the code slower and (c) makes the code faster (!), > whereas I would not expect any performance change. > > A sane person should ignore such minor performance delta (+72 ms = +6% > // -31.4 ms = -3%). Right. But for optimization patches on CPython, > we use the CPython benchmark suite as a proof that yeah, the change > really makes CPython faster, as announced. > > I compiled the C code using GCC (5.3) and Clang (3.7) using various > options: -O0, -O3, -fno-align-functions, -falign-functions=N (with > N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the > performance looks "random". I'm unable to correlate the performance > with any Linux perf event. IMHO the performance depends on something > low level like L1 cache, CPU pipeline, branch prediction, etc. As I > wrote, I'm unable to verify that. > > To reproduce my issue, you can use the following commands: > --------------------------- > hg clone https://hg.python.org/cpython fastcall > # or: "hg clone (...)/cpython fastcall" > # if you already have a local copy of cpython ;-) > cd fastcall > ./configure -C > > # build reference binary > hg up -C -r 496e094f4734 > patch -p1 < prepare.patch > make && mv python python-ref > > # build binary with deadcode 1 > hg up -C -r 496e094f4734 > patch -p1 < prepare.patch > patch -p1 < deadcode1.patch > make && mv python python-deadcode1 > > # build binary with deadcode 2 > hg up -C -r 496e094f4734 > patch -p1 < prepare.patch > patch -p1 < deadcode2.patch > make && mv python python-deadcode2 > > # run benchmark > PYTHONHASHSEED=0 ./python-ref bm_call_simple.py > PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py > PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py > --------------------------- > > It suggest you to isolate at least one CPU and run the benchmark on > isolated CPUs to get reliable timings: > --------------------------- > # run benchmark on the CPU #2 > PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py > PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py > PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py > --------------------------- > My notes on CPU isolation: > http://haypo-notes.readthedocs.org/microbenchmark.html > > If you don't want to try CPU isolation, try to get an idle system > and/or run the benchmark many times until the standard deviation (the > "+/- ..." part) looks small enough... > > Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you > will get random results depending on the secret hash key used by the > randomized hash function. (Or modify the code to spawn enough child > process to get an uniform distribution ;-)) > > I don't expect that you get the same numbers than me. For example, on > my laptop, the delta is very small (+/- 1%): > > $ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py > Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms) > > $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py > Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms) > => +1% (+13 ms) > > $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py > Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms) > => -2% (-24 ms) > > CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4 > physical cores with hyper-threading > CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2 > physical cores with hyper-threading > > I modified bm_call_simple.py to call foo() 100 times rather than 20 in > the loop to see the issue more easily. I also removed dependencies and > changed the output format to display average, standard deviation, > minimum and maximum. > > For more benchmarks, see attached deadcode1.log and deadcode2.log: > results of the CPython benchmark to compare deadcode1 VS reference, > and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU > isolation). Again, deadcode1 looks slower in most cases, whereas > deadcode2 looks faster in most cases, whereas the difference is still > dead code... > > Victor, disappointed > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Thu Apr 28 04:27:11 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 28 Apr 2016 10:27:11 +0200 Subject: [Speed] When CPython performance depends on dead code... In-Reply-To: References:

Message-ID: Hi, 2016-04-27 20:30 GMT+02:00 Brett Cannon : > My first intuition is some cache somewhere is unhappy w/ the varying sizes. > Have you tried any of this on another machine to see if the results are > consistent? On my laptop, the performance when I add deadcode doesn't seem to change much: the delta is smaller than 1%. I found a fix for my deadcode issue! Use "make profile-opt" rather than "make". Using PGO, GCC reorders hot functions to make them closer. I also read that it records statistics on branches to emit first the most frequent branch. I also modified bm_call_simple.py to use multiple processes and to use random hash seeds, rather than using a single process and disabling hash randomization. Comparison reference => fastcall (my whole fork, not just the tiny patches adding deadcode) using make (gcc -O3): Average: 1183.5 ms +/- 6.1 ms (min: 1173.3 ms, max: 1201.9 ms) - 15 processes x 5 loops => Average: 1121.2 ms +/- 7.4 ms (min: 1106.5 ms, max: 1142.0 ms) - 15 processes x 5 loops Comparison reference => fastcall using make profile-opt (PGO): Average: 962.7 ms +/- 17.8 ms (min: 952.6 ms, max: 998.6 ms) - 15 processes x 5 loops => Average: 961.1 ms +/- 18.6 ms (min: 949.0 ms, max: 1011.3 ms) - 15 processes x 5 loops Using make, fastcall *seems* to be faster, but in fact it looks more like random noise of deadcode. Using PGO, fastcall doesn't change performance at all. I expected fastcall to be faster, but it's the purpose of benchmarks: get real performance, not expectations :-) Next step: modify most benchmarks of perf.py to run multiple processes rather than a single process to test using multiple hash seeds. Victor