From brett at python.org Mon May 2 12:25:58 2016 From: brett at python.org (Brett Cannon) Date: Mon, 02 May 2016 16:25:58 +0000 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References: <20160414190517.073ea4cf@fsol> Message-ID: On Thu, 14 Apr 2016 at 10:50 Maciej Fijalkowski wrote: > On Thu, Apr 14, 2016 at 7:05 PM, Antoine Pitrou > wrote: > > On Wed, 13 Apr 2016 20:57:35 +0200 > > Maciej Fijalkowski > > wrote: > >> Hi > >> > >> I have a radical idea: to take a pypy benchmark suite, update the > >> libraries to newer ones and replace python benchmarks with that. The > >> main reason being that pypy has a much better coverage of things that > >> are not microbenchmarks, the list (in json): > > > > So why not consolidate all benchmarks together, instead of throwing > > away work already done? > > > > Regards > > > > Antoine. > > Yeah, you can call it that too. > I also reached out to Pyston at https://gitter.im/dropbox/pyston over the weekend to see if they would want to participate as well. So are we actually going to try and make this happen? I guess we should get people to vote on whether they like the idea enough before we hash out how we want to structure the new repository and benchmark suite. I'm +1 on the idea, but I currently don't have the time to help beyond helping drive the email conversation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kmod at dropbox.com Mon May 2 18:18:24 2016 From: kmod at dropbox.com (Kevin Modzelewski) Date: Mon, 2 May 2016 15:18:24 -0700 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References: <20160414190517.073ea4cf@fsol> Message-ID: I'm definitely interested and willing to clean up + contribute our benchmarks. On a side note, I'm a bit skeptical that there can be a single benchmark suite that satisfies everyone. I would imagine that there will still be projects with specific use-cases they prioritize (such as Pyston with webserver workloads), or that have some idea that their users will be "non-representative" in some way. One example of that is the emphasis on warmup vs steady-state performance, which can be reflected in different measurement methodologies -- I don't think there's a single right answer to the question "how much does warmup matter". But anyway, I'm still definitely +1 on the idea of merging all the benchmarks together, and I think that that will be better than the current situation. I'm imagining that we can at least have a common language for discussing these things ("Pyston prefers to use the flags `--webserver --include-warmup`"). I also see quite a few blog posts / academic papers on Python performance that seem to get led astray by the confusing benchmark situation, and I think having a blessed set of benchmarks (even if different people use them in different ways) would still be a huge step forward. kmod On Mon, May 2, 2016 at 9:25 AM, Brett Cannon wrote: > > > On Thu, 14 Apr 2016 at 10:50 Maciej Fijalkowski wrote: > >> On Thu, Apr 14, 2016 at 7:05 PM, Antoine Pitrou >> wrote: >> > On Wed, 13 Apr 2016 20:57:35 +0200 >> > Maciej Fijalkowski >> > wrote: >> >> Hi >> >> >> >> I have a radical idea: to take a pypy benchmark suite, update the >> >> libraries to newer ones and replace python benchmarks with that. The >> >> main reason being that pypy has a much better coverage of things that >> >> are not microbenchmarks, the list (in json): >> > >> > So why not consolidate all benchmarks together, instead of throwing >> > away work already done? >> > >> > Regards >> > >> > Antoine. >> >> Yeah, you can call it that too. >> > > I also reached out to Pyston at https://gitter.im/dropbox/pyston over the > weekend to see if they would want to participate as well. > > So are we actually going to try and make this happen? I guess we should > get people to vote on whether they like the idea enough before we hash out > how we want to structure the new repository and benchmark suite. > > I'm +1 on the idea, but I currently don't have the time to help beyond > helping drive the email conversation. > > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Mon May 2 18:24:16 2016 From: brett at python.org (Brett Cannon) Date: Mon, 02 May 2016 22:24:16 +0000 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References: <20160414190517.073ea4cf@fsol>

Message-ID: On Mon, 2 May 2016 at 15:18 Kevin Modzelewski wrote: > I'm definitely interested and willing to clean up + contribute our > benchmarks. > > On a side note, I'm a bit skeptical that there can be a single benchmark > suite that satisfies everyone. I would imagine that there will still be > projects with specific use-cases they prioritize (such as Pyston with > webserver workloads), or that have some idea that their users will be > "non-representative" in some way. One example of that is the emphasis on > warmup vs steady-state performance, which can be reflected in different > measurement methodologies -- I don't think there's a single right answer to > the question "how much does warmup matter". > Totally agree. I think the general thinking is to at have a central repository and a flexible enough benchmark runner that people can benchmark whatever they find important to them. That way if e.g. Pyston adds nice web server benchmarks other implementations can use them or users can decide that's a workload they care about and make an informed decision of what Python implementations may work for them (before testing their own workload :). -Brett > > But anyway, I'm still definitely +1 on the idea of merging all the > benchmarks together, and I think that that will be better than the current > situation. I'm imagining that we can at least have a common language for > discussing these things ("Pyston prefers to use the flags `--webserver > --include-warmup`"). I also see quite a few blog posts / academic papers > on Python performance that seem to get led astray by the confusing > benchmark situation, and I think having a blessed set of benchmarks (even > if different people use them in different ways) would still be a huge step > forward. > > kmod > > On Mon, May 2, 2016 at 9:25 AM, Brett Cannon wrote: > >> >> >> On Thu, 14 Apr 2016 at 10:50 Maciej Fijalkowski wrote: >> >>> On Thu, Apr 14, 2016 at 7:05 PM, Antoine Pitrou >>> wrote: >>> > On Wed, 13 Apr 2016 20:57:35 +0200 >>> > Maciej Fijalkowski >>> > wrote: >>> >> Hi >>> >> >>> >> I have a radical idea: to take a pypy benchmark suite, update the >>> >> libraries to newer ones and replace python benchmarks with that. The >>> >> main reason being that pypy has a much better coverage of things that >>> >> are not microbenchmarks, the list (in json): >>> > >>> > So why not consolidate all benchmarks together, instead of throwing >>> > away work already done? >>> > >>> > Regards >>> > >>> > Antoine. >>> >>> Yeah, you can call it that too. >>> >> >> I also reached out to Pyston at https://gitter.im/dropbox/pyston over >> the weekend to see if they would want to participate as well. >> >> So are we actually going to try and make this happen? I guess we should >> get people to vote on whether they like the idea enough before we hash out >> how we want to structure the new repository and benchmark suite. >> >> I'm +1 on the idea, but I currently don't have the time to help beyond >> helping drive the email conversation. >> >> _______________________________________________ >> Speed mailing list >> Speed at python.org >> https://mail.python.org/mailman/listinfo/speed >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue May 3 04:33:52 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 3 May 2016 10:33:52 +0200 Subject: [Speed] Adapting pypy benchmark suite In-Reply-To: References: <20160414190517.073ea4cf@fsol>

Message-ID: Hi I'm willing to put some work after I'm back from holiday (mid-May) On Tue, May 3, 2016 at 12:24 AM, Brett Cannon wrote: > > > On Mon, 2 May 2016 at 15:18 Kevin Modzelewski wrote: >> >> I'm definitely interested and willing to clean up + contribute our >> benchmarks. >> >> On a side note, I'm a bit skeptical that there can be a single benchmark >> suite that satisfies everyone. I would imagine that there will still be >> projects with specific use-cases they prioritize (such as Pyston with >> webserver workloads), or that have some idea that their users will be >> "non-representative" in some way. One example of that is the emphasis on >> warmup vs steady-state performance, which can be reflected in different >> measurement methodologies -- I don't think there's a single right answer to >> the question "how much does warmup matter". > > > Totally agree. I think the general thinking is to at have a central > repository and a flexible enough benchmark runner that people can benchmark > whatever they find important to them. That way if e.g. Pyston adds nice web > server benchmarks other implementations can use them or users can decide > that's a workload they care about and make an informed decision of what > Python implementations may work for them (before testing their own workload > :). > > -Brett > >> >> >> But anyway, I'm still definitely +1 on the idea of merging all the >> benchmarks together, and I think that that will be better than the current >> situation. I'm imagining that we can at least have a common language for >> discussing these things ("Pyston prefers to use the flags `--webserver >> --include-warmup`"). I also see quite a few blog posts / academic papers on >> Python performance that seem to get led astray by the confusing benchmark >> situation, and I think having a blessed set of benchmarks (even if different >> people use them in different ways) would still be a huge step forward. >> >> kmod >> >> On Mon, May 2, 2016 at 9:25 AM, Brett Cannon wrote: >>> >>> >>> >>> On Thu, 14 Apr 2016 at 10:50 Maciej Fijalkowski wrote: >>>> >>>> On Thu, Apr 14, 2016 at 7:05 PM, Antoine Pitrou >>>> wrote: >>>> > On Wed, 13 Apr 2016 20:57:35 +0200 >>>> > Maciej Fijalkowski >>>> > wrote: >>>> >> Hi >>>> >> >>>> >> I have a radical idea: to take a pypy benchmark suite, update the >>>> >> libraries to newer ones and replace python benchmarks with that. The >>>> >> main reason being that pypy has a much better coverage of things that >>>> >> are not microbenchmarks, the list (in json): >>>> > >>>> > So why not consolidate all benchmarks together, instead of throwing >>>> > away work already done? >>>> > >>>> > Regards >>>> > >>>> > Antoine. >>>> >>>> Yeah, you can call it that too. >>> >>> >>> I also reached out to Pyston at https://gitter.im/dropbox/pyston over the >>> weekend to see if they would want to participate as well. >>> >>> So are we actually going to try and make this happen? I guess we should >>> get people to vote on whether they like the idea enough before we hash out >>> how we want to structure the new repository and benchmark suite. >>> >>> I'm +1 on the idea, but I currently don't have the time to help beyond >>> helping drive the email conversation. >>> >>> _______________________________________________ >>> Speed mailing list >>> Speed at python.org >>> https://mail.python.org/mailman/listinfo/speed >>> > From victor.stinner at gmail.com Tue May 17 10:44:04 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 17 May 2016 16:44:04 +0200 Subject: [Speed] CPU speed of one core changes for unknown reason Message-ID: Hi, I'm still having fun with microbenchmarks. I disabled Power States (pstate) of my Intel CPU and forced the frequency for 3.4 GHz. I isolated 2 physical cores on a total of 4. Timings are very stable *but* sometimes, I get impressive slowdown: like 60% or 80% slower, but only for a short time. Do you know which CPU feature can explain such temporary slowdown? I tried cpupower & powertop tools to try to learn more about internal CPU states, but I don't see anything obvious. I also noticed that powertop has a major side effect: it changes the speed of my CPU cores! Since the CPU cores used to run benchmarks are isolated, powertop uses a low speed (like 1.6 GHz, half speed) while benchmarks are running, probably because the kernel doesn't "see" the benchmark processes. My CPU model is: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz I'm using "userspace" scaling governor for isolated CPU cores, but "ondemand" for other CPU cores. I disabled pstate (kernel parameter: intel_pstate=disable), the CPU scaling driver is "acpi-cpufreq". CPUs 2,3,6,7 are isolated. In the following examples, the same microbenchmark takes ~196 ms on all cores, except of the core 3 on the first example. Example 1: --- $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py -n 1 --timer perf_counter; done === CPU 0 === 0.19619656700160704 === CPU 1 === 0.19547197800056892 === CPU 2 === 0.19512042699716403 === CPU 3 === 0.35738898099953076 === CPU 4 === 0.19744606299718725 === CPU 5 === 0.195480646998476 === CPU 6 === 0.19495172200186062 === CPU 7 === 0.19495161599843414 --- Example 2: --- $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py -n 1 --timer perf_counter; done === CPU 0 === 0.19725238799946965 === CPU 1 === 0.19552089699936914 === CPU 2 === 0.19495758999983082 === CPU 3 === 0.19517506799820694 === CPU 4 === 0.1963375539999106 === CPU 5 === 0.19575440099652042 === CPU 6 === 0.19582506000006106 === CPU 7 === 0.19503543600148987 --- If I repeat the same test, timings are always ~196 ms on all cores. It looks like some cores decide to sleep. Victor From victor.stinner at gmail.com Tue May 17 17:11:50 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 17 May 2016 23:11:50 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" Message-ID: Hi, I'm still (!) investigating the reasons why the benchmark call_simple (ok, let's be honest: the *micro*benchmark) gets different results for unknown reasons. (*) Collisions in hash tables: perf.py already calls the benchmark with PYTHONHASHSEED=1 to test the same hash function. A more generic solution is to use multiple processes to test multiple hash seeds to get a better uniform distribution. (*) System load => CPU isolation, disable ASLR, set CPU affinity on IRQs, etc. work around this issue -- http://haypo-notes.readthedocs.io/microbenchmark.html (*) CPU heat => disable CPU Turbo Mode works around this issue (*) Locale, size of the command line and/or the current working directory => WTF?! Examples with a system tuned to get reliable benchmark. Example 1 using different locales: --- $ env -i PYTHONHASHSEED=1 LANG=$LANG taskset -c 3 ../fastcall/pgo/python performance/bm_call_simple.py -n 2 --timer perf_counter 0.1914542349995827 0.1914668690005783 $ env -i PYTHONHASHSEED=1 taskset -c 3 ../fastcall/pgo/python performance/bm_call_simple.py -n 2 --timer perf_counter 0.2037885540003117 0.20376207399931445 --- Example 2 using a different command line (the "xxx" is ignored but changes the benchmark result): -- $ env -i PYTHONHASHSEED=1 taskset -c 3 ../fastcall/pgo/python performance/bm_call_simple.py -n 2 --timer perf_counter 0.20377227199969639 0.20376165899961052 $ env -i PYTHONHASHSEED=1 taskset -c 3 ../fastcall/pgo/python performance/bm_call_simple.py -n 2 --timer perf_counter xxx 0.20814169400000537 0.20804374700037442 --- => My bet is that the locale, current working directory, command line, etc. impact how the heap memory is allocated, and this specific benchmark depends on the locality of memory allocated on the heap... For a microbenchmark, 191 ms, 203 ms or 208 ms are not the same numbers... Such very subtle difference impacts the final "NNNx slower" or "NNNx faster" line of perf.py. I tried different values of $LANG environment variable and differerent lengths of command lines. When the performance decreases, the stalled-cycles-frontend Linux perf event increases while the LLC-loads even increases. => The performance of the benchmark depends on the usage of low-level memory caches (L1, L2, L3). I understand that in some cases, more memory fits into the fatest caches, and so the benchmark is faster. But sometimes, all memory doesn't fit, and so the benchmark is slower. Maybe the problem is that memory is close to memory pages boundaries, or doesn't fit into L1 cache lines, or something like that. Victor From victor.stinner at gmail.com Tue May 17 17:21:29 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 17 May 2016 23:21:29 +0200 Subject: [Speed] CPU speed of one core changes for unknown reason In-Reply-To: References: Message-ID: According to a friend, my CPU model "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz" has a "Turbo Mode" which is enabled by default. The CPU tries to use the Turbo Mode whener possible, but disables it when the CPU is too hot. The change should be visible with the exact CPU frequency (the change can be a single MHz: 3400 => 3401). I didn't notice such minor CPU frequency change, but I didn't check carefully. Anyway, I disabled the Turbo Mode and Hyperthreading in the EFI. It should avoid the strange performance "drop". Victor 2016-05-17 16:44 GMT+02:00 Victor Stinner : > Hi, > > I'm still having fun with microbenchmarks. I disabled Power States > (pstate) of my Intel CPU and forced the frequency for 3.4 GHz. I > isolated 2 physical cores on a total of 4. Timings are very stable > *but* sometimes, I get impressive slowdown: like 60% or 80% slower, > but only for a short time. > > Do you know which CPU feature can explain such temporary slowdown? > > I tried cpupower & powertop tools to try to learn more about internal > CPU states, but I don't see anything obvious. I also noticed that > powertop has a major side effect: it changes the speed of my CPU > cores! Since the CPU cores used to run benchmarks are isolated, > powertop uses a low speed (like 1.6 GHz, half speed) while benchmarks > are running, probably because the kernel doesn't "see" the benchmark > processes. > > My CPU model is: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz > > I'm using "userspace" scaling governor for isolated CPU cores, but > "ondemand" for other CPU cores. > > I disabled pstate (kernel parameter: intel_pstate=disable), the CPU > scaling driver is "acpi-cpufreq". > > CPUs 2,3,6,7 are isolated. > > In the following examples, the same microbenchmark takes ~196 ms on > all cores, except of the core 3 on the first example. > > Example 1: > --- > $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 > taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py > -n 1 --timer perf_counter; done > === CPU 0 === > 0.19619656700160704 > === CPU 1 === > 0.19547197800056892 > === CPU 2 === > 0.19512042699716403 > === CPU 3 === > 0.35738898099953076 > === CPU 4 === > 0.19744606299718725 > === CPU 5 === > 0.195480646998476 > === CPU 6 === > 0.19495172200186062 > === CPU 7 === > 0.19495161599843414 > --- > > Example 2: > --- > $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 > taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py > -n 1 --timer perf_counter; done > === CPU 0 === > 0.19725238799946965 > === CPU 1 === > 0.19552089699936914 > === CPU 2 === > 0.19495758999983082 > === CPU 3 === > 0.19517506799820694 > === CPU 4 === > 0.1963375539999106 > === CPU 5 === > 0.19575440099652042 > === CPU 6 === > 0.19582506000006106 > === CPU 7 === > 0.19503543600148987 > --- > > If I repeat the same test, timings are always ~196 ms on all cores. > > It looks like some cores decide to sleep. > > Victor From fijall at gmail.com Wed May 18 02:55:46 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 18 May 2016 08:55:46 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References: Message-ID: > => The performance of the benchmark depends on the usage of low-level > memory caches (L1, L2, L3). > > I understand that in some cases, more memory fits into the fatest > caches, and so the benchmark is faster. But sometimes, all memory > doesn't fit, and so the benchmark is slower. > > Maybe the problem is that memory is close to memory pages boundaries, > or doesn't fit into L1 cache lines, or something like that. > > Victor I think you misunderstand how caches work. The way caches work depends on the addresses of memory (their value) which even with ASLR disabled can differ between runs. Then you either do or don't have cache collisions. How about you just accept the fact that there is a statistical distribution of the results on not the concrete "right" result? I tried to explain to you before that even if you get the "right" result, it'll still be at best just one sample of the statistics. From arigo at tunes.org Wed May 18 04:45:46 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 18 May 2016 10:45:46 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References: Message-ID: Hi Victor, On 17 May 2016 at 23:11, Victor Stinner wrote: > with PYTHONHASHSEED=1 to test the same hash function. A more generic > solution is to use multiple processes to test multiple hash seeds to > get a better uniform distribution. What you say in the rest of the mail just shows that this "generic solution" should be applied not only to PYTHONHASHSEED, but also to other variables that seem to introduce deterministic noise. You've just found three more: the locale, the size of the command line, and the working directory. I guess the mere size of the environment also plays a role. So I guess, ideally, you'd run a large number of times with random values in all these parameters. (In practice it might be enough to run a smaller fixed number of times with known values in the parameters.) A bient?t, Armin. From victor.stinner at gmail.com Wed May 18 07:05:09 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 18 May 2016 13:05:09 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References: Message-ID: 2016-05-18 8:55 GMT+02:00 Maciej Fijalkowski : > I think you misunderstand how caches work. The way caches work depends > on the addresses of memory (their value) which even with ASLR disabled > can differ between runs. Then you either do or don't have cache > collisions. How about you just accept the fact that there is a > statistical distribution of the results on not the concrete "right" > result? Slowly, I understood that running multiple processes are needed to get a better statistical distribution. Ok. But I found a very specific case where the result depends on the command line, and the command line is constant. Running the benchmark once or 1 million of times doesn't reduce the effect of this parameter, since the effect is constant. Victor From victor.stinner at gmail.com Wed May 18 07:13:16 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 18 May 2016 13:13:16 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References: Message-ID: 2016-05-18 10:45 GMT+02:00 Armin Rigo : > On 17 May 2016 at 23:11, Victor Stinner wrote: >> with PYTHONHASHSEED=1 to test the same hash function. A more generic >> solution is to use multiple processes to test multiple hash seeds to >> get a better uniform distribution. > > What you say in the rest of the mail just shows that this "generic > solution" should be applied not only to PYTHONHASHSEED, but also to > other variables that seem to introduce deterministic noise. Right. ... or ensure that these other parameters are not changed when testing two versions of the code ;-) perf.py already starts the process with an empty environment and set PYTHONHASHSEED: the environment is fixed (constant). I noticed the difference of performance with the environment because I failed to reproduce the benchmark (I got different numbers) when I ran again the benchmark manually. > You've > just found three more: the locale, the size of the command line, and > the working directory. I guess the mere size of the environment also > plays a role. So I guess, ideally, you'd run a large number of times > with random values in all these parameters. (In practice it might be > enough to run a smaller fixed number of times with known values in the > parameters.) Right, I have to think about that, try to find a way to randomize these "parameters" (or find a way to make them constants): directories, name of the binary, etc. As I wrote, the environment is easy to control. The working directory and the command line, it's more complex. It's convenient to be able to pass links to two different Python binaries compiled in two different directories. FYI I'm using a "reference python" compiled in one directory, and my "patched python" in a different directory. Both are compiled using the same compiler options (I'm using -O0 for debug, -O3 for quick benchmark, -O3 with PGO and LTO for reliable benchmarks). -- Another option for microbenchmarks would be to *ignore* (hide) differences smaller than +/- 10%, since such kind of benchmark depends too much on external parameters. I did that in my custom microbenchmark runner, it helps to ignore noise and focus on major speedup (or slowdown!). Victor From victor.stinner at gmail.com Wed May 18 07:16:40 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 18 May 2016 13:16:40 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References: Message-ID: 2016-05-18 8:55 GMT+02:00 Maciej Fijalkowski : > I think you misunderstand how caches work. The way caches work depends > on the addresses of memory (their value) which even with ASLR disabled > can differ between runs. Then you either do or don't have cache > collisions. Ok. I'm not sure yet that it's feasible to get exactly the same memory addresses for "hot" objects allocated by Python between two versions of the code (especially when testing a small patch). Not only the addresses look to depend on external parameters, but the patch can also adds or avoids some memory allocations. The concrete problem is that the benchmark depends on such low-level CPU feature and the perf.py doesn't ignore minor delta in performance, no? Victor From fijall at gmail.com Wed May 18 14:54:25 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 18 May 2016 20:54:25 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References:

Message-ID: On Wed, May 18, 2016 at 1:16 PM, Victor Stinner wrote: > 2016-05-18 8:55 GMT+02:00 Maciej Fijalkowski : >> I think you misunderstand how caches work. The way caches work depends >> on the addresses of memory (their value) which even with ASLR disabled >> can differ between runs. Then you either do or don't have cache >> collisions. > > Ok. I'm not sure yet that it's feasible to get exactly the same memory > addresses for "hot" objects allocated by Python between two versions > of the code (especially when testing a small patch). Not only the > addresses look to depend on external parameters, but the patch can > also adds or avoids some memory allocations. > > The concrete problem is that the benchmark depends on such low-level > CPU feature and the perf.py doesn't ignore minor delta in performance, > no? > > Victor Well the answer is to do more statistics really in my opinion. That is, perf should report average over multiple runs in multiple processes. I started a branch for pypy benchmarks for that, but never finished it actually. From victor.stinner at gmail.com Wed May 18 15:30:19 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 18 May 2016 21:30:19 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References:

Message-ID: 2016-05-18 20:54 GMT+02:00 Maciej Fijalkowski : >> Ok. I'm not sure yet that it's feasible to get exactly the same memory >> addresses for "hot" objects allocated by Python between two versions >> of the code (...) > > Well the answer is to do more statistics really in my opinion. That > is, perf should report average over multiple runs in multiple > processes. I started a branch for pypy benchmarks for that, but never > finished it actually. I'm not sure that I understood you correctly. As I wrote, running the same benchmark twice using two processes gives exactly the same timing. I already modified perf.py locally to run multiple processes and focus on the average + std dev rather than min of a single process. Example: run 10 process x 3 loops (total: 30) Run average: 205.4 ms +/- 0.1 ms (min: 205.3 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.2 ms +/- 0.0 ms (min: 205.2 ms, max: 205.3 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.4 ms +/- 0.1 ms (min: 205.3 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.2 ms (min: 205.1 ms, max: 205.4 ms) Run average: 205.2 ms +/- 0.1 ms (min: 205.1 ms, max: 205.2 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Total average: 205.3 ms +/- 0.1 ms (min: 205.1 ms, max: 205.4 ms) The "total" concatenates all lists of timings. Note: Oh, by the way, the timing also depends on the presence of .pyc files ;-) I modified perf.py to add a first run with a single iteration just to rebuild .pyc, since the benchmark always start by removing alll .pyc files... Victor From paul at paulgraydon.co.uk Wed May 18 17:05:11 2016 From: paul at paulgraydon.co.uk (Paul Graydon) Date: Wed, 18 May 2016 21:05:11 +0000 Subject: [Speed] CPU speed of one core changes for unknown reason In-Reply-To: References:

Message-ID: <20160518210511.GA7407@paulgraydon.co.uk> Bear in mind that what you see by way of CPU Speed is based on *sampling*, and the CPU can be switched speeds very quickly. Far faster than you'd necessarily see in your periodic updates. Also note that if your cooling isn't up to scratch for handling the CPU running permanently at its top normal speed, thermal throttling will cause the system to slow down independently of anything happening OS side. That's embedded within the chip and can't be disabled. FWIW microbenchmarks are inherently unstable and susceptible to jitter on the system side. There's all sorts of things that could be interfering outside the scope of your tests, and because the benchmark is over and done with so quickly, if something does happen it's going to skew the entire benchmark run. If microbenchmarking really is the right thing for your needs, you should look at running enough runs to be able to get a fair idea of realistic performance. Think hundreds etc, then eliminating particularly fast and/or slow runs from your consideration, and whatever other things you might consider for statistical significance. I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong optimisations. Paul On Tue, May 17, 2016 at 11:21:29PM +0200, Victor Stinner wrote: > According to a friend, my CPU model "Intel(R) Core(TM) i7-2600 CPU @ > 3.40GHz" has a "Turbo Mode" which is enabled by default. The CPU tries > to use the Turbo Mode whener possible, but disables it when the CPU is > too hot. The change should be visible with the exact CPU frequency > (the change can be a single MHz: 3400 => 3401). I didn't notice such > minor CPU frequency change, but I didn't check carefully. > > Anyway, I disabled the Turbo Mode and Hyperthreading in the EFI. It > should avoid the strange performance "drop". > > Victor > > 2016-05-17 16:44 GMT+02:00 Victor Stinner : > > Hi, > > > > I'm still having fun with microbenchmarks. I disabled Power States > > (pstate) of my Intel CPU and forced the frequency for 3.4 GHz. I > > isolated 2 physical cores on a total of 4. Timings are very stable > > *but* sometimes, I get impressive slowdown: like 60% or 80% slower, > > but only for a short time. > > > > Do you know which CPU feature can explain such temporary slowdown? > > > > I tried cpupower & powertop tools to try to learn more about internal > > CPU states, but I don't see anything obvious. I also noticed that > > powertop has a major side effect: it changes the speed of my CPU > > cores! Since the CPU cores used to run benchmarks are isolated, > > powertop uses a low speed (like 1.6 GHz, half speed) while benchmarks > > are running, probably because the kernel doesn't "see" the benchmark > > processes. > > > > My CPU model is: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz > > > > I'm using "userspace" scaling governor for isolated CPU cores, but > > "ondemand" for other CPU cores. > > > > I disabled pstate (kernel parameter: intel_pstate=disable), the CPU > > scaling driver is "acpi-cpufreq". > > > > CPUs 2,3,6,7 are isolated. > > > > In the following examples, the same microbenchmark takes ~196 ms on > > all cores, except of the core 3 on the first example. > > > > Example 1: > > --- > > $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 > > taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py > > -n 1 --timer perf_counter; done > > === CPU 0 === > > 0.19619656700160704 > > === CPU 1 === > > 0.19547197800056892 > > === CPU 2 === > > 0.19512042699716403 > > === CPU 3 === > > 0.35738898099953076 > > === CPU 4 === > > 0.19744606299718725 > > === CPU 5 === > > 0.195480646998476 > > === CPU 6 === > > 0.19495172200186062 > > === CPU 7 === > > 0.19495161599843414 > > --- > > > > Example 2: > > --- > > $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 > > taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py > > -n 1 --timer perf_counter; done > > === CPU 0 === > > 0.19725238799946965 > > === CPU 1 === > > 0.19552089699936914 > > === CPU 2 === > > 0.19495758999983082 > > === CPU 3 === > > 0.19517506799820694 > > === CPU 4 === > > 0.1963375539999106 > > === CPU 5 === > > 0.19575440099652042 > > === CPU 6 === > > 0.19582506000006106 > > === CPU 7 === > > 0.19503543600148987 > > --- > > > > If I repeat the same test, timings are always ~196 ms on all cores. > > > > It looks like some cores decide to sleep. > > > > Victor > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed From victor.stinner at gmail.com Wed May 18 19:39:57 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 19 May 2016 01:39:57 +0200 Subject: [Speed] CPU speed of one core changes for unknown reason In-Reply-To: <20160518210511.GA7407@paulgraydon.co.uk> References:

<20160518210511.GA7407@paulgraydon.co.uk> Message-ID: 2016-05-18 23:05 GMT+02:00 Paul Graydon : > Bear in mind that what you see by way of CPU Speed is based on *sampling*, and the CPU can be switched speeds very > quickly. Far faster than you'd necessarily see in your periodic updates. Also note that if your cooling isn't up to > scratch for handling the CPU running permanently at its top normal speed, thermal throttling will cause the system to > slow down independently of anything happening OS side. That's embedded within the chip and can't be disabled. I checked the temperature of my CPU cores using the "sensors" command and it was somewhere around ~50?C which doesn't seem "too hot" to me. A better bet is that I was close the temperature switching between Turbo Mode or not. I disabled Turbo Mode and Hyperthreading on my CPU and I didn't reproduce the random slowdown anymore. I also misunderstood how Turbo Mode works. By default, a CPU uses the Turbo Mode, but disables it automatically if the CPU is too hot. I expected that the CPU doesn't use Turbo Mode, but start to use it after a few seconds if the CPU usage is high. It looks like the performance also depends on the number of cores currently used: https://en.wikipedia.org/wiki/Intel_Turbo_Boost#Example > FWIW microbenchmarks are inherently unstable and susceptible to jitter on the system side. Using CPU isolation helps a lot to reduce the noise coming from the "system". > If microbenchmarking really is the right thing for your needs, (...) Someone asked me to check the perfomance of my patches using perf.py, so I'm using it. The accuracy of some specific benchmark of this benchmark suite is still an open question ;-) > ... you should look at running enough runs to be able to get a fair idea of realistic performance. Right, this idea was already discussed in other threads and already implemented in the PyPy flavor of perf.py. I also patched locally my perf.py to do that. > I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're > at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong > optimisations. Yeah, that's an excellent remark :-) It's not the first time that I read it. I think that it's ok to use CPU isolation and tune CPU options (ex: disable Turbo Mode) to reduce the noise. Other parameters like disabling hash randomization or disabling ASLR is more an open question. It seems to me that disabling randomization (hash function, ASLR) introduces a risk of leading to the invalidate conclusion (patch makes Python faster / slower). But I read this advice many times, and perf.py currently explicitly disables hash randomization. The most common trend in benchmarking is to disable all sources of noice and only care of the minimum (smallest timing). In my experience (of last weeks), it just doesn't work, at least for microbenchmarks. Victor From rdmurray at bitdance.com Wed May 18 18:04:40 2016 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 18 May 2016 18:04:40 -0400 Subject: [Speed] CPU speed of one core changes for unknown reason In-Reply-To: <20160518210511.GA7407@paulgraydon.co.uk> References:

<20160518210511.GA7407@paulgraydon.co.uk> Message-ID: <20160518220441.CA97CB1401C@webabinitio.net> On Wed, 18 May 2016 21:05:11 -0000, Paul Graydon wrote: > I do have some concerns that you're increasingly creating a synthetic > environment to benchmark against, and that you're at risk of > optimising towards an environment the code won't actually run in, and > might even end up pursuing the wrong > optimisations. My understanding is that Victor isn't using this to guide optimization, but rather to have a quick-as-possible way to find out that he screwed up when he made a code change. I'm sure he's using much longer benchmarks runs for actually looking at the performance impact of the complete changeset. --David From victor.stinner at gmail.com Thu May 19 05:49:40 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 19 May 2016 11:49:40 +0200 Subject: [Speed] CPU speed of one core changes for unknown reason In-Reply-To: <20160518220441.CA97CB1401C@webabinitio.net> References:

<20160518210511.GA7407@paulgraydon.co.uk> <20160518220441.CA97CB1401C@webabinitio.net> Message-ID: FYI I'm running the CPython Benchmark Suite with: taskset -c 1,3 python3 -u perf.py --rigorous ../ref_python/pgo/python ../fastcall/pgo/python -b all I was asked to use --rigorous and -b all when I worked on other patches, like: https://bugs.python.org/issue21955#msg259431 2016-05-19 0:04 GMT+02:00 R. David Murray : > On Wed, 18 May 2016 21:05:11 -0000, Paul Graydon wrote: >> I do have some concerns that you're increasingly creating a synthetic >> environment to benchmark against, and that you're at risk of >> optimising towards an environment the code won't actually run in, and >> might even end up pursuing the wrong >> optimisations. > > My understanding is that Victor isn't using this to guide optimization, > but rather to have a quick-as-possible way to find out that he screwed up > when he made a code change. I'm sure he's using much longer benchmarks > runs for actually looking at the performance impact of the complete > changeset. Right, I don't use the benchmark suite to choose which parts of the code should be optimized, but only to ensure that my optimizations make Python faster, as expected :-) But I understood what Paul wrote. He says that modifying a random parameter to make it constant (like random hash function) can lead to wrong conclusion on the patch. Depending on the chosen fixed value, the benchmark can say that the patch makes Pyhon faster or slower. Well, at least in corner cases, especially microbenchmarks like call_simple. Victor From victor.stinner at gmail.com Thu May 19 06:12:19 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 19 May 2016 12:12:19 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: References: Message-ID: 2016-05-17 23:11 GMT+02:00 Victor Stinner : > (...) > > (*) System load => CPU isolation, disable ASLR, set CPU affinity on > IRQs, etc. work around this issue -- > http://haypo-notes.readthedocs.io/microbenchmark.html > > (... > > (*) Locale, size of the command line and/or the current working > directory => WTF?! > (...) > => My bet is that the locale, current working directory, command line, > etc. impact how the heap memory is allocated, and this specific > benchmark depends on the locality of memory allocated on the heap... > (...) I tried to find a tool to "randomize" memory allocations, but I failed to find a popular and simple tool. I found the following tool, but it seems overkill and not realistic to me: https://emeryberger.com/research/stabilizer/ This tool randomizes everything and "re-randomize" the code at runtime, every 500 ms. IMHO it's not realistic because PGO+LTO use a specific link order to group "hot code" to make hot functions close. It seems like (enabling) ASLR "hides" the effects of the comand line, current working directory, environment variables, etc. Using ASLR + statistics (compute mean + standard deviation, use multiple processes to get a better distribution) fixes my issue. Slowly, I understand better why using the minimum and disabling legit sources of randomness is wrong. I mean that slowly I'm able to explain why :-) It looks like disabling ASLR and focusing on the minimum timing is just wrong. I'm surprised because disabling ASLR is a common practice in benchmarking. For example, on this mailing list, 2 months ago, Alecsandru Patrascu from Intel suggested to disable ASLR: https://mail.python.org/pipermail/speed/2016-February/000289.html (and also to disable Turbo, Hyper Threading and use a fixed CPU frequency which are good advices ;-)) By the way, I'm interested to know how the server running speed.python.org is tuned: CPU tuning, OS tuning, etc. For example, Zachary Ware wrote that perf.py was not run with --rigorous when he launched the website. I will probably write a blog post to explain my issues with benchmarks. Later, I will propose more concrete changes to perf.py and write doc explaining how perf.py should be used (give advices how to get reliable results). Victor From solipsis at pitrou.net Mon May 30 04:14:10 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 30 May 2016 10:14:10 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" References: Message-ID: <20160530101410.024be303@fsol> On Tue, 17 May 2016 23:11:50 +0200 Victor Stinner wrote: > Hi, > > I'm still (!) investigating the reasons why the benchmark call_simple > (ok, let's be honest: the *micro*benchmark) gets different results for > unknown reasons. Try to define MCACHE_STATS in Objects/typeobject.c and observe the statistics from run to run. It might give some hints. Regards Antoine. From victor.stinner at gmail.com Tue May 31 08:41:55 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 31 May 2016 14:41:55 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" In-Reply-To: <20160530101410.024be303@fsol> References: <20160530101410.024be303@fsol> Message-ID: 2016-05-30 10:14 GMT+02:00 Antoine Pitrou : >> I'm still (!) investigating the reasons why the benchmark call_simple >> (ok, let's be honest: the *micro*benchmark) gets different results for >> unknown reasons. > > Try to define MCACHE_STATS in Objects/typeobject.c and observe the > statistics from run to run. It might give some hints. call_simple only uses regular functions, not methods, so the type cache should not have any effect on it. No? I already noticed that the exact layout of items in globals() dict matters. Depending on the PYTHONHASHSEED, you get or not hash collision, the effect is visible on such microbenchmark. Victor From solipsis at pitrou.net Tue May 31 08:47:53 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 31 May 2016 14:47:53 +0200 Subject: [Speed] External sources of noise changing call_simple "performance" References: <20160530101410.024be303@fsol> Message-ID: <20160531144753.09041ba0@fsol> On Tue, 31 May 2016 14:41:55 +0200 Victor Stinner wrote: > 2016-05-30 10:14 GMT+02:00 Antoine Pitrou : > >> I'm still (!) investigating the reasons why the benchmark call_simple > >> (ok, let's be honest: the *micro*benchmark) gets different results for > >> unknown reasons. > > > > Try to define MCACHE_STATS in Objects/typeobject.c and observe the > > statistics from run to run. It might give some hints. > > call_simple only uses regular functions, not methods, so the type > cache should not have any effect on it. No? Indeed, sorry for the mistake. Regards Antoine.