From armin.rigo at gmail.com Wed Nov 2 06:04:25 2016 From: armin.rigo at gmail.com (Armin Rigo) Date: Wed, 2 Nov 2016 11:04:25 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: Hi Victor, On 19 October 2016 at 18:55, Victor Stinner wrote: > 3) new --duplication option to perf timeit This is never a good idea on top of PyPy, so I wouldn't mind if using this option on top of PyPy threw an error. A bient?t, Armin. From victor.stinner at gmail.com Wed Nov 2 06:50:39 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 2 Nov 2016 11:50:39 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: 2016-11-02 11:04 GMT+01:00 Armin Rigo : > On 19 October 2016 at 18:55, Victor Stinner wrote: >> 3) new --duplication option to perf timeit > > This is never a good idea on top of PyPy, so I wouldn't mind if using > this option on top of PyPy threw an error. Can you please elaborate? Victor From armin.rigo at gmail.com Wed Nov 2 07:12:59 2016 From: armin.rigo at gmail.com (Armin Rigo) Date: Wed, 2 Nov 2016 12:12:59 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: Hi Victor, On 2 November 2016 at 11:50, Victor Stinner wrote: > 2016-11-02 11:04 GMT+01:00 Armin Rigo : >> On 19 October 2016 at 18:55, Victor Stinner wrote: >>> 3) new --duplication option to perf timeit >> >> This is never a good idea on top of PyPy, so I wouldn't mind if using >> this option on top of PyPy threw an error. > > Can you please elaborate? Yes, exactly :-) Consider a benchmark written like that: for i in range(lots): z = a + b z = a + b z = a + b z = a + b z = a + b What you are really measuring by running PyPy on this is completely different from what you *think* you are measuring---in this case, mostly everything is optimized away. If you try to make it actually do something so that it's not optimized away, then the problem of duplicating lines becomes of making the tracing JIT compiler not happy at all. If you duplicate the lines too many times, the loop body becomes too long for the JIT compiler to swallow---never duplicates 1000 times, that's always too much! But even if you duplicate only 10 times, then the more subtle problem is: assume that each line can follow *two* control flow paths (even internally, e.g. because of some condition done in RPython). (It is likely the case, if you try to do something non-trivially-optimizable-away.) Then if you duplicate the line 10 times, there are suddenly 2**10 control flow paths. That means the JIT will never be able to warm up completely. Suddenly you are measuring the JIT compiler's performance and not at all your code's. The --duplication option on PyPy is thus either useless or limited to use cases where you definitely know there is only one code path ever followed, and don't duplicate too much, and know for sure that multiple repetitions of the same line won't cause cross-line optimizations. That's not possible to explain without going very technical. A bient?t, Armin. From victor.stinner at gmail.com Wed Nov 2 08:00:26 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 2 Nov 2016 13:00:26 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: Hum, so for an usability point of view, I think that the best to do is to ignore the option if Python has a JIT. On CPython, --duplicate makes sense (no?). So for example, the following command should use duplicate on CPython but not on PyPy: python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy Victor From gmludo at gmail.com Tue Nov 1 19:34:13 2016 From: gmludo at gmail.com (Ludovic Gasc) Date: Wed, 2 Nov 2016 00:34:13 +0100 Subject: [Speed] [Python-Dev] Benchmarking Python and micro-optimizations In-Reply-To: References: Message-ID: Hi, Thanks first for that, it's very interesting. About to enrich benchmark suite, I might have a suggestion: We might add REST/JSON scenarios, because a lot of people use Python for that. It isn't certainly not the best REST/JSON scenarios, because they have a small payload, but better than nothing: https://www.techempower.com/benchmarks/#section=code&hw=peak&test=fortune Moreover, we already have several implementations for the most populars Web frameworks: https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Python The drawback is that a lot of tests need a database. I can help if you're interested in. Have a nice week. -------------- next part -------------- An HTML attachment was scrubbed... URL: From armin.rigo at gmail.com Wed Nov 2 10:20:44 2016 From: armin.rigo at gmail.com (Armin Rigo) Date: Wed, 2 Nov 2016 15:20:44 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: Hi Victor, On 2 November 2016 at 13:00, Victor Stinner wrote: > On CPython, --duplicate makes sense (no?). So for example, the > following command should use duplicate on CPython but not on PyPy: > > python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy This example means "compare CPython where the data cache gets extra pressure from reading a strangely large code object, and PyPy where the multiplication might be entirely removed for all I know". Is that really the kind of examples you want to put forward? A bient?t, Armin. From victor.stinner at gmail.com Wed Nov 2 11:53:45 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 2 Nov 2016 16:53:45 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: 2016-11-02 15:20 GMT+01:00 Armin Rigo : > Is that really the kind of examples you want to put forward? I am not a big fan of timeit, but we must use it sometimes to micro-optimizations in CPython to check if an optimize really makes CPython faster or not. I am only trying to enhance timeit. Understanding results require to understand how the statements are executed. > This example means "compare CPython where the data cache gets extra pressure from reading a strangely large code object, I wrote --duplicate option to benchmark "x+y" with "x=1; y=2". I know, it's an extreme and stupid benchmark, but many people spend a lot of time on trying to optimize this in Python/ceval.c: https://bugs.python.org/issue21955 I tried multiple values of --duplicate when benchmarking x+y, and x+y seems "faster" when using a larger --duplicate value. I understand that the cost of the outer loop is higher than the cost of "reading a strangely large code object". I provide a tool and I try to document how to use it. But it's hard to prevent users to use it for stupid things. For example, recently I spent time trying to optimize bytes%args in Python 3 after reading an article, but then I realized that the Python 2 benchmark was meaningless: https://atleastfornow.net/blog/not-all-bytes/ def bytes_plus(): b"hi" + b" " + b"there" ... benchmark(bytes_plus) ... bytes_plus() is optimized by the _compiler_, so the benchmark measure the cost of LOAD_CONST :-) The issue was not the tool but the usage of the tool :-D Victor From armin.rigo at gmail.com Wed Nov 2 12:03:32 2016 From: armin.rigo at gmail.com (Armin Rigo) Date: Wed, 2 Nov 2016 17:03:32 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: Hi Victor, On 2 November 2016 at 16:53, Victor Stinner wrote: > 2016-11-02 15:20 GMT+01:00 Armin Rigo : >> Is that really the kind of examples you want to put forward? > > I am not a big fan of timeit, but we must use it sometimes to > micro-optimizations in CPython to check if an optimize really makes > CPython faster or not. I am only trying to enhance timeit. > Understanding results require to understand how the statements are > executed. Don't get me wrong, I understand the point of the following usage of timeit: python2 -m perf timeit '[1,2]*1000' --duplicate=1000 What I'm criticizing here is this instead: python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy because you're very unlikely to get any relevant information from such a comparison. I stand by my original remark: I would say it should be an error or at least a big fat warning to use --duplicate and PyPy in the same invocation. This is as opposed to silently ignoring --duplicate for PyPy, which is just adding more confusion imho. A bient?t, Armin. From victor.stinner at gmail.com Wed Nov 2 21:30:34 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 3 Nov 2016 02:30:34 +0100 Subject: [Speed] Tune the system the run benchmarks Message-ID: Hi, Last months, I used various shell and Python scripts to tune a system to run benchmarks. There are many parameters to be set to "fully" configure a Linux: * Turbo Boost of Intel CPU * CPU scaling governor * CPU speed * Isolate CPU * Disable kernel RCU on isolated CPUs * etc. I added a new "sytem tune" command to the newly released perf 0.8.4. I implemented many operations: http://perf.readthedocs.io/en/latest/cli.html#system Right now, intel_pstate is better supported. I'm not sure about the CPU scaling governor when intel_pstate is not used, so this is one is not implemented yet. In my old Python script, I used the "userland" governor and a fixed speed for the CPUs. My old Python script also disabled interruptions (IRQ) on isolated CPUs. I will also implement that later. I don't know if setting the default CPU mask for IRQ is enough, or if it's better to set the CPU mask of all invididual IRQs. Example on the speed.python.org server: ----- haypo at speed-python$ sudo python3 -m perf system tune CPU Frequency: Minimum frequency of CPU 1 set to the maximum frequency CPU Frequency: Minimum frequency of CPU 3 set to the maximum frequency ... CPU Frequency: Minimum frequency of CPU 23 set to the maximum frequency Turbo Boost (MSR): Turbo Boost disabled on CPU 0: MSR 0x1a0 set to 0x4000850089 Turbo Boost (MSR): Turbo Boost disabled on CPU 1: MSR 0x1a0 set to 0x4000850089 ASLR: Full randomization Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 CPU Frequency: 0,2,4,6,8,10,12,14,16,18,20,22=min=1600 MHz, max=3333 MHz; 1,3,5,7,9,11,13,15,17,19,21,23=min=3333 MHz, max=3333 MHz Turbo Boost (MSR): CPU 0-23: disabled ----- "Reset" the config: ----- haypo at speed-python$ sudo python3 -m perf system reset CPU Frequency: Minimum frequency of CPU 1 reset to the minimum frequency CPU Frequency: Minimum frequency of CPU 3 reset to the minimum frequency ... CPU Frequency: Minimum frequency of CPU 23 reset to the minimum frequency Turbo Boost (MSR): Turbo Boost enabled on CPU 0: MSR 0x1a0 set to 0x850089 Turbo Boost (MSR): Turbo Boost enabled on CPU 1: MSR 0x1a0 set to 0x850089 ASLR: Full randomization Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 CPU Frequency: 0-23=min=1600 MHz, max=3333 MHz Turbo Boost (MSR): CPU 0-23: enabled ----- Victor From victor.stinner at gmail.com Fri Nov 4 08:12:42 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 4 Nov 2016 13:12:42 +0100 Subject: [Speed] New benchmarks results on speed.python.org Message-ID: Hi, Good news, I regenerated all benchmark results of CPython using the latest versions of perf and perfomance and the results look much more reliable. Sadly, I didn't kept a screenshot of old benchmarks, so you should trust me, I cannot show you the old unstable timeline. -- I regenerated all benchmark results of speed.python.org using performance 0.3.2. I now have an (almost) fully automated script to run benchmarks (compile python, run benchmarks, etc.) using a list of Python revisions and/or branches. Only the last step, upload the JSON, is still manual, but it's nothing to automate this part ;-) https://github.com/python/performance/tree/master/scripts Python is compiled using LTO, but not PGO. The compilation with PGO fails with an internal GCC bug, speed-python uses Ubuntu 14.04, the GCC bug seems to be known (and fixed upstream...). Because of various bugs (including a bug in the Linux kernel ;-) NOHZ_FULL+intel_pstate), I didn't have time to analyze the impact of compilation options (-O2, -O3, LTO, PGO, etc.) on the stability of benchmark results. I isolated all CPUs of the NUMA node 1 (the CPU has two NUMA nodes): I added the following parameters to the the Linux kernel command line of the speed-python server: isolcpus=1,3,5,7,9,11,13,15,17,19,21,23 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23 Before running the benchmarks, I used the "python3 -m perf system tune" command (of the development version of perf) to tune the server. Results of the tuning: ------------------------- $ sudo python3 -m perf system System state ============ ASLR: Full randomization Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 CPU Frequency: 0,2,4,6,8,10,12,14,16,18,20,22=min=1600 MHz, max=3333 MHz; 1,3,5,7,9,11,13,15,17,19,21,23=min=max=3333 MHz Turbo Boost (MSR): CPU 0,2,4,6,8,10,12,14,16,18,20,22: enabled, CPU 1,3,5,7,9,11,13,15,17,19,21,23: disabled IRQ affinity: irqbalance service: inactive IRQ affinity: Default IRQ affinity: CPU 0,2,4,6,8,10,12,14,16,18,20,22 IRQ affinity: IRQ affinity: 0,2=0-23, 1,3-15,17,20,22-23,67-82=0,2,4,6,8,10,12,14,16,18,20,22 ------------------------- I don't well yet the hardware of the speed-python server. The CPU is a "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz": * I only disabled Turbo Boost on the CPUs used to run benchmarks. Maybe I should disable Turbo Boost on all CPUs? On my computers using intel_pstate, Turbo Boost is disabled globally (for all CPUs) using an option of the intel_pstate driver. * I didn't tune the CPU scaling governor yet: all CPUs use "ondemand" * Maybe I should use a fixed CPU frequency on all CPUs and use the "userland" scaling governor? Results seem more stable, but it's still not perfect yet (see below). See [Timeline] (x) Display all in grid: https://speed.python.org/timeline/#/?exe=4&ben=grid&env=1&revs=50&equid=off&quarts=on&extr=on There are still some hiccups: (*) call_method: temporary peak of 29 ms for October 19, whereas all other revisions are around 17 ms: https://speed.python.org/timeline/#/?exe=4&ben=call_method&env=1&revs=50&equid=off&quarts=on&extr=on (*) python_startup increased from 21 ms to 27.5 ms between Sept 9 and Sept 15... The problem is that this one is not a temporary hiccup, but seems like a real performance regression: there are 4 points at 21 ms (Sept 4-Sept 9) and 7 points at 27.5 ms (Sept 15-Nov 3). But I was unable yet to reproduce the slowndown on my laptop. https://speed.python.org/timeline/#/?exe=4&ben=python_startup&env=1&revs=50&equid=off&quarts=on&extr=on Victor From victor.stinner at gmail.com Fri Nov 4 08:28:36 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 4 Nov 2016 13:28:36 +0100 Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6 performance Message-ID: Hi, You may know that I'm working on benchmarks. I regenerated all benchmark results of speed.python.org using performance 0.3.2 (benchmark suite). I started to analyze results. All results are available online on the website: https://speed.python.org/ To communicate on my work on benchmarks, I tweeted two pictures: "sympy benchmarks: Python 3.6 is between 8% and 48% faster than Python 2.7 #python #benchmark": https://twitter.com/VictorStinner/status/794289596683210760 "Python 3.6 is between 25% and 54% slower than Python 2.7 in the following benchmarks": https://twitter.com/VictorStinner/status/794305065708376069 Many people were disappointed that Python 3.6 can be up to 54% slower than Python 2.7. In fact, I know many reasons which explain that, but it's hard to summarize them in 140 characters ;-) For example, Python 3.6 is 54% slower than Python 2.7 on the benchmark pycrypto_aes. This benchmark tests a pure Python implementation of the crypto cipher AES. You may know that CPython is slow for CPU intensive functions, especially on integer and floatting point numbers. "int" in Python 3 is now "long integers" by default, which is known to be a little bit slower than "short int" of Python 2. On a more realistic benchmark (see other benchmarks), the overhead of Python 3 "long int" is negligible. AES is a typical example stressing integers. For me, it's a dummy benchmark: it doesn't make sense to use Python for AES: modern CPUs have an *hardware* implemention which is super fast. Well, I didn't have time to analyze in depth individual benchmarks. If you want to help me, here is the source code of benchmarks: https://github.com/python/performance/blob/master/performance/benchmarks/ Raw results of Python 3.6 compared to Python 2.7: ------------------- $ python3 -m perf compare_to 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz 2016-11-03_15-38-3.6-c4319c0d0131.json.gz -G --min-speed=5 Slower (40): - python_startup: 7.74 ms +- 0.28 ms -> 26.9 ms +- 0.6 ms: 3.47x slower - python_startup_no_site: 4.43 ms +- 0.08 ms -> 10.4 ms +- 0.4 ms: 2.36x slower - unpickle_pure_python: 417 us +- 3 us -> 918 us +- 14 us: 2.20x slower - call_method: 16.3 ms +- 0.2 ms -> 28.6 ms +- 0.8 ms: 1.76x slower - call_method_slots: 16.2 ms +- 0.4 ms -> 28.3 ms +- 0.7 ms: 1.75x slower - call_method_unknown: 18.4 ms +- 0.2 ms -> 30.8 ms +- 0.8 ms: 1.67x slower - crypto_pyaes: 161 ms +- 2 ms -> 249 ms +- 2 ms: 1.54x slower - xml_etree_parse: 201 ms +- 5 ms -> 298 ms +- 8 ms: 1.49x slower - logging_simple: 26.4 us +- 0.3 us -> 38.4 us +- 0.7 us: 1.46x slower - logging_format: 31.3 us +- 0.4 us -> 45.5 us +- 0.8 us: 1.45x slower - pickle_pure_python: 986 us +- 9 us -> 1.41 ms +- 0.03 ms: 1.43x slower - spectral_norm: 208 ms +- 2 ms -> 287 ms +- 2 ms: 1.38x slower - logging_silent: 660 ns +- 7 ns -> 865 ns +- 31 ns: 1.31x slower - chaos: 240 ms +- 2 ms -> 314 ms +- 4 ms: 1.31x slower - go: 490 ms +- 2 ms -> 640 ms +- 26 ms: 1.31x slower - xml_etree_iterparse: 178 ms +- 2 ms -> 230 ms +- 5 ms: 1.29x slower - sqlite_synth: 8.29 us +- 0.16 us -> 10.6 us +- 0.2 us: 1.28x slower - xml_etree_process: 210 ms +- 6 ms -> 268 ms +- 14 ms: 1.28x slower - django_template: 387 ms +- 4 ms -> 484 ms +- 5 ms: 1.25x slower - fannkuch: 830 ms +- 32 ms -> 1.04 sec +- 0.03 sec: 1.25x slower - hexiom: 20.2 ms +- 0.1 ms -> 24.7 ms +- 0.2 ms: 1.22x slower - chameleon: 26.1 ms +- 0.2 ms -> 31.9 ms +- 0.4 ms: 1.22x slower - regex_compile: 395 ms +- 2 ms -> 482 ms +- 6 ms: 1.22x slower - json_dumps: 25.8 ms +- 0.2 ms -> 31.0 ms +- 0.5 ms: 1.20x slower - nqueens: 229 ms +- 2 ms -> 274 ms +- 2 ms: 1.20x slower - genshi_text: 81.9 ms +- 0.6 ms -> 97.8 ms +- 1.1 ms: 1.19x slower - raytrace: 1.17 sec +- 0.03 sec -> 1.39 sec +- 0.03 sec: 1.19x slower - scimark_monte_carlo: 240 ms +- 7 ms -> 282 ms +- 10 ms: 1.17x slower - scimark_sor: 441 ms +- 8 ms -> 517 ms +- 12 ms: 1.17x slower - deltablue: 17.4 ms +- 0.1 ms -> 20.1 ms +- 0.6 ms: 1.16x slower - sqlalchemy_declarative: 310 ms +- 3 ms -> 354 ms +- 6 ms: 1.14x slower - call_simple: 12.2 ms +- 0.2 ms -> 13.9 ms +- 0.2 ms: 1.14x slower - scimark_fft: 613 ms +- 19 ms -> 694 ms +- 23 ms: 1.13x slower - meteor_contest: 191 ms +- 1 ms -> 215 ms +- 2 ms: 1.13x slower - pathlib: 46.9 ms +- 0.4 ms -> 52.6 ms +- 0.9 ms: 1.12x slower - richards: 181 ms +- 1 ms -> 201 ms +- 6 ms: 1.11x slower - genshi_xml: 191 ms +- 2 ms -> 209 ms +- 2 ms: 1.10x slower - float: 290 ms +- 5 ms -> 310 ms +- 7 ms: 1.07x slower - scimark_sparse_mat_mult: 8.19 ms +- 0.22 ms -> 8.74 ms +- 0.15 ms: 1.07x slower - xml_etree_generate: 302 ms +- 3 ms -> 320 ms +- 8 ms: 1.06x slower Faster (15): - telco: 707 ms +- 22 ms -> 22.1 ms +- 0.4 ms: 32.04x faster - unpickle_list: 15.0 us +- 0.3 us -> 7.86 us +- 0.16 us: 1.90x faster - pickle_list: 14.7 us +- 0.2 us -> 9.12 us +- 0.38 us: 1.61x faster - json_loads: 98.7 us +- 2.3 us -> 62.3 us +- 0.7 us: 1.58x faster - pickle: 40.4 us +- 0.6 us -> 27.1 us +- 0.5 us: 1.49x faster - sympy_sum: 361 ms +- 10 ms -> 244 ms +- 7 ms: 1.48x faster - sympy_expand: 1.68 sec +- 0.02 sec -> 1.15 sec +- 0.03 sec: 1.47x faster - regex_v8: 62.0 ms +- 0.5 ms -> 47.2 ms +- 0.6 ms: 1.31x faster - sympy_str: 699 ms +- 22 ms -> 537 ms +- 15 ms: 1.30x faster - regex_effbot: 6.67 ms +- 0.04 ms -> 5.23 ms +- 0.05 ms: 1.28x faster - mako: 61.5 ms +- 0.7 ms -> 49.7 ms +- 2.5 ms: 1.24x faster - html5lib: 298 ms +- 7 ms -> 261 ms +- 6 ms: 1.14x faster - sympy_integrate: 55.9 ms +- 0.3 ms -> 51.8 ms +- 1.0 ms: 1.08x faster - pickle_dict: 69.4 us +- 0.9 us -> 65.2 us +- 3.2 us: 1.06x faster - scimark_lu: 551 ms +- 26 ms -> 523 ms +- 18 ms: 1.05x faster Benchmark hidden because not significant (8): 2to3, dulwich_log, nbody, pidigits, regex_dna, tornado_http, unpack_sequence, unpickle Ignored benchmarks (3) of 2016-11-03_15-36-2.7-91f024fc9b3a.json: hg_startup, pyflate, spambayes ------------------- Please ignore call_method, call_method_slots, call_method_unknown benchmarks: it seems like I had an issue on the benchmark server. I was unable to reproduce he 70% slowdown on my laptop. I attached the two compressed JSON files to this email if you want to analyze them yourself. I hope that my work on benchmarks will motive some developers to look closer at Python 3 performance to find interesting optimizations ;-) Victor -------------- next part -------------- A non-text attachment was scrubbed... Name: 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz Type: application/x-gzip Size: 107594 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2016-11-03_15-38-3.6-c4319c0d0131.json.gz Type: application/x-gzip Size: 102546 bytes Desc: not available URL: From tobami at gmail.com Fri Nov 4 15:18:48 2016 From: tobami at gmail.com (Miquel Torres) Date: Fri, 04 Nov 2016 19:18:48 +0000 Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6 performance In-Reply-To: References: Message-ID: Nice! For the record, I'll be giving a talk in PyCon Ireland about Codespeed. Would you mind me citing those tweets and screenshots, to highlight usage on speed.python.org? You mentioned new more reliable vs old results. How close are we to have an stable setup that gives us benchmarks numbers regularly? Cheers, Miquel El El vie, 4 nov 2016 a las 12:30, Victor Stinner escribi?: > Hi, > > You may know that I'm working on benchmarks. I regenerated all > benchmark results of speed.python.org using performance 0.3.2 > (benchmark suite). I started to analyze results. > > All results are available online on the website: > > https://speed.python.org/ > > > To communicate on my work on benchmarks, I tweeted two pictures: > > "sympy benchmarks: Python 3.6 is between 8% and 48% faster than Python > 2.7 #python #benchmark": > https://twitter.com/VictorStinner/status/794289596683210760 > > "Python 3.6 is between 25% and 54% slower than Python 2.7 in the > following benchmarks": > https://twitter.com/VictorStinner/status/794305065708376069 > > > Many people were disappointed that Python 3.6 can be up to 54% slower > than Python 2.7. In fact, I know many reasons which explain that, but > it's hard to summarize them in 140 characters ;-) > > For example, Python 3.6 is 54% slower than Python 2.7 on the benchmark > pycrypto_aes. This benchmark tests a pure Python implementation of the > crypto cipher AES. You may know that CPython is slow for CPU intensive > functions, especially on integer and floatting point numbers. > > "int" in Python 3 is now "long integers" by default, which is known to > be a little bit slower than "short int" of Python 2. On a more > realistic benchmark (see other benchmarks), the overhead of Python 3 > "long int" is negligible. > > AES is a typical example stressing integers. For me, it's a dummy > benchmark: it doesn't make sense to use Python for AES: modern CPUs > have an *hardware* implemention which is super fast. > > > Well, I didn't have time to analyze in depth individual benchmarks. If > you want to help me, here is the source code of benchmarks: > https://github.com/python/performance/blob/master/performance/benchmarks/ > > > Raw results of Python 3.6 compared to Python 2.7: > ------------------- > $ python3 -m perf compare_to 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz > 2016-11-03_15-38-3.6-c4319c0d0131.json.gz -G --min-speed=5 > Slower (40): > - python_startup: 7.74 ms +- 0.28 ms -> 26.9 ms +- 0.6 ms: 3.47x slower > - python_startup_no_site: 4.43 ms +- 0.08 ms -> 10.4 ms +- 0.4 ms: 2.36x > slower > - unpickle_pure_python: 417 us +- 3 us -> 918 us +- 14 us: 2.20x slower > - call_method: 16.3 ms +- 0.2 ms -> 28.6 ms +- 0.8 ms: 1.76x slower > - call_method_slots: 16.2 ms +- 0.4 ms -> 28.3 ms +- 0.7 ms: 1.75x slower > - call_method_unknown: 18.4 ms +- 0.2 ms -> 30.8 ms +- 0.8 ms: 1.67x slower > - crypto_pyaes: 161 ms +- 2 ms -> 249 ms +- 2 ms: 1.54x slower > - xml_etree_parse: 201 ms +- 5 ms -> 298 ms +- 8 ms: 1.49x slower > - logging_simple: 26.4 us +- 0.3 us -> 38.4 us +- 0.7 us: 1.46x slower > - logging_format: 31.3 us +- 0.4 us -> 45.5 us +- 0.8 us: 1.45x slower > - pickle_pure_python: 986 us +- 9 us -> 1.41 ms +- 0.03 ms: 1.43x slower > - spectral_norm: 208 ms +- 2 ms -> 287 ms +- 2 ms: 1.38x slower > - logging_silent: 660 ns +- 7 ns -> 865 ns +- 31 ns: 1.31x slower > - chaos: 240 ms +- 2 ms -> 314 ms +- 4 ms: 1.31x slower > - go: 490 ms +- 2 ms -> 640 ms +- 26 ms: 1.31x slower > - xml_etree_iterparse: 178 ms +- 2 ms -> 230 ms +- 5 ms: 1.29x slower > - sqlite_synth: 8.29 us +- 0.16 us -> 10.6 us +- 0.2 us: 1.28x slower > - xml_etree_process: 210 ms +- 6 ms -> 268 ms +- 14 ms: 1.28x slower > - django_template: 387 ms +- 4 ms -> 484 ms +- 5 ms: 1.25x slower > - fannkuch: 830 ms +- 32 ms -> 1.04 sec +- 0.03 sec: 1.25x slower > - hexiom: 20.2 ms +- 0.1 ms -> 24.7 ms +- 0.2 ms: 1.22x slower > - chameleon: 26.1 ms +- 0.2 ms -> 31.9 ms +- 0.4 ms: 1.22x slower > - regex_compile: 395 ms +- 2 ms -> 482 ms +- 6 ms: 1.22x slower > - json_dumps: 25.8 ms +- 0.2 ms -> 31.0 ms +- 0.5 ms: 1.20x slower > - nqueens: 229 ms +- 2 ms -> 274 ms +- 2 ms: 1.20x slower > - genshi_text: 81.9 ms +- 0.6 ms -> 97.8 ms +- 1.1 ms: 1.19x slower > - raytrace: 1.17 sec +- 0.03 sec -> 1.39 sec +- 0.03 sec: 1.19x slower > - scimark_monte_carlo: 240 ms +- 7 ms -> 282 ms +- 10 ms: 1.17x slower > - scimark_sor: 441 ms +- 8 ms -> 517 ms +- 12 ms: 1.17x slower > - deltablue: 17.4 ms +- 0.1 ms -> 20.1 ms +- 0.6 ms: 1.16x slower > - sqlalchemy_declarative: 310 ms +- 3 ms -> 354 ms +- 6 ms: 1.14x slower > - call_simple: 12.2 ms +- 0.2 ms -> 13.9 ms +- 0.2 ms: 1.14x slower > - scimark_fft: 613 ms +- 19 ms -> 694 ms +- 23 ms: 1.13x slower > - meteor_contest: 191 ms +- 1 ms -> 215 ms +- 2 ms: 1.13x slower > - pathlib: 46.9 ms +- 0.4 ms -> 52.6 ms +- 0.9 ms: 1.12x slower > - richards: 181 ms +- 1 ms -> 201 ms +- 6 ms: 1.11x slower > - genshi_xml: 191 ms +- 2 ms -> 209 ms +- 2 ms: 1.10x slower > - float: 290 ms +- 5 ms -> 310 ms +- 7 ms: 1.07x slower > - scimark_sparse_mat_mult: 8.19 ms +- 0.22 ms -> 8.74 ms +- 0.15 ms: > 1.07x slower > - xml_etree_generate: 302 ms +- 3 ms -> 320 ms +- 8 ms: 1.06x slower > > Faster (15): > - telco: 707 ms +- 22 ms -> 22.1 ms +- 0.4 ms: 32.04x faster > - unpickle_list: 15.0 us +- 0.3 us -> 7.86 us +- 0.16 us: 1.90x faster > - pickle_list: 14.7 us +- 0.2 us -> 9.12 us +- 0.38 us: 1.61x faster > - json_loads: 98.7 us +- 2.3 us -> 62.3 us +- 0.7 us: 1.58x faster > - pickle: 40.4 us +- 0.6 us -> 27.1 us +- 0.5 us: 1.49x faster > - sympy_sum: 361 ms +- 10 ms -> 244 ms +- 7 ms: 1.48x faster > - sympy_expand: 1.68 sec +- 0.02 sec -> 1.15 sec +- 0.03 sec: 1.47x faster > - regex_v8: 62.0 ms +- 0.5 ms -> 47.2 ms +- 0.6 ms: 1.31x faster > - sympy_str: 699 ms +- 22 ms -> 537 ms +- 15 ms: 1.30x faster > - regex_effbot: 6.67 ms +- 0.04 ms -> 5.23 ms +- 0.05 ms: 1.28x faster > - mako: 61.5 ms +- 0.7 ms -> 49.7 ms +- 2.5 ms: 1.24x faster > - html5lib: 298 ms +- 7 ms -> 261 ms +- 6 ms: 1.14x faster > - sympy_integrate: 55.9 ms +- 0.3 ms -> 51.8 ms +- 1.0 ms: 1.08x faster > - pickle_dict: 69.4 us +- 0.9 us -> 65.2 us +- 3.2 us: 1.06x faster > - scimark_lu: 551 ms +- 26 ms -> 523 ms +- 18 ms: 1.05x faster > > Benchmark hidden because not significant (8): 2to3, dulwich_log, > nbody, pidigits, regex_dna, tornado_http, unpack_sequence, unpickle > Ignored benchmarks (3) of 2016-11-03_15-36-2.7-91f024fc9b3a.json: > hg_startup, pyflate, spambayes > ------------------- > > Please ignore call_method, call_method_slots, call_method_unknown > benchmarks: it seems like I had an issue on the benchmark server. I > was unable to reproduce he 70% slowdown on my laptop. > > I attached the two compressed JSON files to this email if you want to > analyze them yourself. > > I hope that my work on benchmarks will motive some developers to look > closer at Python 3 performance to find interesting optimizations ;-) > > Victor > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yselivanov.ml at gmail.com Fri Nov 4 15:21:31 2016 From: yselivanov.ml at gmail.com (Yury Selivanov) Date: Fri, 4 Nov 2016 15:21:31 -0400 Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6 performance In-Reply-To: References: Message-ID: I'm curious why call_* benchmarks became slower on 3.x? Yury On 2016-11-04 8:28 AM, Victor Stinner wrote: > Hi, > > You may know that I'm working on benchmarks. I regenerated all > benchmark results of speed.python.org using performance 0.3.2 > (benchmark suite). I started to analyze results. > > All results are available online on the website: > > https://speed.python.org/ > > > To communicate on my work on benchmarks, I tweeted two pictures: > > "sympy benchmarks: Python 3.6 is between 8% and 48% faster than Python > 2.7 #python #benchmark": > https://twitter.com/VictorStinner/status/794289596683210760 > > "Python 3.6 is between 25% and 54% slower than Python 2.7 in the > following benchmarks": > https://twitter.com/VictorStinner/status/794305065708376069 > > > Many people were disappointed that Python 3.6 can be up to 54% slower > than Python 2.7. In fact, I know many reasons which explain that, but > it's hard to summarize them in 140 characters ;-) > > For example, Python 3.6 is 54% slower than Python 2.7 on the benchmark > pycrypto_aes. This benchmark tests a pure Python implementation of the > crypto cipher AES. You may know that CPython is slow for CPU intensive > functions, especially on integer and floatting point numbers. > > "int" in Python 3 is now "long integers" by default, which is known to > be a little bit slower than "short int" of Python 2. On a more > realistic benchmark (see other benchmarks), the overhead of Python 3 > "long int" is negligible. > > AES is a typical example stressing integers. For me, it's a dummy > benchmark: it doesn't make sense to use Python for AES: modern CPUs > have an *hardware* implemention which is super fast. > > > Well, I didn't have time to analyze in depth individual benchmarks. If > you want to help me, here is the source code of benchmarks: > https://github.com/python/performance/blob/master/performance/benchmarks/ > > > Raw results of Python 3.6 compared to Python 2.7: > ------------------- > $ python3 -m perf compare_to 2016-11-03_15-36-2.7-91f024fc9b3a.json.gz > 2016-11-03_15-38-3.6-c4319c0d0131.json.gz -G --min-speed=5 > Slower (40): > - python_startup: 7.74 ms +- 0.28 ms -> 26.9 ms +- 0.6 ms: 3.47x slower > - python_startup_no_site: 4.43 ms +- 0.08 ms -> 10.4 ms +- 0.4 ms: 2.36x slower > - unpickle_pure_python: 417 us +- 3 us -> 918 us +- 14 us: 2.20x slower > - call_method: 16.3 ms +- 0.2 ms -> 28.6 ms +- 0.8 ms: 1.76x slower > - call_method_slots: 16.2 ms +- 0.4 ms -> 28.3 ms +- 0.7 ms: 1.75x slower > - call_method_unknown: 18.4 ms +- 0.2 ms -> 30.8 ms +- 0.8 ms: 1.67x slower > - crypto_pyaes: 161 ms +- 2 ms -> 249 ms +- 2 ms: 1.54x slower > - xml_etree_parse: 201 ms +- 5 ms -> 298 ms +- 8 ms: 1.49x slower > - logging_simple: 26.4 us +- 0.3 us -> 38.4 us +- 0.7 us: 1.46x slower > - logging_format: 31.3 us +- 0.4 us -> 45.5 us +- 0.8 us: 1.45x slower > - pickle_pure_python: 986 us +- 9 us -> 1.41 ms +- 0.03 ms: 1.43x slower > - spectral_norm: 208 ms +- 2 ms -> 287 ms +- 2 ms: 1.38x slower > - logging_silent: 660 ns +- 7 ns -> 865 ns +- 31 ns: 1.31x slower > - chaos: 240 ms +- 2 ms -> 314 ms +- 4 ms: 1.31x slower > - go: 490 ms +- 2 ms -> 640 ms +- 26 ms: 1.31x slower > - xml_etree_iterparse: 178 ms +- 2 ms -> 230 ms +- 5 ms: 1.29x slower > - sqlite_synth: 8.29 us +- 0.16 us -> 10.6 us +- 0.2 us: 1.28x slower > - xml_etree_process: 210 ms +- 6 ms -> 268 ms +- 14 ms: 1.28x slower > - django_template: 387 ms +- 4 ms -> 484 ms +- 5 ms: 1.25x slower > - fannkuch: 830 ms +- 32 ms -> 1.04 sec +- 0.03 sec: 1.25x slower > - hexiom: 20.2 ms +- 0.1 ms -> 24.7 ms +- 0.2 ms: 1.22x slower > - chameleon: 26.1 ms +- 0.2 ms -> 31.9 ms +- 0.4 ms: 1.22x slower > - regex_compile: 395 ms +- 2 ms -> 482 ms +- 6 ms: 1.22x slower > - json_dumps: 25.8 ms +- 0.2 ms -> 31.0 ms +- 0.5 ms: 1.20x slower > - nqueens: 229 ms +- 2 ms -> 274 ms +- 2 ms: 1.20x slower > - genshi_text: 81.9 ms +- 0.6 ms -> 97.8 ms +- 1.1 ms: 1.19x slower > - raytrace: 1.17 sec +- 0.03 sec -> 1.39 sec +- 0.03 sec: 1.19x slower > - scimark_monte_carlo: 240 ms +- 7 ms -> 282 ms +- 10 ms: 1.17x slower > - scimark_sor: 441 ms +- 8 ms -> 517 ms +- 12 ms: 1.17x slower > - deltablue: 17.4 ms +- 0.1 ms -> 20.1 ms +- 0.6 ms: 1.16x slower > - sqlalchemy_declarative: 310 ms +- 3 ms -> 354 ms +- 6 ms: 1.14x slower > - call_simple: 12.2 ms +- 0.2 ms -> 13.9 ms +- 0.2 ms: 1.14x slower > - scimark_fft: 613 ms +- 19 ms -> 694 ms +- 23 ms: 1.13x slower > - meteor_contest: 191 ms +- 1 ms -> 215 ms +- 2 ms: 1.13x slower > - pathlib: 46.9 ms +- 0.4 ms -> 52.6 ms +- 0.9 ms: 1.12x slower > - richards: 181 ms +- 1 ms -> 201 ms +- 6 ms: 1.11x slower > - genshi_xml: 191 ms +- 2 ms -> 209 ms +- 2 ms: 1.10x slower > - float: 290 ms +- 5 ms -> 310 ms +- 7 ms: 1.07x slower > - scimark_sparse_mat_mult: 8.19 ms +- 0.22 ms -> 8.74 ms +- 0.15 ms: > 1.07x slower > - xml_etree_generate: 302 ms +- 3 ms -> 320 ms +- 8 ms: 1.06x slower > > Faster (15): > - telco: 707 ms +- 22 ms -> 22.1 ms +- 0.4 ms: 32.04x faster > - unpickle_list: 15.0 us +- 0.3 us -> 7.86 us +- 0.16 us: 1.90x faster > - pickle_list: 14.7 us +- 0.2 us -> 9.12 us +- 0.38 us: 1.61x faster > - json_loads: 98.7 us +- 2.3 us -> 62.3 us +- 0.7 us: 1.58x faster > - pickle: 40.4 us +- 0.6 us -> 27.1 us +- 0.5 us: 1.49x faster > - sympy_sum: 361 ms +- 10 ms -> 244 ms +- 7 ms: 1.48x faster > - sympy_expand: 1.68 sec +- 0.02 sec -> 1.15 sec +- 0.03 sec: 1.47x faster > - regex_v8: 62.0 ms +- 0.5 ms -> 47.2 ms +- 0.6 ms: 1.31x faster > - sympy_str: 699 ms +- 22 ms -> 537 ms +- 15 ms: 1.30x faster > - regex_effbot: 6.67 ms +- 0.04 ms -> 5.23 ms +- 0.05 ms: 1.28x faster > - mako: 61.5 ms +- 0.7 ms -> 49.7 ms +- 2.5 ms: 1.24x faster > - html5lib: 298 ms +- 7 ms -> 261 ms +- 6 ms: 1.14x faster > - sympy_integrate: 55.9 ms +- 0.3 ms -> 51.8 ms +- 1.0 ms: 1.08x faster > - pickle_dict: 69.4 us +- 0.9 us -> 65.2 us +- 3.2 us: 1.06x faster > - scimark_lu: 551 ms +- 26 ms -> 523 ms +- 18 ms: 1.05x faster > > Benchmark hidden because not significant (8): 2to3, dulwich_log, > nbody, pidigits, regex_dna, tornado_http, unpack_sequence, unpickle > Ignored benchmarks (3) of 2016-11-03_15-36-2.7-91f024fc9b3a.json: > hg_startup, pyflate, spambayes > ------------------- > > Please ignore call_method, call_method_slots, call_method_unknown > benchmarks: it seems like I had an issue on the benchmark server. I > was unable to reproduce he 70% slowdown on my laptop. > > I attached the two compressed JSON files to this email if you want to > analyze them yourself. > > I hope that my work on benchmarks will motive some developers to look > closer at Python 3 performance to find interesting optimizations ;-) > > Victor > > > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed From victor.stinner at gmail.com Fri Nov 4 16:56:21 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 4 Nov 2016 21:56:21 +0100 Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6 performance In-Reply-To: References: Message-ID: 2016-11-04 20:18 GMT+01:00 Miquel Torres : > Nice! For the record, I'll be giving a talk in PyCon Ireland about > Codespeed. Would you mind me citing those tweets and screenshots, to > highlight usage on speed.python.org? Sure. Keep me in touch in you publish your slides later. > You mentioned new more reliable vs old results. How close are we to have an > stable setup that gives us benchmarks numbers regularly? My plan for the short term is to analyze last (latest?) benchmarks hiccups and try to fix them. The fully automated script to run benchmarks is already written: https://github.com/python/performance/tree/master/scripts Then, the plan we decided with Zachary Ware is to run a script in a loop which compiles the default branch of CPython. Later, we may also do the same for 2.7 and 3.6 branches. And then add PyPy (and PyPy 3). Victor From victor.stinner at gmail.com Fri Nov 4 16:58:19 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 4 Nov 2016 21:58:19 +0100 Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6 performance In-Reply-To: References: Message-ID: 2016-11-04 20:21 GMT+01:00 Yury Selivanov : > I'm curious why call_* benchmarks became slower on 3.x? It's almost the same between 2.7 and default. For 3.6, it looks like an issue on the benchmark runner, not on Python itself: >> Please ignore call_method, call_method_slots, call_method_unknown >> benchmarks: it seems like I had an issue on the benchmark server. I >> was unable to reproduce he 70% slowdown on my laptop. Victor From victor.stinner at gmail.com Fri Nov 4 18:35:26 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 4 Nov 2016 23:35:26 +0100 Subject: [Speed] Performance difference in call_method() Message-ID: Hi, I noticed a temporary performance peak in the call_method: https://speed.python.org/timeline/#/?exe=4&ben=call_method&env=1&revs=50&equid=off&quarts=on&extr=on The difference is major: 17 ms => 29 ms, 70% slower! I expected a temporary issue on the server used to run benchmarks, but... I reproduced the result on the server. Recently, the performance of call_method() changed in CPython default from 17 ms to 28 ms (well, the exact value is variable: 25 ms, 28 ms, 29 ms, ...) and then back to 17 ms: (1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms https://hg.python.org/cpython/rev/83877018ef97 (2) 3e073e7b4460: 28 ms => 204a43c452cc (Oct 22): 17 ms https://hg.python.org/cpython/rev/204a43c452cc None of these revisions modify code used in the call_method() benchmark, so I guess that it's yet another compiler joke. On my laptop and my desktop PC, I'm unable to reproduce the issue: the performance is the same (I tested ce85a1f129e3, 83877018ef97, 204a43c452cc). These PC uses Fedora 24, GCC 6.2.1. CPUs: * laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz * desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz The speed-python runs Ubuntu 14.04, GCC 4.8.4-2ubuntu1~14.04. CPU: "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz". call_method() benchmark is a microbenchmark which seems to depend a lot of very low level stuff like CPU L1 cache. Maybe the impact on the compiler is more important on speed-python which has an older CPU, than my more recent hardware. Maybe GCC 6.2 produces more efficient machine code than GCC 4.8. I expect that PGO would "fix" the call_method() performance issue, but PGO compilation fails on Ubuntu 14.04 with a compiler error :-p A solution would be to upgrade the OS of this server. Victor From victor.stinner at gmail.com Fri Nov 4 19:20:48 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 5 Nov 2016 00:20:48 +0100 Subject: [Speed] Performance difference in call_method() In-Reply-To: References: Message-ID: I found some interesting differences using the Linux perf tool. # perf stat -e L1-icache-loads,L1-icache-load-misses ./python performance/benchmarks/bm_call_method.py --inherit=PYTHONPATH -v --worker -l1 -n 25 -w0 2016-11-04 23:35 GMT+01:00 Victor Stinner : > (1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms > > https://hg.python.org/cpython/rev/83877018ef97 Comparison of metrics of rev ce85a1f129e3 (fast) => rev 83877018ef97 (slow): L1-icache-load-misses: 0.06% => 8.41% of all L1-icache hits Instructions per cycle: 2.38 => 1.41 stalled-cycles-frontend: 12.99% => 42.85% frontend cycles idle stalled-cycles-backend: 2.28% => 21.36% backend cycles idle So it confirms what I expected: call_method() is highly impacted by the CPU L1 instruction cache. I don't know exactly why the revision 83877018ef97 has an impact on the CPU L1 cache. Victor From victor.stinner at gmail.com Fri Nov 4 20:31:22 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 5 Nov 2016 01:31:22 +0100 Subject: [Speed] Performance difference in call_method() In-Reply-To: References: Message-ID: I proposed a patch which fixes the issue: http://bugs.python.org/issue28618 "Decorate hot functions using __attribute__((hot)) to optimize Python" Victor From victor.stinner at gmail.com Fri Nov 4 22:23:15 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 5 Nov 2016 03:23:15 +0100 Subject: [Speed] Benchmarks: Comparison between Python 2.7 and Python 3.6 performance In-Reply-To: References: Message-ID: 2016-11-04 21:58 GMT+01:00 Victor Stinner : > 2016-11-04 20:21 GMT+01:00 Yury Selivanov : >> I'm curious why call_* benchmarks became slower on 3.x? > > It's almost the same between 2.7 and default. For 3.6, it looks like > an issue on the benchmark runner, not on Python itself: (...) Aha, it seems to be a compiler performance issue. I proposed a patch to fix the issue: http://bugs.python.org/issue28618 Victor From ncoghlan at gmail.com Sat Nov 5 10:56:27 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 6 Nov 2016 00:56:27 +1000 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: On 3 November 2016 at 02:03, Armin Rigo wrote: > Hi Victor, > > On 2 November 2016 at 16:53, Victor Stinner wrote: >> 2016-11-02 15:20 GMT+01:00 Armin Rigo : >>> Is that really the kind of examples you want to put forward? >> >> I am not a big fan of timeit, but we must use it sometimes to >> micro-optimizations in CPython to check if an optimize really makes >> CPython faster or not. I am only trying to enhance timeit. >> Understanding results require to understand how the statements are >> executed. > > Don't get me wrong, I understand the point of the following usage of timeit: > > python2 -m perf timeit '[1,2]*1000' --duplicate=1000 > > What I'm criticizing here is this instead: > > python2 -m perf timeit '[1,2]*1000' --duplicate=1000 --compare-to=pypy > > because you're very unlikely to get any relevant information from such > a comparison. I stand by my original remark: I would say it should be > an error or at least a big fat warning to use --duplicate and PyPy in > the same invocation. This is as opposed to silently ignoring > --duplicate for PyPy, which is just adding more confusion imho. Since the use case for --duplicate is to reduce the relative overhead of the outer loop when testing a micro-optimisation within a *given* interpreter, perhaps the error should be for combining --duplicate and --compare-to at all? And then it would just be up to developers of a *particular* implementation to know if "--duplicate" is relevant to them. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From victor.stinner at gmail.com Sat Nov 5 11:34:47 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 5 Nov 2016 16:34:47 +0100 Subject: [Speed] Latest enhancements of perf 0.8.1 and performance 0.3.1 In-Reply-To: References: Message-ID: 2016-11-05 15:56 GMT+01:00 Nick Coghlan : > Since the use case for --duplicate is to reduce the relative overhead > of the outer loop when testing a micro-optimisation within a *given* > interpreter, perhaps the error should be for combining --duplicate and > --compare-to at all? And then it would just be up to developers of a > *particular* implementation to know if "--duplicate" is relevant to > them. Hum, I think that using "timeit --compare-to=python --duplicate=1000" makes sense when you compare two versions of CPython. If I understood correctly Armin, the usage of --duplicate on a Python using a JIT must fail with an error. It's in my (long) TODO list ;-) Victor From ncoghlan at gmail.com Sat Nov 5 12:35:47 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 6 Nov 2016 02:35:47 +1000 Subject: [Speed] New benchmarks results on speed.python.org In-Reply-To: References: Message-ID: On 4 November 2016 at 22:12, Victor Stinner wrote: > I don't well yet the hardware of the speed-python server. The CPU is a > "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz": This is still the system HP contributed a few years back, so the full system specs can be found at https://speed.python.org/about/ Once you get the benchmark suite up and running reliably there, it could be interesting to get it running under Beaker [1] and then let it loose as an automated job in Red Hat's hardware compatibility testing environment :) Cheers, Nick. [1] https://beaker-project.org/ -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From kmod at dropbox.com Mon Nov 7 22:59:12 2016 From: kmod at dropbox.com (Kevin Modzelewski) Date: Mon, 7 Nov 2016 19:59:12 -0800 Subject: [Speed] Performance difference in call_method() In-Reply-To: References: Message-ID: Code layout matters a lot and you can get lucky or unlucky with it. I wasn't able to make it to this talk but the slides look quite interesting: https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes- of-performance-instability-due-to-code-placement-in-x86 I'm not sure how much us mere mortals can debug this sort of thing, but I know the intel folks have at one point expressed interest in making sure that Python runs quickly on their processors so they might be willing to give advice (the deck even says "if all else fails, ask Intel"). On Fri, Nov 4, 2016 at 3:35 PM, Victor Stinner wrote: > Hi, > > I noticed a temporary performance peak in the call_method: > > https://speed.python.org/timeline/#/?exe=4&ben=call_ > method&env=1&revs=50&equid=off&quarts=on&extr=on > > The difference is major: 17 ms => 29 ms, 70% slower! > > I expected a temporary issue on the server used to run benchmarks, > but... I reproduced the result on the server. > > Recently, the performance of call_method() changed in CPython default > from 17 ms to 28 ms (well, the exact value is variable: 25 ms, 28 ms, > 29 ms, ...) and then back to 17 ms: > > (1) ce85a1f129e3: 17 ms => 83877018ef97 (Oct 18): 25 ms > > https://hg.python.org/cpython/rev/83877018ef97 > > (2) 3e073e7b4460: 28 ms => 204a43c452cc (Oct 22): 17 ms > > https://hg.python.org/cpython/rev/204a43c452cc > > None of these revisions modify code used in the call_method() > benchmark, so I guess that it's yet another compiler joke. > > > On my laptop and my desktop PC, I'm unable to reproduce the issue: the > performance is the same (I tested ce85a1f129e3, 83877018ef97, > 204a43c452cc). These PC uses Fedora 24, GCC 6.2.1. CPUs: > > * laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz > * desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz > > > The speed-python runs Ubuntu 14.04, GCC 4.8.4-2ubuntu1~14.04. CPU: > "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz". > > > call_method() benchmark is a microbenchmark which seems to depend a > lot of very low level stuff like CPU L1 cache. Maybe the impact on the > compiler is more important on speed-python which has an older CPU, > than my more recent hardware. Maybe GCC 6.2 produces more efficient > machine code than GCC 4.8. > > > I expect that PGO would "fix" the call_method() performance issue, but > PGO compilation fails on Ubuntu 14.04 with a compiler error :-p A > solution would be to upgrade the OS of this server. > > Victor > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul at paulgraydon.co.uk Thu Nov 10 12:01:20 2016 From: paul at paulgraydon.co.uk (Paul Graydon) Date: Thu, 10 Nov 2016 17:01:20 +0000 Subject: [Speed] Ubuntu 16.04 speed issues Message-ID: <20161110170120.GA13009@paulgraydon.co.uk> I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm completely failing to find it in my emails. The OpenStack-Ansible project has noticed that performance on Ubuntu 16.04 is quite significantly slower than on 14.04. At the moment it's looking like *possibly* a GCC related bug. https://bugs.launchpad.net/ubuntu/+source/python2.7/+bug/1638695 From victor.stinner at gmail.com Thu Nov 10 16:31:47 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 10 Nov 2016 22:31:47 +0100 Subject: [Speed] Ubuntu 16.04 speed issues In-Reply-To: <20161110170120.GA13009@paulgraydon.co.uk> References: <20161110170120.GA13009@paulgraydon.co.uk> Message-ID: Hello, > The OpenStack-Ansible project has noticed that performance on Ubuntu 16.04 is quite significantly slower than on 14.04. > At the moment it's looking like *possibly* a GCC related bug. Is it exactly the same Python version? What is the full version? Try to get compiler flags: python2 -c 'import sysconfig; print(sysconfig.get_config_var("CFLAGS"))' 2016-11-10 18:01 GMT+01:00 Paul Graydon : > I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm > completely failing to find it in my emails. You might run https://github.com/python/performance on Ubuntu 14.04 and 16.04 on the same hardware, or at least similar hardware, to compare performance. Victor From ncoghlan at gmail.com Mon Nov 14 09:20:18 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 15 Nov 2016 00:20:18 +1000 Subject: [Speed] Ubuntu 16.04 speed issues In-Reply-To: <20161110170120.GA13009@paulgraydon.co.uk> References: <20161110170120.GA13009@paulgraydon.co.uk> Message-ID: On 11 November 2016 at 03:01, Paul Graydon wrote: > I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm > completely failing to find it in my emails. You may be thinking of the PGO-related issue that Victor found on *14*.04: https://mail.python.org/pipermail/speed/2016-November/000471.html Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From paul at paulgraydon.co.uk Mon Nov 14 15:19:24 2016 From: paul at paulgraydon.co.uk (Paul Graydon) Date: Mon, 14 Nov 2016 20:19:24 +0000 Subject: [Speed] Ubuntu 16.04 speed issues In-Reply-To: References: <20161110170120.GA13009@paulgraydon.co.uk> Message-ID: <20161114201924.GA16889@paulgraydon.co.uk> On Tue, Nov 15, 2016 at 12:20:18AM +1000, Nick Coghlan wrote: > On 11 November 2016 at 03:01, Paul Graydon wrote: > > I've a niggling feeling there was discussion about some performance drops on 16.04 not all that long ago, but I'm > > completely failing to find it in my emails. > > You may be thinking of the PGO-related issue that Victor found on > *14*.04: https://mail.python.org/pipermail/speed/2016-November/000471.html > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia I think you might be right there. Too many bugs going bouncing around at work, and on other projects, I guess I'm losing track :D Paul From victor.stinner at gmail.com Fri Nov 18 20:32:26 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 19 Nov 2016 02:32:26 +0100 Subject: [Speed] Analysis of a Python performance issue Message-ID: Hi, I'm happy because I just finished an article putting the most important things that I learnt this year on the most silly issue with Python performance: code placement. https://haypo.github.io/analysis-python-performance-issue.html I explain how to debug such issue and my attempt to fix it in CPython. I hate code placement issues :-) I hate performance slowdowns caused by random unrelated changes... Victor From sguelton at quarkslab.com Sat Nov 19 15:29:35 2016 From: sguelton at quarkslab.com (serge guelton) Date: Sat, 19 Nov 2016 21:29:35 +0100 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: References: Message-ID: <20161119202935.bkczl4nvyyl3zwgh@lakota> On Sat, Nov 19, 2016 at 02:32:26AM +0100, Victor Stinner wrote: > Hi, > > I'm happy because I just finished an article putting the most > important things that I learnt this year on the most silly issue with > Python performance: code placement. > > https://haypo.github.io/analysis-python-performance-issue.html > > I explain how to debug such issue and my attempt to fix it in CPython. > > I hate code placement issues :-) I hate performance slowdowns caused > by random unrelated changes... > > Victor Thanks *a lot* victor for this great article. You not only very accurately describe the method you used to track the performance bug, but also give very convincing results. I still wonder what the conclusion should be: - (this) Micro benchmarks are not relevant at all, they are sensible to minor factors that are not relevant to bigger applications - There is a generally good code layout that favors most applications? Maybe some core function from the interpreter ? Why does PGO fails to ``find'' them? Serge From victor.stinner at gmail.com Sat Nov 19 18:54:41 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sun, 20 Nov 2016 00:54:41 +0100 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: <20161119202935.bkczl4nvyyl3zwgh@lakota> References: <20161119202935.bkczl4nvyyl3zwgh@lakota> Message-ID: Le 19 nov. 2016 21:29, "serge guelton" a ?crit : > Thanks *a lot* victor for this great article. You not only very > accurately describe the method you used to track the performance bug, > but also give very convincing results. You're welcome. I'm not 100% sure that adding the hot attrbute makes the performance of call_method reliable at 100%. My hope is that the 70% slowdown doesn't reoccur. > I still wonder what the conclusion should be: > > - (this) Micro benchmarks are not relevant at all, they are sensible to minor > factors that are not relevant to bigger applications Other benchmarks had peaks: logging_silent and json_loads. I'm unable to say if microbenchmarks must be used or not to cehck for performance regression or test the performance of a patch. So I try instead to analyze and fix performance issues. At least I can say that temporary peaks are higher and more frequent on microbenchmark. Homework: define what is a microbenchmark :-) > - There is a generally good code layout that favors most applications? This is an hard question. I don't know the answer. The hot attributes put tagged functions in a separated ELF section, but I understand that inside the section, order is not deterministic. Maybe the size of a function code matters too. What happens if a function grows? Does it impact other functions? > Maybe some core function from the interpreter ? I chose to only tag the most famous functions of the core right now. I'm testing tagging functions of extensions like json but I'm not sure that the result is significant. > Why does PGO fails to > ``find'' them? I don't use PGO on speed-python. I'm not sure that is PGO is reliable neither (reproductible performance). Victor -------------- next part -------------- An HTML attachment was scrubbed... URL: From kmod at dropbox.com Sat Nov 19 20:58:19 2016 From: kmod at dropbox.com (Kevin Modzelewski) Date: Sat, 19 Nov 2016 17:58:19 -0800 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: <20161119202935.bkczl4nvyyl3zwgh@lakota> References: <20161119202935.bkczl4nvyyl3zwgh@lakota> Message-ID: I think it's safe to not reinvent the wheel here. Some searching gives: http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort Pyston takes a different approach where we pull the list of hot functions from the PGO build, ie defer all the hard work to the C compiler. On Sat, Nov 19, 2016 at 12:29 PM, serge guelton wrote: > On Sat, Nov 19, 2016 at 02:32:26AM +0100, Victor Stinner wrote: > > Hi, > > > > I'm happy because I just finished an article putting the most > > important things that I learnt this year on the most silly issue with > > Python performance: code placement. > > > > https://haypo.github.io/analysis-python-performance-issue.html > > > > I explain how to debug such issue and my attempt to fix it in CPython. > > > > I hate code placement issues :-) I hate performance slowdowns caused > > by random unrelated changes... > > > > Victor > > Thanks *a lot* victor for this great article. You not only very > accurately describe the method you used to track the performance bug, > but also give very convincing results. > > I still wonder what the conclusion should be: > > - (this) Micro benchmarks are not relevant at all, they are sensible to > minor > factors that are not relevant to bigger applications > > - There is a generally good code layout that favors most applications? > Maybe some core function from the interpreter ? Why does PGO fails to > ``find'' them? > > Serge > > _______________________________________________ > Speed mailing list > Speed at python.org > https://mail.python.org/mailman/listinfo/speed > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kmod at dropbox.com Mon Nov 21 18:26:19 2016 From: kmod at dropbox.com (Kevin Modzelewski) Date: Mon, 21 Nov 2016 15:26:19 -0800 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: <20161121193908.thax2o3faxa5pxfx@lakota> References: <20161119202935.bkczl4nvyyl3zwgh@lakota> <20161121193908.thax2o3faxa5pxfx@lakota> Message-ID: Oh sorry I was unclear, yes this is for the pyston binary itself, and yes PGO does a better job and I definitely think it should be used. Separately, we often use non-pgo builds for quick checks, so we also have the system I described that makes our non-pgo build more reliable by using the function ordering from the pgo build. On Mon, Nov 21, 2016 at 11:39 AM, serge guelton < serge.guelton at telecom-bretagne.eu> wrote: > On Sat, Nov 19, 2016 at 05:58:19PM -0800, Kevin Modzelewski wrote: > > I think it's safe to not reinvent the wheel here. Some searching gives: > > http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/ > Articles/papers15.pdf > > http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf > > https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort > > Thanks Kevin for the pointers! I'm new to this area of optimization... > another source of fun and weirdness :-$ > > > Pyston takes a different approach where we pull the list of hot functions > > from the PGO build, ie defer all the hard work to the C compiler. > > You're talking about the build of Pyston itself, not the jit generated > code, right? In that case, how is it different to a regular > > -fprofile-generate followed by several runs then -fprofile-use? > > PGO builds should perform better than marking some functions as hot, as > it also includes info for better branch prediction too, right? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kmod at dropbox.com Sat Nov 26 17:16:54 2016 From: kmod at dropbox.com (Kevin Modzelewski) Date: Sat, 26 Nov 2016 14:16:54 -0800 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: <20161124110735.q2t5axix2llyfsd5@lakota> References: <20161119202935.bkczl4nvyyl3zwgh@lakota> <20161121193908.thax2o3faxa5pxfx@lakota> <20161124110735.q2t5axix2llyfsd5@lakota> Message-ID: On Thu, Nov 24, 2016 at 3:07 AM, serge guelton wrote: > On Mon, Nov 21, 2016 at 03:26:19PM -0800, Kevin Modzelewski wrote: > > Oh sorry I was unclear, yes this is for the pyston binary itself, and yes > > PGO does a better job and I definitely think it should be used. > > That raised a second question: do you collect branch / hotness info > during lower tier jitted code run, so as to improve performance of > higher tiers ? > We don't (yet) do code placement optimizations. We should be getting some basic amount of this, though, by our generated code being grouped by "tier that compiled it" which is highly correlated with hotness. > > > Separately, we often use non-pgo builds for quick checks, so we also have > > the system I described that makes our non-pgo build more reliable by > using > > the function ordering from the pgo build. > > ok. Are you just ? putting hot stuff in the hot section ? or did you try > to specify an ordering to further improve locality? (I don't know if it's > possible, it's mentionned in one of the paper) > We pull the function order from the PGO build and ask the non-pgo build to use the same order, so it's up to whatever the C compiler did. Though to keep things tractable we only do this for functions that have some non-negligible hotness. I think this does help with overall performance of the non-pgo build, but our main goal was performance consistency. > > Thanks, > > Serge > -------------- next part -------------- An HTML attachment was scrubbed... URL: From serge.guelton at telecom-bretagne.eu Mon Nov 21 14:39:08 2016 From: serge.guelton at telecom-bretagne.eu (serge guelton) Date: Mon, 21 Nov 2016 20:39:08 +0100 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: References: <20161119202935.bkczl4nvyyl3zwgh@lakota> Message-ID: <20161121193908.thax2o3faxa5pxfx@lakota> On Sat, Nov 19, 2016 at 05:58:19PM -0800, Kevin Modzelewski wrote: > I think it's safe to not reinvent the wheel here. Some searching gives: > http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf > http://www.cs.utexas.edu/users/mckinley/papers/dcm-vee-2006.pdf > https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort Thanks Kevin for the pointers! I'm new to this area of optimization... another source of fun and weirdness :-$ > Pyston takes a different approach where we pull the list of hot functions > from the PGO build, ie defer all the hard work to the C compiler. You're talking about the build of Pyston itself, not the jit generated code, right? In that case, how is it different to a regular -fprofile-generate followed by several runs then -fprofile-use? PGO builds should perform better than marking some functions as hot, as it also includes info for better branch prediction too, right? From sguelton at quarkslab.com Thu Nov 24 06:07:35 2016 From: sguelton at quarkslab.com (serge guelton) Date: Thu, 24 Nov 2016 12:07:35 +0100 Subject: [Speed] Analysis of a Python performance issue In-Reply-To: References: <20161119202935.bkczl4nvyyl3zwgh@lakota> <20161121193908.thax2o3faxa5pxfx@lakota> Message-ID: <20161124110735.q2t5axix2llyfsd5@lakota> On Mon, Nov 21, 2016 at 03:26:19PM -0800, Kevin Modzelewski wrote: > Oh sorry I was unclear, yes this is for the pyston binary itself, and yes > PGO does a better job and I definitely think it should be used. That raised a second question: do you collect branch / hotness info during lower tier jitted code run, so as to improve performance of higher tiers ? > Separately, we often use non-pgo builds for quick checks, so we also have > the system I described that makes our non-pgo build more reliable by using > the function ordering from the pgo build. ok. Are you just ? putting hot stuff in the hot section ? or did you try to specify an ordering to further improve locality? (I don't know if it's possible, it's mentionned in one of the paper) Thanks, Serge