[pypy-commit] extradoc extradoc: some final improvements from my side

Thu Jul 25 03:26:21 EDT 2019

Author: Remi Meier <remi.meier at gmail.com>
Branch: extradoc
Changeset: r5955:e71b2e056b2f
Date: 2019-07-25 09:25 +0200
http://bitbucket.org/pypy/extradoc/changeset/e71b2e056b2f/

Log:	some final improvements from my side

diff --git a/blog/draft/2019-07-arm64.rst b/blog/draft/2019-07-arm64.rst
--- a/blog/draft/2019-07-arm64.rst
+++ b/blog/draft/2019-07-arm64.rst
@@ -2,59 +2,64 @@
 
 We are pleased to announce the availability of the new PyPy for AArch64. This
 port brings PyPy's high-performance just-in-time compiler to the AArch64
-platform, also known as 64-bit ARM. This work was funded by ARM Holdings Ltd.
-and Crossbar.io.
+platform, also known as 64-bit ARM. With the addition of AArch64, PyPy now
+supports a total of 6 architectures: x86 (32 & 64bit), ARM (32 & 64bit), PPC64,
+and s390x. The AArch64 work was funded by ARM Holdings Ltd. and Crossbar.io.
 
-To show how well the new PyPy port performs, we compare the performance of PyPy
-against CPython on a set of benchmarks. As a point of comparison, we include the
-results of PyPy on x86_64. Note, however, that the results presented here were
-measured on a Graviton A1 machine from AWS, which comes with a very serious
-word of warning: Graviton A1's are
-virtual machines and as such, are not suitable for benchmarking. If someone
-has access to a beefy enough (16G) ARM64 server and is willing to give
-us access to it, we are happy to redo the benchmarks on a real machine.
-Our main concern here is that while a virtual CPU is 1-to-1 with a real CPU, it's
-not clear to us how caches are shared, and how they cross CPU boundaries.
+PyPy has a good record of boosting the performance of Python programs on the
+existing platforms. To show how well the new PyPy port performs, we compare the
+performance of PyPy against CPython on a set of benchmarks. As a point of
+comparison, we include the results of PyPy on x86_64.
 
-The following graph shows the speedup of PyPy (hg id 2417f925ce94) compared to
-CPython (2.7.15) on AArch64, as well as the speedups on a x86_64 Linux laptop,
+Note, however, that the results presented here were measured on a Graviton A1
+machine from AWS, which comes with a very serious word of warning: Graviton A1's
+are virtual machines, and, as such, they are not suitable for benchmarking. If
+someone has access to a beefy enough (16G) ARM64 server and is willing to give
+us access to it, we are happy to redo the benchmarks on a real machine. One
+major concern is that while a virtual CPU is 1-to-1 with a real CPU, it is not
+clear to us how CPU caches are shared across virtual CPUs. Also, note that by no
+means is this benchmark suite representative enough to average the results. Read
+the numbers individually per benchmark.
+
+The following graph shows the speedups on AArch64 of PyPy (hg id 2417f925ce94) compared to
+CPython (2.7.15), as well as the speedups on a x86_64 Linux laptop
 comparing the most recent release, PyPy 7.1.1, to CPython 2.7.16.
 
 .. image:: 2019-07-arm64-speedups.png
 
 In the majority of benchmarks, the speedups achieved on AArch64 match those
 achieved on the x86_64 laptop. Over CPython, PyPy on AArch64 achieves speedups
-between 0.6x to 44.9x. These speedups are comparable to x86_64, where they are
-between 0.6x and 58.9x.
+between 0.6x to 44.9x. These speedups are comparable to x86_64, where the
+numbers are between 0.6x and 58.9x.
 
 The next graph compares between the speedups achieved on AArch64 to the speedups
-achieved on x86_64, i.e., how much the speedup is on AArch64 vs. the same
-benchmark on x86_64. Note that by no means is this benchmark suite
-representative enough to average the results. Read the numbers individually per
-benchmark.
+achieved on x86_64, i.e., how great the speedup is on AArch64 vs. the same
+benchmark on x86_64. This comparison should give a rough idea about the
+quality of the generated code for the new platform.
 
 .. image:: 2019-07-arm64-relative.png
 
-Note also that we see a wide variance. There are generally three groups of
+Note that we see a large variance: There are generally three groups of
 benchmarks - those that run at more or less the same speed, those that
-run at 2x the speed and those that run at 0.5x the speed of x86_64.
+run at 2x the speed, and those that run at 0.5x the speed of x86_64.
 
 The variance and disparity are likely related to a variety of issues, mostly due
-to differences in architecture. What *is* however interesting is that compared
+to differences in architecture. What *is* however interesting is that, compared
 to measurements performed on older ARM boards, the branch predictor on the
 Graviton A1 machine appears to have improved. As a result, the speedups achieved
-by PyPy over CPython are smaller: "sophisticated" branchy code, like CPython
-itself, simply runs a lot faster.
+by PyPy over CPython are smaller than on older ARM boards: sufficiently branchy
+code, like CPython itself, simply runs a lot faster. As a result, the advantage
+of the non-branchy code generated by PyPy's just-in-time compiler is smaller.
 
-One takeaway here is that there is a lot of improvement left to be done
-in PyPy. This is true for both of the above platforms, but probably more
-so for AArch64, which comes with a large number of registers. The PyPy
-backend was written with x86 (the 32-bit variant) in mind, which is very
-register poor. We think we can improve somewhat in the area of emitting
-more modern machine code, which should have more impact for AArch64
-than for x86_64. There are also still a few missing features in the AArch64
-backend, which are implemented as calls instead of inlined instructions;
-something we hope to improve.
+One takeaway here is that many possible improvements for PyPy yet to be
+implemented. This is true for both of the above platforms, but probably more so
+for AArch64, which comes with a large number of CPU registers. The PyPy backend
+was written with x86 (the 32-bit variant) in mind, which has a really low number
+of registers. We think that we can improve in the area of emitting more modern
+machine code, which may have a higher impact on AArch64 than on x86_64. There is
+also a number of missing features in the AArch64 backend. These features are
+currently implemented as expensive function calls instead of inlined native
+instructions, something we intend to improve.
 
 Best,
 Maciej Fijalkowski, Armin Rigo and the PyPy team