[pypy-dev] Differences performance Julia / PyPy on very similar codes

Mon Dec 21 17:47:22 EST 2020

You did not state on exactly what system you are conducting the
experiment, but "a factor of 4" seems very close to the
auto-vectorization speedup of a vector of floats.

> I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.

If the performance gap is caused by auto-vectorization, I would
recommend that you use consider Numpy with Numba LLVM-based JIT.  Or,
for a "pure" Python solution, you can experiment with an older release
of PyPy and NumPyPy.

If the problem is the abstraction penalty, then the suggestion from
Anto should help.

But, for the question of why, you can examine the code for the inner
loop generated by Julia and the code for the inner loop generate by
PyPy and analyze the reason for the performance gap.  It should be
evident if the difference is abstraction or SIMD.

Thanks, David

On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER
<pierre.augier at univ-grenoble-alpes.fr> wrote:
>
>
> ----- Mail original -----
> > De: "David Edelsohn" <dje.gcc at gmail.com>
> > À: "PIERRE AUGIER" <pierre.augier at univ-grenoble-alpes.fr>
> > Cc: "pypy-dev" <pypy-dev at python.org>
> > Envoyé: Vendredi 18 Décembre 2020 21:00:42
> > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes
>
> > Does Julia based on LLVM auto-vectorize the code?  I assume yes
> > because you specifically mention SIMD design of the data structure.
>
> Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
>
> > Have you tried NumPyPy?  Development on NumPyPy has not continued, but
> > it probably would be a better comparison of what PyPy with
> > auto-vectorization could accomplish to compare with Julia.
>
> I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6.
>
> Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia.
>
> I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation.
>
> Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list?
>
> Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy?
>
> For example,
>
> class Point3D:
>     def __init__(self, x, y, z):
>         self.x = x
>         self.y = y
>         self.z = z
>
>     def norm_square(self):
>         return self.x**2 + self.y**2 + self.z**2
>
> I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code?
>
> Pierre
>
> >
> > Thanks, David
> >
> > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER
> > <pierre.augier at univ-grenoble-alpes.fr> wrote:
> >>
> >> Hi,
> >>
> >> I post on this list a message written in PyPy issue tracker
> >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about
> >> some experiments I did on writing efficient implementations of the NBody
> >> problem https://github.com/paugier/nbabel to potentially answer to this article
> >> https://arxiv.org/pdf/2009.11295.pdf.
> >>
> >> I get from a PR an [interesting optimized implementation in
> >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl).
> >> It is very fast (even slightly faster than in Pythran). One idea is to store
> >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D`
> >> containing 4 floats to better use simd instructions.
> >>
> >> I added a pure Python implementation inspired by this new Julia implementation
> >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D`
> >> does not make the code faster) and good news it is with PyPy a bit faster than
> >> our previous PyPy implementations (only 3 times slower than the old C++
> >> implementation).
> >>
> >> However, it is much slower than with Julia (while the code is very similar). I
> >> coded a simplified version in Julia with nearly nothing else that what can be
> >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). It
> >> seems to me that the comparison of these 2 versions could be interesting. So I
> >> again simplified these 2 versions to keep only what is important for
> >> performance, which gives
> >>
> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py
> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
> >>
> >> The results are summarized in
> >> https://github.com/paugier/nbabel/blob/master/py/microbench.md
> >>
> >> An important point is that with `Point3D` (a mutable class in Python and an
> >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and
> >> nothing really fancy in Julia so I guess that PyPy might be missing some
> >> optimization opportunities. At least it would be interesting to understand what
> >> is slower in PyPy (and why). I have to admit that I don't know how to get
> >> interesting information on timing and what is happening with PyPy JIT in a
> >> particular case. I only used cProfile and it's of course clearly not enough. I
> >> can run vmprof but I'm not able to visualize the data because the website
> >> http://vmprof.com/ is down. I don't know if I can trust values given by IPython
> >> `%timeit` for particular instructions since I don't know if PyPy JIT does the
> >> same thing in `%timeit` and in the function `compute_accelerations`.
> >>
> >> I also feel that I really miss in pure Python an efficient fixed size
> >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic
> >> numerical types (as Python `array.array`) but also instances of user-defined
> >> classes and instances of Vectors. The Python code uses a [pure Python
> >> implementation using a
> >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it
> >> would be reasonable to have a good implementation highly compatible with PyPy
> >> (and potentially other Python implementations) in a package on PyPI. It would
> >> really help to write PyPy compatible numerical codes. What would be the good
> >> tool to implement such package? HPy? I wonder whether we can get some speedup
> >> compared to the pure Python version with lists. For very simple classes like
> >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously in
> >> memory and if some operations could be done without boxing/unboxing.
> >>
> >> However, I really don't know what is slower in PyPy / faster in Julia.
> >>
> >> I would be very interested to get the points of view of people knowing well
> >> PyPy.
> >>
> >> Pierre
> >> _______________________________________________
> >> pypy-dev mailing list
> >> pypy-dev at python.org
> > > https://mail.python.org/mailman/listinfo/pypy-dev
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev