From johnfouf at gmail.com Fri Dec 11 06:55:17 2020 From: johnfouf at gmail.com (Ioannis Foufoulas) Date: Fri, 11 Dec 2020 13:55:17 +0200 Subject: [pypy-dev] Pickling generators Message-ID: Hi, While in PyPy was possible to pickle a generator function and resume after unpicking this does not happen with PyPy3: File "/snap/pypy3/72/lib-python/3/pickle.py", line 942, in save_global (obj, module_name, name)) pickle.PicklingError: Can't pickle : it's not found as builtins.generator Is this a bug or this feature is removed in PyPy3? Thanks, Yannis From matti.picus at gmail.com Fri Dec 11 08:21:18 2020 From: matti.picus at gmail.com (Matti Picus) Date: Fri, 11 Dec 2020 15:21:18 +0200 Subject: [pypy-dev] Pickling generators In-Reply-To: References: Message-ID: <481969e4-743d-f080-0e8b-1d42f874770d@gmail.com> According to issue 3150 https://foss.heptapod.net/pypy/pypy/-/issues/3150, this is on purpose, and brings us into feature compatibility with CPython: Python 3.8.6 | packaged by conda-forge | (default, Oct? 7 2020, 19:08:05) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pickle >>> def f(): ...???? yield 10 ... >>> gen = f() >>> pickle.dumps(gen) Traceback (most recent call last): ? File "", line 1, in TypeError: cannot pickle 'generator' object I guess we could improve the error message to be more helpful. Do you have a concrete use case for this? Matti. On 12/11/20 1:55 PM, Ioannis Foufoulas wrote: > Hi, > While in PyPy was possible to pickle a generator function and resume after unpicking this does not happen with PyPy3: > > File "/snap/pypy3/72/lib-python/3/pickle.py", line 942, in save_global > (obj, module_name, name)) > pickle.PicklingError: Can't pickle : it's not found as builtins.generator > > Is this a bug or this feature is removed in PyPy3? > > Thanks, > Yannis > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From steve at pearwood.info Fri Dec 11 15:49:48 2020 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 12 Dec 2020 07:49:48 +1100 Subject: [pypy-dev] Pickling generators In-Reply-To: <481969e4-743d-f080-0e8b-1d42f874770d@gmail.com> References: <481969e4-743d-f080-0e8b-1d42f874770d@gmail.com> Message-ID: <20201211204948.GC4576@ando.pearwood.info> On Fri, Dec 11, 2020 at 03:21:18PM +0200, Matti Picus wrote: > According to issue 3150 > https://foss.heptapod.net/pypy/pypy/-/issues/3150, this is on purpose, > and brings us into feature compatibility with CPython: The inability to pickle an object is not a feature, it is the lack of a feature. I think this is a case where PyPy initially did the right thing and then threw it away. I'm not satisfied by the arguments given here: http://peadrop.com/blog/2009/12/29/why-you-cannot-pickle-generators/ https://bugs.python.org/issue1092962 for CPython to not pickle generators, but pickling generators is not forbidden by the language, it is a quality of implementation issue. In this instance, both Stackless and PyPy 2 had better implementations than CPython. > I guess we could improve the error message to be more helpful. Do you > have a concrete use case for this? You have a generator which you are iterating through, and you want to stop your program and resume later. Or resume immediately but from another process. -- Steve From cfbolz at gmx.de Fri Dec 11 16:16:55 2020 From: cfbolz at gmx.de (Carl Friedrich Bolz-Tereick) Date: Fri, 11 Dec 2020 22:16:55 +0100 Subject: [pypy-dev] Pickling generators In-Reply-To: <20201211204948.GC4576@ando.pearwood.info> References: <481969e4-743d-f080-0e8b-1d42f874770d@gmail.com> <20201211204948.GC4576@ando.pearwood.info> Message-ID: <6d389ecc-a136-cba5-40f7-676d06eb0a58@gmx.de> Hi Steven, hi all, I don't think it's a huge intentional thing that we don't support this feature in PyPy3. It's mostly that nobody was motivated enough to implement it (as Armin said in the linked issue). So if somebody wants to tackle it, we'd be happy to merge it! Cheers, CF On 11.12.20 21:49, Steven D'Aprano wrote: > On Fri, Dec 11, 2020 at 03:21:18PM +0200, Matti Picus wrote: >> According to issue 3150 >> https://foss.heptapod.net/pypy/pypy/-/issues/3150, this is on purpose, >> and brings us into feature compatibility with CPython: > > The inability to pickle an object is not a feature, it is the lack of a > feature. I think this is a case where PyPy initially did the right thing > and then threw it away. > > I'm not satisfied by the arguments given here: > > http://peadrop.com/blog/2009/12/29/why-you-cannot-pickle-generators/ > > https://bugs.python.org/issue1092962 > > for CPython to not pickle generators, but pickling generators is not > forbidden by the language, it is a quality of implementation issue. In > this instance, both Stackless and PyPy 2 had better implementations than > CPython. > > >> I guess we could improve the error message to be more helpful. Do you >> have a concrete use case for this? > > You have a generator which you are iterating through, and you want to > stop your program and resume later. Or resume immediately but from > another process. > > From muke101 at protonmail.com Thu Dec 17 13:13:12 2020 From: muke101 at protonmail.com (muke101) Date: Thu, 17 Dec 2020 18:13:12 +0000 Subject: [pypy-dev] Contributing Polyhedral Optimisations in PyPy Message-ID: <1CdAb8DZ8jmSE-75kiASlGgHA96yRAGV1M9SJEYHYGkJlnOEircR3ybjk_Vk_Z05l00ouHJGqX1MJwnAb5nyCfmtdDIn9gHI5ckj8rMXnSs=@protonmail.com> I'm doing a computer science masters and am looking for an appropriate project to take on for a dissertation related to Polyhedral optimisations. Talking to my professor, we both think trying to implement the model and it's loop transformations in PyPy's JIT optimiser could be a good project to pursue, but before committing to anything I wanted to run this idea by the devs here who might be able to point out any hurdles I'd be likely to quickly come across that could prove difficult to solve at just a masters level, or whether or not these optimisations are actually already implemented in the first place (I have tried to google if this is the case and hadn't found anything, but can't be sure). I think this could have some good real world impact too as a lot of scientific code is written in Python and run on PyPy, and the Polyhedral model can offer substantial performance improvements in the form of auto-parallelization for these types of codes, which is why I'm interested in working on this for PyPy rather than CPython, although if anyone has good reason that I might want to look at CPython for this over PyPy please let me know. Appreciate any and all advice, thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.leslie.ttg at gmail.com Thu Dec 17 17:48:42 2020 From: william.leslie.ttg at gmail.com (William ML Leslie) Date: Fri, 18 Dec 2020 09:48:42 +1100 Subject: [pypy-dev] Contributing Polyhedral Optimisations in PyPy In-Reply-To: <1CdAb8DZ8jmSE-75kiASlGgHA96yRAGV1M9SJEYHYGkJlnOEircR3ybjk_Vk_Z05l00ouHJGqX1MJwnAb5nyCfmtdDIn9gHI5ckj8rMXnSs=@protonmail.com> References: <1CdAb8DZ8jmSE-75kiASlGgHA96yRAGV1M9SJEYHYGkJlnOEircR3ybjk_Vk_Z05l00ouHJGqX1MJwnAb5nyCfmtdDIn9gHI5ckj8rMXnSs=@protonmail.com> Message-ID: On Fri, 18 Dec 2020 at 05:14, muke101 via pypy-dev wrote: > > I'm doing a computer science masters and am looking for an appropriate project to take on for a dissertation related to Polyhedral optimisations. Talking to my professor, we both think trying to implement the model and it's loop transformations in PyPy's JIT optimiser could be a good project to pursue, but before committing to anything I wanted to run this idea by the devs here who might be able to point out any hurdles I'd be likely to quickly come across that could prove difficult to solve at just a masters level, or whether or not these optimisations are actually already implemented in the first place (I have tried to google if this is the case and hadn't found anything, but can't be sure). I think this could have some good real world impact too as a lot of scientific code is written in Python and run on PyPy, and the Polyhedral model can offer substantial performance improvements in the form of auto-parallelization for these types of codes, which is why I'm interested in working on this for PyPy rather than CPython, although if anyone has good reason that I might want to look at CPython for this over PyPy please let me know. > > Appreciate any and all advice, thanks. Hi! That's a great topic. The challenge with implementing this in the pypy JIT at this point is that the JIT only sees one control flow path. That is, one loop, and the branches taken within that loop. It does not find out about the outer loop usually until later, and may not ever find out about the content of other control flow paths if they aren't taken. This narrows the amount of information available about effects and possible aliases quite a bit, making semantic-preserving cross-loop transformations difficult in many cases. On the other hand, since you can deal with precise types in the JIT, it's possible to narrow down the domain of discourse, which might make it possible to rule out problematic side-effects. Nevertheless, please dig and experiment. You might find that a combination of custom annotations and JIT work get you what you need. -- William Leslie Q: What is your boss's password? A: "Authentication", clearly Notice: Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely MAY reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to DENY YOU THOSE RIGHTS would be illegal without prior contractual agreement. From armin.rigo at gmail.com Fri Dec 18 13:03:49 2020 From: armin.rigo at gmail.com (Armin Rigo) Date: Fri, 18 Dec 2020 19:03:49 +0100 Subject: [pypy-dev] Contributing Polyhedral Optimisations in PyPy In-Reply-To: References: <1CdAb8DZ8jmSE-75kiASlGgHA96yRAGV1M9SJEYHYGkJlnOEircR3ybjk_Vk_Z05l00ouHJGqX1MJwnAb5nyCfmtdDIn9gHI5ckj8rMXnSs=@protonmail.com> Message-ID: Hi, On Thu, 17 Dec 2020 at 23:48, William ML Leslie wrote: > The challenge with implementing this in the pypy JIT at this point is > that the JIT only sees one control flow path. That is, one loop, and > the branches taken within that loop. It does not find out about the > outer loop usually until later, and may not ever find out about the > content of other control flow paths if they aren't taken. Note that strictly speaking, the problem is not that you haven't seen yet other code paths. It's Python, so you never know what may happen in the future---maybe another code path will be taken, or maybe someone will do crazy things with `sys._getframe()` or with the debugger `pdb`. So merely seeing all paths in a function doesn't really buy you a lot. No, the problem is that emitting machine code is incremental at the granularity of code paths. At the point where we see a new code path, all previously-seen code paths have already been completely optimized and turned into machine code, and we don't keep much information about them. To go beyond this simple model, what we have so far is that we can "invalidate" previous code paths at any point, when we figure out that they were compiled using assumptions that no longer hold. So using it, it would be possible in theory to do any amount of global optimizations: save enough additional information as you see each code path; use it later in the optimization of additional code paths; invalidate some of the old code paths if you figure out that its optimizations are no longer valid (but invalidate only, not write a new version yet); and when you later see the old code path being generated again, optimize it differently. It's all doable, but theoretical so far: I don't know of any JIT compiler that seriously does things like that. It's certainly worth a research paper IMHO. It also looks like quite some work. It's certainly not just "take some ideas from [ahead-of-time or full-method] compiler X and apply them to PyPy". A bient?t, Armin. From muke101 at protonmail.com Fri Dec 18 13:15:10 2020 From: muke101 at protonmail.com (muke101) Date: Fri, 18 Dec 2020 18:15:10 +0000 Subject: [pypy-dev] Contributing Polyhedral Optimisations in PyPy In-Reply-To: References: <1CdAb8DZ8jmSE-75kiASlGgHA96yRAGV1M9SJEYHYGkJlnOEircR3ybjk_Vk_Z05l00ouHJGqX1MJwnAb5nyCfmtdDIn9gHI5ckj8rMXnSs=@protonmail.com> Message-ID: Thanks both of you for getting back to me, these definitely seem like problems worth thinking about first. Looking into it, there has actually been some research already on implementing Polyhedral optimisations in a JIT optimiser, specifically in JavaScript. It's paper (http://impact.gforge.inria.fr/impact2018/papers/polyhedral-javascript.pdf) seems to point out the same problems you both bring up, like SCoP detection and aliasing, and how it worked around them. For now then I'll try and consider how ambitious replicating these solutions would be and if they would map into PyPy from JS cleanly - please let me know if any other hurdles come to mind in the meantime though. Thanks again for the advise. ??????? Original Message ??????? On Friday, 18 December 2020 18:03, Armin Rigo wrote: > Hi, > > On Thu, 17 Dec 2020 at 23:48, William ML Leslie > william.leslie.ttg at gmail.com wrote: > > > The challenge with implementing this in the pypy JIT at this point is > > that the JIT only sees one control flow path. That is, one loop, and > > the branches taken within that loop. It does not find out about the > > outer loop usually until later, and may not ever find out about the > > content of other control flow paths if they aren't taken. > > Note that strictly speaking, the problem is not that you haven't seen > yet other code paths. It's Python, so you never know what may happen > in the future---maybe another code path will be taken, or maybe > someone will do crazy things with `sys._getframe()` or with the > debugger `pdb`. So merely seeing all paths in a function doesn't > really buy you a lot. No, the problem is that emitting machine code > is incremental at the granularity of code paths. At the point where > we see a new code path, all previously-seen code paths have already > been completely optimized and turned into machine code, and we don't > keep much information about them. > > To go beyond this simple model, what we have so far is that we can > "invalidate" previous code paths at any point, when we figure out that > they were compiled using assumptions that no longer hold. So using > it, it would be possible in theory to do any amount of global > optimizations: save enough additional information as you see each code > path; use it later in the optimization of additional code paths; > invalidate some of the old code paths if you figure out that its > optimizations are no longer valid (but invalidate only, not write a > new version yet); and when you later see the old code path being > generated again, optimize it differently. It's all doable, but > theoretical so far: I don't know of any JIT compiler that seriously > does things like that. It's certainly worth a research paper IMHO. > It also looks like quite some work. It's certainly not just "take > some ideas from [ahead-of-time or full-method] compiler X and apply > them to PyPy". > > A bient?t, > > Armin. From pierre.augier at univ-grenoble-alpes.fr Fri Dec 18 14:48:27 2020 From: pierre.augier at univ-grenoble-alpes.fr (PIERRE AUGIER) Date: Fri, 18 Dec 2020 20:48:27 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes Message-ID: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> Hi, I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf. I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions. I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation). However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`. I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing. However, I really don't know what is slower in PyPy / faster in Julia. I would be very interested to get the points of view of people knowing well PyPy. Pierre From dje.gcc at gmail.com Fri Dec 18 15:00:42 2020 From: dje.gcc at gmail.com (David Edelsohn) Date: Fri, 18 Dec 2020 15:00:42 -0500 Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: Does Julia based on LLVM auto-vectorize the code? I assume yes because you specifically mention SIMD design of the data structure. Have you tried NumPyPy? Development on NumPyPy has not continued, but it probably would be a better comparison of what PyPy with auto-vectorization could accomplish to compare with Julia. Thanks, David On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER wrote: > > Hi, > > I post on this list a message written in PyPy issue tracker (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about some experiments I did on writing efficient implementations of the NBody problem https://github.com/paugier/nbabel to potentially answer to this article https://arxiv.org/pdf/2009.11295.pdf. > > I get from a PR an [interesting optimized implementation in Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). It is very fast (even slightly faster than in Pythran). One idea is to store the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` containing 4 floats to better use simd instructions. > > I added a pure Python implementation inspired by this new Julia implementation (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` does not make the code faster) and good news it is with PyPy a bit faster than our previous PyPy implementations (only 3 times slower than the old C++ implementation). > > However, it is much slower than with Julia (while the code is very similar). I coded a simplified version in Julia with nearly nothing else that what can be written in pure Python (in particular, no `@inbounds` and `@simd` macros). It seems to me that the comparison of these 2 versions could be interesting. So I again simplified these 2 versions to keep only what is important for performance, which gives > > - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py > - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl > > The results are summarized in https://github.com/paugier/nbabel/blob/master/py/microbench.md > > An important point is that with `Point3D` (a mutable class in Python and an immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and nothing really fancy in Julia so I guess that PyPy might be missing some optimization opportunities. At least it would be interesting to understand what is slower in PyPy (and why). I have to admit that I don't know how to get interesting information on timing and what is happening with PyPy JIT in a particular case. I only used cProfile and it's of course clearly not enough. I can run vmprof but I'm not able to visualize the data because the website http://vmprof.com/ is down. I don't know if I can trust values given by IPython `%timeit` for particular instructions since I don't know if PyPy JIT does the same thing in `%timeit` and in the function `compute_accelerations`. > > I also feel that I really miss in pure Python an efficient fixed size homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic numerical types (as Python `array.array`) but also instances of user-defined classes and instances of Vectors. The Python code uses a [pure Python implementation using a list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it would be reasonable to have a good implementation highly compatible with PyPy (and potentially other Python implementations) in a package on PyPI. It would really help to write PyPy compatible numerical codes. What would be the good tool to implement such package? HPy? I wonder whether we can get some speedup compared to the pure Python version with lists. For very simple classes like `Point3d` and `Point4d`, I wonder if the data could be saved continuously in memory and if some operations could be done without boxing/unboxing. > > However, I really don't know what is slower in PyPy / faster in Julia. > > I would be very interested to get the points of view of people knowing well PyPy. > > Pierre > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From armin.rigo at gmail.com Fri Dec 18 17:17:23 2020 From: armin.rigo at gmail.com (Armin Rigo) Date: Fri, 18 Dec 2020 23:17:23 +0100 Subject: [pypy-dev] Contributing Polyhedral Optimisations in PyPy In-Reply-To: References: <1CdAb8DZ8jmSE-75kiASlGgHA96yRAGV1M9SJEYHYGkJlnOEircR3ybjk_Vk_Z05l00ouHJGqX1MJwnAb5nyCfmtdDIn9gHI5ckj8rMXnSs=@protonmail.com> Message-ID: Hi, On Fri, 18 Dec 2020 at 19:15, muke101 wrote: > Thanks both of you for getting back to me, these definitely seem like problems worth thinking about first. Looking into it, there has actually been some research already on implementing Polyhedral optimisations in a JIT optimiser, specifically in JavaScript. It's paper (http://impact.gforge.inria.fr/impact2018/papers/polyhedral-javascript.pdf) seems to point out the same problems you both bring up, like SCoP detection and aliasing, and how it worked around them. > > For now then I'll try and consider how ambitious replicating these solutions would be and if they would map into PyPy from JS cleanly - please let me know if any other hurdles come to mind in the meantime though. I assume that by "JavaScript" you mean JavaScript with a method-based JIT compiler. At this level, that's the main difference with PyPy, which contains RPython's tracing JIT compiler instead. The fact that they are about the JavaScript or Python language is not that important. Here's another idea about how to do more advanced optimizations in a tracing JIT a la PyPy. The idea would be to keep enough metadata for the various pieces of machine code that the current backend produces, and add logic to detect when this machine code runs for long enough. At that point, we would involve a (completely new) second level backend, which would consolidate the pieces into a better-optimized whole. This is an idea that exists in method JITs but that should also work in tracing JITs: the second level backend can see all the common paths at once, instead of one after the other. The second level can be slower (within reason), and it can even know how common each path is, which might give it an edge over ahead-of-time compilers. A bient?t, Armin. From pierre.augier at univ-grenoble-alpes.fr Mon Dec 21 17:19:30 2020 From: pierre.augier at univ-grenoble-alpes.fr (PIERRE AUGIER) Date: Mon, 21 Dec 2020 23:19:30 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> ----- Mail original ----- > De: "David Edelsohn" > ?: "PIERRE AUGIER" > Cc: "pypy-dev" > Envoy?: Vendredi 18 D?cembre 2020 21:00:42 > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes > Does Julia based on LLVM auto-vectorize the code? I assume yes > because you specifically mention SIMD design of the data structure. Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case? > Have you tried NumPyPy? Development on NumPyPy has not continued, but > it probably would be a better comparison of what PyPy with > auto-vectorization could accomplish to compare with Julia. I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia. I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation. Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list? Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy? For example, class Point3D: def __init__(self, x, y, z): self.x = x self.y = y self.z = z def norm_square(self): return self.x**2 + self.y**2 + self.z**2 I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code? Pierre > > Thanks, David > > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER > wrote: >> >> Hi, >> >> I post on this list a message written in PyPy issue tracker >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about >> some experiments I did on writing efficient implementations of the NBody >> problem https://github.com/paugier/nbabel to potentially answer to this article >> https://arxiv.org/pdf/2009.11295.pdf. >> >> I get from a PR an [interesting optimized implementation in >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). >> It is very fast (even slightly faster than in Pythran). One idea is to store >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` >> containing 4 floats to better use simd instructions. >> >> I added a pure Python implementation inspired by this new Julia implementation >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` >> does not make the code faster) and good news it is with PyPy a bit faster than >> our previous PyPy implementations (only 3 times slower than the old C++ >> implementation). >> >> However, it is much slower than with Julia (while the code is very similar). I >> coded a simplified version in Julia with nearly nothing else that what can be >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). It >> seems to me that the comparison of these 2 versions could be interesting. So I >> again simplified these 2 versions to keep only what is important for >> performance, which gives >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl >> >> The results are summarized in >> https://github.com/paugier/nbabel/blob/master/py/microbench.md >> >> An important point is that with `Point3D` (a mutable class in Python and an >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and >> nothing really fancy in Julia so I guess that PyPy might be missing some >> optimization opportunities. At least it would be interesting to understand what >> is slower in PyPy (and why). I have to admit that I don't know how to get >> interesting information on timing and what is happening with PyPy JIT in a >> particular case. I only used cProfile and it's of course clearly not enough. I >> can run vmprof but I'm not able to visualize the data because the website >> http://vmprof.com/ is down. I don't know if I can trust values given by IPython >> `%timeit` for particular instructions since I don't know if PyPy JIT does the >> same thing in `%timeit` and in the function `compute_accelerations`. >> >> I also feel that I really miss in pure Python an efficient fixed size >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic >> numerical types (as Python `array.array`) but also instances of user-defined >> classes and instances of Vectors. The Python code uses a [pure Python >> implementation using a >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it >> would be reasonable to have a good implementation highly compatible with PyPy >> (and potentially other Python implementations) in a package on PyPI. It would >> really help to write PyPy compatible numerical codes. What would be the good >> tool to implement such package? HPy? I wonder whether we can get some speedup >> compared to the pure Python version with lists. For very simple classes like >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously in >> memory and if some operations could be done without boxing/unboxing. >> >> However, I really don't know what is slower in PyPy / faster in Julia. >> >> I would be very interested to get the points of view of people knowing well >> PyPy. >> >> Pierre >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev From anto.cuni at gmail.com Mon Dec 21 17:25:41 2020 From: anto.cuni at gmail.com (Antonio Cuni) Date: Mon, 21 Dec 2020 23:25:41 +0100 Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: On Mon, Dec 21, 2020 at 11:19 PM PIERRE AUGIER < pierre.augier at univ-grenoble-alpes.fr> wrote: > class Point3D: > def __init__(self, x, y, z): > self.x = x > self.y = y > self.z = z > > def norm_square(self): > return self.x**2 + self.y**2 + self.z**2 > you could try to store x, y and z inside a list instead of 3 different attributes: PyPy will use the specialized implementation which stores them unboxed, which might help the subsequent code. You can even use @property do expose them as .x .y and .z, since the JIT should happily remove the abstraction away -------------- next part -------------- An HTML attachment was scrubbed... URL: From dje.gcc at gmail.com Mon Dec 21 17:47:22 2020 From: dje.gcc at gmail.com (David Edelsohn) Date: Mon, 21 Dec 2020 17:47:22 -0500 Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: You did not state on exactly what system you are conducting the experiment, but "a factor of 4" seems very close to the auto-vectorization speedup of a vector of floats. > I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation. If the performance gap is caused by auto-vectorization, I would recommend that you use consider Numpy with Numba LLVM-based JIT. Or, for a "pure" Python solution, you can experiment with an older release of PyPy and NumPyPy. If the problem is the abstraction penalty, then the suggestion from Anto should help. But, for the question of why, you can examine the code for the inner loop generated by Julia and the code for the inner loop generate by PyPy and analyze the reason for the performance gap. It should be evident if the difference is abstraction or SIMD. Thanks, David On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER wrote: > > > ----- Mail original ----- > > De: "David Edelsohn" > > ?: "PIERRE AUGIER" > > Cc: "pypy-dev" > > Envoy?: Vendredi 18 D?cembre 2020 21:00:42 > > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes > > > Does Julia based on LLVM auto-vectorize the code? I assume yes > > because you specifically mention SIMD design of the data structure. > > Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case? > > > Have you tried NumPyPy? Development on NumPyPy has not continued, but > > it probably would be a better comparison of what PyPy with > > auto-vectorization could accomplish to compare with Julia. > > I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. > > Anyway, for this experiment, my attempt was to stay in pure Python and to compare with what is done in pure Julia. > > I think it would be very interesting to understand why PyPy is much slower than Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if it is an issue of the language or a limitation of the implementation. > > Moreover, I would really be interested to know if an extension compatible with PyPy (better, not only compatible with PyPy) could be written to make such code faster (a code involving an array of instances of a very simple class). Could we gain anything compare to using a Python list? > > Are there some tools to understand what is done by PyPy to speedup some code? Or to know more on the data structures used under the hood by PyPy? > > For example, > > class Point3D: > def __init__(self, x, y, z): > self.x = x > self.y = y > self.z = z > > def norm_square(self): > return self.x**2 + self.y**2 + self.z**2 > > I guess it would be good for efficiency to store the 3 floats as native floats aligned in memory and to vectorized the power computation. How can one know what is done by PyPy for a particular code? > > Pierre > > > > > Thanks, David > > > > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER > > wrote: > >> > >> Hi, > >> > >> I post on this list a message written in PyPy issue tracker > >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about > >> some experiments I did on writing efficient implementations of the NBody > >> problem https://github.com/paugier/nbabel to potentially answer to this article > >> https://arxiv.org/pdf/2009.11295.pdf. > >> > >> I get from a PR an [interesting optimized implementation in > >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). > >> It is very fast (even slightly faster than in Pythran). One idea is to store > >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` > >> containing 4 floats to better use simd instructions. > >> > >> I added a pure Python implementation inspired by this new Julia implementation > >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` > >> does not make the code faster) and good news it is with PyPy a bit faster than > >> our previous PyPy implementations (only 3 times slower than the old C++ > >> implementation). > >> > >> However, it is much slower than with Julia (while the code is very similar). I > >> coded a simplified version in Julia with nearly nothing else that what can be > >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). It > >> seems to me that the comparison of these 2 versions could be interesting. So I > >> again simplified these 2 versions to keep only what is important for > >> performance, which gives > >> > >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py > >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl > >> > >> The results are summarized in > >> https://github.com/paugier/nbabel/blob/master/py/microbench.md > >> > >> An important point is that with `Point3D` (a mutable class in Python and an > >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and > >> nothing really fancy in Julia so I guess that PyPy might be missing some > >> optimization opportunities. At least it would be interesting to understand what > >> is slower in PyPy (and why). I have to admit that I don't know how to get > >> interesting information on timing and what is happening with PyPy JIT in a > >> particular case. I only used cProfile and it's of course clearly not enough. I > >> can run vmprof but I'm not able to visualize the data because the website > >> http://vmprof.com/ is down. I don't know if I can trust values given by IPython > >> `%timeit` for particular instructions since I don't know if PyPy JIT does the > >> same thing in `%timeit` and in the function `compute_accelerations`. > >> > >> I also feel that I really miss in pure Python an efficient fixed size > >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic > >> numerical types (as Python `array.array`) but also instances of user-defined > >> classes and instances of Vectors. The Python code uses a [pure Python > >> implementation using a > >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it > >> would be reasonable to have a good implementation highly compatible with PyPy > >> (and potentially other Python implementations) in a package on PyPI. It would > >> really help to write PyPy compatible numerical codes. What would be the good > >> tool to implement such package? HPy? I wonder whether we can get some speedup > >> compared to the pure Python version with lists. For very simple classes like > >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously in > >> memory and if some operations could be done without boxing/unboxing. > >> > >> However, I really don't know what is slower in PyPy / faster in Julia. > >> > >> I would be very interested to get the points of view of people knowing well > >> PyPy. > >> > >> Pierre > >> _______________________________________________ > >> pypy-dev mailing list > >> pypy-dev at python.org > > > https://mail.python.org/mailman/listinfo/pypy-dev > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From pierre.augier at univ-grenoble-alpes.fr Tue Dec 22 10:34:23 2020 From: pierre.augier at univ-grenoble-alpes.fr (PIERRE AUGIER) Date: Tue, 22 Dec 2020 16:34:23 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <1078901941.2791191.1608651263208.JavaMail.zimbra@univ-grenoble-alpes.fr> ----- Mail original ----- > De: "David Edelsohn" > ?: "PIERRE AUGIER" > Cc: "pypy-dev" > Envoy?: Lundi 21 D?cembre 2020 23:47:22 > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes > You did not state on exactly what system you are conducting the > experiment, but "a factor of 4" seems very close to the > auto-vectorization speedup of a vector of floats. The problem is described in details in the repository https://github.com/paugier/nbabel and in the related issue https://foss.heptapod.net/pypy/pypy/-/issues/3349 >> I think it would be very interesting to understand why PyPy is much slower than >> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if >> it is an issue of the language or a limitation of the implementation. > > If the performance gap is caused by auto-vectorization, I would > recommend that you use consider Numpy with Numba LLVM-based JIT. Or, > for a "pure" Python solution, you can experiment with an older release > of PyPy and NumPyPy. There is already an implementation based on Numba (which is slower and in my point of view less elegant that what can be done with Transonic-Pythran). Here, it is really about what can be done with PyPy, nowadays and in future. About NumPyPy, I'm sorry about this story, but I'm not interested to play with an unsupported project. > If the problem is the abstraction penalty, then the suggestion from > Anto should help. I tried to use a list to store the data but unfortunatelly, it's slower (1.5 times slower than with attributes and 6 times slower than Julia on my slow laptop): Measurements with Julia (https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl): pierre at voyage ~/Dev/nbabel/py master $ julia microbench_ju4.jl Main.NB.MutablePoint3D 17.833 ms (1048576 allocations: 32.00 MiB) Main.NB.Point3D 5.737 ms (0 allocations: 0 bytes) Main.NB.Point4D 4.984 ms (0 allocations: 0 bytes) Measurements with PyPy objects with x, y, z attributes (like Julia, https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py): pierre at voyage ~/Dev/nbabel/py master $ pypy microbench_pypy4.py Point3D: 22.503 ms Point4D: 45.127 ms Measurements with PyPy, lists and @property (https://github.com/paugier/nbabel/blob/master/py/microbench_pypy_list.py): pierre at voyage ~/Dev/nbabel/py master $ pypy microbench_pypy_list.py Point3D: 34.115 ms Point4D: 59.646 ms > But, for the question of why, you can examine the code for the inner > loop generated by Julia and the code for the inner loop generate by > PyPy and analyze the reason for the performance gap. It should be > evident if the difference is abstraction or SIMD. Sorry for this naive question but how can I examine the code for the inner loop generated by PyPy ? Pierre > > On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER > wrote: >> >> >> ----- Mail original ----- >> > De: "David Edelsohn" >> > ?: "PIERRE AUGIER" >> > Cc: "pypy-dev" >> > Envoy?: Vendredi 18 D?cembre 2020 21:00:42 >> > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes >> >> > Does Julia based on LLVM auto-vectorize the code? I assume yes >> > because you specifically mention SIMD design of the data structure. >> >> Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case? >> >> > Have you tried NumPyPy? Development on NumPyPy has not continued, but >> > it probably would be a better comparison of what PyPy with >> > auto-vectorization could accomplish to compare with Julia. >> >> I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. >> >> Anyway, for this experiment, my attempt was to stay in pure Python and to >> compare with what is done in pure Julia. >> >> I think it would be very interesting to understand why PyPy is much slower than >> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if >> it is an issue of the language or a limitation of the implementation. >> >> Moreover, I would really be interested to know if an extension compatible with >> PyPy (better, not only compatible with PyPy) could be written to make such code >> faster (a code involving an array of instances of a very simple class). Could >> we gain anything compare to using a Python list? >> >> Are there some tools to understand what is done by PyPy to speedup some code? Or >> to know more on the data structures used under the hood by PyPy? >> >> For example, >> >> class Point3D: >> def __init__(self, x, y, z): >> self.x = x >> self.y = y >> self.z = z >> >> def norm_square(self): >> return self.x**2 + self.y**2 + self.z**2 >> >> I guess it would be good for efficiency to store the 3 floats as native floats >> aligned in memory and to vectorized the power computation. How can one know >> what is done by PyPy for a particular code? >> >> Pierre >> >> > >> > Thanks, David >> > >> > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER >> > wrote: >> >> >> >> Hi, >> >> >> >> I post on this list a message written in PyPy issue tracker >> >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about >> >> some experiments I did on writing efficient implementations of the NBody >> >> problem https://github.com/paugier/nbabel to potentially answer to this article >> >> https://arxiv.org/pdf/2009.11295.pdf. >> >> >> >> I get from a PR an [interesting optimized implementation in >> >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). >> >> It is very fast (even slightly faster than in Pythran). One idea is to store >> >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` >> >> containing 4 floats to better use simd instructions. >> >> >> >> I added a pure Python implementation inspired by this new Julia implementation >> >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` >> >> does not make the code faster) and good news it is with PyPy a bit faster than >> >> our previous PyPy implementations (only 3 times slower than the old C++ >> >> implementation). >> >> >> >> However, it is much slower than with Julia (while the code is very similar). I >> >> coded a simplified version in Julia with nearly nothing else that what can be >> >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). It >> >> seems to me that the comparison of these 2 versions could be interesting. So I >> >> again simplified these 2 versions to keep only what is important for >> >> performance, which gives >> >> >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl >> >> >> >> The results are summarized in >> >> https://github.com/paugier/nbabel/blob/master/py/microbench.md >> >> >> >> An important point is that with `Point3D` (a mutable class in Python and an >> >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and >> >> nothing really fancy in Julia so I guess that PyPy might be missing some >> >> optimization opportunities. At least it would be interesting to understand what >> >> is slower in PyPy (and why). I have to admit that I don't know how to get >> >> interesting information on timing and what is happening with PyPy JIT in a >> >> particular case. I only used cProfile and it's of course clearly not enough. I >> >> can run vmprof but I'm not able to visualize the data because the website >> >> http://vmprof.com/ is down. I don't know if I can trust values given by IPython >> >> `%timeit` for particular instructions since I don't know if PyPy JIT does the >> >> same thing in `%timeit` and in the function `compute_accelerations`. >> >> >> >> I also feel that I really miss in pure Python an efficient fixed size >> >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic >> >> numerical types (as Python `array.array`) but also instances of user-defined >> >> classes and instances of Vectors. The Python code uses a [pure Python >> >> implementation using a >> >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it >> >> would be reasonable to have a good implementation highly compatible with PyPy >> >> (and potentially other Python implementations) in a package on PyPI. It would >> >> really help to write PyPy compatible numerical codes. What would be the good >> >> tool to implement such package? HPy? I wonder whether we can get some speedup >> >> compared to the pure Python version with lists. For very simple classes like >> >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously in >> >> memory and if some operations could be done without boxing/unboxing. >> >> >> >> However, I really don't know what is slower in PyPy / faster in Julia. >> >> >> >> I would be very interested to get the points of view of people knowing well >> >> PyPy. >> >> >> >> Pierre >> >> _______________________________________________ >> >> pypy-dev mailing list >> >> pypy-dev at python.org >> > > https://mail.python.org/mailman/listinfo/pypy-dev >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev From cfbolz at gmx.de Tue Dec 22 15:50:19 2020 From: cfbolz at gmx.de (Carl Friedrich Bolz-Tereick) Date: Tue, 22 Dec 2020 21:50:19 +0100 Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: <1078901941.2791191.1608651263208.JavaMail.zimbra@univ-grenoble-alpes.fr> References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> <1078901941.2791191.1608651263208.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <52cb6482-9d7b-1b1e-2a04-71c0f4e564cd@gmx.de> On 22.12.20 16:34, PIERRE AUGIER wrote: > Here, it is really about what can be done with PyPy, nowadays and in future. Hi Pierre, A few somewhat random comments from me. First note is that you shouldn't run two different implementations that you are comparing (Point3D and Point4D in this case) within the same process, since they can influence each other. If I run them in the same process I get this: Point3D: 11.426 ms Point4D: 21.572 ms in separate processes the latter speeds up: Point4D: 13.136 ms (but it doesn't become faster than Point4D, indeed because we don't have any real SIMD support in the JIT.) Next: some information about how to look at the generated code with PyPy. What I do is look at the JIT IR (which is very close to machine code, but one abstraction level above it). You get it like this: PYPYLOG=jit-log-opt,jit-summary,jit-backend-counts:out pypy3 microbench_pypy4.py This produces a file called 'out' with different sections. I usually start by looking at the bottom, which shows how often each trace is entered. This way, you can find the hottest trace: [26f0c8566379] {jit-backend-counts ... TargetToken(140179837690368):43692970 TargetToken(140179837690448):74923530 ... [26f0c8567905] jit-backend-counts} Now I search for the address of the hottest trace to find its IR. The IR shows traced Python bycodes interspersed with IR instructions (takes a bit of time to get used to reading it, but it's not super hard). Looking through that it's my opinion that the trace looks quite good. There are many small inefficiencies (a bit too much pointer chasing, a bit too much type checking everywhere, a few allocations that aren't necessary), but no single thing missed optimization that could immediately give a 5x speedup. Which also follows my expectations of how I suspect a shootout between Julia and PyPy to end up: PyPy is much faster than CPython for algorithmic pure Python code (~150x on my laptop! that's really good :-)). But it can't really beat a "serious" ahead-of-time compiler for a statically typed language that specifically targets numerical code. That is for several reasons, the most important ones being that 1) PyPy has a lot less time to produce code given that it does it at runtime 2) PyPy has to support the full dynamically typed language Python where really random things can be done at runtime and PyPy must still always observe the Python semantics. That said, I can understand that 5x slower is still a somewhat disappointing result and I suspect given enough effort we could maybe get it down to around 3x slower. Cheers, Carl Friedrich From yury at shurup.com Tue Dec 22 16:08:14 2020 From: yury at shurup.com (Yury V. Zaytsev) Date: Tue, 22 Dec 2020 22:08:14 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: <52cb6482-9d7b-1b1e-2a04-71c0f4e564cd@gmx.de> References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> <1078901941.2791191.1608651263208.JavaMail.zimbra@univ-grenoble-alpes.fr> <52cb6482-9d7b-1b1e-2a04-71c0f4e564cd@gmx.de> Message-ID: <14a6c954-187f-2341-702a-b599455a6f8@shurup.com> On Tue, 22 Dec 2020, Carl Friedrich Bolz-Tereick wrote: > That said, I can understand that 5x slower is still a somewhat > disappointing result and I suspect given enough effort we could maybe > get it down to around 3x slower. Just to clarify, if I understand you correctly, you mean that by investing some serious effort into optimising those "small" inefficiencies one could improve the situation from 5x to 3x. However, I wonder if anything could be done on the SIMD front in a rather generic way with a reasonable investment of time but without going the full NumPyPy way, e.g. by doing something special for tight loops performing math on objects with a special layout (lists, arrays)... -- Sincerely yours, Yury V. Zaytsev From pierre.augier at univ-grenoble-alpes.fr Wed Dec 23 08:42:23 2020 From: pierre.augier at univ-grenoble-alpes.fr (PIERRE AUGIER) Date: Wed, 23 Dec 2020 14:42:23 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <1561344788.2885955.1608730943563.JavaMail.zimbra@univ-grenoble-alpes.fr> ----- Mail original ----- > De: "David Edelsohn" > ?: "PIERRE AUGIER" > Cc: "pypy-dev" > Envoy?: Lundi 21 D?cembre 2020 23:47:22 > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes > You did not state on exactly what system you are conducting the > experiment, but "a factor of 4" seems very close to the > auto-vectorization speedup of a vector of floats. I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is: ```python def sum_x(positions): result = 0.0 for i in range(len(positions)): result += positions[i].x return result ``` The scripts are: - https://github.com/paugier/nbabel/blob/master/py/microbench_sum_x.py - https://github.com/paugier/nbabel/blob/master/py/microbench_sum_x.jl Even on this case, Julia is again notably (~2.7 times) faster on this case: ``` $ julia microbench_sum_x.jl 1.208 ?s (1 allocation: 16 bytes) In [1]: run microbench_sum_x.py sum_x(positions) 3.29 ?s ? 133 ns per loop (mean ? std. dev. of 7 runs, 100000 loops each) sum_x(positions_list) 14.5 ?s ? 291 ns per loop (mean ? std. dev. of 7 runs, 100000 loops each) ``` For `positions_list`, each `point` contains a list to store the 3 floats. How can I analyze these performance differences? How can I get more information on what happens for this code with PyPy? From pierre.augier at univ-grenoble-alpes.fr Wed Dec 23 17:35:00 2020 From: pierre.augier at univ-grenoble-alpes.fr (PIERRE AUGIER) Date: Wed, 23 Dec 2020 23:35:00 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <1516345526.2933283.1608762900737.JavaMail.zimbra@univ-grenoble-alpes.fr> I think I understood that what is very slow compared to Julia is looping over a list of Python objects. def loop_over_list_of_objects(l): for o in l: o loop_over_list_of_objects([object() for _ in range(1000)]) See https://github.com/paugier/nbabel/blob/master/py/microbench_sum_x.py Is there a better way to store Python objects (homogeneous in type) to be able to loop over them more efficiency? Would it be possible to store them in a contiguous array (if it makes sense for Python objects)? ----- Mail original ----- > De: "David Edelsohn" > ?: "PIERRE AUGIER" > Cc: "pypy-dev" > Envoy?: Lundi 21 D?cembre 2020 23:47:22 > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes > You did not state on exactly what system you are conducting the > experiment, but "a factor of 4" seems very close to the > auto-vectorization speedup of a vector of floats. > >> I think it would be very interesting to understand why PyPy is much slower than >> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if >> it is an issue of the language or a limitation of the implementation. > > If the performance gap is caused by auto-vectorization, I would > recommend that you use consider Numpy with Numba LLVM-based JIT. Or, > for a "pure" Python solution, you can experiment with an older release > of PyPy and NumPyPy. > > If the problem is the abstraction penalty, then the suggestion from > Anto should help. > > But, for the question of why, you can examine the code for the inner > loop generated by Julia and the code for the inner loop generate by > PyPy and analyze the reason for the performance gap. It should be > evident if the difference is abstraction or SIMD. > > Thanks, David > > > On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER > wrote: >> >> >> ----- Mail original ----- >> > De: "David Edelsohn" >> > ?: "PIERRE AUGIER" >> > Cc: "pypy-dev" >> > Envoy?: Vendredi 18 D?cembre 2020 21:00:42 >> > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes >> >> > Does Julia based on LLVM auto-vectorize the code? I assume yes >> > because you specifically mention SIMD design of the data structure. >> >> Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case? >> >> > Have you tried NumPyPy? Development on NumPyPy has not continued, but >> > it probably would be a better comparison of what PyPy with >> > auto-vectorization could accomplish to compare with Julia. >> >> I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. >> >> Anyway, for this experiment, my attempt was to stay in pure Python and to >> compare with what is done in pure Julia. >> >> I think it would be very interesting to understand why PyPy is much slower than >> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering if >> it is an issue of the language or a limitation of the implementation. >> >> Moreover, I would really be interested to know if an extension compatible with >> PyPy (better, not only compatible with PyPy) could be written to make such code >> faster (a code involving an array of instances of a very simple class). Could >> we gain anything compare to using a Python list? >> >> Are there some tools to understand what is done by PyPy to speedup some code? Or >> to know more on the data structures used under the hood by PyPy? >> >> For example, >> >> class Point3D: >> def __init__(self, x, y, z): >> self.x = x >> self.y = y >> self.z = z >> >> def norm_square(self): >> return self.x**2 + self.y**2 + self.z**2 >> >> I guess it would be good for efficiency to store the 3 floats as native floats >> aligned in memory and to vectorized the power computation. How can one know >> what is done by PyPy for a particular code? >> >> Pierre >> >> > >> > Thanks, David >> > >> > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER >> > wrote: >> >> >> >> Hi, >> >> >> >> I post on this list a message written in PyPy issue tracker >> >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is about >> >> some experiments I did on writing efficient implementations of the NBody >> >> problem https://github.com/paugier/nbabel to potentially answer to this article >> >> https://arxiv.org/pdf/2009.11295.pdf. >> >> >> >> I get from a PR an [interesting optimized implementation in >> >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). >> >> It is very fast (even slightly faster than in Pythran). One idea is to store >> >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` >> >> containing 4 floats to better use simd instructions. >> >> >> >> I added a pure Python implementation inspired by this new Julia implementation >> >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` >> >> does not make the code faster) and good news it is with PyPy a bit faster than >> >> our previous PyPy implementations (only 3 times slower than the old C++ >> >> implementation). >> >> >> >> However, it is much slower than with Julia (while the code is very similar). I >> >> coded a simplified version in Julia with nearly nothing else that what can be >> >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). It >> >> seems to me that the comparison of these 2 versions could be interesting. So I >> >> again simplified these 2 versions to keep only what is important for >> >> performance, which gives >> >> >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl >> >> >> >> The results are summarized in >> >> https://github.com/paugier/nbabel/blob/master/py/microbench.md >> >> >> >> An important point is that with `Point3D` (a mutable class in Python and an >> >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code and >> >> nothing really fancy in Julia so I guess that PyPy might be missing some >> >> optimization opportunities. At least it would be interesting to understand what >> >> is slower in PyPy (and why). I have to admit that I don't know how to get >> >> interesting information on timing and what is happening with PyPy JIT in a >> >> particular case. I only used cProfile and it's of course clearly not enough. I >> >> can run vmprof but I'm not able to visualize the data because the website >> >> http://vmprof.com/ is down. I don't know if I can trust values given by IPython >> >> `%timeit` for particular instructions since I don't know if PyPy JIT does the >> >> same thing in `%timeit` and in the function `compute_accelerations`. >> >> >> >> I also feel that I really miss in pure Python an efficient fixed size >> >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain basic >> >> numerical types (as Python `array.array`) but also instances of user-defined >> >> classes and instances of Vectors. The Python code uses a [pure Python >> >> implementation using a >> >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think it >> >> would be reasonable to have a good implementation highly compatible with PyPy >> >> (and potentially other Python implementations) in a package on PyPI. It would >> >> really help to write PyPy compatible numerical codes. What would be the good >> >> tool to implement such package? HPy? I wonder whether we can get some speedup >> >> compared to the pure Python version with lists. For very simple classes like >> >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously in >> >> memory and if some operations could be done without boxing/unboxing. >> >> >> >> However, I really don't know what is slower in PyPy / faster in Julia. >> >> >> >> I would be very interested to get the points of view of people knowing well >> >> PyPy. >> >> >> >> Pierre >> >> _______________________________________________ >> >> pypy-dev mailing list >> >> pypy-dev at python.org >> > > https://mail.python.org/mailman/listinfo/pypy-dev >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev From cfbolz at gmx.de Thu Dec 24 01:06:43 2020 From: cfbolz at gmx.de (Carl Friedrich Bolz-Tereick) Date: Thu, 24 Dec 2020 07:06:43 +0100 Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: <1561344788.2885955.1608730943563.JavaMail.zimbra@univ-grenoble-alpes.fr> References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> <1561344788.2885955.1608730943563.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: On 23.12.20 14:42, PIERRE AUGIER wrote: > I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is: > > ```python > def sum_x(positions): > result = 0.0 > for i in range(len(positions)): > result += positions[i].x > return result > ``` This benchmark probably really shows the crux of the problem. In Python, the various Points instances (whether with lists, or with direct attributes) are vastly more complex beasts than the structs in Julia. There, you can declare a struct with a certain number of Float64 fields and be done. Thus, reading .x from such a struct is just a pointer dereference. In Python, due to dynamic typing, the ability to add more fields later and even the ability to change the class of an instance, the actual memory layout of a Point3D type is much more complex with various indirections and boxing. Reading .x out of such a thing is done in several steps: 1) check that positions[i] is an instance 2) check that it's an instance of Point3D 3) read its x field 4) check that the field is a float 5) read the float's value All of these steps involve a pointer read. Improving this situation is probably possible (there's even a paper how to get rid of steps 1 and 2: https://www.csl.cornell.edu/~cbatten/pdfs/cheng-type-freezing-cgo2020.pdf but the work wasn't merged). But there are problems: - basically every single one of these steps needs to be addressed, and every one is its own optimization - it's extremely delicate to get the balance and the trade-offs right, because the object system is so central in getting good performance for Python code across a wide variety of areas (not just numerical algorithms). Another approach would indeed be (as you say in the other mail) to add support for telling PyPy explicitly that some list can contain only instances of a specific class and (more importantly) that a class is not to be considered to be "dynamic" meaning that its fields are fixed and of specific types. So far, we have not really gone in such directions, because that is language design and we leave that to the CPython devs ;-). Note that some of your other benchmarks are not measuring what you hope! eg I suspect that get_objects, get_xs and loop_over_list_of_objects from your other mail get completely removed by the Julia compiler, since they don't have side effects and don't compute anything. PyPy isn't actually able to remove empty loops. So you are comparing empty loops in PyPy with no code at all in Julia. Cheers, Carl Friedrich From yury at shurup.com Thu Dec 24 05:14:19 2020 From: yury at shurup.com (Yury V. Zaytsev) Date: Thu, 24 Dec 2020 11:14:19 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> <1561344788.2885955.1608730943563.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <856f99bc-510-526b-2ce-9ee77a63fec5@shurup.com> On Thu, 24 Dec 2020, Carl Friedrich Bolz-Tereick wrote: > Another approach would indeed be (as you say in the other mail) to add > support for telling PyPy explicitly that some list can contain only > instances of a specific class and (more importantly) that a class is not > to be considered to be "dynamic" meaning that its fields are fixed and > of specific types. So far, we have not really gone in such directions, > because that is language design and we leave that to the CPython devs > ;-). Hmmm, how about dataclasses ;-) Maybe those can be used as optimization targets under some reasonable assumptions: @dataclass class Point3D: x: float y: float z: float Feels pythonic much to me... -- Sincerely yours, Yury V. Zaytsev From bokr at bokr.com Thu Dec 24 04:52:48 2020 From: bokr at bokr.com (Bengt Richter) Date: Thu, 24 Dec 2020 10:52:48 +0100 Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> <1561344788.2885955.1608730943563.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <20201224095248.GA17425@LionPure> Hi Carl, On +2020-12-24 07:06:43 +0100, Carl Friedrich Bolz-Tereick wrote: > On 23.12.20 14:42, PIERRE AUGIER wrote: > > I wrote another very simple benchmark that should not depend on auto-vectorization. The bench function is: > > > > ```python > > def sum_x(positions): > > result = 0.0 > > for i in range(len(positions)): > > result += positions[i].x > > return result > > ``` > > This benchmark probably really shows the crux of the problem. In Python, > the various Points instances (whether with lists, or with direct > attributes) are vastly more complex beasts than the structs in Julia. > There, you can declare a struct with a certain number of Float64 fields > and be done. Thus, reading .x from such a struct is just a pointer > dereference. > > In Python, due to dynamic typing, the ability to add more fields later > and even the ability to change the class of an instance, the actual > memory layout of a Point3D type is much more complex with various > indirections and boxing. Reading .x out of such a thing is done in > several steps: > > 1) check that positions[i] is an instance > 2) check that it's an instance of Point3D > 3) read its x field > 4) check that the field is a float > 5) read the float's value > > All of these steps involve a pointer read. > Could crafted asserts tell the compiler that dynamic stuff is not happening? Is the compiler listening? :) > Improving this situation is probably possible (there's even a paper how > to get rid of steps 1 and 2: > https://www.csl.cornell.edu/~cbatten/pdfs/cheng-type-freezing-cgo2020.pdf but > the work wasn't merged). But there are problems: > - basically every single one of these steps needs to be addressed, and > every one is its own optimization > - it's extremely delicate to get the balance and the trade-offs right, > because the object system is so central in getting good performance for > Python code across a wide variety of areas (not just numerical algorithms). > > Another approach would indeed be (as you say in the other mail) to add > support for telling PyPy explicitly that some list can contain only > instances of a specific class and (more importantly) that a class is not > to be considered to be "dynamic" meaning that its fields are fixed and > of specific types. So far, we have not really gone in such directions, > because that is language design and we leave that to the CPython devs ;-). > assert is already there :) > Note that some of your other benchmarks are not measuring what you hope! > eg I suspect that get_objects, get_xs and loop_over_list_of_objects from > your other mail get completely removed by the Julia compiler, since they > don't have side effects and don't compute anything. PyPy isn't actually > able to remove empty loops. So you are comparing empty loops in PyPy > with no code at all in Julia. > And benchmarks that don't segregate outliers, or clusters, and just average are terrible. Scattergrams are much more informative :) E.g., in a scattergram you can discover that a CPU shortcutting multiply by zero affects a subset of your computations. > Cheers, > > Carl Friedrich > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev Happy Holidays! -- Regards, Bengt Richter From pierre.augier at univ-grenoble-alpes.fr Sat Dec 26 17:23:14 2020 From: pierre.augier at univ-grenoble-alpes.fr (PIERRE AUGIER) Date: Sat, 26 Dec 2020 23:23:14 +0100 (CET) Subject: [pypy-dev] Differences performance Julia / PyPy on very similar codes In-Reply-To: References: <488740334.2327734.1608320907501.JavaMail.zimbra@univ-grenoble-alpes.fr> <1590364666.2670740.1608589170202.JavaMail.zimbra@univ-grenoble-alpes.fr> <1561344788.2885955.1608730943563.JavaMail.zimbra@univ-grenoble-alpes.fr> Message-ID: <1177860345.3070181.1609021394915.JavaMail.zimbra@univ-grenoble-alpes.fr> ----- Mail original ----- > De: "Carl Friedrich Bolz-Tereick" > ?: "PIERRE AUGIER" , "pypy-dev" > Envoy?: Jeudi 24 D?cembre 2020 07:06:43 > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes > On 23.12.20 14:42, PIERRE AUGIER wrote: >> I wrote another very simple benchmark that should not depend on >> auto-vectorization. The bench function is: >> >> ```python >> def sum_x(positions): >> result = 0.0 >> for i in range(len(positions)): >> result += positions[i].x >> return result >> ``` > > This benchmark probably really shows the crux of the problem. In Python, > the various Points instances (whether with lists, or with direct > attributes) are vastly more complex beasts than the structs in Julia. > There, you can declare a struct with a certain number of Float64 fields > and be done. Thus, reading .x from such a struct is just a pointer > dereference. > > In Python, due to dynamic typing, the ability to add more fields later > and even the ability to change the class of an instance, the actual > memory layout of a Point3D type is much more complex with various > indirections and boxing. Reading .x out of such a thing is done in > several steps: > > 1) check that positions[i] is an instance > 2) check that it's an instance of Point3D > 3) read its x field > 4) check that the field is a float > 5) read the float's value > > All of these steps involve a pointer read. > > Improving this situation is probably possible (there's even a paper how > to get rid of steps 1 and 2: > https://www.csl.cornell.edu/~cbatten/pdfs/cheng-type-freezing-cgo2020.pdf but > the work wasn't merged). But there are problems: > - basically every single one of these steps needs to be addressed, and > every one is its own optimization > - it's extremely delicate to get the balance and the trade-offs right, > because the object system is so central in getting good performance for > Python code across a wide variety of areas (not just numerical algorithms). > > Another approach would indeed be (as you say in the other mail) to add > support for telling PyPy explicitly that some list can contain only > instances of a specific class and (more importantly) that a class is not > to be considered to be "dynamic" meaning that its fields are fixed and > of specific types. So far, we have not really gone in such directions, > because that is language design and we leave that to the CPython devs ;-). Thanks a lot Carl for your very interesting answers. I'm wondering if it could be possible to write an extension that would improve the situation for such numerical codes? I wrote a first description here https://github.com/paugier/nbabel/blob/master/py/vector.md (more about the Python API). I think that if something like this extension could exist and be very efficient with PyPy, it would greatly help writing very efficient numerical codes in "pure Python style". For the case of the NBabel problem, the code would be very nice and it seems to me that we could reach very good performance compared to Julia and other compiled languages. I would be very interested to get some feedback on this proposition. Do you think that HPy could be used to implement such extension? Could such extension be fully compatible with PyPy JIT without modification in PyPy? Cheers, Pierre Augier From theloniusmonster at icloud.com Thu Dec 31 12:09:36 2020 From: theloniusmonster at icloud.com (Peter Vessenes) Date: Thu, 31 Dec 2020 09:09:36 -0800 Subject: [pypy-dev] =?utf-8?q?I=E2=80=99m_happy_to_donate_an_M1_Mac_to_yo?= =?utf-8?q?u?= In-Reply-To: <947188B4-5DD0-4B70-B1B8-E6EF04B95A05@icloud.com> References: <947188B4-5DD0-4B70-B1B8-E6EF04B95A05@icloud.com> Message-ID: P.s. weird, this went out from some random icloud email. I?m cc?ing my main one. Sent from my iPad > On Dec 31, 2020, at 9:07 AM, Peter Vessenes wrote: > > ?Hey, > > Happy New Year ? pypy is great. Please let me know where I can send an M1 Mac Mini, and I?ll ship it to you ASAP. > > A quick thanks to me as a followup edit would be nice, but not required. :) > > Peter Vessenes > > > Sent from my iPad From theloniusmonster at icloud.com Thu Dec 31 12:07:37 2020 From: theloniusmonster at icloud.com (Peter Vessenes) Date: Thu, 31 Dec 2020 09:07:37 -0800 Subject: [pypy-dev] =?utf-8?q?I=E2=80=99m_happy_to_donate_an_M1_Mac_to_yo?= =?utf-8?q?u?= Message-ID: <947188B4-5DD0-4B70-B1B8-E6EF04B95A05@icloud.com> Hey, Happy New Year ? pypy is great. Please let me know where I can send an M1 Mac Mini, and I?ll ship it to you ASAP. A quick thanks to me as a followup edit would be nice, but not required. :) Peter Vessenes Sent from my iPad