From kevinar18 at hotmail.com Sun Aug 1 04:09:28 2010 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Sat, 31 Jul 2010 22:09:28 -0400 Subject: [pypy-dev] FW: Would the following shared memory model be possible? In-Reply-To: References: , <20100727062702.GE12699@tunixman.com>, , , , , , , , , Message-ID: > > I have no idea what I did you warrant you hateful replies towards me, but > > they really are not appropriate (in public or private email). > > I had absolutely no intention of offending you, and am deeply sorry > for any offense that I may have caused you. I must admit, I'm rather surprised by your reply -- and also thank you. I'm sorry for the trouble I caused you with this. I had hoped for a good conversation about the issues related to Kamaelia, yet everytime I got a reply back, it seemed like you were mad at me for some unknown reason. As a simple example of what I mean. In your first email, you mentioned a lot of different programming styles related to FBP and Kamaelia. Since I am interested in parallel "research" I put those words into google and made a whole bookmark section so that I would have them for future study. When I replied back, I figured that this would be a good way to lighten the mood in the email, so I thanked you for the info and asked for any more links/ideas you might want to mention. A shared point of interest might be a good way to foster a nice friendly atmosphere. Unfortunately, I am assuming you must have misunderstood me, because instead of stirring up a friendly interest, I received several paragraphs about me being inconsiderate (not searching google for something) and putting an undue burden on you. At this point, it would be really unfair to talk about it further. I guess to sum things up, I got the impression that you were mad at me for some unknown reason: it was like each successive email was going further and further down hill -- and I didn't know why. However, in the end, I am glad that the whole situation could be resolved the way it has been. > I'll bow out at this point. I wouldn't want you to have to do that; your input can be very useful to people. I apologized, you apolgized.... Some stuff was cleared up, etc.... I don't think anybody here is holding a grude or going rehash the topic again (me and you included). You have very specific knowledge related to Kamaelia that could be useful to people exploring micro-threading implementations, parallel computing, etc.... --------------- Now, to change the topic slightly (and hopeful in a positive way). I'm not sure if it really matters to you, but I have been considering another possible way to make a parallel tasklet (like for FBP and Kamaelia) in PyPy... but I don't have 3+ months to spend ironing out the flaws, learning PyPy, writing an implemenation, etc.... ... and to be honest, I would not feel comfortable asking someone else (here or otherwise) to try and make something for my benefit. On another note... something that might actually interest you: I have done some work on a graphical front-end for FBP ... nothing super special, mind you, but I could keep you informed in the future if is something of interest to you. Anyways, hope this email turns out on a positive note for you and everyone else. Kevin > > I have no idea what I did you warrant you hateful replies towards me, but > > they really are not appropriate (in public or private email). > > I had absolutely no intention of offending you, and am deeply sorry > for any offense that I may have caused you. > > In my reply I merely wanted to flag that I don't have time to go into > everything (like most people), that asking questions in a public realm > is better because you may then get answers from multiple people, and > that people who appear to do some research first tend to get better > answers. I also tried to give an example, but that doesn't appear to > have been helpful. (I'm fallible like everyone else) > > My intention there was to be helpful and to explain why I have that > view of only replying on list, and it appears to have offended you > instead, and I apologise. (one person's direct and helpful speech in > one place can be a mortal insult somewhere else) > > After those couple of paragraphs, I tried to add to your discussion by > replying to your specific points which you asked about parallel > execution, noting places and examples where it is possible today. (to > varying degrees of satisfaction) I then also tried to answer your > point of "if something extra could be done, what would probably be > generally useful". To that I noted that *my* talk there was cheap, and > that execution was hard. > > Somehow along the way, my intent to try to be helpful to you has > resulted in offending and upsetting you, and for that I am truly sorry > - life is simply too short for people to upset each other, and in no > way was my post intended as "hateful", and once again, my apologies. > In future please assume good intentions - I assumed good intentions on > your part. > > I'll bow out at this point. > > Best Regards, > > > Michael. > > > > >> Date: Sat, 31 Jul 2010 02:08:49 +0100 > >> Subject: Re: [pypy-dev] FW: Would the following shared memory model be > >> possible? > >> From: sparks.m at gmail.com > >> To: kevinar18 at hotmail.com > >> CC: pypy-dev at codespeak.net > >> > >> On Thu, Jul 29, 2010 at 6:44 PM, Kevin Ar18 wrote: > >> > You brought up a lot of topics. I went ahead and sent you a private > >> > email. > >> > There's always lots of interesting things I can add to my list of things > >> > to > >> > learn about. :) > >> > >> Yes, there are lots of interesting things. I have a limited amount of > >> time however (I should be in bed, it's very late here, but I do /try/ > >> to reply to on-list mails), so cannot spood feed you. Mailing me > >> directly rather than a (relevant) list precludes you getting answers > >> from someone other than me. Not being on lists also precludes you > >> getting answers to questions by chance. Changing emails and names in > >> email headers also makes keeping track of people hard... > >> > >> (For example you asked off list last year about Kamaelia's license > >> from a different email address. Since it wasn't searchable I > >> completely forgot. You also asked all sorts of questions but didn't > >> want the answers public, so I didn't reply. If instead you'd > >> subscribed to the list, and asked there, you'd've found out that > >> Kamaelia's license changed - to the Apache Software License v2 ...) > >> > >> If I mention something you find interesting, please Google first and > >> then ask publicly somewhere relevant. (the answer and question are > >> then googleable, and you're doing the community a service IMO if you > >> ask q's that way - if you're question is somewhere relevant and shows > >> you've already googled prior work as far as you can... People are > >> time however (I should be in bed, it's very late here, but I do /try/ > >> to reply to on-list mails), so cannot spood feed you. Mailing me > >> directly rather than a (relevant) list precludes you getting answers > >> from someone other than me. Not being on lists also precludes you > >> getting answers to questions by chance. Changing emails and names in > >> email headers also makes keeping track of people hard... > >> > >> (For example you asked off list last year about Kamaelia's license > >> from a different email address. Since it wasn't searchable I > >> completely forgot. You also asked all sorts of questions but didn't > >> want the answers public, so I didn't reply. If instead you'd > >> subscribed to the list, and asked there, you'd've found out that > >> Kamaelia's license changed - to the Apache Software License v2 ...) > >> > >> always willing to help people who show willing to help themselves in > >> my experience.) > >> > >> >> just looks to me that you're tieing yourself up in knots over things > >> >> that aren't problems, when there are some things which could be useful > >> >> (in practice) & interesting in this space. > >> > The particular issue in this situation is that there is no way to make > >> > Kamaelia, FBP, or other concurrency concepts run in parallel (unless you > >> > are > >> > willing to accept lots of overhead like with the multiprocessing > >> > queues). > >> > > >> > Since you have worked with Kamaelia code a lot... you understand a lot > >> > more > >> > about implementation details. Do you think the previous shared memory > >> > concept or something like it would let you make Kamaelia parallel? > >> > If not, can you think of any method that would let you make Kamaelia > >> > parallel? > >> > >> Kamaelia already CAN run components in parallel in different processes > >> (has been able to do so for quite some time) or on different > >> processors. Indeed, all you do is use a ProcessPipeline or > >> ProcessGraphline rather than Pipeline or Graphline, and the components > >> in the top level are spread across processes. I still view the code as > >> experimental, but it does work, and when needed is very useful. > >> > >> Kamaelia running on Iron Python can run on seperate processors sharing > >> data efficiently (due to lack of GIL there) happily too. Threaded > >> components there do that naturally - I don't use IronPython, but it > >> does run on Iron Python. On windows this is easiest, though Mono works > >> just as well. > >> > >> I believe Jython also is GIL free, and Kamaelia's Axon runs there > >> cleanly too. As a result because Kamaelia is pure python, it runs > >> truly in parallel there too (based on hearing from people using > >> kamaelia on jython). Cpython is the exception (and a rather big one at > >> that). (Pypy has a choice IIUC) > >> > >> Personally, I think if PyPy worked with generators better (which is > >> why I keep an eye on PyPy) and cpyext was improved, it'd provide a > >> really compelling platform for me. (I was rather gutted at Europython > >> to hear that PyPy's generator support was still ... problematic) > >> > >> Regarding the *efficiency* and *enforcement* of the approach taken, I > >> feel you're chasing the wrong tree, but let's go there. > >> > >> What approach does baseline (non-Iron Python running) kamaelia take > >> for multi-process work? > >> > >> For historical reasons, it builds on top of pprocess rather than > >> multiprocessing module based. This means for interprocess > >> communications objects are pickled before being sent over operating > >> system pipes. > >> > >> This provides an obvious communications overhead - and this isn't > >> really kamaelia specific at this point. > >> > >> However, shifting data from one CPU to another is expensive, and only > >> worth doing in some circumstances. (Consider a machine with several > >> physical CPUs - each has a local CPU cache, and the data needs to be > >> transferred from one to another, which is why partly people worry > >> about thread/CPU affinity etc) > >> > >> Basically, if you can manage it, you don't want to shift data between > >> CPUs, you want to partition the processing. > >> > >> ie you may want to start caring about the size of messages and number > >> of messages going between processes. Sending small and few between > >> processes is going to be preferable to sending large and many for > >> throughput purposes. > >> > >> In the case of small and few, the approach of pickling and sending > >> across OS pipes isn't such a bad idea. It works. > >> > >> If you do want to share data between CPUs, and it sounds like you do, > >> then most OSs already provide a means of doing that - threads. The > >> conventions people use for using threads are where they become > >> unpicked, but as a mechanism, threads do generally work, and work > >> well. > >> > >> As well as channels/boxes, you can use an STM approach, such as than > >> in Axon.STM ... > >> * http://www.kamaelia.org/STM.html > >> * > >> http://code.google.com/p/kamaelia/source/browse/trunk/Code/Python/Bindings/STM/ > >> > >> ...which is logically very similar to version control for variables. A > >> downside of STM (at least with this approach) however, is that for it > >> to work, you need either copy on write semantics for objects, or full > >> copying of objects or similar. Personally I use a biological metaphor > >> here, in that channels/boxes and components, and similar perform a > >> similar function to axons and neurons in the body, and that STM is > >> akin to the hormonal system for maintaining and controlling system > >> state. (I modelled biological tree growth many moons ago) > >> > >> Anyhow, coming back to threads, that brings us back to python, and > >> implementations with a GIL, and those without. > >> > >> For implementations with a GIL, you then have a choice: do I choose to > >> try and implement a memory model that _enforces_ data locality? that > >> is if a piece of data is in use inside a single "process" or "thread" > >> (from hereon I'll use "task" as a generic phrase) that trying to use > >> it inside another causes a problem for the task attempting to breach > >> the model. > >> > >> In order to enforce this, I personally believe you'd need to use > >> multiple processes, and only share data through dedicated code > >> managing shared memory. You could of course do this outside user code. > >> To do this you'd need an abstraction that made sense, and something > >> like stackless' channels or kamaelia's (in/out) box model makes sense > >> there. (The CELL API uses a mailbox metaphor as well for reference) > >> > >> In that case, you have a choice. You either copy the data into shared > >> memory, or you share the data in situ. The former gives you back > >> precisely the same overhead previously described, or the latter > >> fragments your memory (since you can no longer access it). You could > >> also have compaction. > >> > >> However, personally, I think any possible benefits here are outweighed > >> by the costs and complexity. > >> > >> The alternative is to _encourage_ data locality. That is encourage the > >> usage and sharing of data such that whilst you could share data > >> between tasks and cause corruption that the common way of using the > >> system discourages such actions. In essence that's what I try to do in > >> Kamaelia, and it seems to work. Specifically, the model says: > >> > >> * If I take a piece of data from an inbox, I own it and can do anything > >> with it that I like. If you think of a physical piece of paper and > >> I take it from an intray, then that really is the case. > >> > >> * If I put a piece of data in an outbox, I no longer own it and should > >> not attempt to do anything more with it. Again, using a physical > >> metaphor, and naming scheme helps here. In particular, if I put a > >> piece of paper in the post, I can no longer modify it. How it gets > >> to its recipient is not my concern either. > >> > >> In practice this does actually work. If you add in immutable tuples, > >> and immutable strings then it becomes a lot clearer how this can work. > >> > >> Is there a risk here of accidental modification? Yes. However, the > >> size and general simplicity of components tends to lead to such > >> problems being picked up early. It also enables component level > >> acceptance tests. (We tend to build small examples of usage, which in > >> turn effectively form acceptance tests) > >> > >> [ An alternative is to make the "send" primitive make a copy on send. > >> That would be quite an overhead, and also limit the types of data you > >> can send. ] > >> > >> In practical terms, it works. (Stackless proves this as well IMO, > >> since despite some differences, there's also lots of similarities) > >> > >> The other question that arises, is "isn't the GIL a problem with > >> threads?". Well, the answer to that really depends on what you're > >> doing. David Beazely's talk on what happens on mixing different sorts > >> of threads shows that it isn't ideal, and if you're hitting that > >> behaviour, then actually switching to real processes makes sense. > >> However if you're doing CPU intensive work inside a C extension which > >> releases the GIL (eg numpy), then it's less of an issue in practice. > >> Custom extensions can do the same. > >> > >> So, for example, picking something which I know colleagues [1] at work > >> do, you can use a DVS broadcast capture card to capture video frames, > >> pass those between threads which are doing processing on them, and > >> inside those threads use c extensions to process the data efficiently > >> (since image processing does take time...), and those release the GIL > >> boosting throughput. > >> > >> [1] On this project : > >> http://www.bbc.co.uk/rd/projects/2009/10/i3dlive.shtml > >> > >> So, that makes it all sound great - ie things can, after various > >> fashions, run in parallel on various versions of python, to practical > >> benefit. But obviously it could be improved. > >> > >> Personally, I think the project most likely to make a difference here > >> is actually pypy. Now, talk is very cheap, and easy, and I'm not > >> likely to implement this, so I'll aim to be brief. Execution is hard. > >> > >> In particular, what I think is most likely to be beneficial is > >> something _like_ this: > >> > >> Assume pypy runs without a GIL. Then allow the creation of a green > >> process. A green process is implemented using threads, but with data > >> created on the heap such that it defaults to being marked private to > >> the thread (ie ala thread local storage, but perhaps implemented > >> slightly differently - via references from the thread local storage > >> into the heap) rather than shared. Sharing between green processes > >> (for channels or boxes) would "simply" be detagged as being owned by > >> one thread, and passed to another. > >> > >> In particular this would mean that you need a mechanism for doing > >> this. Simply attempting to call another green process (or thread) from > >> another with mutable data types would be sufficient to raise the > >> equivalent of a segmentation fault. > >> > >> Secondly, improve cpyext to the extent that each cpython extension > >> gets it's own version of the GIL. (ie each extension runs with its own > >> logical runtime, and thinks that it has its own GIL which it can lock > >> and release. In practice it's faked by the PyPy runtime. This is > >> essentially similar conceptually to creating green processes. > >> > >> It's worth considering that the Linux kernel went through similar > >> changes, in that in the 2.0 days there was a large single big lock, > >> which was replaced by ever granular locks. I personally think that > >> since there are so many extensions that rely on the existence of the > >> GIL simply waving a wand to get rid of it isn't likely. However > >> logically providing a GIL per C-Extension may be plausible, and _may_ > >> be sufficient. > >> > >> However, I don't know - it might well not - I've not looked at the > >> code, and talk is cheap - execution is hard. > >> > >> Hopefully the above (cheap :) comments are in some small way useful. > >> > >> Regards, > >> > >> > >> Michael. > > From holger at merlinux.eu Sun Aug 1 13:50:29 2010 From: holger at merlinux.eu (holger krekel) Date: Sun, 1 Aug 2010 13:50:29 +0200 Subject: [pypy-dev] py.test/debian and pypy issue Message-ID: <20100801115029.GL1914@trillke.net> Hi all, just for you information: if you are running Debian (e.g. Ubuntu 10.04) and install "py.test" (codespeak-python-lib) from there you get the 9-month old py.test-1.1 which cannot run PyPy's trunk-test suite. Solutions: * uninstall the debian version. install 'py' from PyPI with e.g. "pip install py" or "easy_install py" - this should get you the 1.3.3 version which should work fine. * uninstall the debian version, don't install any other and then alias "py.test" to "trunk/pypy/py/bin/py.test" which means you use the pypy-included py version, currently version 1.3.1 which is also the version used in nightly test runs etc. sidenote: Fedora 13 ships 1.3.2 and Gentoo ships 1.3.3 so you mostly only get the issues on debian-based systems, i guess. best, holger From glavoie at gmail.com Sun Aug 1 22:04:29 2010 From: glavoie at gmail.com (Gabriel Lavoie) Date: Sun, 1 Aug 2010 16:04:29 -0400 Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? In-Reply-To: References: Message-ID: Sorry for the late answer, I was unavailable in the last few days. About send() and receive(), it depends on if the communication is local or not. For a local communication, anything can be passed since only the reference is sent. This is the base model for Stackless channels. For a remote communication (between two interpreters), any picklable object (a copy will then be made) and it includes channels and tasklets (for which a reference will automatically be created). The use of the PyPy proxy object space is to make remote communication more Stackless like by passing object by reference. If a ref_object is made, only a reference will be passed when a tasklet is moved or the object is sent on a channel. The object always resides where it was created. A move() operation will also be implemented on those objects so they can be moved around like tasklets. I hope it helps, Gabriel 2010/7/29 Kevin Ar18 > > > Hello Kevin, > > I don't know if it can be a solution to your problem but for my > > Master Thesis I'm working on making Stackless Python distributed. What > > I did is working but not complete and I'm right now in the process of > > writing the thesis (in french unfortunately). My code currently works > > with PyPy's "stackless" module onlyis and use some PyPy specific > > things. Here's what I added to Stackless: > > > > - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A > > node is an instance of an interpreter. > > - Each tasklet has its global namespace (to avoid sharing of data). The > > state is also easier to move to another interpreter this way. > > - Distributed channels: All requests are known by all nodes using the > > channel. > > - Distributed objets: When a reference is sent to a remote node, the > > object is not copied, a reference is created using PyPy's proxy object > > space. > > - Automated dependency recovery when an object or a tasklet is loaded > > on another interpreter > > > > With a proper scheduler, many tasklets could be automatically spread in > > multiple interpreters to use multiple cores or on multiple computers. A > > bit like the N:M threading model where N lightweight threads/coroutines > > can be executed on M threads. > > Was able to have a look at the API... > If others don't mind my asking this on the mailing list: > > * .send() and .receive() > What type of data can you send and receive between the tasklets? Can you > pass entire Python objects? > > * .send() and .receive() memory model > When you send data between tasklets (pass messages) or whateve you want to > call it, how is this implemented under the hood? Does it use shared memory > under the hood or does it involve a more costly copying of the data? I > realize that if it is on another machine you have to copy the data, but what > about between two threads? You mentioned PyPy's proxy object.... guess I'll > need to read up on that. > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -- Gabriel Lavoie glavoie at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.gaynor at gmail.com Mon Aug 2 04:11:15 2010 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Sun, 1 Aug 2010 22:11:15 -0400 Subject: [pypy-dev] I broke stackless Message-ID: The work I did changing CALL_METHOD to support keyword arguments moved some rstack.resume_point calls around, and it seems to have inadvertantly broken stackless on trunk. The latest translation fail can be found here: http://buildbot.pypy.org/builders/pypy-c-stackless-app-level-linux-x86-32/builds/597/steps/translate/logs/stdio. Anyone have a suggestion as to what exactly I need to do to get this working? Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Voltaire "The people's good is the highest law." -- Cicero "Code can always be simpler than you think, but never as simple as you want" -- Me From benjamin at python.org Mon Aug 2 04:52:34 2010 From: benjamin at python.org (Benjamin Peterson) Date: Sun, 1 Aug 2010 21:52:34 -0500 Subject: [pypy-dev] I broke stackless In-Reply-To: References: Message-ID: 2010/8/1 Alex Gaynor : > The work I did changing CALL_METHOD to support keyword arguments moved > some rstack.resume_point calls around, and it seems to have > inadvertantly broken stackless on trunk. ?The latest translation fail > can be found here: > http://buildbot.pypy.org/builders/pypy-c-stackless-app-level-linux-x86-32/builds/597/steps/translate/logs/stdio. > ?Anyone have a suggestion as to what exactly I need to do to get this > working? Revert it! :) -- Regards, Benjamin From todd.a.anderson at intel.com Tue Aug 3 19:29:49 2010 From: todd.a.anderson at intel.com (Anderson, Todd A) Date: Tue, 3 Aug 2010 10:29:49 -0700 Subject: [pypy-dev] Percentage Python as RPython. Message-ID: <9662F248D13E8C45B097A77F005E9729B0C39B74@orsmsx503.amr.corp.intel.com> Sorry if this has been asked before. I did some searching of the archive and didn't see anything but I might have missed it. I am curious what percentage of real-world Python programs in use are also RPython programs. I know that the FAQ says that the translator is not intended for Python programs in general but only for the PyPy interpreter itself but I've also seen a few mentions (on other sites) of attempting to translate Python to C. I've been thinking about adding a backend to the translator but would only want to do so if a significant amount of Python programs could use it. thanks, Todd -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Aug 3 20:52:54 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 3 Aug 2010 20:52:54 +0200 Subject: [pypy-dev] Percentage Python as RPython. In-Reply-To: <9662F248D13E8C45B097A77F005E9729B0C39B74@orsmsx503.amr.corp.intel.com> References: <9662F248D13E8C45B097A77F005E9729B0C39B74@orsmsx503.amr.corp.intel.com> Message-ID: On Tue, Aug 3, 2010 at 7:29 PM, Anderson, Todd A wrote: > Sorry if this has been asked before. ?I did some searching of the archive > and didn?t see anything but I might have missed it. > > > > I am curious what percentage of real-world Python programs in use are also > RPython programs. ?I know that the FAQ says that the translator is not > intended for Python programs in general but only for the PyPy interpreter > itself but I?ve also seen a few mentions (on other sites) of attempting to > translate Python to C.? I?ve been thinking about adding a backend to the > translator but would only want to do so if a significant amount of Python > programs could use it. > 0 - 0.5% (generally, none. You write programs for RPython in a different manner). > > > thanks, > > > > Todd > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From ademan555 at gmail.com Tue Aug 3 22:08:07 2010 From: ademan555 at gmail.com (Dan Roberts) Date: Tue, 3 Aug 2010 13:08:07 -0700 Subject: [pypy-dev] Percentage Python as RPython. In-Reply-To: <9662F248D13E8C45B097A77F005E9729B0C39B74@orsmsx503.amr.corp.intel.com> References: <9662F248D13E8C45B097A77F005E9729B0C39B74@orsmsx503.amr.corp.intel.com> Message-ID: Hi Todd, I'm not sure what your goals are, but my position is that if you write a translator backend and a JIT backend (please do) you can have fast (and improving) python on platform X. What were you hoping to target with your backend? Cheers, Dan On Aug 3, 2010 10:43 AM, "Anderson, Todd A" wrote: Sorry if this has been asked before. I did some searching of the archive and didn?t see anything but I might have missed it. I am curious what percentage of real-world Python programs in use are also RPython programs. I know that the FAQ says that the translator is not intended for Python programs in general but only for the PyPy interpreter itself but I?ve also seen a few mentions (on other sites) of attempting to translate Python to C. I?ve been thinking about adding a backend to the translator but would only want to do so if a significant amount of Python programs could use it. thanks, Todd _______________________________________________ pypy-dev at codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From bhartsho at yahoo.com Wed Aug 4 10:21:10 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Wed, 4 Aug 2010 01:21:10 -0700 (PDT) Subject: [pypy-dev] demoting method, cannot follow, call result degenerated Message-ID: <267005.44211.qm@web114012.mail.gq1.yahoo.com> I'm still struggling to learn all the rules of RPython, i have read the coding guide, and the PDF's PyGirl and Ancona's RPython paper, but still i feel i'm not fully grasping everything. I have a function that returns different classes that all share a common base class. It works until i introduce a new subclass that has some methods of the same name. Then i get the demotion, can not follow, degenerated error. I googled, but all i can find is an IRC log where Fijal seems to taking talking about my problem. http://www.tismer.com/pypy/irc-logs/pypy/%23pypy.log.20070125 pedronis: if function can return (in rpython) set of classes with common superclass, than all methods that I call later must be defined on that superclass, right? [11:30] [15:01] yes, unless you assert a specific subclass So i just need to use an assert statement before the function return, and assert the class i am returning? I am blogging about my progress while learning RPython, i have posted about meta-programming in Rpython which is a new concept to me. http://pyppet.blogspot.com/2010/08/meta-programming-in-rpython.html -brett From fijall at gmail.com Wed Aug 4 10:25:59 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 4 Aug 2010 10:25:59 +0200 Subject: [pypy-dev] demoting method, cannot follow, call result degenerated In-Reply-To: <267005.44211.qm@web114012.mail.gq1.yahoo.com> References: <267005.44211.qm@web114012.mail.gq1.yahoo.com> Message-ID: Hey. If at any place in code you want to call methods on a thing that can't be proven to be of a specific subclass, they have to be defined on a superclass (even dummy versions). If you are however sure that this object will be of a specific subclass, write: assert isinstance(x, MySubclass) x.specific_method that's fine On Wed, Aug 4, 2010 at 10:21 AM, Hart's Antler wrote: > I'm still struggling to learn all the rules of RPython, i have read the coding guide, and the PDF's PyGirl and Ancona's RPython paper, but still i feel i'm not fully grasping everything. > > I have a function that returns different classes that all share a common base class. ?It works until i introduce a new subclass that has some methods of the same name. ?Then i get the demotion, can not follow, degenerated error. > > I googled, but all i can find is an IRC log where Fijal seems to taking talking about my problem. > http://www.tismer.com/pypy/irc-logs/pypy/%23pypy.log.20070125 > > pedronis: if function can return (in rpython) set of classes with common superclass, than all methods that I call later must be defined on that superclass, right? > > [11:30] [15:01] yes, unless you assert a specific subclass > > So i just need to use an assert statement before the function return, and assert the class i am returning? > > I am blogging about my progress while learning RPython, i have posted about meta-programming in Rpython which is a new concept to me. > > http://pyppet.blogspot.com/2010/08/meta-programming-in-rpython.html > > -brett > > > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From bhartsho at yahoo.com Thu Aug 5 03:04:00 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Wed, 4 Aug 2010 18:04:00 -0700 (PDT) Subject: [pypy-dev] Percentage Python as RPython. Message-ID: <805862.40025.qm@web114009.mail.gq1.yahoo.com> Todd, I think you want a frontend, not a backend. The frontend would take in normal Python and convert it to RPython. RPython seems to have infinite meta-programming possibilities, so its just a matter of how hard would it be to make the meta frontend. Probably too hard since python is so dynamic, but maybe its possible with a new subset of Python halfway to RPython, developers would then only have to port to Not-So-Restricted-Python, and then the frontend does the final job of converting to RPython. -brett From bhartsho at yahoo.com Thu Aug 5 03:14:10 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Wed, 4 Aug 2010 18:14:10 -0700 (PDT) Subject: [pypy-dev] demoting method, cannot follow, call result degenerated In-Reply-To: Message-ID: <836035.68548.qm@web114016.mail.gq1.yahoo.com> Thanks for clarifying Fijal, putting dummy functions on the base class fixes the demotion errors. But now i have a new problem, from the bookkeeper, unpackiterable. pypy.annotation.bookkeeper.CallPatternTooComplex': '*' argument must be SomeTuple .. v2 = call_args(v0, ((0, (), True, False)), v1) .. '(rbpy:1)BPY_Object_MESH.GET_location' I checked the object, instead of SomeTuple it is SomeObject. I'm trying to understand what causes the CallPatternTooComplex error, i can not reproduce it with a simple model that is close to what my actual code is doing. class T(object): def hi( self, *args ): pass class TA( T ): def hi( self, a,b,c ): pass class TB( T ): def hi( self, y ): pass def pypy_entrypoint(): t = T() ta = TA() tb = TB() ta.hi(1,2,'x') tb.hi() tb.hi('xxx') print 'too complex test' the above translates just fine, no TooComplex error. -brett --- On Wed, 8/4/10, Maciej Fijalkowski wrote: > From: Maciej Fijalkowski > Subject: Re: [pypy-dev] demoting method, cannot follow, call result degenerated > To: "Hart's Antler" > Cc: pypy-dev at codespeak.net > Date: Wednesday, 4 August, 2010, 1:25 AM > Hey. > > If at any place in code you want to call methods on a thing > that can't > be proven to be of a specific subclass, they have to be > defined on a > superclass (even dummy versions). > > If you are however sure that this object will be of a > specific subclass, write: > assert isinstance(x, MySubclass) > x.specific_method > > that's fine > > On Wed, Aug 4, 2010 at 10:21 AM, Hart's Antler > wrote: > > I'm still struggling to learn all the rules of > RPython, i have read the coding guide, and the PDF's PyGirl > and Ancona's RPython paper, but still i feel i'm not fully > grasping everything. > > > > I have a function that returns different classes that > all share a common base class. ?It works until i introduce > a new subclass that has some methods of the same name. > ?Then i get the demotion, can not follow, degenerated > error. > > > > I googled, but all i can find is an IRC log where > Fijal seems to taking talking about my problem. > > http://www.tismer.com/pypy/irc-logs/pypy/%23pypy.log.20070125 > > > > pedronis: if function can return (in > rpython) set of classes with common superclass, than all > methods that I call later must be defined on that > superclass, right? > > > > [11:30] [15:01] yes, > unless you assert a specific subclass > > > > So i just need to use an assert statement before the > function return, and assert the class i am returning? > > > > I am blogging about my progress while learning > RPython, i have posted about meta-programming in Rpython > which is a new concept to me. > > > > http://pyppet.blogspot.com/2010/08/meta-programming-in-rpython.html > > > > -brett > > > > > > > > _______________________________________________ > > pypy-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/pypy-dev > > > From kevinar18 at hotmail.com Fri Aug 6 04:30:27 2010 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Thu, 5 Aug 2010 22:30:27 -0400 Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? In-Reply-To: References: , , , Message-ID: Note: Gabriel, do you think we should discuss this on another mailing list (or in private) as I'm not sure this related to PyPy dev anymore? Anywyas, what are your future plans for the project? Is it just an experiment for school ... maybe in the hopes that others would maintaining it if it was found to be interesting? ... are you planning actual future development, maintenance, promotion of it yourself? ----------- On a personal note... the concept has a lot of similarities to what I am exploring. However, I would have to make so many additional modifications. Perhaps you can give some thoughts on whether it would take me a long time to add such things? Some examples: * Two additional message passing styles (in addition to your own) Queues - multiple tasklets can push onto queue, only one tasklet can pop.... multiple tasklets can access the property to find out if there is any data in the queue. Queues can be set to an infite size or set with a max # of entries allowed. Streams - I'm not sure of the exact name, but kind of like an infinite stream/buffer ... useful for passing infinite amounts of data. Only one tasklet can write/add data. Only one tasklet can read/extract data. * Message passing When you create a tasklet, you assign a set number of queues or streams to it (it can have many) and whether they extract data from them or write to them (they can only either extract or write to it as noted above). The tasklet's global namespace has access to these queues or streams and can extract or add data to them. In my case, I look at message passing from the perspective of the tasklet. A tasklet can either be assigned a certain number of "in ports" and a certain number of "out ports." In this case the "in ports" are the .read() end of a queue or stream and the "out ports" are the .send() part of a queue or stream. * Scheduler For the scheduler, I would need to control when a tasklet runs. Currently, I am thinking that I would look at all the "in ports" that a tasklet has and make sure each one has some data. Only then would the tasklet be scheduled to run by the scheduler. ------------ On another note, I am curious how you handled the issue of "nested" objects. Consider send() and receive() that you use to pass objects around in your project. Am I correct in that these objects cannot contain references outside of themselves? Also, how do you handle extracting out of the tree and making sure there are not references outside the object? For example, consider the following object, where "->" means it has a reference to that object Object 1 -> Object 2 Object 2 -> Object 3 Object 2 -> Object 4 Object 4 -> Object 2 Now, let's say I have a tasklet like the following: .... -> incoming data = pointer/reference to Object 1 1. read incoming data (get Object 1 reference) 2. remove Object 3 3. send Object 3 to tasklet B 4. send Object 1 to tasklet C Result: tasklet B now has this object: pointer/reference to Object 1, which contains the following tree: Object 1 -> Object 2 Object 2 -> Object 4 Object 4 -> Object 2 tasklet C now has this object: pointer/reference to Object 3, which contains the following tree: Object 3 On the other hand, consider the following scenario: 1. read incoming data (get Object 1 reference) 2. remove Object 4 ERROR: this would not be possible, as it refers to Object 2 > Sorry for the late answer, I was unavailable in the last few days. > > About send() and receive(), it depends on if the communication is local > or not. For a local communication, anything can be passed since only > the reference is sent. This is the base model for Stackless channels. > For a remote communication (between two interpreters), any picklable > object (a copy will then be made) and it includes channels and tasklets > (for which a reference will automatically be created). > > The use of the PyPy proxy object space is to make remote communication > more Stackless like by passing object by reference. If a ref_object is > made, only a reference will be passed when a tasklet is moved or the > object is sent on a channel. The object always resides where it was > created. A move() operation will also be implemented on those objects > so they can be moved around like tasklets. > > I hope it helps, > > Gabriel > > 2010/7/29 Kevin Ar18> > >> Hello Kevin, >> I don't know if it can be a solution to your problem but for my >> Master Thesis I'm working on making Stackless Python distributed. What >> I did is working but not complete and I'm right now in the process of >> writing the thesis (in french unfortunately). My code currently works >> with PyPy's "stackless" module onlyis and use some PyPy specific >> things. Here's what I added to Stackless: >> >> - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A >> node is an instance of an interpreter. >> - Each tasklet has its global namespace (to avoid sharing of data). The >> state is also easier to move to another interpreter this way. >> - Distributed channels: All requests are known by all nodes using the >> channel. >> - Distributed objets: When a reference is sent to a remote node, the >> object is not copied, a reference is created using PyPy's proxy object >> space. >> - Automated dependency recovery when an object or a tasklet is loaded >> on another interpreter >> >> With a proper scheduler, many tasklets could be automatically spread in >> multiple interpreters to use multiple cores or on multiple computers. A >> bit like the N:M threading model where N lightweight threads/coroutines >> can be executed on M threads. > > Was able to have a look at the API... > If others don't mind my asking this on the mailing list: > > * .send() and .receive() > What type of data can you send and receive between the tasklets? Can > you pass entire Python objects? > > * .send() and .receive() memory model > When you send data between tasklets (pass messages) or whateve you want > to call it, how is this implemented under the hood? Does it use shared > memory under the hood or does it involve a more costly copying of the > data? I realize that if it is on another machine you have to copy the > data, but what about between two threads? You mentioned PyPy's proxy > object.... guess I'll need to read up on that. > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > > > > -- > Gabriel Lavoie > glavoie at gmail.com From glavoie at gmail.com Fri Aug 6 05:31:15 2010 From: glavoie at gmail.com (Gabriel Lavoie) Date: Thu, 5 Aug 2010 23:31:15 -0400 Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? In-Reply-To: References: Message-ID: I don't mind replying to the mailing list unless it annoys someone? Maybe some people could be interested by this discussion. You have a lot of questions! :) My answers are inline. 2010/8/5 Kevin Ar18 > > Note: Gabriel, do you think we should discuss this on another mailing list > (or in private) as I'm not sure this related to PyPy dev anymore? > > > Anywyas, what are your future plans for the project? > Is it just an experiment for school ... maybe in the hopes that others > would maintaining it if it was found to be interesting? > ... > are you planning actual future development, maintenance, promotion of it > yourself? > Based on the interest and time I'll and and other people will have I plan to debug this as much as possible. If people are interested to join in after my thesis, I'll be more than open to welcome then in the project. Right now, I'm writing my report and I'm also looking for a job. I won't have much time to touch again to the code before next month to prepare it for my presentation, along with a lot of examples and use cases. > > > ----------- > > On a personal note... the concept has a lot of similarities to what I am > exploring. However, I would have to make so many additional modifications. > Perhaps you can give some thoughts on whether it would take me a long time > to add such things? > Allright, my plan was to make all the needed lower level constructs that can be used to build more complex things. For example, a mix of tasklet and sync channels could be wrapped in an API to create async channels. I know this is far from complete and I have a few ideas on how it could be improved in the future but it's currently not needed for my project. For now, the idea was to stay as close as possible to standard Stackless Python and only add the needed APIs and functionalities to support distributing tasklets between multiple interpreters. > > Some examples: > > * Two additional message passing styles (in addition to your own) > Queues - multiple tasklets can push onto queue, only one tasklet can > pop.... multiple tasklets can access the property to find out if there is > any data in the queue. Queues can be set to an infite size or set with a max > # of entries allowed. This could easily be implemented using a standard channel and by starting multiple tasklets to send data. With some helper methods on a channel it could be possible to know how many tasklets are waiting to send their data. A channel already have a built-in queue for send/receive requests. This queue contains a list of all tasklets waiting for a send/receive operation. Tasklets are supposed to be lightweight enough to support something like this. > Streams - I'm not sure of the exact name, but kind of like an infinite > stream/buffer ... useful for passing infinite amounts of data. Only one > tasklet can write/add data. Only one tasklet can read/extract data. > Like a UNIX pipe()? Async? Again, some code wrapping standard channels could be used for this. > > > * Message passing > When you create a tasklet, you assign a set number of queues or streams to > it (it can have many) and whether they extract data from them or write to > them (they can only either extract or write to it as noted above). The > tasklet's global namespace has access to these queues or streams and can > extract or add data to them. > > In my case, I look at message passing from the perspective of the tasklet. > A tasklet can either be assigned a certain number of "in ports" and a > certain number of "out ports." In this case the "in ports" are the .read() > end of a queue or stream and the "out ports" are the .send() part of a queue > or stream. > > Sorry, I don't really understand what you're trying to explain here. Maybe an example could be helpful? :) > > * Scheduler > For the scheduler, I would need to control when a tasklet runs. Currently, > I am thinking that I would look at all the "in ports" that a tasklet has and > make sure each one has some data. Only then would the tasklet be scheduled > to run by the scheduler. > > Couldn't all those ports (channels) be read one at a time, then the processing could be done? I don't exactly see the need to play with the scheduler. Channels are blocking. A tasklet will be anyway unscheduled when it tries to read on a channel in which no data is available. > > > ------------ > On another note, I am curious how you handled the issue of "nested" > objects. Consider send() and receive() that you use to pass objects around > in your project. Am I correct in that these objects cannot contain > references outside of themselves? Also, how do you handle extracting out of > the tree and making sure there are not references outside the object? > Right now, I did not really dig too far with this problem. With a local communication, a reference to the object is sent through a channel. The receiver tasklet will have the same access to the object and all the sub-object as the sender tasklet. For remote communications, pickling is involved. The object to send must be picklable. It excludes any I/O object unless the programmer creates its own pickling protocol for those. A copy of all the object tree will then be made. Sometime it's good (small objects), sometime it's bad (really complex, big objects, I/O objects, etc.). This is why I added the concept of ref_object() using PyPy's proxy object space. For such objects, a proxy can be made and only a reference object will be sent to the remote side. This object will have the same type as the original object but all operations will be forwarded to the host node. All replies will also be wrapped by proxies when sent back to the remote reference object. The only case where a proxy object is not created is with atomic types (string, int, float, etc). It's useless for those because they are immutable anyway. A remote access to those would introduce useless latency. With ref_object(), the object tree always stay on the initial node. A move() operation will also be added to those ref_object()s to be able to move them between interpreters if needed. > > For example, consider the following object, where "->" means it has a > reference to that object > > Object 1 -> Object 2 > > Object 2 -> Object 3 Object 2 -> Object 4 > Object 4 -> Object 2 > > > Now, let's say I have a tasklet like the following: > > .... -> incoming data = pointer/reference to Object 1 > > 1. read incoming data (get Object 1 reference) > 2. remove Object 3 > 3. send Object 3 to tasklet B > 4. send Object 1 to tasklet C > > Result: > tasklet B now has this object: > pointer/reference to Object 1, which contains the following tree: Object 1 -> Object 2 > Object 2 -> Object 4 > Object 4 -> Object 2 > > > tasklet C now has this object: > pointer/reference to Object 3, which contains the following tree: > Object 3 > > I think you swapped tasklet B and tasklet C for the end result! ;) > > > On the other hand, consider the following scenario: > > 1. read incoming data (get Object 1 reference) > 2. remove Object 4 > ERROR: this would not be possible, as it refers to Object 2 > Why isn't it possible? By removing "Object 4" I guess you mean removing this link: Object 2 -> Object 4? This is the only way Object 4 could be removed. > > > Sorry for the late answer, I was unavailable in the last few days. > > > > About send() and receive(), it depends on if the communication is local > > or not. For a local communication, anything can be passed since only > > the reference is sent. This is the base model for Stackless channels. > > For a remote communication (between two interpreters), any picklable > > object (a copy will then be made) and it includes channels and tasklets > > (for which a reference will automatically be created). > > > > The use of the PyPy proxy object space is to make remote communication > > more Stackless like by passing object by reference. If a ref_object is > > made, only a reference will be passed when a tasklet is moved or the > > object is sent on a channel. The object always resides where it was > > created. A move() operation will also be implemented on those objects > > so they can be moved around like tasklets. > > > > I hope it helps, > > > > Gabriel > > > > 2010/7/29 Kevin Ar18> > > > >> Hello Kevin, > >> I don't know if it can be a solution to your problem but for my > >> Master Thesis I'm working on making Stackless Python distributed. What > >> I did is working but not complete and I'm right now in the process of > >> writing the thesis (in french unfortunately). My code currently works > >> with PyPy's "stackless" module onlyis and use some PyPy specific > >> things. Here's what I added to Stackless: > >> > >> - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A > >> node is an instance of an interpreter. > >> - Each tasklet has its global namespace (to avoid sharing of data). The > >> state is also easier to move to another interpreter this way. > >> - Distributed channels: All requests are known by all nodes using the > >> channel. > >> - Distributed objets: When a reference is sent to a remote node, the > >> object is not copied, a reference is created using PyPy's proxy object > >> space. > >> - Automated dependency recovery when an object or a tasklet is loaded > >> on another interpreter > >> > >> With a proper scheduler, many tasklets could be automatically spread in > >> multiple interpreters to use multiple cores or on multiple computers. A > >> bit like the N:M threading model where N lightweight threads/coroutines > >> can be executed on M threads. > > > > Was able to have a look at the API... > > If others don't mind my asking this on the mailing list: > > > > * .send() and .receive() > > What type of data can you send and receive between the tasklets? Can > > you pass entire Python objects? > > > > * .send() and .receive() memory model > > When you send data between tasklets (pass messages) or whateve you want > > to call it, how is this implemented under the hood? Does it use shared > > memory under the hood or does it involve a more costly copying of the > > data? I realize that if it is on another machine you have to copy the > > data, but what about between two threads? You mentioned PyPy's proxy > > object.... guess I'll need to read up on that. > > _______________________________________________ > > pypy-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/pypy-dev > > > > > > > > -- > > Gabriel Lavoie > > glavoie at gmail.com > By the way, if you come to #pypy on FreeNode, I'm WildChild! I'm always there though not alway available. I'm in the EST timezone (UTC-5). See ya, Gabriel -- Gabriel Lavoie glavoie at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From bhartsho at yahoo.com Tue Aug 10 08:32:31 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Mon, 9 Aug 2010 23:32:31 -0700 (PDT) Subject: [pypy-dev] rstruct where is pack? Message-ID: <250182.36782.qm@web114016.mail.gq1.yahoo.com> Seems like struct.pack is not RPython? I see the examples for unpack in the tests folder, but not for packing. From benjamin at python.org Tue Aug 10 15:14:04 2010 From: benjamin at python.org (Benjamin Peterson) Date: Tue, 10 Aug 2010 08:14:04 -0500 Subject: [pypy-dev] rstruct where is pack? In-Reply-To: <250182.36782.qm@web114016.mail.gq1.yahoo.com> References: <250182.36782.qm@web114016.mail.gq1.yahoo.com> Message-ID: 2010/8/10 Hart's Antler : > Seems like struct.pack is not RPython? ?I see the examples for unpack in the tests folder, but not for packing. struct.pack() is implemented in pypy/module/rstruct/. -- Regards, Benjamin From anto.cuni at gmail.com Tue Aug 10 15:32:29 2010 From: anto.cuni at gmail.com (Antonio Cuni) Date: Tue, 10 Aug 2010 15:32:29 +0200 Subject: [pypy-dev] rstruct where is pack? In-Reply-To: References: <250182.36782.qm@web114016.mail.gq1.yahoo.com> Message-ID: <4C6154ED.1070904@gmail.com> On 10/08/10 15:14, Benjamin Peterson wrote: > 2010/8/10 Hart's Antler : >> Seems like struct.pack is not RPython? I see the examples for unpack in the tests folder, but not for packing. > > struct.pack() is implemented in pypy/module/rstruct/. I suppose you mean pypy/module/struct. But if the OP is looking for an rpython lib to use in his rpython program, this is not exactly what he looks for, although I agree it could be adapted and ported to rlib. ciao, Anto From bhartsho at yahoo.com Wed Aug 11 03:08:51 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Tue, 10 Aug 2010 18:08:51 -0700 (PDT) Subject: [pypy-dev] rstruct where is pack? In-Reply-To: <4C6154ED.1070904@gmail.com> Message-ID: <951074.37494.qm@web114002.mail.gq1.yahoo.com> I have made a RPython replacement for struct pack/unpack that could go in rlib. It is not a drop in replacement, and for some reason i can't get long to work, but for simple packing and unpacking it will work. Posted the code on my blog if anybody ever runs into the same problem: http://pyppet.blogspot.com/2010/08/rpython-struct.html --- On Tue, 8/10/10, Antonio Cuni wrote: > From: Antonio Cuni > Subject: Re: [pypy-dev] rstruct where is pack? > To: "Benjamin Peterson" > Cc: "Hart's Antler" , pypy-dev at codespeak.net > Date: Tuesday, 10 August, 2010, 6:32 AM > On 10/08/10 15:14, Benjamin Peterson > wrote: > > 2010/8/10 Hart's Antler : > >> Seems like struct.pack is not RPython?? I see > the examples for unpack in the tests folder, but not for > packing. > > > > struct.pack() is implemented in pypy/module/rstruct/. > > I suppose you mean pypy/module/struct. > > But if the OP is looking for an rpython lib to use in his > rpython program, > this is not exactly what he looks for, although I agree it > could be adapted > and ported to rlib. > > ciao, > Anto > From arigo at tunes.org Wed Aug 11 14:20:31 2010 From: arigo at tunes.org (Armin Rigo) Date: Wed, 11 Aug 2010 14:20:31 +0200 Subject: [pypy-dev] I broke stackless In-Reply-To: References: Message-ID: <20100811122031.GA2733@code0.codespeak.net> Hi, For reference, after IRC discussions I fixed it in r76475. Armin From arigo at tunes.org Wed Aug 11 14:24:20 2010 From: arigo at tunes.org (Armin Rigo) Date: Wed, 11 Aug 2010 14:24:20 +0200 Subject: [pypy-dev] Percentage Python as RPython. In-Reply-To: <805862.40025.qm@web114009.mail.gq1.yahoo.com> References: <805862.40025.qm@web114009.mail.gq1.yahoo.com> Message-ID: <20100811122420.GB2733@code0.codespeak.net> Hi Hart, On Wed, Aug 04, 2010 at 06:04:00PM -0700, Hart's Antler wrote: > I think you want a frontend, not a backend. The frontend would take > in normal Python and convert it to RPython. I think the chances of getting this to work are "0 - 0.5 %", as per fijal's previous excellent answer. Writing in RPython requires a different state of mind than writing in normal Python (unless, maybe, you are a Java programmer that writes Java with the Python syntax; for that case, I would suggest that writing in Java in the first place is just as easy). A bientot, Armin. From stefan_ml at behnel.de Thu Aug 12 08:49:09 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Aug 2010 08:49:09 +0200 Subject: [pypy-dev] What can Cython do for PyPy? Message-ID: Hi, there has recently been a move towards a .NET/IronPython port of Cython, mostly driven by the need for a fast NumPy port. During the related discussion, the question came up how much it would take to let Cython also target other runtimes, including PyPy. Given that PyPy already has a CPython C-API compatibility layer, I doubt that it would be hard to enable that. With my limited knowledge about the internals of that layer, I guess the question thus becomes: is there anything Cython could do to the C code it generates that would make the Cython generated extension modules run faster/better/safer on PyPy than they would currently? I never tried to make a Cython module actually run on PyPy (simply because I don't use PyPy), but I have my doubts that they'd run perfectly out of the box. While generally portable, I'm pretty sure the C code relies on some specific internals of CPython that PyPy can't easily (or efficiently) provide. Stefan From fijall at gmail.com Thu Aug 12 10:05:01 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 12 Aug 2010 10:05:01 +0200 Subject: [pypy-dev] What can Cython do for PyPy? In-Reply-To: References: Message-ID: Hi Stefan. CPython extension compatibility layer is in alpha at best. I heavily doubt that anything would run out of the box. However, this is a cpython compatiblity layer anyway, it's not meant to be used as a long term solutions. First of all it's inneficient (and unclear if will ever be), but it's also unjitable. This means that to JIT, cpython extension is like a black box which should not be touched. Also, several concepts, like refcounting are completely alien to pypy and emulated. For example for numpy, I think a rewrite is necessary to make it fast (and as experiments have shown, it's possible to make it really fast), so I would not worry about using cython for speeding things up. In theory you should not need it and the boundary layer between cython-compiled code and JITted code would make you suffer anyway. There is another usecase for using cython for providing access to C libraries. This is a bit harder question and I don't have a good answer for that, but maybe cpython compatibility layer would be good enough in this case? I can't see how Cython can produce a "native" C code instead of CPython C code without some major effort. Cheers, fijal On Thu, Aug 12, 2010 at 8:49 AM, Stefan Behnel wrote: > Hi, > > there has recently been a move towards a .NET/IronPython port of Cython, > mostly driven by the need for a fast NumPy port. During the related > discussion, the question came up how much it would take to let Cython also > target other runtimes, including PyPy. > > Given that PyPy already has a CPython C-API compatibility layer, I doubt > that it would be hard to enable that. With my limited knowledge about the > internals of that layer, I guess the question thus becomes: is there > anything Cython could do to the C code it generates that would make the > Cython generated extension modules run faster/better/safer on PyPy than > they would currently? I never tried to make a Cython module actually run on > PyPy (simply because I don't use PyPy), but I have my doubts that they'd > run perfectly out of the box. While generally portable, I'm pretty sure the > C code relies on some specific internals of CPython that PyPy can't easily > (or efficiently) provide. > > Stefan > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From stefan_ml at behnel.de Thu Aug 12 11:25:18 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Aug 2010 11:25:18 +0200 Subject: [pypy-dev] What can Cython do for PyPy? In-Reply-To: References: Message-ID: Maciej Fijalkowski, 12.08.2010 10:05: > On Thu, Aug 12, 2010 at 8:49 AM, Stefan Behnel wrote: >> there has recently been a move towards a .NET/IronPython port of Cython, >> mostly driven by the need for a fast NumPy port. During the related >> discussion, the question came up how much it would take to let Cython also >> target other runtimes, including PyPy. >> >> Given that PyPy already has a CPython C-API compatibility layer, I doubt >> that it would be hard to enable that. With my limited knowledge about the >> internals of that layer, I guess the question thus becomes: is there >> anything Cython could do to the C code it generates that would make the >> Cython generated extension modules run faster/better/safer on PyPy than >> they would currently? I never tried to make a Cython module actually run on >> PyPy (simply because I don't use PyPy), but I have my doubts that they'd >> run perfectly out of the box. While generally portable, I'm pretty sure the >> C code relies on some specific internals of CPython that PyPy can't easily >> (or efficiently) provide. > > CPython extension compatibility layer is in alpha at best. I heavily > doubt that anything would run out of the box. However, this is a > cpython compatiblity layer anyway, it's not meant to be used as a long > term solutions. First of all it's inneficient (and unclear if will > ever be) If you only use it to call into non-trivial Cython code (e.g. some heavy calculations on NumPy tables), the call overhead should be mostly negligible, maybe even close to that in CPython. You could even provide some kind of fast-path to 'cpdef' functions (i.e. functions that are callable from both C and Python) and 'api' functions (which are currently exported at the module API level using the PyCapsule mechanism). That would reduce the call overhead to that of a C call. Then, a lot of Cython code doesn't do much ref-counting and the like but simply runs in plain C. So, often enough, there won't be that much overhead involved in the code itself either, especially in tight loops where users prune away all CPython interaction anyway. > but it's also unjitable. This means that to JIT, cpython > extension is like a black box which should not be touched. Well, unless both sides learn about each other, that is. It won't necessarily impact the JIT, but then again, a JIT usually won't have a noticeable impact on the performance of Cython code anyway. > Also, several concepts, like refcounting are completely alien to pypy > and emulated. Sure. That's why I asked if there is anything that Cython can help to improve here. For example, the code it generates for INCREF/DECREF operations is not only configurable at the C preprocessor level. > For example for numpy, I think a rewrite is necessary to make it fast > (and as experiments have shown, it's possible to make it really fast), > so I would not worry about using cython for speeding things up. This isn't only about making things fast when being rewritten. This is also about accessing and reusing existing code in a new environment. Cython is becoming increasingly popular in the numerics community, and a lot of Cython code is being written as we speak, not only in the SciPy/NumPy environment. People even find it attractive enough to start rewriting their CPython extension modules (most often library wrappers) from C in Cython, both for performance and TCO reasons. > There is another usecase for using cython for providing access to C > libraries. This is a bit harder question and I don't have a good > answer for that, but maybe cpython compatibility layer would be good > enough in this case? I can't see how Cython can produce a "native" C > code instead of CPython C code without some major effort. Native (standalone) C code isn't the goal, just something that adapts well to what PyPy can provide as a CPython compatibility layer. If Cython modules work across independent Python implementations, that would be the most simple way by far to make lots of them available cross-platform, thus making it a lot simpler to switch between different implementations. Stefan From santagada at gmail.com Thu Aug 12 16:31:01 2010 From: santagada at gmail.com (Leonardo Santagada) Date: Thu, 12 Aug 2010 11:31:01 -0300 Subject: [pypy-dev] What can Cython do for PyPy? In-Reply-To: References: Message-ID: <303CCB23-07C0-4D0F-90D2-0DD908DB4043@gmail.com> On Aug 12, 2010, at 3:49 AM, Stefan Behnel wrote: > Hi, > > there has recently been a move towards a .NET/IronPython port of Cython, > mostly driven by the need for a fast NumPy port. During the related > discussion, the question came up how much it would take to let Cython also > target other runtimes, including PyPy. > > Given that PyPy already has a CPython C-API compatibility layer, I doubt > that it would be hard to enable that. With my limited knowledge about the > internals of that layer, I guess the question thus becomes: is there > anything Cython could do to the C code it generates that would make the > Cython generated extension modules run faster/better/safer on PyPy than > they would currently? I never tried to make a Cython module actually run on > PyPy (simply because I don't use PyPy), but I have my doubts that they'd > run perfectly out of the box. While generally portable, I'm pretty sure the > C code relies on some specific internals of CPython that PyPy can't easily > (or efficiently) provide. A possible solution I think would be to do an oo backend for cython. That could be made to generate C# or RPython code. The problem remains that pypy still doesn't have separate compilation so you cannot make a external module for the pypy interpreter after it is translated. So it is hard, maybe harder than anyone on cython would like, but I still think it is a good solution. (Unless I'm mistaken in any of my assumptions, and then it is a terrible solution :) -- Leonardo Santagada santagada at gmail.com From p.giarrusso at gmail.com Thu Aug 12 17:35:40 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Thu, 12 Aug 2010 17:35:40 +0200 Subject: [pypy-dev] What can Cython do for PyPy? In-Reply-To: References: Message-ID: I agree with the motivations given by Stefan - two interesting possibilities would be to: a) first, test the compatibility layer with Cython generated code b) possibly, allow users to use the Python API while replacing refcounting with another, more meaningful, PyPy-specific API* for a garbage collected heap. However, such an API is radically different. I'm also not sure how well such an API would mesh with the CPython API, actually. If Cython could support such an API, that would be great. But I'm unsure whether this is worth it, for Cython, and more in general for other modules (one could easily and elegantly support both CPython and PyPy with preprocessor tricks). See further below about why call overhead is not the biggest performance problem when not inlining. * I thought the Java Native Interface (JNI) design of local and global references (http://download.oracle.com/javase/6/docs/technotes/guides/jni/spec/design.html#wp16785) would work here, with some adaptation. However, if your moving GCs support pinning of objects, as I expect to be necessary to interact with CPython code, I would do an important change to that API: instead of having object references be pointers to (movable by the GC) pointers to objects, like in the JNI API, PyPy should use plain pinned pointers. The pinning would not be apparent in the type, but that should be fine I guess. Problems arise when PyPy-aware code calls code which still uses the refcounting API. It is mostly safe to ignore the refcounting (even decreases) for local references, but I'm unsure about persistent references, even if it's probably still the best solution, so that the PyPy-aware code handles the lifecycle by itself. On Thu, Aug 12, 2010 at 11:25, Stefan Behnel wrote: > Maciej Fijalkowski, 12.08.2010 10:05: >> On Thu, Aug 12, 2010 at 8:49 AM, Stefan Behnel wrote: > If you only use it to call into non-trivial Cython code (e.g. some heavy > calculations on NumPy tables), the call overhead should be mostly > negligible, maybe even close to that in CPython. You could even provide > some kind of fast-path to 'cpdef' functions (i.e. functions that are > callable from both C and Python) and 'api' functions (which are currently > exported at the module API level using the PyCapsule mechanism). That would > reduce the call overhead to that of a C call. >> but it's also unjitable. This means that to JIT, cpython >> extension is like a black box which should not be touched. > Well, unless both sides learn about each other, that is. It won't > necessarily impact the JIT, but then again, a JIT usually won't have a > noticeable impact on the performance of Cython code anyway. Call overhead is not the biggest problem, I guess (well, if it's bigger than that in C, it might be); it's IMHO the minor problem when you can't inline. Inlining is important because it allows to do more optimizations on the combined code. Now, it might or might not apply to your typical use cases (present and future), you should just keep this issue in mind, too. Whenever you say "If you only use it to call into non-trivial Cython code", you imply that some kind of functional abstraction, the one where you write short functions, such as accessors, are not efficiently supported. For instance, if you call two functions, each containing a parallel for loops, fusing the loops requires inlining the functions to expose the loops. Inlining accessors (getters and setters) allows to recognize that they often don't need to be called over and over again, i.e., common subexpression elimination, which you can't do on a normal (impure) function. To make a particularly dramatic example (since it comes from C) of a quadratic-to-linear optimization: a loop like for (i = 0; i < strlen(s); i++) { //do something on s without modifying it } takes quadratic time, because strlen takes linear time and is called at each loop. Can the optimizer fix this? The simplest way for it is to inline everything, then it could notice that calculating strlen only once is safe. In C with GCC extensions, one could annotate strlen as pure, and use functions which take s as a const parameter (but I'm unsure if it actually works). In Python (and even in Java), anything such should work without annotations. Of course, one can't rely on this quadratic-linear optimization unless it's guaranteed to work (like tail call elimination), so I wouldn't do it in this case; this point relates to the wider issue of unreliable optimizations and "sufficiently smart compilers", better discussed at http://prog21.dadgum.com/40.html (not mine). -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From jbaker at zyasoft.com Thu Aug 12 17:41:39 2010 From: jbaker at zyasoft.com (Jim Baker) Date: Thu, 12 Aug 2010 09:41:39 -0600 Subject: [pypy-dev] What can Cython do for PyPy? In-Reply-To: References: Message-ID: [crossposting to jython-dev] Because of some conversations I had with Maciej (mostly at Folsom Coffee in Boulder :) ), we are considering adding support for the CPython C-Extension API for Jython, modeling what has already been done in PyPy and IronPython. Although I think it may make a lot of sense to port NumPy to Java, and have argued for it in the past, being pragmatic suggests it's better to work with the tide of NumPy/Cython than against it. Also, this can bring in a large swath of existing libraries to work with Jython, including those coded against SWIG, at the cost that it will not run under most security manager policies. I think that's a reasonable tradeoff. Similar concerns that Maciej raises apply to Jython. No Java JIT will inline such native code, marshaling from the Java domain to the native one will be expensive, etc. But this is (mostly) true of Jython today, from Python code to Java (although invokedynamic will at least reduce some of those costs). But users can still take advantage of Java to achieve much better performance from Jython, if they are careful about structuring the execution of their code. At the end of the day, Jython to C code, including that produced by Cython should see a similar performance profile to CPython to C code, as long as they don't hammer the INCREF/DECREF *functions*. (JRuby is implementing something similar, and we probably can borrow their "refcounting" support.) But of course that's exactly what one needs to avoid to write performant extension code anyway in CPython, at least if it's to be multithreaded. One interesting part of this discussion is whether we can support lock eliding. This is one part of JIT inlining that you don't want to give up for multithreaded performance. Rather than having C code callback into Java to release the GIL (which is only global for such C code!), it would be better to have a marker on the C code that allows for immediate release, or perhaps some other inlinable Java stub. I could imagine this could be readily supported by Cython (and perhaps already is). Lastly, I want to emphasize again that if/when Jython adds support for the C extension API, the "GIL" and "refcounting" support will only be for such C code! We like our concurrency support and we are not giving it up :) - Jim On Thu, Aug 12, 2010 at 3:25 AM, Stefan Behnel wrote: > Maciej Fijalkowski, 12.08.2010 10:05: > > On Thu, Aug 12, 2010 at 8:49 AM, Stefan Behnel wrote: > >> there has recently been a move towards a .NET/IronPython port of Cython, > >> mostly driven by the need for a fast NumPy port. During the related > >> discussion, the question came up how much it would take to let Cython > also > >> target other runtimes, including PyPy. > >> > >> Given that PyPy already has a CPython C-API compatibility layer, I doubt > >> that it would be hard to enable that. With my limited knowledge about > the > >> internals of that layer, I guess the question thus becomes: is there > >> anything Cython could do to the C code it generates that would make the > >> Cython generated extension modules run faster/better/safer on PyPy than > >> they would currently? I never tried to make a Cython module actually run > on > >> PyPy (simply because I don't use PyPy), but I have my doubts that they'd > >> run perfectly out of the box. While generally portable, I'm pretty sure > the > >> C code relies on some specific internals of CPython that PyPy can't > easily > >> (or efficiently) provide. > > > > CPython extension compatibility layer is in alpha at best. I heavily > > doubt that anything would run out of the box. However, this is a > > cpython compatiblity layer anyway, it's not meant to be used as a long > > term solutions. First of all it's inneficient (and unclear if will > > ever be) > > If you only use it to call into non-trivial Cython code (e.g. some heavy > calculations on NumPy tables), the call overhead should be mostly > negligible, maybe even close to that in CPython. You could even provide > some kind of fast-path to 'cpdef' functions (i.e. functions that are > callable from both C and Python) and 'api' functions (which are currently > exported at the module API level using the PyCapsule mechanism). That would > reduce the call overhead to that of a C call. > > Then, a lot of Cython code doesn't do much ref-counting and the like but > simply runs in plain C. So, often enough, there won't be that much overhead > involved in the code itself either, especially in tight loops where users > prune away all CPython interaction anyway. > > > > but it's also unjitable. This means that to JIT, cpython > > extension is like a black box which should not be touched. > > Well, unless both sides learn about each other, that is. It won't > necessarily impact the JIT, but then again, a JIT usually won't have a > noticeable impact on the performance of Cython code anyway. > > > > Also, several concepts, like refcounting are completely alien to pypy > > and emulated. > > Sure. That's why I asked if there is anything that Cython can help to > improve here. For example, the code it generates for INCREF/DECREF > operations is not only configurable at the C preprocessor level. > > > > For example for numpy, I think a rewrite is necessary to make it fast > > (and as experiments have shown, it's possible to make it really fast), > > so I would not worry about using cython for speeding things up. > > This isn't only about making things fast when being rewritten. This is also > about accessing and reusing existing code in a new environment. Cython is > becoming increasingly popular in the numerics community, and a lot of > Cython code is being written as we speak, not only in the SciPy/NumPy > environment. People even find it attractive enough to start rewriting their > CPython extension modules (most often library wrappers) from C in Cython, > both for performance and TCO reasons. > > > > There is another usecase for using cython for providing access to C > > libraries. This is a bit harder question and I don't have a good > > answer for that, but maybe cpython compatibility layer would be good > > enough in this case? I can't see how Cython can produce a "native" C > > code instead of CPython C code without some major effort. > > Native (standalone) C code isn't the goal, just something that adapts well > to what PyPy can provide as a CPython compatibility layer. If Cython > modules work across independent Python implementations, that would be the > most simple way by far to make lots of them available cross-platform, thus > making it a lot simpler to switch between different implementations. > > Stefan > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anto.cuni at gmail.com Thu Aug 12 20:31:06 2010 From: anto.cuni at gmail.com (Antonio Cuni) Date: Thu, 12 Aug 2010 20:31:06 +0200 Subject: [pypy-dev] [pypy-svn] r76608 - in pypy/branch/jit-bounds/pypy/jit/metainterp: . test In-Reply-To: <20100812170214.45AC5282B9E@codespeak.net> References: <20100812170214.45AC5282B9E@codespeak.net> Message-ID: <4C643DEA.9020809@gmail.com> On 12/08/10 19:02, hakanardo at codespeak.net wrote: > + def boundint_gt(self, val): > + if val is None: return > + self.minint = val + 1 what happens if val == sys.maxint? ciao, Anto From kevinar18 at hotmail.com Fri Aug 13 01:42:54 2010 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Thu, 12 Aug 2010 19:42:54 -0400 Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? In-Reply-To: References: , , , , , Message-ID: Sorry for not gettin back to you sooner. I don't mind replying to the mailing list unless it annoys someone? Maybe some people could be interested by this discussion. You have a lot of questions! :) My answers are inline. * Message passing When you create a tasklet, you assign a set number of queues or streams to it (it can have many) and whether they extract data from them or write to them (they can only either extract or write to it as noted above). The tasklet's global namespace has access to these queues or streams and can extract or add data to them. In my case, I look at message passing from the perspective of the tasklet. A tasklet can either be assigned a certain number of "in ports" and a certain number of "out ports." In this case the "in ports" are the .read() end of a queue or stream and the "out ports" are the .send() part of a queue or stream. Sorry, I don't really understand what you're trying to explain here. Maybe an example could be helpful? :) * Scheduler For the scheduler, I would need to control when a tasklet runs. Currently, I am thinking that I would look at all the "in ports" that a tasklet has and make sure each one has some data. Only then would the tasklet be scheduled to run by the scheduler. Couldn't all those ports (channels) be read one at a time, then the processing could be done? I don't exactly see the need to play with the scheduler. Channels are blocking. A tasklet will be anyway unscheduled when it tries to read on a channel in which no data is available. http://www.jpaulmorrison.com/fbp/concepts.htm Figure 3.6 and Figure 3.7 are a good example. Let's say Figure 3.7 is the tasklet (the one with a local namespace and no access to global memory or memory in other tasklets). IN, ACC, REJ are pointers to a shared memory location (from an implementation standpoint). IN, ACC, REJ are either a queue or buffer/pipe/steam (from the perspective of the programmer). The tasklet can only read/extract data from IN. The tasklet can only write to ACC and REJ. > Couldn't all those ports (channels) be read one at a time, then the processing could be done? Not sure exactly, what you mean, but as shown in Figure 3.7, different parts of code will read or write to different ports at different times. > A tasklet will be anyway unscheduled when it tries to read on a channel in which no data is available. Good idea. If there's no data to read, the tasklet can yield. ... but I need to know when the tasklet can be put back into the scheduler queue Then again, I don't know how I will want to do the scheduler... and would like the low level primitives to explore different styles. Anyways, at this point, I guess this whole discussion is not that important. I should probably make something simpler for now just to try things out. Then maybe I'll know if I want to even bother working on something better. However, if you would like me to keep you up to date, I can contact you via email a few months from now. (Let me know and I'll give you a different email to use). -------------- next part -------------- An HTML attachment was scrubbed... URL: From bhartsho at yahoo.com Fri Aug 13 03:42:23 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Thu, 12 Aug 2010 18:42:23 -0700 (PDT) Subject: [pypy-dev] Wrapping C, void pointer Message-ID: <986277.65536.qm@web114019.mail.gq1.yahoo.com> I am wrapping PortAudio for RPython. Following the source code in RSDL as a guide i can clearly see what to do with constants, structs, functions and so on. So far so good until i reached something new in PortAudio i am not sure how to deal with, a void typedef and a void pointer. In the RSDL example pointers were defined by Ptr = lltype.Ptr(lltype.ForwardReference()), then in the CConfig class the struct was defined and rffi_platform.configure parses, finally the Ptr is told TO become the type - given the output of platform.configure. How do we deal with this situation from PortAudio? # from portaudio.h typedef void PaStream; PaError Pa_OpenStream( PaStream** stream, const PaStreamParameters *inputParameters, const PaStreamParameters *outputParameters, double sampleRate, unsigned long framesPerBuffer, PaStreamFlags streamFlags, PaStreamCallback *streamCallback, void *userData ); ############################### I have tried the following but it fails when i try to malloc the void pointer. OpenDefaultStream = external( 'Pa_OpenDefaultStream', [ rffi.VOIDPP, # PaStream** rffi.INT, # numInputChannels rffi.INT, # numOutputChannels rffi.INT, # sampleFormat rffi.INT, # sampleRate rffi.INT, # framesPerBuffer rffi.INT, #streamcallback rffi.VOIDP, #userData ], rffi.INT ) Stream = lltype.Void #rffi.VOIDP def test(): print 'portaudio version %s' %GetVersion() assert Initialize() == 0 # paNoError = 0, error code is returned on init fail. stream = lltype.malloc(Stream, flavor='raw') try: ok = OpenDefaultStream( stream, 1, 1, Int16, 22050, FramesPerBufferUnspecified, 0 ) finally: lltype.free(stream, flavor='raw') Terminate() -brett From arigo at tunes.org Fri Aug 13 10:39:49 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 13 Aug 2010 10:39:49 +0200 Subject: [pypy-dev] Wrapping C, void pointer In-Reply-To: <986277.65536.qm@web114019.mail.gq1.yahoo.com> References: <986277.65536.qm@web114019.mail.gq1.yahoo.com> Message-ID: <20100813083948.GA20768@code0.codespeak.net> Hi Hart, On Thu, Aug 12, 2010 at 06:42:23PM -0700, Hart's Antler wrote: > I am wrapping PortAudio for RPython. Why? Writing it in standard ctypes would give really bad performance? Armin From bhartsho at yahoo.com Fri Aug 13 15:02:12 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Fri, 13 Aug 2010 06:02:12 -0700 (PDT) Subject: [pypy-dev] Wrapping C, void pointer In-Reply-To: <20100813083948.GA20768@code0.codespeak.net> Message-ID: <634594.45561.qm@web114019.mail.gq1.yahoo.com> Hi Armin, i wanted something faster than ctypes, i think thats why Hubert Pham used the Python C API when doing pyaudio before, also i want to do DSP on the samples and want to option to do as many effects as possible in real-time. I figured out my problem, rpyportaudio is on google code now, http://code.google.com/p/rpyportaudio/ --- On Fri, 8/13/10, Armin Rigo wrote: > From: Armin Rigo > Subject: Re: [pypy-dev] Wrapping C, void pointer > To: "Hart's Antler" > Cc: pypy-dev at codespeak.net > Date: Friday, 13 August, 2010, 1:39 AM > Hi Hart, > > On Thu, Aug 12, 2010 at 06:42:23PM -0700, Hart's Antler > wrote: > > I am wrapping PortAudio for RPython. > > Why?? Writing it in standard ctypes would give really > bad performance? > > > Armin > From andrewfr_ice at yahoo.com Fri Aug 13 18:52:14 2010 From: andrewfr_ice at yahoo.com (Andrew Francis) Date: Fri, 13 Aug 2010 09:52:14 -0700 (PDT) Subject: [pypy-dev] pypy-dev Digest, Vol 361, Issue 5 In-Reply-To: Message-ID: <66172.2718.qm@web120712.mail.ne1.yahoo.com> Hi Kevin: Message: 4 Date: Thu, 12 Aug 2010 19:42:54 -0400 From: Kevin Ar18 Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? To: Message-ID: Content-Type: text/plain; charset="iso-8859-1" >I don't mind replying to the mailing list unless it annoys someone? Maybe >some people could be interested by this discussion. I am finding it a bit difficult to follow this thread. I am not sure who is saying what. Also I don't know if you are talking about an entirely new system or the stackless.py module. >In my case, I look at message passing from the perspective of the >tasklet. A tasklet can either be assigned a certain number of "in ports" >and a certain number of "out ports." In this case the "in ports" are the >.read() end of a queue or stream and the "out ports" are the .send() part >of a queue or stream. A part of the model that Stackless uses is that tasklets have channels. Channels have send() and receive() operations. >For the scheduler, I would need to control when a tasklet runs. >Currently, I am thinking that I would look at all the "in ports" that a >tasklet has and make sure each one has some data. Only then would the >tasklet be scheduled to run by the scheduler. The current scheduler already does this. However there are no in or out ports, just operations that can proceed. >Couldn't all those ports (channels) be read one at a time, then the >processing could be done? If you are using stackless.py - the tasklet will block if it encounters a channel with no target on the other side. I wrote a select() function that allows monitoring on multiple channels. >Good idea. If there's no data to read, the tasklet can yield. ... but I >need to know when the tasklet can be put back into the scheduler queue I don't want to toot my horn but I gave a talk that covers how rendez-vous semantics works at EuroPython: http://andrewfr.wordpress.com/2010/07/24/prototyping-gos-select-and-beyond/ Cheers, Andrew From bhartsho at yahoo.com Sat Aug 14 03:51:00 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Fri, 13 Aug 2010 18:51:00 -0700 (PDT) Subject: [pypy-dev] RPython function callback from C Message-ID: <662548.18463.qm@web114020.mail.gq1.yahoo.com> I have the PortAudio blocking API working, simple reading and writing to the sound card works. PortAudio also has an async API where samples are fed to a callback as they stream in. But i'm not sure how to define a RPython function that will be called as a callback from C, is this even possible? I see some references in the source of rffi that seems to suggest it is possible. Full source code is here http://pastebin.com/6YHbT7CU I'm passing the callback like this: def stream_callback( *args ): print 'stream callback' return 0 # 0=continue, 1=complete, 2=abort stream_callback_ptr = rffi.CCallback([], rffi.INT) OpenDefaultStream = rffi.llexternal( 'Pa_OpenDefaultStream', [ StreamRefPtr, # PaStream** rffi.INT, # numInputChannels rffi.INT, # numOutputChannels rffi.INT, # sampleFormat rffi.DOUBLE, # double sampleRate rffi.INT, # unsigned long framesPerBuffer #rffi.VOIDP, #PaStreamCallback *streamCallback stream_callback_ptr, rffi.VOIDP, #void *userData ], rffi.INT, # return compilation_info=eci, _callable=stream_callback ) entrypoint(): ... callback = lltype.nullptr( stream_callback_ptr.TO ) ok = OpenDefaultStream( streamptr, 2, 2, Int16, 22050.0, 512, callback, userdata ) From kevinar18 at hotmail.com Sat Aug 14 05:29:15 2010 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Fri, 13 Aug 2010 23:29:15 -0400 Subject: [pypy-dev] ongoing microthread discussions In-Reply-To: <66172.2718.qm@web120712.mail.ne1.yahoo.com> References: , <66172.2718.qm@web120712.mail.ne1.yahoo.com> Message-ID: > >I don't mind replying to the mailing list unless it annoys someone? Maybe >some people could be interested by this discussion. > > I am finding it a bit difficult to follow this thread. I am not sure who is saying what. Also I don't know if you are talking about an entirely new system or the stackless.py module. An entirely new system/way of doing things -- meaning I don't think the stackless style would fit. Originally, I was hoping for some way to achieve what I want in Python across multiple cores, but I'm finding there is no such primitives to do that effectively. I know the basics of how I would do it in a lower level language. Yes, there are many different topics that this brought up. Here's a summary: * I wanted to work on a different way of doing things (different than stackless)... but I needed lower level primitives that allowed me to pass data back and forth between threads using shared memory queues or pipes (instead of the current method that copies the data back and forth) * I then asked about the difficulty in doing some form of limited shared memory (one that wouldn't involve a GIL overhaul) * A branch of the discussion involved people discuss various locking problems that might cause... * The author of Kamaelia posted a message and we had a brief discussion down that road. (His project is very similar to what I want to do.) * Gabriel mentioned his project and we had a brief discussion. His project has some similarities ... but still is probably too different for my needs, but maybe would be very interesting to other people here. * In one of the emails, I brought up a possible solution to offering shared memory "message passing" that would not require locks of locking issues... but it really is too much for me to get involved with now. ... and I guess by now the discussion has pretty much died off as there was really nothing more.... -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbaker at zyasoft.com Sat Aug 14 08:13:26 2010 From: jbaker at zyasoft.com (Jim Baker) Date: Sat, 14 Aug 2010 00:13:26 -0600 Subject: [pypy-dev] ongoing microthread discussions In-Reply-To: References: <66172.2718.qm@web120712.mail.ne1.yahoo.com> Message-ID: Kevin, You may want to broaden your candidates. Jython already supports multiple cores with no GIL and shared memory with well-defined memory semantics derived directly from Java's memory model (and compatible with the informal memory model that we see in CPython). Because JRuby needs it for efficient support of Ruby 1.9 generators, which are more general than Python's (non-nested yields), there has been substantial attention paid to the MLVM coroutine support which has demonstrated 1M+ microthread scalability in a single JVM process. It would be amazing if someone spent some time looking at this in Jython. - Jim On Fri, Aug 13, 2010 at 9:29 PM, Kevin Ar18 wrote: > > >I don't mind replying to the mailing list unless it annoys someone? > Maybe >some people could be interested by this discussion. > > > > I am finding it a bit difficult to follow this thread. I am not sure who > is saying what. Also I don't know if you are talking about an entirely new > system or the stackless.py module. > An entirely new system/way of doing things -- meaning I don't think the > stackless style would fit. > > Originally, I was hoping for some way to achieve what I want in Python > across multiple cores, but I'm finding there is no such primitives to do > that effectively. I know the basics of how I would do it in a lower level > language. > > Yes, there are many different topics that this brought up. Here's a > summary: > * I wanted to work on a different way of doing things (different than > stackless)... but I needed lower level primitives that allowed me to pass > data back and forth between threads using shared memory queues or pipes > (instead of the current method that copies the data back and forth) > * I then asked about the difficulty in doing some form of limited shared > memory (one that wouldn't involve a GIL overhaul) > * A branch of the discussion involved people discuss various locking > problems that might cause... > * The author of Kamaelia posted a message and we had a brief discussion > down that road. (His project is very similar to what I want to do.) > * Gabriel mentioned his project and we had a brief discussion. His project > has some similarities ... but still is probably too different for my needs, > but maybe would be very interesting to other people here. > * In one of the emails, I brought up a possible solution to offering shared > memory "message passing" that would not require locks of locking issues... > but it really is too much for me to get involved with now. > > ... and I guess by now the discussion has pretty much died off as there was > really nothing more.... > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Sat Aug 14 15:50:02 2010 From: arigo at tunes.org (Armin Rigo) Date: Sat, 14 Aug 2010 15:50:02 +0200 Subject: [pypy-dev] PyPy speed center not updating any more Message-ID: <20100814135002.GA1941@code0.codespeak.net> Hi all, The PyPy speed center does not display any update more recent than July 29. The buildbot infrastructure correctly puts them into files codespeak.net:~buildmaster/bench_results/REV.json, but the web site at http://speed.pypy.org/ does not get updated. Help please! A bientot, Armin. From arigo at tunes.org Sat Aug 14 16:23:21 2010 From: arigo at tunes.org (Armin Rigo) Date: Sat, 14 Aug 2010 16:23:21 +0200 Subject: [pypy-dev] PyPy speed center not updating any more In-Reply-To: <20100814135002.GA1941@code0.codespeak.net> References: <20100814135002.GA1941@code0.codespeak.net> Message-ID: <20100814142321.GA4071@code0.codespeak.net> Hi all, On Sat, Aug 14, 2010 at 03:50:02PM +0200, Armin Rigo wrote: > The PyPy speed center does not display any update more recent than July 29. Wrong (thanks Antonio). It's only the twisted_web benchmark that stops at July 29; it was certainly removed at that date. For the others it works as expected. The most recent results of today (76624) have been run on the kill-caninline branch. Armin. From kevinar18 at hotmail.com Sun Aug 15 02:24:20 2010 From: kevinar18 at hotmail.com (Kevin Ar18) Date: Sat, 14 Aug 2010 20:24:20 -0400 Subject: [pypy-dev] ongoing microthread discussions In-Reply-To: References: , <66172.2718.qm@web120712.mail.ne1.yahoo.com> , Message-ID: You may want to broaden your candidates. Jython already supports multiple cores with no GIL and shared memory with well-defined memory semantics derived directly from Java's memory model (and compatible with the informal memory model that we see in CPython). Because JRuby needs it for efficient support of Ruby 1.9 generators, which are more general than Python's (non-nested yields), there has been substantial attention paid to the MLVM coroutine support which has demonstrated 1M+ microthread scalability in a single JVM process. It would be amazing if someone spent some time looking at this in Jython. For me, anything based on the Java VM or copyleft code it out of question. However, you are quite right in that it is not necessary that I use PyPy. For example, if Unladen Swallow had the primitives I needed, that would be great too. As a side note, PyPy does have two advantages: speed and that it is coded in RPython: which might even allow me to just hack PyPy itself at some point. :) BTW, thanks for the suggestion. Now that you brought up the topic of different implementations, I should probably check on what is going on in regards to Unladen Swallow, etc.... -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbaker at zyasoft.com Sun Aug 15 16:31:38 2010 From: jbaker at zyasoft.com (Jim Baker) Date: Sun, 15 Aug 2010 08:31:38 -0600 Subject: [pypy-dev] ongoing microthread discussions In-Reply-To: References: <66172.2718.qm@web120712.mail.ne1.yahoo.com> Message-ID: To clarify, there are numerous implementations of the JVM that are not copyleft, such as Apache Harmony. Of course the MLVM work I citedis not one of them. Jython itself is licensed under the Python Software License. On Sat, Aug 14, 2010 at 6:24 PM, Kevin Ar18 wrote: > You may want to broaden your candidates. Jython already supports multiple > cores with no GIL and shared memory with well-defined memory semantics > derived directly from Java's memory model (and compatible with the informal > memory model that we see in CPython). Because JRuby needs it for efficient > support of Ruby 1.9 generators, which are more general than Python's > (non-nested yields), there has been substantial attention paid to the MLVM > coroutine support which has demonstrated 1M+ microthread scalability in a > single JVM process. > > It would be amazing if someone spent some time looking at this in Jython. > > > For me, anything based on the Java VM or copyleft code it out of question. > However, you are quite right in that it is not necessary that I use PyPy. > For example, if Unladen Swallow had the primitives I needed, that would be > great too. > > As a side note, PyPy does have two advantages: speed and that it is coded > in RPython: which might even allow me to just hack PyPy itself at some > point. :) > > BTW, thanks for the suggestion. Now that you brought up the topic of > different implementations, I should probably check on what is going on in > regards to Unladen Swallow, etc.... > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amauryfa at gmail.com Sun Aug 15 21:32:51 2010 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Sun, 15 Aug 2010 21:32:51 +0200 Subject: [pypy-dev] RPython function callback from C In-Reply-To: <662548.18463.qm@web114020.mail.gq1.yahoo.com> References: <662548.18463.qm@web114020.mail.gq1.yahoo.com> Message-ID: Hi, Le 14 ao?t 2010 03:51:00 UTC+2, Hart's Antler a ?crit : > I have the PortAudio blocking API working, simple reading and writing to the > sound card works. ?PortAudio also has an async API where samples are fed to > a callback as they stream in. ?But i'm not sure how to define a RPython > function that will be called as a callback from C, is this even possible? ?I > see some references in the source of rffi that seems to suggest it is > possible. Yes this is possible. See an example in pypy/rpython/lltypesystem/test/test_ll2ctypes.py in the function test_qsort_callback(). Why are you passing _callable=stream_callback? It should be enough to pass stream_callback directly as a function argument. -- Amaury Forgeot d'Arc From ademan555 at gmail.com Mon Aug 16 01:18:10 2010 From: ademan555 at gmail.com (Dan Roberts) Date: Sun, 15 Aug 2010 16:18:10 -0700 Subject: [pypy-dev] JIT Failure on lltype.Array access Message-ID: As best I can tell, the JIT cannot handle my code properly, it corrupts memory and returns 0.0 for float arrays. I don't know whether the true problem is in my code or the JIT, but I need to get this resolved quickly. I know the JIT and my code are interacting badly because py.py works fine (though slow) and translated pypy-c with jit and --jit threshold=9999999 both work fine. Here's what I've tried to resolve the issue: Removing my _immutable_fields_ hints. Hand implementing bh_{get,set}arrayitem_raw_{r,i,f} (though I don't know my implementation was right, I simply copied the gc version and removed the first offset (since raw arrays have no header right? Although I expect that the gc version would have simply gotten 0 for the header size... I tried it anyways) A few thoughts: descr.py alludes to a FloatArrayDescr which I never raw defined Could the asm backend be part of the problem? Rather than the code in llmodel.py? Unfortunately I'm ill equipped to resolve this issue, so any help is appreciated (I'm on my phone but I'll happily furnish exact errors upon request) -------------- next part -------------- An HTML attachment was scrubbed... URL: From tobami at googlemail.com Mon Aug 16 08:19:33 2010 From: tobami at googlemail.com (Miquel Torres) Date: Mon, 16 Aug 2010 08:19:33 +0200 Subject: [pypy-dev] PyPy speed center not updating any more In-Reply-To: <20100814142321.GA4071@code0.codespeak.net> References: <20100814135002.GA1941@code0.codespeak.net> <20100814142321.GA4071@code0.codespeak.net> Message-ID: Hi Armin, are all results going to be run on a branch now?. If you run results on a branch, but don't change the config on codespeed, the commit logs won't work because it will try to pull them from trunk 2010/8/14 Armin Rigo : > Hi all, > > On Sat, Aug 14, 2010 at 03:50:02PM +0200, Armin Rigo wrote: >> The PyPy speed center does not display any update more recent than July 29. > > Wrong (thanks Antonio). ?It's only the twisted_web benchmark that stops > at July 29; it was certainly remov ed at that date. ?For the others it > works as expected. > > The most recent results of today (76624) have been run on the > kill-caninline branch. > > > Armin. > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From fijall at gmail.com Mon Aug 16 09:00:52 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 16 Aug 2010 09:00:52 +0200 Subject: [pypy-dev] PyPy speed center not updating any more In-Reply-To: References: <20100814135002.GA1941@code0.codespeak.net> <20100814142321.GA4071@code0.codespeak.net> Message-ID: I disabled twisted_web because of run out of TCP connection problem. Regarding branches - how can we have branches visible with trunk side-by-side, submit that as a different interpreter? On Mon, Aug 16, 2010 at 8:19 AM, Miquel Torres wrote: > Hi Armin, > > are all results going to be run on a branch now?. > > If you run results on a branch, but don't change the config on > codespeed, the commit logs won't work because it will try to pull them > from trunk > > > 2010/8/14 Armin Rigo : >> Hi all, >> >> On Sat, Aug 14, 2010 at 03:50:02PM +0200, Armin Rigo wrote: >>> The PyPy speed center does not display any update more recent than July 29. >> >> Wrong (thanks Antonio). ?It's only the twisted_web benchmark that stops >> at July 29; it was certainly remov ed at that date. ?For the others it >> works as expected. >> >> The most recent results of today (76624) have been run on the >> kill-caninline branch. >> >> >> Armin. >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev >> > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From arigo at tunes.org Mon Aug 16 15:06:48 2010 From: arigo at tunes.org (Armin Rigo) Date: Mon, 16 Aug 2010 15:06:48 +0200 Subject: [pypy-dev] PyPy speed center not updating any more In-Reply-To: References: <20100814135002.GA1941@code0.codespeak.net> <20100814142321.GA4071@code0.codespeak.net> Message-ID: <20100816130648.GA15483@code0.codespeak.net> Hi Miquel, On Mon, Aug 16, 2010 at 08:19:33AM +0200, Miquel Torres wrote: > are all results going to be run on a branch now?. No no, I just ran manually twice on a branch. A bientot, Armin. From arigo at tunes.org Mon Aug 16 15:10:33 2010 From: arigo at tunes.org (Armin Rigo) Date: Mon, 16 Aug 2010 15:10:33 +0200 Subject: [pypy-dev] JIT Failure on lltype.Array access In-Reply-To: References: Message-ID: <20100816131033.GB15483@code0.codespeak.net> Hi Dan, The issue was that the JIT was silently and incorrectly accepting the type lltype.Array(), which is a non-GC but with-length-prefix array, and it was (by mistake) considering it to be a GC array. That's where the errors come from. Now the JIT explicitly refuses to work with such arrays. As explained on IRC, you need anyway in micronumpy to use the type rffi.CArray(), which does not contain the length prefix. A bientot, Armin. From tobami at googlemail.com Mon Aug 16 16:15:39 2010 From: tobami at googlemail.com (Miquel Torres) Date: Mon, 16 Aug 2010 16:15:39 +0200 Subject: [pypy-dev] PyPy speed center not updating any more In-Reply-To: <20100816130648.GA15483@code0.codespeak.net> References: <20100814135002.GA1941@code0.codespeak.net> <20100814142321.GA4071@code0.codespeak.net> <20100816130648.GA15483@code0.codespeak.net> Message-ID: Maciej: sorry, we had this issue pending for a long time already. The best way would be to add a new project per branch. So instead of project = 'PyPy' save as project = 'experimental_branchX' then in the admin (the project entry will be created when the first results are saved), choose whether to "track" the project (show or hide in the changes view), and customize the commit log info (pull logs from the corresponding subdir in svn instead of trunk). Note: to avoid confusion, executables names are unique, so exe (interpreter) names will need to be different as well (it could be changed if needed) Cheers, Miquel 2010/8/16 Armin Rigo : > Hi Miquel, > > On Mon, Aug 16, 2010 at 08:19:33AM +0200, Miquel Torres wrote: >> are all results going to be run on a branch now?. > > No no, I just ran manually twice on a branch. > > > A bientot, > > Armin. > From bhartsho at yahoo.com Thu Aug 19 03:25:02 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Wed, 18 Aug 2010 18:25:02 -0700 (PDT) Subject: [pypy-dev] JIT'ed function performance degrades Message-ID: <786138.15701.qm@web114011.mail.gq1.yahoo.com> I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. Using a virtualizable did speed up my code, but it still has the degrading performance problem. I have yesterdays SVN and using 64bit with boehm. I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm? code is here: http://pastebin.com/9VGJHpNa From sakesun at gmail.com Thu Aug 19 06:25:42 2010 From: sakesun at gmail.com (sakesun roykiatisak) Date: Thu, 19 Aug 2010 11:25:42 +0700 Subject: [pypy-dev] =?windows-1252?q?What=27s_wrong_with_=3E=3E=3E_open=28?= =?windows-1252?q?=92xxx=92=2C_=92w=92=29=2Ewrite=28=92stuff=92=29_?= =?windows-1252?q?=3F?= Message-ID: Hi, I encountered this quite a few times when learning pypy from internet resources: the code like this >>> open(?xxx?, ?w?).write(?stuff?) This code is not working on pypy because it rely on CPython refcounting behaviour. I don't get it. Why ? I thought the code should be similar to storing the file object in temporary variable like this >>> f = open('xxx', 'w') >>> f.write('stuff') >>> del f Also, I've tried that with both Jython and IronPython and they all work fine. Why does this cause problem to pypy ? Do I have to avoid writing code like this in the future ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From sakesun at gmail.com Thu Aug 19 06:49:36 2010 From: sakesun at gmail.com (sakesun roykiatisak) Date: Thu, 19 Aug 2010 11:49:36 +0700 Subject: [pypy-dev] =?windows-1252?q?What=27s_wrong_with_=3E=3E=3E_open=28?= =?windows-1252?q?=92xxx=92=2C_=92w=92=29=2Ewrite=28=92stuff=92=29_?= =?windows-1252?q?=3F?= In-Reply-To: References: Message-ID: That's make sense. I've tried on both IronPython and Jython with: ipy -c "open(?xxx?, ?w?).write(?stuff?)" jython -c "open(?xxx?, ?w?).write(?stuff?)" When the interpreter terminate the file is closed. That's why it didn't cause any problem. Perhaps, I should always use "with" statement from now on. >>> with open('xxx', 'w') as f: f.write('stuff') Thanks On Thu, Aug 19, 2010 at 11:40 AM, Aaron DeVore wrote: > If I understand correctly, PyPy will garbage collect (and close) the > file object at an indeterminate time. That time could be as long as > until the program exits. Because CPython uses reference counting, it > closes the file immediately after the file object goes out of scope. > > Of course, I may be entirely wrong. > > -Aaron DeVore > > On Wed, Aug 18, 2010 at 9:25 PM, sakesun roykiatisak > wrote: > > Hi, > > I encountered this quite a few times when learning pypy from internet > > resources: > > the code like this > >>>> open(?xxx?, ?w?).write(?stuff?) > > This code is not working on pypy because it rely on CPython refcounting > > behaviour. > > I don't get it. Why ? I thought the code should be similar to storing > the > > file object in temporary variable like this > >>>> f = open('xxx', 'w') > >>>> f.write('stuff') > >>>> del f > > Also, I've tried that with both Jython and IronPython and they all work > > fine. > > Why does this cause problem to pypy ? Do I have to avoid writing code > like > > this in the future ? > > _______________________________________________ > > pypy-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/pypy-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sakesun at gmail.com Thu Aug 19 07:07:47 2010 From: sakesun at gmail.com (sakesun roykiatisak) Date: Thu, 19 Aug 2010 12:07:47 +0700 Subject: [pypy-dev] =?windows-1252?q?What=27s_wrong_with_=3E=3E=3E_open=28?= =?windows-1252?q?=92xxx=92=2C_=92w=92=29=2Ewrite=28=92stuff=92=29_?= =?windows-1252?q?=3F?= In-Reply-To: References: Message-ID: A little problem is that, "with" statement is yet to work in pypy. :) On Thu, Aug 19, 2010 at 11:49 AM, sakesun roykiatisak wrote: > That's make sense. I've tried on both IronPython and Jython with: > > ipy -c "open(?xxx?, ?w?).write(?stuff?)" > jython -c "open(?xxx?, ?w?).write(?stuff?)" > > When the interpreter terminate the file is closed. That's why it didn't > cause any problem. > > Perhaps, I should always use "with" statement from now on. > > >>> with open('xxx', 'w') as f: f.write('stuff') > > Thanks > > On Thu, Aug 19, 2010 at 11:40 AM, Aaron DeVore wrote: > >> If I understand correctly, PyPy will garbage collect (and close) the >> file object at an indeterminate time. That time could be as long as >> until the program exits. Because CPython uses reference counting, it >> closes the file immediately after the file object goes out of scope. >> >> Of course, I may be entirely wrong. >> >> -Aaron DeVore >> >> On Wed, Aug 18, 2010 at 9:25 PM, sakesun roykiatisak >> wrote: >> > Hi, >> > I encountered this quite a few times when learning pypy from internet >> > resources: >> > the code like this >> >>>> open(?xxx?, ?w?).write(?stuff?) >> > This code is not working on pypy because it rely on CPython refcounting >> > behaviour. >> > I don't get it. Why ? I thought the code should be similar to storing >> the >> > file object in temporary variable like this >> >>>> f = open('xxx', 'w') >> >>>> f.write('stuff') >> >>>> del f >> > Also, I've tried that with both Jython and IronPython and they all work >> > fine. >> > Why does this cause problem to pypy ? Do I have to avoid writing code >> like >> > this in the future ? >> > _______________________________________________ >> > pypy-dev at codespeak.net >> > http://codespeak.net/mailman/listinfo/pypy-dev >> > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.gaynor at gmail.com Thu Aug 19 07:09:25 2010 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Thu, 19 Aug 2010 00:09:25 -0500 Subject: [pypy-dev] =?utf-8?b?V2hhdCdzIHdyb25nIHdpdGggPj4+IG9wZW4o4oCZeHh4?= =?utf-8?b?4oCZLCDigJl34oCZKS53cml0ZSjigJlzdHVmZuKAmSkgPw==?= In-Reply-To: References: Message-ID: On Thu, Aug 19, 2010 at 12:07 AM, sakesun roykiatisak wrote: > > A little problem is that, "with" statement is yet to work in pypy. > :) > > On Thu, Aug 19, 2010 at 11:49 AM, sakesun roykiatisak > wrote: >> >> That's make sense. ?I've tried on both IronPython and Jython with: >> ipy -c "open(?xxx?, ?w?).write(?stuff?)" >> jython -c "open(?xxx?, ?w?).write(?stuff?)" >> When the interpreter terminate the file is closed. That's why it didn't >> cause any problem. >> Perhaps, I should always use "with" statement from now on. >> >>> with open('xxx', 'w') as f: f.write('stuff') >> Thanks >> >> On Thu, Aug 19, 2010 at 11:40 AM, Aaron DeVore >> wrote: >>> >>> If I understand correctly, PyPy will garbage collect (and close) the >>> file object at an indeterminate time. That time could be as long as >>> until the program exits. Because CPython uses reference counting, it >>> closes the file immediately after the file object goes out of scope. >>> >>> Of course, I may be entirely wrong. >>> >>> -Aaron DeVore >>> >>> On Wed, Aug 18, 2010 at 9:25 PM, sakesun roykiatisak >>> wrote: >>> > Hi, >>> > ?I encountered this quite a few times when learning pypy from internet >>> > resources: >>> > ??the code like this >>> >>>> open(?xxx?, ?w?).write(?stuff?) >>> > This code is not working on pypy because it rely on CPython refcounting >>> > behaviour. >>> > I don't get it. Why ? ?I thought the code should be similar to storing >>> > the >>> > file object in temporary variable like this >>> >>>> f = open('xxx', 'w') >>> >>>> f.write('stuff') >>> >>>> del f >>> > Also, I've tried that with both Jython and IronPython and they all work >>> > fine. >>> > Why does this cause problem to pypy ? ?Do I have to avoid writing code >>> > like >>> > this in the future ? >>> > _______________________________________________ >>> > pypy-dev at codespeak.net >>> > http://codespeak.net/mailman/listinfo/pypy-dev >>> > >> > > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > Since PyPy implements Python 2.5 at present you'll need to use `from __future__ import with_statement` to ues it. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Voltaire "The people's good is the highest law." -- Cicero "Code can always be simpler than you think, but never as simple as you want" -- Me From sakesun at gmail.com Thu Aug 19 07:12:46 2010 From: sakesun at gmail.com (sakesun roykiatisak) Date: Thu, 19 Aug 2010 12:12:46 +0700 Subject: [pypy-dev] =?windows-1252?q?What=27s_wrong_with_=3E=3E=3E_open=28?= =?windows-1252?q?=92xxx=92=2C_=92w=92=29=2Ewrite=28=92stuff=92=29_?= =?windows-1252?q?=3F?= In-Reply-To: References: Message-ID: Wow, thanks. Pypy is a really precise implementation. > Since PyPy implements Python 2.5 at present you'll need to use `from > __future__ import with_statement` to ues it. > > Alex > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.leslie.ttg at gmail.com Thu Aug 19 07:13:36 2010 From: william.leslie.ttg at gmail.com (William Leslie) Date: Thu, 19 Aug 2010 15:13:36 +1000 Subject: [pypy-dev] =?windows-1252?q?What=27s_wrong_with_=3E=3E=3E_open=28?= =?windows-1252?q?=92xxx=92=2C_=92w=92=29=2Ewrite=28=92stuff=92=29_?= =?windows-1252?q?=3F?= In-Reply-To: References: Message-ID: A good resource I recently read on this is this entry in Raymond Chen's blog: http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047586.aspx Together with the following entry, which explains why the lifetime of the variable has nothing to do with the lifetime of the object, this should help you understand. You should consider automatically closing a file to be an implementation detail, even cpython may not respect such semantics in future. That is why the with statement was created. -- William Leslie From sakesun at gmail.com Thu Aug 19 07:20:35 2010 From: sakesun at gmail.com (sakesun roykiatisak) Date: Thu, 19 Aug 2010 12:20:35 +0700 Subject: [pypy-dev] =?windows-1252?q?What=27s_wrong_with_=3E=3E=3E_open=28?= =?windows-1252?q?=92xxx=92=2C_=92w=92=29=2Ewrite=28=92stuff=92=29_?= =?windows-1252?q?=3F?= In-Reply-To: References: Message-ID: Thanks. Interestingly, this is not the first time I was suggested to pursue further reading with Raymond Chen's blog. http://www.mail-archive.com/users at lists.ironpython.com/msg05792.html :) On Thu, Aug 19, 2010 at 12:13 PM, William Leslie < william.leslie.ttg at gmail.com> wrote: > A good resource I recently read on this is this entry in Raymond Chen's > blog: > > http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047586.aspx > > Together with the following entry, which explains why the lifetime of > the variable has nothing to do with the lifetime of the object, this > should help you understand. > > You should consider automatically closing a file to be an > implementation detail, even cpython may not respect such semantics in > future. That is why the with statement was created. > > -- > William Leslie > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.giarrusso at gmail.com Thu Aug 19 08:48:05 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Thu, 19 Aug 2010 08:48:05 +0200 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: <786138.15701.qm@web114011.mail.gq1.yahoo.com> References: <786138.15701.qm@web114011.mail.gq1.yahoo.com> Message-ID: On Thu, Aug 19, 2010 at 03:25, Hart's Antler wrote: > I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. ?Using a virtualizable did speed up my code, but it still has the degrading performance problem. ?I have yesterdays SVN and using 64bit with boehm. ?I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm? > > code is here: > http://pastebin.com/9VGJHpNa I think this has nothing to do with Boehm. Is it swapping? If yes, that explains the slowdown. Is memory usage growing over time? I expect yes, and it's a misbehavior which could be explained by my analysis below. Is it JITting code? I think no, or not to an advantage, but that's a more complicated guess. BTW, when debugging such things, _always_ ask and answer these questions yourself. Moreover, I'm not sure you need to use the JIT yourself. - Your code is RPython, so you could as well just translate it without JIT annotations, and it will be compiled to C code. - Otherwise, you could write that as a app-level function, i.e. in normal Python, and pass it to a translated PyPy-JIT interpreter. Did you try and benchmark the code? Can I ask you why you did not write that as a app-level function, i.e. as normal Python code, to use PyPy's JIT directly, without needing detailed understanding of the JIT? It would be interesting to see a comparison (and have it on the web, after some code review). Especially, I'm not sure that as currently written you're getting any speedup, and I seriously wonder whether the JIT could give an additional speedup over RPython here (the regexp interpreter is a completely different case, since it compiles a regexp, but why do you compile an array?). I think just raw CPython can be 340x slower than C (I assume NumPy uses C), and since your code is RPython, there must be something basic wrong. I think you have too many green variables in your code: "At runtime, for a given value of the green variables, one piece of machine code will be generated. This piece of machine code can therefore assume that the value of the green variable is constant." [1] So, every time you change the value of a green variable, the JIT will have to recompile again the function. Note that actually, I think, for each new value of the variable, first a given number of iterations have to occur (1000? 10 000? I'm not sure), then the JIT will spend time creating a trace and compiling it. The length of the involved arrays is maybe around the threshold, maybe smaller, so you get "all pain, and no gain". >From your code: complex_dft_jitdriver = JitDriver( greens = 'index length accum array'.split(), reds = 'k a b J'.split(), virtualizables = 'a'.split() #can_inline=True ) The only acceptable green variable are IMHO array and length there, because in the calling code, the other change for each invocation I think. I also think that only length should be green (and that could give a speedup), and that marking array as green gives neglibible or no speedup. Marking length as green allows specializing the function on the size of the array - something one would not do in C probably, but that one could do in C++. Whether it is worth it depends on the specific code & optimizations available - I think here the speedup should be small. Best regards [1] http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.html -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From fijall at gmail.com Thu Aug 19 12:03:17 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 19 Aug 2010 12:03:17 +0200 Subject: [pypy-dev] =?utf-8?b?V2hhdCdzIHdyb25nIHdpdGggPj4+IG9wZW4o4oCZeHh4?= =?utf-8?b?4oCZLCDigJl34oCZKS53cml0ZSjigJlzdHVmZuKAmSkgPw==?= In-Reply-To: References: Message-ID: Hi. Yes, those two things are equivalent and they both work. However, if you try to read the file immediately after deleting the variable, you'll find out that the file is empty on any implementation but cpython. On Thu, Aug 19, 2010 at 6:25 AM, sakesun roykiatisak wrote: > Hi, > ?I encountered this quite a few times when learning pypy from internet > resources: > ??the code like this >>>> open(?xxx?, ?w?).write(?stuff?) > This code is not working on pypy because it rely on CPython refcounting > behaviour. > I don't get it. Why ? ?I thought the code should be similar to storing the > file object in temporary variable like this >>>> f = open('xxx', 'w') >>>> f.write('stuff') >>>> del f > Also, I've tried that with both Jython and IronPython and they all work > fine. > Why does this cause problem to pypy ? ?Do I have to avoid writing code like > this in the future ? > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From fijall at gmail.com Thu Aug 19 12:11:29 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 19 Aug 2010 12:11:29 +0200 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: References: <786138.15701.qm@web114011.mail.gq1.yahoo.com> Message-ID: Hi On Thu, Aug 19, 2010 at 8:48 AM, Paolo Giarrusso wrote: > On Thu, Aug 19, 2010 at 03:25, Hart's Antler wrote: >> I am starting to learn how to use the JIT, and i'm confused why my function gets slower over time, twice as slow after running for a few minutes. ?Using a virtualizable did speed up my code, but it still has the degrading performance problem. ?I have yesterdays SVN and using 64bit with boehm. ?I understand boehm is slower, but overall my JIT'ed function is many times slower than un-jitted, is this expected behavior from boehm? >> >> code is here: >> http://pastebin.com/9VGJHpNa > > I think this has nothing to do with Boehm. I don't think as well > Moreover, I'm not sure you need to use the JIT yourself. > - Your code is RPython, so you could as well just translate it without > JIT annotations, and it will be compiled to C code. > - Otherwise, you could write that as a app-level function, i.e. in > normal Python, and pass it to a translated PyPy-JIT interpreter. Did > you try and benchmark the code? > Can I ask you why you did not write that as a app-level function, i.e. > as normal Python code, to use PyPy's JIT directly, without needing > detailed understanding of the JIT? > It would be interesting to see a comparison (and have it on the web, > after some code review). JIT can essentially speed up based on constant folding based on bytecode. Bytecode should be the only green variable here and all others (that you don't want to specialize over) should be red and not promoted. In your case it's very likely you compile new loop very often (overspecialization). > > Especially, I'm not sure that as currently written you're getting any > speedup, and I seriously wonder whether the JIT could give an > additional speedup over RPython here (the regexp interpreter is a > completely different case, since it compiles a regexp, but why do you > compile an array?). That's silly, our python interpreter is an RPython program. Anything that can have a meaningfully defined "bytecode" or a "compile time constant" can be sped up by the JIT. For example a templating language. > I think just raw CPython can be 340x slower than C (I assume NumPy > uses C) You should check more and have less assumptions. > > So, every time you change the value of a green variable, the JIT will > have to recompile again the function. Note that actually, I think, for > each new value of the variable, first a given number of iterations > have to occur (1000? 10 000? I'm not sure), then the JIT will spend > time creating a trace and compiling it. The length of the involved > arrays is maybe around the threshold, maybe smaller, so you get "all > pain, and no gain". > to be precise for each combination of green variables there has to be a 1000 (by default) iterations. If there is no such thing, you'll never compile code and simply spend time bookkeeping. Cheers, fijal From p.giarrusso at gmail.com Thu Aug 19 13:34:12 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Thu, 19 Aug 2010 13:34:12 +0200 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: References: <786138.15701.qm@web114011.mail.gq1.yahoo.com> Message-ID: Hi Maciej, I think you totally misunderstood me, possibly because I was not clear, see below. In short, I was wondering whether the approach of the original code made any sense, and my guess was "mostly not", exactly because there is little constant folding possible in the code, as it is written. [Hart, I don't think that any O(N^2) implementation of DFT (what is in the code), i.e. two nested for loops, should be written to explicitly take advantage of the JIT. I don't know about the FFT algorithm, but a few vague ideas say "yes", because constant folding the length could _maybe_ allow constant folding the permutations applied to data in the Cooley?Tukey FFT algorithm.] On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski wrote: > Hi >> Moreover, I'm not sure you need to use the JIT yourself. >> - Your code is RPython, so you could as well just translate it without >> JIT annotations, and it will be compiled to C code. >> - Otherwise, you could write that as a app-level function, i.e. in >> normal Python, and pass it to a translated PyPy-JIT interpreter. Did >> you try and benchmark the code? >> Can I ask you why you did not write that as a app-level function, i.e. >> as normal Python code, to use PyPy's JIT directly, without needing >> detailed understanding of the JIT? >> It would be interesting to see a comparison (and have it on the web, >> after some code review). > > JIT can essentially speed up based on constant folding based on > bytecode. Bytecode should be the only green variable here and all > others (that you don't want to specialize over) should be red and not > promoted. In your case it's very likely you compile new loop very > often (overspecialization). I see no bytecode in the example - it's a DFT implementation. For each combination of green variables, there are 1024 iterations, and there are 1024 such combinations, so overspecialization is almost guaranteed. My next question, inspired from the specific code, is: is JITted code ever thrown away, if too much is generated? Even for valid use cases, most JITs can generate too much code, and they need then to choose what to keep and what to throw away. >> Especially, I'm not sure that as currently written you're getting any >> speedup, and I seriously wonder whether the JIT could give an >> additional speedup over RPython here (the regexp interpreter is a >> completely different case, since it compiles a regexp, but why do you >> compile an array?). > > That's silly, our python interpreter is an RPython program. Anything > that can have a meaningfully defined "bytecode" or a "compile time > constant" can be sped up by the JIT. For example a templating > language. You misunderstood me, I totally agree with you, and my understanding is that in the given program (which I read almost fully) constant folding makes little sense. Since that program is written with RPython + JIT, but it has green variables which are not at all "compile time constants", "I wonder seriously" was meant as "I wonder seriously whether what you are trying makes any sense". As I argued, the only constant folding possible is for the array length. And again, I wonder whether it's worth it, my guess tends towards "no", but a benchmark is needed (there will be some improvement probably). I was just a bit vaguer because I just studied docs on PyPy (and papers about tracing compilation). But your answer confirms that my original analysis is correct, and that I should write more clearly maybe. >> I think just raw CPython can be 340x slower than C (I assume NumPy >> uses C) > You should check more and have less assumptions. I did some checks, on PyPy's blog actually, not definitive though, and I stand by what I meant (see below). Without reading the pastie in full, however, my comments are out of context. I guess your tone is fine, since you thought I wrote nonsense. But in general, I have yet to see a guideline forbidding "IIRC" and similar ways of discussing (the above was an _educated_ guess), especially when the writer remembers correctly (as in this case). Having said that, I'm always happy to see counterexamples and learn something, if they exist. In this case, for what I actually meant (and wrote, IMHO), a counterexample would be a RPython or a JITted program >= 340x slower than C. For the speed ratio, the code pastie writes that RPython JITted code is 340x slower than NumPy code, and I was writing that it's unreasonable; in this case, it happens because of overspecialization caused by misuse of the JIT. For speed ratios among CPython, C, RPython, I was comparing to http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.html. What I meant is that JITted code can't be so much slower than C. For NumPy, I had read this: http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html, and it mostly implies that NumPy is written in C (it actually says "NumPy's C version", but I missed it). And for the specific discussed microbenchmark, the performance gap between NumPy and CPython is ~100x. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From fijall at gmail.com Thu Aug 19 13:55:00 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 19 Aug 2010 13:55:00 +0200 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: References: <786138.15701.qm@web114011.mail.gq1.yahoo.com> Message-ID: On Thu, Aug 19, 2010 at 1:34 PM, Paolo Giarrusso wrote: > Hi Maciej, > I think you totally misunderstood me, possibly because I was not > clear, see below. In short, I was wondering whether the approach of > the original code made any sense, and my guess was "mostly not", > exactly because there is little constant folding possible in the code, > as it is written. That's always possible :) > > [Hart, I don't think that any O(N^2) implementation of DFT (what is in > the code), i.e. two nested for loops, should be written to explicitly > take advantage of the JIT. I don't know about the FFT algorithm, but a > few vague ideas say "yes", because constant folding the length could > _maybe_ allow constant folding the permutations applied to data in the > Cooley?Tukey FFT algorithm.] > > On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski wrote: >> Hi > > >>> Moreover, I'm not sure you need to use the JIT yourself. >>> - Your code is RPython, so you could as well just translate it without >>> JIT annotations, and it will be compiled to C code. >>> - Otherwise, you could write that as a app-level function, i.e. in >>> normal Python, and pass it to a translated PyPy-JIT interpreter. Did >>> you try and benchmark the code? >>> Can I ask you why you did not write that as a app-level function, i.e. >>> as normal Python code, to use PyPy's JIT directly, without needing >>> detailed understanding of the JIT? >>> It would be interesting to see a comparison (and have it on the web, >>> after some code review). >> >> JIT can essentially speed up based on constant folding based on >> bytecode. Bytecode should be the only green variable here and all >> others (that you don't want to specialize over) should be red and not >> promoted. In your case it's very likely you compile new loop very >> often (overspecialization). > > I see no bytecode in the example - it's a DFT implementation. > For each combination of green variables, there are 1024 iterations, > and there are 1024 such combinations, so overspecialization is almost > guaranteed. Agreed. > > My next question, inspired from the specific code, is: is JITted code > ever thrown away, if too much is generated? Even for valid use cases, > most JITs can generate too much code, and they need then to choose > what to keep and what to throw away. No, as of now, never. In general in case of Python it would have to be a heuristic anyway (since code objects are mostly immortal and you can't decide whether certain combination of assumptions will occur in the future or not). We have some ideas which code will never run any more and besides that, we need to implement some heuristics when to throw away code. > >>> Especially, I'm not sure that as currently written you're getting any >>> speedup, and I seriously wonder whether the JIT could give an >>> additional speedup over RPython here (the regexp interpreter is a >>> completely different case, since it compiles a regexp, but why do you >>> compile an array?). >> >> That's silly, our python interpreter is an RPython program. Anything >> that can have a meaningfully defined "bytecode" or a "compile time >> constant" can be sped up by the JIT. For example a templating >> language. > > You misunderstood me, I totally agree with you, and my understanding > is that in the given program (which I read almost fully) constant > folding makes little sense. Great :) I might have misunderstood you. > Since that program is written with RPython + JIT, but it has green > variables which are not at all "compile time constants", "I wonder > seriously" was meant as "I wonder seriously whether what you are > trying makes any sense". As I argued, the only constant folding > possible is for the array length. And again, I wonder whether it's > worth it, my guess tends towards "no", but a benchmark is needed > (there will be some improvement probably). I guess the answer is "hell no", simply because if you don't constant fold our assembler would not be nearly as good as gcc's one (if nothing else). > > I was just a bit vaguer because I just studied docs on PyPy (and > papers about tracing compilation). But your answer confirms that my > original analysis is correct, and that I should write more clearly > maybe. > >>> I think just raw CPython can be 340x slower than C (I assume NumPy >>> uses C) > >> You should check more and have less assumptions. > > I did some checks, on PyPy's blog actually, not definitive though, and > I stand by what I meant (see below). Without reading the pastie in > full, however, my comments are out of context. > I guess your tone is fine, since you thought I wrote nonsense. But in > general, I have yet to see a guideline forbidding "IIRC" and similar > ways of discussing (the above was an _educated_ guess), especially > when the writer remembers correctly (as in this case). > Having said that, I'm always happy to see counterexamples and learn > something, if they exist. In this case, for what I actually meant (and > wrote, IMHO), a counterexample would be a RPython or a JITted program >>= 340x slower than C. My comment was merely about "numpy is written in C". > > For the speed ratio, the code pastie writes that RPython JITted code > is 340x slower than NumPy code, and I was writing that it's > unreasonable; in this case, it happens because of overspecialization > caused by misuse of the JIT. Yes. > > For speed ratios among CPython, C, RPython, I was comparing to > http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.html. > What I meant is that JITted code can't be so much slower than C. > > For NumPy, I had read this: > http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html, > and it mostly implies that NumPy is written in C (it actually says > "NumPy's C version", but I missed it). And for the specific discussed > microbenchmark, the performance gap between NumPy and CPython is > ~100x. Yes, there is a slight difference :-) numpy is written mostly in C (at least glue code), but a lot of algorithms call back to some other stuff (depending what you have installed) which as far as I'm concerned might be whatever (most likely fortran or SSE assembler at some level.) > > Best regards > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > From bhartsho at yahoo.com Fri Aug 20 08:04:48 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Thu, 19 Aug 2010 23:04:48 -0700 (PDT) Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: Message-ID: <23133.28490.qm@web114006.mail.gq1.yahoo.com> Hi Paolo, thanks for your in-depth response, i tried your suggestions and noticed a big speed improvement with no more degrading performance, i didn't realize having more green is bad. However it still runs 4x slower than just plain old compiled RPython, i checked if the JIT was really running, and your right its not actually using any JIT'ed code, it only traces and then aborts, though now i can not figure out why it aborts after trying several things. I didn't write this as an app-level function because i wanted to understand how the JIT works on a deeper level and with RPython. I had seen the blog post before by Carl Friedrich Bolz about JIT'ing and that he was able to speed things up 22x faster than plain RPython translated to C, so that got me curious about the JIT. Now i understand that that was an exceptional case, but what other cases might RPython+JIT be useful? And its good to see here what if any speed up there will be in the worst case senairo. Sorry about all the confusion about numpy being 340x faster, i should have added in that note that i compared numpy fast fourier transform to Rpython direct fourier transform, and direct is known to be hundreds of times slower. (numpy lacks a DFT to compare to) updated code with only the length as green: http://pastebin.com/DnJikXze The jitted function now checks jit.we_are_jitted(), and prints 'unjitted' if there is no jitting. abort: trace too long seems to happen every trace, so we_are_jitted() is never true, and the 4x overhead compared to compiled RPython is then understandable. trace_limit is set to its maximum, so why is it aborting? Here is my settings: jitdriver.set_param('threshold', 4) jitdriver.set_param('trace_eagerness', 4) jitdriver.set_param('trace_limit', sys.maxint) jitdriver.set_param('debug', 3) Tracing: 80 1.019871 Backend: 0 0.000000 Running asm: 0 Blackhole: 80 TOTAL: 16.785704 ops: 1456160 recorded ops: 1200000 calls: 99080 guards: 430120 opt ops: 0 opt guards: 0 forcings: 0 abort: trace too long: 80 abort: compiling: 0 abort: vable escape: 0 nvirtuals: 0 nvholes: 0 nvreused: 0 --- On Thu, 8/19/10, Maciej Fijalkowski wrote: > From: Maciej Fijalkowski > Subject: Re: [pypy-dev] JIT'ed function performance degrades > To: "Paolo Giarrusso" > Cc: "Hart's Antler" , pypy-dev at codespeak.net > Date: Thursday, 19 August, 2010, 4:55 AM > On Thu, Aug 19, 2010 at 1:34 PM, > Paolo Giarrusso > wrote: > > Hi Maciej, > > I think you totally misunderstood me, possibly because > I was not > > clear, see below. In short, I was wondering whether > the approach of > > the original code made any sense, and my guess was > "mostly not", > > exactly because there is little constant folding > possible in the code, > > as it is written. > > That's always possible :) > > > > > [Hart, I don't think that any O(N^2) implementation of > DFT (what is in > > the code), i.e. two nested for loops, should be > written to explicitly > > take advantage of the JIT. I don't know about the FFT > algorithm, but a > > few vague ideas say "yes", because constant folding > the length could > > _maybe_ allow constant folding the permutations > applied to data in the > > Cooley?Tukey FFT algorithm.] > > > > On Thu, Aug 19, 2010 at 12:11, Maciej Fijalkowski > > wrote: > >> Hi > > > > > >>> Moreover, I'm not sure you need to use the JIT > yourself. > >>> - Your code is RPython, so you could as well > just translate it without > >>> JIT annotations, and it will be compiled to C > code. > >>> - Otherwise, you could write that as a > app-level function, i.e. in > >>> normal Python, and pass it to a translated > PyPy-JIT interpreter. Did > >>> you try and benchmark the code? > >>> Can I ask you why you did not write that as a > app-level function, i.e. > >>> as normal Python code, to use PyPy's JIT > directly, without needing > >>> detailed understanding of the JIT? > >>> It would be interesting to see a comparison > (and have it on the web, > >>> after some code review). > >> > >> JIT can essentially speed up based on constant > folding based on > >> bytecode. Bytecode should be the only green > variable here and all > >> others (that you don't want to specialize over) > should be red and not > >> promoted. In your case it's very likely you > compile new loop very > >> often (overspecialization). > > > > I see no bytecode in the example - it's a DFT > implementation. > > For each combination of green variables, there are > 1024 iterations, > > and there are 1024 such combinations, so > overspecialization is almost > > guaranteed. > > Agreed. > > > > > My next question, inspired from the specific code, is: > is JITted code > > ever thrown away, if too much is generated? Even for > valid use cases, > > most JITs can generate too much code, and they need > then to choose > > what to keep and what to throw away. > > No, as of now, never. In general in case of Python it would > have to be > a heuristic anyway (since code objects are mostly immortal > and you > can't decide whether certain combination of assumptions > will occur in > the future or not). We have some ideas which code will > never run any > more and besides that, we need to implement some heuristics > when to > throw away code. > > > > >>> Especially, I'm not sure that as currently > written you're getting any > >>> speedup, and I seriously wonder whether the > JIT could give an > >>> additional speedup over RPython here (the > regexp interpreter is a > >>> completely different case, since it compiles a > regexp, but why do you > >>> compile an array?). > >> > >> That's silly, our python interpreter is an RPython > program. Anything > >> that can have a meaningfully defined "bytecode" or > a "compile time > >> constant" can be sped up by the JIT. For example a > templating > >> language. > > > > You misunderstood me, I totally agree with you, and my > understanding > > is that in the given program (which I read almost > fully) constant > > folding makes little sense. > > Great :) I might have misunderstood you. > > > Since that program is written with RPython + JIT, but > it has green > > variables which are not at all "compile time > constants", "I wonder > > seriously" was meant as "I wonder seriously whether > what you are > > trying makes any sense". As I argued, the only > constant folding > > possible is for the array length. And again, I wonder > whether it's > > worth it, my guess tends towards "no", but a benchmark > is needed > > (there will be some improvement probably). > > I guess the answer is "hell no", simply because if you > don't constant > fold our assembler would not be nearly as good as gcc's one > (if > nothing else). > > > > > I was just a bit vaguer because I just studied docs on > PyPy (and > > papers about tracing compilation). But your answer > confirms that my > > original analysis is correct, and that I should write > more clearly > > maybe. > > > >>> I think just raw CPython can be 340x slower > than C (I assume NumPy > >>> uses C) > > > >> You should check more and have less assumptions. > > > > I did some checks, on PyPy's blog actually, not > definitive though, and > > I stand by what I meant (see below). Without reading > the pastie in > > full, however, my comments are out of context. > > I guess your tone is fine, since you thought I wrote > nonsense. But in > > general, I have yet to see a guideline forbidding > "IIRC" and similar > > ways of discussing (the above was an _educated_ > guess), especially > > when the writer remembers correctly (as in this > case). > > Having said that, I'm always happy to see > counterexamples and learn > > something, if they exist. In this case, for what I > actually meant (and > > wrote, IMHO), a counterexample would be a RPython or a > JITted program > >>= 340x slower than C. > > My comment was merely about "numpy is written in C". > > > > > For the speed ratio, the code pastie writes that > RPython JITted code > > is 340x slower than NumPy code, and I was writing that > it's > > unreasonable; in this case, it happens because of > overspecialization > > caused by misuse of the JIT. > > Yes. > > > > > For speed ratios among CPython, C, RPython, I was > comparing to > > http://morepypy.blogspot.com/2010/06/jit-for-regular-expression-matching.html. > > What I meant is that JITted code can't be so much > slower than C. > > > > For NumPy, I had read this: > > http://morepypy.blogspot.com/2009/07/pypy-numeric-experiments.html, > > and it mostly implies that NumPy is written in C (it > actually says > > "NumPy's C version", but I missed it). And for the > specific discussed > > microbenchmark, the performance gap between NumPy and > CPython is > > ~100x. > > Yes, there is a slight difference :-) numpy is written > mostly in C (at > least glue code), but a lot of algorithms call back to some > other > stuff (depending what you have installed) which as far as > I'm > concerned might be whatever (most likely fortran or SSE > assembler at > some level.) > > > > > Best regards > > -- > > Paolo Giarrusso - Ph.D. Student > > http://www.informatik.uni-marburg.de/~pgiarrusso/ > > > From timon.elviejo at gmail.com Fri Aug 20 09:58:07 2010 From: timon.elviejo at gmail.com (=?ISO-8859-1?Q?Jorge_Tim=F3n?=) Date: Fri, 20 Aug 2010 09:58:07 +0200 Subject: [pypy-dev] gpgpu and pypy Message-ID: Hi, I'm just curious about the feasibility of running python code in a gpu by extending pypy. I don't have the time (and probably the knowledge neither) to develop that pypy extension, but I just want to know if it's possible. I'm interested in languages like openCL and nvidia's CUDA because I think the future of supercomputing is going to be GPGPU. There's people working in bringing GPGPU to python: http://mathema.tician.de/software/pyopencl http://mathema.tician.de/software/pycuda Would it be possible to run python code in parallel without the need (for the developer) of actively parallelizing the code? I'm not talking about code of hard concurrency, but of code with intrinsic parallelism (let's say matrix multiplication). Would a JIT compilation be capable of detecting parallelism? Would it be interesting or that's a job we must leave to humans by now? What do you think? I don't know if I had explain myself because English is not my first language. Cheers, Jorge Tim?n -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.leslie.ttg at gmail.com Fri Aug 20 10:05:50 2010 From: william.leslie.ttg at gmail.com (William Leslie) Date: Fri, 20 Aug 2010 18:05:50 +1000 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: <23133.28490.qm@web114006.mail.gq1.yahoo.com> References: <23133.28490.qm@web114006.mail.gq1.yahoo.com> Message-ID: On 20 August 2010 16:04, Hart's Antler wrote: > Hi Paolo, > > thanks for your in-depth response, i tried your suggestions and noticed a big speed improvement with no more degrading performance, i didn't realize having more green is bad. ?However it still runs 4x slower than just plain old compiled RPython, i checked if the JIT was really running, and your right its not actually using any JIT'ed code, it only traces and then aborts, though now i can not figure out why it aborts after trying several things. > > I didn't write this as an app-level function because i wanted to understand how the JIT works on a deeper level and with RPython. ?I had seen the blog post before by Carl Friedrich Bolz about JIT'ing and that he was able to speed things up 22x faster than plain RPython translated to C, so that got me curious about the JIT. ?Now i understand that that was an exceptional case, but what other cases might RPython+JIT be useful? ?And its good to see here what if any speed up there will be in the worst case senairo. > > Sorry about all the confusion about numpy being 340x faster, i should have added in that note that i compared numpy fast fourier transform to Rpython direct fourier transform, and direct is known to be hundreds of times slower. ?(numpy lacks a DFT to compare to) > > updated code with only the length as green: http://pastebin.com/DnJikXze > > The jitted function now checks jit.we_are_jitted(), and prints 'unjitted' if there is no jitting. > abort: trace too long seems to happen every trace, so we_are_jitted() is never true, and the 4x overhead compared to compiled RPython is then understandable. > > trace_limit is set to its maximum, so why is it aborting? ?Here is my settings: > ? ? ? ?jitdriver.set_param('threshold', 4) > ? ? ? ?jitdriver.set_param('trace_eagerness', 4) > ? ? ? ?jitdriver.set_param('trace_limit', sys.maxint) > ? ? ? ?jitdriver.set_param('debug', 3) > > > Tracing: ? ? ? ?80 ? ? ?1.019871 > Backend: ? ? ? ?0 ? ? ? 0.000000 > Running asm: ? ? ? ? ? ?0 > Blackhole: ? ? ? ? ? ? ?80 > TOTAL: ? ? ? ? ? ? ? ? ?16.785704 > ops: ? ? ? ? ? ? ? ? ? ?1456160 > recorded ops: ? ? ? ? ? 1200000 > ?calls: ? ? ? ? ? ? ? ?99080 > guards: ? ? ? ? ? ? ? ? 430120 > opt ops: ? ? ? ? ? ? ? ?0 > opt guards: ? ? ? ? ? ? 0 > forcings: ? ? ? ? ? ? ? 0 > abort: trace too long: ?80 > abort: compiling: ? ? ? 0 > abort: vable escape: ? ?0 > nvirtuals: ? ? ? ? ? ? ?0 > nvholes: ? ? ? ? ? ? ? ?0 > nvreused: ? ? ? ? ? ? ? 0 This application probably isn't a very good use for the jit because it has very little control flow. It may unroll the loop, but you're probably not gaining anything there. As long as the methods get inlined (as there is no polymorphic dispatch here that I can see), jit can't improve on this much. What optimisations do you expect it to make? -- William Leslie From arigo at tunes.org Fri Aug 20 11:31:58 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 20 Aug 2010 11:31:58 +0200 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: <786138.15701.qm@web114011.mail.gq1.yahoo.com> References: <786138.15701.qm@web114011.mail.gq1.yahoo.com> Message-ID: <20100820093158.GA16244@code0.codespeak.net> Hi Hart, On Wed, Aug 18, 2010 at 06:25:02PM -0700, Hart's Antler wrote: > I am starting to learn how to use the JIT, and i'm confused why my > function gets slower over time, twice as slow after running for a few > minutes. Using a virtualizable did speed up my code, but it still has > the degrading performance problem. I have yesterdays SVN and using > 64bit with boehm. I understand boehm is slower, but overall my JIT'ed > function is many times slower than un-jitted, is this expected > behavior from boehm? It seems that there are still issues with the 64-bit JIT -- it could be something along the line of "the guards are not correctly overwritten", or likely something more subtle along these lines, causing more and more assembler to be produced. We have observed "infinite"-looking memory usage for long-running programs, too. Note that in the example you posted, you are doing the common mistake of putting some code (the looping condition) between can_enter_jit and jit_merge_point. We should really do something about checking that people don't do that. It mostly works, except in some cases where it doesn't :-( The issue is more precisely: while x < y: my_jit_driver.jit_merge_point(...) ...loop body... my_jit_driver.can_enter_jit(...) In this case, the "x < y" is evaluated between can_enter_jit and jit_merge_point, and that's the mistake. You should rewrite your examples as: while x < y: my_jit_driver.can_enter_jit(...) my_jit_driver.jit_merge_point(...) ...loop body... A bientot, Armin. From arigo at tunes.org Fri Aug 20 11:45:24 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 20 Aug 2010 11:45:24 +0200 Subject: [pypy-dev] JIT'ed function performance degrades In-Reply-To: <23133.28490.qm@web114006.mail.gq1.yahoo.com> References: <23133.28490.qm@web114006.mail.gq1.yahoo.com> Message-ID: <20100820094524.GB16244@code0.codespeak.net> Hi Hart, On Thu, Aug 19, 2010 at 11:04:48PM -0700, Hart's Antler wrote: > I had seen the blog post before by Carl Friedrich Bolz about JIT'ing > and that he was able to speed things up 22x faster than plain RPython > translated to C, so that got me curious about the JIT. You cannot expect any program to get 22x faster with RPython+JIT than it is with just RPython. That would be like saying that any C program can get 22x faster if we apply some special JIT on it. For a general C program, such a statement makes no sense -- no JIT can help. The PyPy JIT can help *only* if the RPython program in question is some kind of interpreter, with a loose definition of interpreter. That's why we can apply the PyPy JIT to the Python interpreter written in RPython; or to some other examples, like Carl Friedrich's blog post about the regular expressions "interpreter". A bientot, Armin. From arigo at tunes.org Fri Aug 20 11:57:21 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 20 Aug 2010 11:57:21 +0200 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: References: Message-ID: <20100820095721.GC16244@code0.codespeak.net> Hi Sakesun, On Thu, Aug 19, 2010 at 11:25:42AM +0700, sakesun roykiatisak wrote: > >>> f = open('xxx', 'w') > >>> f.write('stuff') > >>> del f > > Also, I've tried that with both Jython and IronPython and they all work > fine. I guess that you didn't try exactly the same thing. If I do: arigo at tannit ~ $ jython Jython 2.2.1 on java1.6.0_20 Type "copyright", "credits" or "license" for more information. >>> open('x', 'w').write('hello') >>> Then "cat x" in another terminal shows an empty file. The file "x" is only filled when I exit Jython. It is exactly the same behavior as I get on PyPy. Maybe I missed something, and there is a different way to do things such that it works on Jython but not on PyPy; if so, can you describe it more precisely? Thanks! A bientot, Armin. From donny.viszneki at gmail.com Fri Aug 20 12:23:26 2010 From: donny.viszneki at gmail.com (Donny Viszneki) Date: Fri, 20 Aug 2010 06:23:26 -0400 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: <20100820095721.GC16244@code0.codespeak.net> References: <20100820095721.GC16244@code0.codespeak.net> Message-ID: Armin: Sakesun used "del f" and it appears you did not. In Python IIRC, an explicit call to del should kick off the finalizer to flush and close the file! open('x', 'w').write('hello') alone does not imply the file instance (return value of open()) has been finalized because the garbage collector may not have hit it yet. Jython and IronPython are pretty much guaranteed to behave differently under a wide variety of circumstances when it comes to the garbage collector. Do not rely on the garbage collector for program semantics! Because Sakesun has used "del f" it should be quite a concern that the file has not been finalized properly! On Fri, Aug 20, 2010 at 5:57 AM, Armin Rigo wrote: > Hi Sakesun, > > On Thu, Aug 19, 2010 at 11:25:42AM +0700, sakesun roykiatisak wrote: >> >>> f = open('xxx', 'w') >> >>> f.write('stuff') >> >>> del f >> >> Also, I've tried that with both Jython and IronPython and they all work >> fine. > > I guess that you didn't try exactly the same thing. ?If I do: > > ? ?arigo at tannit ~ $ jython > ? ?Jython 2.2.1 on java1.6.0_20 > ? ?Type "copyright", "credits" or "license" for more information. > ? ?>>> open('x', 'w').write('hello') > ? ?>>> > > Then "cat x" in another terminal shows an empty file. ?The file "x" is > only filled when I exit Jython. ?It is exactly the same behavior as I > get on PyPy. ?Maybe I missed something, and there is a different way to > do things such that it works on Jython but not on PyPy; if so, can you > describe it more precisely? ?Thanks! > > > A bientot, > > Armin. > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -- http://codebad.com/ From william.leslie.ttg at gmail.com Fri Aug 20 12:32:34 2010 From: william.leslie.ttg at gmail.com (William Leslie) Date: Fri, 20 Aug 2010 20:32:34 +1000 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: References: <20100820095721.GC16244@code0.codespeak.net> Message-ID: It seems you too have missed the difference between deleting some reference to the object (as del does) and finalising. On 20/08/2010 8:23 PM, "Donny Viszneki" wrote: Armin: Sakesun used "del f" and it appears you did not. In Python IIRC, an explicit call to del should kick off the finalizer to flush and close the file! open('x', 'w').write('hello') alone does not imply the file instance (return value of open()) has been finalized because the garbage collector may not have hit it yet. Jython and IronPython are pretty much guaranteed to behave differently under a wide variety of circumstances when it comes to the garbage collector. Do not rely on the garbage collector for program semantics! Because Sakesun has used "del f" it should be quite a concern that the file has not been finalized properly! On Fri, Aug 20, 2010 at 5:57 AM, Armin Rigo wrote: > Hi Sakesun, > > On Thu, Aug ... -- http://codebad.com/ _______________________________________________ pypy-dev at codespeak.net http://codespeak.net/mailman/... -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Fri Aug 20 13:06:49 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 20 Aug 2010 13:06:49 +0200 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: References: <20100820095721.GC16244@code0.codespeak.net> Message-ID: <20100820110649.GA23268@code0.codespeak.net> Hi Donny, On Fri, Aug 20, 2010 at 06:23:26AM -0400, Donny Viszneki wrote: > Armin: Sakesun used "del f" and it appears you did not. As explained earlier this makes no difference. E.g. in any Python version, the following code would not call the __del__ method of the object x either: >>> x = SomeClassWithADel() >>> y = x >>> del x A bientot, Armin. From p.giarrusso at gmail.com Fri Aug 20 15:39:22 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 20 Aug 2010 15:39:22 +0200 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: References: <20100820095721.GC16244@code0.codespeak.net> Message-ID: On Fri, Aug 20, 2010 at 12:23, Donny Viszneki wrote: > Armin: Sakesun used "del f" and it appears you did not. Actually, he didn't either. He said "I think that open(?xxx?, ?w?).write(?stuff?)" is equivalent to using del (which he thought would work), and the equivalence was correct. Anyway, in the _first reply_ message, he realized that using: ipy -c "open(?xxx?, ?w?).write(?stuff?)" jython -c "open(?xxx?, ?w?).write(?stuff?)" made a difference (because the interpreter exited), so that problem was solved. His mail implies that on PyPy he typed the code at the prompt, rather than at -c. > In Python > IIRC, an explicit call to del should kick off the finalizer to flush > and close the file! No, as shown by Armin. del will just clear the reference, which on CPython means decreasing the refcount. Refcounting will then finalize the object immediately, a GC at some later point, if it runs at all - there's no such guarantee on Java and .NET. For Java, that's unless you do special unsafe setup (System.runFinalizersOnExit(), it's discouraged for a number of reasons, see docs). On .NET, I expect a such method to exist, too, since they were so unaware of problems wiith finalizers in .NET 1.0 to give them the syntax of destructors. But .NET 2.0 has SafeHandles, which guarantee release of critical resources if the "finalization" code follows some restriction, using _reference counting_: http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.safehandle.aspx http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.safehandle.dangerousaddref.aspx > open('x', 'w').write('hello') alone does not imply the file instance > (return value of open()) has been finalized because the garbage > collector may not have hit it yet. On CPython, you have such an implication, because of refcounting semantics. > On Fri, Aug 20, 2010 at 5:57 AM, Armin Rigo wrote: >> Hi Sakesun, >> >> On Thu, Aug 19, 2010 at 11:25:42AM +0700, sakesun roykiatisak wrote: >>> >>> f = open('xxx', 'w') >>> >>> f.write('stuff') >>> >>> del f >>> >>> Also, I've tried that with both Jython and IronPython and they all work >>> fine. >> >> I guess that you didn't try exactly the same thing. ?If I do: >> >> ? ?arigo at tannit ~ $ jython >> ? ?Jython 2.2.1 on java1.6.0_20 >> ? ?Type "copyright", "credits" or "license" for more information. >> ? ?>>> open('x', 'w').write('hello') >> ? ?>>> >> >> Then "cat x" in another terminal shows an empty file. ?The file "x" is >> only filled when I exit Jython. ?It is exactly the same behavior as I >> get on PyPy. ?Maybe I missed something, and there is a different way to >> do things such that it works on Jython but not on PyPy; if so, can you >> describe it more precisely? ?Thanks! -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From arigo at tunes.org Fri Aug 20 16:06:31 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 20 Aug 2010 16:06:31 +0200 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: References: <20100820095721.GC16244@code0.codespeak.net> Message-ID: <20100820140631.GA3513@code0.codespeak.net> Hi Donny, On Fri, Aug 20, 2010 at 06:23:26AM -0400, Donny Viszneki wrote: > Armin: Sakesun used "del f" and it appears you did not. In Python > IIRC, an explicit call to del should kick off the finalizer to flush > and close the file! No, you are wrong. Try for example: >>> f = open('xxx') >>> g = f >>> del f After this, 'g' still refers to the file, and it is still open. If you want the file to be flushed and closed, then call 'f.close()' :-) A bientot, Armin. From p.giarrusso at gmail.com Fri Aug 20 19:01:07 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 20 Aug 2010 19:01:07 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: 2010/8/20 Jorge Tim?n : > Hi, I'm just curious about the feasibility of running python code in a gpu > by extending pypy. Disclaimer: I am not a PyPy developer, even if I've been following the project with interest. Nor am I an expert of GPU - I provide links to the literature I've read. Yet, I believe that such an attempt is unlikely to be interesting. Quoting Wikipedia's synthesis: "Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very fast." And significant optimizations are needed anyway to get performance for GPU code (and if you don't need the last bit of performance, why bother with a GPU?), so I think that the need to use a C-like language is the smallest problem. > I don't have the time (and probably the knowledge neither) to develop that > pypy extension, but I just want to know if it's possible. > I'm interested in languages like openCL and nvidia's CUDA because I think > the future of supercomputing is going to be GPGPU. I would like to point out that while for some cases it might be right, the importance of GPGPU is probably often exaggerated: http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# Researchers in the field are mostly aware of the fact that GPGPU is the way to go only for a very restricted category of code. For that code, fine. Thus, instead of running Python code in a GPU, designing from scratch an easy way to program a GPU efficiently, for those task, is better, and projects for that already exist (i.e. what you cite). Additionally, it would take probably a different kind of JIT to exploit GPUs. No branch prediction, very small non-coherent caches, no efficient synchronization primitives, as I read from this paper... I'm no expert, but I guess you'd need to rearchitecture from scratch the needed optimizations. And it took 20-30 years to get from the first, slow Lisp (1958) to, say, Self (1991), a landmark in performant high-level languages, derived from SmallTalk. Most of that would have to be redone. So, I guess that the effort to compile Python code for a GPU is not worth it. There might be further reasons due to the kind of code a JIT generates, since a GPU has no branch predictor, no caches, and so on, but I'm no GPU expert and I would have to check again. Finally, for general purpose code, exploiting the big expected number of CPUs on our desktop systems is already a challenge. > There's people working in > bringing GPGPU to python: > > http://mathema.tician.de/software/pyopencl > http://mathema.tician.de/software/pycuda > > Would it be possible to run python code in parallel without the need (for > the developer) of actively parallelizing the code? I would say that Python is not yet the language to use to write efficient parallel code, because of the Global Interpreter Lock (Google for "Python GIL"). The two implementations having no GIL are IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, and the current focus is not on removing it. Scientific computing uses external libraries (like NumPy) - for the supported algorithms, one could introduce parallelism at that level. If that's enough for your application, good. If you want to write a parallel algorithm in Python, we're not there yet. > I'm not talking about code of hard concurrency, but of code with intrinsic > parallelism (let's say matrix multiplication). Automatic parallelization is hard, see: http://en.wikipedia.org/wiki/Automatic_parallelization Lots of scientists have tried, lots of money has been invested, but it's still hard. The only practical approaches still require the programmer to introduce parallelism, but in ways much simpler than using multithreading directly. Google OpenMP and Cilk. > Would a JIT compilation be capable of detecting parallelism? Summing up what is above, probably not. Moreover, matrix multiplication may not be so easy as one might think. I do not know how to write it for a GPU, but in the end I reference some suggestions from that paper (where it is one of the benchmarks). But here, I explain why writing it for a CPU is complicated. You can multiply two matrixes with a triply nested for, but such an algorithm has poor performance for big matrixes because of bad cache locality. GPUs, according to the above mentioned paper, provide no caches and hides latency in other ways. See here for the two main alternative ideas which allow solving this problem of writing an efficient matrix multiplication algorithm: http://en.wikipedia.org/wiki/Cache_blocking http://en.wikipedia.org/wiki/Cache-oblivious_algorithm Then, you need to parallelize the resulting code yourself, which might or might not be easy (depending on the interactions between the parallel blocks that are found there). In that paper, where matrix multiplication is called as SGEMM (the BLAS routine implementing it), they suggest using a cache-blocked version of matrix multiplication for both CPUs and GPUs, and argue that parallelization is then easy. Cheers, -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From jbaker at zyasoft.com Fri Aug 20 20:20:17 2010 From: jbaker at zyasoft.com (Jim Baker) Date: Fri, 20 Aug 2010 12:20:17 -0600 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: <20100820140631.GA3513@code0.codespeak.net> References: <20100820095721.GC16244@code0.codespeak.net> <20100820140631.GA3513@code0.codespeak.net> Message-ID: Obviously please close the file, ideally using something like the with-statement or at least finally. But for perhaps the convenience of scripters, and the sorrow of everyone else ;), Jython will close the file upon clean termination of the JVM via registering a closer of such files with Runtime#addShutdownHook This is currently part of the most important outstanding bugin Jython 2.5.2, and something that has to be resolved for 2.5.2 beta 2, because of how it interacts with classloaders and prevents their class GC upon reload (thus potentially exhausting permgen). On Fri, Aug 20, 2010 at 8:06 AM, Armin Rigo wrote: > Hi Donny, > > On Fri, Aug 20, 2010 at 06:23:26AM -0400, Donny Viszneki wrote: > > Armin: Sakesun used "del f" and it appears you did not. In Python > > IIRC, an explicit call to del should kick off the finalizer to flush > > and close the file! > > No, you are wrong. Try for example: > > >>> f = open('xxx') > >>> g = f > >>> del f > > After this, 'g' still refers to the file, and it is still open. > > If you want the file to be flushed and closed, then call 'f.close()' :-) > > > A bientot, > > Armin. > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbaker at zyasoft.com Fri Aug 20 20:25:11 2010 From: jbaker at zyasoft.com (Jim Baker) Date: Fri, 20 Aug 2010 12:25:11 -0600 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: Jython single-threaded performance has little to do with a lack of the GIL. Probably the only direct manifestation is seen in the overhead of allocating __dict__ (or dict) objects because Python attributes have volatile memory semantics, which is ensured by the backing of a ConcurrentHashMap, which can be expensive to allocate. There are workarounds. 2010/8/20 Paolo Giarrusso > 2010/8/20 Jorge Tim?n : > > Hi, I'm just curious about the feasibility of running python code in a > gpu > > by extending pypy. > Disclaimer: I am not a PyPy developer, even if I've been following the > project with interest. Nor am I an expert of GPU - I provide links to > the literature I've read. > Yet, I believe that such an attempt is unlikely to be interesting. > Quoting Wikipedia's synthesis: > "Unlike CPUs however, GPUs have a parallel throughput architecture > that emphasizes executing many concurrent threads slowly, rather than > executing a single thread very fast." > And significant optimizations are needed anyway to get performance for > GPU code (and if you don't need the last bit of performance, why > bother with a GPU?), so I think that the need to use a C-like language > is the smallest problem. > > > I don't have the time (and probably the knowledge neither) to develop > that > > pypy extension, but I just want to know if it's possible. > > I'm interested in languages like openCL and nvidia's CUDA because I think > > the future of supercomputing is going to be GPGPU. > > I would like to point out that while for some cases it might be right, > the importance of GPGPU is probably often exaggerated: > > > http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# > > Researchers in the field are mostly aware of the fact that GPGPU is > the way to go only for a very restricted category of code. For that > code, fine. > Thus, instead of running Python code in a GPU, designing from scratch > an easy way to program a GPU efficiently, for those task, is better, > and projects for that already exist (i.e. what you cite). > > Additionally, it would take probably a different kind of JIT to > exploit GPUs. No branch prediction, very small non-coherent caches, no > efficient synchronization primitives, as I read from this paper... I'm > no expert, but I guess you'd need to rearchitecture from scratch the > needed optimizations. > And it took 20-30 years to get from the first, slow Lisp (1958) to, > say, Self (1991), a landmark in performant high-level languages, > derived from SmallTalk. Most of that would have to be redone. > > So, I guess that the effort to compile Python code for a GPU is not > worth it. There might be further reasons due to the kind of code a JIT > generates, since a GPU has no branch predictor, no caches, and so on, > but I'm no GPU expert and I would have to check again. > > Finally, for general purpose code, exploiting the big expected number > of CPUs on our desktop systems is already a challenge. > > > There's people working in > > bringing GPGPU to python: > > > > http://mathema.tician.de/software/pyopencl > > http://mathema.tician.de/software/pycuda > > > > Would it be possible to run python code in parallel without the need (for > > the developer) of actively parallelizing the code? > > I would say that Python is not yet the language to use to write > efficient parallel code, because of the Global Interpreter Lock > (Google for "Python GIL"). The two implementations having no GIL are > IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, > and the current focus is not on removing it. > Scientific computing uses external libraries (like NumPy) - for the > supported algorithms, one could introduce parallelism at that level. > If that's enough for your application, good. > If you want to write a parallel algorithm in Python, we're not there yet. > > > I'm not talking about code of hard concurrency, but of code with > intrinsic > > parallelism (let's say matrix multiplication). > > Automatic parallelization is hard, see: > http://en.wikipedia.org/wiki/Automatic_parallelization > > Lots of scientists have tried, lots of money has been invested, but > it's still hard. > The only practical approaches still require the programmer to > introduce parallelism, but in ways much simpler than using > multithreading directly. Google OpenMP and Cilk. > > > Would a JIT compilation be capable of detecting parallelism? > Summing up what is above, probably not. > > Moreover, matrix multiplication may not be so easy as one might think. > I do not know how to write it for a GPU, but in the end I reference > some suggestions from that paper (where it is one of the benchmarks). > But here, I explain why writing it for a CPU is complicated. You can > multiply two matrixes with a triply nested for, but such an algorithm > has poor performance for big matrixes because of bad cache locality. > GPUs, according to the above mentioned paper, provide no caches and > hides latency in other ways. > > See here for the two main alternative ideas which allow solving this > problem of writing an efficient matrix multiplication algorithm: > http://en.wikipedia.org/wiki/Cache_blocking > http://en.wikipedia.org/wiki/Cache-oblivious_algorithm > > Then, you need to parallelize the resulting code yourself, which might > or might not be easy (depending on the interactions between the > parallel blocks that are found there). > In that paper, where matrix multiplication is called as SGEMM (the > BLAS routine implementing it), they suggest using a cache-blocked > version of matrix multiplication for both CPUs and GPUs, and argue > that parallelization is then easy. > > Cheers, > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.giarrusso at gmail.com Fri Aug 20 22:45:21 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 20 Aug 2010 22:45:21 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: 2010/8/20 Jim Baker : > Jython single-threaded performance has little to do with a lack of the GIL. Never implied that - I do believe that a GIL-less fast Python is possible. I just meant we don't have one yet. > Probably the only direct manifestation is seen in the overhead of allocating > __dict__ (or dict) objects because Python attributes have volatile memory > semantics Uh? "Jython memory model" doesn't seem to find anything. Is there any docs on this, with the rationale for the choice you describe? I've only found the Unladen Swallow proposals for a memory model: http://code.google.com/p/unladen-swallow/wiki/MemoryModel (and python-safethread, which I don't like). As a Java programmer using Jython, I wouldn't expect to have any volatile field ever, but I would expect to be able to act on different fields indipendently - the race conditions we have to protect from are the ones on structual modification (unless the table uses open addressing). _This_ can be implemented through ConcurrentHashMap (which also makes all fields volatile), but an implementation not guaranteeing volatile semantics (if possible) would have been equally valid. I am interested because I want to experiment with alternatives. Of course, you can offer stronger semantics, but then you should also advertise that fields are volatile, thus I don't need a lock to pass a reference. > , which is ensured by the backing of a ConcurrentHashMap, which can > be expensive to allocate. There are workarounds. I'm also curious about such workarounds - are they currently implemented or speculations? > 2010/8/20 Paolo Giarrusso >> >> 2010/8/20 Jorge Tim?n : >> > Hi, I'm just curious about the feasibility of running python code in a >> > gpu >> > by extending pypy. >> Disclaimer: I am not a PyPy developer, even if I've been following the >> project with interest. Nor am I an expert of GPU - I provide links to >> the literature I've read. >> Yet, I believe that such an attempt is unlikely to be interesting. >> Quoting Wikipedia's synthesis: >> "Unlike CPUs however, GPUs have a parallel throughput architecture >> that emphasizes executing many concurrent threads slowly, rather than >> executing a single thread very fast." >> And significant optimizations are needed anyway to get performance for >> GPU code (and if you don't need the last bit of performance, why >> bother with a GPU?), so I think that the need to use a C-like language >> is the smallest problem. >> >> > I don't have the time (and probably the knowledge neither) to develop >> > that >> > pypy extension, but I just want to know if it's possible. >> > I'm interested in languages like openCL and nvidia's CUDA because I >> > think >> > the future of supercomputing is going to be GPGPU. >> >> I would like to point out that while for some cases it might be right, >> the importance of GPGPU is probably often exaggerated: >> >> >> http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# >> >> Researchers in the field are mostly aware of the fact that GPGPU is >> the way to go only for a very restricted category of code. For that >> code, fine. >> Thus, instead of running Python code in a GPU, designing from scratch >> an easy way to program a GPU efficiently, for those task, is better, >> and projects for that already exist (i.e. what you cite). >> >> Additionally, it would take probably a different kind of JIT to >> exploit GPUs. No branch prediction, very small non-coherent caches, no >> efficient synchronization primitives, as I read from this paper... I'm >> no expert, but I guess you'd need to rearchitecture from scratch the >> needed optimizations. >> And it took 20-30 years to get from the first, slow Lisp (1958) to, >> say, Self (1991), a landmark in performant high-level languages, >> derived from SmallTalk. Most of that would have to be redone. >> >> So, I guess that the effort to compile Python code for a GPU is not >> worth it. There might be further reasons due to the kind of code a JIT >> generates, since a GPU has no branch predictor, no caches, and so on, >> but I'm no GPU expert and I would have to check again. >> >> Finally, for general purpose code, exploiting the big expected number >> of CPUs on our desktop systems is already a challenge. >> >> > There's people working in >> > bringing GPGPU to python: >> > >> > http://mathema.tician.de/software/pyopencl >> > http://mathema.tician.de/software/pycuda >> > >> > Would it be possible to run python code in parallel without the need >> > (for >> > the developer) of actively parallelizing the code? >> >> I would say that Python is not yet the language to use to write >> efficient parallel code, because of the Global Interpreter Lock >> (Google for "Python GIL"). The two implementations having no GIL are >> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, >> and the current focus is not on removing it. >> Scientific computing uses external libraries (like NumPy) - for the >> supported algorithms, one could introduce parallelism at that level. >> If that's enough for your application, good. >> If you want to write a parallel algorithm in Python, we're not there yet. >> >> > I'm not talking about code of hard concurrency, but of code with >> > intrinsic >> > parallelism (let's say matrix multiplication). >> >> Automatic parallelization is hard, see: >> http://en.wikipedia.org/wiki/Automatic_parallelization >> >> Lots of scientists have tried, lots of money has been invested, but >> it's still hard. >> The only practical approaches still require the programmer to >> introduce parallelism, but in ways much simpler than using >> multithreading directly. Google OpenMP and Cilk. >> >> > Would a JIT compilation be capable of detecting parallelism? >> Summing up what is above, probably not. >> >> Moreover, matrix multiplication may not be so easy as one might think. >> I do not know how to write it for a GPU, but in the end I reference >> some suggestions from that paper (where it is one of the benchmarks). >> But here, I explain why writing it for a CPU is complicated. You can >> multiply two matrixes with a triply nested for, but such an algorithm >> has poor performance for big matrixes because of bad cache locality. >> GPUs, according to the above mentioned paper, provide no caches and >> hides latency in other ways. >> >> See here for the two main alternative ideas which allow solving this >> problem of writing an efficient matrix multiplication algorithm: >> http://en.wikipedia.org/wiki/Cache_blocking >> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm >> >> Then, you need to parallelize the resulting code yourself, which might >> or might not be easy (depending on the interactions between the >> parallel blocks that are found there). >> In that paper, where matrix multiplication is called as SGEMM (the >> BLAS routine implementing it), they suggest using a cache-blocked >> version of matrix multiplication for both CPUs and GPUs, and argue >> that parallelization is then easy. >> >> Cheers, >> -- >> Paolo Giarrusso - Ph.D. Student >> http://www.informatik.uni-marburg.de/~pgiarrusso/ >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From fijall at gmail.com Fri Aug 20 22:51:42 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 20 Aug 2010 22:51:42 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: 2010/8/20 Paolo Giarrusso : > 2010/8/20 Jorge Tim?n : >> Hi, I'm just curious about the feasibility of running python code in a gpu >> by extending pypy. > Disclaimer: I am not a PyPy developer, even if I've been following the > project with interest. Nor am I an expert of GPU - I provide links to > the literature I've read. > Yet, I believe that such an attempt is unlikely to be interesting. > Quoting Wikipedia's synthesis: > "Unlike CPUs however, GPUs have a parallel throughput architecture > that emphasizes executing many concurrent threads slowly, rather than > executing a single thread very fast." > And significant optimizations are needed anyway to get performance for > GPU code (and if you don't need the last bit of performance, why > bother with a GPU?), so I think that the need to use a C-like language > is the smallest problem. > >> I don't have the time (and probably the knowledge neither) to develop that >> pypy extension, but I just want to know if it's possible. >> I'm interested in languages like openCL and nvidia's CUDA because I think >> the future of supercomputing is going to be GPGPU. Python is a very different language than CUDA or openCL, hence it's not completely to map python's semantics to something that will make sense for GPU. > > I would like to point out that while for some cases it might be right, > the importance of GPGPU is probably often exaggerated: > > http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# > > Researchers in the field are mostly aware of the fact that GPGPU is > the way to go only for a very restricted category of code. For that > code, fine. > Thus, instead of running Python code in a GPU, designing from scratch > an easy way to program a GPU efficiently, for those task, is better, > and projects for that already exist (i.e. what you cite). > > Additionally, it would take probably a different kind of JIT to > exploit GPUs. No branch prediction, very small non-coherent caches, no > efficient synchronization primitives, as I read from this paper... I'm > no expert, but I guess you'd need to rearchitecture from scratch the > needed optimizations. > And it took 20-30 years to get from the first, slow Lisp (1958) to, > say, Self (1991), a landmark in performant high-level languages, > derived from SmallTalk. Most of that would have to be redone. > > So, I guess that the effort to compile Python code for a GPU is not > worth it. There might be further reasons due to the kind of code a JIT > generates, since a GPU has no branch predictor, no caches, and so on, > but I'm no GPU expert and I would have to check again. > > Finally, for general purpose code, exploiting the big expected number > of CPUs on our desktop systems is already a challenge. > >> There's people working in >> bringing GPGPU to python: >> >> http://mathema.tician.de/software/pyopencl >> http://mathema.tician.de/software/pycuda >> >> Would it be possible to run python code in parallel without the need (for >> the developer) of actively parallelizing the code? > > I would say that Python is not yet the language to use to write > efficient parallel code, because of the Global Interpreter Lock > (Google for "Python GIL"). The two implementations having no GIL are > IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, > and the current focus is not on removing it. > Scientific computing uses external libraries (like NumPy) - for the > supported algorithms, one could introduce parallelism at that level. > If that's enough for your application, good. > If you want to write a parallel algorithm in Python, we're not there yet. > >> I'm not talking about code of hard concurrency, but of code with intrinsic >> parallelism (let's say matrix multiplication). > > Automatic parallelization is hard, see: > http://en.wikipedia.org/wiki/Automatic_parallelization > > Lots of scientists have tried, lots of money has been invested, but > it's still hard. > The only practical approaches still require the programmer to > introduce parallelism, but in ways much simpler than using > multithreading directly. Google OpenMP and Cilk. > >> Would a JIT compilation be capable of detecting parallelism? > Summing up what is above, probably not. > > Moreover, matrix multiplication may not be so easy as one might think. > I do not know how to write it for a GPU, but in the end I reference > some suggestions from that paper (where it is one of the benchmarks). > But here, I explain why writing it for a CPU is complicated. You can > multiply two matrixes with a triply nested for, but such an algorithm > has poor performance for big matrixes because of bad cache locality. > GPUs, according to the above mentioned paper, provide no caches and > hides latency in other ways. > > See here for the two main alternative ideas which allow solving this > problem of writing an efficient matrix multiplication algorithm: > http://en.wikipedia.org/wiki/Cache_blocking > http://en.wikipedia.org/wiki/Cache-oblivious_algorithm > > Then, you need to parallelize the resulting code yourself, which might > or might not be easy (depending on the interactions between the > parallel blocks that are found there). > In that paper, where matrix multiplication is called as SGEMM (the > BLAS routine implementing it), they suggest using a cache-blocked > version of matrix multiplication for both CPUs and GPUs, and argue > that parallelization is then easy. What's interesting in using GPU and a JIT is optimizing numpy vectorized operations to speed up things like big_array_a + big_array_b using SSE and GPU. However, I don't think anyone plans to work on it in a near future and if you don't have time this stays as a topic of interest only :) > > Cheers, > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev From jonah at eecs.berkeley.edu Fri Aug 20 23:05:15 2010 From: jonah at eecs.berkeley.edu (Jeff Anderson-Lee) Date: Fri, 20 Aug 2010 14:05:15 -0700 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: <4C6EEE0B.7060500@eecs.berkeley.edu> On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote: > 2010/8/20 Paolo Giarrusso: >> 2010/8/20 Jorge Tim?n: >>> Hi, I'm just curious about the feasibility of running python code in a gpu >>> by extending pypy. >> Disclaimer: I am not a PyPy developer, even if I've been following the >> project with interest. Nor am I an expert of GPU - I provide links to >> the literature I've read. >> Yet, I believe that such an attempt is unlikely to be interesting. >> Quoting Wikipedia's synthesis: >> "Unlike CPUs however, GPUs have a parallel throughput architecture >> that emphasizes executing many concurrent threads slowly, rather than >> executing a single thread very fast." >> And significant optimizations are needed anyway to get performance for >> GPU code (and if you don't need the last bit of performance, why >> bother with a GPU?), so I think that the need to use a C-like language >> is the smallest problem. >> >>> I don't have the time (and probably the knowledge neither) to develop that >>> pypy extension, but I just want to know if it's possible. >>> I'm interested in languages like openCL and nvidia's CUDA because I think >>> the future of supercomputing is going to be GPGPU. > Python is a very different language than CUDA or openCL, hence it's > not completely to map python's semantics to something that will make > sense for GPU. Try googling: copperhead cuda Also look at: http://code.google.com/p/copperhead/wiki/Installing From fijall at gmail.com Fri Aug 20 23:18:12 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 20 Aug 2010 23:18:12 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: <4C6EEE0B.7060500@eecs.berkeley.edu> References: <4C6EEE0B.7060500@eecs.berkeley.edu> Message-ID: On Fri, Aug 20, 2010 at 11:05 PM, Jeff Anderson-Lee wrote: > ?On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote: >> 2010/8/20 Paolo Giarrusso: >>> 2010/8/20 Jorge Tim?n: >>>> Hi, I'm just curious about the feasibility of running python code in a gpu >>>> by extending pypy. >>> Disclaimer: I am not a PyPy developer, even if I've been following the >>> project with interest. Nor am I an expert of GPU - I provide links to >>> the literature I've read. >>> Yet, I believe that such an attempt is unlikely to be interesting. >>> Quoting Wikipedia's synthesis: >>> "Unlike CPUs however, GPUs have a parallel throughput architecture >>> that emphasizes executing many concurrent threads slowly, rather than >>> executing a single thread very fast." >>> And significant optimizations are needed anyway to get performance for >>> GPU code (and if you don't need the last bit of performance, why >>> bother with a GPU?), so I think that the need to use a C-like language >>> is the smallest problem. >>> >>>> I don't have the time (and probably the knowledge neither) to develop that >>>> pypy extension, but I just want to know if it's possible. >>>> I'm interested in languages like openCL and nvidia's CUDA because I think >>>> the future of supercomputing is going to be GPGPU. >> Python is a very different language than CUDA or openCL, hence it's >> not completely to map python's semantics to something that will make >> sense for GPU. > Try googling: copperhead cuda > Also look at: > > http://code.google.com/p/copperhead/wiki/Installing > What's the point of posting here project which has not released any code? From jbaker at zyasoft.com Fri Aug 20 23:27:20 2010 From: jbaker at zyasoft.com (Jim Baker) Date: Fri, 20 Aug 2010 15:27:20 -0600 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: The Unladen Swallow doc, which was derived from a PEP that Jeff proposed, seems to be a fair descriptive outline of Python memory models in general, and Jython's in specific. Obviously the underlying implementation in the JVM is happens-before consistency; everything else derives from there. The CHM provides additional consistency constraints that should imply sequential consistency for a (vast) subset of Python programs. However, I can readily construct a program that violates sequential consistency: maybe it uses slots (stored in a Java array), or the array module (which also just wraps Java arrays), or by accesses local variables in a frame from another thread (same storage, same problem). Likewise I can also create Python programs that access Java classes (since this is Jython!), and they too will only see happens-before consistency. Naturally, the workarounds I mentioned for improving performance in object allocation all rely on not using CHM and its (modestly) expensive semantics. So this would mean using a Java class in some way, possibly a HashMap (especially one that's been exposed through our type expose mechanism to avoid reflection overhead), or directly using a Java class of some kind (again exposing is best, much like are builtin types like PyInteger), possibly with all fields marked as volatile. Hope this helps! If you are interested in studying this problem in more depth for Jython, or other implementations, and the implications of our hybrid model, it would certainly be most welcome. Unfortunately, it's not something that Jython development itself will be working on (standard time constraints apply here). - Jim 2010/8/20 Paolo Giarrusso > 2010/8/20 Jim Baker : > > Jython single-threaded performance has little to do with a lack of the > GIL. > > Never implied that - I do believe that a GIL-less fast Python is > possible. I just meant we don't have one yet. > > > Probably the only direct manifestation is seen in the overhead of > allocating > > __dict__ (or dict) objects because Python attributes have volatile memory > > semantics > Uh? "Jython memory model" doesn't seem to find anything. Is there any > docs on this, with the rationale for the choice you describe? > > I've only found the Unladen Swallow proposals for a memory model: > http://code.google.com/p/unladen-swallow/wiki/MemoryModel (and > python-safethread, which I don't like). > > As a Java programmer using Jython, I wouldn't expect to have any > volatile field ever, but I would expect to be able to act on different > fields indipendently - the race conditions we have to protect from are > the ones on structual modification (unless the table uses open > addressing). > _This_ can be implemented through ConcurrentHashMap (which also makes > all fields volatile), but an implementation not guaranteeing volatile > semantics (if possible) would have been equally valid. > I am interested because I want to experiment with alternatives. > > Of course, you can offer stronger semantics, but then you should also > advertise that fields are volatile, thus I don't need a lock to pass a > reference. > > > , which is ensured by the backing of a ConcurrentHashMap, which can > > be expensive to allocate. There are workarounds. > > I'm also curious about such workarounds - are they currently > implemented or speculations? > > > 2010/8/20 Paolo Giarrusso > >> > >> 2010/8/20 Jorge Tim?n : > >> > Hi, I'm just curious about the feasibility of running python code in a > >> > gpu > >> > by extending pypy. > >> Disclaimer: I am not a PyPy developer, even if I've been following the > >> project with interest. Nor am I an expert of GPU - I provide links to > >> the literature I've read. > >> Yet, I believe that such an attempt is unlikely to be interesting. > >> Quoting Wikipedia's synthesis: > >> "Unlike CPUs however, GPUs have a parallel throughput architecture > >> that emphasizes executing many concurrent threads slowly, rather than > >> executing a single thread very fast." > >> And significant optimizations are needed anyway to get performance for > >> GPU code (and if you don't need the last bit of performance, why > >> bother with a GPU?), so I think that the need to use a C-like language > >> is the smallest problem. > >> > >> > I don't have the time (and probably the knowledge neither) to develop > >> > that > >> > pypy extension, but I just want to know if it's possible. > >> > I'm interested in languages like openCL and nvidia's CUDA because I > >> > think > >> > the future of supercomputing is going to be GPGPU. > >> > >> I would like to point out that while for some cases it might be right, > >> the importance of GPGPU is probably often exaggerated: > >> > >> > >> > http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# > >> > >> Researchers in the field are mostly aware of the fact that GPGPU is > >> the way to go only for a very restricted category of code. For that > >> code, fine. > >> Thus, instead of running Python code in a GPU, designing from scratch > >> an easy way to program a GPU efficiently, for those task, is better, > >> and projects for that already exist (i.e. what you cite). > >> > >> Additionally, it would take probably a different kind of JIT to > >> exploit GPUs. No branch prediction, very small non-coherent caches, no > >> efficient synchronization primitives, as I read from this paper... I'm > >> no expert, but I guess you'd need to rearchitecture from scratch the > >> needed optimizations. > >> And it took 20-30 years to get from the first, slow Lisp (1958) to, > >> say, Self (1991), a landmark in performant high-level languages, > >> derived from SmallTalk. Most of that would have to be redone. > >> > >> So, I guess that the effort to compile Python code for a GPU is not > >> worth it. There might be further reasons due to the kind of code a JIT > >> generates, since a GPU has no branch predictor, no caches, and so on, > >> but I'm no GPU expert and I would have to check again. > >> > >> Finally, for general purpose code, exploiting the big expected number > >> of CPUs on our desktop systems is already a challenge. > >> > >> > There's people working in > >> > bringing GPGPU to python: > >> > > >> > http://mathema.tician.de/software/pyopencl > >> > http://mathema.tician.de/software/pycuda > >> > > >> > Would it be possible to run python code in parallel without the need > >> > (for > >> > the developer) of actively parallelizing the code? > >> > >> I would say that Python is not yet the language to use to write > >> efficient parallel code, because of the Global Interpreter Lock > >> (Google for "Python GIL"). The two implementations having no GIL are > >> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, > >> and the current focus is not on removing it. > >> Scientific computing uses external libraries (like NumPy) - for the > >> supported algorithms, one could introduce parallelism at that level. > >> If that's enough for your application, good. > >> If you want to write a parallel algorithm in Python, we're not there > yet. > >> > >> > I'm not talking about code of hard concurrency, but of code with > >> > intrinsic > >> > parallelism (let's say matrix multiplication). > >> > >> Automatic parallelization is hard, see: > >> http://en.wikipedia.org/wiki/Automatic_parallelization > >> > >> Lots of scientists have tried, lots of money has been invested, but > >> it's still hard. > >> The only practical approaches still require the programmer to > >> introduce parallelism, but in ways much simpler than using > >> multithreading directly. Google OpenMP and Cilk. > >> > >> > Would a JIT compilation be capable of detecting parallelism? > >> Summing up what is above, probably not. > >> > >> Moreover, matrix multiplication may not be so easy as one might think. > >> I do not know how to write it for a GPU, but in the end I reference > >> some suggestions from that paper (where it is one of the benchmarks). > >> But here, I explain why writing it for a CPU is complicated. You can > >> multiply two matrixes with a triply nested for, but such an algorithm > >> has poor performance for big matrixes because of bad cache locality. > >> GPUs, according to the above mentioned paper, provide no caches and > >> hides latency in other ways. > >> > >> See here for the two main alternative ideas which allow solving this > >> problem of writing an efficient matrix multiplication algorithm: > >> http://en.wikipedia.org/wiki/Cache_blocking > >> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm > >> > >> Then, you need to parallelize the resulting code yourself, which might > >> or might not be easy (depending on the interactions between the > >> parallel blocks that are found there). > >> In that paper, where matrix multiplication is called as SGEMM (the > >> BLAS routine implementing it), they suggest using a cache-blocked > >> version of matrix multiplication for both CPUs and GPUs, and argue > >> that parallelization is then easy. > >> > >> Cheers, > >> -- > >> Paolo Giarrusso - Ph.D. Student > >> http://www.informatik.uni-marburg.de/~pgiarrusso/ > >> _______________________________________________ > >> pypy-dev at codespeak.net > >> http://codespeak.net/mailman/listinfo/pypy-dev > > > > > > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonah at eecs.berkeley.edu Fri Aug 20 23:28:14 2010 From: jonah at eecs.berkeley.edu (Jeff Anderson-Lee) Date: Fri, 20 Aug 2010 14:28:14 -0700 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: <4C6EEE0B.7060500@eecs.berkeley.edu> Message-ID: <4C6EF36E.5000507@eecs.berkeley.edu> On 8/20/2010 2:18 PM, Maciej Fijalkowski wrote: > On Fri, Aug 20, 2010 at 11:05 PM, Jeff Anderson-Lee > wrote: >> On 8/20/2010 1:51 PM, Maciej Fijalkowski wrote: >>> 2010/8/20 Paolo Giarrusso: >>>> 2010/8/20 Jorge Tim?n: >>>>> Hi, I'm just curious about the feasibility of running python code in a gpu >>>>> by extending pypy. >>>> Disclaimer: I am not a PyPy developer, even if I've been following the >>>> project with interest. Nor am I an expert of GPU - I provide links to >>>> the literature I've read. >>>> Yet, I believe that such an attempt is unlikely to be interesting. >>>> Quoting Wikipedia's synthesis: >>>> "Unlike CPUs however, GPUs have a parallel throughput architecture >>>> that emphasizes executing many concurrent threads slowly, rather than >>>> executing a single thread very fast." >>>> And significant optimizations are needed anyway to get performance for >>>> GPU code (and if you don't need the last bit of performance, why >>>> bother with a GPU?), so I think that the need to use a C-like language >>>> is the smallest problem. >>>> >>>>> I don't have the time (and probably the knowledge neither) to develop that >>>>> pypy extension, but I just want to know if it's possible. >>>>> I'm interested in languages like openCL and nvidia's CUDA because I think >>>>> the future of supercomputing is going to be GPGPU. >>> Python is a very different language than CUDA or openCL, hence it's >>> not completely to map python's semantics to something that will make >>> sense for GPU. >> Try googling: copperhead cuda >> Also look at: >> >> http://code.google.com/p/copperhead/wiki/Installing >> > What's the point of posting here project which has not released any code? 1) He is packaging it up for release this month: > Comment by bryan.catanzaro > , Aug 05, 2010 > > Before the end of August. I'm working on packaging it up right now. =) > 2) Bryan's got a good head on his shoulders and has been working on this problem or some time. Rather than (or at least before) starting off in a completely new direction, its worth looking at something that has been in the works for a while now and is attaining some maturity. 3) You are welcome to ignore it, but some folks might be interested, and at least they now know it is there and where to look for more information and forthcoming code. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncbray at gmail.com Sat Aug 21 00:46:53 2010 From: ncbray at gmail.com (Nick Bray) Date: Fri, 20 Aug 2010 17:46:53 -0500 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: I can't speak for GPGPU, but I have compiled a subset of Python onto the GPU for real-time rendering. The subset is a little broader than RPython in some ways (for example, attributes are semantically identical to Python) and a little narrower in some ways (many forms of recursion are disallowed.) This big idea is that it allows you to create a real-time rendering system with a single code base, and transparently share functions and data structures between the CPU and GPU. http://www.ncbray.com/pystream.html http://www.ncbray.com/ncbray-dissertation.pdf It's at least ~100,000x faster than interpreting Python on the CPU. "At least" because the measurements neglect doing things on the CPU like texture sampling. This speedup is pretty obscene, but if you break it down it isn't too unbelievable... 100x for interpreted -> compiled, 10x for abstraction overhead of using floats instead of doubles, 100x for using the GPU and using it for a task it was built for. Parallelism issues are sidestepped by explicitly identifying the parallel sections (one function processes every vertex, one function processes every fragment), requiring the parallel sections have no global side effects, and that certain I/O conventions are followed. Sorry, no big answers here - it's essentially Pythonic stream programming. The biggest issues with getting Python onto the GPU is memory. I was actually targeting GLSL, not CUDA (it can't access the full rendering pipeline), so pointers were not available. To work around this, the code is optimized to an extreme degree to remove as many memory operations as possible. The remaining memory operations are emulated by splitting the heap into regions, indirecting through arrays, and copying constant data wherever possible. From what I've seen this is where PyPy would have the most trouble: its analysis algorithms are good enough for inferring types and allowing compilation / translation... they aren't designed to enable aggressive optimization of memory operations (there's not a huge reason to do this if you're translating RPython into C... the C compiler will do it for you). In general, GPU programming doesn't work well with memory access (too many functional units, too little bandwidth). Most of the "C-like" GPU languages are designed to they can easily boil down into code operating out of registers. Python, on the other hand, is addicted to heap memory. Even if you target CUDA, eliminating memory operations will be a huge win. I'll freely admit there's some ugly things going on, such as the lack of recursion, reliance on exhaustive inlining, requiring GPU code follow a specific form, and not working well with container objects in certain situations (it needs to bound the size of the heap). In the end, however, it's a talking dog... the grammar may not be perfect, but the dog talks! If anyone has questions, either private or on the list, I'd be happy to answer them. I have not done enough to advertise my project, and this seems like a good place to start. - Nick Bray 2010/8/20 Paolo Giarrusso : > 2010/8/20 Jorge Tim?n : >> Hi, I'm just curious about the feasibility of running python code in a gpu >> by extending pypy. > Disclaimer: I am not a PyPy developer, even if I've been following the > project with interest. Nor am I an expert of GPU - I provide links to > the literature I've read. > Yet, I believe that such an attempt is unlikely to be interesting. > Quoting Wikipedia's synthesis: > "Unlike CPUs however, GPUs have a parallel throughput architecture > that emphasizes executing many concurrent threads slowly, rather than > executing a single thread very fast." > And significant optimizations are needed anyway to get performance for > GPU code (and if you don't need the last bit of performance, why > bother with a GPU?), so I think that the need to use a C-like language > is the smallest problem. > >> I don't have the time (and probably the knowledge neither) to develop that >> pypy extension, but I just want to know if it's possible. >> I'm interested in languages like openCL and nvidia's CUDA because I think >> the future of supercomputing is going to be GPGPU. > > I would like to point out that while for some cases it might be right, > the importance of GPGPU is probably often exaggerated: > > http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# > > Researchers in the field are mostly aware of the fact that GPGPU is > the way to go only for a very restricted category of code. For that > code, fine. > Thus, instead of running Python code in a GPU, designing from scratch > an easy way to program a GPU efficiently, for those task, is better, > and projects for that already exist (i.e. what you cite). > > Additionally, it would take probably a different kind of JIT to > exploit GPUs. No branch prediction, very small non-coherent caches, no > efficient synchronization primitives, as I read from this paper... I'm > no expert, but I guess you'd need to rearchitecture from scratch the > needed optimizations. > And it took 20-30 years to get from the first, slow Lisp (1958) to, > say, Self (1991), a landmark in performant high-level languages, > derived from SmallTalk. Most of that would have to be redone. > > So, I guess that the effort to compile Python code for a GPU is not > worth it. There might be further reasons due to the kind of code a JIT > generates, since a GPU has no branch predictor, no caches, and so on, > but I'm no GPU expert and I would have to check again. > > Finally, for general purpose code, exploiting the big expected number > of CPUs on our desktop systems is already a challenge. > >> There's people working in >> bringing GPGPU to python: >> >> http://mathema.tician.de/software/pyopencl >> http://mathema.tician.de/software/pycuda >> >> Would it be possible to run python code in parallel without the need (for >> the developer) of actively parallelizing the code? > > I would say that Python is not yet the language to use to write > efficient parallel code, because of the Global Interpreter Lock > (Google for "Python GIL"). The two implementations having no GIL are > IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, > and the current focus is not on removing it. > Scientific computing uses external libraries (like NumPy) - for the > supported algorithms, one could introduce parallelism at that level. > If that's enough for your application, good. > If you want to write a parallel algorithm in Python, we're not there yet. > >> I'm not talking about code of hard concurrency, but of code with intrinsic >> parallelism (let's say matrix multiplication). > > Automatic parallelization is hard, see: > http://en.wikipedia.org/wiki/Automatic_parallelization > > Lots of scientists have tried, lots of money has been invested, but > it's still hard. > The only practical approaches still require the programmer to > introduce parallelism, but in ways much simpler than using > multithreading directly. Google OpenMP and Cilk. > >> Would a JIT compilation be capable of detecting parallelism? > Summing up what is above, probably not. > > Moreover, matrix multiplication may not be so easy as one might think. > I do not know how to write it for a GPU, but in the end I reference > some suggestions from that paper (where it is one of the benchmarks). > But here, I explain why writing it for a CPU is complicated. You can > multiply two matrixes with a triply nested for, but such an algorithm > has poor performance for big matrixes because of bad cache locality. > GPUs, according to the above mentioned paper, provide no caches and > hides latency in other ways. > > See here for the two main alternative ideas which allow solving this > problem of writing an efficient matrix multiplication algorithm: > http://en.wikipedia.org/wiki/Cache_blocking > http://en.wikipedia.org/wiki/Cache-oblivious_algorithm > > Then, you need to parallelize the resulting code yourself, which might > or might not be easy (depending on the interactions between the > parallel blocks that are found there). > In that paper, where matrix multiplication is called as SGEMM (the > BLAS routine implementing it), they suggest using a cache-blocked > version of matrix multiplication for both CPUs and GPUs, and argue > that parallelization is then easy. > > Cheers, > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev From p.giarrusso at gmail.com Sat Aug 21 01:46:28 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sat, 21 Aug 2010 01:46:28 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: 2010/8/20 Jim Baker : > The Unladen Swallow doc, which was derived from a PEP that Jeff proposed, > seems to be a fair descriptive outline of Python memory models in general, > and Jython's in specific. > Obviously the underlying implementation in the JVM is happens-before > consistency; everything else derives from there. The CHM ?provides > additional consistency constraints that should imply sequential consistency > for a (vast) subset of Python programs. However, I can readily construct a > program that violates sequential consistency: maybe it uses slots (stored in > a Java array), or the array module (which also just wraps Java arrays), or > by accesses local variables in a frame from another thread (same storage, > same problem). Likewise I can also create Python programs that access Java > classes (since this is Jython!), and they too will only see happens-before > consistency. OK, I guess that volatile semantics for fields were just a side effect. As far as I can see, you get sequential consistency only in practice, not in theory - you have happens-before edges only when a reader and a writer touch the same field. In practice, the few cases where it matters can't apply here as far as I know, because a hash function decides to which of the submaps a mapping belongs. Your mention of slots is very cool! You made me recall that once you get shadow classes in Python, you can not only do inline caching, but you also have the _same_ object layout as in slots, because adding a member causes a hidden class transition, getting rid of any kind of dictionary _after compilation_. Two exceptions: * an immutable dictionary mapping field names to offsets is used both during JIT compilation and when inline caching fails, for * a fallback case for when __dict__ is used, I guess, is needed. Not necessarily a dictionary must be used though: one could also make __dict__ usage just cause class transitions. * beyond a certain member count, i.e., if __dict__ is used as a general-purpose dictionary, one might want to switch back to a dictionary representation. This only applies if this is done in Pythonic code (guess not) - I remember this case from V8, for JavaScript, where the expected usage is different. > Naturally, the workarounds I mentioned for improving performance in object > allocation all rely on not using CHM and its (modestly) expensive semantics. > So this would mean using a Java class in some way, possibly a HashMap > (especially one that's been exposed through our type expose mechanism to > avoid reflection overhead), or directly using a Java class of some kind > (again exposing is best, much like are builtin types like PyInteger), > possibly with all fields marked as volatile. > Hope this helps! If you are interested in studying this problem in more > depth for Jython, or other implementations, and the implications of our > hybrid model, it would certainly be most welcome. Unfortunately, it's not > something that Jython development itself will be working on (standard time > constraints apply here). Such constraints apply to me too - but I hope this to work on that. > - Jim > 2010/8/20 Paolo Giarrusso >> >> 2010/8/20 Jim Baker : >> > Jython single-threaded performance has little to do with a lack of the >> > GIL. >> >> Never implied that - I do believe that a GIL-less fast Python is >> possible. I just meant we don't have one yet. >> >> > Probably the only direct manifestation is seen in the overhead of >> > allocating >> > __dict__ (or dict) objects because Python attributes have volatile >> > memory >> > semantics >> Uh? "Jython memory model" doesn't seem to find anything. Is there any >> docs on this, with the rationale for the choice you describe? >> >> I've only found the Unladen Swallow proposals for a memory model: >> http://code.google.com/p/unladen-swallow/wiki/MemoryModel (and >> python-safethread, which I don't like). >> >> As a Java programmer using Jython, I wouldn't expect to have any >> volatile field ever, but I would expect to be able to act on different >> fields indipendently - the race conditions we have to protect from are >> the ones on structual modification (unless the table uses open >> addressing). >> _This_ can be implemented through ConcurrentHashMap (which also makes >> all fields volatile), but an implementation not guaranteeing volatile >> semantics (if possible) would have been equally valid. >> I am interested because I want to experiment with alternatives. >> >> Of course, you can offer stronger semantics, but then you should also >> advertise that fields are volatile, thus I don't need a lock to pass a >> reference. >> >> > , which is ensured by the backing of a ConcurrentHashMap, which can >> > be expensive to allocate. There are workarounds. >> >> I'm also curious about such workarounds - are they currently >> implemented or speculations? >> >> > 2010/8/20 Paolo Giarrusso >> >> >> >> 2010/8/20 Jorge Tim?n : >> >> > Hi, I'm just curious about the feasibility of running python code in >> >> > a >> >> > gpu >> >> > by extending pypy. >> >> Disclaimer: I am not a PyPy developer, even if I've been following the >> >> project with interest. Nor am I an expert of GPU - I provide links to >> >> the literature I've read. >> >> Yet, I believe that such an attempt is unlikely to be interesting. >> >> Quoting Wikipedia's synthesis: >> >> "Unlike CPUs however, GPUs have a parallel throughput architecture >> >> that emphasizes executing many concurrent threads slowly, rather than >> >> executing a single thread very fast." >> >> And significant optimizations are needed anyway to get performance for >> >> GPU code (and if you don't need the last bit of performance, why >> >> bother with a GPU?), so I think that the need to use a C-like language >> >> is the smallest problem. >> >> >> >> > I don't have the time (and probably the knowledge neither) to develop >> >> > that >> >> > pypy extension, but I just want to know if it's possible. >> >> > I'm interested in languages like openCL and nvidia's CUDA because I >> >> > think >> >> > the future of supercomputing is going to be GPGPU. >> >> >> >> I would like to point out that while for some cases it might be right, >> >> the importance of GPGPU is probably often exaggerated: >> >> >> >> >> >> >> >> http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# >> >> >> >> Researchers in the field are mostly aware of the fact that GPGPU is >> >> the way to go only for a very restricted category of code. For that >> >> code, fine. >> >> Thus, instead of running Python code in a GPU, designing from scratch >> >> an easy way to program a GPU efficiently, for those task, is better, >> >> and projects for that already exist (i.e. what you cite). >> >> >> >> Additionally, it would take probably a different kind of JIT to >> >> exploit GPUs. No branch prediction, very small non-coherent caches, no >> >> efficient synchronization primitives, as I read from this paper... I'm >> >> no expert, but I guess you'd need to rearchitecture from scratch the >> >> needed optimizations. >> >> And it took 20-30 years to get from the first, slow Lisp (1958) to, >> >> say, Self (1991), a landmark in performant high-level languages, >> >> derived from SmallTalk. Most of that would have to be redone. >> >> >> >> So, I guess that the effort to compile Python code for a GPU is not >> >> worth it. There might be further reasons due to the kind of code a JIT >> >> generates, since a GPU has no branch predictor, no caches, and so on, >> >> but I'm no GPU expert and I would have to check again. >> >> >> >> Finally, for general purpose code, exploiting the big expected number >> >> of CPUs on our desktop systems is already a challenge. >> >> >> >> > There's people working in >> >> > bringing GPGPU to python: >> >> > >> >> > http://mathema.tician.de/software/pyopencl >> >> > http://mathema.tician.de/software/pycuda >> >> > >> >> > Would it be possible to run python code in parallel without the need >> >> > (for >> >> > the developer) of actively parallelizing the code? >> >> >> >> I would say that Python is not yet the language to use to write >> >> efficient parallel code, because of the Global Interpreter Lock >> >> (Google for "Python GIL"). The two implementations having no GIL are >> >> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, >> >> and the current focus is not on removing it. >> >> Scientific computing uses external libraries (like NumPy) - for the >> >> supported algorithms, one could introduce parallelism at that level. >> >> If that's enough for your application, good. >> >> If you want to write a parallel algorithm in Python, we're not there >> >> yet. >> >> >> >> > I'm not talking about code of hard concurrency, but of code with >> >> > intrinsic >> >> > parallelism (let's say matrix multiplication). >> >> >> >> Automatic parallelization is hard, see: >> >> http://en.wikipedia.org/wiki/Automatic_parallelization >> >> >> >> Lots of scientists have tried, lots of money has been invested, but >> >> it's still hard. >> >> The only practical approaches still require the programmer to >> >> introduce parallelism, but in ways much simpler than using >> >> multithreading directly. Google OpenMP and Cilk. >> >> >> >> > Would a JIT compilation be capable of detecting parallelism? >> >> Summing up what is above, probably not. >> >> >> >> Moreover, matrix multiplication may not be so easy as one might think. >> >> I do not know how to write it for a GPU, but in the end I reference >> >> some suggestions from that paper (where it is one of the benchmarks). >> >> But here, I explain why writing it for a CPU is complicated. You can >> >> multiply two matrixes with a triply nested for, but such an algorithm >> >> has poor performance for big matrixes because of bad cache locality. >> >> GPUs, according to the above mentioned paper, provide no caches and >> >> hides latency in other ways. >> >> >> >> See here for the two main alternative ideas which allow solving this >> >> problem of writing an efficient matrix multiplication algorithm: >> >> http://en.wikipedia.org/wiki/Cache_blocking >> >> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm >> >> >> >> Then, you need to parallelize the resulting code yourself, which might >> >> or might not be easy (depending on the interactions between the >> >> parallel blocks that are found there). >> >> In that paper, where matrix multiplication is called as SGEMM (the >> >> BLAS routine implementing it), they suggest using a cache-blocked >> >> version of matrix multiplication for both CPUs and GPUs, and argue >> >> that parallelization is then easy. >> >> >> >> Cheers, >> >> -- >> >> Paolo Giarrusso - Ph.D. Student >> >> http://www.informatik.uni-marburg.de/~pgiarrusso/ >> >> _______________________________________________ >> >> pypy-dev at codespeak.net >> >> http://codespeak.net/mailman/listinfo/pypy-dev >> > >> >> >> >> -- >> Paolo Giarrusso - Ph.D. Student >> http://www.informatik.uni-marburg.de/~pgiarrusso/ > > -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From sakesun at gmail.com Sat Aug 21 05:20:11 2010 From: sakesun at gmail.com (sakesun roykiatisak) Date: Sat, 21 Aug 2010 10:20:11 +0700 Subject: [pypy-dev] What's wrong with >>> open(?xxx?, ?w?).write(?stuff?) ? In-Reply-To: References: <20100820095721.GC16244@code0.codespeak.net> Message-ID: This discussion is getting a little too long than necessary, at least for me. :) Most of pypy talk video is in pretty poor recording quality. Most of the time I try to discern barely from the slides. I always understand the difference between resource lifetime and object lifetime. Actually, in my most recent years, my sole python interpreter is the non-refcounting IronPython already. And I always wrap file operation inside try/finally or with statement. The problem is the example that claim to cause problem: >>> open('xxx', 'w').write('stuff') I misinterpret that the problem is caused in the "write" methods. The above statement cause no problem, but the subsequent usage of the file will. That's what I missed. In fact, it might be more intuitive to demonstrate in a little longer sample. >>> open('xxx', 'w').write('stuff') >>> assert open('xxx').read() == 'stuff' # Might fail ! The first file might not be closed yet ! Cheers. On Fri, Aug 20, 2010 at 8:39 PM, Paolo Giarrusso wrote: > On Fri, Aug 20, 2010 at 12:23, Donny Viszneki > wrote: > > Armin: Sakesun used "del f" and it appears you did not. > Actually, he didn't either. He said "I think that open(?xxx?, > ?w?).write(?stuff?)" is equivalent to using del (which he thought > would work), and the equivalence was correct. > > Anyway, in the _first reply_ message, he realized that using: > > ipy -c "open(?xxx?, ?w?).write(?stuff?)" > jython -c "open(?xxx?, ?w?).write(?stuff?)" > > made a difference (because the interpreter exited), so that problem > was solved. His mail implies that on PyPy he typed the code at the > prompt, rather than at -c. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hakan at debian.org Sat Aug 21 09:06:10 2010 From: hakan at debian.org (Hakan Ardo) Date: Sat, 21 Aug 2010 09:06:10 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: Hi, here is a another effort allowing you to write GPU kernels using python, targeted at gpgpu. The programmer has to explicitly state the parallelism and there are restrictions on what kind of constructs are allowed in the kernels, but it's pretty cool: http://www.cs.lth.se/home/Calle_Lejdfors/pygpu/ On Sat, Aug 21, 2010 at 12:46 AM, Nick Bray wrote: > I can't speak for GPGPU, but I have compiled a subset of Python onto > the GPU for real-time rendering. ?The subset is a little broader than > RPython in some ways (for example, attributes are semantically > identical to Python) and a little narrower in some ways (many forms of > recursion are disallowed.) ?This big idea is that it allows you to > create a real-time rendering system with a single code base, and > transparently share functions and data structures between the CPU and > GPU. > > http://www.ncbray.com/pystream.html > http://www.ncbray.com/ncbray-dissertation.pdf > > It's at least ~100,000x faster than interpreting Python on the CPU. > "At least" because the measurements neglect doing things on the CPU > like texture sampling. ?This speedup is pretty obscene, but if you > break it down it isn't too unbelievable... 100x for interpreted -> > compiled, 10x for abstraction overhead of using floats instead of > doubles, 100x for using the GPU and using it for a task it was built > for. > > Parallelism issues are sidestepped by explicitly identifying the > parallel sections (one function processes every vertex, one function > processes every fragment), requiring the parallel sections have no > global side effects, and that certain I/O conventions are followed. > Sorry, no big answers here - it's essentially Pythonic stream > programming. > > The biggest issues with getting Python onto the GPU is memory. ?I was > actually targeting GLSL, not CUDA (it can't access the full rendering > pipeline), so pointers were not available. ?To work around this, the > code is optimized to an extreme degree to remove as many memory > operations as possible. ?The remaining memory operations are emulated > by splitting the heap into regions, indirecting through arrays, and > copying constant data wherever possible. ?From what I've seen this is > where PyPy would have the most trouble: its analysis algorithms are > good enough for inferring types and ?allowing compilation / > translation... they aren't designed to enable aggressive optimization > of memory operations (there's not a huge reason to do this if you're > translating RPython into C... the C compiler will do it for you). ?In > general, GPU programming doesn't work well with memory access (too > many functional units, too little bandwidth). ?Most of the "C-like" > GPU languages are designed to they can easily boil down into code > operating out of registers. ?Python, on the other hand, is addicted to > heap memory. ?Even if you target CUDA, eliminating memory operations > will be a huge win. > > I'll freely admit there's some ugly things going on, such as the lack > of recursion, reliance on exhaustive inlining, requiring GPU code > follow a specific form, and not working well with container objects in > certain situations (it needs to bound the size of the heap). ?In the > end, however, it's a talking dog... the grammar may not be perfect, > but the dog talks! ?If anyone has questions, either private or on the > list, I'd be happy to answer them. ?I have not done enough to > advertise my project, and this seems like a good place to start. > > - Nick Bray > > 2010/8/20 Paolo Giarrusso : >> 2010/8/20 Jorge Tim?n : >>> Hi, I'm just curious about the feasibility of running python code in a gpu >>> by extending pypy. >> Disclaimer: I am not a PyPy developer, even if I've been following the >> project with interest. Nor am I an expert of GPU - I provide links to >> the literature I've read. >> Yet, I believe that such an attempt is unlikely to be interesting. >> Quoting Wikipedia's synthesis: >> "Unlike CPUs however, GPUs have a parallel throughput architecture >> that emphasizes executing many concurrent threads slowly, rather than >> executing a single thread very fast." >> And significant optimizations are needed anyway to get performance for >> GPU code (and if you don't need the last bit of performance, why >> bother with a GPU?), so I think that the need to use a C-like language >> is the smallest problem. >> >>> I don't have the time (and probably the knowledge neither) to develop that >>> pypy extension, but I just want to know if it's possible. >>> I'm interested in languages like openCL and nvidia's CUDA because I think >>> the future of supercomputing is going to be GPGPU. >> >> I would like to point out that while for some cases it might be right, >> the importance of GPGPU is probably often exaggerated: >> >> http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# >> >> Researchers in the field are mostly aware of the fact that GPGPU is >> the way to go only for a very restricted category of code. For that >> code, fine. >> Thus, instead of running Python code in a GPU, designing from scratch >> an easy way to program a GPU efficiently, for those task, is better, >> and projects for that already exist (i.e. what you cite). >> >> Additionally, it would take probably a different kind of JIT to >> exploit GPUs. No branch prediction, very small non-coherent caches, no >> efficient synchronization primitives, as I read from this paper... I'm >> no expert, but I guess you'd need to rearchitecture from scratch the >> needed optimizations. >> And it took 20-30 years to get from the first, slow Lisp (1958) to, >> say, Self (1991), a landmark in performant high-level languages, >> derived from SmallTalk. Most of that would have to be redone. >> >> So, I guess that the effort to compile Python code for a GPU is not >> worth it. There might be further reasons due to the kind of code a JIT >> generates, since a GPU has no branch predictor, no caches, and so on, >> but I'm no GPU expert and I would have to check again. >> >> Finally, for general purpose code, exploiting the big expected number >> of CPUs on our desktop systems is already a challenge. >> >>> There's people working in >>> bringing GPGPU to python: >>> >>> http://mathema.tician.de/software/pyopencl >>> http://mathema.tician.de/software/pycuda >>> >>> Would it be possible to run python code in parallel without the need (for >>> the developer) of actively parallelizing the code? >> >> I would say that Python is not yet the language to use to write >> efficient parallel code, because of the Global Interpreter Lock >> (Google for "Python GIL"). The two implementations having no GIL are >> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, >> and the current focus is not on removing it. >> Scientific computing uses external libraries (like NumPy) - for the >> supported algorithms, one could introduce parallelism at that level. >> If that's enough for your application, good. >> If you want to write a parallel algorithm in Python, we're not there yet. >> >>> I'm not talking about code of hard concurrency, but of code with intrinsic >>> parallelism (let's say matrix multiplication). >> >> Automatic parallelization is hard, see: >> http://en.wikipedia.org/wiki/Automatic_parallelization >> >> Lots of scientists have tried, lots of money has been invested, but >> it's still hard. >> The only practical approaches still require the programmer to >> introduce parallelism, but in ways much simpler than using >> multithreading directly. Google OpenMP and Cilk. >> >>> Would a JIT compilation be capable of detecting parallelism? >> Summing up what is above, probably not. >> >> Moreover, matrix multiplication may not be so easy as one might think. >> I do not know how to write it for a GPU, but in the end I reference >> some suggestions from that paper (where it is one of the benchmarks). >> But here, I explain why writing it for a CPU is complicated. You can >> multiply two matrixes with a triply nested for, but such an algorithm >> has poor performance for big matrixes because of bad cache locality. >> GPUs, according to the above mentioned paper, provide no caches and >> hides latency in other ways. >> >> See here for the two main alternative ideas which allow solving this >> problem of writing an efficient matrix multiplication algorithm: >> http://en.wikipedia.org/wiki/Cache_blocking >> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm >> >> Then, you need to parallelize the resulting code yourself, which might >> or might not be easy (depending on the interactions between the >> parallel blocks that are found there). >> In that paper, where matrix multiplication is called as SGEMM (the >> BLAS routine implementing it), they suggest using a cache-blocked >> version of matrix multiplication for both CPUs and GPUs, and argue >> that parallelization is then easy. >> >> Cheers, >> -- >> Paolo Giarrusso - Ph.D. Student >> http://www.informatik.uni-marburg.de/~pgiarrusso/ >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -- H?kan Ard? From cfbolz at gmx.de Sat Aug 21 10:25:39 2010 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Sat, 21 Aug 2010 10:25:39 +0200 Subject: [pypy-dev] gpgpu and pypy In-Reply-To: References: Message-ID: <4C6F8D83.8060709@gmx.de> Hi Paolo, On 08/21/2010 01:46 AM, Paolo Giarrusso wrote: [...] > Your mention of slots is very cool! You made me recall that once you > get shadow classes in Python, you can not only do inline caching, but > you also have the _same_ object layout as in slots, because adding a > member causes a hidden class transition, getting rid of any kind of > dictionary _after compilation_. Two exceptions: > * an immutable dictionary mapping field names to offsets is used both > during JIT compilation and when inline caching fails, for > * a fallback case for when __dict__ is used, I guess, is needed. Not > necessarily a dictionary must be used though: one could also make > __dict__ usage just cause class transitions. > * beyond a certain member count, i.e., if __dict__ is used as a > general-purpose dictionary, one might want to switch back to a > dictionary representation. This only applies if this is done in > Pythonic code (guess not) - I remember this case from V8, for > JavaScript, where the expected usage is different. > Just as a note: PyPy's Python interpreter does all this already, and I am working on making it even cooler :-). [...] Cheers, Carl Friedrich From hakan at debian.org Sat Aug 28 15:05:11 2010 From: hakan at debian.org (Hakan Ardo) Date: Sat, 28 Aug 2010 15:05:11 +0200 Subject: [pypy-dev] Loop invaraints Message-ID: Hi, some time ago, there were some discussion about loop invaraints, but no conclusion. What do you think about the following approach: - Let optimize_loop mark the arguments in loop.inputargs as invariant if they appear at the same position in the jump instruction at the end before calling propagate_formward - Let the optimize_... methods emit operations that only uses invariant arguments to some preamble instead of emitting them to self.newoperations whenever that is safe. Also, the result of these operations should probably be marked as invariant. - Insert the created preamble at every point where the loop is called, right before the jump. - When compiling a bridge from a failing guard, run the the preamble through propagate_formward and discard the emitted operations, to inherit that part of the state of Optimizer. This should place the invariant instructions at the end of the entry bridge, which is a suitable place, right? At the end of a bridge from a failing guard that maintains the invariants the optimizer should remove the inserted preamble again, right? And at the end of a bridge that invalidates them, enough of the preamble will be kept to maintain correct behavior, right? -- H?kan Ard? From william.leslie.ttg at gmail.com Sun Aug 29 00:05:06 2010 From: william.leslie.ttg at gmail.com (William Leslie) Date: Sun, 29 Aug 2010 08:05:06 +1000 Subject: [pypy-dev] Loop invaraints In-Reply-To: References: Message-ID: The other part of the work is the algorithm that finds loop variants. It is similar to the algorithm for variable colour inference, so you do have a starting point. On 28/08/2010 11:12 PM, "Hakan Ardo" wrote: Hi, some time ago, there were some discussion about loop invaraints, but no conclusion. What do you think about the following approach: - Let optimize_loop mark the arguments in loop.inputargs as invariant if they appear at the same position in the jump instruction at the end before calling propagate_formward - Let the optimize_... methods emit operations that only uses invariant arguments to some preamble instead of emitting them to self.newoperations whenever that is safe. Also, the result of these operations should probably be marked as invariant. - Insert the created preamble at every point where the loop is called, right before the jump. - When compiling a bridge from a failing guard, run the the preamble through propagate_formward and discard the emitted operations, to inherit that part of the state of Optimizer. This should place the invariant instructions at the end of the entry bridge, which is a suitable place, right? At the end of a bridge from a failing guard that maintains the invariants the optimizer should remove the inserted preamble again, right? And at the end of a bridge that invalidates them, enough of the preamble will be kept to maintain correct behavior, right? -- H?kan Ard? _______________________________________________ pypy-dev at codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfbolz at gmx.de Sun Aug 29 12:32:23 2010 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Sun, 29 Aug 2010 12:32:23 +0200 Subject: [pypy-dev] Loop invaraints In-Reply-To: References: Message-ID: <4C7A3737.50902@gmx.de> Hi H?kan, thanks for taking up the topic. On 08/28/2010 03:05 PM, Hakan Ardo wrote: > - Let optimize_loop mark the arguments in loop.inputargs as > invariant if they appear at the same position in the jump instruction > at the end before calling propagate_formward sounds good. > - Let the optimize_... methods emit operations that only uses > invariant arguments to some preamble instead of emitting them to > self.newoperations whenever that is safe. Also, the result of these > operations should probably be marked as invariant. Need to be a bit careful about operations with side-effects, but basically yes. > - Insert the created preamble at every point where the loop is > called, right before the jump. This part makes sense to me. The code would have to be careful to match the variables in the trace and in the preamble. > - When compiling a bridge from a failing guard, run the the preamble > through propagate_formward and discard the emitted operations, to > inherit that part of the state of Optimizer. ... but I don't see why this is needed. Wouldn't you rather need the whole trace of the loop including the preamble up to the failing guard? This would be bad, because you need to store the full trace then. > This should place the invariant instructions at the end of the entry > bridge, which is a suitable place, right? At the end of a bridge > from a failing guard that maintains the invariants the optimizer > should remove the inserted preamble again, right? And at the end of a > bridge that invalidates them, enough of the preamble will be kept to > maintain correct behavior, right? Yes to all the questions, at least as fas as I can see. I guess in practice there might be complications. Cheers, Carl Friedrich P.S.: A bit unrelated, but a comment on the jit-bounds branch: I think it would be good if the bounds-related optimizations could move out of optimizeopt.py to their own file, because otherwise optimizeopt.py is getting really unwieldy. Does that make sense? From arigo at tunes.org Sun Aug 29 13:04:11 2010 From: arigo at tunes.org (Armin Rigo) Date: Sun, 29 Aug 2010 13:04:11 +0200 Subject: [pypy-dev] Loop invaraints In-Reply-To: References: Message-ID: <20100829110411.GA13704@code0.codespeak.net> Hi, On Sat, Aug 28, 2010 at 03:05:11PM +0200, Hakan Ardo wrote: > some time ago, there were some discussion about loop invaraints, but > no conclusion. A general answer to that question: there are two kinds of goals we can have when optimizing. One is to get the fastest possible code for small Python loops, e.g. doing numerical computations. The other is to get reasonably good code for large and complicated loops, e.g. the dispatch loop of some network application. Although loop-invariant code motion would definitely be great for the first kind of loops, it's unclear that it helps on the second kind of loops. As a similar consideration, I am thinking about trying to remove the optimization that passes "virtuals" from one iteration of the loop to the next one. Although it has good effects on small loops, it has actually a negative effect on large loops, because the loop taking virtual arguments cannot be directly jumped to from the interpreter. I'm not saying that loop-invariant code motion could also have a negative effect on large loops; I think it's a pure win, so it's probably worth a try. I'm just giving a warning: it may not help much in the case of a "general Python program doing lots of stuff", but only in the case of small numerical computation loops. A bientot, Armin. From hakan at debian.org Sun Aug 29 13:49:23 2010 From: hakan at debian.org (Hakan Ardo) Date: Sun, 29 Aug 2010 13:49:23 +0200 Subject: [pypy-dev] Loop invaraints In-Reply-To: <4C7A3737.50902@gmx.de> References: <4C7A3737.50902@gmx.de> Message-ID: On Sun, Aug 29, 2010 at 12:32 PM, Carl Friedrich Bolz wrote: > > ... but I don't see why this is needed. Wouldn't you rather need the My thinking was that for the preamble to be removed from the end of a bridge maintaining the invariant this would be needed? But I might be mistaking? > whole trace of the loop including the preamble up to the failing guard? > This would be bad, because you need to store the full trace then. OK, so that might be a problem. Maybe it would be possible to extract what part of the state it would be safe to inherit even if only the preamble has been processed, i.e. self.pure_operations might be ok? > P.S.: A bit unrelated, but a comment on the jit-bounds branch: I think > it would be good if the bounds-related optimizations could move out of > optimizeopt.py to their own file, because otherwise optimizeopt.py is > getting really unwieldy. Does that make sense? Well, class IntBound and the propagate_bounds_ methods could probably be moved elsewhere, but a lot of the work is done in optimize_... methods, which I'm not so sure it would make sens to split up. -- H?kan Ard? From cfbolz at gmx.de Sun Aug 29 14:03:37 2010 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Sun, 29 Aug 2010 14:03:37 +0200 Subject: [pypy-dev] jit-bounds branch (was: Loop invaraints) In-Reply-To: References: <4C7A3737.50902@gmx.de> Message-ID: <4C7A4C99.2050803@gmx.de> On 08/29/2010 01:49 PM, Hakan Ardo wrote: >> P.S.: A bit unrelated, but a comment on the jit-bounds branch: I think >> it would be good if the bounds-related optimizations could move out of >> optimizeopt.py to their own file, because otherwise optimizeopt.py is >> getting really unwieldy. Does that make sense? > > Well, class IntBound and the propagate_bounds_ methods could probably > be moved elsewhere, but a lot of the work is done in optimize_... > methods, which I'm not so sure it would make sens to split up. I guess then the things that can be sanely moved should move. The file is nearly 2000 lines, which is way too big. I guess also the heap optimizations could go to their own file. Carl Friedrich From fijall at gmail.com Sun Aug 29 22:05:49 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 29 Aug 2010 22:05:49 +0200 Subject: [pypy-dev] jit-bounds branch (was: Loop invaraints) In-Reply-To: <4C7A4C99.2050803@gmx.de> References: <4C7A3737.50902@gmx.de> <4C7A4C99.2050803@gmx.de> Message-ID: On Sun, Aug 29, 2010 at 2:03 PM, Carl Friedrich Bolz wrote: > On 08/29/2010 01:49 PM, Hakan Ardo wrote: >>> P.S.: A bit unrelated, but a comment on the jit-bounds branch: I think >>> it would be good if the bounds-related optimizations could move out of >>> optimizeopt.py to their own file, because otherwise optimizeopt.py is >>> getting really unwieldy. Does that make sense? >> >> Well, class IntBound and the propagate_bounds_ methods could probably >> be moved elsewhere, but a lot of the work is done in optimize_... >> methods, which I'm not so sure it would make sens to split up. > > I guess then the things that can be sanely moved should move. The file > is nearly 2000 lines, which is way too big. I guess also the heap > optimizations could go to their own file. > > Carl Friedrich How about a couple of files (preferably small) each containing a contained optimization if possible? (maybe a package?) From hakan at debian.org Tue Aug 31 09:25:13 2010 From: hakan at debian.org (Hakan Ardo) Date: Tue, 31 Aug 2010 09:25:13 +0200 Subject: [pypy-dev] jit-bounds branch (was: Loop invaraints) In-Reply-To: References: <4C7A3737.50902@gmx.de> <4C7A4C99.2050803@gmx.de> Message-ID: Ok, so we split it up into a set of Optimization classes in separate files. Each containing a subset of the optimize_... methods. Then we have the propagate_forward method iterate over the instructions passing them to one Optimization after the other? That way we keep the single iteration over the instructions. Would it be preferable to separate them even more and have each Optimization contain it's own loop over the instructions? On Sun, Aug 29, 2010 at 10:05 PM, Maciej Fijalkowski wrote: > On Sun, Aug 29, 2010 at 2:03 PM, Carl Friedrich Bolz wrote: >> On 08/29/2010 01:49 PM, Hakan Ardo wrote: >>>> P.S.: A bit unrelated, but a comment on the jit-bounds branch: I think >>>> it would be good if the bounds-related optimizations could move out of >>>> optimizeopt.py to their own file, because otherwise optimizeopt.py is >>>> getting really unwieldy. Does that make sense? >>> >>> Well, class IntBound and the propagate_bounds_ methods could probably >>> be moved elsewhere, but a lot of the work is done in optimize_... >>> methods, which I'm not so sure it would make sens to split up. >> >> I guess then the things that can be sanely moved should move. The file >> is nearly 2000 lines, which is way too big. I guess also the heap >> optimizations could go to their own file. >> >> Carl Friedrich > > How about a couple of files (preferably small) each containing a > contained optimization if possible? (maybe a package?) > -- H?kan Ard? From hakan at debian.org Tue Aug 31 09:20:15 2010 From: hakan at debian.org (Hakan Ardo) Date: Tue, 31 Aug 2010 09:20:15 +0200 Subject: [pypy-dev] Loop invaraints In-Reply-To: <20100829110411.GA13704@code0.codespeak.net> References: <20100829110411.GA13704@code0.codespeak.net> Message-ID: On Sun, Aug 29, 2010 at 1:04 PM, Armin Rigo wrote: > > I'm not saying that loop-invariant code motion could also have a > negative effect on large loops; I think it's a pure win, so it's > probably worth a try. ?I'm just giving a warning: it may not help much > in the case of a "general Python program doing lots of stuff", but only > in the case of small numerical computation loops. Right. I write a lot of numerical computation loops these days, both small and somewhat bigger, and I am typically force to write them in C to get decent performance. So the motivation here would rater be to broaden the usability of python than to improve performance of exciting python programs. Another motivation might be to help pypy developers focus on the important instruction while staring at traces, ie by hiding the instructions that will be inserted only once :) -- H?kan Ard? From fijall at gmail.com Tue Aug 31 10:38:22 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 31 Aug 2010 10:38:22 +0200 Subject: [pypy-dev] Loop invaraints In-Reply-To: References: <20100829110411.GA13704@code0.codespeak.net> Message-ID: On Tue, Aug 31, 2010 at 9:20 AM, Hakan Ardo wrote: > On Sun, Aug 29, 2010 at 1:04 PM, Armin Rigo wrote: >> >> I'm not saying that loop-invariant code motion could also have a >> negative effect on large loops; I think it's a pure win, so it's >> probably worth a try. ?I'm just giving a warning: it may not help much >> in the case of a "general Python program doing lots of stuff", but only >> in the case of small numerical computation loops. > > Right. I write a lot of numerical computation loops these days, both > small and somewhat bigger, and I am typically force to write them in C > to get decent performance. So the motivation here would rater be to > broaden the usability of python than to improve performance of > exciting python programs. > > Another motivation might be to help pypy developers focus on the > important instruction while staring at traces, ie by hiding the > instructions that will be inserted only once :) > I second hakan here - small loops are not uninteresting, since it broadens areas where you can use python, not limiting yourself to existing python programs.