From hakan at debian.org Thu Dec 4 10:15:34 2008 From: hakan at debian.org (Hakan Ardo) Date: Thu, 4 Dec 2008 10:15:34 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: Message-ID: Hi, I've started to play around with the pypy codebase with the intention to make obj[i] act like obj.__getitem__(i) for rpython objects. The approach I tried was to add: class __extend__(pairtype(SomeInstance, SomeObject)): def getitem((s_array, s_index)): s=SomeString() s.const="__getitem__" p=s_array.getattr(s) return p.simple_call(s_index) and then do something like: class __extend__(pairtype(AbstractInstanceRepr, Repr)): def rtype_getitem((r_array, r_key), hop): hop2=hop.copy() ... hop2.forced_opname = 'getattr' hop2.dispatch() hop3=hop.copy() ... hop3.forced_opname = 'simple_call' hop3.dispatch() But I am having a hard time understanding the rtyper and if this is the right approach? Is there anything similar in the code/docs I could look at to get a better understanding on how to write this? Would it be a better solution to add an intermediate step between the annotator and the rtyper that converts getitem(SomeInstance,SomeObject) into vx=getattr(SomeInstance,'__getitem__'); simple_call(vx,SomeObject)? Any suggestions will be appreciated. Thanx! -- H?kan Ard? From simon at arrowtheory.com Fri Dec 5 07:15:36 2008 From: simon at arrowtheory.com (Simon Burton) Date: Fri, 5 Dec 2008 17:15:36 +1100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: Message-ID: <20081205171536.14185653.simon@arrowtheory.com> On Thu, 4 Dec 2008 10:15:34 +0100 "Hakan Ardo" wrote: > Hi, > I've started to play around with the pypy codebase with the intention > to make obj[i] act like obj.__getitem__(i) for rpython objects. Woohoo!! > The > approach I tried was to add: > > class __extend__(pairtype(SomeInstance, SomeObject)): > def getitem((s_array, s_index)): > s=SomeString() > s.const="__getitem__" > p=s_array.getattr(s) > return p.simple_call(s_index) > > and then do something like: > > class __extend__(pairtype(AbstractInstanceRepr, Repr)): > def rtype_getitem((r_array, r_key), hop): > hop2=hop.copy() > ... > hop2.forced_opname = 'getattr' > hop2.dispatch() > hop3=hop.copy() > ... > hop3.forced_opname = 'simple_call' > hop3.dispatch() um... > > But I am having a hard time understanding the rtyper and if this is > the right approach? Is there anything similar in the code/docs I > could look at to get a better understanding on how to write this? I had lots of fun last year writing code in rpython/numpy . There is plenty of getitem goodness there. Unfortunately it is probably impenetrable. Um. Perhaps you could revert back to when the code was sane (but much less functional). I also had a whack at __str__ for classes but failed horribly. Keep pestering me and i'm likely to become interested in this stuff again. Cheering from afar, Simon. From arigo at tunes.org Sun Dec 7 18:04:59 2008 From: arigo at tunes.org (Armin Rigo) Date: Sun, 7 Dec 2008 18:04:59 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: Message-ID: <20081207170458.GA5599@code0.codespeak.net> Hi Hakan, On Thu, Dec 04, 2008 at 10:15:34AM +0100, Hakan Ardo wrote: > I've started to play around with the pypy codebase with the intention > to make obj[i] act like obj.__getitem__(i) for rpython objects. The > approach I tried was to add: Calling __getitem__ in this way is not really supported, but here is how I managed to get it work anyway. It's only slightly more complicated: class __extend__(pairtype(SomeInstance, SomeObject)): def getitem((s_array, s_index)): # first generate a pseudo call to the helper bk = getbookkeeper() s_callable = bk.immutablevalue(do_getitem) args_s = [s_array, s_index] bk.emulate_pbc_call(('instance_getitem', s_array.knowntype), s_callable, args_s) # then use your own trick to get the correct result s=SomeString() s.const="__getitem__" p=s_array.getattr(s) return p.simple_call(s_index) # this is the helper def do_getitem(array, key): return array.__getitem__(key) do_getitem._annspecialcase_ = 'specialize:argtype(0)' # ^^^ specialization; not sure I have done it right... and then for the code in rclass: from pypy.annotation import binaryop from pypy.objspace.flow.model import Constant class __extend__(pairtype(AbstractInstanceRepr, Repr)): def rtype_getitem((r_array, r_key), hop): # call the helper do_getitem... hop2 = hop.copy() bk = r_array.rtyper.annotator.bookkeeper v_func = Constant(binaryop.do_getitem) s_func = bk.immutablevalue(v_func.value) hop2.v_s_insertfirstarg(v_func, s_func) hop2.forced_opname = 'simple_call' return hop2.dispatch() I think the _annspecialcase_ should be able to sort out between multiple unrelated calls to the helper. The code for the annotator is a bit bogus, btw, because it emulates a call to the function but also computes the result explicitly; but I couldn't figure out a better way. A bientot, Armin. From arigo at tunes.org Sun Dec 7 18:59:32 2008 From: arigo at tunes.org (Armin Rigo) Date: Sun, 7 Dec 2008 18:59:32 +0100 Subject: [pypy-dev] Sprint dates Message-ID: <20081207175932.GA11777@code0.codespeak.net> Hi, About the February sprint, the proposed dates (mostly the original ones) are: 7-14th. After sorting out the Duesseldorf situation, these dates could be ok too. Anyone has strong objections? A bientot, Armin. From anto.cuni at gmail.com Sun Dec 7 23:59:52 2008 From: anto.cuni at gmail.com (Antonio Cuni) Date: Sun, 07 Dec 2008 23:59:52 +0100 Subject: [pypy-dev] Sprint dates In-Reply-To: <20081207175932.GA11777@code0.codespeak.net> References: <20081207175932.GA11777@code0.codespeak.net> Message-ID: <493C5568.5060403@gmail.com> Armin Rigo wrote: > Hi, > > About the February sprint, the proposed dates (mostly the original ones) > are: 7-14th. After sorting out the Duesseldorf situation, these dates > could be ok too. Anyone has strong objections? the dates should be ok for me, though I don't promise I will come because I'm already doing a lot of travels in that period and I'm not sure I feel like doing another one... I'll decide later. ciao, Anto From hakan at debian.org Tue Dec 9 20:14:35 2008 From: hakan at debian.org (Hakan Ardo) Date: Tue, 9 Dec 2008 20:14:35 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <20081207170458.GA5599@code0.codespeak.net> References: <20081207170458.GA5599@code0.codespeak.net> Message-ID: Hi, thanx for all help. For anyone else interested I've placed a few files on http://hakan.ardoe.net/pypy/ namely: getitem_support.py - The suggested implementation getsetitem_support.py - Generalisation to handle __setitem__ aswell special_methods.py - Generalisation to handle several __xxx__ methods test_getitem.py - Tests for __getitem__ test_matrix.py - Tests using __getitem__, __setitem__ and __add__ > do_getitem._annspecialcase_ = 'specialize:argtype(0)' > # ^^^ specialization; not sure I have done it right... > > I think the _annspecialcase_ should be able to sort out between multiple > unrelated calls to the helper. It seems to be doing it's job. But if I try to apply the same trick to the __getitem__ method it does not seem to work, e.g. if I try to compile the code below it only works if I either do a[i] or a[i,j] calls not a mix of the two. class arr2d: def __init__(self,w,h): self.width=w self.height=h self.data=[i for i in range(w*h)] def __getitem__(self,i): if isinstance(i,int): return self.data[i] elif len(i)==2: return self.data[i[1]*self.width + i[0]] else: raise TypeError __getitem__._annspecialcase_ = 'specialize:argtype(0)' -- H?kan Ard? From holger at merlinux.eu Wed Dec 10 12:41:30 2008 From: holger at merlinux.eu (holger krekel) Date: Wed, 10 Dec 2008 12:41:30 +0100 Subject: [pypy-dev] Sprint dates In-Reply-To: <20081207175932.GA11777@code0.codespeak.net> References: <20081207175932.GA11777@code0.codespeak.net> Message-ID: <20081210114130.GE2219@trillke.net> On Sun, Dec 07, 2008 at 18:59 +0100, Armin Rigo wrote: > Hi, > > About the February sprint, the proposed dates (mostly the original ones) > are: 7-14th. After sorting out the Duesseldorf situation, these dates > could be ok too. Anyone has strong objections? i am fine with it. holger From arigo at tunes.org Thu Dec 11 14:53:47 2008 From: arigo at tunes.org (Armin Rigo) Date: Thu, 11 Dec 2008 14:53:47 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081207170458.GA5599@code0.codespeak.net> Message-ID: <20081211135347.GA6713@code0.codespeak.net> Hi Hakan, On Tue, Dec 09, 2008 at 08:14:35PM +0100, Hakan Ardo wrote: > class arr2d: > def __init__(self,w,h): > self.width=w > self.height=h > self.data=[i for i in range(w*h)] > def __getitem__(self,i): > if isinstance(i,int): > return self.data[i] > elif len(i)==2: > return self.data[i[1]*self.width + i[0]] > else: > raise TypeError > __getitem__._annspecialcase_ = 'specialize:argtype(0)' That's the wrong annotation. For this case, it should be 'specialize:argtype(1)' in order to get two versions of __getitem__, compiled for the two types that can be seen: integers and tuples. A bientot, Armin. From hakan at debian.org Fri Dec 12 16:49:17 2008 From: hakan at debian.org (Hakan Ardo) Date: Fri, 12 Dec 2008 16:49:17 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <20081207170458.GA5599@code0.codespeak.net> References: <20081207170458.GA5599@code0.codespeak.net> Message-ID: On Sun, Dec 7, 2008 at 6:04 PM, Armin Rigo wrote: > > class __extend__(pairtype(SomeInstance, SomeObject)): > def getitem((s_array, s_index)): > # first generate a pseudo call to the helper > bk = getbookkeeper() > s_callable = bk.immutablevalue(do_getitem) > args_s = [s_array, s_index] > bk.emulate_pbc_call(('instance_getitem', s_array.knowntype), > s_callable, args_s) > # then use your own trick to get the correct result > s=SomeString() > s.const="__getitem__" > p=s_array.getattr(s) > return p.simple_call(s_index) > > unrelated calls to the helper. The code for the annotator is a bit > bogus, btw, because it emulates a call to the function but also computes > the result explicitly; but I couldn't figure out a better way. How about instead doing: class __extend__(pairtype(SomeInstance, SomeObject)): def getitem((s_array, s_index)): return call_helper('do_getitem', (s_array, s_index)) def call_helper(name,s_args): bk = getbookkeeper() s_callable = bk.immutablevalue(eval(name)) s_ret=bk.emulate_pbc_call(('instance_'+name,)+tuple([s.knowntype for s in s_args]), s_callable, s_args) for graph in bk.annotator.pendingblocks.values(): if graph.name[0:len(name)]==name: bk.annotator.notify[graph.returnblock][bk.position_key]=1 return s_ret; Is there some way to get hold of the mangled function name of the created graph? The above code might add too many notifies if there are several graphs in pendingblocks with a name starting with the name of the helper. Or is there some better way to get hold of the created graph object if any was created? > > __getitem__._annspecialcase_ = 'specialize:argtype(0)' > > That's the wrong annotation. For this case, it should be > 'specialize:argtype(1)' in order to get two versions of __getitem__, Right. Sorry about that. -- H?kan Ard? From arigo at tunes.org Fri Dec 12 18:11:47 2008 From: arigo at tunes.org (Armin Rigo) Date: Fri, 12 Dec 2008 18:11:47 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081207170458.GA5599@code0.codespeak.net> Message-ID: <20081212171146.GA9294@code0.codespeak.net> Hi Hakan, On Fri, Dec 12, 2008 at 04:49:17PM +0100, Hakan Ardo wrote: > How about instead doing: > > (...) Ah, using 'notify' to force a reflow. Obscure :-/ > Is there some way to get hold of the mangled function name of the > created graph? Don't look up graphs by name; the name is only there to get information about it when printing the graph. You should probably pass the function object instead of a string giving the name into your helper. Then you can get from the function to the graph(s) with the translator. Armin From getxsick at gmail.com Mon Dec 15 12:53:45 2008 From: getxsick at gmail.com (Bartosz SKOWRON) Date: Mon, 15 Dec 2008 12:53:45 +0100 Subject: [pypy-dev] Wroclaw (PL) sprint - announcement Message-ID: <77887e110812150353m63e2f122h9241a392a3aa29e6@mail.gmail.com> hi, i would like to announce that I've already reserved a for upcoming sprint. dates: 7-14.02.2009 (we have access between 8am - 9 pm but it should be flexible if needed) venue: Wroclaw University of Technology, Poland I've already seen the room and it's pretty nice. Size is similar or a bit larger than the sprint room from last EuroPython. The room is dedicated for 30 people so should be enough space there. Whiteboard and video projector is included. More informations after x-mass. cheers! From hakan at debian.org Mon Dec 15 20:47:26 2008 From: hakan at debian.org (Hakan Ardo) Date: Mon, 15 Dec 2008 20:47:26 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <20081212171146.GA9294@code0.codespeak.net> References: <20081207170458.GA5599@code0.codespeak.net> <20081212171146.GA9294@code0.codespeak.net> Message-ID: On Fri, Dec 12, 2008 at 6:11 PM, Armin Rigo wrote: > Hi Hakan, > > On Fri, Dec 12, 2008 at 04:49:17PM +0100, Hakan Ardo wrote: >> How about instead doing: >> >> (...) > > Ah, using 'notify' to force a reflow. Obscure :-/ OK, what's the intended use of the notify feature? The reflow is happening with the previous solution as well. Presumable because p.simple_call(s_index) gets the getitem opperation registered as a call site of the __getitem__ method? Maybe a better solution is to register as a call site of the helper? The following (from rpython/controllerentry.py) seems to do the trick: def call_helper(func,s_args): bk = getbookkeeper() s_callable = bk.immutablevalue(func) return bk.emulate_pbc_call(bk.position_key, s_callable, s_args, callback = bk.position_key) At http://hakan.ardoe.net/pypy/ there is now an implementation of __add__/__radd__ combination in getsetitem_support.py that calls the correct method in all cases I could come up with (test_add.py). It cannot yet handle that the methods return NotImplemented. Would it be possible to handle that in a similar manner to how None is handled? That would remove all unneeded tests if the annotator can prove that a call will always/never return NotImplemented, right? -- H?kan Ard? From amauryfa at gmail.com Fri Dec 19 17:50:04 2008 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Fri, 19 Dec 2008 17:50:04 +0100 Subject: [pypy-dev] pypy's windows buildbot In-Reply-To: <693bc9ab0812190751l42d872f8v68926d056567453b@mail.gmail.com> References: <693bc9ab0812190322h47e55da6q4fcce535dd324df4@mail.gmail.com> <693bc9ab0812190751l42d872f8v68926d056567453b@mail.gmail.com> Message-ID: Maciej Fijalkowski wrote: > > pypy-c -S > >>>> import site > > [...] > > RuntimeError: internal error: > > right, that's a known issue that you cannot do a debug build without > -O2 at least. I suppose we need to adapt makefiles and such (it sucks, > doesn't it?). > > The reason for that is that we have tons of local vars which are > temporary and go into registers (but by default, without any > optimisations, compiler does not remove them from stack space) I suggest to modify the compiler options in pypy.translator.platform, on windows first. - use the /STACK linker option to reserve a stack large enough for pypy (this affects only .exe, not dlls) - set the MAX_STACK_SIZE preprocessor symbol used in by LL_stack_too_big() in src/stack.h, to a value slightly lower than the one above. For optimized build, stacksize=512Kb is enough (this is the default value in src/stack.h) For debug build stacksize=4Mb works on win32. I don't know the gcc options very well. What do you think of adding these two: "-DMAX_STACK_SIZE=%d" % (stacksize - 1024) "-Wl,--stack,%d" % stacksize -- Amaury Forgeot d'Arc From p.giarrusso at gmail.com Sun Dec 21 07:06:33 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sun, 21 Dec 2008 07:06:33 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) Message-ID: Hi all, after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it). While we lack some runtime checks, conditional branches are often almost for free if you have to do an indirect branch soon and likely flush the pipeline anyway, especially if just one of the possibilities lays in the fast path. That means that runtime checks for invalid code have the cost of a normal instruction and not the cost of a conditional branch (which can be much more expensive). I obviously do not expect everybody to believe me on this. So, if you prefer, ignore this "vaporware" and just consider my proposals. In fact, we are at least 50% faster on anything we can run, but also on this benchmark, with the usage of unboxed integers and tagged pointer (we tag pointers with 1 and integers with 0, like V8 does and SELF did, so you can add integers without untagging them): def f(): y = 0 x = 10000000 while x: x-=1 y+=3 return y - 29999873 And since we do overflow checking (by checking EFLAGS in assembly, even if the code could be improved even more, probably), I don't think a comparison on this is unfair in any way. Also, when adding overflow checking we didn't notice any important slowdown. I think that's enough to say that we are faster also because of design choices and implementation quality, and the V8 lead developer, Lars Bak, our professor, had a similar feeling. On Mon, Nov 17, 2008 at 15:05, Antonio Cuni wrote: > Paolo Giarrusso wrote: >> specialized bytecode can be significant, I guess, only if the >> interpreter is really fast (either a threaded one, or a code-copying >> one). Is the PyPy interpreter threaded? > sometime ago I tried to measure if/how much we can gain with a threaded > interpreter. I manually modified the produced C code to make the main loop > threaded, but we didn't gain anything; I think there are three possible > reasons: > 1) in Python a lot of opcodes are quite complex and time-consuming, That's wrong for a number of reasons - the most common opcodes are probably loads from the constant pool, and loads and stores to/from the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed the dispatch after the LOAD_FAST opcode. And those opcodes are quite simple. The hotspot used to be LOAD_CONST, but to optimize it, I just had to save the pointer to the constant pool. Instead of accessing current_function->co_consts, I save it into a local, which can be cached into a register even through function calls: raw_array *current_function_consts = current_function->co_consts; That made a 4% speedup on our benchmarks. And we didn't write a microbenchmark for LOAD_CONST. This is the handling code: LABEL_BEGIN(LOAD_CONST); PUSH(raw_array_skip_header(current_function_consts)[parameter]); LABEL_END(LOAD_CONST); (LABEL_BEGIN/LABEL_END bracket all opcode cases, they expand to labels for indirect threading and to the dispatch code itself). After seeing this, you can see why one extra memory load makes a difference. > so the > time spent to dispatch to them is a little percentage of the total time > spent for the execution That's your problem - threading helps when you spend most of the time on dispatch, and efficient interpreters get to that point. Threading helps even Prolog interpreters, like Yap, so it can help even for Python. Also, according to measurements "The Structure and Performance of Efficient Interpreters", efficient interpreters tend to execute no more than 30 instructions per opcode on their RISC machine simulator. The Python interpreter could still maybe be considered efficient, since it uses ~30 instructions per opcode, but that's on a CISC processor (i.e. my laptop), so the numbers are not directly comparable. I'll maybe measure the ratio, for the interpreters they benchmark, again on my laptop, to do a fair comparison. > 2) due to Python's semantics, it's not possible to just jump from one opcode > to the next, as we need to do a lot of bookkeeping, like remembering what > was the last line executed, etc. No, you don't need that, and not even CPython does it. For exception handling, just _when an exception is thrown_, you can take the current opcode index and find the last executed line with a lookup table. Their source code has a slowpath for tracing support, but that's one just conditional jump to a slowpath, which is for free in any case if the predictor guesses it right (and it should). So, can you make some more examples? > This means that the trampolines at the end > of each opcode contains a lot code duplication, leading to a bigger main > loop, with possibly bad effects on the cache (didn't measure this, though) If the interpreter loop is able to overflow the Icache, that should be fought through __builtin_expect first, to give hint for jump prediction and lay out slowpaths out-of-line. > 3) it's possible that I did something wrong, so in that case my measurements > are completely useless :-). If anyone wants to try again, it cannot hurt. Well, my plan is first to try, at some point, to implant threading into the Python interpreter and benchmark the difference - it shouldn't take long but it has a low priority currently. Regards -- Paolo Giarrusso From cfbolz at gmx.de Sun Dec 21 19:13:04 2008 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Sun, 21 Dec 2008 19:13:04 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: References: Message-ID: <494E8730.7050307@gmx.de> Hi Paolo, Paolo Giarrusso wrote: > after the completion of our student project, I have enough experience > to say something more. > We wrote in C an interpreter for a Python subset and we could make it > much faster than the Python interpreter (50%-60% faster). That was due > to the usage of indirect threading, tagged pointers and unboxed > integers, and a real GC (a copying one, which is enough for the > current benchmarks we have - a generational GC would be more realistic > but we didn't manage to do it). Interesting, but it sounds like you are comparing apples to oranges. What sort of subset of Python are you implementing, i.e. what things don't work? It has been shown time and time again that implementing only a subset of Python makes it possible to get interesting speedups compared to CPython. Then, as more and more features are implemented, the difference gets smaller and smaller. This was true for a number of Python implementations (e.g. IronPython). I think to get really meaningful comparisons it would be good to modify an existing Python implementation and compare that. Yes, I know this can be a lot of work. On your actual techniques used I don't have an opinion. I am rather sure that a copying GC helped performance ? it definitely did for PyPy. Tagged pointers make PyPy slower, but then, we tag integers with 1, not with 0. This could be changed, wouldn't even be too much work. About better implementations of the bytecode dispatch I am unsure. Note however, that a while ago we did measurements to see how large the bytecode dispatch overhead is. I don't recall the exact number, but I think it was below 10%. That means that even if you somehow manage to reduce that to no overhead at all, you would still only get 10% performance win. Cheers, Carl Friedrich From p.giarrusso at gmail.com Sun Dec 21 19:41:54 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sun, 21 Dec 2008 19:41:54 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: <494E8730.7050307@gmx.de> References: <494E8730.7050307@gmx.de> Message-ID: On Sun, Dec 21, 2008 at 19:13, Carl Friedrich Bolz wrote: > Hi Paolo, > > Paolo Giarrusso wrote: >> after the completion of our student project, I have enough experience >> to say something more. >> We wrote in C an interpreter for a Python subset and we could make it >> much faster than the Python interpreter (50%-60% faster). That was due >> to the usage of indirect threading, tagged pointers and unboxed >> integers, and a real GC (a copying one, which is enough for the >> current benchmarks we have - a generational GC would be more realistic >> but we didn't manage to do it). > > Interesting, but it sounds like you are comparing apples to oranges. > What sort of subset of Python are you implementing, i.e. what things > don't work? It has been shown time and time again that implementing only > a subset of Python makes it possible to get interesting speedups > compared to CPython. Then, as more and more features are implemented, > the difference gets smaller and smaller. This was true for a number of > Python implementations (e.g. IronPython). Yes, I remember - that's part of the motivation of your project, or something like that. Still, on the example of the arithmetic loop there couldn't be anything to add I feel; I forgot to note that we support, beyond overflow checking, operator overloading; i.e. this is our BINARY_ADD implementation: LABEL_BEGIN(BINARY_ADD); { tagged_el_t a = POP; tagged_el_t b = POP; if (!(unlikely(is_pointer(a)) || unlikely(is_pointer(b)))) { long sum; if (likely(!overflow_checked_add(&sum, TAGGED_EL_RAW_VALUE(a), TAGGED_EL_RAW_VALUE(b)))) { PUSH(raw_long_to_tagged_el(sum)); } else { //XXX box to unlimited precision integers. PUSH(long_to_tagged_el(INT_MAX)); } } else { Object *bp = get_ptr(b); Object *(*add)(Object *, Object *) = bp->type->add; if (add != NULL) { PUSH(ptr_to_tagged_el(add(get_ptr(b), get_ptr(a)))); } else { assert(false); } } } LABEL_END(BINARY_ADD); As you can see, the slowpaths still need a lot of implementation work (we don't support __radd__ for instance), but I tried to ensure that nothing is missing from the fastpath I'm benchmarking. That's why overflow checking is implemented even if overflow is not handled properly (and the given example does _not_ overflow). > I think to get really meaningful comparisons it would be good to modify > an existing Python implementation and compare that. Yes, I know this can > be a lot of work. Yes, I'm aware of that, and that's why I plan to test threading in Python. > On your actual techniques used I don't have an opinion. I am rather sure > that a copying GC helped performance ? it definitely did for PyPy. > Tagged pointers make PyPy slower, but then, we tag integers with 1, not > with 0. This could be changed, wouldn't even be too much work. The major problem in changing it is maybe to convert pointers, since in your representation pointers can be used directly (it depends on your macros). But since offseting a pointer is directly supported by the processor, it shouldn't have any major cost (except 1-extra byte of output, I guess, for each pointer access). > About better implementations of the bytecode dispatch I am unsure. Note > however, that a while ago we did measurements to see how large the > bytecode dispatch overhead is. I don't recall the exact number, but I > think it was below 10%. That means that even if you somehow manage to > reduce that to no overhead at all, you would still only get 10% > performance win. Sure - but my point, discussed in the mail, and taken from the paper I mentioned, is that in an efficient interpreter, bytecode dispatch becomes much more important than the other activities. Don't believe me, believe that paper (which is quite respected). They report speedups given by threading up to 2x :-), on efficient interpreters. Also, Antonio Cuni made just one example of additional overhead for dispatch: source-line tracking, for exception handling I guess. And as I explained, as far as I understand that's simply a mistake, even compared to CPython - so I think there may be other low-hanging fruits in the interpreter. But I'm still waiting for answers and further explainations on this point. Regards -- Paolo Giarrusso From anto.cuni at gmail.com Sun Dec 21 23:49:18 2008 From: anto.cuni at gmail.com (Antonio Cuni) Date: Sun, 21 Dec 2008 23:49:18 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: References: Message-ID: <494EC7EE.7000205@gmail.com> Paolo Giarrusso wrote: > Hi all, Hi! > after the completion of our student project, I have enough experience > to say something more. > We wrote in C an interpreter for a Python subset and we could make it > much faster than the Python interpreter (50%-60% faster). That was due > to the usage of indirect threading, tagged pointers and unboxed > integers, and a real GC (a copying one, which is enough for the > current benchmarks we have - a generational GC would be more realistic > but we didn't manage to do it). That's interesting but says little about the benefit of threaded interpretation itself, as the speedup could be given by the other optimizations. For example, I suspect that for the benchmark you showed most of the speedup is because of tagged pointers and the better GC. Is it possible to make you eval loop non-threaded? Measuring the difference with and without indirect threading could give a good hint of how much you gain with it. What kind of bytecode you use? The same as CPython or a custom one? E.g. I found that if we want to handle properly the EXTENDED_ARG CPython opcode it is necessary to insert a lot of code before jumping to the next opcode. Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but it is possible that causes a slowdown for other kind of operations. Do you have non-numerical benchmarks? (though I know that it's hard to get fair comparison, because the Python object model is complex and it's not easy to write a subset of it in a way that is not cheating) Finally, as Carl said, it would be nice to know which kind of subset it is. E.g. does it support exceptions, sys.settrace() and sys._getframe()? > In fact, we are at least 50% faster on anything we can run, but also > on this benchmark, with the usage of unboxed integers and tagged > pointer (we tag pointers with 1 and integers with 0, like V8 does and > SELF did, so you can add integers without untagging them): > > def f(): > y = 0 > x = 10000000 > while x: > x-=1 > y+=3 > return y - 29999873 > > And since we do overflow checking (by checking EFLAGS in assembly, > even if the code could be improved even more, probably), I don't think > a comparison on this is unfair in any way. is your subset large enough to handle e.g. pystone? What is the result? >> 1) in Python a lot of opcodes are quite complex and time-consuming, > > That's wrong for a number of reasons - the most common opcodes are > probably loads from the constant pool, and loads and stores to/from > the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed > the dispatch after the LOAD_FAST opcode. if you do benchmarks as the one showed above, I agree with you. If you consider real world applications, unfortunately there is more than LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more time consuming than LOAD_{FAST,CONST} > That's your problem - threading helps when you spend most of the time > on dispatch, and efficient interpreters get to that point. the question is: is it possible for a full python interpreter to be "efficient" as you define it? >> 2) due to Python's semantics, it's not possible to just jump from one opcode >> to the next, as we need to do a lot of bookkeeping, like remembering what >> was the last line executed, etc. > > No, you don't need that, and not even CPython does it. For exception > handling, just _when an exception is thrown_, [cut] Sorry, I made a typo: it is needed to remember the last *bytecode* executed, not the last line. This is necessary to implement properly sys.settrace(). I never mentioned exception handling, that was your (wrong :-)) guess. > If the interpreter loop is able to overflow the Icache, that should be > fought through __builtin_expect first, to give hint for jump > prediction and lay out slowpaths out-of-line. I think that Armin tried once to use __builtin_expect, but I don't remember the outcome. Armin, what was it? > Well, my plan is first to try, at some point, to implant threading > into the Python interpreter and benchmark the difference - it > shouldn't take long but it has a low priority currently. That would be cool, tell us when you have done :-). From anto.cuni at gmail.com Mon Dec 22 09:41:21 2008 From: anto.cuni at gmail.com (Antonio Cuni) Date: Mon, 22 Dec 2008 09:41:21 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: <494E8730.7050307@gmx.de> References: <494E8730.7050307@gmx.de> Message-ID: <494F52B1.2080802@gmail.com> Carl Friedrich Bolz wrote: > About better implementations of the bytecode dispatch I am unsure. Note > however, that a while ago we did measurements to see how large the > bytecode dispatch overhead is. I don't recall the exact number, but I > think it was below 10%. I think it's something more. There is the 'rbench' module that contains geninterpreted versions of both richards and pystone; IIRC last time I tried they where ~50% faster than they interpreted counterparts, on both pypy-c and pypy-cli. Of course, with geninterp you remove more than just the interpretatiom overhead, as e.g. locals are stored on the stack instead that on a frame. ciao, Anto From arigo at tunes.org Tue Dec 23 12:23:44 2008 From: arigo at tunes.org (Armin Rigo) Date: Tue, 23 Dec 2008 12:23:44 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081207170458.GA5599@code0.codespeak.net> <20081212171146.GA9294@code0.codespeak.net> Message-ID: <20081223112344.GA15154@code0.codespeak.net> Hi Hakan, On Mon, Dec 15, 2008 at 08:47:26PM +0100, Hakan Ardo wrote: > cannot yet handle that the methods return NotImplemented. Would it be > possible to handle that in a similar manner to how None is handled? Not easily. The annotation framework of PyPy was never meant to handle the full Python language, but only a subset reasonable for writing interpreters. Anyway, None-or-integer is not supported either, simply because there is no way to represent that in a single machine word. A bientot, Armin. From p.giarrusso at gmail.com Tue Dec 23 12:29:01 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Tue, 23 Dec 2008 12:29:01 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <20081223112344.GA15154@code0.codespeak.net> References: <20081207170458.GA5599@code0.codespeak.net> <20081212171146.GA9294@code0.codespeak.net> <20081223112344.GA15154@code0.codespeak.net> Message-ID: On Tue, Dec 23, 2008 at 12:23, Armin Rigo wrote: > Hi Hakan, > > On Mon, Dec 15, 2008 at 08:47:26PM +0100, Hakan Ardo wrote: >> cannot yet handle that the methods return NotImplemented. Would it be >> possible to handle that in a similar manner to how None is handled? > > Not easily. The annotation framework of PyPy was never meant to handle > the full Python language, but only a subset reasonable for writing > interpreters. Anyway, None-or-integer is not supported either, simply > because there is no way to represent that in a single machine word. There are at least two ways, once you have a singleton (maybe static) None object around: - box all integers and use only pointers - the slow one; - tagged integers/pointers that you already use elsewhere. So integers of up to 31/63 bits get represented directly, while the other ones are through pointers. -- Paolo Giarrusso From p.giarrusso at gmail.com Tue Dec 23 06:16:32 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Tue, 23 Dec 2008 06:16:32 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: <494EC7EE.7000205@gmail.com> References: <494EC7EE.7000205@gmail.com> Message-ID: On Sun, Dec 21, 2008 at 23:49, Antonio Cuni wrote: > Paolo Giarrusso wrote: Thanks for the interesting mail. It really goes into detail about the issues I raised, and it does suggest what I needed: stuff needing support inside the interpreter fast path, if just for unlikely branches. I started the required coding, but it will take a while (also because I have little time, given Christmas). But I'm answering with some preliminary results. First, I'd answer this question: >> That's your problem - threading helps when you spend most of the time >> on dispatch, and efficient interpreters get to that point. > the question is: is it possible for a full python interpreter to be > "efficient" as you define it? Well, my guess is "if Prolog, Scheme and so on can, why can't Python"? I know this is an interesting question, and I and my colleague, during the past two months, tried to get an answer. And it wasn't really two months for two people, I would say that much less time was actually spent. So for instance we don't have support for modules, exception handling, and the basic support for object (through Self/V8 style maps) is not yet complete (but almost). Then, for a couple of points, it isn't possible. Unboxed integers work better with the Python 3.x specs - if you have them and you still want a 32-bit integer, which is boxed, to be an int instead of a long, you get some problems. Also, the semantics of Python destructors are too restrictive for a proper GC, and they can create unreclaimable reference cycles (when all objects have destructors). In fact, Java, unlike C++, has instead finalizers - the difference is that you can finalize an object after finalizing the objects it points to. That's why I prefer not to say "Python finalizers" like the official docs. However, for both cases I feel that the semantics are wrong. Even finalizers are a bad idea - it's interesting that .NET is even more misguided than Java about them (and that's the opinion of Hans Boehm, not mine). So, those semantics fit well only reference counting - and reference counting, in turn, makes support for multithreading impossible (the patches for "free threading", without the global interpreter lock, gave a 2x slowdown I think). Somebody on the python-dev ML realized that CPython simply gets both things wrong (reference counting and threading support), but most people went to say "oh, but people shouldn't use threading anyway, it's too complicated to get right, they should use multiprocess stuff..." . That was one of the most (negatively) astonishing things I saw. Oh, no, the worst was something like "oh, we should allow using more interpreters in the same process, but we should move lots of statics to the interpreter context, and it's a lot of work so it's not worth it". Since I've participated to a community where bigger patches are committed every week, this looks like plain laziness; it could be because much less people are paid to work on Python, or maybe because the general community mood is that one. In fact, one of the reason I started this project _at all_ so early is that the Python community clearly shows _ignorance_ about years of VM research and little attention about VM performance. == Benchmarks about dispatch overhead == And while you don't look like that, the mention of "tracking the last line executed" seemed quite weird. And even tracking the last bytecode executed looks weird, even if it is not maybe. I'm inspecting CPython's Python/ceval.c, and the overhead for instruction dispatch looks comparable. The only real problem I'm getting right now is committing the last bytecode executed to memory. If I store it into a local, I have no problem at all, if I store it into the interpreter context, it's a store to memory, so it hurts performance a lot - I'm still wondering about the right road to go. Runtime of the while benchmark, with 10x more iterations, user time: baseline (now, with tracking of f_lasti inside a local): ~4.6s add ->f_lasti tracking _into memory_: ~5.6s disable indirect threading: ~5.6s both slowdown together: ~7s Python 2.5.1: ~11.2s Other benchmarks use fatter opcodes, so the impact is still visible, but is lower relative to the overall time. Even if f_lasti must be updated at all the time, could it be stored in memory just in some occasions? Is it needed during the call to settrace(), or only to the next opcode after that, i.e. after the return of the interpreter loop? The second case would be enough. Storing the variable into memory just for that would be simply excellent for performance, maybe even in the stock Python interpreter. Note: all this is on a 64bit box, with 16 registers; we're also faster on a 32-bit one, and in some cases the difference wrt. Python is bigger on 32-bit (on the GC stress-test, for instance, due to smaller pointers to update, and on the recursion test, maybe for the same reason when using the stack, since the stack element size is 64bit on 64bit machines), while on the others the extra registers (I guess) give advantage wrt. Python to the 64bit interpreter (obviously, the comparison is always with Python on the same machine and architecture). See slide 8 of pres.pdf. > That's interesting but says little about the benefit of threaded > interpretation itself, as the speedup could be given by the other > optimizations. For example, I suspect that for the benchmark you showed > most of the speedup is because of tagged pointers and the better GC. > Is it possible to make you eval loop non-threaded? Measuring the difference > with and without indirect threading could give a good hint of how much you > gain with it. Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed though. And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe). Am I correct in assuming that "geninterpret"ing _basically_ pastes the opcode handlers together? I guess with your infrastructure, you can even embed easily the opcode parameters inside the handlers, it's just a trivial partial evaluation - I tried to apply code copying of machine code to my interpreter, but I would have had to keep parameter fetching separate (getting the output I needed from GCC was not easy - I could make code copying work just for an early prototype). About keeping variables into the stack instead that in the frame, that's even stranger to me, given this argument. > What kind of bytecode you use? The same as CPython or a custom one? E.g. I > found that if we want to handle properly the EXTENDED_ARG CPython opcode it > is necessary to insert a lot of code before jumping to the next opcode. We handle a subset of the Python's opcodes, but what you say means that we'll have to handle EXTENDED_ARG properly, because of our mission. Oh, done, I would say (tests needed though). I think that "lot of code" can only refer to the requirement of tracking the last bytecode executed, but I'll have to look at your and CPython sources for that. That seems to have an impact of a few percent points (like 4-5% slower than before), but I have to do real benchmarking for these stuff. I fear I'll have to calculate confidence interval to give any meaningful number (I never needed that when comparing us with CPython), because the impact seems under statistical noise. However, just having the code laid out the wrong way can make it much worse. As soon as I wrote it, it almost doubled the runtime of the arithmetic benchmark (while it had a much lower impact on the other). It seems it was because GCC decided that HAS_ARG(opcode) had become unlikely, adding "likely" fixed it back. In fact, one of the reason we are so fast is that we started with a small interpreter and tracked down every performance regression as soon as it happened. We lost half of our performance, at some point, just because the address of the program_counter local variable escaped, so that var had to be allocated on the stack and not inside a register. == Opcode set == We really wanted to have our own one opcode set, but we didn't manage. In particular, I was refreshed when I read about your LOAD_METHOD optimization. Also, invoking __init__ is a real mess, Java handles it much more nicely, and I don't see the point of not using JVM-style 2-step class construction: new Class dup <-- invokespecial Class. Existence of bound methods prevents simply pushing the receiver as the first argument on the stack; however, I can't see a similar argument for class construction, even if __init__ can be a bound method. Generating similar bytecode would avoid the need to rotate the stack. In Python, one would get something like: ALLOC_INSTANCE Class // <--- new opcode DUP_TOP //Copy the value to return LOAD_METHOD __init__ INVOKE_METHOD POP_TOP //Remove the None returned by the constructor However, the dispatch cost of the additional DUP_TOP and POP_TOP might be a problem. I guess such bytecode will be for sure more efficient to JIT-compile, but for actual interpretation, benchmarks would be needed. > Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but > it is possible that causes a slowdown for other kind of operations. Do you > have non-numerical benchmarks? (though I know that it's hard to get fair > comparison, because the Python object model is complex and it's not easy to > write a subset of it in a way that is not cheating) I agree with that. The choice was done following the one done by V8 people, and was already tested in the Self system, but I'll answer you with those tests. We have a couple of recursion test, with a recursive function and a recursive method (but this one is unfair). And a test with list access - we are ~50% faster even there. For the Python object model, we have the opposite problem, because it's much more code to write and optimize: our INPLACE_ADD is equal to BINARY_ADD, so given a list l, a loop with "l += [1]" is able to kill us completely (since for now we create a list each time). Indeed, "l = l + [1]" is just as slow with our interp, but gets much slower in Python. Even here we are still faster, but most of the time is spent adding lists, so we are just 15% faster. I guess we'd get more advantage if INPLACE_ADD was implemented, and that's not a big deal. > Finally, as Carl said, it would be nice to know which kind of subset it is. I just need to find the time to write it down, sorry for the wait. I'm attaching our final presentation, the report we wrote for the course, and the _benchmarks_ we have. But basically, the subset we implemented is _much_ smaller. > E.g. does it support exceptions, sys.settrace() and sys._getframe()? Exception handling is not supported, but stack unwind should be possible without impact on the fastpath. We didn't know about the other two ones. I saw code which I guess is for handling settrace in CPython, but that's on the slowpath, I think I should be able to easily simulate the impact of that by adding one unlikely conditional jump. Is it required to track the last bytecode even when tracing is not active, as asked above? About _getframe(), I'm quite curious about how it works and how your support for it works. From the next mail you sent, it seems that you construct the needed frames during the function calls (and I saw some documentation about them). Well, in the Smalltalk age, developers found that constructing the frame object only when reflective access to the frame is needed is the only reasonable solution, if performance is important. _If_ the frame object returned is modifiable, I still think one can intercept field modifications and modify the original stack to track the modified object. In fact, to get the current performance, we don't allocate a locals dictionary unless needed; and currently it seems that doing it only when STORE_NAME is invoked gives the correct semantics (and if that were wrong, computing a NEEDS_LOCAL_DICT flag during compilation would be trivial). Otherwise, the recursion test would be slower than the one in Python. Also, we didn't bother with Unicode, and that's complicated to handle - even Unicode handling in V8 has problems, and string operations are slower than in TraceMonkey. > is your subset large enough to handle e.g. pystone? What is the result? We didn't have the time to even download pystone. >>> 1) in Python a lot of opcodes are quite complex and time-consuming, >> >> That's wrong for a number of reasons - the most common opcodes are >> probably loads from the constant pool, and loads and stores to/from >> the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed >> the dispatch after the LOAD_FAST opcode. > if you do benchmarks as the one showed above, I agree with you. If you > consider real world applications, unfortunately there is more than > LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more > time consuming than LOAD_{FAST,CONST} Yep, but well, if you want to implement a low-level algorithm in Python, at some point you'll write such code in an inner loop. But yes, we've been working also on attribute fetch - we've been implementing V8 maps just for that (we worked on that for a couple of days, and I'm fixing the remaining bugs - but it's not really a lot of code). For GET/SET_ATTR, I will need to do "inline caching" of lookup results (even if the term doesn't really apply to an interpreter). Making it faster than dictionary lookup will be hard however, since our strings cache their hash code. >>> 2) due to Python's semantics, it's not possible to just jump from one >>> opcode >>> to the next, as we need to do a lot of bookkeeping, like remembering what >>> was the last line executed, etc. >> >> No, you don't need that, and not even CPython does it. For exception >> handling, just _when an exception is thrown_, > [cut] > Sorry, I made a typo: it is needed to remember the last *bytecode* executed, > not the last line. This is necessary to implement properly sys.settrace(). > I never mentioned exception handling, that was your (wrong :-)) guess. Indeed, and I was quite surprised to see something like that. I'm quite happier now, but see above for comment on its impact. >> If the interpreter loop is able to overflow the Icache, that should be >> fought through __builtin_expect first, to give hint for jump >> prediction and lay out slowpaths out-of-line. > I think that Armin tried once to use __builtin_expect, but I don't remember > the outcome. Armin, what was it? On suitable benchmarks, for me it's easy to see that the wrong expectation can give horrible results (not always, but often). And sometimes GCC does get the wrong result. But once, on a short opcode, fixing code laid out the wrong way didn't make any important difference. I guess that to execute one simple opcode one maybe doesn't manage to fill a 20-stage pipeline before of the next dispatch, which will flush it. So, if you have 10 or 20 instructions, it doesn't matter that much - 20 cycles are needed anyway. In that case, fixing the layout made the same instructions execute faster probably, but they had to wait more time. Also, I do guess that dynamic branch prediction could fix the wrong code layout (not perfectly though - a predicted taken branch costs more than one not taken). Note: to give you actual numbers, BINARY_ADD uses 16 assembler instructions, and then 11 are needed for dispatching, plus other 7 if the next opcode has a parameter; the slow paths use some more instructions, in case of EXTENDED_ARG (but they are laid out-of-line). DUP_TOP is three assembly instructions, POP_TOP is one, and the stack pointer is kept in RBX. Note 2: by suitable benchmarks, I don't mean unfair ones. It would be easy (and unfair) to give hints for the exact code you have in your benchmark. >> Well, my plan is first to try, at some point, to implant threading >> into the Python interpreter and benchmark the difference - it >> shouldn't take long but it has a low priority currently. > That would be cool, tell us when you have done :-). For sure I will let you know, just don't hold your breath on that :-). Regards -- Paolo Giarrusso -------------- next part -------------- A non-text attachment was scrubbed... Name: pres.pdf Type: application/pdf Size: 56344 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: report.pdf.bz2 Type: application/x-bzip2 Size: 105640 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bench.tar.bz2 Type: application/x-bzip2 Size: 2549 bytes Desc: not available URL: From arigo at tunes.org Tue Dec 23 15:44:26 2008 From: arigo at tunes.org (Armin Rigo) Date: Tue, 23 Dec 2008 15:44:26 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081207170458.GA5599@code0.codespeak.net> <20081212171146.GA9294@code0.codespeak.net> <20081223112344.GA15154@code0.codespeak.net> Message-ID: <20081223144426.GA1401@code0.codespeak.net> Hi Paolo, On Tue, Dec 23, 2008 at 12:29:01PM +0100, Paolo Giarrusso wrote: > There are at least two ways, once you have a singleton (maybe static) > None object around: > - box all integers and use only pointers - the slow one; > - tagged integers/pointers that you already use elsewhere. So integers > of up to 31/63 bits get represented directly, while the other ones are > through pointers. Yes, we're using both ways, but for app-level integers, not for regular RPython-level integers. That would be a major slow-down. A bientot, Armin. From arigo at tunes.org Tue Dec 23 16:01:14 2008 From: arigo at tunes.org (Armin Rigo) Date: Tue, 23 Dec 2008 16:01:14 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: References: <494EC7EE.7000205@gmail.com> Message-ID: <20081223150113.GA1522@code0.codespeak.net> Hi Paolo, Just a quick note as an answer to your long and detailed e-mail (thanks for it, btw). On the whole you're making quite some efforts to get Python fast, starting with a subset of Python and adding feature after feature until it is more or less complete, while benchmarking at every step. This is not a new approach: it has been tried before for Python. Usually, this kind of project ends up being not used and forgotten, because it's "only" 80% or 90% compatible but not 99% -- and people care much more, on average, about 99% compatibility than about 50% performance improvement. PyPy on the other hand starts from the path of 99% compatibility and then tries to improve performance (which started as 10000 times slower... and is now roughly 1.5 or 2 times slower). Just saying that the approach is completely different... And I have not much interest in it -- because you change the language and have to start again from scratch. A strong point of PyPy is that you don't have to; e.g. we have, in addition to the Python interpreter, a JavaScript, a Smalltalk, etc... A bientot, Armin. From anto.cuni at gmail.com Tue Dec 23 15:59:37 2008 From: anto.cuni at gmail.com (Antonio Cuni) Date: Tue, 23 Dec 2008 15:59:37 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081207170458.GA5599@code0.codespeak.net> <20081212171146.GA9294@code0.codespeak.net> <20081223112344.GA15154@code0.codespeak.net> Message-ID: <4950FCD9.8050104@gmail.com> Paolo Giarrusso wrote: > There are at least two ways, once you have a singleton (maybe static) > None object around: > - box all integers and use only pointers - the slow one; > - tagged integers/pointers that you already use elsewhere. So integers > of up to 31/63 bits get represented directly, while the other ones are > through pointers. I think you are confusing level: here we are talking about RPython, i.e. the language which our Python interpreter is implemented in. Hence, RPython ints are really like C ints, and you don't want to manipulate C ints as tagged pointer, do you? ciao, Anto From p.giarrusso at gmail.com Tue Dec 23 17:49:20 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Tue, 23 Dec 2008 17:49:20 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <4950FCD9.8050104@gmail.com> References: <20081207170458.GA5599@code0.codespeak.net> <20081212171146.GA9294@code0.codespeak.net> <20081223112344.GA15154@code0.codespeak.net> <4950FCD9.8050104@gmail.com> Message-ID: On Tue, Dec 23, 2008 at 15:59, Antonio Cuni wrote: > Paolo Giarrusso wrote: > >> There are at least two ways, once you have a singleton (maybe static) >> None object around: >> - box all integers and use only pointers - the slow one; >> - tagged integers/pointers that you already use elsewhere. So integers >> of up to 31/63 bits get represented directly, while the other ones are >> through pointers. > > I think you are confusing level: here we are talking about RPython, i.e. the > language which our Python interpreter is implemented in. Hence, RPython > ints are really like C ints, and you don't want to manipulate C ints as > tagged pointer, do you? I understood the difference, but writing "there's no way to represent both of them in a machine word" was a statement that prompted me to write something - actually, I was thinking to just the return convention of __add__ and __radd__. If those method start returning NotImplemented or None, any _sound_ static type analysis won't assign type "int" to them, so it looks (to me, who ignore the content of RPython, I'm aware of that) that it may be possible to do this without tagging _all_ integers. And there are examples of compiled languages with tagged integers (I know at least of OcaML). But can you currently live in RPython without anything which could be a pointer or an integer? Can you have a list like [1, None] in RPython? Then I wonder how do you get an omogeneous call interface for all __add__ methods (i.e. how to force the one returning just integers to also have type NotImplementedOrInteger). And I also wonder if the RPython compiler can inline the __add__ call and optimize the tagging away. That said, I do not know if what I'm suggesting is implementable in RPython, or if it would be a good idea. Just my 2 cents, since this might be what Hakan is looking for. Regards -- Paolo Giarrusso From anto.cuni at gmail.com Wed Dec 24 22:29:33 2008 From: anto.cuni at gmail.com (Antonio Cuni) Date: Wed, 24 Dec 2008 22:29:33 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: References: <494EC7EE.7000205@gmail.com> Message-ID: <4952A9BD.6060903@gmail.com> Paolo Giarrusso wrote: >> the question is: is it possible for a full python interpreter to be >> "efficient" as you define it? > > Well, my guess is "if Prolog, Scheme and so on can, why can't Python"? a possible answer is that python is much more complex than prolog; for example, in PyPy we also have an rpython implementation of both prolog and scheme (though I don't know how much complete is the latter one). I quickly counted the number of lines for the interpreters, excluding the builtin types/functions, and we have 28188 non-empty lines for python, 5376 for prolog and 1707 for scheme. I know that the number of lines does not mean anything, but I think it's a good hint about the relative complexities of the languages. I also know that being more complex does not necessarily mean that it's impossible to write an "efficient" interpreter for it, it's an open question. Thanks for the interesting email, but unfortunately I don't have time to answer right now (xmas is coming :-)), I just drop few quick notes: > And while you don't look like that, the mention of "tracking the last > line executed" seemed quite weird. > And even tracking the last bytecode executed looks weird, even if it > is not maybe. I'm inspecting CPython's Python/ceval.c, and the > overhead for instruction dispatch looks comparable. > The only real problem I'm getting right now is committing the last > bytecode executed to memory. If I store it into a local, I have no > problem at all, if I store it into the interpreter context, it's a > store to memory, so it hurts performance a lot - I'm still wondering > about the right road to go. by "tracking the last bytecode executed" I was really referring to the equivalent of f_lasti; are you sure you can store it in a local and still implement sys.settrace()? > Ok, just done it, the speedup given by indirect threading seems to be > about 18% (see also above). More proper benchmarks are needed though. that's interesting, thanks for having tried. I wonder I should try again with indirect threading in pypy soon or later. Btw, are the sources for your project available somewhere? > And as you say in the other mail, the overhead given by dispatch is > quite more than 50% (maybe). no, it's less. 50% is the total speedup given by geninterp, which removes dispatch overhead but also other things, like storing variables on the stack and turning python level flow control into C-level flow control (so e.g. loops are expressed as C loops). > Am I correct in assuming that > "geninterpret"ing _basically_ pastes the opcode handlers together? I > guess with your infrastructure, you can even embed easily the opcode > parameters inside the handlers, it's just a trivial partial evaluation that's (part of) what our JIT is doing/will do. But it does much more than that, of course. Merry Christmas to you and all pypyers on the list! ciao, Anto From p.giarrusso at gmail.com Thu Dec 25 00:42:18 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Thu, 25 Dec 2008 00:42:18 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: <20081223150113.GA1522@code0.codespeak.net> References: <494EC7EE.7000205@gmail.com> <20081223150113.GA1522@code0.codespeak.net> Message-ID: Hi Armin, first, thanks for your answer. On Tue, Dec 23, 2008 at 16:01, Armin Rigo wrote: > On the whole you're making quite some efforts to get Python fast, > starting with a subset of Python and adding feature after feature until > it is more or less complete, while benchmarking at every step. This is > not a new approach: it has been tried before for Python. Usually, this > kind of project ends up being not used and forgotten, because it's > "only" 80% or 90% compatible but not 99% -- and people care much more, > on average, about 99% compatibility than about 50% performance > improvement. PyPy on the other hand starts from the path of 99% > compatibility and then tries to improve performance (which started as > 10000 times slower... and is now roughly 1.5 or 2 times slower). > Just saying that the approach is completely different... And I have not > much interest in it -- because you change the language and have to start > again from scratch. A strong point of PyPy is that you don't have to; > e.g. we have, in addition to the Python interpreter, a JavaScript, a > Smalltalk, etc... Never said we're gonna turn this into a full-featured Python interpreter, and to rewrite all the libraries for it. So, just a few clarifications: 1) this is a _student project_ which is currently "completed" and has been handed in, has been written by two students and was our first interpreter ever (and for one of us, the first really big C project). I knew that locals are faster than structure fields, but I had absolutely no idea of why and how much, before starting experimenting with this. 2) it is intended to be a way to learn how to write it, and a proof of concept about how Python can be made faster. The first two things I'll try to optimize are the assignment to ->f_lasti and addition of indirect threading (even if right now I'd guess an impact around 5%, if anything, because of refcounting). If I'll want to try something without refcounting, I'll guess I'd turn to PyPy, but don't hold your breath for that. The fact that indirect threading didn't work, that you're 1.5-2x slower than CPython, and that you store locals in frame objects, they all show that the abstraction overhead of the interpret is too high. Since you have different type of frame objects, I guess you might use virtuals to access them (even though I hope not), or that you have anyhow some virtuals. And that'd be a problem as well. 3) still, I do believe that working on it was interesting to get experience about how to optimize an interpreter. And the original idea was to show that real multithreading (without a global interpreter lock) cannot be done in Python just because of the big design mistakes of CPython. Regards -- Paolo Giarrusso From p.giarrusso at gmail.com Fri Dec 26 05:52:06 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 26 Dec 2008 05:52:06 +0100 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: <4952A9BD.6060903@gmail.com> References: <494EC7EE.7000205@gmail.com> <4952A9BD.6060903@gmail.com> Message-ID: Hi! This time, I'm trying to answer shortly Is this the geninterp you're talking about? http://codespeak.net/pypy/dist/pypy/doc/geninterp.html Is the geninterpreted version RPython code? I'm almost sure, except for the """NOT_RPYTHON""" doc string in the geninterpreted source snippet. I guess it's there because the _source_ of it is not RPython code. On Wed, Dec 24, 2008 at 22:29, Antonio Cuni wrote: > Paolo Giarrusso wrote: > I quickly counted the number of lines for the interpreters, excluding the > builtin types/functions, and we have 28188 non-empty lines for python, 5376 > for prolog and 1707 for scheme. > I know that the number of lines does not mean anything, but I think it's a > good hint about the relative complexities of the languages. Also about the amount of Python-specific optimizations you did :-). > I also know > that being more complex does not necessarily mean that it's impossible to > write an "efficient" interpreter for it, it's an open question. The 90-10 rule should apply anyway, but overhead for obscure features might be a problem. Well, reflection on the call stack can have a big runtime impact, but that's also present in Smalltalk as far as I know and that can be handled as well. Anyway, if Python developers are not able to implement efficient multithreading in the interpreter because of the excessive performance impact and they don't decide to drop refcounting, saying "there's space for optimizations" looks like a safe bet; the source of the idea is what I've been taught in the course, but I'm also noticing this by myself. > Thanks for the interesting email, but unfortunately I don't have time to > answer right now (xmas is coming :-)), I just drop few quick notes: Yeah, for me as well, plus I'm in the last month of my Erasmus study time :-) >> Ok, just done it, the speedup given by indirect threading seems to be >> about 18% (see also above). More proper benchmarks are needed though. > that's interesting, thanks for having tried. I wonder I should try again > with indirect threading in pypy soon or later. I would do it together with OProfile benchmarking of indirect branches and of their mispredictions (see the presentation for the OProfile commands on the Core 2 processor). > Btw, are the sources for your project available somewhere? They'll be sooner or later. There are a few bugs I should fix, and a few other interesting things to do. But if you are interested in trying to do benchmarking even if it's a student project, it's not feature complete, and it's likely buggy, I might publish it earlier. >> And as you say in the other mail, the overhead given by dispatch is >> quite more than 50% (maybe). > no, it's less. Yeah, sorry, I remember you wrote geninterp also does other stuff. > 50% is the total speedup given by geninterp, which removes > dispatch overhead but also other things, like storing variables on the stack I wonder why that's not done by your stock interpreter - the CPython frame object has a pointer to a real stack frame; I'm not sure, but I guess this can increase stack locality since a 32/64-byte cacheline is much bigger than a typical stack frame and has space for the operand stack (and needless to say we store locals on the stack, like JVMs do). The right benchmark for this, I guess, would be oprofiling cache misses on a recursion test like factorial or Fibonacci. > and turning python level flow control into C-level flow control (so e.g. > loops are expressed as C loops). Looking at the geninterpreted code, it's amazing that the RPython translator can do this. Can it also already specialize the interpreter for each of the object spaces and save the virtual calls? == About F_LASTI == > by "tracking the last bytecode executed" I was really referring to the > equivalent of f_lasti; are you sure you can store it in a local and still > implement sys.settrace()? Not really, I didn't even start studying its proper semantics, but now I know it's worth a try and some additional complexity, at least in an interpreter with GC. If one write to memory has such a horrible impact, I'm frightened by the possible impact of refcounting; on the other side, I wouldn't be surprised if saving the f_lasti write had no impact on CPython. My current approach would be that if I can identify code paths where no code can even look at it (and I guess that most simple opcodes are such paths), I can copy f_lasti to a global structure only in the other paths; if f_lasti is just passed to the code tracing routine and it's called only from the interpreter loop, I could even turn it into a parameter to that routine (it may be faster with a register calling convention, but anyway IMHO one gets code which is easier to follow). Actually, I even wonder if I can just set it when tracing is active, but since that'd be trivial to do, I guess that when you return from a call to settrace, you discover (without being able to anticipate it) that now you need to discover the previous opcode, that's why it's not already fixed. Still, a local can do even for that, or more complicated algorithms can do as well (basically, the predecessor is always known at compile time except for jumps, so only jump opcodes really need to compute f_lasti). Regards -- Paolo Giarrusso From hakan at debian.org Fri Dec 26 11:31:53 2008 From: hakan at debian.org (Hakan Ardo) Date: Fri, 26 Dec 2008 11:31:53 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: Message-ID: On Tue, Dec 23, 2008 at 12:23 PM, Armin Rigo wrote: > On Mon, Dec 15, 2008 at 08:47:26PM +0100, Hakan Ardo wrote: >> cannot yet handle that the methods return NotImplemented. Would it be >> possible to handle that in a similar manner to how None is handled? > > Not easily. The annotation framework of PyPy was never meant to handle > the full Python language, but only a subset reasonable for writing OK, that makes sens. On Tue, Dec 23, 2008 at 5:49 PM, Paolo Giarrusso wrote: > And I also wonder if the > RPython compiler can inline the __add__ call and optimize the tagging > away. > > That said, I do not know if what I'm suggesting is implementable in > RPython, or if it would be a good idea. Just my 2 cents, since this > might be what Hakan is looking for. Yes, in most cases of interest, the annotator can determined whether the __add__ method always/never return NotImplemented and thus the test will be removed. If we'll not be able to handle the most general case where the choise between __add__ and __radd__ has to be done at runtime, that's fine. If I use None instead of NotImplemented I get the behaviour I want. The following code does for example compile: class mystr(object): def __init__(self,s): self.s=str(s) def __str__(self): return self.s def __add__(self,other): return mystr(self.s+str(other)) def __radd__(self,other): return mystr(str(other)+self.s) __add__._annspecialcase_ = 'specialize:argtype(1)' __radd__._annspecialcase_ = 'specialize:argtype(1)' class pair(object): def __init__(self,a,b): self.a=a self.b=b def __str__(self): return "(%d,%d)"%(self.a,self.b) def __add__(self,other): if isinstance(other,pair): return pair(self.a+other.a, self.b+other.b) else: return None __add__._annspecialcase_ = 'specialize:argtype(1)' __radd__=__add__ def dotst_notimplemented(): a=mystr('a') b=pair(1,2) return (str(b+a), str(a+b)) If I use the following helper to call the method: def do_add_radd(lop,rop): if isinstance(rop,lop.__class__) and not isinstance(lop,rop.__class__): r=rop.__radd__(lop) if r is None: return lop.__add__(rop) return r else: r=lop.__add__(rop) if r is None: return rop.__radd__(lop) return r But if I replace None with NotImplemented, it does not compile anymore. So, can we get the annotater to treat NotImplemented in a similar manner as it treats None? -- H?kan Ard? From jbaker at zyasoft.com Fri Dec 26 16:19:36 2008 From: jbaker at zyasoft.com (Jim Baker) Date: Fri, 26 Dec 2008 08:19:36 -0700 Subject: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: References: <494EC7EE.7000205@gmail.com> <4952A9BD.6060903@gmail.com> Message-ID: Interesting discussion. Just a note: in Jython, f_lasti is only used to manage exit/entry points, specifically for coroutines/generators, so this is not at the level of bytecode granularity. We also set f_lineno, at the level of Python code of course. HotSpot apparently optimizes this access nicely in any event. (There are other problems with supporting call frames, but this is not one of them it seems.) Java also offers a debugging interface, which in conjunction with a C++ agent, allows for more fine-grained access to these internals, potentially with lower overhead. This is something Tobias Ivarsson has been exploring. - Jim On Thu, Dec 25, 2008 at 9:52 PM, Paolo Giarrusso wrote: > Hi! > This time, I'm trying to answer shortly > > Is this the geninterp you're talking about? > http://codespeak.net/pypy/dist/pypy/doc/geninterp.html > Is the geninterpreted version RPython code? I'm almost sure, except > for the """NOT_RPYTHON""" doc string in the geninterpreted source > snippet. I guess it's there because the _source_ of it is not RPython > code. > > On Wed, Dec 24, 2008 at 22:29, Antonio Cuni wrote: > > Paolo Giarrusso wrote: > > > I quickly counted the number of lines for the interpreters, excluding the > > builtin types/functions, and we have 28188 non-empty lines for python, > 5376 > > for prolog and 1707 for scheme. > > > I know that the number of lines does not mean anything, but I think it's > a > > good hint about the relative complexities of the languages. > > Also about the amount of Python-specific optimizations you did :-). > > > I also know > > that being more complex does not necessarily mean that it's impossible to > > write an "efficient" interpreter for it, it's an open question. > > The 90-10 rule should apply anyway, but overhead for obscure features > might be a problem. > Well, reflection on the call stack can have a big runtime impact, but > that's also present in Smalltalk as far as I know and that can be > handled as well. > Anyway, if Python developers are not able to implement efficient > multithreading in the interpreter because of the excessive performance > impact and they don't decide to drop refcounting, saying "there's > space for optimizations" looks like a safe bet; the source of the idea > is what I've been taught in the course, but I'm also noticing this by > myself. > > > Thanks for the interesting email, but unfortunately I don't have time to > > answer right now (xmas is coming :-)), I just drop few quick notes: > > Yeah, for me as well, plus I'm in the last month of my Erasmus study time > :-) > > >> Ok, just done it, the speedup given by indirect threading seems to be > >> about 18% (see also above). More proper benchmarks are needed though. > > > that's interesting, thanks for having tried. I wonder I should try again > > with indirect threading in pypy soon or later. > > I would do it together with OProfile benchmarking of indirect branches > and of their mispredictions (see the presentation for the OProfile > commands on the Core 2 processor). > > > Btw, are the sources for your project available somewhere? > > They'll be sooner or later. There are a few bugs I should fix, and a > few other interesting things to do. > But if you are interested in trying to do benchmarking even if it's a > student project, it's not feature complete, and it's likely buggy, I > might publish it earlier. > > >> And as you say in the other mail, the overhead given by dispatch is > >> quite more than 50% (maybe). > > > no, it's less. > > Yeah, sorry, I remember you wrote geninterp also does other stuff. > > > 50% is the total speedup given by geninterp, which removes > > dispatch overhead but also other things, like storing variables on the > stack > > I wonder why that's not done by your stock interpreter - the CPython > frame object has a pointer to a real stack frame; I'm not sure, but I > guess this can increase stack locality since a 32/64-byte cacheline is > much bigger than a typical stack frame and has space for the operand > stack (and needless to say we store locals on the stack, like JVMs > do). > > The right benchmark for this, I guess, would be oprofiling cache > misses on a recursion test like factorial or Fibonacci. > > > and turning python level flow control into C-level flow control (so e.g. > > loops are expressed as C loops). > > Looking at the geninterpreted code, it's amazing that the RPython > translator can do this. Can it also already specialize the interpreter > for each of the object spaces and save the virtual calls? > > == About F_LASTI == > > by "tracking the last bytecode executed" I was really referring to the > > equivalent of f_lasti; are you sure you can store it in a local and still > > implement sys.settrace()? > > Not really, I didn't even start studying its proper semantics, but now > I know it's worth a try and some additional complexity, at least in an > interpreter with GC. If one write to memory has such a horrible > impact, I'm frightened by the possible impact of refcounting; on the > other side, I wouldn't be surprised if saving the f_lasti write had no > impact on CPython. > > My current approach would be that if I can identify code paths where > no code can even look at it (and I guess that most simple opcodes are > such paths), I can copy f_lasti to a global structure only in the > other paths; if f_lasti is just passed to the code tracing routine and > it's called only from the interpreter loop, I could even turn it into > a parameter to that routine (it may be faster with a register calling > convention, but anyway IMHO one gets code which is easier to follow). > > Actually, I even wonder if I can just set it when tracing is active, > but since that'd be trivial to do, I guess that when you return from a > call to settrace, you discover (without being able to anticipate it) > that now you need to discover the previous opcode, that's why it's not > already fixed. Still, a local can do even for that, or more > complicated algorithms can do as well (basically, the predecessor is > always known at compile time except for jumps, so only jump opcodes > really need to compute f_lasti). > > Regards > -- > Paolo Giarrusso > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > -- Jim Baker jbaker at zyasoft.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Sat Dec 27 19:45:13 2008 From: arigo at tunes.org (Armin Rigo) Date: Sat, 27 Dec 2008 19:45:13 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: Message-ID: <20081227184512.GA20899@code0.codespeak.net> Hi, On Fri, Dec 26, 2008 at 11:31:53AM +0100, Hakan Ardo wrote: > But if I replace None with NotImplemented, it does not compile > anymore. So, can we get the annotater to treat NotImplemented in a > similar manner as it treats None? You can try, but it's messy. It's not a problem for the annotator but for the later RTyper. None is implemented as a NULL pointer by the RTyper; I don't know how you would distinguish between None-as-a-NULL and NotImplemented. You could try to go for something like ((TYPE*) 1), but this doesn't work on top of ootype, where you really have only one NULL value. A bientot, Armin. From p.giarrusso at gmail.com Sat Dec 27 19:56:51 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sat, 27 Dec 2008 19:56:51 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <20081227184512.GA20899@code0.codespeak.net> References: <20081227184512.GA20899@code0.codespeak.net> Message-ID: On 27/12/2008, Armin Rigo wrote: > Hi, > > > On Fri, Dec 26, 2008 at 11:31:53AM +0100, Hakan Ardo wrote: > > But if I replace None with NotImplemented, it does not compile > > anymore. So, can we get the annotater to treat NotImplemented in a > > similar manner as it treats None? > You can try, but it's messy. It's not a problem for the annotator but > for the later RTyper. None is implemented as a NULL pointer by the > RTyper; I don't know how you would distinguish between None-as-a-NULL > and NotImplemented. You could try to go for something like ((TYPE*) 1), > but this doesn't work on top of ootype, where you really have only one > NULL value. I don't know ootype, but why not having a NotImplemented singleton type, and returning a pointer to its instance? Python has a None singleton type as well, so it makes sense. Regards -- Paolo Giarrusso From arigo at tunes.org Sat Dec 27 21:51:40 2008 From: arigo at tunes.org (Armin Rigo) Date: Sat, 27 Dec 2008 21:51:40 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081227184512.GA20899@code0.codespeak.net> Message-ID: <20081227205140.GA31712@code0.codespeak.net> Hi Paolo, On Sat, Dec 27, 2008 at 07:56:51PM +0100, Paolo Giarrusso wrote: > I don't know ootype, but why not having a NotImplemented singleton > type, and returning a pointer to its instance? Python has a None > singleton type as well, so it makes sense. See our doc: http://codespeak.net/pypy/dist/pypy/doc/translation.html and http://codespeak.net/pypy/dist/pypy/doc/rtyper.html . A bientot, Armin. From jbaker at zyasoft.com Sat Dec 27 22:08:46 2008 From: jbaker at zyasoft.com (Jim Baker) Date: Sat, 27 Dec 2008 14:08:46 -0700 Subject: [pypy-dev] Fwd: Threaded interpretation (was: Re: compiler optimizations: collecting ideas) In-Reply-To: References: <494EC7EE.7000205@gmail.com> <4952A9BD.6060903@gmail.com> Message-ID: forgot to reply-all ---------- Forwarded message ---------- From: Jim Baker Date: Sat, Dec 27, 2008 at 2:08 PM Subject: Re: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) To: Paolo Giarrusso I'm only speaking of Jython 2.5, since that's what we're working on, but I believe it was the same for 2.2. (I personally regard 2.5 as more robust, we certainly test it more extensively, although external interfaces may still have some change before our forthcoming release.) We have limited support for tracing in terms of its events interface. It seems usable enough, although we don't produce quite the same traces. Fidelity to things we can't readily and efficiently support is not a goal for Jython. This is especially when they don't bear on running interesting applications. So we don't support ref counting, a GIL, and certain other internal details. We have found that when code does rely on these details that we have been able to push changes into the appropriate projects. Often the same changes are needed for alternative implementations like PyPy; we saw this with Django support. However, we do support frame introspection, the standard Python obj model (including classic classes), and even the rather mixed up unicode/str model. Having said that, we do plan to support a Python bytecode (PBC) VM running in Jython in a future release (possibly 2.5.1). At that point, we may support last_i at the level of the PBC instruction, just like CPython. Reasons for supporting PBC include scenarios where we can't dynamically generate and then load Java bytecode (unsigned applets, for example, or Android), greenlets (which really needs last_i, although it's possible to get a subset of greenlet functionality w/o it), and various components like Jinga2, a templating engines, that directly emit PBC. It's also rather cool to do so I think. - Jim On Sat, Dec 27, 2008 at 12:58 PM, Paolo Giarrusso wrote: > On 26/12/2008, Jim Baker wrote: > > Interesting discussion. Just a note: > > > > in Jython, f_lasti is only used to manage exit/entry points, specifically > > for coroutines/generators, so this is not at the level of bytecode > > granularity. > > But is it used to support sys._settrace() ? Also, its past usage > (before Python 2.3) for generators is mentioned in CPython source code > comments, and last stable Jython release is 2.2.1. What happens in > 2.5-beta could be more interesting. > > Here's the comment from /Python/ceval.c: > > f->f_lasti now refers to the index of the last instruction > executed. You might think this was obvious from the name, but > this wasn't always true before 2.3! PyFrame_New now sets > f->f_lasti to -1 (i.e. the index *before* the first instruction) > and YIELD_VALUE doesn't fiddle with f_lasti any more. So this > does work. Promise. > > > We also set f_lineno, at the level of Python code of course. > > Hmm, CPython is already able to do this only when needed (i.e. when > calling trace functions), "as of Python 2.3": > > /* As of 2.3 f_lineno is only valid when tracing is active (i.e. when > f_trace is set) -- at other times use PyCode_Addr2Line instead. */ > int f_lineno; /* Current line number */ > > > HotSpot apparently optimizes this access nicely in any event. (There are > > other problems with supporting call frames, but this is not one of them > it > > seems.) > > Do you mean you inspected the code generated by the Hotspot native > compiler? > Hmm, it can't save a write to memory if they are fields of the frame > object, unless it can prove that a pointer to the object does not > escape the local function through escape analysis (which has been > added in Java 1.6 but is said _somewhere_ to notice too few cases). > > > Java also offers a debugging interface, which in conjunction with a C++ > > agent, allows for more fine-grained access to these internals, > potentially > > with lower overhead. This is something Tobias Ivarsson has been > exploring. > > That sounds interesting, even if strange (and not applicable to > CPython nor PyPy) - do you want to offer an alternate debug interface > or to implement settrace through this? > > > - Jim > > > On Thu, Dec 25, 2008 at 9:52 PM, Paolo Giarrusso > > wrote: > > > Hi! > > > This time, I'm trying to answer shortly > > > > > > Is this the geninterp you're talking about? > > > > > http://codespeak.net/pypy/dist/pypy/doc/geninterp.html > > > Is the geninterpreted version RPython code? I'm almost sure, except > > > for the """NOT_RPYTHON""" doc string in the geninterpreted source > > > snippet. I guess it's there because the _source_ of it is not RPython > > > code. > > > > > > > > > On Wed, Dec 24, 2008 at 22:29, Antonio Cuni > wrote: > > > > Paolo Giarrusso wrote: > > > > > > > > > > I quickly counted the number of lines for the interpreters, excluding > > the > > > > builtin types/functions, and we have 28188 non-empty lines for > python, > > 5376 > > > > for prolog and 1707 for scheme. > > > > > > > I know that the number of lines does not mean anything, but I think > it's > > a > > > > good hint about the relative complexities of the languages. > > > > > > Also about the amount of Python-specific optimizations you did :-). > > > > > > > > > > I also know > > > > that being more complex does not necessarily mean that it's > impossible > > to > > > > write an "efficient" interpreter for it, it's an open question. > > > > > > The 90-10 rule should apply anyway, but overhead for obscure features > > > might be a problem. > > > Well, reflection on the call stack can have a big runtime impact, but > > > that's also present in Smalltalk as far as I know and that can be > > > handled as well. > > > Anyway, if Python developers are not able to implement efficient > > > multithreading in the interpreter because of the excessive performance > > > impact and they don't decide to drop refcounting, saying "there's > > > space for optimizations" looks like a safe bet; the source of the idea > > > is what I've been taught in the course, but I'm also noticing this by > > > myself. > > > > > > > > > > Thanks for the interesting email, but unfortunately I don't have time > to > > > > answer right now (xmas is coming :-)), I just drop few quick notes: > > > > > > Yeah, for me as well, plus I'm in the last month of my Erasmus study > time > > :-) > > > > > > > > > >> Ok, just done it, the speedup given by indirect threading seems to > be > > > >> about 18% (see also above). More proper benchmarks are needed > though. > > > > > > > that's interesting, thanks for having tried. I wonder I should try > again > > > > with indirect threading in pypy soon or later. > > > > > > I would do it together with OProfile benchmarking of indirect branches > > > and of their mispredictions (see the presentation for the OProfile > > > commands on the Core 2 processor). > > > > > > > > > > Btw, are the sources for your project available somewhere? > > > > > > They'll be sooner or later. There are a few bugs I should fix, and a > > > few other interesting things to do. > > > But if you are interested in trying to do benchmarking even if it's a > > > student project, it's not feature complete, and it's likely buggy, I > > > might publish it earlier. > > > > > > > > > >> And as you say in the other mail, the overhead given by dispatch is > > > >> quite more than 50% (maybe). > > > > > > > no, it's less. > > > > > > Yeah, sorry, I remember you wrote geninterp also does other stuff. > > > > > > > > > > 50% is the total speedup given by geninterp, which removes > > > > dispatch overhead but also other things, like storing variables on > the > > stack > > > > > > I wonder why that's not done by your stock interpreter - the CPython > > > frame object has a pointer to a real stack frame; I'm not sure, but I > > > guess this can increase stack locality since a 32/64-byte cacheline is > > > much bigger than a typical stack frame and has space for the operand > > > stack (and needless to say we store locals on the stack, like JVMs > > > do). > > > > > > The right benchmark for this, I guess, would be oprofiling cache > > > misses on a recursion test like factorial or Fibonacci. > > > > > > > > > > and turning python level flow control into C-level flow control (so > e.g. > > > > loops are expressed as C loops). > > > > > > Looking at the geninterpreted code, it's amazing that the RPython > > > translator can do this. Can it also already specialize the interpreter > > > for each of the object spaces and save the virtual calls? > > > > > > == About F_LASTI == > > > > > > > by "tracking the last bytecode executed" I was really referring to > the > > > > equivalent of f_lasti; are you sure you can store it in a local and > > still > > > > implement sys.settrace()? > > > > > > Not really, I didn't even start studying its proper semantics, but now > > > I know it's worth a try and some additional complexity, at least in an > > > interpreter with GC. If one write to memory has such a horrible > > > impact, I'm frightened by the possible impact of refcounting; on the > > > other side, I wouldn't be surprised if saving the f_lasti write had no > > > impact on CPython. > > > > > > My current approach would be that if I can identify code paths where > > > no code can even look at it (and I guess that most simple opcodes are > > > such paths), I can copy f_lasti to a global structure only in the > > > other paths; if f_lasti is just passed to the code tracing routine and > > > it's called only from the interpreter loop, I could even turn it into > > > a parameter to that routine (it may be faster with a register calling > > > convention, but anyway IMHO one gets code which is easier to follow). > > > > > > Actually, I even wonder if I can just set it when tracing is active, > > > but since that'd be trivial to do, I guess that when you return from a > > > call to settrace, you discover (without being able to anticipate it) > > > that now you need to discover the previous opcode, that's why it's not > > > already fixed. Still, a local can do even for that, or more > > > complicated algorithms can do as well (basically, the predecessor is > > > always known at compile time except for jumps, so only jump opcodes > > > really need to compute f_lasti). > > > > > > Regards > > > -- > > > Paolo Giarrusso > > > > > > > > > > > > _______________________________________________ > > > pypy-dev at codespeak.net > > > > > http://codespeak.net/mailman/listinfo/pypy-dev > > > > > > -- > Paolo Giarrusso > -- Jim Baker jbaker at zyasoft.com -- Jim Baker jbaker at zyasoft.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hakan at debian.org Mon Dec 29 16:15:09 2008 From: hakan at debian.org (Hakan Ardo) Date: Mon, 29 Dec 2008 16:15:09 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: <20081227184512.GA20899@code0.codespeak.net> References: <20081227184512.GA20899@code0.codespeak.net> Message-ID: On Sat, Dec 27, 2008 at 7:45 PM, Armin Rigo wrote: > > You can try, but it's messy. It's not a problem for the annotator but > for the later RTyper. None is implemented as a NULL pointer by the Atatched is a small patch for the annotator that makes it treat None and NotImplemented alike. This is all that is needed for most cases as all NotImplemented are typically removed by the optimisations performed by the annotator. At http://hakan.ardoe.net/pypy/ I have placed special_methods.py that adds support for the methods listed below together with 42 test including the relevant parts of test_augassign.py and test_binop.py from the cpython source (somewhat modified to work). The methods currently supported are: __str__, __repr__, __len__, __getitem__, __setitem__, __add__, __mul__, __sub__, __div__, __floordiv__, __mod__, __xor__, __rshift__, __lshift__, __radd__, __rmul__, __rsub__, __rdiv__, __rfloordiv__, __rmod__, __rxor__, __rrshift__, __rlshift__, __iadd__, __imul__, __isub__, __idiv__, __ifloordiv__, __imod__, __ixor__, __irshift__, __ilshift__ With this implementation, the opperation str(o) calls o.__str__(), but the opperation "%s"%o does not. I don't know why. -- H?kan Ard? -------------- next part -------------- A non-text attachment was scrubbed... Name: annotator.patch Type: text/x-patch Size: 1370 bytes Desc: not available URL: From p.giarrusso at gmail.com Mon Dec 29 20:55:52 2008 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Mon, 29 Dec 2008 20:55:52 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081227184512.GA20899@code0.codespeak.net> Message-ID: On Mon, Dec 29, 2008 at 16:15, Hakan Ardo wrote: > On Sat, Dec 27, 2008 at 7:45 PM, Armin Rigo wrote: >> >> You can try, but it's messy. It's not a problem for the annotator but >> for the later RTyper. None is implemented as a NULL pointer by the > > Atatched is a small patch for the annotator that makes it treat None > and NotImplemented alike. This is all that is needed for most cases as > all NotImplemented are typically removed by the optimisations > performed by the annotator. That can be made to work, but if such a method returns None you get completely different semantics (trying again with something else) from CPython (which will maybe return a failure, or return None for the result of such an operation), so you have to restrict the allowed semantics in RPython to the "most cases" you are referring to. Basically, I'd propose that in RPython with your patch, those methods (__add__ etc.) cannot return None (I can't think of a possible use case for an addition returning None), while I guess NotImplemented can only be returned by them, and not by any other function or method (do the current sources ever use NotImplemented?). There must be no function which can choose to return None or NotImplemented. Regards -- Paolo Giarrusso From hakan at debian.org Mon Dec 29 23:50:37 2008 From: hakan at debian.org (Hakan Ardo) Date: Mon, 29 Dec 2008 23:50:37 +0100 Subject: [pypy-dev] Support for __getitem__ in rpython? In-Reply-To: References: <20081227184512.GA20899@code0.codespeak.net> Message-ID: On Mon, Dec 29, 2008 at 8:55 PM, Paolo Giarrusso wrote: >> >> Atatched is a small patch for the annotator that makes it treat None >> and NotImplemented alike. This is all that is needed for most cases as >> all NotImplemented are typically removed by the optimisations >> performed by the annotator. > That can be made to work, but if such a method returns None you get > completely different semantics (trying again with something else) from > CPython (which will maybe return a failure, or return None for the > result of such an operation), so you have to restrict the allowed No, the patch do distinguish between None and NotImplemented. What I mean is that NotImplemented is treated in a similar manner as to how None is treated. The following crazy construction do compile and generate the same result as in cpython ('OK', 'String', 'None', 'None'): class mystr: def __init__(self,s): self.s=s def __str__(self): return self.s def __add__(self,other): if isinstance(other,mystr): return NotImplemented s=self.s+other if s=='None': return None else: return s __add__._annspecialcase_ = 'specialize:argtype(1)' def __radd__(self,other): return str(other)+self.s __radd__._annspecialcase_ = 'specialize:argtype(1)' def dotst_nonestr(): s1=mystr('No')+'ne' if s1 is None: s1='OK' s2=mystr('Str')+'ing' s3=mystr('No')+mystr('ne') s4='No'+mystr('ne') return (s1,s2,s3,s4) -- H?kan Ard? From hakan at debian.org Tue Dec 30 12:15:18 2008 From: hakan at debian.org (Hakan Ardo) Date: Tue, 30 Dec 2008 12:15:18 +0100 Subject: [pypy-dev] String or None segmentation fault Message-ID: Hi, the following code compiles into something that segfaults. Attached is a patch trying to fix that. def nstr(i): if i==0: return None return str(i) def fn(i): return (str(nstr(0)), str(nstr(i))) -- H?kan Ard? -------------- next part -------------- A non-text attachment was scrubbed... Name: nonesegv.patch Type: text/x-patch Size: 1611 bytes Desc: not available URL: From anto.cuni at gmail.com Tue Dec 30 23:37:52 2008 From: anto.cuni at gmail.com (Antonio Cuni) Date: Tue, 30 Dec 2008 23:37:52 +0100 Subject: [pypy-dev] [Fwd: [Fwd: Re: Threaded interpretation]] Message-ID: <495AA2C0.9050105@gmail.com> Hi, Antoine Pitrou told me that his mail got rejected by the mailing list, so I'm forwarding it. -------- Message transf?r? -------- De: Antoine Pitrou ?: pypy-dev at codespeak.net Sujet: Re: Threaded interpretation Date: Fri, 26 Dec 2008 21:16:36 +0000 (UTC) Hi people, By reading this thread I had the idea to write a threaded interpretation patch for py3k. Speedup on pybench and pystone is 15-20%. http://bugs.python.org/issue4753 Regards Antoine.