From cfbolz at gmx.de Thu Jul 1 10:06:46 2010 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Thu, 01 Jul 2010 10:06:46 +0200 Subject: [pypy-dev] CGO 2011 Conference Message-ID: <4C2C4C96.9030808@gmx.de> Hi all, I think this conference could be interesting to us: http://www.cgo.org/cgo2011/call_papers.html From the call for papers: * Techniques for efficient execution of dynamically typed languages Deadline is 15 of September. Cheers, Carl Friedric From fijall at gmail.com Thu Jul 1 12:02:31 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 1 Jul 2010 04:02:31 -0600 Subject: [pypy-dev] PyPy 1.3 released In-Reply-To: References: Message-ID: On Wed, Jun 30, 2010 at 3:24 PM, Phyo Arkar wrote: > So far , python-mysql still not working.. > > Anyone had sucessfully got it work? Hey. I'm not aware of anyone who had any success. You can come to #pypy on irc.freenode.net and we can see how to solve the problem. > > On Fri, Jun 25, 2010 at 11:27 PM, Maciej Fijalkowski > wrote: >> >> ======================= >> PyPy 1.3: Stabilization >> ======================= >> >> Hello. >> >> We're please to announce release of PyPy 1.3. This release has two major >> improvements. First of all, we stabilized the JIT compiler since 1.2 >> release, >> answered user issues, fixed bugs, and generally improved speed. >> >> We're also pleased to announce alpha support for loading CPython extension >> modules written in C. While the main purpose of this release is increased >> stability, this feature is in alpha stage and it is not yet suited for >> production environments. >> >> Highlights of this release >> ========================== >> >> * We introduced support for CPython extension modules written in C. As of >> now, >> ?this support is in alpha, and it's very unlikely unaltered C extensions >> will >> ?work out of the box, due to missing functions or refcounting details. The >> ?support is disable by default, so you have to do:: >> >> ? import cpyext >> >> ?before trying to import any .so file. Also, libraries are >> source-compatible >> ?and not binary-compatible. That means you need to recompile binaries, >> using >> ?for example:: >> >> ? python setup.py build >> >> ?Details may vary, depending on your build system. Make sure you include >> ?the above line at the beginning of setup.py or put it in your >> PYTHONSTARTUP. >> >> ?This is alpha feature. It'll likely segfault. You have been warned! >> >> * JIT bugfixes. A lot of bugs reported for the JIT have been fixed, and >> its >> ?stability greatly improved since 1.2 release. >> >> * Various small improvements have been added to the JIT code, as well as a >> great >> ?speedup of compiling time. >> >> Cheers, >> Maciej Fijalkowski, Armin Rigo, Alex Gaynor, Amaury Forgeot d'Arc and >> the PyPy team >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > > From hakan at debian.org Thu Jul 1 16:02:30 2010 From: hakan at debian.org (Hakan Ardo) Date: Thu, 1 Jul 2010 16:02:30 +0200 Subject: [pypy-dev] array performace? Message-ID: Hi, are there any python construct that the jit will be able to compile into c-type array accesses? Consider the following test: l=0.0 for i in xrange(640,640*480): l+=img[i] intimg[i]=intimg[i-640]+l With the 1.3 release of the jit it executes about 20 times slower than a similar construction in C if I create the arrays using: import _rawffi RAWARRAY = _rawffi.Array('d') img=RAWARRAY(640*480, autofree=True) intimg=RAWARRAY(640*480, autofree=True) Using a list is about 40 times slower and using array.array is about 400 times slower. Any suggestion on how to improve the performance of these kind of constructions? Thanx. -- H?kan Ard? From fijall at gmail.com Thu Jul 1 16:46:09 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 1 Jul 2010 08:46:09 -0600 Subject: [pypy-dev] array performace? In-Reply-To: References: Message-ID: Hey. There is a variety of reasons why those behave like this (array module is in PyPy written in Python for example, using _rawffi). There is a branch that plans to fix that for all lists, but that's not finished yet. On Thu, Jul 1, 2010 at 8:02 AM, Hakan Ardo wrote: > Hi, > are there any python construct that the jit will be able to compile > into c-type array accesses? Consider the following test: > > ? ?l=0.0 > ? ?for i in xrange(640,640*480): > ? ? ? ?l+=img[i] > ? ? ? ?intimg[i]=intimg[i-640]+l > > With the 1.3 release of the jit it executes about 20 times slower than > a similar construction in C if I create the arrays using: > > ? ?import _rawffi > ? ?RAWARRAY = _rawffi.Array('d') > ? ?img=RAWARRAY(640*480, autofree=True) > ? ?intimg=RAWARRAY(640*480, autofree=True) > > Using a list is about 40 times slower and using array.array is about > 400 times slower. Any suggestion on how to improve the performance of > these kind of constructions? > > ?Thanx. > > -- > H?kan Ard? > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > > > > From arigo at tunes.org Thu Jul 1 17:28:27 2010 From: arigo at tunes.org (Armin Rigo) Date: Thu, 1 Jul 2010 17:28:27 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: Message-ID: <20100701152827.GA30661@code0.codespeak.net> Hi, On Thu, Jul 01, 2010 at 04:02:30PM +0200, Hakan Ardo wrote: > are there any python construct that the jit will be able to compile > into c-type array accesses? Consider the following test: > > l=0.0 > for i in xrange(640,640*480): > l+=img[i] > intimg[i]=intimg[i-640]+l This is still implemented as a list of Python objects (as expected, because the JIT cannot prove that we won't suddenly try to put something else than a float in the same list). Using _rawffi.Array('d') directly is the best option right now. I'm not sure why the array.array module is 400 times slower, but it's definitely slower given that it's implemented at app-level using a _rawffi.Array('c') and doing the conversion by itself (for some partially stupid reasons like doing the right kind of error checking). A bientot, Armin. From fijall at gmail.com Thu Jul 1 17:35:17 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 1 Jul 2010 09:35:17 -0600 Subject: [pypy-dev] array performace? In-Reply-To: <20100701152827.GA30661@code0.codespeak.net> References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Thu, Jul 1, 2010 at 9:28 AM, Armin Rigo wrote: > Hi, > > On Thu, Jul 01, 2010 at 04:02:30PM +0200, Hakan Ardo wrote: >> are there any python construct that the jit will be able to compile >> into c-type array accesses? Consider the following test: >> >> ? ? l=0.0 >> ? ? for i in xrange(640,640*480): >> ? ? ? ? l+=img[i] >> ? ? ? ? intimg[i]=intimg[i-640]+l > > This is still implemented as a list of Python objects (as expected, > because the JIT cannot prove that we won't suddenly try to put something > else than a float in the same list). > > Using _rawffi.Array('d') directly is the best option right now. ?I'm not > sure why the array.array module is 400 times slower, but it's definitely > slower given that it's implemented at app-level using a _rawffi.Array('c') > and doing the conversion by itself (for some partially stupid reasons like > doing the right kind of error checking). > > > A bientot, > > Armin. The main reason why _rawffi.Array is slow is that JIT does not look into that module, so there is wrapping and unwrapping going on. Relatively easy to fix I suppose, but _rawffi.Array was not meant to be used like that (array.array looks like a better candidate). From alex.gaynor at gmail.com Thu Jul 1 17:40:38 2010 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Thu, 1 Jul 2010 10:40:38 -0500 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Thu, Jul 1, 2010 at 10:35 AM, Maciej Fijalkowski wrote: > On Thu, Jul 1, 2010 at 9:28 AM, Armin Rigo wrote: >> Hi, >> >> On Thu, Jul 01, 2010 at 04:02:30PM +0200, Hakan Ardo wrote: >>> are there any python construct that the jit will be able to compile >>> into c-type array accesses? Consider the following test: >>> >>> ? ? l=0.0 >>> ? ? for i in xrange(640,640*480): >>> ? ? ? ? l+=img[i] >>> ? ? ? ? intimg[i]=intimg[i-640]+l >> >> This is still implemented as a list of Python objects (as expected, >> because the JIT cannot prove that we won't suddenly try to put something >> else than a float in the same list). >> >> Using _rawffi.Array('d') directly is the best option right now. ?I'm not >> sure why the array.array module is 400 times slower, but it's definitely >> slower given that it's implemented at app-level using a _rawffi.Array('c') >> and doing the conversion by itself (for some partially stupid reasons like >> doing the right kind of error checking). >> >> >> A bientot, >> >> Armin. > > The main reason why _rawffi.Array is slow is that JIT does not look > into that module, so there is wrapping and unwrapping going on. > Relatively easy to fix I suppose, but _rawffi.Array was not meant to > be used like that (array.array looks like a better candidate). > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev If array.array performance is important to your work, the array.py module looks like a good target for writing at interp level, and it's not too much code. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Voltaire "The people's good is the highest law." -- Cicero "Code can always be simpler than you think, but never as simple as you want" -- Me From glavoie at gmail.com Thu Jul 1 20:57:43 2010 From: glavoie at gmail.com (Gabriel Lavoie) Date: Thu, 1 Jul 2010 14:57:43 -0400 Subject: [pypy-dev] Improving Stackless/Coroutines implementation In-Reply-To: References: Message-ID: Hello everyone, the change is implemented in r75735. I also added a coroutine.throw() method to raise any exception inside any coroutine. I don't know if some people need this but I personnally do. For now, it's implemented approximately like greenlet.throw(). The documentation for stackless.html page was updated in pypy/doc/stackless.txt. If someone could review the changes and possibly update the documentation on the website it would be appreciated. ;) Gabriel 2010/6/29 Gabriel Lavoie > Hello everyone, > as a few knows here, I'm working heavily with PyPy's "stackless" > module for my Master degree project to make it more distributed. Since I > started to work full time on this project I've encountered a few bugs > (mostly related to pickling of tasklets) and missing implementation details > in the module. The latest problem I've encountered is to be able to detect > when tasklet.kill() is called, within the tasklet being killed. With > Stackless CPython, TaskletExit is raised and can be caught but this part > wasn't really implemented in PyPy's stackless module. Since the module is > implemented on top of coroutines and since coroutine.kill() is called within > tasklet.kill(), the exception thrown by the coroutine implementation needs > to be caught. Here's the problem: > > http://codespeak.net/pypy/dist/pypy/doc/stackless.html#coroutines > > - > > coro.kill() > > Kill coro by sending an exception to it. (At the moment, the exception > is not visible to app-level, which means that you cannot catch it, and that > try: finally: clauses are not honored. This will be fixed in the > future.) > > > The exception is not thrown at app level and a coroutine dies silently. > Took a look at the code and I've been able to expose a CoroutineExit > exception to app level on which I intend implementing TaskletExit correctly. > I'm also able to catch the exception as expected but the code is not yet > complete. > > Right now, I have a question on how to expose correctly the CoroutineExit > and TaskletExit exceptions to app level. Here's what I did: > > W_CoroutineExit = _new_exception('CoroutineExit', W_Exception, 'Exit > requested...') > > class AppCoroutine(Coroutine): # XXX, StacklessFlags): > > def __init__(self, space, state=None): > # Some other code here > > # Exporting new exception to __builtins__ and "exceptions" modules > self.w_CoroutineExit = space.gettypefor(W_CoroutineExit) > space.setitem( > space.exceptions_module.w_dict, > space.new_interned_str('CoroutineExit'), > self.w_CoroutineExit) > space.setitem(space.builtin.w_dict, > space.new_interned_str('CoroutineExit'), > self.w_CoroutineExit) > > I talked about this on #pypy (IRC) but people weren't sure about exporting > new names to __builtins__. On my side I wanted to make it look as most as > possible as how Stackless CPython did it with TaskletExit, which is directly > available in __builtins__. This would make code compatible with both > Stackless Python and PyPy's stackless module. Also, exporting names this way > would only make them appear in __builtins__ when the "_stackless" module is > enabled (pypy-c built with --stackless). > > What are your opinions about it? (Maciej, I already know about yours! ;) > > Thank you very much, > > Gabriel (WildChild) > > -- > Gabriel Lavoie > glavoie at gmail.com > -- Gabriel Lavoie glavoie at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hakan at debian.org Fri Jul 2 07:24:07 2010 From: hakan at debian.org (Hakan Ardo) Date: Fri, 2 Jul 2010 07:24:07 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: OK, so making an interpreter level implementation of array.array seams like a good idea. Would it be possible to get the jit to remove the wrapping/unwrapping in that case to get better performance than _rawffi.Array('d'), which is already an interpreter level implementation? Are there some docs to get me started at writing interpreter level objects? I've had a look at _rawffi/array.py and am a bit confused about the W_Array.typedef = TypeDef('Array',...) construction. Maybe there is a easier example to start with? On Thu, Jul 1, 2010 at 5:40 PM, Alex Gaynor wrote: > On Thu, Jul 1, 2010 at 10:35 AM, Maciej Fijalkowski wrote: >> On Thu, Jul 1, 2010 at 9:28 AM, Armin Rigo wrote: >>> Hi, >>> >>> On Thu, Jul 01, 2010 at 04:02:30PM +0200, Hakan Ardo wrote: >>>> are there any python construct that the jit will be able to compile >>>> into c-type array accesses? Consider the following test: >>>> >>>> ? ? l=0.0 >>>> ? ? for i in xrange(640,640*480): >>>> ? ? ? ? l+=img[i] >>>> ? ? ? ? intimg[i]=intimg[i-640]+l >>> >>> This is still implemented as a list of Python objects (as expected, >>> because the JIT cannot prove that we won't suddenly try to put something >>> else than a float in the same list). >>> >>> Using _rawffi.Array('d') directly is the best option right now. ?I'm not >>> sure why the array.array module is 400 times slower, but it's definitely >>> slower given that it's implemented at app-level using a _rawffi.Array('c') >>> and doing the conversion by itself (for some partially stupid reasons like >>> doing the right kind of error checking). >>> >>> >>> A bientot, >>> >>> Armin. >> >> The main reason why _rawffi.Array is slow is that JIT does not look >> into that module, so there is wrapping and unwrapping going on. >> Relatively easy to fix I suppose, but _rawffi.Array was not meant to >> be used like that (array.array looks like a better candidate). >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > > If array.array performance is important to your work, the array.py > module looks like a good target for writing at interp level, and it's > not too much code. > > Alex > > -- > "I disapprove of what you say, but I will defend to the death your > right to say it." -- Voltaire > "The people's good is the highest law." -- Cicero > "Code can always be simpler than you think, but never as simple as you > want" -- Me > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev -- H?kan Ard? From alex.gaynor at gmail.com Fri Jul 2 07:40:21 2010 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Fri, 2 Jul 2010 00:40:21 -0500 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Fri, Jul 2, 2010 at 12:24 AM, Hakan Ardo wrote: > OK, so making an interpreter level implementation of array.array seams > like a good idea. Would it be possible to get the jit to remove the > wrapping/unwrapping in that case to get better performance than > _rawffi.Array('d'), which is already an interpreter level > implementation? > > Are there some docs to get me started at writing interpreter level > objects? I've had a look at _rawffi/array.py and am a bit confused > about the W_Array.typedef = TypeDef('Array',...) ?construction. Maybe > there is a easier example to start with? > > On Thu, Jul 1, 2010 at 5:40 PM, Alex Gaynor wrote: >> On Thu, Jul 1, 2010 at 10:35 AM, Maciej Fijalkowski wrote: >>> On Thu, Jul 1, 2010 at 9:28 AM, Armin Rigo wrote: >>>> Hi, >>>> >>>> On Thu, Jul 01, 2010 at 04:02:30PM +0200, Hakan Ardo wrote: >>>>> are there any python construct that the jit will be able to compile >>>>> into c-type array accesses? Consider the following test: >>>>> >>>>> ? ? l=0.0 >>>>> ? ? for i in xrange(640,640*480): >>>>> ? ? ? ? l+=img[i] >>>>> ? ? ? ? intimg[i]=intimg[i-640]+l >>>> >>>> This is still implemented as a list of Python objects (as expected, >>>> because the JIT cannot prove that we won't suddenly try to put something >>>> else than a float in the same list). >>>> >>>> Using _rawffi.Array('d') directly is the best option right now. ?I'm not >>>> sure why the array.array module is 400 times slower, but it's definitely >>>> slower given that it's implemented at app-level using a _rawffi.Array('c') >>>> and doing the conversion by itself (for some partially stupid reasons like >>>> doing the right kind of error checking). >>>> >>>> >>>> A bientot, >>>> >>>> Armin. >>> >>> The main reason why _rawffi.Array is slow is that JIT does not look >>> into that module, so there is wrapping and unwrapping going on. >>> Relatively easy to fix I suppose, but _rawffi.Array was not meant to >>> be used like that (array.array looks like a better candidate). >>> _______________________________________________ >>> pypy-dev at codespeak.net >>> http://codespeak.net/mailman/listinfo/pypy-dev >> >> If array.array performance is important to your work, the array.py >> module looks like a good target for writing at interp level, and it's >> not too much code. >> >> Alex >> >> -- >> "I disapprove of what you say, but I will defend to the death your >> right to say it." -- Voltaire >> "The people's good is the highest law." -- Cicero >> "Code can always be simpler than you think, but never as simple as you >> want" -- Me >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > > > > -- > H?kan Ard? > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > I'd take a look at the cStringIO module, it's a decent example of the APIs (and not too much code). FWIW one thing to note is that array uses the struct module, which is also pure python. I believe it's possible to still use that with an interp-level module, but it may just become another bottle neck, just something to consider. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Voltaire "The people's good is the highest law." -- Cicero "Code can always be simpler than you think, but never as simple as you want" -- Me From fijall at gmail.com Fri Jul 2 08:04:26 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 00:04:26 -0600 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Thu, Jul 1, 2010 at 1:18 PM, Hakan Ardo wrote: > OK, so making an interpreter level implementation of array.array seams > like a good idea. Would it be possible to get the jit to remove the > wrapping/unwrapping in that case to get better performance than > _rawffi.Array('d'), which is already an interpreter level > implementation? it should work mostly out of the box (you can also try this for _rawffi.array part of module, if you want to). It's probably enough to enable module in pypy/module/pypyjit/policy.py so JIT can have a look there. In case of _rawffi, probably a couple of hints for the jit to not look inside some functions (which do external calls for example) should also be needed, since for example JIT as of now does not support raw mallocs (using C malloc and not our GC). Still, making an array module interp-level is probably the sanest approach. > Are there some docs to get me started at writing interpreter level > objects? I've had a look at _rawffi/array.py and am a bit confused > about the W_Array.typedef = TypeDef('Array',...) ?construction. Maybe > there is a easier example to start with? TypeDef is a way to expose interpreter level (RPython) object to app-level (Python). It tells what methods there are what properties and what attributes. > > On Thu, Jul 1, 2010 at 5:40 PM, Alex Gaynor wrote: >> On Thu, Jul 1, 2010 at 10:35 AM, Maciej Fijalkowski wrote: >>> On Thu, Jul 1, 2010 at 9:28 AM, Armin Rigo wrote: >>>> Hi, >>>> >>>> On Thu, Jul 01, 2010 at 04:02:30PM +0200, Hakan Ardo wrote: >>>>> are there any python construct that the jit will be able to compile >>>>> into c-type array accesses? Consider the following test: >>>>> >>>>> ? ? l=0.0 >>>>> ? ? for i in xrange(640,640*480): >>>>> ? ? ? ? l+=img[i] >>>>> ? ? ? ? intimg[i]=intimg[i-640]+l >>>> >>>> This is still implemented as a list of Python objects (as expected, >>>> because the JIT cannot prove that we won't suddenly try to put something >>>> else than a float in the same list). >>>> >>>> Using _rawffi.Array('d') directly is the best option right now. ?I'm not >>>> sure why the array.array module is 400 times slower, but it's definitely >>>> slower given that it's implemented at app-level using a _rawffi.Array('c') >>>> and doing the conversion by itself (for some partially stupid reasons like >>>> doing the right kind of error checking). >>>> >>>> >>>> A bientot, >>>> >>>> Armin. >>> >>> The main reason why _rawffi.Array is slow is that JIT does not look >>> into that module, so there is wrapping and unwrapping going on. >>> Relatively easy to fix I suppose, but _rawffi.Array was not meant to >>> be used like that (array.array looks like a better candidate). >>> _______________________________________________ >>> pypy-dev at codespeak.net >>> http://codespeak.net/mailman/listinfo/pypy-dev >> >> If array.array performance is important to your work, the array.py >> module looks like a good target for writing at interp level, and it's >> not too much code. >> >> Alex >> >> -- >> "I disapprove of what you say, but I will defend to the death your >> right to say it." -- Voltaire >> "The people's good is the highest law." -- Cicero >> "Code can always be simpler than you think, but never as simple as you >> want" -- Me >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > > > > -- > H?kan Ard? > From fijall at gmail.com Fri Jul 2 08:45:15 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 00:45:15 -0600 Subject: [pypy-dev] [pypy-svn] r75683 - in pypy/trunk: include lib-python/modified-2.5.2/distutils lib-python/modified-2.5.2/distutils/command pypy/_interfaces pypy/module/cpyext pypy/module/cpyext/test In-Reply-To: <20100630145114.AB57C282BE3@codespeak.net> References: <20100630145114.AB57C282BE3@codespeak.net> Message-ID: Hey. Any reason why we should copy .h files during translation and can't just have them there? Cheers, fijal On Wed, Jun 30, 2010 at 8:51 AM, wrote: > Author: antocuni > Date: Wed Jun 30 16:51:13 2010 > New Revision: 75683 > > Added: > ? pypy/trunk/include/ ? (props changed) > ? pypy/trunk/include/README > Removed: > ? pypy/trunk/pypy/_interfaces/ > Modified: > ? pypy/trunk/lib-python/modified-2.5.2/distutils/command/build_ext.py > ? pypy/trunk/lib-python/modified-2.5.2/distutils/sysconfig_pypy.py > ? pypy/trunk/pypy/module/cpyext/api.py > ? pypy/trunk/pypy/module/cpyext/test/test_api.py > Log: > create a directory trunk/include to contains all the headers file. They are > automatically copied there from cpyext/include during translation. The > generated pypy_decl.h and pypy_macros.h are also put there, instead of the > now-gone pypy/_interfaces. > > The goal is to have the svn checkout as similar as possible as release > tarballs and virtualenvs, which have an include/ dir at the top > > > > Added: pypy/trunk/include/README > ============================================================================== > --- (empty file) > +++ pypy/trunk/include/README ? Wed Jun 30 16:51:13 2010 > @@ -0,0 +1,7 @@ > +This directory contains all the include files needed to build cpython > +extensions with PyPy. ?Note that these are just copies of the original headers > +that are in pypy/module/cpyext/include: they are automatically copied from > +there during translation. > + > +Moreover, pypy_decl.h and pypy_macros.h are automatically generated, also > +during translation. > > Modified: pypy/trunk/lib-python/modified-2.5.2/distutils/command/build_ext.py > ============================================================================== > --- pypy/trunk/lib-python/modified-2.5.2/distutils/command/build_ext.py (original) > +++ pypy/trunk/lib-python/modified-2.5.2/distutils/command/build_ext.py Wed Jun 30 16:51:13 2010 > @@ -167,7 +167,7 @@ > ? ? ? ? # for Release and Debug builds. > ? ? ? ? # also Python's library directory must be appended to library_dirs > ? ? ? ? if os.name == 'nt': > - ? ? ? ? ? ?self.library_dirs.append(os.path.join(sys.prefix, 'pypy', '_interfaces')) > + ? ? ? ? ? ?self.library_dirs.append(os.path.join(sys.prefix, 'include')) > ? ? ? ? ? ? if self.debug: > ? ? ? ? ? ? ? ? self.build_temp = os.path.join(self.build_temp, "Debug") > ? ? ? ? ? ? else: > > Modified: pypy/trunk/lib-python/modified-2.5.2/distutils/sysconfig_pypy.py > ============================================================================== > --- pypy/trunk/lib-python/modified-2.5.2/distutils/sysconfig_pypy.py ? ?(original) > +++ pypy/trunk/lib-python/modified-2.5.2/distutils/sysconfig_pypy.py ? ?Wed Jun 30 16:51:13 2010 > @@ -13,12 +13,7 @@ > > ?def get_python_inc(plat_specific=0, prefix=None): > ? ? from os.path import join as j > - ? ?cand = j(sys.prefix, 'include') > - ? ?if os.path.exists(cand): > - ? ? ? ?return cand > - ? ?if plat_specific: > - ? ? ? ?return j(sys.prefix, "pypy", "_interfaces") > - ? ?return j(sys.prefix, 'pypy', 'module', 'cpyext', 'include') > + ? ?return j(sys.prefix, 'include') > > ?def get_python_version(): > ? ? """Return a string containing the major and minor Python version, > > Modified: pypy/trunk/pypy/module/cpyext/api.py > ============================================================================== > --- pypy/trunk/pypy/module/cpyext/api.py ? ? ? ?(original) > +++ pypy/trunk/pypy/module/cpyext/api.py ? ? ? ?Wed Jun 30 16:51:13 2010 > @@ -45,11 +45,9 @@ > ?pypydir = py.path.local(autopath.pypydir) > ?include_dir = pypydir / 'module' / 'cpyext' / 'include' > ?source_dir = pypydir / 'module' / 'cpyext' / 'src' > -interfaces_dir = pypydir / "_interfaces" > ?include_dirs = [ > ? ? include_dir, > ? ? udir, > - ? ?interfaces_dir, > ? ? ] > > ?class CConfig: > @@ -100,9 +98,16 @@ > ?udir.join('pypy_macros.h').write("/* Will be filled later */") > ?globals().update(rffi_platform.configure(CConfig_constants)) > > -def copy_header_files(): > +def copy_header_files(dstdir): > + ? ?assert dstdir.check(dir=True) > + ? ?headers = include_dir.listdir('*.h') + include_dir.listdir('*.inl') > ? ? for name in ("pypy_decl.h", "pypy_macros.h"): > - ? ? ? ?udir.join(name).copy(interfaces_dir / name) > + ? ? ? ?headers.append(udir.join(name)) > + ? ?for header in headers: > + ? ? ? ?header.copy(dstdir) > + ? ? ? ?target = dstdir.join(header.basename) > + ? ? ? ?target.chmod(0444) # make the file read-only, to make sure that nobody > + ? ? ? ? ? ? ? ? ? ? ? ? ? # edits it by mistake > > ?_NOT_SPECIFIED = object() > ?CANNOT_FAIL = object() > @@ -881,7 +886,8 @@ > ? ? ? ? deco(func.get_wrapper(space)) > > ? ? setup_init_functions(eci) > - ? ?copy_header_files() > + ? ?trunk_include = pypydir.dirpath() / 'include' > + ? ?copy_header_files(trunk_include) > > ?initfunctype = lltype.Ptr(lltype.FuncType([], lltype.Void)) > ?@unwrap_spec(ObjSpace, str, str) > > Modified: pypy/trunk/pypy/module/cpyext/test/test_api.py > ============================================================================== > --- pypy/trunk/pypy/module/cpyext/test/test_api.py ? ? ?(original) > +++ pypy/trunk/pypy/module/cpyext/test/test_api.py ? ? ?Wed Jun 30 16:51:13 2010 > @@ -1,3 +1,4 @@ > +import py > ?from pypy.conftest import gettestobjspace > ?from pypy.rpython.lltypesystem import rffi, lltype > ?from pypy.interpreter.baseobjspace import W_Root > @@ -68,3 +69,13 @@ > ? ? ? ? api.PyPy_GetWrapped(space.w_None) > ? ? ? ? api.PyPy_GetReference(space.w_None) > > + > +def test_copy_header_files(tmpdir): > + ? ?api.copy_header_files(tmpdir) > + ? ?def check(name): > + ? ? ? ?f = tmpdir.join(name) > + ? ? ? ?assert f.check(file=True) > + ? ? ? ?py.test.raises(py.error.EACCES, "f.open('w')") # check that it's not writable > + ? ?check('Python.h') > + ? ?check('modsupport.inl') > + ? ?check('pypy_decl.h') > _______________________________________________ > pypy-svn mailing list > pypy-svn at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-svn > From fijall at gmail.com Fri Jul 2 09:28:15 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 01:28:15 -0600 Subject: [pypy-dev] [pypy-svn] r75683 - in pypy/trunk: include lib-python/modified-2.5.2/distutils lib-python/modified-2.5.2/distutils/command pypy/_interfaces pypy/module/cpyext pypy/module/cpyext/test In-Reply-To: <4C2D9369.7030004@gmail.com> References: <20100630145114.AB57C282BE3@codespeak.net> <4C2D9369.7030004@gmail.com> Message-ID: On Fri, Jul 2, 2010 at 1:21 AM, Antonio Cuni wrote: > On 02/07/10 08:45, Maciej Fijalkowski wrote: >> >> Hey. >> >> Any reason why we should copy .h files during translation and can't >> just have them there? >> > > I talked with Amaury and he told me that he prefers to keep all the > cpyext-related files together, which I think makes sense. ?Moreover, we need > to generate© pypy_decl.h and pypy_macros.h anyway, so we can copy the > others as well while we are at it. > > ciao, > Anto > Fine by me. Can you fix test_package then? It assumes there is Python.h in include (which might not be there). From anto.cuni at gmail.com Fri Jul 2 09:21:13 2010 From: anto.cuni at gmail.com (Antonio Cuni) Date: Fri, 02 Jul 2010 09:21:13 +0200 Subject: [pypy-dev] [pypy-svn] r75683 - in pypy/trunk: include lib-python/modified-2.5.2/distutils lib-python/modified-2.5.2/distutils/command pypy/_interfaces pypy/module/cpyext pypy/module/cpyext/test In-Reply-To: References: <20100630145114.AB57C282BE3@codespeak.net> Message-ID: <4C2D9369.7030004@gmail.com> On 02/07/10 08:45, Maciej Fijalkowski wrote: > Hey. > > Any reason why we should copy .h files during translation and can't > just have them there? > I talked with Amaury and he told me that he prefers to keep all the cpyext-related files together, which I think makes sense. Moreover, we need to generate© pypy_decl.h and pypy_macros.h anyway, so we can copy the others as well while we are at it. ciao, Anto From anto.cuni at gmail.com Fri Jul 2 09:30:30 2010 From: anto.cuni at gmail.com (Antonio Cuni) Date: Fri, 02 Jul 2010 09:30:30 +0200 Subject: [pypy-dev] [pypy-svn] r75683 - in pypy/trunk: include lib-python/modified-2.5.2/distutils lib-python/modified-2.5.2/distutils/command pypy/_interfaces pypy/module/cpyext pypy/module/cpyext/test In-Reply-To: References: <20100630145114.AB57C282BE3@codespeak.net> <4C2D9369.7030004@gmail.com> Message-ID: <4C2D9596.5050105@gmail.com> On 02/07/10 09:28, Maciej Fijalkowski wrote: > Fine by me. Can you fix test_package then? It assumes there is > Python.h in include (which might not be there). ah right... because when we run own-test translation didn't happen, so .h are not there. Ok, I'll fix it later. ciao, Anto From tobami at googlemail.com Fri Jul 2 09:27:10 2010 From: tobami at googlemail.com (Miquel Torres) Date: Fri, 2 Jul 2010 09:27:10 +0200 Subject: [pypy-dev] New speed.pypy.org version In-Reply-To: References: Message-ID: Hi Paolo, hey! I think it is a great idea. With logs you get both: correct normalized totals AND the ability to display the individual stacked series, which necessarily add arithmetically. But it strikes me, hasn't anyone written a paper about that method already? or at least documented it? Anyway I need to check that the math is right (hopefully today), and then I would go and implement it. I'll tell you how it goes. Cheers, Miquel 2010/6/30 Paolo Giarrusso : > Hi Miquel, > I'm quite busy (because of a paper deadline next Tuesday), sorry for > not answering earlier. > > I was just struck by an idea: there is a stacked bar plot where the > total bar is related to the geometric mean, such that it is > normalization-invariant. But this graph _is_ complicated. > > It is a stacked plot of _logarithms_ of performance ratios? This way, > the complete stacked bar shows the logarithm of the product, rather > than their sum, i.e. the log of the (geometric mean)^N rather than > their arithmetic mean. log of the (geometric mean)^N = N*log of the > (geometric mean). > > Some simple maths (I didn't write it out, so please recheck!) seems to > show that showing (a+b*log (ratio)), instead of log(ratio), gives > still a fair comparison, obtaining N*a+b*N*log(geomean) = > \Theta(log(geomean)). You need to put a and b because showing if the > ratio is 1, log(1) is zero (b is the representation scale which is > always there). > > About your workaround: I would like a table with the geometric mean of > the ratios, where we get the real global performance ratio among the > interpreters. As far as the results of your solution do not contradict > that _real_ table, it should be a reasonable workaround (but I would > embed the check in the code - otherwise other projects _will be_ > bitten by that). Probably, I would like the website to offer such a > table to users, and I would like a graph of the overall performance > ratio over time (actually revisions). > > Finally, the docs of your web application should at the very least > reference the paper and this conversation (if there's a public archive > of the ML, as I think), and ideally explain the issue. > > Sorry for being too dense, maybe - if I was unclear, please tell me > and I'll answer next week. > > Best regards, > Paolo > > On Mon, Jun 28, 2010 at 11:21, Miquel Torres wrote: >> Hi Paolo, >> >> I read the paper, very interesting. It is perfectly clear that to >> calculate a normalized total only the geometric mean makes sense. >> >> However, a stacked bars plot shows the individual benchmarks so it >> implicitly is an arithmetic mean. The only solution (apart from >> removing the stacked charts and only offering total bars) is the >> weighted approach. >> >> External weights are not very practical though. Codespeed is used by >> other projects so an extra option would need to be added to the >> settings to allow the introducing of arbitrary weights to benchmarks. >> A bit cumbersome. I have an idea that may work. Take the weights from >> a defined baseline so that the run times are equal, which is the same >> as normalizing to a baseline. It would be the same as now, only that >> you can't choose the normalization, it will be weighted (normalized) >> according the default baseline (which you already can already >> configure in the settings). >> >> You may say that it is still an arithmetic mean, but there won't be >> conflicting results because there is only a single normalization. For >> PyPy that would be cpython, and everything would make sense. >> I know it is a work around, not a solution. If you think it is a bad >> idea, the only other possibility is not to have stacked bars (as in >> "showing individual benchmarks"). But I find them useful. Yes you can >> see the individual benchmark results better in the normal bars chart, >> but there you don't see visually which benchmarks take the biggest >> part of the pie, which helps visualize what parts of your program need >> most improving. >> >> What do you think? >> >> Regards, >> Miquel >> >> >> 2010/6/25 Paolo Giarrusso : >>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres wrote: >>>> Hi Paolo, >>>> >>>> I am aware of the problem with calculating benchmark means, but let me >>>> explain my point of view. >>>> >>>> You are correct in that it would be preferable to have absolute times. Well, >>>> you actually can, but see what it happens: >>>> http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars >>> >>> Ahah! I didn't notice that I could skip normalization! This does not >>> fully invalidate my point, however. >>> >>>> Absolute values would only work if we had carefully chosen benchmaks >>>> runtimes to be very similar (for our cpython baseline). As it is, html5lib, >>>> spitfire and spitfire_cstringio completely dominate the cummulative time. >>> >>> I acknowledge that (btw, it should be cumulative time, with one 'm', >>> both here and in the website). >>> >>>> And not because the interpreter is faster or slower but because the >>>> benchmark was arbitrarily designed to run that long. Any improvement in the >>>> long running benchmarks will carry much more weight than in the short >>>> running. >>> >>>> What is more useful is to have comparable slices of time so that the >>>> improvements can be seen relatively over time. >>> >>> If you want to sum up times (but at this point, I see no reason for >>> it), you should rather have externally derived weights, as suggested >>> by the paper (in Rule 3). >>> As soon as you take weights from the data, lots of maths that you need >>> is not going to work any more - that's generally true in many cases in >>> statistics. >>> And the only way making sense to have external weights is to gather >>> them from real world programs. Since that's not going to happen >>> easily, just stick with the geometric mean. Or set an arbitrarily low >>> weight, manually, without any math, so that the long-running >>> benchmarks stop dominating the res. It's no fraud, since the current >>> graph is less valid anyway. >>> >>>> Normalizing does that i >>>> think. >>> Not really. >>> >>>> It just says: we have 21 tasks which take 1 second to run each on >>>> interpreter X (cpython in the default case). Then we see how other >>>> executables compare to that. What would the geometric mean achieve here, >>>> exactly, for the end user? >>> >>> You actually need the geomean to do that. Don't forget that the >>> geomean is still a mean: it's a mean performance ratio which averages >>> individual performance ratios. >>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task >>> in 11.5 seconds instead of 21. To me, this sounds exactly like what >>> you want to achieve. Moreover, it actually works, unlike what you use. >>> >>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no >>> JIT). Then, change the normalization among the two: >>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars >>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars >>> with the current data, you get that in one case cpython is faster, in >>> the other pypy-c is faster. >>> It can't happen with the geomean. This is the point of the paper. >>> >>> I could even construct a normalization baseline $base such that >>> CPython seems faster than PyPy-JIT. Such a base should be very fast >>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai >>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on >>> other benchmarks (so that they disappear in the sum). >>> >>> So, the only difference I see is that geomean works, arithm. mean >>> doesn't. That's why Real Benchmarkers use geomean. >>> >>> Moreover, you are making a mistake quite common among non-physicists. >>> What you say makes sense under the implicit assumption that dividing >>> two times gives something you can use as a time. When you say "Pypy's >>> runtime for a 1 second task", you actually want to talk about a >>> performance ratio, not about the time. In the same way as when you say >>> "this bird runs 3 meters long in one second", a physicist would sum >>> that up as "3 m/s" rather than "3 m". >>> >>>> I am not really calculating any mean. You can see that I carefully avoided >>>> to display any kind of total bar which would indeed incur in the problem you >>>> mention. That a stacked chart implicitly displays a total is something you >>>> can not avoid, and for that kind of chart I still think normalized results >>>> is visually the best option. >>> >>> But on a stacked bars graph, I'm not going to look at individual bars >>> at all, just at the total: it's actually less convenient than in >>> "normal bars" to look at the result of a particular benchmark. >>> >>> I hope I can find guidelines against stacked plots, I have a PhD >>> colleague reading on how to make graphs. >>> >>> Best regards >>> -- >>> Paolo Giarrusso - Ph.D. Student >>> http://www.informatik.uni-marburg.de/~pgiarrusso/ >>> >> > > > > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > From hakan at debian.org Fri Jul 2 09:37:03 2010 From: hakan at debian.org (Hakan Ardo) Date: Fri, 2 Jul 2010 09:37:03 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: Hi, I've got a simple implementation of array now, wrapping lltype.malloc with no error checking yet (cStringIO was great help, thx). How can I test this with the jit? Do I need to translate the entire pypy or is there a quicker way? > there. In case of _rawffi, probably a couple of hints for the jit to > not look inside some functions (which do external calls for example) > should also be needed, since for example JIT as of now does not > support raw mallocs (using C malloc and not our GC). Still, making an > array module interp-level is probably the sanest approach. Do I need to guard the lltype.malloc call with such hints? What is the syntax? -- H?kan Ard? From p.giarrusso at gmail.com Fri Jul 2 09:47:57 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 2 Jul 2010 09:47:57 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Fri, Jul 2, 2010 at 08:04, Maciej Fijalkowski wrote: > On Thu, Jul 1, 2010 at 1:18 PM, Hakan Ardo wrote: >> OK, so making an interpreter level implementation of array.array seams >> like a good idea. Would it be possible to get the jit to remove the >> wrapping/unwrapping in that case to get better performance than >> _rawffi.Array('d'), which is already an interpreter level >> implementation? > > it should work mostly out of the box (you can also try this for > _rawffi.array part of module, if you want to). It's probably enough to > enable module in pypy/module/pypyjit/policy.py so JIT can have a look > there. In case of _rawffi, probably a couple of hints for the jit to > not look inside some functions (which do external calls for example) > should also be needed, since for example JIT as of now does not > support raw mallocs (using C malloc and not our GC). > Still, making an > array module interp-level is probably the sanest approach. That might be a bad sign. For CPython, people recommend to write extensions in C for performance, i.e. to make them less maintainable and understandable for performance. A good JIT should make this unnecessary in as many cases as possible. Of course, the array module might be an exception, if it's a single case. But performance 20x slower than C, with a JIT, is a big warning, since fast interpreters are documented to be (in general) just 10x slower than C. In this case, the JIT should be instructed to look into that module; if the result is still slow, the missing optimizations need to be traced down and added. Also, it seems that at some point in the future, the JIT should in general look into the whole standard library by default _and_ learn to be careful to such external calls. Isn't it? Comments appreciated. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From p.giarrusso at gmail.com Fri Jul 2 09:53:04 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 2 Jul 2010 09:53:04 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Fri, Jul 2, 2010 at 09:47, Paolo Giarrusso wrote: > On Fri, Jul 2, 2010 at 08:04, Maciej Fijalkowski wrote: >> On Thu, Jul 1, 2010 at 1:18 PM, Hakan Ardo wrote: >>> OK, so making an interpreter level implementation of array.array seams >>> like a good idea. Would it be possible to get the jit to remove the >>> wrapping/unwrapping in that case to get better performance than >>> _rawffi.Array('d'), which is already an interpreter level >>> implementation? >> >> it should work mostly out of the box (you can also try this for >> _rawffi.array part of module, if you want to). It's probably enough to >> enable module in pypy/module/pypyjit/policy.py so JIT can have a look >> there. In case of _rawffi, probably a couple of hints for the jit to >> not look inside some functions (which do external calls for example) >> should also be needed, since for example JIT as of now does not >> support raw mallocs (using C malloc and not our GC). > >> Still, making an >> array module interp-level is probably the sanest approach. > > That might be a bad sign. > For CPython, people recommend to write extensions in C for > performance, i.e. to make them less maintainable and understandable > for performance. Here, I forgot to state explicitly that having to rewrite a module at the interpreter level is somehow similar. Imagine that was suggested, the day PyPy will be standard, to application authors. > A good JIT should make this unnecessary in as many cases as possible. > Of course, the array module might be an exception, if it's a single > case. > But performance 20x slower than C, with a JIT, is a big warning, since > fast interpreters are documented to be (in general) just 10x slower > than C. > In this case, the JIT should be instructed to look into that module; > if the result is still slow, the missing optimizations need to be > traced down and added. > Also, it seems that at some point in the future, the JIT should in > general look into the whole standard library by default _and_ learn to > be careful to such external calls. Isn't it? > Comments appreciated. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From p.giarrusso at gmail.com Fri Jul 2 09:58:29 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 2 Jul 2010 09:58:29 +0200 Subject: [pypy-dev] New speed.pypy.org version In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 09:27, Miquel Torres wrote: > Hi Paolo, > > hey! I think it is a great idea. With logs you get both: correct > normalized totals AND the ability to display the individual stacked > series, which necessarily add arithmetically. But it strikes me, > hasn't anyone written a paper about that method already? or at least > documented it? I guess the problem is that the graph is weird enough, and that you need arbitrary a and b to make it work, since the logarithm might get negative, and arbitrarily big. log 0 = - inf. I still think that's fair and makes sense, but it's somewhat hard to sell. > Anyway I need to check that the math is right (hopefully today), and > then I would go and implement it. > I'll tell you how it goes. > > Cheers, > Miquel > > > > 2010/6/30 Paolo Giarrusso : >> Hi Miquel, >> I'm quite busy (because of a paper deadline next Tuesday), sorry for >> not answering earlier. >> >> I was just struck by an idea: there is a stacked bar plot where the >> total bar is related to the geometric mean, such that it is >> normalization-invariant. But this graph _is_ complicated. >> >> It is a stacked plot of _logarithms_ of performance ratios? This way, >> the complete stacked bar shows the logarithm of the product, rather >> than their sum, i.e. the log of the (geometric mean)^N rather than >> their arithmetic mean. log of the (geometric mean)^N = N*log of the >> (geometric mean). >> >> Some simple maths (I didn't write it out, so please recheck!) seems to >> show that showing (a+b*log (ratio)), instead of log(ratio), gives >> still a fair comparison, obtaining N*a+b*N*log(geomean) = >> \Theta(log(geomean)). You need to put a and b because showing if the >> ratio is 1, log(1) is zero (b is the representation scale which is >> always there). >> >> About your workaround: I would like a table with the geometric mean of >> the ratios, where we get the real global performance ratio among the >> interpreters. As far as the results of your solution do not contradict >> that _real_ table, it should be a reasonable workaround (but I would >> embed the check in the code - otherwise other projects _will be_ >> bitten by that). Probably, I would like the website to offer such a >> table to users, and I would like a graph of the overall performance >> ratio over time (actually revisions). >> >> Finally, the docs of your web application should at the very least >> reference the paper and this conversation (if there's a public archive >> of the ML, as I think), and ideally explain the issue. >> >> Sorry for being too dense, maybe - if I was unclear, please tell me >> and I'll answer next week. >> >> Best regards, >> Paolo >> >> On Mon, Jun 28, 2010 at 11:21, Miquel Torres wrote: >>> Hi Paolo, >>> >>> I read the paper, very interesting. It is perfectly clear that to >>> calculate a normalized total only the geometric mean makes sense. >>> >>> However, a stacked bars plot shows the individual benchmarks so it >>> implicitly is an arithmetic mean. The only solution (apart from >>> removing the stacked charts and only offering total bars) is the >>> weighted approach. >>> >>> External weights are not very practical though. Codespeed is used by >>> other projects so an extra option would need to be added to the >>> settings to allow the introducing of arbitrary weights to benchmarks. >>> A bit cumbersome. I have an idea that may work. Take the weights from >>> a defined baseline so that the run times are equal, which is the same >>> as normalizing to a baseline. It would be the same as now, only that >>> you can't choose the normalization, it will be weighted (normalized) >>> according the default baseline (which you already can already >>> configure in the settings). >>> >>> You may say that it is still an arithmetic mean, but there won't be >>> conflicting results because there is only a single normalization. For >>> PyPy that would be cpython, and everything would make sense. >>> I know it is a work around, not a solution. If you think it is a bad >>> idea, the only other possibility is not to have stacked bars (as in >>> "showing individual benchmarks"). But I find them useful. Yes you can >>> see the individual benchmark results better in the normal bars chart, >>> but there you don't see visually which benchmarks take the biggest >>> part of the pie, which helps visualize what parts of your program need >>> most improving. >>> >>> What do you think? >>> >>> Regards, >>> Miquel >>> >>> >>> 2010/6/25 Paolo Giarrusso : >>>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres wrote: >>>>> Hi Paolo, >>>>> >>>>> I am aware of the problem with calculating benchmark means, but let me >>>>> explain my point of view. >>>>> >>>>> You are correct in that it would be preferable to have absolute times. Well, >>>>> you actually can, but see what it happens: >>>>> http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars >>>> >>>> Ahah! I didn't notice that I could skip normalization! This does not >>>> fully invalidate my point, however. >>>> >>>>> Absolute values would only work if we had carefully chosen benchmaks >>>>> runtimes to be very similar (for our cpython baseline). As it is, html5lib, >>>>> spitfire and spitfire_cstringio completely dominate the cummulative time. >>>> >>>> I acknowledge that (btw, it should be cumulative time, with one 'm', >>>> both here and in the website). >>>> >>>>> And not because the interpreter is faster or slower but because the >>>>> benchmark was arbitrarily designed to run that long. Any improvement in the >>>>> long running benchmarks will carry much more weight than in the short >>>>> running. >>>> >>>>> What is more useful is to have comparable slices of time so that the >>>>> improvements can be seen relatively over time. >>>> >>>> If you want to sum up times (but at this point, I see no reason for >>>> it), you should rather have externally derived weights, as suggested >>>> by the paper (in Rule 3). >>>> As soon as you take weights from the data, lots of maths that you need >>>> is not going to work any more - that's generally true in many cases in >>>> statistics. >>>> And the only way making sense to have external weights is to gather >>>> them from real world programs. Since that's not going to happen >>>> easily, just stick with the geometric mean. Or set an arbitrarily low >>>> weight, manually, without any math, so that the long-running >>>> benchmarks stop dominating the res. It's no fraud, since the current >>>> graph is less valid anyway. >>>> >>>>> Normalizing does that i >>>>> think. >>>> Not really. >>>> >>>>> It just says: we have 21 tasks which take 1 second to run each on >>>>> interpreter X (cpython in the default case). Then we see how other >>>>> executables compare to that. What would the geometric mean achieve here, >>>>> exactly, for the end user? >>>> >>>> You actually need the geomean to do that. Don't forget that the >>>> geomean is still a mean: it's a mean performance ratio which averages >>>> individual performance ratios. >>>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task >>>> in 11.5 seconds instead of 21. To me, this sounds exactly like what >>>> you want to achieve. Moreover, it actually works, unlike what you use. >>>> >>>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no >>>> JIT). Then, change the normalization among the two: >>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars >>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars >>>> with the current data, you get that in one case cpython is faster, in >>>> the other pypy-c is faster. >>>> It can't happen with the geomean. This is the point of the paper. >>>> >>>> I could even construct a normalization baseline $base such that >>>> CPython seems faster than PyPy-JIT. Such a base should be very fast >>>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai >>>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on >>>> other benchmarks (so that they disappear in the sum). >>>> >>>> So, the only difference I see is that geomean works, arithm. mean >>>> doesn't. That's why Real Benchmarkers use geomean. >>>> >>>> Moreover, you are making a mistake quite common among non-physicists. >>>> What you say makes sense under the implicit assumption that dividing >>>> two times gives something you can use as a time. When you say "Pypy's >>>> runtime for a 1 second task", you actually want to talk about a >>>> performance ratio, not about the time. In the same way as when you say >>>> "this bird runs 3 meters long in one second", a physicist would sum >>>> that up as "3 m/s" rather than "3 m". >>>> >>>>> I am not really calculating any mean. You can see that I carefully avoided >>>>> to display any kind of total bar which would indeed incur in the problem you >>>>> mention. That a stacked chart implicitly displays a total is something you >>>>> can not avoid, and for that kind of chart I still think normalized results >>>>> is visually the best option. >>>> >>>> But on a stacked bars graph, I'm not going to look at individual bars >>>> at all, just at the total: it's actually less convenient than in >>>> "normal bars" to look at the result of a particular benchmark. >>>> >>>> I hope I can find guidelines against stacked plots, I have a PhD >>>> colleague reading on how to make graphs. >>>> >>>> Best regards >>>> -- >>>> Paolo Giarrusso - Ph.D. Student >>>> http://www.informatik.uni-marburg.de/~pgiarrusso/ >>>> >>> >> >> >> >> -- >> Paolo Giarrusso - Ph.D. Student >> http://www.informatik.uni-marburg.de/~pgiarrusso/ >> > -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From fijall at gmail.com Fri Jul 2 10:14:36 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 02:14:36 -0600 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Fri, Jul 2, 2010 at 1:47 AM, Paolo Giarrusso wrote: > On Fri, Jul 2, 2010 at 08:04, Maciej Fijalkowski wrote: >> On Thu, Jul 1, 2010 at 1:18 PM, Hakan Ardo wrote: >>> OK, so making an interpreter level implementation of array.array seams >>> like a good idea. Would it be possible to get the jit to remove the >>> wrapping/unwrapping in that case to get better performance than >>> _rawffi.Array('d'), which is already an interpreter level >>> implementation? >> >> it should work mostly out of the box (you can also try this for >> _rawffi.array part of module, if you want to). It's probably enough to >> enable module in pypy/module/pypyjit/policy.py so JIT can have a look >> there. In case of _rawffi, probably a couple of hints for the jit to >> not look inside some functions (which do external calls for example) >> should also be needed, since for example JIT as of now does not >> support raw mallocs (using C malloc and not our GC). > >> Still, making an >> array module interp-level is probably the sanest approach. > > That might be a bad sign. > For CPython, people recommend to write extensions in C for > performance, i.e. to make them less maintainable and understandable > for performance. > A good JIT should make this unnecessary in as many cases as possible. > Of course, the array module might be an exception, if it's a single > case. > But performance 20x slower than C, with a JIT, is a big warning, since > fast interpreters are documented to be (in general) just 10x slower > than C. There is a lot of unsupported claims in your sentences, however, that's not my point. array module is the main source in Python for single-type arrays (including C types which are not available under Python). The other would be numpy. That makes sense to write in C/RPython, since it's lower-level than Python has. From fijall at gmail.com Fri Jul 2 10:16:13 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 02:16:13 -0600 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: On Fri, Jul 2, 2010 at 1:37 AM, Hakan Ardo wrote: > Hi, > I've got a simple implementation of array now, wrapping lltype.malloc > with no error checking yet (cStringIO was great help, thx). How can I > test this with the jit? Do I need to translate the entire pypy or is > there a quicker way? > >> there. In case of _rawffi, probably a couple of hints for the jit to >> not look inside some functions (which do external calls for example) >> should also be needed, since for example JIT as of now does not >> support raw mallocs (using C malloc and not our GC). Still, making an >> array module interp-level is probably the sanest approach. > > Do I need to guard the lltype.malloc call with such hints? What is the syntax? > I can see into making raw_malloc just a call from JIT. That shouldn't be a big issue. For now you can either: a) use from pypy.rlib import rgc and use rgc.malloc_nonmovable (not sure if jit'll like it), so you'll get a gc-managed non-movable memory b) just wrap call to malloc in a function with decorator dont_look_inside (from pypy.rlib.jit) From arigo at tunes.org Fri Jul 2 10:17:04 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 2 Jul 2010 10:17:04 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: <20100702081704.GA12280@code0.codespeak.net> Hi Fijal, On Thu, Jul 01, 2010 at 09:35:17AM -0600, Maciej Fijalkowski wrote: > The main reason why _rawffi.Array is slow is that JIT does not look > into that module, so there is wrapping and unwrapping going on. > Relatively easy to fix I suppose, but _rawffi.Array was not meant to > be used like that (array.array looks like a better candidate). If you mean "better candidate" for being fast right now, then you missed my point: our array.array module is implemented on top of _rawffi.Array. If you mean "better candidate" for being optimizable given some work, then yes, I agree that the array module is a good target. A bientot, Armin. From arigo at tunes.org Fri Jul 2 10:18:59 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 2 Jul 2010 10:18:59 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> Message-ID: <20100702081859.GB12280@code0.codespeak.net> Hi Alex, On Fri, Jul 02, 2010 at 12:40:21AM -0500, Alex Gaynor wrote: > FWIW one thing to note is that array > uses the struct module, which is also pure python. No: we have a pure Python version, but in a normally compiled pypy-c, there is an interp-level version of 'struct' too. A bientot, Armin. From fijall at gmail.com Fri Jul 2 10:19:02 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 02:19:02 -0600 Subject: [pypy-dev] array performace? In-Reply-To: <20100702081704.GA12280@code0.codespeak.net> References: <20100701152827.GA30661@code0.codespeak.net> <20100702081704.GA12280@code0.codespeak.net> Message-ID: On Fri, Jul 2, 2010 at 2:17 AM, Armin Rigo wrote: > Hi Fijal, > > On Thu, Jul 01, 2010 at 09:35:17AM -0600, Maciej Fijalkowski wrote: >> The main reason why _rawffi.Array is slow is that JIT does not look >> into that module, so there is wrapping and unwrapping going on. >> Relatively easy to fix I suppose, but _rawffi.Array was not meant to >> be used like that (array.array looks like a better candidate). > > If you mean "better candidate" for being fast right now, then you missed > my point: our array.array module is implemented on top of > _rawffi.Array. ?If you mean "better candidate" for being optimizable > given some work, then yes, I agree that the array module is a good > target. > By "better candidate" I mean that having JIT see _rawffi might mean some struggle for it to understand what's going on with raw pointers and writing array in interp-level would be better. From arigo at tunes.org Fri Jul 2 10:23:10 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 2 Jul 2010 10:23:10 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <20100702081704.GA12280@code0.codespeak.net> Message-ID: <20100702082310.GC12280@code0.codespeak.net> Hi Fijal, On Fri, Jul 02, 2010 at 02:19:02AM -0600, Maciej Fijalkowski wrote: > By "better candidate" I mean that having JIT see _rawffi might mean > some struggle for it to understand what's going on with raw pointers > and writing array in interp-level would be better. Ah, right. Armin. From fijall at gmail.com Fri Jul 2 10:35:20 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 02:35:20 -0600 Subject: [pypy-dev] array performace? In-Reply-To: <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: On Fri, Jul 2, 2010 at 2:26 AM, wrote: >> On Fri, Jul 2, 2010 at 1:47 AM, Paolo Giarrusso > >> wrote: >> > On Fri, Jul 2, 2010 at 08:04, Maciej Fijalkowski > wrote: >> >> On Thu, Jul 1, 2010 at 1:18 PM, Hakan Ardo wrote: >> >>> OK, so making an interpreter level implementation of array.array > seams >> >>> like a good idea. Would it be possible to get the jit to remove > the >> >>> wrapping/unwrapping in that case to get better performance than >> >>> _rawffi.Array('d'), which is already an interpreter level >> >>> implementation? >> >> >> >> it should work mostly out of the box (you can also try this for >> >> _rawffi.array part of module, if you want to). It's probably enough > to >> >> enable module in pypy/module/pypyjit/policy.py so JIT can have a > look >> >> there. In case of _rawffi, probably a couple of hints for the jit > to >> >> not look inside some functions (which do external calls for > example) >> >> should also be needed, since for example JIT as of now does not >> >> support raw mallocs (using C malloc and not our GC). >> > >> >> Still, making an >> >> array module interp-level is probably the sanest approach. >> > >> > That might be a bad sign. >> > For CPython, people recommend to write extensions in C for >> > performance, i.e. to make them less maintainable and understandable >> > for performance. >> > A good JIT should make this unnecessary in as many cases as > possible. >> > Of course, the array module might be an exception, if it's a single >> > case. >> > But performance 20x slower than C, with a JIT, is a big warning, > since >> > fast interpreters are documented to be (in general) just 10x slower >> > than C. >> >> There is a lot of unsupported claims in your sentences, however, >> that's not my point. >> > > That's a little harsh. When the JIT was originally developed it was > envisaged that it would be faster to re-write code to app level to give > speed-ups. If that's changed that's fine, but it's not an "unsupported > claim" > > Ben > Unsupported claim is for example that fast interpreters are 10x slower than C. On what exactly? Did he write this particular benchmark in C and in fast interpreter to compare? Another unsupported claim is that JIT is 20x slower than C here. Array module is not even JITted, because it's based on _rawffi which itself operates on low-level pointers which JIT does not want to deal with. That's exactly the reason why JIT doesn't look into _rawffi module and making it look there doesn't sound like a good idea (instead, we're trying to replace it with something JIT-friendly that knows how to do FFI calls into C, there is a summer of code project). All I'm trying to say is that there are valid reasons that array module should be on interpreter level and none of this has anything to do with incapabilities of the JIT. Cheers, fijal From Ben.Young at sungard.com Fri Jul 2 11:26:18 2010 From: Ben.Young at sungard.com (Ben.Young at sungard.com) Date: Fri, 2 Jul 2010 10:26:18 +0100 Subject: [pypy-dev] PyPy Speed Message-ID: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> http://speed.pypy.org/overview/ seems to have been unavailable for the last couple of days. It gives a 500 whenever I visit it Ben Young - Senior Software Engineer SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ CONFIDENTIALITY: This email (including any attachments) may contain confidential, proprietary and privileged information, and unauthorized disclosure or use is prohibited. If you received this email in error, please notify the sender and delete this email from your system. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Fri Jul 2 11:28:03 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 03:28:03 -0600 Subject: [pypy-dev] PyPy Speed In-Reply-To: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> References: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: Hey. I know miquel was talking about rolling in new version. Apparently, did not work :) On Fri, Jul 2, 2010 at 3:26 AM, wrote: > http://speed.pypy.org/overview/ seems to have been unavailable for the last > couple of days. It gives a 500 whenever I visit it > > > > Ben Young - Senior Software Engineer > > SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR > > Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ > > > > CONFIDENTIALITY:? This email (including any attachments) may contain > confidential, proprietary and privileged information, and unauthorized > disclosure or use is prohibited.? If you received this email in error, > please notify the sender and delete this email from your system.? Thank you. > > > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From Ben.Young at sungard.com Fri Jul 2 11:36:28 2010 From: Ben.Young at sungard.com (Ben.Young at sungard.com) Date: Fri, 2 Jul 2010 10:36:28 +0100 Subject: [pypy-dev] PyPy Speed In-Reply-To: References: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: <01781CA2CC22B145B230504679ECF48C01AC449C@EMEA-EXCHANGE03.internal.sungard.corp> Ok thanks :) -----Original Message----- From: Maciej Fijalkowski [mailto:fijall at gmail.com] Sent: 02 July 2010 10:28 To: Young, Ben Cc: pypy-dev at codespeak.net Subject: Re: [pypy-dev] PyPy Speed Hey. I know miquel was talking about rolling in new version. Apparently, did not work :) On Fri, Jul 2, 2010 at 3:26 AM, wrote: > http://speed.pypy.org/overview/ seems to have been unavailable for the last > couple of days. It gives a 500 whenever I visit it > > > > Ben Young - Senior Software Engineer > > SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR > > Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ > > > > CONFIDENTIALITY:? This email (including any attachments) may contain > confidential, proprietary and privileged information, and unauthorized > disclosure or use is prohibited.? If you received this email in error, > please notify the sender and delete this email from your system.? Thank you. > > > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From tobami at googlemail.com Fri Jul 2 11:49:53 2010 From: tobami at googlemail.com (Miquel Torres) Date: Fri, 2 Jul 2010 11:49:53 +0200 Subject: [pypy-dev] PyPy Speed In-Reply-To: <01781CA2CC22B145B230504679ECF48C01AC449C@EMEA-EXCHANGE03.internal.sungard.corp> References: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC449C@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: Hi Ben, no, that is not the case, the new version has been online for a week without problems. The reason is the renaming of the "overview" to "changes". Maybe I should have left the URL /overview/ active with a redirection to /changes/, sorry. You would have seen that if you had checked the root URL (speed.pypy.org) btw. Anyway thanks for pointing it out. Cheers, Miquel 2010/7/2 : > Ok thanks :) > > -----Original Message----- > From: Maciej Fijalkowski [mailto:fijall at gmail.com] > Sent: 02 July 2010 10:28 > To: Young, Ben > Cc: pypy-dev at codespeak.net > Subject: Re: [pypy-dev] PyPy Speed > > Hey. > > I know miquel was talking about rolling in new version. Apparently, > did not work :) > > On Fri, Jul 2, 2010 at 3:26 AM, ? wrote: >> http://speed.pypy.org/overview/ seems to have been unavailable for the last >> couple of days. It gives a 500 whenever I visit it >> >> >> >> Ben Young - Senior Software Engineer >> >> SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR >> >> Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ >> >> >> >> CONFIDENTIALITY:? This email (including any attachments) may contain >> confidential, proprietary and privileged information, and unauthorized >> disclosure or use is prohibited.? If you received this email in error, >> please notify the sender and delete this email from your system.? Thank you. >> >> >> >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev >> > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev From Ben.Young at sungard.com Fri Jul 2 11:51:36 2010 From: Ben.Young at sungard.com (Ben.Young at sungard.com) Date: Fri, 2 Jul 2010 10:51:36 +0100 Subject: [pypy-dev] PyPy Speed In-Reply-To: References: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp><01781CA2CC22B145B230504679ECF48C01AC449C@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: <01781CA2CC22B145B230504679ECF48C01AC44B1@EMEA-EXCHANGE03.internal.sungard.corp> Ah, ok thanks. I had bookmarked the other page, so I just clicked and assumed it was broken Thanks, Ben -----Original Message----- From: Miquel Torres [mailto:tobami at googlemail.com] Sent: 02 July 2010 10:50 To: Young, Ben Cc: pypy-dev at codespeak.net Subject: Re: [pypy-dev] PyPy Speed Hi Ben, no, that is not the case, the new version has been online for a week without problems. The reason is the renaming of the "overview" to "changes". Maybe I should have left the URL /overview/ active with a redirection to /changes/, sorry. You would have seen that if you had checked the root URL (speed.pypy.org) btw. Anyway thanks for pointing it out. Cheers, Miquel 2010/7/2 : > Ok thanks :) > > -----Original Message----- > From: Maciej Fijalkowski [mailto:fijall at gmail.com] > Sent: 02 July 2010 10:28 > To: Young, Ben > Cc: pypy-dev at codespeak.net > Subject: Re: [pypy-dev] PyPy Speed > > Hey. > > I know miquel was talking about rolling in new version. Apparently, > did not work :) > > On Fri, Jul 2, 2010 at 3:26 AM, ? wrote: >> http://speed.pypy.org/overview/ seems to have been unavailable for the last >> couple of days. It gives a 500 whenever I visit it >> >> >> >> Ben Young - Senior Software Engineer >> >> SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR >> >> Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ >> >> >> >> CONFIDENTIALITY:? This email (including any attachments) may contain >> confidential, proprietary and privileged information, and unauthorized >> disclosure or use is prohibited.? If you received this email in error, >> please notify the sender and delete this email from your system.? Thank you. >> >> >> >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev >> > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev From fijall at gmail.com Fri Jul 2 12:20:18 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 04:20:18 -0600 Subject: [pypy-dev] PyPy Speed In-Reply-To: <01781CA2CC22B145B230504679ECF48C01AC44B1@EMEA-EXCHANGE03.internal.sungard.corp> References: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC449C@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC44B1@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: To be fair it's not like it said "404 not found" to me On Fri, Jul 2, 2010 at 3:51 AM, wrote: > Ah, ok thanks. I had bookmarked the other page, so I just clicked and assumed it was broken > > Thanks, > Ben > > -----Original Message----- > From: Miquel Torres [mailto:tobami at googlemail.com] > Sent: 02 July 2010 10:50 > To: Young, Ben > Cc: pypy-dev at codespeak.net > Subject: Re: [pypy-dev] PyPy Speed > > Hi Ben, > > no, that is not the case, the new version has been online for a week > without problems. > > The reason is the renaming of the "overview" to "changes". Maybe I > should have left the URL /overview/ active with a redirection to > /changes/, sorry. You would have seen that if you had checked the root > URL (speed.pypy.org) btw. > > Anyway thanks for pointing it out. > > Cheers, > Miquel > > > 2010/7/2 ?: >> Ok thanks :) >> >> -----Original Message----- >> From: Maciej Fijalkowski [mailto:fijall at gmail.com] >> Sent: 02 July 2010 10:28 >> To: Young, Ben >> Cc: pypy-dev at codespeak.net >> Subject: Re: [pypy-dev] PyPy Speed >> >> Hey. >> >> I know miquel was talking about rolling in new version. Apparently, >> did not work :) >> >> On Fri, Jul 2, 2010 at 3:26 AM, ? wrote: >>> http://speed.pypy.org/overview/ seems to have been unavailable for the last >>> couple of days. It gives a 500 whenever I visit it >>> >>> >>> >>> Ben Young - Senior Software Engineer >>> >>> SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR >>> >>> Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ >>> >>> >>> >>> CONFIDENTIALITY:? This email (including any attachments) may contain >>> confidential, proprietary and privileged information, and unauthorized >>> disclosure or use is prohibited.? If you received this email in error, >>> please notify the sender and delete this email from your system.? Thank you. >>> >>> >>> >>> _______________________________________________ >>> pypy-dev at codespeak.net >>> http://codespeak.net/mailman/listinfo/pypy-dev >>> >> >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > From p.giarrusso at gmail.com Fri Jul 2 14:08:35 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 2 Jul 2010 14:08:35 +0200 Subject: [pypy-dev] array performace? In-Reply-To: <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: On Fri, Jul 2, 2010 at 10:55, wrote: >> On Fri, Jul 2, 2010 at 2:26 AM, ? wrote: >> >> On Fri, Jul 2, 2010 at 1:47 AM, Paolo Giarrusso >> > >> >> wrote: >> >> > On Fri, Jul 2, 2010 at 08:04, Maciej Fijalkowski >> > wrote: >> >> >> On Thu, Jul 1, 2010 at 1:18 PM, Hakan Ardo wrote: >> >> >>> OK, so making an interpreter level implementation of array.array >> > seams >> >> >>> like a good idea. Would it be possible to get the jit to remove >> > the >> >> >>> wrapping/unwrapping in that case to get better performance than >> >> >>> _rawffi.Array('d'), which is already an interpreter level >> >> >>> implementation? >> >> >> >> >> >> it should work mostly out of the box (you can also try this for >> >> >> _rawffi.array part of module, if you want to). It's probably enough >> > to >> >> >> enable module in pypy/module/pypyjit/policy.py so JIT can have a >> > look >> >> >> there. In case of _rawffi, probably a couple of hints for the jit >> > to >> >> >> not look inside some functions (which do external calls for >> > example) >> >> >> should also be needed, since for example JIT as of now does not >> >> >> support raw mallocs (using C malloc and not our GC). >> >> > >> >> >> Still, making an >> >> >> array module interp-level is probably the sanest approach. >> >> > >> >> > That might be a bad sign. >> >> > For CPython, people recommend to write extensions in C for >> >> > performance, i.e. to make them less maintainable and understandable >> >> > for performance. >> >> > A good JIT should make this unnecessary in as many cases as >> > possible. >> >> > Of course, the array module might be an exception, if it's a single >> >> > case. >> >> > But performance 20x slower than C, with a JIT, is a big warning, >> > since >> >> > fast interpreters are documented to be (in general) just 10x slower >> >> > than C. >> >> >> >> There is a lot of unsupported claims in your sentences, however, >> >> that's not my point. >> >> >> > >> > That's a little harsh. When the JIT was originally developed it was >> > envisaged that it would be faster to re-write code to app level to give >> > speed-ups. If that's changed that's fine, but it's not an "unsupported >> > claim" >> > >> > Ben >> > >> >> Unsupported claim is for example that fast interpreters are 10x slower >> than C. That's the only unsupported claim, but it comes from "The Structure and Performance of E?cient Interpreters". I studied that as a student on VM, you are writing one, so I (unconsciously) guessed that everybody knows that paper - I know that's a completely broken way of writing, but I didn't spot it. >>On what exactly? Did he write this particular benchmark in C >> and in fast interpreter to compare? Another unsupported claim is that >> JIT is 20x slower than C here. I did not claim that - I am aware that it is not even JITted. I complain against the lack of JITting. >> Array module is not even JITted, >> because it's based on _rawffi which itself operates on low-level >> pointers which JIT does not want to deal with. I would say that instead of doing manual annotations or rewriting at the interp-level (which doesn't scale), it would be overall simpler to make the JIT learn itself how to deal with those calls (i.e. inline everything around, leave the external call as a call), once and for all. What you suggest below might be a way to do it. >> That's exactly the >> reason why JIT doesn't look into _rawffi module and making it look >> there doesn't sound like a good idea (instead, we're trying to replace >> it with something JIT-friendly that knows how to do FFI calls into C, >> there is a summer of code project). Well, at the abstraction level I'm speaking, it sounds like there in the end, the JIT will be able to do what is needed. I am not aware of the details. But then, at the end of that project, it seems to me that it should be possible to write the array module in pure Python using this new FFI interface and have the JIT look at it, shouldn't it? I do not concentrate on array specifically - rewriting a few modules at interpreter level is fine. But as a Python developer I should have no need for that. >> All I'm trying to say is that there are valid reasons that array >> module should be on interpreter level and none of this has anything to >> do with incapabilities of the JIT. > Fair enough, and I do see your point, but I think Paolo comment was not aimed at array, just the implication (in this case) that to get performance you need to re-write in rpython. I think his point in general is correct, even if he picked the wrong example to mention it :) (and his 20x claim comes from the original email, so I don't think it's entirely unsupported) Thanks for understanding my point. I'm unsure whether an ideal JIT could allow leaving array at the app-level (and I noted also in the original mail that I was unsure on this). > Of course in this case I'm sure there are good reasons, but it is certainly interesting to see the push towards more rpython code than app-level. I guess that's because the JIT can "see" and accelerate rpython code too I believe, so it?s win-win (because of the code size issues and things like that) > Incidentally, is there a reason that geninterped code is so bloated compared to rpython code that looks like it could have been generated from the app-level equivalent? Would there be a way of annotating the app-level code so that when it's geninterped it's as tight as the equivalent rpython? -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From tobami at googlemail.com Fri Jul 2 16:16:14 2010 From: tobami at googlemail.com (Miquel Torres) Date: Fri, 2 Jul 2010 16:16:14 +0200 Subject: [pypy-dev] PyPy Speed In-Reply-To: References: <01781CA2CC22B145B230504679ECF48C01AC448A@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC449C@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC44B1@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: > To be fair it's not like it said "404 not found" to me right, that is wrong 2010/7/2 Maciej Fijalkowski : > To be fair it's not like it said "404 not found" to me > > On Fri, Jul 2, 2010 at 3:51 AM, ? wrote: >> Ah, ok thanks. I had bookmarked the other page, so I just clicked and assumed it was broken >> >> Thanks, >> Ben >> >> -----Original Message----- >> From: Miquel Torres [mailto:tobami at googlemail.com] >> Sent: 02 July 2010 10:50 >> To: Young, Ben >> Cc: pypy-dev at codespeak.net >> Subject: Re: [pypy-dev] PyPy Speed >> >> Hi Ben, >> >> no, that is not the case, the new version has been online for a week >> without problems. >> >> The reason is the renaming of the "overview" to "changes". Maybe I >> should have left the URL /overview/ active with a redirection to >> /changes/, sorry. You would have seen that if you had checked the root >> URL (speed.pypy.org) btw. >> >> Anyway thanks for pointing it out. >> >> Cheers, >> Miquel >> >> >> 2010/7/2 ?: >>> Ok thanks :) >>> >>> -----Original Message----- >>> From: Maciej Fijalkowski [mailto:fijall at gmail.com] >>> Sent: 02 July 2010 10:28 >>> To: Young, Ben >>> Cc: pypy-dev at codespeak.net >>> Subject: Re: [pypy-dev] PyPy Speed >>> >>> Hey. >>> >>> I know miquel was talking about rolling in new version. Apparently, >>> did not work :) >>> >>> On Fri, Jul 2, 2010 at 3:26 AM, ? wrote: >>>> http://speed.pypy.org/overview/ seems to have been unavailable for the last >>>> couple of days. It gives a 500 whenever I visit it >>>> >>>> >>>> >>>> Ben Young - Senior Software Engineer >>>> >>>> SunGard - Enterprise House, Vision Park, Histon, Cambridge, CB24 9ZR >>>> >>>> Tel +44 1223 266042 - Main +44 1223 266100 - http://www.sungard.com/ >>>> >>>> >>>> >>>> CONFIDENTIALITY:? This email (including any attachments) may contain >>>> confidential, proprietary and privileged information, and unauthorized >>>> disclosure or use is prohibited.? If you received this email in error, >>>> please notify the sender and delete this email from your system.? Thank you. >>>> >>>> >>>> >>>> _______________________________________________ >>>> pypy-dev at codespeak.net >>>> http://codespeak.net/mailman/listinfo/pypy-dev >>>> >>> >>> _______________________________________________ >>> pypy-dev at codespeak.net >>> http://codespeak.net/mailman/listinfo/pypy-dev >> >> >> _______________________________________________ >> pypy-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/pypy-dev >> > From cfbolz at gmx.de Fri Jul 2 20:35:46 2010 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Fri, 02 Jul 2010 20:35:46 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> Message-ID: <4C2E3182.4020307@gmx.de> Hi Paolo, On 07/02/2010 02:08 PM, Paolo Giarrusso wrote: >>> Unsupported claim is for example that fast interpreters are 10x >>> slower than C. > That's the only unsupported claim, but it comes from "The Structure > and Performance of E?cient Interpreters". I studied that as a > student on VM, you are writing one, so I (unconsciously) guessed > that everybody knows that paper - I know that's a completely broken > way of writing, but I didn't spot it. Even if something is claimed by a well-known paper, it doesn't necessarily have to be true. The paper considers a class of interpreters where each specific bytecode does very little work (the paper does not make this assumption explicit). This is not the case for Python at all, so I think that the conclusions of the paper don't apply directly. This is explained quite clearly in the following paper: Virtual-Machine Abstraction and Optimization Techniques by Stefan Brunthaler in Bytecode 2009. [...] > Well, at the abstraction level I'm speaking, it sounds like there in > the end, the JIT will be able to do what is needed. I am not aware > of the details. But then, at the end of that project, it seems to me > that it should be possible to write the array module in pure Python > using this new FFI interface and have the JIT look at it, shouldn't > it? I do not concentrate on array specifically - rewriting a few > modules at interpreter level is fine. But as a Python developer I > should have no need for that. That's a noble goal :-). I agree with the goal, but I still wanted to point out that the case of array is really quite outside of the range of possibilities of typical JIT compilers. Consider the hypothetical problem of having to write a pure-Python array module without using any other module, only builtin types. Then you would have to map arrays to be normal Python lists, and you would have no way to circumvent the fact that all objects in the lists are boxed. The JIT is now not helping you at all, because it only optimizes on a code level, and cannot change the way your data is structured in memory. I know that this is not at all how you are proposing the array module should be written, but I still wanted to point out that current JITs don't help you much if your data is represented in a bad way. We have some ideas how data representations could be optimized at runtime, but nothing implemented yet. Cheers, Carl Friedrich From p.giarrusso at gmail.com Fri Jul 2 21:35:55 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Fri, 2 Jul 2010 21:35:55 +0200 Subject: [pypy-dev] array performace? In-Reply-To: <4C2E3182.4020307@gmx.de> References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: On Fri, Jul 2, 2010 at 20:35, Carl Friedrich Bolz wrote: > Hi Paolo, > > On 07/02/2010 02:08 PM, Paolo Giarrusso wrote: >>>> Unsupported claim is for example that fast interpreters are 10x >>>> slower than C. >> That's the only unsupported claim, but it comes from "The Structure >> and Performance of E?cient Interpreters". I studied that as a >> student on VM, you are writing one, so I (unconsciously) guessed >> that everybody knows that paper - I know that's a completely broken >> way of writing, but I didn't spot it. > > Even if something is claimed by a well-known paper, it doesn't > necessarily have to be true. The paper considers a class of interpreters > where each specific bytecode does very little work (the paper does not > make this assumption explicit). This is not the case for Python at all, > so I think that the conclusions of the paper don't apply directly. Well, actually what I mention is not a conclusion of that paper, but what you say probably applies to the original paper which is referenced, so it doesn't matter. > This is explained quite clearly in the following paper: > > Virtual-Machine Abstraction and Optimization Techniques by Stefan > Brunthaler in Bytecode 2009. I already mentioned that paper, a couple of years ago, when discussing threading in PyPy, and my point was dismissed on general arguments. I'm happy to see now a paper stating your point, so that it can be discussed more precisely. But the obvious question is: given the mixed characteristics of the Lua interpreter, what is the instruction subdivision in that case? They write it's in the same class without any measurement, while it can complete an addition in 5 instructions instead of 3, and avoiding the need for separate loads. In Python, instead, refcounting alone is a very expensive operation. Beyond that, that paper also acknowledges that a virtual machine for Prolog, even if using dynamic types like Python, was in the same efficiency class as lower-level VMs. I agree however that other optimizations are needed first. I would expect Lua to seem more 'low-level' also from this point of view, and thus able to benefit more from threading. And with Python 3.0, where the distinction between int and long is gone, the Lua implementation would be almost fine, if one uses tagged integer and optimizes overflow checking through assembler (it's two lines of assembly code on x86/x86_64). > That's a noble goal :-). I agree with the goal, but I still wanted to > point out that the case of array is really quite outside of the range of > possibilities of typical JIT compilers. Consider the hypothetical > problem of having to write a pure-Python array module without using any > other module, only builtin types. Then you would have to map arrays to > be normal Python lists, and you would have no way to circumvent the fact > that all objects in the lists are boxed. The JIT is now not helping you > at all, because it only optimizes on a code level, and cannot change the > way your data is structured in memory. > I know that this is not at all how you are proposing the array module > should be written, but I still wanted to point out that current JITs > don't help you much if your data is represented in a bad way. We have > some ideas how data representations could be optimized at runtime, but > nothing implemented yet. OK, agreed. It would still be generally useful if the JIT _could_ optimize such cases, but that's hard enough. Especially, trying to recognize that the list is used with homogeneous element does not look easy in such a setting. However, again, what about tagged integers? They wouldn't allow optimizing all uses of arrays, but they would be generally useful on at least 31-bit integers and narrow characters. If I had more free time, and then also enough disk space to translate PyPy (I recall I hadn't when I conceived trying), I could maybe try doing that myself, with some help. Don't hold your breath for that, though. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From hakan at debian.org Fri Jul 2 21:59:01 2010 From: hakan at debian.org (Hakan Ardo) Date: Fri, 2 Jul 2010 21:59:01 +0200 Subject: [pypy-dev] Interpreter level array implementation Message-ID: Hi, we got the simplest possible interpreter level implementation of an array-like object running (in the interplevel-array branch) and it executes my previous example about 2 times slower than optimized C. Attached is the trace generated by the following example: img=array(640*480); l=0; i=0; while i<640*480: l+=img[i] i+=1 a simplified version of that trace is: 1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i] 2. i14 = int_lt(i, 307200) 3. guard_true(i14, descr=) 4. guard_nonnull_class(p10, 145745952, descr=) 5. img = getfield_gc(p10, descr=) 6. f17 = getarrayitem_gc(img, i, descr=) 7. f18 = float_add(f11, f17) 8. i20 = int_add_ovf(i, 1) 9. guard_no_overflow(, descr=) # 10. i23 = getfield_raw(149604768, descr=) 11. i25 = int_add(i23, 1) 12. setfield_raw(149604768, i25, descr=) 13. i28 = int_and(i25, -2131755008) 14. i29 = int_is_true(i28) 15. guard_false(i29, descr=) 16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32), ConstPtr(ptr33), p8, p9, p10, f18, i20) Does these operation more or less correspond to assembler instructions? I guess that the extra overhead here as compared to the the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about? I guess that most of these additional operation would not affect the performance of more complicated loops as they will only occur once per loop (although combining the guard on line 9 with line 3 might be a possible optimization)? Line 4 will appear once for each array used in the loop and line 5 once for every array access, right? Can the array implementation be designed in someway that would not generate line 5 above? Or would it be possible to get rid of it by some optimization? -- H?kan Ard? -------------- next part -------------- A non-text attachment was scrubbed... Name: log Type: application/octet-stream Size: 2316 bytes Desc: not available URL: From alex.gaynor at gmail.com Fri Jul 2 22:12:19 2010 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Fri, 2 Jul 2010 15:12:19 -0500 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 2:59 PM, Hakan Ardo wrote: > Hi, > we got the simplest possible interpreter level implementation of an > array-like object running (in the interplevel-array branch) and it > executes my previous example about 2 times slower than optimized C. > Attached is the trace generated by the following example: > > ? ?img=array(640*480); ? l=0; ? i=0; > ? ?while i<640*480: > ? ? ? ?l+=img[i] > ? ? ? ?i+=1 > > a simplified version of that trace is: > > ? 1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i] > ? 2. i14 = int_lt(i, 307200) > ? 3. ? guard_true(i14, descr=) > ? 4. ? guard_nonnull_class(p10, 145745952, descr=) > ? 5. img = getfield_gc(p10, descr=) > ? 6. f17 = getarrayitem_gc(img, i, descr=) > ? 7. f18 = float_add(f11, f17) > ? 8. i20 = int_add_ovf(i, 1) > ? 9. ? guard_no_overflow(, descr=) # > ?10. i23 = getfield_raw(149604768, descr=) > ?11. i25 = int_add(i23, 1) > ?12. setfield_raw(149604768, i25, descr=) > ?13. i28 = int_and(i25, -2131755008) > ?14. i29 = int_is_true(i28) > ?15. ? guard_false(i29, descr=) > ?16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32), > ? ? ? ? ? ConstPtr(ptr33), p8, p9, p10, f18, i20) > > Does these operation more or less correspond to assembler > instructions? I guess that the extra overhead here as compared to the > the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about? > I guess that most of these additional operation would not affect the > performance of more complicated loops as they will only occur once per > loop (although combining the guard on line 9 with line 3 might be a > possible optimization)? Line 4 will appear once for each array used in > the loop and line 5 once for every array access, right? > > Can the array implementation be designed in someway that would not > generate line 5 above? Or would it be possible to get rid of it by > some optimization? > > -- > H?kan Ard? > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > In addition to the things you noted, I guess the int overflow check can be optimized out, since i+=1 can never cause it to overflow given that i is bounded at 640*480. I suppose in general that would require more dataflow analysis. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Voltaire "The people's good is the highest law." -- Cicero "Code can always be simpler than you think, but never as simple as you want" -- Me From fijall at gmail.com Fri Jul 2 23:16:36 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 15:16:36 -0600 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: [snip] > the need for separate loads. In Python, instead, refcounting alone is > a very expensive operation. How does that apply to pypy? From fijall at gmail.com Fri Jul 2 23:21:17 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 15:21:17 -0600 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: General note - we consider 2x optimized C a pretty good result :) Details below On Fri, Jul 2, 2010 at 1:59 PM, Hakan Ardo wrote: > Hi, > we got the simplest possible interpreter level implementation of an > array-like object running (in the interplevel-array branch) and it > executes my previous example about 2 times slower than optimized C. > Attached is the trace generated by the following example: > > ? ?img=array(640*480); ? l=0; ? i=0; > ? ?while i<640*480: > ? ? ? ?l+=img[i] > ? ? ? ?i+=1 > > a simplified version of that trace is: > > ? 1. [p0, p1, p2, p3, i4, p5, p6, p7, p8, p9, p10, f11, i] > ? 2. i14 = int_lt(i, 307200) > ? 3. ? guard_true(i14, descr=) > ? 4. ? guard_nonnull_class(p10, 145745952, descr=) > ? 5. img = getfield_gc(p10, descr=) > ? 6. f17 = getarrayitem_gc(img, i, descr=) > ? 7. f18 = float_add(f11, f17) > ? 8. i20 = int_add_ovf(i, 1) > ? 9. ? guard_no_overflow(, descr=) # > ?10. i23 = getfield_raw(149604768, descr=) > ?11. i25 = int_add(i23, 1) > ?12. setfield_raw(149604768, i25, descr=) > ?13. i28 = int_and(i25, -2131755008) > ?14. i29 = int_is_true(i28) > ?15. ? guard_false(i29, descr=) > ?16. jump(p0, p1, p2, p3, 27, ConstPtr(ptr31), ConstPtr(ptr32), > ? ? ? ? ? ConstPtr(ptr33), p8, p9, p10, f18, i20) > > Does these operation more or less correspond to assembler > instructions? Yes. Use PYPYJITLOG=log pypy-c ... to get assembler. View using pypy/jit/backend/x86/tool/viewcode.py > I guess that the extra overhead here as compared to the > the C version would be line 4, 5, 9 and 10-15. What's 10-15 all about? It's about a couple of things that python interpreter has to perform. Notably asynchronous signal checking and thread swapping with GIL. > I guess that most of these additional operation would not affect the > performance of more complicated loops as they will only occur once per > loop (although combining the guard on line 9 with line 3 might be a > possible optimization)? Line 4 will appear once for each array used in > the loop and line 5 once for every array access, right? Yes. We don't do loop invariant optimizations for some reasons, the best of it being the fact that to loop you can always add a bridge which will invalidate this invariant. > > Can the array implementation be designed in someway that would not > generate line 5 above? Or would it be possible to get rid of it by > some optimization? No, it's about optimizations of JIT itself (it's an artifact of python looping rather than array module). > > -- > H?kan Ard? > > _______________________________________________ > pypy-dev at codespeak.net > http://codespeak.net/mailman/listinfo/pypy-dev > Cheers, fijal From bokr at oz.net Sat Jul 3 00:56:39 2010 From: bokr at oz.net (Bengt Richter) Date: Fri, 02 Jul 2010 15:56:39 -0700 Subject: [pypy-dev] array performace? In-Reply-To: <4C2E3182.4020307@gmx.de> References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: On 07/02/2010 11:35 AM Carl Friedrich Bolz wrote: > Hi Paolo, > > On 07/02/2010 02:08 PM, Paolo Giarrusso wrote: >>>> Unsupported claim is for example that fast interpreters are 10x >>>> slower than C. >> That's the only unsupported claim, but it comes from "The Structure >> and Performance of E???cient Interpreters". I studied that as a >> student on VM, you are writing one, so I (unconsciously) guessed >> that everybody knows that paper - I know that's a completely broken >> way of writing, but I didn't spot it. > > Even if something is claimed by a well-known paper, it doesn't > necessarily have to be true. The paper considers a class of interpreters > where each specific bytecode does very little work (the paper does not > make this assumption explicit). This is not the case for Python at all, > so I think that the conclusions of the paper don't apply directly. > > This is explained quite clearly in the following paper: > > Virtual-Machine Abstraction and Optimization Techniques by Stefan > Brunthaler in Bytecode 2009. > > > [...] >> Well, at the abstraction level I'm speaking, it sounds like there in >> the end, the JIT will be able to do what is needed. I am not aware >> of the details. But then, at the end of that project, it seems to me >> that it should be possible to write the array module in pure Python >> using this new FFI interface and have the JIT look at it, shouldn't >> it? I do not concentrate on array specifically - rewriting a few >> modules at interpreter level is fine. But as a Python developer I >> should have no need for that. > > That's a noble goal :-). I agree with the goal, but I still wanted to > point out that the case of array is really quite outside of the range of > possibilities of typical JIT compilers. Consider the hypothetical > problem of having to write a pure-Python array module without using any > other module, only builtin types. Then you would have to map arrays to > be normal Python lists, and you would have no way to circumvent the fact > that all objects in the lists are boxed. The JIT is now not helping you > at all, because it only optimizes on a code level, and cannot change the > way your data is structured in memory. > > I know that this is not at all how you are proposing the array module > should be written, but I still wanted to point out that current JITs > don't help you much if your data is represented in a bad way. We have > some ideas how data representations could be optimized at runtime, but > nothing implemented yet. A thought/question: Could/does JIT make use of information in an assert statement? E.g., could we write assert set(type(x) for x in img) == set([float]) and len(img)==640*480 in front of a loop operating on img and have JIT use the info as assumed true even when "if __debug__:" suites are optimized away? Could such assertions allow e.g. a list to be implemented as a homogeneous vector of unboxed representations? What kind of guidelines for writing assertions would have to exist to make them useful to JIT most easily? Regards, Bengt Richter From amauryfa at gmail.com Sat Jul 3 01:14:40 2010 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Sat, 3 Jul 2010 01:14:40 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: Hi, 2010/7/3 Bengt Richter : > A thought/question: > > Could/does JIT make use of information in an assert statement? E.g., could we write > ? ? assert set(type(x) for x in img) == set([float]) and len(img)==640*480 > in front of a loop operating on img and have JIT use the info as assumed true > even when "if __debug__:" suites are optimized away? > > Could such assertions allow e.g. a list to be implemented as a homogeneous vector > of unboxed representations? > > What kind of guidelines for writing assertions would have to exist to make them > useful to JIT most easily? If efficient python code needs this, I'd better write the loop in C and explicitly choose the types. The C code could be inlined in the python script, and compiled on demand. At least you'll know what you get. -- Amaury Forgeot d'Arc From bokr at oz.net Sat Jul 3 02:38:16 2010 From: bokr at oz.net (Bengt Richter) Date: Fri, 02 Jul 2010 17:38:16 -0700 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: <4C2E8678.5070208@oz.net> On 07/02/2010 04:14 PM Amaury Forgeot d'Arc wrote: > Hi, > > 2010/7/3 Bengt Richter : >> A thought/question: >> >> Could/does JIT make use of information in an assert statement? E.g., could we write >> assert set(type(x) for x in img) == set([float]) and len(img)==640*480 >> in front of a loop operating on img and have JIT use the info as assumed true >> even when "if __debug__:" suites are optimized away? >> >> Could such assertions allow e.g. a list to be implemented as a homogeneous vector >> of unboxed representations? >> >> What kind of guidelines for writing assertions would have to exist to make them >> useful to JIT most easily? > > If efficient python code needs this, I'd better write the loop in C > and explicitly choose the types. > The C code could be inlined in the python script, and compiled on demand. > At least you'll know what you get. > Well, even C accepts hints like 'register' (and may ignore you, so you are not truly sure what you get ;-) The point of using assert would be to let the user remain within the python language, while still passing useful hints to the compiler. If I wanted to mix languages (not uninteresting!), I'd go with racket (the star formerly known as PLT-scheme) http://www.racket-lang.org/ They have extended programmability right down to the reader/tokenizer, so it might well be possible for them to accept literal C as a translated sub/macro-language, given the appropriate syntax definitions written in racket. For more, see http://docs.racket-lang.org/guide/languages.html and more specifically http://docs.racket-lang.org/guide/hash-reader.html Regards, Bengt Richter From fijall at gmail.com Sat Jul 3 07:00:33 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 2 Jul 2010 23:00:33 -0600 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: On Fri, Jul 2, 2010 at 4:56 PM, Bengt Richter wrote: > On 07/02/2010 11:35 AM Carl Friedrich Bolz wrote: >> Hi Paolo, >> >> On 07/02/2010 02:08 PM, Paolo Giarrusso wrote: >>>>> Unsupported claim is for example that fast interpreters are 10x >>>>> slower than C. >>> That's the only unsupported claim, but it comes from "The Structure >>> and Performance of E???cient Interpreters". I studied that as a >>> student on VM, you are writing one, so I (unconsciously) guessed >>> that everybody knows that paper - I know that's a completely broken >>> way of writing, but I didn't spot it. >> >> Even if something is claimed by a well-known paper, it doesn't >> necessarily have to be true. The paper considers a class of interpreters >> where each specific bytecode does very little work (the paper does not >> make this assumption explicit). This is not the case for Python at all, >> so I think that the conclusions of the paper don't apply directly. >> >> This is explained quite clearly in the following paper: >> >> Virtual-Machine Abstraction and Optimization Techniques by Stefan >> Brunthaler in Bytecode 2009. >> >> >> [...] >>> Well, at the abstraction level I'm speaking, it sounds like there in >>> the end, the JIT will be able to do what is needed. I am not aware >>> of the details. But then, at the end of that project, it seems to me >>> that it should be possible to write the array module in pure Python >>> using this new FFI interface and have the JIT look at it, shouldn't >>> it? I do not concentrate on array specifically - rewriting a few >>> modules at interpreter level is fine. But as a Python developer I >>> should have no need for that. >> >> That's a noble goal :-). I agree with the goal, but I still wanted to >> point out that the case of array is really quite outside of the range of >> possibilities of typical JIT compilers. Consider the hypothetical >> problem of having to write a pure-Python array module without using any >> other module, only builtin types. Then you would have to map arrays to >> be normal Python lists, and you would have no way to circumvent the fact >> that all objects in the lists are boxed. The JIT is now not helping you >> at all, because it only optimizes on a code level, and cannot change the >> way your data is structured in memory. >> >> I know that this is not at all how you are proposing the array module >> should be written, but I still wanted to point out that current JITs >> don't help you much if your data is represented in a bad way. We have >> some ideas how data representations could be optimized at runtime, but >> nothing implemented yet. > > A thought/question: > > Could/does JIT make use of information in an assert statement? E.g., could we write > ? ? assert set(type(x) for x in img) == set([float]) and len(img)==640*480 > in front of a loop operating on img and have JIT use the info as assumed true > even when "if __debug__:" suites are optimized away? > > Could such assertions allow e.g. a list to be implemented as a homogeneous vector > of unboxed representations? > > What kind of guidelines for writing assertions would have to exist to make them > useful to JIT most easily? > > Regards, > Bengt Richter if you look closer this assertion is insanely complex to derive any informations from (you even used a generator expression). Besides, nothing stops you from changing that assumption later. You would need some sort of static analyzis which is either very hard or plain impossible in Python. Instead, we rather pursue ways of getting some runtime profiling data to get usage patterns. Cheers, fijal From hakan at debian.org Sat Jul 3 08:14:00 2010 From: hakan at debian.org (Hakan Ardo) Date: Sat, 3 Jul 2010 08:14:00 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski wrote: > General note - we consider 2x optimized C a pretty good result :) Details below As do I :) I just want to make this as jit-friendly as possible without rely knowing what's jit-friendly... > Yes. We don't do loop invariant optimizations for some reasons, the > best of it being the fact that to loop you can always add a bridge > which will invalidate this invariant. Are you telling me that you probably never will include that kind of optimization because of the limitations it imposes on other parts of the jit or just that it would be a lot of work to get it in place? What is a bridge? -- H?kan Ard? From fijall at gmail.com Sat Jul 3 08:20:01 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sat, 3 Jul 2010 00:20:01 -0600 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 12:14 AM, Hakan Ardo wrote: > On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski wrote: >> General note - we consider 2x optimized C a pretty good result :) Details below > > As do I :) I just want ?to make this as jit-friendly as possible > without rely knowing what's jit-friendly... I think it's fairly JIT friendly. You can look into traces (as you did), but seems fine to me. > >> Yes. We don't do loop invariant optimizations for some reasons, the >> best of it being the fact that to loop you can always add a bridge >> which will invalidate this invariant. > > Are you telling me that you probably never will include that kind of > optimization because of the limitations it imposes on other parts of > the jit or just that it would be a lot of work to get it in place? It requires thinking. It's harder to do because we don't know statically upfront how many paths we'll compile to assembler, but I can think about ways to mitigate that. > > What is a bridge? If guard fails often enough, it's traced and compiled to assembler. That's a bridge > > -- > H?kan Ard? > From p.giarrusso at gmail.com Sat Jul 3 08:58:34 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sat, 3 Jul 2010 08:58:34 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 08:20, Maciej Fijalkowski wrote: > On Sat, Jul 3, 2010 at 12:14 AM, Hakan Ardo wrote: >> On Fri, Jul 2, 2010 at 11:21 PM, Maciej Fijalkowski wrote: >>> General note - we consider 2x optimized C a pretty good result :) Details below >> >> As do I :) I just want ?to make this as jit-friendly as possible >> without rely knowing what's jit-friendly... > > I think it's fairly JIT friendly. You can look into traces (as you > did), but seems fine to me. >>> Yes. We don't do loop invariant optimizations for some reasons, the >>> best of it being the fact that to loop you can always add a bridge >>> which will invalidate this invariant. >> >> Are you telling me that you probably never will include that kind of >> optimization because of the limitations it imposes on other parts of >> the jit or just that it would be a lot of work to get it in place? > > It requires thinking. It's harder to do because we don't know > statically upfront how many paths we'll compile to assembler, but I > can think about ways to mitigate that. Isn't there some existing research about that in the 'tracing' community? As far as I remember, the theory is that traces are assembled in trace trees, and that each time a (simplified*) SSA optimization pass is applied to the trace tree to compile it. Not sure whether they do it also for Javascript, since there compilation times have to be very fast, but I guess they did so in their Java compiler. Also, in other cases the general JIT approach is 'optimize and invalidate if needed'. For instance, if a Java class has no subclass, it's not safe to assume this will hold forever to perform optimization; but the optimization is performed and a hook is installed so that class loading will undo the optimization. Another issue: what is i4 for? It's not used at all in the loop, but it is reset to 27 at the end of it, each time. Doesn't such a var waste some (little) time? * SSA on trace trees took advantage of their simpler structure compared to graphs for some operations. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From arigo at tunes.org Sat Jul 3 09:14:02 2010 From: arigo at tunes.org (Armin Rigo) Date: Sat, 3 Jul 2010 09:14:02 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: <20100703071402.GA19649@code0.codespeak.net> Hi Alex, On Fri, Jul 02, 2010 at 03:12:19PM -0500, Alex Gaynor wrote: > In addition to the things you noted, I guess the int overflow check > can be optimized out, since i+=1 can never cause it to overflow given > that i is bounded at 640*480. I suppose in general that would require > more dataflow analysis. Hakan mentioned this. It's actually an easy optimization in our linear code; I guess I will give it a try. A bientot, Armin. From arigo at tunes.org Sat Jul 3 09:28:14 2010 From: arigo at tunes.org (Armin Rigo) Date: Sat, 3 Jul 2010 09:28:14 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: <20100703072814.GB19649@code0.codespeak.net> Hi Paolo, On Sat, Jul 03, 2010 at 08:58:34AM +0200, Paolo Giarrusso wrote: > Isn't there some existing research about that in the 'tracing' > community? (...) Not sure > whether they do it also for Javascript, since there compilation times > have to be very fast, but I guess they did so in their Java compiler. We are not very good at mentioning existing research, but at least for the case of tracing JITs I think we know pretty much everything published, which you might find by googling for tracing JIT. (It's always a better approach than doing "guesses" in an unrelated project's mailing list.) For how PyPy's tracing JIT compares to existing approaches, there is a PyPy paper at: http://codespeak.net/svn/pypy/extradoc/talk/icooolps2009/ As well as the start of a draft about virtuals at: http://codespeak.net/svn/pypy/extradoc/talk/s3-2010/ And you should not miss Benjamin's great summary at: http://codespeak.net/pypy/dist/pypy/doc/jit/pyjitpl5.html A bientot, Armin. From anto.cuni at gmail.com Sat Jul 3 09:52:37 2010 From: anto.cuni at gmail.com (Antonio Cuni) Date: Sat, 03 Jul 2010 09:52:37 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: <4C2EEC45.3090205@gmail.com> On 03/07/10 08:14, Hakan Ardo wrote: > What is a bridge? you might be interested to read the chapter of my PhD thesis which explains exactly that, with diagrams: http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf In particular, section 6.4 explains the difference between loops, bridges and entry bridges. ciao, Anto From cfbolz at gmx.de Sat Jul 3 10:03:27 2010 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Sat, 3 Jul 2010 10:03:27 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: Hi Paolo, 2010/7/3 Paolo Giarrusso : >> It requires thinking. It's harder to do because we don't know >> statically upfront how many paths we'll compile to assembler, but I >> can think about ways to mitigate that. > > Isn't there some existing research about that in the 'tracing' > community? As far as I remember, the theory is that traces are > assembled in trace trees, and that each time a (simplified*) SSA > optimization pass is applied to the trace tree to compile it. Not sure > whether they do it also for Javascript, since there compilation times > have to be very fast, but I guess they did so in their Java compiler. There are two ways to deal with attaching now traces to existing ones. On the one hand there are trace trees, which recompile the whole tree of traces when a new one is added. This can be costly. On the other hand, there is trace stitching, which just patches the existing trace to jump to the new one. PyPy (and TraceMonkey, I think) uses trace stitching. The problem with loop-invarian code motion is that when you stitch in a new trace (what we call a bridge) it is not clear that the code that was invariant so far is invariant on the new path as well. Cheers, Carl Friedrich From santagada at gmail.com Sat Jul 3 09:57:51 2010 From: santagada at gmail.com (Leonardo Santagada) Date: Sat, 3 Jul 2010 04:57:51 -0300 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: On Jul 3, 2010, at 3:58 AM, Paolo Giarrusso wrote: > Another issue: what is i4 for? It's not used at all in the loop, but > it is reset to 27 at the end of it, each time. Doesn't such a var > waste some (little) time? This I found interesting. Do anyone know the answer? -- Leonardo Santagada santagada at gmail.com From p.giarrusso at gmail.com Sat Jul 3 10:14:34 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sat, 3 Jul 2010 10:14:34 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: <20100703072814.GB19649@code0.codespeak.net> References: <20100703072814.GB19649@code0.codespeak.net> Message-ID: On Sat, Jul 3, 2010 at 09:28, Armin Rigo wrote: > Hi Paolo, > > On Sat, Jul 03, 2010 at 08:58:34AM +0200, Paolo Giarrusso wrote: >> Isn't there some existing research about that in the 'tracing' >> community? ?(...) ? Not sure >> whether they do it also for Javascript, since there compilation times >> have to be very fast, but I guess they did so in their Java compiler. > > We are not very good at mentioning existing research, but at least for > the case of tracing JITs I think we know pretty much everything > published, which you might find by googling for tracing JIT. ?(It's > always a better approach than doing "guesses" in an unrelated project's > mailing list.) If you had read the next sentence you'd have found out that I did read some papers about that (where I learned about trace trees). My guess was just about whether their Java compiler used trace trees or the other possibility, i.e., trace stitching (as I now learned). But thanks for the references, I'll have a look later. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From william.leslie.ttg at gmail.com Sat Jul 3 16:20:46 2010 From: william.leslie.ttg at gmail.com (William Leslie) Date: Sun, 4 Jul 2010 00:20:46 +1000 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: On 3 July 2010 08:56, Bengt Richter wrote: > On 07/02/2010 11:35 AM Carl Friedrich Bolz wrote: > A thought/question: > > Could/does JIT make use of information in an assert statement? E.g., could we write > ? ? assert set(type(x) for x in img) == set([float]) and len(img)==640*480 > in front of a loop operating on img and have JIT use the info as assumed true > even when "if __debug__:" suites are optimized away? There are several reasons we can't make use of such information from the JIT at the moment. It requires more information that we have, and it is difficult to analyse quickly. If img is visible from outside the current thread, for example, the ad-hoc memory model of the python language means we would have to order writes and reads to img from other threads with the JIT's own accesses. Similarly, functions that we call may insert objects that break this invariant. Determining when this may occur requires analysing a lot of code - for example, if *one* type was not int, it could implement a __radd__ method that broke the invariant. It's typically faster to just execute the code than to find out. In the presence of whole-program optimisation this sort of thing is possible, with the right analysis it may be possible within the JIT, but the question remains as to if it will be profitable. (This is an area I have been exploring, but don't hold your breath for results.) On 3 July 2010 10:38, Bengt Richter wrote: > On 07/02/2010 04:14 PM Amaury Forgeot d'Arc wrote: >> If efficient python code needs this, I'd better write the loop in C >> and explicitly choose the types. >> The C code could be inlined in the python script, and compiled on demand. >> At least you'll know what you get. >> > Well, even C accepts hints like 'register' (and may ignore you, so you are not truly sure what you get ;-) > > The point of using assert would be to let the user remain within the python language, while still passing > useful hints to the compiler. Interesting you mention racket. Racket comes with a static language that integrates with their usual dynamic Scheme. Many common lisp implementations provide optional typing. Paolo recently bemoaned the trend toward writing modules at interp level for speed* - I'm not really sure if it is a trend now or not - but at some point it might be fun looking at optional typing annotations that compile the case for those assumptions. It might be a precursor to cython or pyrex support. * with justification : though ok for the stdlib, translating pypy every time you add an extension module is going to get old. fast. > Could such assertions allow e.g. a list to be implemented as a homogeneous vector > of unboxed representations? Pypy is already great in terms of data layout, for example pypy uses shadow classes in the form of 'structures', but supporting more complicated layout optimisations (such as row or column order storage for structures so the JIT can do relational algebra) would probably be unique. It doesn't seem so far off considering that in the progression (list int) -> (list unpacked tuple int) -> (list unpacked homogenous structure), the first step, limiting or otherwise determining the item type, is the most complicated. > If I wanted to mix languages (not uninteresting!), I'd go with > racket (the star formerly known as PLT-scheme) -- possible can of worms -- As for mixing languages, that is the pinnacle of awesome; but this is probably not the list for it. MLVMs such as JVM+JSR-292, Racket, GNU Guile, and Parrot; it seems to me that once you settle on an execution / object model and / or bytecode format, you've already decided what languages (where the 's' seems superfluous) support is going to be first class for. Don't get me wrong, I find each of these really exciting, but good multi-platform integration is a much harder problem than writing a few compilers with a common bytecode format; and even the common bytecode format is probably not a good idea, because different languages need (really) different primatives, as pirate has bought out. Other impedance mismatches, such as calling conventions (eg, javascript and lua functions silently accepting an incorrect number of arguments), reduction methods (applicative vs normal order vs call-by-name), mutable strings, TCE, various type systems involving structural types, Oliviera/Sulzmann classes, existential types, dependant types, value types, single and multiple inheretance, and the completely insane (prolog) make implementing real multi-language platforms a mammoth task. And even if you manage to get that working, how do you make exception hierarchies work? Why can't I cast my Java ArrayList as a C# ArrayList? etc. Sure, you could probably hook up a few of the bundled VMs, IO or E would make for a great twisted integration DSL. But actually convincing people to lock themselves into an unstandardised, unproven chimera? Lets just say that doing multi-language right is NP-hard. Doing it while targeting JVM and CLI, offering platform integration while supporting exotic language constructs like real continuations? Likely impossible. It's a nice idea, but probably out of Pypy's scope. -- William Leslie From p.giarrusso at gmail.com Sat Jul 3 18:51:49 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sat, 3 Jul 2010 18:51:49 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: On Fri, Jul 2, 2010 at 23:16, Maciej Fijalkowski wrote: > [snip] > >> the need for separate loads. In Python, instead, refcounting alone is >> a very expensive operation. > > > How does that apply to pypy? I was talking about the original paper. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From p.giarrusso at gmail.com Sat Jul 3 19:22:54 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sat, 3 Jul 2010 19:22:54 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: On Sat, Jul 3, 2010 at 16:20, William Leslie wrote: > On 3 July 2010 08:56, Bengt Richter wrote: >> On 07/02/2010 11:35 AM Carl Friedrich Bolz wrote: > Paolo recently bemoaned the > trend toward writing modules at interp level for speed* - I'm not > really sure if it is a trend now or not - but at some point it might > be fun looking at optional typing annotations that compile the case > for those assumptions. It might be a precursor to cython or pyrex > support. > * with justification : though ok for the stdlib, translating pypy > every time you add an extension module is going to get old. fast. That's one point, but it's not the biggest one. I guess that if that happens often enough, at some point one will need to implement separate compilation for RPython as well (at least for development). I mean, whole-program optimization (which one would maybe lose) is optional in other languages. 1) The real problem is that you don't want users to need interp-level coding for their program. If they need, there's something wrong (and I now think/hope it's not the case). 2) Another instance of the same issue happens when Python developers are suggested to write extensions in C or to perform inlining by hand. 3) The last case is users avoiding Python (or another high-level language) altogether because of bad performance. The common factor is that in all cases, a weakness of the implementation makes the abstraction less desirable, and thus user programs are hand-optimized and become less maintainable. That's why efficient JITs (including PyPy) are important. It is interesting that 2) stems also from the desire of Guido van Rossum to keep CPython simple, while complicating life for its users. >> Could such assertions allow e.g. a list to be implemented as a homogeneous vector >> of unboxed representations? > Pypy is already great in terms of data layout, for example pypy uses > shadow classes in the form of 'structures', but supporting more > complicated layout optimisations (such as row or column order storage > for structures so the JIT can do relational algebra) would probably be > unique. It doesn't seem so far off considering that in the progression > (list int) -> (list unpacked tuple int) -> (list unpacked homogenous > structure), the first step, limiting or otherwise determining the item > type, is the most complicated. > As for mixing languages, that is the pinnacle of awesome; but this is > probably not the list for it. MLVMs such as JVM+JSR-292, Racket, GNU > Guile, and Parrot; it seems to me that once you settle on an execution > / object model and / or bytecode format, you've already decided what > languages (where the 's' seems superfluous) support is going to be > first class for. You are right about "first class support". But assembly doesn't offer first class support for anything, and still you can make it work. Of course, bytecodes are more limited, but sometimes you might manage. I had 3 colleague students who implemented, for instance, a Python-to-JVM bytecode compiler which was way faster than Jython. Which was the trick? Python methods were encoded as Java classes (maybe with static methods), and they performed inline-caching in bytecode, i.e., each call was converted to something like if (target.class() == this_class) specificMethodClass.perform(target, args) else (perform normal method resolution, and possibly regenerate the class). I'm unsure about the actual call produced for the call - either they used static classes, or they just relied on inline-caching/inlining by the underlying JIT. Another detail (I guess) is that you need some form of shadow classes (like Self, V8, and also PyPy I guess - if you talk about the same thing). Unfortunately, I don't know whether they published their code - it was for a term project for a course held by Lars Bak (the V8 author) in Aarhus. It worked quite well, and there was still potential for optimization. I don't know how feature-complete they were, though; still, they managed to perform a meta-implementation of Inline-Caching (and the same trick allows also polymorphic inline-caching), where meta- is used like in meta-interpreter. I guess it would still be possible to interoperate with Java classes - you can still provide, I think, a conventional interface (where methods become just... methods), even if possibly it will be slower. > Other impedance mismatches, such as calling conventions (eg, > javascript and lua functions silently accepting an incorrect number of > arguments), reduction methods (applicative vs normal order vs > call-by-name), mutable strings, TCE, various type systems involving > structural types, Oliviera/Sulzmann classes, existential types, > dependant types, value types, single and multiple inheretance, and the > completely insane (prolog) make implementing real multi-language > platforms a mammoth task. And even if you manage to get that working, > how do you make exception hierarchies work? > Why can't I cast my Java > ArrayList as a C# ArrayList? etc. Well, this latter question seems somehow solved by .NET, even if they don't really support the original libraries. Or you just use the VM and write conversion functions for that. > Sure, you could probably hook up a few of the bundled VMs, IO or E IO? E? > would make for a great twisted integration DSL. But actually > convincing people to lock themselves into an unstandardised, unproven > chimera? Lets just say that doing multi-language right is NP-hard. > Doing it while targeting JVM and CLI, offering platform integration > while supporting exotic language constructs like real continuations? Now that you mention it, I wonder about how Scala's future support (in next release) for (delimited) continuations will work. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From anto.cuni at gmail.com Sat Jul 3 21:23:32 2010 From: anto.cuni at gmail.com (Antonio Cuni) Date: Sat, 03 Jul 2010 21:23:32 +0200 Subject: [pypy-dev] array performace? In-Reply-To: References: <20100701152827.GA30661@code0.codespeak.net> <01781CA2CC22B145B230504679ECF48C01AC4415@EMEA-EXCHANGE03.internal.sungard.corp> <01781CA2CC22B145B230504679ECF48C01AC445A@EMEA-EXCHANGE03.internal.sungard.corp> <4C2E3182.4020307@gmx.de> Message-ID: <4C2F8E34.3030701@gmail.com> On 03/07/10 19:22, Paolo Giarrusso wrote: > I had 3 colleague students who implemented, for instance, a > Python-to-JVM bytecode compiler which was way faster than Jython. > Which was the trick? [cut] I'm ready to bet that they did not implement a Python compiler, but a simil-Python language that implements 80/90/95% of Python features. The web is full of projects like this. I'm not saying that the techniques used for that project are not worth of attention, just that probably "the trick" was not to support the features of Python that are hardest to implement efficiently. ciao, Anto From hakan at debian.org Sun Jul 4 10:50:25 2010 From: hakan at debian.org (Hakan Ardo) Date: Sun, 4 Jul 2010 10:50:25 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 8:20 AM, Maciej Fijalkowski wrote: >>> Yes. We don't do loop invariant optimizations for some reasons, the >>> best of it being the fact that to loop you can always add a bridge >>> which will invalidate this invariant. >> >> Are you telling me that you probably never will include that kind of >> optimization because of the limitations it imposes on other parts of >> the jit or just that it would be a lot of work to get it in place? > > It requires thinking. It's harder to do because we don't know > statically upfront how many paths we'll compile to assembler, but I > can think about ways to mitigate that. Could it be treated similar to how you handle: s=0 i=0 while i<100000: s+=i i+=1 if i>50000: i=float(i) which nicely generates two separate traces I believe... -- H?kan Ard? From p.giarrusso at gmail.com Sun Jul 4 11:04:01 2010 From: p.giarrusso at gmail.com (Paolo Giarrusso) Date: Sun, 4 Jul 2010 11:04:01 +0200 Subject: [pypy-dev] Interpreter level array implementation In-Reply-To: References: Message-ID: Hi Carl, first, thanks for reading and for your explanation. On Sat, Jul 3, 2010 at 10:03, Carl Friedrich Bolz wrote: > 2010/7/3 Paolo Giarrusso : >>> It requires thinking. It's harder to do because we don't know >>> statically upfront how many paths we'll compile to assembler, but I >>> can think about ways to mitigate that. >> >> Isn't there some existing research about that in the 'tracing' >> community? As far as I remember, the theory is that traces are >> assembled in trace trees, and that each time a (simplified*) SSA >> optimization pass is applied to the trace tree to compile it. Not sure >> whether they do it also for Javascript, since there compilation times >> have to be very fast, but I guess they did so in their Java compiler. > > There are two ways to deal with attaching now traces to existing ones. > On the one hand there are trace trees, which recompile the whole tree > of traces when a new one is added. This can be costly. On the other > hand, there is trace stitching, which just patches the existing trace > to jump to the new one. PyPy (and TraceMonkey, I think) uses trace > stitching. For TraceMonkey, response times suggest the usage of trace stitching. The original Java compiler used trace trees. But if I have a Python application server, I'm probably willing to accept the bigger compilation time, especially if compilation is performed by a background thread. Would it be possible to accommodate this case? > The problem with loop-invarian code motion is that when you stitch in > a new trace (what we call a bridge) it is not clear that the code that > was invariant so far is invariant on the new path as well. I see - but what about noting potential modifications to the involved objects and invalidating the old traces, similarly to how classloading invalidates other optimizations? Of course, some heuristics and tuning would be needed I guess, since I expect that invalidations here would be much more frequent otherwise. Such heuristics would probably approximate a solution to the problem mentioned by Maciej: > It requires thinking. It's harder to do because we don't know > statically upfront how many paths we'll compile to assembler, but I > can think about ways to mitigate that. However, I still wonder how easy it is to recognize a potential write. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ From fijall at gmail.com Sun Jul 4 22:25:30 2010 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 4 Jul 2010 14:25:30 -0600 Subject: [pypy-dev] [pypy-svn] r75824 - in pypy/branch/interplevel-array/pypy/module/array: . test In-Reply-To: <20100704190622.4F935282B9D@codespeak.net> References: <20100704190622.4F935282B9D@codespeak.net> Message-ID: > + > + ? ?def item_w(self, w_item): > + ? ? ? ?space=self.space > + ? ? ? ?if self.typecode == 'c': > + ? ? ? ? ? ?return self.space.str_w(w_item) > + ? ? ? ?elif self.typecode == 'u': > + ? ? ? ? ? ?return self.space.unicode_w(w_item) > + > + ? ? ? ?elif self.typecode == 'b': > + ? ? ? ? ? ?item=self.space.int_w(w_item) > + ? ? ? ? ? ?if item<-128: > + ? ? ? ? ? ? ? ?msg='signed char is less than minimum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?elif item>127: > + ? ? ? ? ? ? ? ?msg='signed char is greater than maximum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?return rffi.cast(rffi.SIGNEDCHAR, item) > + ? ? ? ?elif self.typecode == 'B': > + ? ? ? ? ? ?item=self.space.int_w(w_item) > + ? ? ? ? ? ?if item<0: > + ? ? ? ? ? ? ? ?msg='unsigned byte integer is less than minimum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?elif item>255: > + ? ? ? ? ? ? ? ?msg='unsigned byte integer is greater than maximum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?return rffi.cast(rffi.UCHAR, item) > + > + ? ? ? ?elif self.typecode == 'h': > + ? ? ? ? ? ?item=self.space.int_w(w_item) > + ? ? ? ? ? ?if item<-32768: > + ? ? ? ? ? ? ? ?msg='signed short integer is less than minimum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?elif item>32767: > + ? ? ? ? ? ? ? ?msg='signed short integer is greater than maximum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?return rffi.cast(rffi.SHORT, item) > + ? ? ? ?elif self.typecode == 'H': > + ? ? ? ? ? ?item=self.space.int_w(w_item) > + ? ? ? ? ? ?if item<0: > + ? ? ? ? ? ? ? ?msg='unsigned short integer is less than minimum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?elif item>65535: > + ? ? ? ? ? ? ? ?msg='unsigned short integer is greater than maximum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?return rffi.cast(rffi.USHORT, item) > + > + ? ? ? ?elif self.typecode in ('i', 'l'): > + ? ? ? ? ? ?item=self.space.int_w(w_item) > + ? ? ? ? ? ?if item<-2147483648: > + ? ? ? ? ? ? ? ?msg='signed integer is less than minimum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?elif item>2147483647: > + ? ? ? ? ? ? ? ?msg='signed integer is greater than maximum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?return rffi.cast(lltype.Signed, item) > + ? ? ? ?elif self.typecode in ('I', 'L'): > + ? ? ? ? ? ?item=self.space.int_w(w_item) > + ? ? ? ? ? ?if item<0: > + ? ? ? ? ? ? ? ?msg='unsigned integer is less than minimum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?elif item>4294967295: > + ? ? ? ? ? ? ? ?msg='unsigned integer is greater than maximum' > + ? ? ? ? ? ? ? ?raise OperationError(space.w_OverflowError, space.wrap(msg)) > + ? ? ? ? ? ?return rffi.cast(lltype.Unsigned, item) > + > + ? ? ? ?elif self.typecode == 'f': > + ? ? ? ? ? ?item=self.space.float_w(w_item) > + ? ? ? ? ? ?return rffi.cast(lltype.SingleFloat, item) > + ? ? ? ?elif self.typecode == 'd': > + ? ? ? ? ? ?return self.space.float_w(w_item) > + Hey. This looks a bit ugly, you can definitely do it with some constant dict or something (we have special support for iterating over constants and unrolling the iteration, look for unrolling_iterable). Also, annotator can fold a bunch of ifs into a switch, but not if "in" operator is used (or is fine though). From hakan at debian.org Mon Jul 5 07:54:59 2010 From: hakan at debian.org (Hakan Ardo) Date: Mon, 5 Jul 2010 07:54:59 +0200 Subject: [pypy-dev] [pypy-svn] r75824 - in pypy/branch/interplevel-array/pypy/module/array: . test In-Reply-To: References: Message-ID: On Sun, Jul 4, 2010 at 10:25 PM, Maciej Fijalkowski wrote: > > Hey. This looks a bit ugly, ?It does, doesn't it :) > ?you can definitely do it with some > constant dict or something Yes, there is an overflow check needed on the integer types but not on the character an float types, but I guess that could be solved with a flag in the dict. I was actually considering to introduce separate subclasses for each typecode overriding intem_w and descr_getitem. That would get rid of the typecode attribute lookup all together. > (we have special support for iterating over > constants and unrolling the iteration, look for unrolling_iterable). > Also, annotator can fold a bunch of ifs into a switch, but not if "in" > operator is used (or is fine though). That's nice features, good to know about. Thanx. -- H?kan Ard? From hakan at debian.org Mon Jul 5 08:53:20 2010 From: hakan at debian.org (Hakan Ardo) Date: Mon, 5 Jul 2010 08:53:20 +0200 Subject: [pypy-dev] [pypy-svn] r75824 - in pypy/branch/interplevel-array/pypy/module/array: . test In-Reply-To: References: <20100704190622.4F935282B9D@codespeak.net> Message-ID: I've checked in a dict-based version. Not sure it became that clean after all. Is the getattr(space, tc.unwrap) construction ok? On Mon, Jul 5, 2010 at 7:26 AM, Hakan Ardo wrote: > On Sun, Jul 4, 2010 at 10:25 PM, Maciej Fijalkowski wrote: >> >> Hey. This looks a bit ugly, > > ?It does, doesn't it :) > >> ?you can definitely do it with some >> constant dict or something > > Yes, there is an overflow check needed on the integer types but not on > the character an float types, but I guess that could be solved with a > flag in the dict. > > I was actually considering to introduce separate subclasses for each > typecode overriding intem_w and descr_getitem. That would get rid of > the typecode attribute lookup all together. > >> (we have special support for iterating over >> constants and unrolling the iteration, look for unrolling_iterable). >> Also, annotator can fold a bunch of ifs into a switch, but not if "in" >> operator is used (or is fine though). > > That's nice features, good to know about. Thanx. > > > > -- > H?kan Ard? > -- H?kan Ard? From bhartsho at yahoo.com Fri Jul 9 04:08:19 2010 From: bhartsho at yahoo.com (Hart's Antler) Date: Thu, 8 Jul 2010 19:08:19 -0700 (PDT) Subject: [pypy-dev] Interactive Translation and JIT Message-ID: <756045.59228.qm@web114009.mail.gq1.yahoo.com> I'm using Jason Creighton branch, and i am trying to test the JIT from interactive translation. Is it now allowed? I'm getting this error: NotImplementedError: --gcrootfinder=asmgcc requires standalone Or am i not setting the options correctly on the translator, here is how i'm translating. from pypy.translator.interactive import Translation t = Translation( pypy_entry_point ) t.config.translation.suggest(jit=True, jit_debug='steps', jit_backend='x86', gc='boehm') t.annotate() t.rtype() f = t.compile_c() f() complete code: http://pastebin.com/T42cqSbz demo: http://www.youtube.com/watch?v=HwbDG3Rdi_Q -brett From arigo at tunes.org Fri Jul 9 09:43:51 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 9 Jul 2010 09:43:51 +0200 Subject: [pypy-dev] Interactive Translation and JIT In-Reply-To: <756045.59228.qm@web114009.mail.gq1.yahoo.com> References: <756045.59228.qm@web114009.mail.gq1.yahoo.com> Message-ID: <20100709074351.GA8538@code0.codespeak.net> Hi Brett, On Thu, Jul 08, 2010 at 07:08:19PM -0700, Hart's Antler wrote: > I'm using Jason Creighton branch, and i am trying to test the JIT from > interactive translation. Is it now allowed? I'm getting this error: > NotImplementedError: --gcrootfinder=asmgcc requires standalone Indeed, it is not allowed. As far as I know, the interactive translation does not support making standalone programs. You need to run translate.py as described e.g. here: http://codespeak.net/pypy/dist/pypy/doc/getting-started-python.html#translating-the-pypy-python-interpreter A bientot, Armin. From arigo at tunes.org Fri Jul 9 09:47:19 2010 From: arigo at tunes.org (Armin Rigo) Date: Fri, 9 Jul 2010 09:47:19 +0200 Subject: [pypy-dev] Interactive Translation and JIT In-Reply-To: <20100709074351.GA8538@code0.codespeak.net> References: <756045.59228.qm@web114009.mail.gq1.yahoo.com> <20100709074351.GA8538@code0.codespeak.net> Message-ID: <20100709074719.GB8538@code0.codespeak.net> Re-hi, On Fri, Jul 09, 2010 at 09:43:51AM +0200, Armin Rigo wrote: > You need to run translate.py as described e.g. here: ... or to use pypy/jit/tl/pypyjit.py for a quick test of the JIT running on top of PyPy -- although you won't get any assembler, but only the so-called 'llgraph' backend, which emulates assembler by hand using higher-level type-safe operations. (It should be possible in theory to tweak pypyjit.py to really use the x86 backend.) A bientot, Armin. From Dave.Cross at cdl.co.uk Tue Jul 13 13:09:56 2010 From: Dave.Cross at cdl.co.uk (Dave Cross) Date: Tue, 13 Jul 2010 12:09:56 +0100 Subject: [pypy-dev] Windows binaries Message-ID: Hi, Is there a likely delivery date for Windows binaries of PyPy 1.3? Dave.

**********************************************************************
Please consider the environment - do you really need to print this email?

This email is intended only for the person(s) named above and may contain private and confidential information. If it has come to you in error, please destroy and permanently delete any copy in your possession and contact us on +44 (0) 161 480 4420. The information in this email is copyright © CDL Group Holdings Limited. We cannot accept any liability for any loss or damage sustained as a result of software viruses. It is your responsibility to carry out such virus checking as is necessary before opening any attachment.
Cheshire Datasystems Limited uses software which automatically screens incoming emails for inappropriate content and attachments. If the software identifies such content or attachment, the email will be forwarded to our Technology Department for checking. You should be aware that any email which you send to Cheshire Datasystems Limited is subject to this procedure.
Cheshire Datasystems Limited, Strata House, Kings Reach Road, Stockport SK4 2HD
Registered in England and Wales with Company Number 3991057
VAT registration: 727 1188 33

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From fijall at gmail.com  Tue Jul 13 13:57:31 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Tue, 13 Jul 2010 13:57:31 +0200
Subject: [pypy-dev] Windows binaries
In-Reply-To: 
References: 
Message-ID: 

On Tue, Jul 13, 2010 at 1:09 PM, Dave Cross  wrote:
> Hi,
>
>
>
> Is there a likely delivery date for Windows binaries of PyPy 1.3?
>

Eh, sorry, my fault, will upload them today.

>
>
> Dave.
>
> **********************************************************************
> Please consider the environment - do you really need to print this email?
>
> This email is intended only for the person(s) named above and may contain
> private and confidential information. If it has come to you in error, please
> destroy and permanently delete any copy in your possession and contact us on
> +44 (0) 161 480 4420. The information in this email is copyright ? CDL Group
> Holdings Limited. We cannot accept any liability for any loss or damage
> sustained as a result of software viruses. It is your responsibility to
> carry out such virus checking as is necessary before opening any attachment.
> Cheshire Datasystems Limited uses software which automatically screens
> incoming emails for inappropriate content and attachments. If the software
> identifies such content or attachment, the email will be forwarded to our
> Technology Department for checking. You should be aware that any email which
> you send to Cheshire Datasystems Limited is subject to this procedure.
> Cheshire Datasystems Limited, Strata House, Kings Reach Road, Stockport SK4
> 2HD
> Registered in England and Wales with Company Number 3991057
> VAT registration: 727 1188 33
>
>
>
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>


From fijall at gmail.com  Sun Jul 18 18:36:24 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Sun, 18 Jul 2010 18:36:24 +0200
Subject: [pypy-dev] [pypy-svn] r76268 - pypy/branch/micronumpy/pypy/tool
In-Reply-To: <20100716224104.25C49282BD4@codespeak.net>
References: <20100716224104.25C49282BD4@codespeak.net>
Message-ID: 

Benchmarks generally should go to pypy/benchmarks directory in the
main source tree (that is svn+ssh://codespeak.net/svn/pypy/benchmarks)

On Sat, Jul 17, 2010 at 12:41 AM,   wrote:
> Author: dan
> Date: Sat Jul 17 00:41:02 2010
> New Revision: 76268
>
> Added:
> ? pypy/branch/micronumpy/pypy/tool/convolve.py
> Modified:
> ? pypy/branch/micronumpy/pypy/tool/numpybench.py
> Log:
> Oops, I forgot the most important part of the benchmark!
>
> Added: pypy/branch/micronumpy/pypy/tool/convolve.py
> ==============================================================================
> --- (empty file)
> +++ pypy/branch/micronumpy/pypy/tool/convolve.py ? ? ? ?Sat Jul 17 00:41:02 2010
> @@ -0,0 +1,43 @@
> +from __future__ import division
> +from __main__ import numpy as np
> +
> +def naive_convolve(f, g):
> + ? ?# f is an image and is indexed by (v, w)
> + ? ?# g is a filter kernel and is indexed by (s, t),
> + ? ?# ? it needs odd dimensions
> + ? ?# h is the output image and is indexed by (x, y),
> + ? ?# ? it is not cropped
> + ? ?if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
> + ? ? ? ?raise ValueError("Only odd dimensions on filter supported")
> + ? ?# smid and tmid are number of pixels between the center pixel
> + ? ?# and the edge, ie for a 5x5 filter they will be 2.
> + ? ?#
> + ? ?# The output size is calculated by adding smid, tmid to each
> + ? ?# side of the dimensions of the input image.
> + ? ?vmax = f.shape[0]
> + ? ?wmax = f.shape[1]
> + ? ?smax = g.shape[0]
> + ? ?tmax = g.shape[1]
> + ? ?smid = smax // 2
> + ? ?tmid = tmax // 2
> + ? ?xmax = vmax + 2*smid
> + ? ?ymax = wmax + 2*tmid
> + ? ?# Allocate result image.
> + ? ?h = np.zeros([xmax, ymax], dtype=f.dtype)
> + ? ?# Do convolution
> + ? ?for x in range(xmax):
> + ? ? ? ?for y in range(ymax):
> + ? ? ? ? ? ?# Calculate pixel value for h at (x,y). Sum one component
> + ? ? ? ? ? ?# for each pixel (s, t) of the filter g.
> + ? ? ? ? ? ?s_from = max(smid - x, -smid)
> + ? ? ? ? ? ?s_to = min((xmax - x) - smid, smid + 1)
> + ? ? ? ? ? ?t_from = max(tmid - y, -tmid)
> + ? ? ? ? ? ?t_to = min((ymax - y) - tmid, tmid + 1)
> + ? ? ? ? ? ?value = 0
> + ? ? ? ? ? ?for s in range(s_from, s_to):
> + ? ? ? ? ? ? ? ?for t in range(t_from, t_to):
> + ? ? ? ? ? ? ? ? ? ?v = x - smid + s
> + ? ? ? ? ? ? ? ? ? ?w = y - tmid + t
> + ? ? ? ? ? ? ? ? ? ?value += g[smid - s, tmid - t] * f[v, w]
> + ? ? ? ? ? ?h[x, y] = value
> + ? ?return h
>
> Modified: pypy/branch/micronumpy/pypy/tool/numpybench.py
> ==============================================================================
> --- pypy/branch/micronumpy/pypy/tool/numpybench.py ? ? ?(original)
> +++ pypy/branch/micronumpy/pypy/tool/numpybench.py ? ? ?Sat Jul 17 00:41:02 2010
> @@ -21,13 +21,29 @@
> ? ? return numpy.array(kernel)
>
> ?if __name__ == '__main__':
> - ? ?from sys import argv as args
> - ? ?width, height, kwidth, kheight = [int(x) for x in args[1:]]
> + ? ?from optparse import OptionParser
> +
> + ? ?option_parser = OptionParser()
> + ? ?option_parser.add_option('--kernel-size', dest='kernel', default='3x3',
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? help="The size of the convolution kernel, given as WxH. ie 3x3"
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?"Note that both dimensions must be odd.")
> + ? ?option_parser.add_option('--image-size', dest='image', default='256x256',
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? help="The size of the image, given as WxH. ie. 256x256")
> + ? ?option_parser.add_option('--runs', '--count', dest='count', default=1000,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? help="The number of times to run the convolution filter")
> +
> + ? ?options, args = option_parser.parse_args()
> +
> + ? ?def parse_dimension(arg):
> + ? ? ? ?return [int(s.strip()) for s in arg.split('x')]
> +
> + ? ?width, height = parse_dimension(options.image)
> + ? ?kwidth, kheight = parse_dimension(options.kernel)
> + ? ?count = int(options.count)
>
> ? ? image = generate_image(width, height)
> ? ? kernel = generate_kernel(kwidth, kheight)
>
> ? ? from timeit import Timer
> ? ? convolve_timer = Timer('naive_convolve(image, kernel)', 'from convolve import naive_convolve; from __main__ import image, kernel; gc.enable()')
> - ? ?count = 100
> ? ? print "%.5f sec/pass" % (convolve_timer.timeit(number=count)/count)
> _______________________________________________
> pypy-svn mailing list
> pypy-svn at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-svn
>


From jcreigh at gmail.com  Thu Jul 22 15:34:55 2010
From: jcreigh at gmail.com (Jason Creighton)
Date: Thu, 22 Jul 2010 09:34:55 -0400
Subject: [pypy-dev] Building a shared library on x86-64 fails due to static
	linking of libffi
Message-ID: 

Hello,

While working on asmgcc-64, I ran into this issue. For some reason, PyPy
wants to link libffi statically on some platforms, Linux included. But when
compiling with the "shared" option (as is done in some asmgcroot tests), you
get link errors like:

/usr/bin/ld: /usr/lib/libffi.a(ffi64.o): relocation R_X86_64_32S against
`.rodata' can not be used when making a shared object; recompile with -fPIC
/usr/lib/libffi.a: could not read symbols: Bad value

I interpret this to mean that since we building a shared library, the
resulting library must be position independent, so we can't link in non-PIC
such as is found in the static version of libffi on my system. (Ubuntu
10.04, x86-64). And indeed, if I switch to linking dynamically, the error
goes away and things seem to work.

However, I don't want to just blindly enable dynamic linking, because there
must be a reason it was configured to link statically in the first place.
What is that reason?

Also, what steps should I take here? I think I need to enable dynamic
linking of libffi on x86-64 Linux when building a shared library at the very
least, but to reduce the number of code paths, I'm somewhat inclined to link
dynamically whether we're building a library or not. What do you guys think?

Thanks,

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From amauryfa at gmail.com  Thu Jul 22 17:03:57 2010
From: amauryfa at gmail.com (Amaury Forgeot d'Arc)
Date: Thu, 22 Jul 2010 17:03:57 +0200
Subject: [pypy-dev] Building a shared library on x86-64 fails due to
	static linking of libffi
In-Reply-To: 
References: 
Message-ID: 

Hi,

2010/7/22 Jason Creighton :
> Hello,
>
> While working on asmgcc-64, I ran into this issue. For some reason, PyPy
> wants to link libffi statically on some platforms, Linux included. But when
> compiling with the "shared" option (as is done in some asmgcroot tests), you
> get link errors like:
>
> /usr/bin/ld: /usr/lib/libffi.a(ffi64.o): relocation R_X86_64_32S against
> `.rodata' can not be used when making a shared object; recompile with -fPIC
> /usr/lib/libffi.a: could not read symbols: Bad value
>
> I interpret this to mean that since we building a shared library, the
> resulting library must be position independent, so we can't link in non-PIC
> such as is found in the static version of libffi on my system. (Ubuntu
> 10.04, x86-64). And indeed, if I switch to linking dynamically, the error
> goes away and things seem to work.

Exactly

> However, I don't want to just blindly enable dynamic linking, because there
> must be a reason it was configured to link statically in the first place.
> What is that reason?
>
> Also, what steps should I take here? I think I need to enable dynamic
> linking of libffi on x86-64 Linux when building a shared library at the very
> least, but to reduce the number of code paths, I'm somewhat inclined to link
> dynamically whether we're building a library or not. What do you guys think?

The reason is actually in the code: pypy/rlib/libffi.py

    # On some platforms, we try to link statically libffi, which is small
    # anyway and avoids endless troubles for installing.  On other platforms
    # libffi.a is typically not there, so we link dynamically.

Probably static linking to libffi should be disabled on 64bit platform.
Or just skip the test: for what I know, --shared is not really needed
on Unix platforms.

-- 
Amaury Forgeot d'Arc


From ndbecker2 at gmail.com  Thu Jul 22 17:59:37 2010
From: ndbecker2 at gmail.com (Neal Becker)
Date: Thu, 22 Jul 2010 11:59:37 -0400
Subject: [pypy-dev] Building a shared library on x86-64 fails due to
	static linking of libffi
References: 
	
Message-ID: 

AFAIK, i386 is the only platform that allows building a shared lib linked 
with a static lib.



From bhartsho at yahoo.com  Fri Jul 23 06:49:53 2010
From: bhartsho at yahoo.com (Hart's Antler)
Date: Thu, 22 Jul 2010 21:49:53 -0700 (PDT)
Subject: [pypy-dev] rpython questions, **kw, __call__, __getattr__
Message-ID: <92687.49811.qm@web114018.mail.gq1.yahoo.com>

Looking through the pypy source code i see **kw, __call__ and __getattr__ are used, but when i try to write my own rpython code that uses these conventions, i get translation errors.  Do i need to borrow from "application space" in order to do this or can i just give hints to the annotator?
Thanks,
-brett



#this is allowed
def func(*args): print(args)

#but this is not?
def func(**kw): print(args)
#error call pattern too complex

#this class fails to translate, are we not allowed to define our own __call__ and __getattr__ in rpython?
class A(object):
  __call__(*args): print(args)
  __getattr__(self,name): print(name)







From fijall at gmail.com  Fri Jul 23 10:20:40 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Fri, 23 Jul 2010 10:20:40 +0200
Subject: [pypy-dev] rpython questions, **kw, __call__, __getattr__
In-Reply-To: <92687.49811.qm@web114018.mail.gq1.yahoo.com>
References: <92687.49811.qm@web114018.mail.gq1.yahoo.com>
Message-ID: 

Hello.

__call__ and __getattr__ won't work. You see it in pypy source code,
because not all of pypy source code is RPython (in fact, Python is a
metaprogramming language for RPython). Same goes to **kw.

On Fri, Jul 23, 2010 at 6:49 AM, Hart's Antler  wrote:
> Looking through the pypy source code i see **kw, __call__ and __getattr__ are used, but when i try to write my own rpython code that uses these conventions, i get translation errors. ?Do i need to borrow from "application space" in order to do this or can i just give hints to the annotator?
> Thanks,
> -brett
>
>
>
> #this is allowed
> def func(*args): print(args)
>
> #but this is not?
> def func(**kw): print(args)
> #error call pattern too complex
>
> #this class fails to translate, are we not allowed to define our own __call__ and __getattr__ in rpython?
> class A(object):
> ?__call__(*args): print(args)
> ?__getattr__(self,name): print(name)
>
>
>
>
>
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>


From cfbolz at gmx.de  Fri Jul 23 10:23:15 2010
From: cfbolz at gmx.de (Carl Friedrich Bolz)
Date: Fri, 23 Jul 2010 10:23:15 +0200
Subject: [pypy-dev] rpython questions, **kw, __call__, __getattr__
In-Reply-To: <92687.49811.qm@web114018.mail.gq1.yahoo.com>
References: <92687.49811.qm@web114018.mail.gq1.yahoo.com>
Message-ID: <4C495173.9070107@gmx.de>

On 07/23/2010 06:49 AM, Hart's Antler wrote:
> Looking through the pypy source code i see **kw, __call__ and
> __getattr__ are used,

Where exactly are they used? Not all of the code in PyPy is RPython.

> but when i try to write my own rpython code
> that uses these conventions, i get translation errors.  Do i need to
> borrow from "application space" in order to do this or can i just
> give hints to the annotator? Thanks, -brett
>
>
>
> #this is allowed
 > def func(*args):
 >     print(args)
>
> #but this is not?
 > def func(**kw):
 >     print(args)
 > #error call pattern too complex
>
> #this class fails to translate, are we not allowed to define our own
> __call__ and __getattr__ in rpython?


> class A(object):
 >     __call__(*args):
 >          print(args)
 >     __getattr__(self,name):
 >          print(name)

You cannot use any __xxx__ functions in RPython, only __init__ and 
__del__. Anyway, you cannot translate a class, so "fails to translate" 
has no meaning :-).

Cheers,

Carl Friedrich


From bhartsho at yahoo.com  Sat Jul 24 11:06:09 2010
From: bhartsho at yahoo.com (Hart's Antler)
Date: Sat, 24 Jul 2010 02:06:09 -0700 (PDT)
Subject: [pypy-dev] PyPyGTK v0.1
Message-ID: <114704.12351.qm@web114014.mail.gq1.yahoo.com>

http://pastebin.com/UhnEurqb

The above is a crude way to run pygtk from rpython (through CPython and talking on a pipe), but at least it partially works.  Callbacks are limited to quoted lambdas, but some simple return types back to rpython is possible - i'm going to try that next.  There is no support for dynamic attribute access, but most of pygtk involves function calls.  Where attribute access is required, i guess extra proxy functions could be written.

-brett 





From cfbolz at gmx.de  Sat Jul 24 11:21:28 2010
From: cfbolz at gmx.de (Carl Friedrich Bolz)
Date: Sat, 24 Jul 2010 11:21:28 +0200
Subject: [pypy-dev] PyPyGTK v0.1
In-Reply-To: <114704.12351.qm@web114014.mail.gq1.yahoo.com>
References: <114704.12351.qm@web114014.mail.gq1.yahoo.com>
Message-ID: <4C4AB098.1010306@gmx.de>

Hi Brett,

On 07/24/2010 11:06 AM, Hart's Antler wrote:
> http://pastebin.com/UhnEurqb

Nice. Did you also see this?: 
http://morepypy.blogspot.com/2009/11/using-cpython-extension-modules-with.html
I guess it could be used for GTK as well.

BTW, I guess if we ever wanted "real" GTK support, without proxying 
CPython, we should use the GObject-introspection features, which should 
make the wrapping rather simple.

Cheers,

Carl Friedrich


From bhartsho at yahoo.com  Mon Jul 26 07:47:02 2010
From: bhartsho at yahoo.com (Hart's Antler)
Date: Sun, 25 Jul 2010 22:47:02 -0700 (PDT)
Subject: [pypy-dev] PyPy Proxy
Message-ID: <345280.58625.qm@web114015.mail.gq1.yahoo.com>

The code from PyPyGTK has been generalized so that it can work with PyGame, and PyODE.  Function calls are improved so that different arg types can be accepted by defining custom wrappers per function.  Custom wrappers can also be made for the return values, so different types can be handled as well (it seems that rpython restricts what can be returned from a function to the same types).  Proxy objects can move back and forth from CPython to RPython.  The wrappers for pygtk, pyode, and pygame are by no means complete, but some basic tests are working.  Callbacks are limited to quoted lambdas, but it could be improved.

http://pastebin.com/rWEfgMSN

I had seen the other proxy method before, but found few examples, how does it work, from Rpython or the PyPy interpreter?
http://morepypy.blogspot.com/2009/11/using-cpython-extension-modules-with.html





From kevinar18 at hotmail.com  Tue Jul 27 04:09:16 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Mon, 26 Jul 2010 22:09:16 -0400
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
Message-ID: 


Might as well warn you: This is going to be a rather long post.
I'm not sure if this is appropriate to post here or if would fit right in with the mailing list.? Sorry, if it is the wrong place to post about this.


I've looked through the documenation (http://codespeak.net/pypy/dist/pypy/doc/stackless.html) and didn't really see what I was looking for.? I've also investigated several options in the default CPython.

What I'm trying to accomplish:
I am trying to write a particular threading scenario that follows these rules.? It is partly an experiment and partly for actual production code.

1. Hundreds or thousands of micro-threads that are essentially small self-contained programs (not really, but you can think of them that way).
2. No shared state - data is passed around from one micro-thread to another; only one micro-thread has access to the data at a time. (although the programmer gets the impression there is no shared state, in reality, the underlying implementation uses shared memory / shared state for speed; the data does not move; you just pass around a reference/pointer to some shared memory)
3. The micro-threads can run in parallel on different cpu cores, get moved to a different core, etc....
4. The micro-threads are truly pre-emptive (uses hardware interrupt pre-emption).
5. It is my intention to write my own scheduler that will suspend the micro-threads, start them, control the sharing of data, assign them to different CPU cores etc....? In fact, for my purposes, I MUST write my own scheduler as I have very specific requirements on when they should and should not run.


Now, I have spent some time trying to find a way to achieve this ... and I can implement a rather poor version using default Python.? However, I don't see any way to implement my ideal version.? Maybe someone here might have some pointers for me.

Shared Memory between parallel processes
----------------------------------------
Quick Question: Do queues from the multiprocessing module use shared memory?? If the answer is YES, you can just skip this section, because that would solve this particular problem.

(For simplicity, let's assume a quad core CPU)
It is my intent to create 4 threads/processs (one per core) and use the scheduler to assign a micro-thread (of which there may be hundreds) to one of the 4 threads/processes.? However, the micro-threads need to exchange data quickly; to do that I need shared memory -- and that is where I'm having some trouble.
Normally, 4 threads would be the ideal solution -- as they can run in parallel and use shared memory.? However, because of the Python GIL, I can't use threads in this way; thus, I have to use 4 processes, which are not setup to share memory.

Question: How can I share Python Objects between processes USING SHARED MEMORY?? I do not want to have to copy or "pass" data back and forth between processes or have to use a proxy "server" process.? These are both too much of a performance hit for my needs; shared memory is what I need.

The multiprocessing module offers me 4 options: "queues", "pipes", "shared memory map", and a "server process".
"Shared memory map" won't work as it only handles C values and arrays (not Python objects or variables).
"Server Process" sounds like a bad idea.? Am I correct in that this option requires extra processing power and does not even use shared memory?? If so, that would be a very bad choice for me.
The big question then... do "queues" and "pipes" used shared memory or do they pass data back and forth between processes?? (if they used shared memory, then that would be perfect)

Does PyPy have any other options for me?


True Pre-emptive scheduling?

----------------------------

Any way to get pre-emptive micro-threads?? Stackless (the real 
Stackless, not the one in PyPy) has the ability to suspend them after a 
certain number of interpreter instructions; however, this is prone to 
problems because it can run much longer than expected.? Ideally, I would 
like to have true pre-emptive scheduling using 
hardware interrupts based on timing or CPU cycles (like the OS does for 
real threads).

I am currently not aware of any way to achieve this in CPython, PyPy, Unladen Swallow, Stackless, etc....


Are there detailed docs on why the Python GIL exists?
-----------------------------------------------------
I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it".
It may be possible that my particular usage would not require the GIL.? However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against.? Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?




Thanks,
Kevin
 		 	   		  
_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

From evan at theunixman.com  Tue Jul 27 08:27:03 2010
From: evan at theunixman.com (Evan Cofsky)
Date: Mon, 26 Jul 2010 23:27:03 -0700
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
 message passing?
In-Reply-To: 
References: 
Message-ID: <20100727062702.GE12699@tunixman.com>

On 07/26 22:09, Kevin Ar18 wrote:
> What I'm trying to accomplish:
>
> I am trying to write a particular threading scenario that follows these
> rules.? It is partly an experiment and partly for actual
> production code.

This is actually interesting to me as well. I can't count the number of
times I've had to implement something like this for projects. It would be
nice to be able to use a public module instead of writing it all
yet again.

> Now, I have spent some time trying to find a way to achieve this ... and
> I can implement a rather poor version using default Python.? However, I
> don't see any way to implement my ideal version.? Maybe someone here
> might have some pointers for me.

> Shared Memory between parallel processes

This is the way I usually implement it. I'm currently mulling over some
sort of byte-addressable abstraction that can use a buffer or any sequence
as a backing store, which would make it useful for mmap objects as well.
And I'm thinking about using the class definitions and inheritance to
handle nested structures in some way.

> Quick Question: Do queues from the multiprocessing module use shared
> memory?? If the answer is YES, you can just skip this section, because
> that would solve this particular problem.

I can't imagine it wouldn't, but I haven't checked the source yet.

> Question: How can I share Python Objects between processes USING SHARED
> MEMORY?? I do not want to have to copy or "pass" data back and forth
> between processes or have to use a proxy "server" process.? These are
> both too much of a performance hit for my needs; shared memory is what
> I need.

Anonymous memory-mapped regions would work, with a suitable data
abstraction. Or even memory-mapped files, which aren't really all that
different on systems anymore.

> The multiprocessing module offers me 4 options: "queues", "pipes", "shared memory map", and a "server process".
> "Shared memory map" won't work as it only handles C values and arrays (not Python objects or variables).

cPickle could help. But then there's a serialization/deserialization step
which wouldn't really be too fast. It's not slow, but the cost of copying
the data is far outweighed by the cost of the dumps/loads, and if you need
to share multiple copies you're really going to feel it.

> "Server Process" sounds like a bad idea.? Am I correct in that this
> option requires extra processing power and does not even use
> shared memory??

Not really. It depends on how you would implement it.

> The big question then... do "queues" and "pipes" used shared memory or
> do they pass data back and forth between processes?? (if they used
> shared memory, then that would be perfect)
 
Queues most likely do, pipes absolutely do not.

> Does PyPy have any other options for me?

I wonder if it could be done with an object space, or similarly done
"behind the scenes" in the PyPy interpreter, sort of the way ZODB works
semi-transparently. Only in this case completely transparently.

> True Pre-emptive scheduling?

This wouldn't really be difficult, although doing it efficiently might
very well be without some serious black magic. But PyPy may also be the
right tool for that since the black magic can be written in Python or
RPython instead of C.

> Any way to get pre-emptive micro-threads?? Stackless (the real
> Stackless, not the one in PyPy) has the ability to suspend them after a
> certain number of interpreter instructions; however, this is prone to
> problems because it can run much longer than expected.? Ideally, I would
> like to have true pre-emptive scheduling using hardware interrupts based
> on timing or CPU cycles (like the OS does for real threads).

By using a process for each thread, and some shared memory arena for the
bulk of the application data structures, this is probably quite possible
without reimplementing the OS in Python.

> I am currently not aware of any way to achieve this in CPython, PyPy,
> Unladen Swallow, Stackless, etc....

I've done this a number of times, both with threads and with processes.
Processes ironically give you finer control over scheduling since you
aren't stuck behind the GIL, but as you are finding, you need some way to
share data.

> Are there detailed docs on why the Python GIL exists?

Here is the page from the Python Wiki:

http://wiki.python.org/moin/GlobalInterpreterLock

And here is an interesting article on the GIL problem:

http://blog.ianbicking.org/gil-of-doom.html

-- 
Evan Cofsky "The UNIX Man" 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: Digital signature
URL: 

From fijall at gmail.com  Tue Jul 27 11:43:50 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Tue, 27 Jul 2010 11:43:50 +0200
Subject: [pypy-dev] PyPy Proxy
In-Reply-To: <345280.58625.qm@web114015.mail.gq1.yahoo.com>
References: <345280.58625.qm@web114015.mail.gq1.yahoo.com>
Message-ID: 

Hey.

Does it come with tests? Or how can I look how is it working?

On Mon, Jul 26, 2010 at 7:47 AM, Hart's Antler  wrote:
> The code from PyPyGTK has been generalized so that it can work with PyGame, and PyODE. ?Function calls are improved so that different arg types can be accepted by defining custom wrappers per function. ?Custom wrappers can also be made for the return values, so different types can be handled as well (it seems that rpython restricts what can be returned from a function to the same types). ?Proxy objects can move back and forth from CPython to RPython. ?The wrappers for pygtk, pyode, and pygame are by no means complete, but some basic tests are working. ?Callbacks are limited to quoted lambdas, but it could be improved.
>
> http://pastebin.com/rWEfgMSN
>
> I had seen the other proxy method before, but found few examples, how does it work, from Rpython or the PyPy interpreter?
> http://morepypy.blogspot.com/2009/11/using-cpython-extension-modules-with.html
>
>
>
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>


From fijall at gmail.com  Tue Jul 27 11:48:57 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Tue, 27 Jul 2010 11:48:57 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
Message-ID: 

On Tue, Jul 27, 2010 at 4:09 AM, Kevin Ar18  wrote:
>
> Might as well warn you: This is going to be a rather long post.
> I'm not sure if this is appropriate to post here or if would fit right in with the mailing list.? Sorry, if it is the wrong place to post about this.
>

This is a relevant list for some of questions below. I'll try to answer them.

> Quick Question: Do queues from the multiprocessing module use shared memory?? If the answer is YES, you can just skip this section, because that would solve this particular problem.

PyPy has no multiprocessing module so far (besides, I think it's an
ugly hack, but that's another issue).

>
> Does PyPy have any other options for me?
>

Right now, no. But there are ways in which you can experiment. Truly
concurrent threads (depends on implicit vs explicit shared memory)
might require a truly concurrent GC to achieve performance. This is
work (although not as big as removing refcounting from CPython for
example).

>
> True Pre-emptive scheduling?
>
> ----------------------------
>
> Any way to get pre-emptive micro-threads?? Stackless (the real
> Stackless, not the one in PyPy) has the ability to suspend them after a
> certain number of interpreter instructions; however, this is prone to
> problems because it can run much longer than expected.? Ideally, I would
> like to have true pre-emptive scheduling using
> hardware interrupts based on timing or CPU cycles (like the OS does for
> real threads).
>
> I am currently not aware of any way to achieve this in CPython, PyPy, Unladen Swallow, Stackless, etc....
>

Sounds relatively easy, but you would need to write this part in
RPython (however, that does not mean you get rid of GIL).

>
> Are there detailed docs on why the Python GIL exists?
> -----------------------------------------------------
> I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it".
> It may be possible that my particular usage would not require the GIL.? However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against.? Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?

The short answer is "yes". The long answer is that it's much easier to
write interpreter assuming GIL is around. For fine-grained locking to
work and be efficient, you would need:

* Some sort of concurrent GC (not specifically running in a separate
thread, but having different pools of memory to allocate from)
* Possibly a JIT optimization that would remove some locking.
* The forementioned locking, to ensure that it's not that easy to
screw things up.

So, in short, "work".


From bhartsho at yahoo.com  Tue Jul 27 14:43:47 2010
From: bhartsho at yahoo.com (Hart's Antler)
Date: Tue, 27 Jul 2010 05:43:47 -0700 (PDT)
Subject: [pypy-dev] PyPy Proxy
In-Reply-To: 
Message-ID: <929795.85917.qm@web114016.mail.gq1.yahoo.com>

Hi Maciej,

Yes it comes with its own test, just save the file from pastebin and run it.  You should two gtk windows popup and a pygame window that draws a circle.

I have a new version with proxy support for one module from Blender2.5 (bpy.ops); note that pygtk, ode, and pygame are broken in this version because i had to change some things so it can run in Blender's embedded python3.1

http://pastebin.com/TsYNqd8p



--- On Tue, 7/27/10, Maciej Fijalkowski  wrote:

> From: Maciej Fijalkowski 
> Subject: Re: [pypy-dev] PyPy Proxy
> To: "Hart's Antler" 
> Cc: pypy-dev at codespeak.net
> Date: Tuesday, 27 July, 2010, 2:43 AM
> Hey.
> 
> Does it come with tests? Or how can I look how is it
> working?
> 
> On Mon, Jul 26, 2010 at 7:47 AM, Hart's Antler 
> wrote:
> > The code from PyPyGTK has been generalized so that it
> can work with PyGame, and PyODE. ?Function calls are
> improved so that different arg types can be accepted by
> defining custom wrappers per function. ?Custom wrappers can
> also be made for the return values, so different types can
> be handled as well (it seems that rpython restricts what can
> be returned from a function to the same types). ?Proxy
> objects can move back and forth from CPython to RPython.
> ?The wrappers for pygtk, pyode, and pygame are by no means
> complete, but some basic tests are working. ?Callbacks are
> limited to quoted lambdas, but it could be improved.
> >
> > http://pastebin.com/rWEfgMSN
> >
> > I had seen the other proxy method before, but found
> few examples, how does it work, from Rpython or the PyPy
> interpreter?
> > http://morepypy.blogspot.com/2009/11/using-cpython-extension-modules-with.html
> >
> >
> >
> > _______________________________________________
> > pypy-dev at codespeak.net
> > http://codespeak.net/mailman/listinfo/pypy-dev
> >
> 





From p.giarrusso at gmail.com  Tue Jul 27 15:17:29 2010
From: p.giarrusso at gmail.com (Paolo Giarrusso)
Date: Tue, 27 Jul 2010 15:17:29 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: <20100727062702.GE12699@tunixman.com>
References: 
	<20100727062702.GE12699@tunixman.com>
Message-ID: 

On Tue, Jul 27, 2010 at 08:27, Evan Cofsky  wrote:
> On 07/26 22:09, Kevin Ar18 wrote:
>> Are there detailed docs on why the Python GIL exists?
>
> Here is the page from the Python Wiki:
>
> http://wiki.python.org/moin/GlobalInterpreterLock

To keep it short, CPython uses refcounting, and without the GIL the
refcount incs and decs would need to be atomic, with a huge
performance impact (that's discussed in the below links).

However, you can look at this answer from Guido van Rossum:
http://www.artima.com/weblogs/viewpost.jsp?thread=214235

And these two attempts to remove the GIL:
http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Global_Interpreter_Lock
http://code.google.com/p/python-safethread/

PyPy does not have this problem, but you still need to make
thread-safe the dictionaries holding members of each object. You don't
need to make lists thread-safe, I think, because the programmer is
supposed to lock them, but you want to allow a thread to add a member
to an object while another thread performs a method call.

Anyway, all this just explains why the GIL is still there, which is a
slightly different question from the original one. With
state-of-the-art technology, it is bad on every front, except
simplicity of implementation.

> And here is an interesting article on the GIL problem:
>
> http://blog.ianbicking.org/gil-of-doom.html

Given that processor frequencies aren't going to increase a lot in the
future as they used to do, while the number of cores is going to
increase much more, this article seems outdated nowadays - see also
http://atlee.ca/blog/2006/06/27/python-warts-2/.

This other link (http://poshmodule.sourceforge.net/) used to be
interesting for the problem you are discussing, but seems also dead -
there are other modules here:
http://wiki.python.org/moin/ParallelProcessing.

Best regards
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/


From fijall at gmail.com  Tue Jul 27 15:42:55 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Tue, 27 Jul 2010 15:42:55 +0200
Subject: [pypy-dev] rotting buildbot infrastructure
Message-ID: 

Hello.

According to current buildbot status, both osx and win machines are
offline. No clue how to get them back. Anyway, our OS X machine is
unable to translate pypy, so it's not exactly the best buildbot ever.
Can anyone contribute any machine for one of those buildbots?

Cheers
fijal


From p.giarrusso at gmail.com  Tue Jul 27 16:36:26 2010
From: p.giarrusso at gmail.com (Paolo Giarrusso)
Date: Tue, 27 Jul 2010 16:36:26 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	
Message-ID: 

Hi all!

I am possibly interested in doing work on this, even if not in the
immediate future.

On Tue, Jul 27, 2010 at 11:48, Maciej Fijalkowski  wrote:
> On Tue, Jul 27, 2010 at 4:09 AM, Kevin Ar18  wrote:

> Truly
> concurrent threads (depends on implicit vs explicit shared memory)
> might require a truly concurrent GC to achieve performance. This is
> work (although not as big as removing refcounting from CPython for
> example).

>> Are there detailed docs on why the Python GIL exists?
>> -----------------------------------------------------
>> I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it".
>> It may be possible that my particular usage would not require the GIL.? However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against.? Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?

> The short answer is "yes". The long answer is that it's much easier to
> write interpreter assuming GIL is around. For fine-grained locking to
> work and be efficient, you would need:

> * The forementioned locking, to ensure that it's not that easy to
> screw things up.
I've wondered around the guarantees we need to offer to the
programmer, and my guess was that Jython's memory model is similar.
I've been concentrating on the dictionary of objects, on the
assumption that lists and most other built-in structures should be
locked by the programmer in case of concurrent modifications.

However, we don't want to require locking to support something like:
Thread 1:
obj.newmember=1;
Thread 2:
a = obj.oldmember;

Looking for Jython memory model on Google produces some garbage and
then this document from Unladen Swallow:
http://code.google.com/p/unladen-swallow/wiki/MemoryModel
It implicitly agrees on what's above (since Jython and IronPython both
use thread-safe dictionaries), and then delves into issues about
allowed reorderings.
However, it requires that even racy code does not make the interpreter crash.

> * Possibly a JIT optimization that would remove some locking.
Any more specific ideas on this?
> * Some sort of concurrent GC (not specifically running in a separate
> thread, but having different pools of memory to allocate from)

Among all points, this seems the easiest design-wise. Having
per-thread pools is nowadays standard, so it's _just_ work (as opposed
to 'complicated design'). Parallel GCs become important just when lots
of garbage must be reclaimed.
A GC is called concurrent, rather than parallel, when it runs
concurrently with the mutator, and this usually reduces both pause
times and throughput, so you probably don't want this as default (it
is useful for particular programs, such as heavily interactive
programs or videogames, I guess), do you?

More details are here:
http://www.ibm.com/developerworks/java/library/j-jtp11253/

The trick used in the (mostly) concurrent collector of Hotspot seems
interesting: it uses two short-stop-the-world phases and lets the
program run in between. I think I'll look for a paper on it.

Cheers,
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/


From fijall at gmail.com  Tue Jul 27 17:07:59 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Tue, 27 Jul 2010 17:07:59 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	 
	
Message-ID: 

On Tue, Jul 27, 2010 at 4:36 PM, Paolo Giarrusso  wrote:
> Hi all!
>
> I am possibly interested in doing work on this, even if not in the
> immediate future.

Well, talk is cheap. Would be great to see some work done of course.

Cheers,
fijal


From fijall at gmail.com  Tue Jul 27 17:11:43 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Tue, 27 Jul 2010 17:11:43 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	 
	
Message-ID: 

>
>> Truly
>> concurrent threads (depends on implicit vs explicit shared memory)
>> might require a truly concurrent GC to achieve performance. This is
>> work (although not as big as removing refcounting from CPython for
>> example).
>
>>> Are there detailed docs on why the Python GIL exists?
>>> -----------------------------------------------------
>>> I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it".
>>> It may be possible that my particular usage would not require the GIL.? However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against.? Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?
>
>> The short answer is "yes". The long answer is that it's much easier to
>> write interpreter assuming GIL is around. For fine-grained locking to
>> work and be efficient, you would need:
>
>> * The forementioned locking, to ensure that it's not that easy to
>> screw things up.
> I've wondered around the guarantees we need to offer to the
> programmer, and my guess was that Jython's memory model is similar.
> I've been concentrating on the dictionary of objects, on the
> assumption that lists and most other built-in structures should be
> locked by the programmer in case of concurrent modifications.
>
> However, we don't want to require locking to support something like:
> Thread 1:
> obj.newmember=1;
> Thread 2:
> a = obj.oldmember;
>
> Looking for Jython memory model on Google produces some garbage and
> then this document from Unladen Swallow:
> http://code.google.com/p/unladen-swallow/wiki/MemoryModel
> It implicitly agrees on what's above (since Jython and IronPython both
> use thread-safe dictionaries), and then delves into issues about
> allowed reorderings.
> However, it requires that even racy code does not make the interpreter crash.

I guess the main restraint is "interpreter should not crash" indeed.

>
>> * Possibly a JIT optimization that would remove some locking.
> Any more specific ideas on this?

Well, yes. Determining when object is local so you don't need to do
any locking, even though it escapes (this is also "just work", since
it has been done before).

>> * Some sort of concurrent GC (not specifically running in a separate
>> thread, but having different pools of memory to allocate from)
>
> Among all points, this seems the easiest design-wise. Having
> per-thread pools is nowadays standard, so it's _just_ work (as opposed
> to 'complicated design'). Parallel GCs become important just when lots
> of garbage must be reclaimed.
> A GC is called concurrent, rather than parallel, when it runs
> concurrently with the mutator, and this usually reduces both pause
> times and throughput, so you probably don't want this as default (it
> is useful for particular programs, such as heavily interactive
> programs or videogames, I guess), do you?

I guess I meant parallel then.

>
> More details are here:
> http://www.ibm.com/developerworks/java/library/j-jtp11253/
>
> The trick used in the (mostly) concurrent collector of Hotspot seems
> interesting: it uses two short-stop-the-world phases and lets the
> program run in between. I think I'll look for a paper on it.

Would be interested in that.

>
> Cheers,
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/
>


From holger at merlinux.eu  Tue Jul 27 18:05:49 2010
From: holger at merlinux.eu (holger krekel)
Date: Tue, 27 Jul 2010 18:05:49 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared
	memory	message passing?
In-Reply-To: 
References: 
	
	
	
Message-ID: <20100727160548.GJ14601@trillke.net>

On Tue, Jul 27, 2010 at 17:07 +0200, Maciej Fijalkowski wrote:
> On Tue, Jul 27, 2010 at 4:36 PM, Paolo Giarrusso  wrote:
> > Hi all!
> >
> > I am possibly interested in doing work on this, even if not in the
> > immediate future.
> 
> Well, talk is cheap. Would be great to see some work done of course.

Well, I think it can be useful to state intentions and interest.  At least
for my projects i feel a difference if people express interest (even through 
negative feedback or broken code) or if they are indifferent, 
not saying or doing anything. 

best,
holger


From jbaker at zyasoft.com  Tue Jul 27 19:58:17 2010
From: jbaker at zyasoft.com (Jim Baker)
Date: Tue, 27 Jul 2010 11:58:17 -0600
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: <20100727160548.GJ14601@trillke.net>
References: 
	 
	 
	 
	<20100727160548.GJ14601@trillke.net>
Message-ID: 

A much shorter version of the Jython memory model can be found in my book:
http://jythonpodcast.hostjava.net/jythonbook/en/1.0/Concurrency.html#python-memory-model

In general, I would think the coroutine mechanism being implemented by Lukas
Stadler for the MLVM version of the hotspot JVM might be a good option; you
can directly control the scheduling, although I don't think you change the
mapping from one hardware thread to another. (That's probably not
interesting.)

There are good results with JRuby, it would be nice to replicate with Jython
- and it should be really straightforward to do that. See
http://classparser.blogspot.com/

- Jim

On Tue, Jul 27, 2010 at 10:05 AM, holger krekel  wrote:

> On Tue, Jul 27, 2010 at 17:07 +0200, Maciej Fijalkowski wrote:
> > On Tue, Jul 27, 2010 at 4:36 PM, Paolo Giarrusso 
> wrote:
> > > Hi all!
> > >
> > > I am possibly interested in doing work on this, even if not in the
> > > immediate future.
> >
> > Well, talk is cheap. Would be great to see some work done of course.
>
> Well, I think it can be useful to state intentions and interest.  At least
> for my projects i feel a difference if people express interest (even
> through
> negative feedback or broken code) or if they are indifferent,
> not saying or doing anything.
>
> best,
> holger
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From kevinar18 at hotmail.com  Tue Jul 27 20:20:10 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Tue, 27 Jul 2010 14:20:10 -0400
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
 message passing?
In-Reply-To: <20100727062702.GE12699@tunixman.com>
References: ,
	<20100727062702.GE12699@tunixman.com>
Message-ID: 


I won't even bother giving individual replies.? It's 
going to take me some time to go through all that information on the 
GIL, so I guess there's no much of a reply I can give anyways.? :)? Let me explain what this is all about in greater detail.



BTW, if there are more links on the GIL, feel free to post.

> Anonymous memory-mapped regions would work, with a suitable data
> abstraction. Or even memory-mapped files, which aren't really all that
> different on systems anymore.
I considered that... however, that would mean writing a significant library to convert Python data types to C/machine types and I wasn't looking forward to that prospect... although after some experimenting, maybe I will find that it won't be that big a deal for my particular situation.

-----------------------
What this is all about:
-----------------------
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf)? There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html? Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.

Consider p74 and p75 of the FBP book (http://www.jpaulmorrison.com/fbp/book.pdf).? Programs essentially consist of many "black boxes" connected together.? A box receives data, processes it and passes it along to another box, to output or drops/deletes it.? Each box, is like a mini-program written in a traditional programming language (like C++ or Python).

The process of connecting the boxes together was actually designed to be programmed visually, as you can see from the examples in the book (I have no idea if it works well, as I am merely starting to experiment with it).

Each box, being a self contained "program," the only data it has access to is 3 parts:
(1) it's own internal variables
(2) The "in ports" These are connections from other boxes allowing the box to receive data to be processed (very similar to the arguments in a function call)
(3) The "out ports" After processing the data, the box sends results to various "out ports" (which, in turn, go to anther box's "in port" or to system output).? There is no "return" like in functions... and a box can continually generate many pieces of data on the "out ports", unlike a function which only generates one return.


------------------------
At this point, my understanding of the FBP concept is extremely limited.? Unfortunately, the author does not have very detailed documentation on the implementation details.? So, I am going to try exploring the concept on my own and see if I can actually use it in some production code.


Implementation of FBP requires a custom scheduler for several reasons:
(1) A box can only run if it has actual data on the "in port(s)"? Thus, the scheduler would only schedule boxes to run when they can actually process some data.
(2) In theory, it may be possible to end up with hundreds or thousands of these light weight boxes.? Using heavy-weight OS threads or processes for every one is out of the question.


The Kamaelia website describes a simplistic single-threaded way to write a scheduler in Python that would work for the FBP concept (even though they never heard of FBP when they designed Kamaelia).? Based on that, it seems like writing a simple scheduler would be rather easy:


In a perfect world, here's what I might do:
* Assume a quad core cpu
(1) Spawn 1 process
(2) Spawn 4 threads & assign each thread to only 1 core -- in other words, don't let the OS handle moving threads around to different cores
(3) Inside each thread, have a mini scheduler that switches back and forth between the many micro-threads (or "boxes") -- note that the OS should not handle any of the switching between micro-threads/boxes as it does it all wrong (and to heavyweight) for this situation.
(4) Using a shared memory queue, each of the 4 schedulers can get the next box to run... or add more boxes to the schedule queue.

(5) Each box has access to its "in ports" and "out ports" only -- and nothing else.? These can be implemented as shared memory for speed.


Some notes:
Garbage Collection - I noticed that one of the issues mentioned about the GIL was garbage collection.? Within the FBP concept, this MIGHT be easily solved: (a) only 1 running piece of code (1 box) can access a piece of data at a time, so there is no worries about whether there are dangling pointers to the var/object somewhere, etc... (b) data must be manually "dropped" inside a box to get rid of it; thus, there is no need to go checking for data that is not used anymore

Threading protection - In theory, there is significantly less threading issues since: (a) only one box can control/access data at a time (b) the only place where there is contention is when you push/pop from the in/out ports ... and that is trivial to protect against.



Anyways, I appreciate the replies.? At this point, I guess I'll just go for a simplistic implementation to get a feel for how things work.? Then, maybe I can check on if something better can be done in PyPy.
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

From cfbolz at gmx.de  Tue Jul 27 23:56:26 2010
From: cfbolz at gmx.de (Carl Friedrich Bolz)
Date: Tue, 27 Jul 2010 23:56:26 +0200
Subject: [pypy-dev] rotting buildbot infrastructure
In-Reply-To: 
References: 
Message-ID: <4C4F560A.6080101@gmx.de>

On 07/27/2010 03:42 PM, Maciej Fijalkowski wrote:
> Hello.
>
> According to current buildbot status, both osx and win machines are
> offline. No clue how to get them back. Anyway, our OS X machine is
> unable to translate pypy, so it's not exactly the best buildbot ever.
> Can anyone contribute any machine for one of those buildbots?

Sorry, I will only be able to look at the OS X machine in August. Why 
can't it translate PyPy?

Carl Friedrich


From fijall at gmail.com  Wed Jul 28 08:42:22 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Wed, 28 Jul 2010 08:42:22 +0200
Subject: [pypy-dev] rotting buildbot infrastructure
In-Reply-To: <4C4F560A.6080101@gmx.de>
References:  
	<4C4F560A.6080101@gmx.de>
Message-ID: 

On Tue, Jul 27, 2010 at 11:56 PM, Carl Friedrich Bolz  wrote:
> On 07/27/2010 03:42 PM, Maciej Fijalkowski wrote:
>> Hello.
>>
>> According to current buildbot status, both osx and win machines are
>> offline. No clue how to get them back. Anyway, our OS X machine is
>> unable to translate pypy, so it's not exactly the best buildbot ever.
>> Can anyone contribute any machine for one of those buildbots?
>
> Sorry, I will only be able to look at the OS X machine in August. Why
> can't it translate PyPy?

There is not enough memory (the build timeout after like 4 or 5 hours).

>
> Carl Friedrich
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>


From stephen at thorne.id.au  Wed Jul 28 09:29:51 2010
From: stephen at thorne.id.au (Stephen Thorne)
Date: Wed, 28 Jul 2010 17:29:51 +1000
Subject: [pypy-dev] rotting buildbot infrastructure
In-Reply-To: 
References: 
	<4C4F560A.6080101@gmx.de>
	
Message-ID: <20100728072951.GF1338@thorne.id.au>

On 2010-07-28, Maciej Fijalkowski wrote:
> On Tue, Jul 27, 2010 at 11:56 PM, Carl Friedrich Bolz  wrote:
> > On 07/27/2010 03:42 PM, Maciej Fijalkowski wrote:
> >> Hello.
> >>
> >> According to current buildbot status, both osx and win machines are
> >> offline. No clue how to get them back. Anyway, our OS X machine is
> >> unable to translate pypy, so it's not exactly the best buildbot ever.
> >> Can anyone contribute any machine for one of those buildbots?
> >
> > Sorry, I will only be able to look at the OS X machine in August. Why
> > can't it translate PyPy?
> 
> There is not enough memory (the build timeout after like 4 or 5 hours).

I have a quad core ppc OSX (10.4) machine that isn't currently operating. It
only has a few of its RAM slots filled. If anyone wanted to fill it with RAM it
would make a reasonable build machine.

-- 
Regards,
Stephen Thorne
Development Engineer
Netbox Blue


From william.leslie.ttg at gmail.com  Wed Jul 28 14:54:39 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Wed, 28 Jul 2010 22:54:39 +1000
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
Message-ID: 

On 28 July 2010 04:20, Kevin Ar18  wrote:
> I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf)? There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html? Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.

Does anyone know if there is a central resource for incompatible
python memory model proposals? I know of Jython, Python-Safethread,
and Mont-E.

I do like the idea of MiniAxon, but let me mention a topic that has
slowly been bubbling to the front of my mind for the last few months.

Concurrency in the face of shared mutable state is hard. It makes it
trivial to introduce bugs all over the place. Nondeterminacy related
bugs are far harder to test, diagnose, and fix than anything else that
I would almost mandate static verification (via optional typing,
probably) of task noninterference if I was moving to a concurrent
environment with shared mutable state. There might be a reasonable
middle ground where, if a task attempts to violate the required static
semantics, it fails dynamically. At least then, latent bugs make
plenty of noise. An example for MiniAxon (as I understand it, which is
not very well) would be verification that a "task" (including
functions that the task calls) never closes over and yields the same
mutable objects, and never mutates globally reachable objects.

I wonder if you could close such tasks off with a clever subclass of
the proxy object space that detects and rejects such memory model
violations? With only semantics that make the program deterministic?

The moral equivalent would be cooperating processes with a large
global (or not) shared memory store for immutable objects, queues for
communication, and the additional semantic that objects in a queue are
either immutable or the queue holds their only reference. The trouble
is that it is so hard to work out what immutable really means.
Non-optional annotations would be not very pythonian.

-- 
William Leslie


From p.giarrusso at gmail.com  Wed Jul 28 15:12:40 2010
From: p.giarrusso at gmail.com (Paolo Giarrusso)
Date: Wed, 28 Jul 2010 15:12:40 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
Message-ID: 

On Tue, Jul 27, 2010 at 20:20, Kevin Ar18  wrote:
>
> I won't even bother giving individual replies.? It's
> going to take me some time to go through all that information on the
> GIL, so I guess there's no much of a reply I can give anyways.? :)? Let me explain what this is all about in greater detail.

> BTW, if there are more links on the GIL, feel free to post.
>
>> Anonymous memory-mapped regions would work, with a suitable data
>> abstraction. Or even memory-mapped files, which aren't really all that
>> different on systems anymore.
> I considered that... however, that would mean writing a significant library to convert Python data types to C/machine types and I wasn't looking forward to that prospect... although after some experimenting, maybe I will find that it won't be that big a deal for my particular situation.

> I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf)? There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html? Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
Except for the "visual programming" part, the general idea you
describe stems from CSP (Communicating Sequential Processes) and is
also found at least in the Scala actor library and in Google's Go with
goroutines.

In both languages you can easily pretend that no memory is shared by
avoiding to share any pointers (unlike C, even buggy code can't modify
a pointer which wasn't shared), and Go recommends programming this
way. A difference is that this is a convention.

For the "visual programming", it looks like a particular case of what
the Eclipse Modeling Framework is doing (they allow you to define
types of diagrams, called metamodels, and a way to convert them to
code, and generate a diagram editor and other support stuff. I'm not
an expert on that).
>From what you describe, FBP seems to give nothing new, except the
combination among "visual programming" with this idea. Disclaimer: I
did not read the book.

> Consider p74 and p75 of the FBP book (http://www.jpaulmorrison.com/fbp/book.pdf).? Programs essentially consist of many "black boxes" connected together.? A box receives data, processes it and passes it along to another box, to output or drops/deletes it.? Each box, is like a mini-program written in a traditional programming language (like C++ or Python).
>
> The process of connecting the boxes together was actually designed to be programmed visually, as you can see from the examples in the book (I have no idea if it works well, as I am merely starting to experiment with it).
>
> Each box, being a self contained "program," the only data it has access to is 3 parts:
> (1) it's own internal variables
> (2) The "in ports" These are connections from other boxes allowing the box to receive data to be processed (very similar to the arguments in a function call)
> (3) The "out ports" After processing the data, the box sends results to various "out ports" (which, in turn, go to anther box's "in port" or to system output).? There is no "return" like in functions... and a box can continually generate many pieces of data on the "out ports", unlike a function which only generates one return.
>
>
> ------------------------
> At this point, my understanding of the FBP concept is extremely limited.? Unfortunately, the author does not have very detailed documentation on the implementation details.? So, I am going to try exploring the concept on my own and see if I can actually use it in some production code.
>
>
> Implementation of FBP requires a custom scheduler for several reasons:
> (1) A box can only run if it has actual data on the "in port(s)"? Thus, the scheduler would only schedule boxes to run when they can actually process some data.
> (2) In theory, it may be possible to end up with hundreds or thousands of these light weight boxes.? Using heavy-weight OS threads or processes for every one is out of the question.
>
>
> The Kamaelia website describes a simplistic single-threaded way to write a scheduler in Python that would work for the FBP concept (even though they never heard of FBP when they designed Kamaelia).? Based on that, it seems like writing a simple scheduler would be rather easy:

> In a perfect world, here's what I might do:
> * Assume a quad core cpu
> (1) Spawn 1 process
> (2) Spawn 4 threads & assign each thread to only 1 core -- in other words, don't let the OS handle moving threads around to different cores
> (3) Inside each thread, have a mini scheduler that switches back and forth between the many micro-threads (or "boxes") -- note that the OS should not handle any of the switching between micro-threads/boxes as it does it all wrong (and to heavyweight) for this situation.
> (4) Using a shared memory queue, each of the 4 schedulers can get the next box to run... or add more boxes to the schedule queue.

Most of this is usual or standard - even if somebody possibly won't
set thread-CPU affinity, possibly because they don't know about the
syscalls to do it, i.e. sched_setaffinity. IIRC, this was not
mentioned in the paper I read about the Scala actor library.
Look for 'N:M threading library' (without quotes) on Google.

> (5) Each box has access to its "in ports" and "out ports" only -- and nothing else.? These can be implemented as shared memory for speed.

> Some notes:
> Garbage Collection - I noticed that one of the issues mentioned about the GIL was garbage collection.? Within the FBP concept, this MIGHT be easily solved: (a) only 1 running piece of code (1 box) can access a piece of data at a time, so there is no worries about whether there are dangling pointers to the var/object somewhere, etc...

> (b) data must be manually "dropped" inside a box to get rid of it; thus, there is no need to go checking for data that is not used anymore

A "piece of data" can point to other objects, and the pointer can be
modified. So you need GC anyway: having that, requiring data to be
dropped explicitly seems just an annoyance (there might be deeper
reasons, however).

> Threading protection - In theory, there is significantly less threading issues since: (a) only one box can control/access data at a time (b) the only place where there is contention is when you push/pop from the in/out ports ... and that is trivial to protect against.
Agreed.
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/


From p.giarrusso at gmail.com  Wed Jul 28 15:37:07 2010
From: p.giarrusso at gmail.com (Paolo Giarrusso)
Date: Wed, 28 Jul 2010 15:37:07 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
	
Message-ID: 

On Wed, Jul 28, 2010 at 14:54, William Leslie
 wrote:
> On 28 July 2010 04:20, Kevin Ar18  wrote:
>> I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf)? There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html? Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.

> Does anyone know if there is a central resource for incompatible
> python memory model proposals? I know of Jython, Python-Safethread,
> and Mont-E.

Add Unladen Swallow to your list - the "Jython memory model" is undocumented.
I don't know of Mont-E, can't find its website through Google (!), and
there seems to be no such central resource.

> I do like the idea of MiniAxon, but let me mention a topic that has
> slowly been bubbling to the front of my mind for the last few months.

> Concurrency in the face of shared mutable state is hard. It makes it
> trivial to introduce bugs all over the place. Nondeterminacy related
> bugs are far harder to test, diagnose, and fix than anything else that
> I would almost mandate static verification (via optional typing,
> probably) of task noninterference if I was moving to a concurrent
> environment with shared mutable state.

This is a general issue with concurrency, and usually I try to solve
this using more pencil-and-paper design than usual.

> There might be a reasonable
> middle ground where, if a task attempts to violate the required static
> semantics, it fails dynamically. At least then, latent bugs make
> plenty of noise.

In general, I've seen lots of research on this, and something
implemented in Valgrind - see here for links:
http://blaisorbladeprog.blogspot.com/2010/07/automatic-race-detection.html.
Given the interest on this, the lack of complete tools might mean that
it is just too hard currently.

> An example for MiniAxon (as I understand it, which is
> not very well) would be verification that a "task" (including
> functions that the task calls) never closes over and yields the same
> mutable objects, and never mutates globally reachable objects.

I guess that 'close over' here means 'getting as input'.

> I wonder if you could close such tasks off with a clever subclass of
> the proxy object space that detects and rejects such memory model
> violations? With only semantics that make the program deterministic?

> The moral equivalent would be cooperating processes with a large
> global (or not) shared memory store for immutable objects, queues for
> communication, and the additional semantic that objects in a queue are
> either immutable or the queue holds their only reference.

In C++ auto_ptr do it, but that's hard in Python.

> The trouble
> is that it is so hard to work out what immutable really means.
> Non-optional annotations would be not very pythonian.

If you want static guarantees, you need a statically typed language.
The usual argument for dynamic languages is that instead of static
typing, you need to write unit tests, and since you must do that
anyway, dynamic languages are a win. We have two incomplete attempts
to make programs correct:
- Types give strong guarantees against a subclass of errors (you
_never_ get certain errors from a program which compiles)
- Testing gives weak guarantees (which go just as far as you test),
but covers all classes of errors
- The middle ground would be to require annotations to prove
properties. One would need (once and for all) to annotate even strings
as immutable!

Cheers,
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/


From william.leslie.ttg at gmail.com  Wed Jul 28 16:56:43 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Thu, 29 Jul 2010 00:56:43 +1000
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
Message-ID: 

On 28 July 2010 23:37, Paolo Giarrusso  wrote:
> On Wed, Jul 28, 2010 at 14:54, William Leslie
>  wrote:
>> Does anyone know if there is a central resource for incompatible
>> python memory model proposals? I know of Jython, Python-Safethread,
>> and Mont-E.
>
> Add Unladen Swallow to your list - the "Jython memory model" is undocumented.
> I don't know of Mont-E, can't find its website through Google (!), and
> there seems to be no such central resource.

Mont-E was, for a long time, the hypothetical capability-secure subset
of python based on E and discussed on cap-talk. A handful of people
started work on it in earnest as a cpython fork fairly recently, but
it does seem to be pretty quiet, and documentation free. I did find a
repository and a presentation:
  http://bytebucket.org/habnabit/mont-e/overview
  https://docs.google.com/present/view?id=d9wrrrq_15ch78nq9n

> This is a general issue with concurrency, and usually I try to solve
> this using more pencil-and-paper design than usual.

I found the following paper pretty interesting. The motivating study
is some concurrency experts implementing software for proving the lack
of deadlock in Java. Even with the sort of dedication that only a
researcher with no life can provide, their deadlock inference software
itself deadlocked after many years of use.
www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

>> An example for MiniAxon (as I understand it, which is
>> not very well) would be verification that a "task" (including
>> functions that the task calls) never closes over and yields the same
>> mutable objects, and never mutates globally reachable objects.
>
> I guess that 'close over' here means 'getting as input'.

I mean that it keeps a reference to the objects between invocations.
Hence, sharing mutable state.

>> The trouble
>> is that it is so hard to work out what immutable really means.
>> Non-optional annotations would be not very pythonian.
>
> If you want static guarantees, you need a statically typed language.
> The usual argument for dynamic languages is that instead of static
> typing, you need to write unit tests, and since you must do that
> anyway, dynamic languages are a win.

One thing that many even very experienced hackers miss is that
(static) types (and typesystems) actually cover a broad range of
usages, and many of them are very different to the structural
typesafety systems you are used to in C# and Java. A typesystem can
prove anything that is statically computable, from the noninterference
of effects to program termination, the ability to stack allocate data
structures, and that privileged information can't be tainted.

It's important to realise that these are orthogonal to, not supersets
of, typesystems that validate structural safety. So it can be
reasonable, if yet a little more difficult, to apply them to dynamic
languages.

-- 
William Leslie


From glavoie at gmail.com  Wed Jul 28 21:32:38 2010
From: glavoie at gmail.com (Gabriel Lavoie)
Date: Wed, 28 Jul 2010 15:32:38 -0400
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: 
References: 
Message-ID: 

Hello Kevin,
     I don't know if it can be a solution to your problem but for my Master
Thesis I'm working on making Stackless Python distributed. What I did is
working but not complete and I'm right now in the process of writing the
thesis (in french unfortunately). My code currently works with PyPy's
"stackless" module onlyis and use some PyPy specific things. Here's what I
added to Stackless:

- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is
an instance of an interpreter.
- Each tasklet has its global namespace (to avoid sharing of data). The
state is also easier to move to another interpreter this way.
- Distributed channels: All requests are known by all nodes using the
channel.
- Distributed objets: When a reference is sent to a remote node, the object
is not copied, a reference is created using PyPy's proxy object space.
- Automated dependency recovery when an object or a tasklet is loaded on
another interpreter

With a proper scheduler, many tasklets could be automatically spread in
multiple interpreters to use multiple cores or on multiple computers. A bit
like the N:M threading model where N lightweight threads/coroutines can be
executed on M threads.

The API is described here in french but it's pretty straightforward:
https://w3.mutehq.net/wiki/maitrise/API_DStackless

The code is available here (Just click on the Download link next to the
trunk folder):
https://w3.mutehq.net/websvn/wildchild/dstackless/trunk/

You need pypy-c built with --stackless. The code is a bit buggy right now
though...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From kevinar18 at hotmail.com  Thu Jul 29 02:59:21 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Wed, 28 Jul 2010 20:59:21 -0400
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
 message passing?
In-Reply-To: 
References: ,
	
Message-ID: 

> I don't know if it can be a solution to your problem but for my Master
> Thesis I'm working on making Stackless Python distributed.

It might be of use.  Thanks for the heads up.  I do have several questions:

1) Is it PyPy's stackless module or Stackless Python (stackless.com)?  Or are they the same module?
2) Do you have a non-https version of the site or one with a publically signed certificate?

P.S. You can send your reply over private email if you want, so as to not bother the list. :)

Date: Wed, 28 Jul 2010 15:32:38 -0400
Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory 	message passing?
From: glavoie at gmail.com
To: kevinar18 at hotmail.com
CC: pypy-dev at codespeak.net

Hello Kevin,
     I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless:

- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter.- Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. 
- Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space.
- Automated dependency recovery when an object or a tasklet is loaded on another interpreter
With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads. 

The API is described here in french but it's pretty straightforward:https://w3.mutehq.net/wiki/maitrise/API_DStackless

The code is available here (Just click on the Download link next to the trunk folder):https://w3.mutehq.net/websvn/wildchild/dstackless/trunk/

You need pypy-c built with --stackless. The code is a bit buggy right now though...
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From kevinar18 at hotmail.com  Thu Jul 29 03:33:04 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Wed, 28 Jul 2010 21:33:04 -0400
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: ,
	<20100727062702.GE12699@tunixman.com>,
	,
	
Message-ID: 

As a followup to my earlier post:
"pre-emptive micro-threads utilizing shared memory message passing?"

I am actually finding that the biggest hurdle to accomplishing what I want is the lack of ANY type of shared memory -- even if it is limited.  I wonder if I might ask a question:

Would the following be a possible way to offer a limited type of shared memory:

Summary: create a system very, very similar to POSH, but with differences:

In detail, here's what I mean:
* unlike POSH, utilize OS threads and shared memory (not processes)
* Create a special shared memory location where you can place Python objects
* Each Python object you place into this location can only be accessed (modified) by 1 thread.
* You must manually assign ownership of an object to a particular thread.
* The thread that "owns" the object is the only one that can modify it.
* You can transfer ownership to another thread (but, as always only the owner can modify it).

* There is no GIL when a thread interacts with these special objects.  You can have true thread parallelism if your code uses a lot of these special objects.
* The GIL remains in place for all other data access.
* If your code has a mixture of access to the special objects and regular data, then once you hit a point where a thread starts to interact with data not in the special storage, then that thread must follow GIL rules.

Granted, there might be some difficulty with the GIL part... but I thought I might ask anyways. :)

> Date: Wed, 28 Jul 2010 22:54:39 +1000
> Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory 	message passing?
> From: william.leslie.ttg at gmail.com
> To: kevinar18 at hotmail.com
> CC: pypy-dev at codespeak.net
> 
> On 28 July 2010 04:20, Kevin Ar18  wrote:
> > I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf)  There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html  Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
> 
> Does anyone know if there is a central resource for incompatible
> python memory model proposals? I know of Jython, Python-Safethread,
> and Mont-E.
> 
> I do like the idea of MiniAxon, but let me mention a topic that has
> slowly been bubbling to the front of my mind for the last few months.
> 
> Concurrency in the face of shared mutable state is hard. It makes it
> trivial to introduce bugs all over the place. Nondeterminacy related
> bugs are far harder to test, diagnose, and fix than anything else that
> I would almost mandate static verification (via optional typing,
> probably) of task noninterference if I was moving to a concurrent
> environment with shared mutable state. There might be a reasonable
> middle ground where, if a task attempts to violate the required static
> semantics, it fails dynamically. At least then, latent bugs make
> plenty of noise. An example for MiniAxon (as I understand it, which is
> not very well) would be verification that a "task" (including
> functions that the task calls) never closes over and yields the same
> mutable objects, and never mutates globally reachable objects.
> 
> I wonder if you could close such tasks off with a clever subclass of
> the proxy object space that detects and rejects such memory model
> violations? With only semantics that make the program deterministic?
> 
> The moral equivalent would be cooperating processes with a large
> global (or not) shared memory store for immutable objects, queues for
> communication, and the additional semantic that objects in a queue are
> either immutable or the queue holds their only reference. The trouble
> is that it is so hard to work out what immutable really means.
> Non-optional annotations would be not very pythonian.
> 
> -- 
> William Leslie
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From alex.gaynor at gmail.com  Thu Jul 29 03:44:23 2010
From: alex.gaynor at gmail.com (Alex Gaynor)
Date: Wed, 28 Jul 2010 20:44:23 -0500
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
Message-ID: 

On Wed, Jul 28, 2010 at 8:33 PM, Kevin Ar18  wrote:
> As a followup to my earlier post:
> "pre-emptive micro-threads utilizing shared memory message passing?"
>
> I am actually finding that the biggest hurdle to accomplishing what I want
> is the lack of ANY type of shared memory -- even if it is limited.??I wonder
> if I might ask a question:
>
> Would the following be a possible way to offer a limited type of shared
> memory:
>
> Summary: create a system very, very similar to POSH, but with differences:
>
> In detail, here's what I mean:
> * unlike POSH, utilize OS threads and shared memory (not processes)
> * Create a special shared memory location where you can place Python objects
> * Each Python object you place into this location can only be accessed
> (modified) by 1 thread.
> * You must manually assign ownership of an object to a particular thread.
> * The thread that "owns" the object is the only one that can modify it.
> * You can transfer ownership to another thread (but, as always only the
> owner can modify it).
>
> * There is no GIL when a thread interacts with these special objects.??You
> can have true thread parallelism if your code uses a lot of these special
> objects.
> * The GIL remains in place for all other data access.
> * If your code has a mixture of access to the special objects and regular
> data, then once you hit a point where a thread starts to interact with data
> not in the special storage, then that thread must follow GIL rules.
>
> Granted, there might be some difficulty with the GIL part... but I thought I
> might ask anyways. :)
>
>> Date: Wed, 28 Jul 2010 22:54:39 +1000
>> Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory
>> message passing?
>> From: william.leslie.ttg at gmail.com
>> To: kevinar18 at hotmail.com
>> CC: pypy-dev at codespeak.net
>>
>> On 28 July 2010 04:20, Kevin Ar18  wrote:
>> > I am attempting to experiment with FBP - Flow Based Programming
>> > (http://www.jpaulmorrison.com/fbp/ and book:
>> > http://www.jpaulmorrison.com/fbp/book.pdf)? There is something very similar
>> > in Python: http://www.kamaelia.org/MiniAxon.html? Also, there are some
>> > similarities to Erlang - the share nothing memory model... and on some very
>> > broad levels, there are similarities that can be found in functional
>> > languages.
>>
>> Does anyone know if there is a central resource for incompatible
>> python memory model proposals? I know of Jython, Python-Safethread,
>> and Mont-E.
>>
>> I do like the idea of MiniAxon, but let me mention a topic that has
>> slowly been bubbling to the front of my mind for the last few months.
>>
>> Concurrency in the face of shared mutable state is hard. It makes it
>> trivial to introduce bugs all over the place. Nondeterminacy related
>> bugs are far harder to test, diagnose, and fix than anything else that
>> I would almost mandate static verification (via optional typing,
>> probably) of task noninterference if I was moving to a concurrent
>> environment with shared mutable state. There might be a reasonable
>> middle ground where, if a task attempts to violate the required static
>> semantics, it fails dynamically. At least then, latent bugs make
>> plenty of noise. An example for MiniAxon (as I understand it, which is
>> not very well) would be verification that a "task" (including
>> functions that the task calls) never closes over and yields the same
>> mutable objects, and never mutates globally reachable objects.
>>
>> I wonder if you could close such tasks off with a clever subclass of
>> the proxy object space that detects and rejects such memory model
>> violations? With only semantics that make the program deterministic?
>>
>> The moral equivalent would be cooperating processes with a large
>> global (or not) shared memory store for immutable objects, queues for
>> communication, and the additional semantic that objects in a queue are
>> either immutable or the queue holds their only reference. The trouble
>> is that it is so hard to work out what immutable really means.
>> Non-optional annotations would be not very pythonian.
>>
>> --
>> William Leslie
>
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev
>

Honestly, that sounds really difficult, out and out removing the GIL
would probably be easier.

Alex

-- 
"I disapprove of what you say, but I will defend to the death your
right to say it." -- Voltaire
"The people's good is the highest law." -- Cicero
"Code can always be simpler than you think, but never as simple as you
want" -- Me


From kevinar18 at hotmail.com  Thu Jul 29 04:07:57 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Wed, 28 Jul 2010 22:07:57 -0400
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: ,
	<20100727062702.GE12699@tunixman.com>,
	,
	,
	,
	
Message-ID: 


> Honestly, that sounds really difficult, out and out removing the GIL
> would probably be easier.
Based on the extremely limited info on the GIL, the big issue I noticed were two pieces of code trying to modify the same object at the same time because of the way they are stored internally in Python and because of garbage collection.
I figured, if you have special objects that cannot be simultaneously accessed that maybe that would be possible. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From william.leslie.ttg at gmail.com  Thu Jul 29 07:18:57 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Thu, 29 Jul 2010 15:18:57 +1000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
Message-ID: 

On 29 July 2010 11:33, Kevin Ar18  wrote:
> In detail, here's what I mean:
> * unlike POSH, utilize OS threads and shared memory (not processes)
> * Create a special shared memory location where you can place Python objects
> * Each Python object you place into this location can only be accessed
> (modified) by 1 thread.
> * You must manually assign ownership of an object to a particular thread.
> * The thread that "owns" the object is the only one that can modify it.
> * You can transfer ownership to another thread (but, as always only the
> owner can modify it).

When an object is mutable, it must be visible to at most one thread.
This means it can participate in return values, arguments and queues,
but the sender cannot keep a reference to an object it sends, because
if the receiver mutates the object, this will need to be reflected in
the sender's thread to ensure internal consistency. Well, you could
ignore internal consistency, require explicit locking, and have it
segfault when the change to the length of your list has propogated but
not the element you have added, but that wouldn't be much fun. The
alternative, implicitly writing updates back to memory as soon as
possible and reading them out of memory every time, can be hundreds or
more times slower. So you really can't have two tasks sharing mutable
objects, ever.

-- 
William Leslie


From fijall at gmail.com  Thu Jul 29 09:27:05 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Thu, 29 Jul 2010 09:27:05 +0200
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
	 
	
	
Message-ID: 

On Thu, Jul 29, 2010 at 7:18 AM, William Leslie
 wrote:
> On 29 July 2010 11:33, Kevin Ar18  wrote:
>> In detail, here's what I mean:
>> * unlike POSH, utilize OS threads and shared memory (not processes)
>> * Create a special shared memory location where you can place Python objects
>> * Each Python object you place into this location can only be accessed
>> (modified) by 1 thread.
>> * You must manually assign ownership of an object to a particular thread.
>> * The thread that "owns" the object is the only one that can modify it.
>> * You can transfer ownership to another thread (but, as always only the
>> owner can modify it).
>
> When an object is mutable, it must be visible to at most one thread.
> This means it can participate in return values, arguments and queues,
> but the sender cannot keep a reference to an object it sends, because
> if the receiver mutates the object, this will need to be reflected in
> the sender's thread to ensure internal consistency. Well, you could
> ignore internal consistency, require explicit locking, and have it
> segfault when the change to the length of your list has propogated but
> not the element you have added, but that wouldn't be much fun. The
> alternative, implicitly writing updates back to memory as soon as
> possible and reading them out of memory every time, can be hundreds or
> more times slower. So you really can't have two tasks sharing mutable
> objects, ever.
>
> --
> William Leslie

Hi.

Do you have any data points supporting your claim?

Cheers,
fijal


From william.leslie.ttg at gmail.com  Thu Jul 29 09:32:57 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Thu, 29 Jul 2010 17:32:57 +1000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
Message-ID: 

On 29 July 2010 17:27, Maciej Fijalkowski  wrote:
> On Thu, Jul 29, 2010 at 7:18 AM, William Leslie
>  wrote:
>> When an object is mutable, it must be visible to at most one thread.
>> This means it can participate in return values, arguments and queues,
>> but the sender cannot keep a reference to an object it sends, because
>> if the receiver mutates the object, this will need to be reflected in
>> the sender's thread to ensure internal consistency. Well, you could
>> ignore internal consistency, require explicit locking, and have it
>> segfault when the change to the length of your list has propogated but
>> not the element you have added, but that wouldn't be much fun. The
>> alternative, implicitly writing updates back to memory as soon as
>> possible and reading them out of memory every time, can be hundreds or
>> more times slower. So you really can't have two tasks sharing mutable
>> objects, ever.
>>
>> --
>> William Leslie
>
> Hi.
>
> Do you have any data points supporting your claim?

About the performance of programs that involve a cache miss on every
memory access, or internal consistency?

-- 
William Leslie


From fijall at gmail.com  Thu Jul 29 09:40:21 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Thu, 29 Jul 2010 09:40:21 +0200
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
	 
	
	 
	 
	
Message-ID: 

On Thu, Jul 29, 2010 at 9:32 AM, William Leslie
 wrote:
> On 29 July 2010 17:27, Maciej Fijalkowski  wrote:
>> On Thu, Jul 29, 2010 at 7:18 AM, William Leslie
>>  wrote:
>>> When an object is mutable, it must be visible to at most one thread.
>>> This means it can participate in return values, arguments and queues,
>>> but the sender cannot keep a reference to an object it sends, because
>>> if the receiver mutates the object, this will need to be reflected in
>>> the sender's thread to ensure internal consistency. Well, you could
>>> ignore internal consistency, require explicit locking, and have it
>>> segfault when the change to the length of your list has propogated but
>>> not the element you have added, but that wouldn't be much fun. The
>>> alternative, implicitly writing updates back to memory as soon as
>>> possible and reading them out of memory every time, can be hundreds or
>>> more times slower. So you really can't have two tasks sharing mutable
>>> objects, ever.
>>>
>>> --
>>> William Leslie
>>
>> Hi.
>>
>> Do you have any data points supporting your claim?
>
> About the performance of programs that involve a cache miss on every
> memory access, or internal consistency?
>

I think I lost some implication here. Did I get you right - you claim
that per-object locking in case threads share obejcts are very
expensive, is that correct? If not, I completely misunderstood you and
my question makes no sense, please explain. If yes, why does it mean a
cache miss on every read/write?

Cheers,
fijal


From william.leslie.ttg at gmail.com  Thu Jul 29 09:57:58 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Thu, 29 Jul 2010 17:57:58 +1000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
	
	
Message-ID: 

On 29 July 2010 17:40, Maciej Fijalkowski  wrote:
> On Thu, Jul 29, 2010 at 9:32 AM, William Leslie
>  wrote:
>> On 29 July 2010 17:27, Maciej Fijalkowski  wrote:
>>> On Thu, Jul 29, 2010 at 7:18 AM, William Leslie
>>>  wrote:
>>>> When an object is mutable, it must be visible to at most one thread.
>>>> This means it can participate in return values, arguments and queues,
>>>> but the sender cannot keep a reference to an object it sends, because
>>>> if the receiver mutates the object, this will need to be reflected in
>>>> the sender's thread to ensure internal consistency. Well, you could
>>>> ignore internal consistency, require explicit locking, and have it
>>>> segfault when the change to the length of your list has propogated but
>>>> not the element you have added, but that wouldn't be much fun. The
>>>> alternative, implicitly writing updates back to memory as soon as
>>>> possible and reading them out of memory every time, can be hundreds or
>>>> more times slower. So you really can't have two tasks sharing mutable
>>>> objects, ever.
>>>>
>>>> --
>>>> William Leslie
>>>
>>> Hi.
>>>
>>> Do you have any data points supporting your claim?
>>
>> About the performance of programs that involve a cache miss on every
>> memory access, or internal consistency?
>>
>
> I think I lost some implication here. Did I get you right - you claim
> that per-object locking in case threads share obejcts are very
> expensive, is that correct? If not, I completely misunderstood you and
> my question makes no sense, please explain. If yes, why does it mean a
> cache miss on every read/write?

I claim that there are two alternatives in the face of one thread
mutating an object and the other observing:

0. You can give up consistency and do fine-grained locking, which is
reasonably fast but error prone, or
1. Expect python to handle all of this for you, effectively not making
a change to the memory model. You could do this with implicit
per-object locks which might be reasonably fast in the absence of
contention, but not when several threads are trying to use the object.

Queues already are in a sense your per-object-lock,
one-thread-mutating, but usually one thread has acquire semantics and
one has release semantics, and that combination actually works. It's
when you expect to have a full memory barrier that is the problem.

Come to think of it, you might be right Kevin: as long as only one
thread mutates the object, the mutating thread never /needs/ to
acquire, as it knows that it has the latest revision.

Have I missed something?

-- 
William Leslie


From fijall at gmail.com  Thu Jul 29 10:02:30 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Thu, 29 Jul 2010 10:02:30 +0200
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
	 
	
	 
	 
	 
	 
	
Message-ID: 

On Thu, Jul 29, 2010 at 9:57 AM, William Leslie
 wrote:
> On 29 July 2010 17:40, Maciej Fijalkowski  wrote:
>> On Thu, Jul 29, 2010 at 9:32 AM, William Leslie
>>  wrote:
>>> On 29 July 2010 17:27, Maciej Fijalkowski  wrote:
>>>> On Thu, Jul 29, 2010 at 7:18 AM, William Leslie
>>>>  wrote:
>>>>> When an object is mutable, it must be visible to at most one thread.
>>>>> This means it can participate in return values, arguments and queues,
>>>>> but the sender cannot keep a reference to an object it sends, because
>>>>> if the receiver mutates the object, this will need to be reflected in
>>>>> the sender's thread to ensure internal consistency. Well, you could
>>>>> ignore internal consistency, require explicit locking, and have it
>>>>> segfault when the change to the length of your list has propogated but
>>>>> not the element you have added, but that wouldn't be much fun. The
>>>>> alternative, implicitly writing updates back to memory as soon as
>>>>> possible and reading them out of memory every time, can be hundreds or
>>>>> more times slower. So you really can't have two tasks sharing mutable
>>>>> objects, ever.
>>>>>
>>>>> --
>>>>> William Leslie
>>>>
>>>> Hi.
>>>>
>>>> Do you have any data points supporting your claim?
>>>
>>> About the performance of programs that involve a cache miss on every
>>> memory access, or internal consistency?
>>>
>>
>> I think I lost some implication here. Did I get you right - you claim
>> that per-object locking in case threads share obejcts are very
>> expensive, is that correct? If not, I completely misunderstood you and
>> my question makes no sense, please explain. If yes, why does it mean a
>> cache miss on every read/write?
>
> I claim that there are two alternatives in the face of one thread
> mutating an object and the other observing:
>
> 0. You can give up consistency and do fine-grained locking, which is
> reasonably fast but error prone, or
> 1. Expect python to handle all of this for you, effectively not making
> a change to the memory model. You could do this with implicit
> per-object locks which might be reasonably fast in the absence of
> contention, but not when several threads are trying to use the object.
>
> Queues already are in a sense your per-object-lock,
> one-thread-mutating, but usually one thread has acquire semantics and
> one has release semantics, and that combination actually works. It's
> when you expect to have a full memory barrier that is the problem.
>
> Come to think of it, you might be right Kevin: as long as only one
> thread mutates the object, the mutating thread never /needs/ to
> acquire, as it knows that it has the latest revision.
>
> Have I missed something?
>
> --
> William Leslie
>

So my question is why you think 1. is really expensive (can you find
evidence). I don't see what is has to do with cache misses. Besides,
in python you cannot guarantee much about mutability of objects. So
you don't know if object passed in a queue is mutable or not, unless
you restrict yourself to some very simlpe types (in which case there
is no shared memory, since you only pass immutable objects).

Cheers,
fijal


From william.leslie.ttg at gmail.com  Thu Jul 29 10:50:52 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Thu, 29 Jul 2010 18:50:52 +1000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
	
	
	
	
Message-ID: 

On 29 July 2010 18:02, Maciej Fijalkowski  wrote:
> On Thu, Jul 29, 2010 at 9:57 AM, William Leslie
>  wrote:
>> I claim that there are two alternatives in the face of one thread
>> mutating an object and the other observing:
>>
>> 0. You can give up consistency and do fine-grained locking, which is
>> reasonably fast but error prone, or
>> 1. Expect python to handle all of this for you, effectively not making
>> a change to the memory model. You could do this with implicit
>> per-object locks which might be reasonably fast in the absence of
>> contention, but not when several threads are trying to use the object.
>>
>> Queues already are in a sense your per-object-lock,
>> one-thread-mutating, but usually one thread has acquire semantics and
>> one has release semantics, and that combination actually works. It's
>> when you expect to have a full memory barrier that is the problem.
>>
>> Come to think of it, you might be right Kevin: as long as only one
>> thread mutates the object, the mutating thread never /needs/ to
>> acquire, as it knows that it has the latest revision.
>>
>> Have I missed something?
>>
>> --
>> William Leslie
>>
>
> So my question is why you think 1. is really expensive (can you find
> evidence). I don't see what is has to do with cache misses. Besides,
> in python you cannot guarantee much about mutability of objects. So
> you don't know if object passed in a queue is mutable or not, unless
> you restrict yourself to some very simlpe types (in which case there
> is no shared memory, since you only pass immutable objects).

If task X expects that task Y will mutate some object it has, it needs
to go back to the source for every read. This means that if you do use
mutation of some shared object for communication, it needs to be
synchronised before every access. What this means for us is that every
read from a possibly mutable object requires an acquire, and every
write requires a release. It's as if every reference in the program is
implemented with a volatile pointer. Even if the object is never
mutated, there can be a lot of unnecessary bus chatter waiting for
MESI to tell us so.

-- 
William Leslie


From fijall at gmail.com  Thu Jul 29 10:55:25 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Thu, 29 Jul 2010 10:55:25 +0200
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
	 
	
	 
	 
	 
	 
	 
	 
	
Message-ID: 

On Thu, Jul 29, 2010 at 10:50 AM, William Leslie
 wrote:
> On 29 July 2010 18:02, Maciej Fijalkowski  wrote:
>> On Thu, Jul 29, 2010 at 9:57 AM, William Leslie
>>  wrote:
>>> I claim that there are two alternatives in the face of one thread
>>> mutating an object and the other observing:
>>>
>>> 0. You can give up consistency and do fine-grained locking, which is
>>> reasonably fast but error prone, or
>>> 1. Expect python to handle all of this for you, effectively not making
>>> a change to the memory model. You could do this with implicit
>>> per-object locks which might be reasonably fast in the absence of
>>> contention, but not when several threads are trying to use the object.
>>>
>>> Queues already are in a sense your per-object-lock,
>>> one-thread-mutating, but usually one thread has acquire semantics and
>>> one has release semantics, and that combination actually works. It's
>>> when you expect to have a full memory barrier that is the problem.
>>>
>>> Come to think of it, you might be right Kevin: as long as only one
>>> thread mutates the object, the mutating thread never /needs/ to
>>> acquire, as it knows that it has the latest revision.
>>>
>>> Have I missed something?
>>>
>>> --
>>> William Leslie
>>>
>>
>> So my question is why you think 1. is really expensive (can you find
>> evidence). I don't see what is has to do with cache misses. Besides,
>> in python you cannot guarantee much about mutability of objects. So
>> you don't know if object passed in a queue is mutable or not, unless
>> you restrict yourself to some very simlpe types (in which case there
>> is no shared memory, since you only pass immutable objects).
>
> If task X expects that task Y will mutate some object it has, it needs
> to go back to the source for every read. This means that if you do use
> mutation of some shared object for communication, it needs to be
> synchronised before every access. What this means for us is that every
> read from a possibly mutable object requires an acquire, and every
> write requires a release. It's as if every reference in the program is
> implemented with a volatile pointer. Even if the object is never
> mutated, there can be a lot of unnecessary bus chatter waiting for
> MESI to tell us so.
>

I do agree there is an overhead. Can you provide some data how much
this overhead is? Python is not a very simple language and a lot of
things are complex and time consuming, so I wonder how it compares to
locking per object.


From sparks.m at gmail.com  Thu Jul 29 11:44:52 2010
From: sparks.m at gmail.com (Michael Sparks)
Date: Thu, 29 Jul 2010 10:44:52 +0100
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
Message-ID: 

Would comments from a project using this approach in real systems be
of interest/use/help? Whilst I didn't know about Morrison's FBP
(Balzer's work predates him btw - don't listen to hype) I had heard of
(and played with) Occam among other more influential things, and
Kamaelia is a real tool. Also there is already a pre-existing FBP tool
for Stackless, and then historically there's also MASCOT & friends. It
just looks to me that you're tieing yourself up in knots over things
that aren't problems, when there are some things which could be useful
(in practice) & interesting in this space.

Oh, incidentally, Mini Axon is a toy/teaching/testing system - as the
name suggests. The main Axon is more complete -- in the areas we've
needed - it's been driven by real system needs.

(for those who don't know me, Kamaelia is my project, I don't bite,
but I do sometimes talk or type fast)

Regards,

Michael Sparks
--
http://www.kamaelia.org/PragmaticConcurrency.html
http://yeoldeclue.com/blog

On 7/29/10, Kevin Ar18  wrote:
> As a followup to my earlier post:
> "pre-emptive micro-threads utilizing shared memory message passing?"
>
> I am actually finding that the biggest hurdle to accomplishing what I want
> is the lack of ANY type of shared memory -- even if it is limited.  I wonder
> if I might ask a question:
>
> Would the following be a possible way to offer a limited type of shared
> memory:
>
> Summary: create a system very, very similar to POSH, but with differences:
>
> In detail, here's what I mean:
> * unlike POSH, utilize OS threads and shared memory (not processes)
> * Create a special shared memory location where you can place Python objects
> * Each Python object you place into this location can only be accessed
> (modified) by 1 thread.
> * You must manually assign ownership of an object to a particular thread.
> * The thread that "owns" the object is the only one that can modify it.
> * You can transfer ownership to another thread (but, as always only the
> owner can modify it).
>
> * There is no GIL when a thread interacts with these special objects.  You
> can have true thread parallelism if your code uses a lot of these special
> objects.
> * The GIL remains in place for all other data access.
> * If your code has a mixture of access to the special objects and regular
> data, then once you hit a point where a thread starts to interact with data
> not in the special storage, then that thread must follow GIL rules.
>
> Granted, there might be some difficulty with the GIL part... but I thought I
> might ask anyways. :)
>
>> Date: Wed, 28 Jul 2010 22:54:39 +1000
>> Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory
>> 	message passing?
>> From: william.leslie.ttg at gmail.com
>> To: kevinar18 at hotmail.com
>> CC: pypy-dev at codespeak.net
>>
>> On 28 July 2010 04:20, Kevin Ar18  wrote:
>> > I am attempting to experiment with FBP - Flow Based Programming
>> > (http://www.jpaulmorrison.com/fbp/ and book:
>> > http://www.jpaulmorrison.com/fbp/book.pdf)  There is something very
>> > similar in Python: http://www.kamaelia.org/MiniAxon.html  Also, there
>> > are some similarities to Erlang - the share nothing memory model... and
>> > on some very broad levels, there are similarities that can be found in
>> > functional languages.
>>
>> Does anyone know if there is a central resource for incompatible
>> python memory model proposals? I know of Jython, Python-Safethread,
>> and Mont-E.
>>
>> I do like the idea of MiniAxon, but let me mention a topic that has
>> slowly been bubbling to the front of my mind for the last few months.
>>
>> Concurrency in the face of shared mutable state is hard. It makes it
>> trivial to introduce bugs all over the place. Nondeterminacy related
>> bugs are far harder to test, diagnose, and fix than anything else that
>> I would almost mandate static verification (via optional typing,
>> probably) of task noninterference if I was moving to a concurrent
>> environment with shared mutable state. There might be a reasonable
>> middle ground where, if a task attempts to violate the required static
>> semantics, it fails dynamically. At least then, latent bugs make
>> plenty of noise. An example for MiniAxon (as I understand it, which is
>> not very well) would be verification that a "task" (including
>> functions that the task calls) never closes over and yields the same
>> mutable objects, and never mutates globally reachable objects.
>>
>> I wonder if you could close such tasks off with a clever subclass of
>> the proxy object space that detects and rejects such memory model
>> violations? With only semantics that make the program deterministic?
>>
>> The moral equivalent would be cooperating processes with a large
>> global (or not) shared memory store for immutable objects, queues for
>> communication, and the additional semantic that objects in a queue are
>> either immutable or the queue holds their only reference. The trouble
>> is that it is so hard to work out what immutable really means.
>> Non-optional annotations would be not very pythonian.
>>
>> --
>> William Leslie
>


From william.leslie.ttg at gmail.com  Thu Jul 29 15:15:32 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Thu, 29 Jul 2010 23:15:32 +1000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
	
	
	
	
	
	
Message-ID: 

On 29 July 2010 18:55, Maciej Fijalkowski  wrote:
> On Thu, Jul 29, 2010 at 10:50 AM, William Leslie
>  wrote:
>> If task X expects that task Y will mutate some object it has, it needs
>> to go back to the source for every read. This means that if you do use
>> mutation of some shared object for communication, it needs to be
>> synchronised before every access. What this means for us is that every
>> read from a possibly mutable object requires an acquire, and every
>> write requires a release. It's as if every reference in the program is
>> implemented with a volatile pointer. Even if the object is never
>> mutated, there can be a lot of unnecessary bus chatter waiting for
>> MESI to tell us so.
>>
>
> I do agree there is an overhead. Can you provide some data how much
> this overhead is? Python is not a very simple language and a lot of
> things are complex and time consuming, so I wonder how it compares to
> locking per object.

It *is* locking per object, but you also spend time looking for the
data if someone else has invalidated your cache line.

Come to think of it, that isn't as bad as it first seemed to me. If
the sender never mutates the object, it will Just Work on any machine
with a fairly flat cache architecture.

Sorry. Carry on.

-- 
William Leslie


From kevinar18 at hotmail.com  Thu Jul 29 18:56:28 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Thu, 29 Jul 2010 12:56:28 -0400
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: ,
	<20100727062702.GE12699@tunixman.com>,
	,
	,
	,
	,
	,
	,
	,
	
Message-ID: 


> I claim that there are two alternatives in the face of one thread
> mutating an object and the other observing:
Well, I did consider the possibility of one thread being able to change, the others observe, but I have no idea if that is too complicate like you are suggesting.
However, that is not even necessary.  An even more limited form, would work fine (at least for me):
 
Two possible modes:
Read/Write from 1 thread:
* ONLY one thread can change and observe(read) -- no other threads have access of any kind or even know of its existence until you transfer control to another thread (then only the thread you transferred control has acces).
(Optional) read only from all threads:
* Optionally, you could have objects that are in read only mode and all threads can observe it.
 
To make things easier, maybe special GIL-free threads could be added.  (They would still be OS-level threads, but with special properties in Python.) These threads would have the property that they could ONLY access data stored in the special object store to which they have read/write privilege.  They can't access other objects not in the special store.  As a result, these special threads would be free of the GIL and could run in parallel.

> Queues already are in a sense your per-object-lock,
> one-thread-mutating, but usually one thread has acquire semantics and
> one has release semantics, and that combination actually works. It's
> when you expect to have a full memory barrier that is the problem.

Now you brought up something interesting: queues
To be honest something like queues and pipes would good enough for my purposes -- if they used shared memory.  Currently, the implemenation of queues and pipes in the multiprocessing module seems rather costly as they use processes, and require copying data back and forth.
In particular, what would be useful:
 
* A queue that holds self-contained Python objects (with no pointers/references to other data not in the queue so as to prevent threading issues)
* The queue can be accessed by all special threads simultaneously (in parallel).  You would only need locks around queue operations, but that is pretty easy to do -- unless there is some hidden Interpreter problem that would make this easy task hard.
* Streaming buffers -- like a file buffer or something similar, so you can send data from one thread to another as it comes in (when you don't know when it will end or it may never end).  Only two threads have access: one to put data in, the other to extract it.
 
> 0. You can give up consistency and do fine-grained locking, which is
> reasonably fast but error prone, or
> 1. Expect python to handle all of this for you, effectively not making
> a change to the memory model. You could do this with implicit
> per-object locks which might be reasonably fast in the absence of
> contention, but not when several threads are trying to use the object.
> 
...
> 
> Come to think of it, you might be right Kevin: as long as only one
> thread mutates the object, the mutating thread never /needs/ to
> acquire, as it knows that it has the latest revision.
> 
> Have I missed something?
I'm afraid I don't know enough about Python's Interpreter to say much.  The only way would be for me to do some studying on interpreters/compilers and get digging into the codebase -- and I'm not sure how much time I have to do that right now. :)
Perhaps the part about one thread only having read & write changes the situation?
 
One possible implemenation might be similar to how POSH does it:
Now, I'm not suggesting this, because I know enough to say it is possible, but just to put something out there that might work.
Create a special virtual memory address or lookup table for each thread.  When you assign a read+write object to a thread, it gets added to the virtual address/memory table.
Optinally, it could be up to the programmer to make sure they don't try to access data from a thread that does not have ownership/control of that object.  If a programmer does try to access it, it would fail as the memory address would point to nowhere/bad data/etc....
 
Of course, there are probably other, better ways to do it that are not as fickle as this... but I don't know if the limitations of the Python Interpreter and GIL would allow better methods. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From andrewfr_ice at yahoo.com  Thu Jul 29 18:56:52 2010
From: andrewfr_ice at yahoo.com (Andrew Francis)
Date: Thu, 29 Jul 2010 09:56:52 -0700 (PDT)
Subject: [pypy-dev] pypy-dev Digest, Vol 360, Issue 13
In-Reply-To: 
Message-ID: <680164.96893.qm@web120007.mail.ne1.yahoo.com>

Hi Kevin:

Message: 1
Date: Tue, 27 Jul 2010 14:20:10 -0400
From: Kevin Ar18 
Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared
    memory message passing?
To: 
Message-ID: 
Content-Type: text/plain; charset="iso-8859-1"


>I am attempting to experiment with FBP - Flow Based Programming >(http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com>/fbp/book.pdf)? There is something very similar in Python: >http://www.kamaelia.org/MiniAxon.html? Also, there are some similarities >to Erlang - the share nothing memory model... and on some very broad >levels, there are similarities that can be found in functional languages.

I just came back from EuroPython. A lot of discussion on concurrency....

Well functional languages (like Erlang), variables tend to be immutable. This is a bonus in a concurrent system - makes it easier to reason about the system - and helps to avoid various race conditions. As for the shared memory. I think there is a difference between whether things are shared at the application programmer level, or under the hood controlled by the system. Programmers tend to beare bad at the former. 

>http://www.kamaelia.org/MiniAxon.html

I took a quick look. Maybe I am biased but Stackless Python gives you most of that. Also tasklets and channels can do everything a generator can and more (a generator is more specialised than a coroutine). Also it is easy to mimic asynchrony with a CSP style messaging system where microthreads and channels are cheap. A line from the book "Actors: A Model of Concurrent Computation in Distributed Systems" by Gul A. Agha comes to mind: "synchrony is mere buffered asynchrony."

>The process of connecting the boxes together was actually designed to be >programmed visually, as you can see from the examples in the book (I have >no idea if it works well, as I am merely starting to experiment with it).

What bought me to Stackless Python and PyPy was work concerning WS-BPEL. Allegedly, WS-BPEL/XLang/WSFL (Web-Services Flow Language) are based on formalisms like pi calculus.

Since I don't own a multi-core machine and I am not doing CPU intense stuff, I never really cared. However I have been doing things where I needed to impose logical orderings upon processes (i.e., process C can only run after process A and B are finished). My initial native uses of Stackless (easy to do in anything system based on CSP), resulted in deadlocking the system. So I found understanding deadlock to be very important.

>Each box, being a self contained "program," the only data it has access >to is 3 parts:

>Implementation of FBP requires a custom scheduler for several reasons:
>(1) A box can only run if it has actual data on the "in port(s)"? Thus, >the scheduler would only schedule boxes to run when they can actually >process some data.

Stackless Python already works like this. No custom scheduler needed. I would recommend you read Rob Pike's paper "The Implementation of Newsqueak" or some of the Cardilli papers to understand how CSP constructs with channels work. And if you need to customize schedulers - you have two routes 1) Use pre-existing classes and API 2) Experiment with PyPy's stackless.py

>(2) In theory, it may be possible to end up with hundreds or thousands of >these light weight boxes.? Using heavy-weight OS threads or processes for every one is out of the question.

Stackless Python.

>In a perfect world, here's what I might do:
* Assume a quad core cpu
>(1) Spawn 1 process
>(2) Spawn 4 threads & assign each thread to only 1 core -- in other >words, don't let the OS handle moving threads around to different cores
>(3) Inside each thread, have a mini scheduler that switches back and >forth between the many micro-threads (or "boxes") -- note that the OS >should not handle any of the switching between micro-threads/boxes as it >does it all wrong (and to heavyweight) for this situation.
>(4) Using a shared memory queue, each of the 4 schedulers can get the >next box to run... or add more boxes to the schedule queue.

My advice: get stuff properly working under a single threaded model first so you understand the machinery. That said, I think Carlos Eduardo de Paula a few years ago played with adapting Stackless for multi-processing.

Second piece of advice: start looking at how Go does things. Stackless Python and Go share a common ancestor. However Go does much more on the multi-core front.

Cheers,
Andrew


      



From kevinar18 at hotmail.com  Thu Jul 29 19:35:14 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Thu, 29 Jul 2010 13:35:14 -0400
Subject: [pypy-dev] pypy-dev Digest, Vol 360, Issue 13
In-Reply-To: <680164.96893.qm@web120007.mail.ne1.yahoo.com>
References: ,
	<680164.96893.qm@web120007.mail.ne1.yahoo.com>
Message-ID: 


> Well functional languages (like Erlang), variables tend to be immutable. This is a bonus in a concurrent system - makes it easier to reason about the system - and helps to avoid various race conditions. As for the shared memory. I think there is a difference between whether things are shared at the application programmer level, or under the hood controlled by the system. Programmers tend to beare bad at the former. 

Your right... and I am actually talking about non-shared memory from the perspective of the programmer, but under the hood, it MUST use shared memory for implementation.  The problem I am running into is that there is no way to implement it under the hood because there is no way to do shared memory in Python.
 
Thanks for bringing that up.  Maybe that will clarify what I was going on about. :)
 
> I took a quick look. Maybe I am biased but Stackless Python gives you most of that. Also tasklets and channels can do everything a generator can and more (a generator is more specialised than a coroutine). Also it is easy to mimic asynchrony with a CSP style messaging system where microthreads and channels are cheap. A line from the book "Actors: A Model of Concurrent Computation in Distributed Systems" by Gul A. Agha comes to mind: "synchrony is mere buffered asynchrony."

Agreed.  Stuff like the stackless module in PyPy, greenlets, twisted, and others do offer some useful options that are even better than generators...  I could definitely make use of them for some of the broader implemenation details.  However, the problem is always that there is no way to make them parallel within Python itself, because there is no shared memory that I can use for "under the hood" implemenation.
 
Now, if there is a true parallel implementation of stackless, greenlets, twisted, etc... maybe it could fit my purposes... but I'd have to check.  I did some basic searching on various Python threading implemenations in the past and didn't really find one that did... but, like you suggested, maybe there is one out there somewhere.
 
> >The process of connecting the boxes together was actually designed to be >programmed visually, as you can see from the examples in the book (I have >no idea if it works well, as I am merely starting to experiment with it).
> 
> What bought me to Stackless Python and PyPy was work concerning WS-BPEL. Allegedly, WS-BPEL/XLang/WSFL (Web-Services Flow Language) are based on formalisms like pi calculus.
> 
> Since I don't own a multi-core machine and I am not doing CPU intense stuff, I never really cared. However I have been doing things where I needed to impose logical orderings upon processes (i.e., process C can only run after process A and B are finished). My initial native uses of Stackless (easy to do in anything system based on CSP), resulted in deadlocking the system. So I found understanding deadlock to be very important.
> 
Thanks... and, uh, about all I can do is bookmark this for later.  Really, thanks for the links; I may very well want to research each and every one of these at some point and see what I can learn from each one.  If you have more stuff like that, feel free to let me know. :)
 
> My advice: get stuff properly working under a single threaded model first so you understand the machinery. That said, I think Carlos Eduardo de Paula a few years ago played with adapting Stackless for multi-processing.
Yeah, I've been considering that.  Maybe I'll just go ahead with a single threaded implementation... and if I feel like it, I could always try to edit PyPy or one of the other implemenations later (although I probably never will due to time constraints :) ).  Still, I figured I might as well ask around and see if it was possible to do a parallel implementation sooner.
 
Or... what I may end up doing is using the slow multiprocessing module and queues.  Granted, it will probably be slow since it doesn't use shared memory "under the hood", but it would be parallel.
 
> Second piece of advice: start looking at how Go does things. Stackless Python and Go share a common ancestor. However Go does much more on the multi-core front.
I have looked at Go Goroutines.... albeit briefly.  I noticed that they are co-operative like stackless and, based on your comments, I'm guessing they work on multiple cores?  I was really disappointed that they were not pre-emptive, however.  I haven't really looked much into it beyond that, but maybe I'll give it another look; but using it would mean not using Python. :( 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From kevinar18 at hotmail.com  Thu Jul 29 19:44:39 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Thu, 29 Jul 2010 13:44:39 -0400
Subject: [pypy-dev] FW: Would the following shared memory model be possible?
In-Reply-To: 
References: ,
	<20100727062702.GE12699@tunixman.com>,
	,
	,
	,
	,
	
Message-ID: 


> Would comments from a project using this approach in real systems be
> of interest/use/help? Whilst I didn't know about Morrison's FBP
> (Balzer's work predates him btw - don't listen to hype) I had heard of
> (and played with) Occam among other more influential things, and
> Kamaelia is a real tool. Also there is already a pre-existing FBP tool
> for Stackless, and then historically there's also MASCOT & friends. It

You brought up a lot of topics.  I went ahead and sent you a private email.  There's always lots of interesting things I can add to my list of things to learn about. :)
 
> just looks to me that you're tieing yourself up in knots over things
> that aren't problems, when there are some things which could be useful
> (in practice) & interesting in this space.
The particular issue in this situation is that there is no way to make Kamaelia, FBP, or other concurrency concepts run in parallel (unless you are willing to accept lots of overhead like with the multiprocessing queues).
 
Since you have worked with Kamaelia code a lot... you understand a lot more about implementation details.  Do you think the previous shared memory concept or something like it would let you make Kamaelia parallel?
If not, can you think of any method that would let you make Kamaelia parallel?
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From kevinar18 at hotmail.com  Thu Jul 29 20:02:38 2010
From: kevinar18 at hotmail.com (Kevin Ar18)
Date: Thu, 29 Jul 2010 14:02:38 -0400
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
 message passing?
In-Reply-To: 
References: ,
	
Message-ID: 


> Hello Kevin,
> I don't know if it can be a solution to your problem but for my
> Master Thesis I'm working on making Stackless Python distributed. What
> I did is working but not complete and I'm right now in the process of
> writing the thesis (in french unfortunately). My code currently works
> with PyPy's "stackless" module onlyis and use some PyPy specific
> things. Here's what I added to Stackless:
>
> - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A
> node is an instance of an interpreter.
> - Each tasklet has its global namespace (to avoid sharing of data). The
> state is also easier to move to another interpreter this way.
> - Distributed channels: All requests are known by all nodes using the
> channel.
> - Distributed objets: When a reference is sent to a remote node, the
> object is not copied, a reference is created using PyPy's proxy object
> space.
> - Automated dependency recovery when an object or a tasklet is loaded
> on another interpreter
>
> With a proper scheduler, many tasklets could be automatically spread in
> multiple interpreters to use multiple cores or on multiple computers. A
> bit like the N:M threading model where N lightweight threads/coroutines
> can be executed on M threads.

Was able to have a look at the API...
If others don't mind my asking this on the mailing list:
 
* .send() and .receive()
What type of data can you send and receive between the tasklets?  Can you pass entire Python objects?
 
* .send() and .receive() memory model
When you send data between tasklets (pass messages) or whateve you want to call it, how is this implemented under the hood?  Does it use shared memory under the hood or does it involve a more costly copying of the data?  I realize that if it is on another machine you have to copy the data, but what about between two threads?  You mentioned PyPy's proxy object.... guess I'll need to read up on that. 		 	   		  

From sparks.m at gmail.com  Thu Jul 29 19:21:25 2010
From: sparks.m at gmail.com (Michael Sparks)
Date: Thu, 29 Jul 2010 18:21:25 +0100
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	
	
Message-ID: <201007291821.26318.sparks.m@gmail.com>

I make it a point these days to only reply on-list. It leads to endless 
repetition otherwise. If you repost this cc'ing the pypy-dev list I'll reply. 
If you think it's off topic there, then I see no point.


Michael.

On Thursday 29 July 2010 18:05:27 you wrote:
> Thanks for the reply.
> 
> > Would comments from a project using this approach in real systems be
> > of interest/use/help?
> 
> I contacted someone from Kamaelia a while back (probably you).
> Yes, use of the dataflow concept would be really useful (no
> MIT/BSD/Python/PD license).  However, licensing was an issues, so I went
> it on my own.  I find the concept rather interesting both to maybe learn
> from and to actually try and use in an actual application.
> 
> > Whilst I didn't know about Morrison's FBP
> > (Balzer's work predates him btw - don't listen to hype) I had heard of
> > (and played with) Occam among other more influential things, and
> > Kamaelia is a real tool.
> 
> What is this Balzer and Occam? :)  Do you have any links I can look at?
> 
> > Also there is already a pre-existing FBP tool
> > for Stackless
> 
> The problem is that Stackless is not parallel, which is what I would really
> like to do.
> 
> > , and then historically there's also MASCOT & friends.
> 
> Do you have a link about this?

-- 
>>>


From andrewfr_ice at yahoo.com  Thu Jul 29 22:39:16 2010
From: andrewfr_ice at yahoo.com (Andrew Francis)
Date: Thu, 29 Jul 2010 13:39:16 -0700 (PDT)
Subject: [pypy-dev] Would the following shared memory model be possible?
Message-ID: <557968.57023.qm@web120009.mail.ne1.yahoo.com>

Hi Michael:

--- On Thu, 7/29/10, Michael Sparks  wrote:

> It's a pity we didn't get a chance to chat at the
> conference. (I was the one videoing everything for upload after >transcoding :)

Yes I noticed. I gave the talk "Prototyping Go's Select with stackless.py for Stackless Python." Much of that talk dealt with rendezvous semantics courtesy via synchronous channels.

I will post the Original slides and the Revised version (mistakes corrected ) in a day or two. 
 
> > >http://www.kamaelia.org/MiniAxon.html
> > 

> I'm biassed towards Kamaelia (naturally :-), but I agree.
> MiniAxon is just a  toy/tutorial. Early in kamaelia's history we >considered using stackless, but rejected it simply because we wanted to >work mainly with mainline python, rather than a specialised version.

Fair enough. Currently Stackless Python is being integrated with Psyco and will be available as a module. 

> Other things in Stackless's favour (IIRC) - include the
> fact that you can  pickle generators, and send them across network >connection, unpickle them and let them continue running. I don't know 
>if you do the same with tasklets, but I wouldn't be surprised if you do :)

As long as you do not have a C Frame involved, you can pickle a tasklet.
That was the subject of my "Silly Stackless Python Trick" lighting talk.
I was going to demonstrate a version of the Sieve of Eratosthenes that could be pickled and resumed on another machine. However my HP Netbook had a non-standard VGA output connection and I needed to install Stackless
on a loaner ThinkPad that died as I hooked it up. However you saw all
that :-(

> That means you have potential for process migration.

Yep. Gabriel Lavoie does a lot of work in that area with PyPy (thanks Gabriel !)

> Doing that sensibly though IMO would require better understanding
> in the system of what the user is trying to achieve and what they're >sending. (It's easy to think of examples where this causes more pain than >it's worth after all)

You have to understand what can be pickled. Occasionally you are in for
a surprise (i.e. functools).

>You could argue in that case that the biggest _real_ difference is 
>that we try to use a unified API for different concurrency
>methods. 

Well I would argue that Stackless has a simple elegant model. The addition of select just adds more power.Stackless channels can also serve as generators (they are iterable). I recently took a stab at writing the Sleeping Barber's problem. I think in Stackless, the basic solution was about 30 lines. Very little clutter.

> One **highly subjective** other thing in our favour, is
> that generators are limited to a single level of control flow 
>(ie non-nestable without a trampoline). This doesn't sounds like
> an advantage, but it tends to lead to simpler components which are 
>in turn reusable. (and that I view as useful :)

Okay. I attended Ray's Hettinger's talk on Monocle. In the past
I have encountered situations where I bumped up with the nesting problem.
If I recall, the problem involved request handlers that had a RPC style AND made additional Twisted deferred calls:

class MyRequestHandler(...):
   
    @defer.inlineCallbacks
    def process(self):
        try:
            result = yield
client.getPage("http://www.google.com")
        except Exception, err:
            log.err(err, "process getPage call
failed")
        else:
            # do some processing with the result 
            return result

looks reasonable but Python will balk. Nested generators. Only way around it is that you had to hope that the Twisted protocol was properly written and chain deferreds.

> Have fun,

I do :-)

Cheers,
Andrew



      



From p.giarrusso at gmail.com  Thu Jul 29 22:53:22 2010
From: p.giarrusso at gmail.com (Paolo Giarrusso)
Date: Thu, 29 Jul 2010 22:53:22 +0200
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com> 
	
	 
	
	 
	 
	 
	 
	 
	 
	 
	 
	
Message-ID: 

On Thu, Jul 29, 2010 at 15:15, William Leslie
 wrote:
> On 29 July 2010 18:55, Maciej Fijalkowski  wrote:
>> On Thu, Jul 29, 2010 at 10:50 AM, William Leslie
>>  wrote:
>>> If task X expects that task Y will mutate some object it has, it needs
>>> to go back to the source for every read. This means that if you do use
>>> mutation of some shared object for communication, it needs to be
>>> synchronised before every access. What this means for us is that every
>>> read from a possibly mutable object requires an acquire, and every
>>> write requires a release. It's as if every reference in the program is
>>> implemented with a volatile pointer. Even if the object is never
>>> mutated, there can be a lot of unnecessary bus chatter waiting for
>>> MESI to tell us so.
>>>
>>
>> I do agree there is an overhead. Can you provide some data how much
>> this overhead is? Python is not a very simple language and a lot of
>> things are complex and time consuming, so I wonder how it compares to
>> locking per object.

Below I try to prove that locking is still too expensive, even for an
interpreter.
Also, for many things the clever optimizations you do allow making
those costs small, at least for the average case / fast path. I have
been taught to consider clever optimizations as required. With JIT
compilation, specialization and shadow classes, are method calls much
more expensive than a guard and (if no inlining is done, as might
happen in PyPy in the worst case for big functions) an assembler
'call' opcode, and possibly stack shuffling? How many cycles is that?
How more expensive is that than optimized JavaScript (which is not far
from C, the only difference being the guard)? You can assume the case
of plain calls without keyword arguments and so on (and with inlining,
keyword arguments should pay no runtime cost).

Also, the free threading patches which tried removing the GIL gave an
unacceptable (IIRC 2x) slowdown to CPython in the old days of CPython
1.5. And I don't think they tried to lock every object, just what you
need to lock (which included refcounts).

> It *is* locking per object, but you also spend time looking for the
> data if someone else has invalidated your cache line.

That overhead is already there in locking per object, I think (locking
can be much more expensive than a cache miss, see below).
However, locking per object does not prevent race conditions unless
you make atomic regions as big as actually needed (locking per
statement does not work), it just prevents data races (a conflict
between a write and a memory operation which are not synchronized
between each other). And you can't extend atomic regions indefinitely,
as that implies starvation. Even software transactional memory
requires the programmer to allocate which regions have to be atomic.

Given the additional cost (discussed elsewhere in this mail), and
given that there is not much benefit, I think locking-per-object is
not worth it (but I'd still love to know more about why the effort on
python-safethread was halted).

> Come to think of it, that isn't as bad as it first seemed to me. If
> the sender never mutates the object, it will Just Work on any machine
> with a fairly flat cache architecture.

You first wrote: "The alternative, implicitly writing updates back to
memory as soon as possible and reading them out of memory every time,
can be hundreds or more times slower."
This is not "locking per object", it is just semantically close to it,
and becomes equivalent if only one thread has a reference at any time.

They are very different though performance-wise, and each of them is
better for some usages. In the Linux kernel (which I consider quite
authoritative here, on what you can do in C) both are used for valid
performance reasons, and a JIT compiler could choice between them.
Here, first I describe the two alternatives mentioned. Finally, I go
to the combination for the "unshared case".

- What you first described (memory barriers or uncached R/Ws) can be
faster for small updates, depending on the access pattern. An uncached
memory area does not disturb other memory traffic, unlike memory
barriers which are global, but I don't think an unprivileged process
is allowed to obtain one (by modifying MSRs or PATs, for x86).

Cost: each memory op goes to main memory and is thus as slow as a
cache miss (hundreds of clock cycles). When naively reading a Python
field, many such reads can be possible, but a JIT compiler can bring
it down to the equivalent of a C access with shadow classes and
specialization, and this would pay even more here (V8 does it for
JavaScript and I think PyPy already does most or all of it).

- Locking per object (monitors): slow upfront, but you can do each r/w
out of your cache, so if the object is kept locked for some time, this
is more efficient.
How slow? A system call to perform locking can cost tens of thousands
of cycles. But Java locks, and nowadays even Linux futexes (and
Windows locks), perform everything in userspace in as many cases as
possible (the slowpath is when there is actually contention on the
lock, but it's uncommon with locking-per-object). I won't sum up here
the literature on this.

- Since no contention is expected here, a simple couple of memory
barrier is needed on send/receive (a write barrier for send, a read
one for receive, IIRC). Allowing read-only access to another thread
already brings back to a mixture of the above two solutions. However,
in the 1st solution, using memory barriers, you'd need a write barrier
for every write, but you could save on read barriers.
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/


From exarkun at twistedmatrix.com  Thu Jul 29 23:24:58 2010
From: exarkun at twistedmatrix.com (exarkun at twistedmatrix.com)
Date: Thu, 29 Jul 2010 21:24:58 -0000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: <557968.57023.qm@web120009.mail.ne1.yahoo.com>
References: <557968.57023.qm@web120009.mail.ne1.yahoo.com>
Message-ID: <20100729212458.2188.24074246.divmod.xquotient.34@localhost.localdomain>

On 08:39 pm, andrewfr_ice at yahoo.com wrote:
>
>Okay. I attended Ray's Hettinger's talk on Monocle. In the past
>I have encountered situations where I bumped up with the nesting 
>problem.
>If I recall, the problem involved request handlers that had a RPC style 
>AND made additional Twisted deferred calls:
>
>class MyRequestHandler(...):
>
>    @defer.inlineCallbacks
>    def process(self):
>        try:
>            result = yield
>client.getPage("http://www.google.com")
>        except Exception, err:
>            log.err(err, "process getPage call
>failed")
>        else:
>            # do some processing with the result
>            return result
>
>looks reasonable but Python will balk.

Aside from the "return result" (should be defer.returnValue(result), 
generators can't return with a value), this looks fine to me too.  Why 
do you say Python will balk?

Jean-Paul


From william.leslie.ttg at gmail.com  Fri Jul 30 09:35:29 2010
From: william.leslie.ttg at gmail.com (William Leslie)
Date: Fri, 30 Jul 2010 17:35:29 +1000
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
	
	
	
	
	
	
	
	
Message-ID: 

On 30 July 2010 06:53, Paolo Giarrusso  wrote:
>> Come to think of it, that isn't as bad as it first seemed to me. If
>> the sender never mutates the object, it will Just Work on any machine
>> with a fairly flat cache architecture.
>
> You first wrote: "The alternative, implicitly writing updates back to
> memory as soon as possible and reading them out of memory every time,
> can be hundreds or more times slower."
> This is not "locking per object", it is just semantically close to it,
> and becomes equivalent if only one thread has a reference at any time.

Yes, direct memory access was misdirection (sorry), as the cache
already handles consistency even in NUMA systems of the same size that
sit on most desktops today, and most significantly you still need to
lock objects in many cases, such as looking up an entry in a dict,
which can change size while probing. Not only are uncached accesses
needlessly slow in the typical case, but they are not sufficient to
ensure consistency of some resizable rpython data structures.

-- 
William Leslie


From evan at theunixman.com  Fri Jul 30 21:36:28 2010
From: evan at theunixman.com (Evan Cofsky)
Date: Fri, 30 Jul 2010 12:36:28 -0700
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
 message passing?
In-Reply-To: 
References: 
	
Message-ID: <20100730193627.GB2082@tunixman.com>

On 07/27 11:48, Maciej Fijalkowski wrote:
> Right now, no. But there are ways in which you can experiment. Truly
> concurrent threads (depends on implicit vs explicit shared memory)
> might require a truly concurrent GC to achieve performance. This is
> work (although not as big as removing refcounting from CPython for
> example).

Would starting to remove the GIL then be a useful project for someone
(like me, for example) to undertake? It might be a good start to
experimentation with other kinds of concurrency. I've been interested in
Software Transactional Memory
(http://en.wikipedia.org/wiki/Software_transactional_memory).

-- 
Evan Cofsky 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: Digital signature
URL: 

From fijall at gmail.com  Fri Jul 30 21:40:35 2010
From: fijall at gmail.com (Maciej Fijalkowski)
Date: Fri, 30 Jul 2010 21:40:35 +0200
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
	message passing?
In-Reply-To: <20100730193627.GB2082@tunixman.com>
References: 
	 
	<20100730193627.GB2082@tunixman.com>
Message-ID: 

On Fri, Jul 30, 2010 at 9:36 PM, Evan Cofsky  wrote:
> On 07/27 11:48, Maciej Fijalkowski wrote:
>> Right now, no. But there are ways in which you can experiment. Truly
>> concurrent threads (depends on implicit vs explicit shared memory)
>> might require a truly concurrent GC to achieve performance. This is
>> work (although not as big as removing refcounting from CPython for
>> example).
>
> Would starting to remove the GIL then be a useful project for someone
> (like me, for example) to undertake? It might be a good start to
> experimentation with other kinds of concurrency. I've been interested in
> Software Transactional Memory
> (http://en.wikipedia.org/wiki/Software_transactional_memory).
>
> --
> Evan Cofsky 
>

I think removing GIL is not a good place to start. It's far too
complex without knowing codebase (it's fairly complex with knowing
codebase). There are many related projects, which are smaller in size
and eventually might lead to having some idea how to remove the GIL.
If you're interested, come to #pypy on IRC to discuss.

Cheers,
fijal


From evan at theunixman.com  Fri Jul 30 21:54:09 2010
From: evan at theunixman.com (Evan Cofsky)
Date: Fri, 30 Jul 2010 12:54:09 -0700
Subject: [pypy-dev] pre-emptive micro-threads utilizing shared memory
 message passing?
In-Reply-To: 
References: 
	
	<20100730193627.GB2082@tunixman.com>
	
Message-ID: <20100730195408.GC2082@tunixman.com>

On 07/30 21:40, Maciej Fijalkowski wrote:
> If you're interested, come to #pypy on IRC to discuss.

Sounds reasonable enough. I'll hang out on #pypy and see what happens.

Thanks

-- 
Evan Cofsky 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: Digital signature
URL: 

From sparks.m at gmail.com  Sat Jul 31 03:08:49 2010
From: sparks.m at gmail.com (Michael Sparks)
Date: Sat, 31 Jul 2010 02:08:49 +0100
Subject: [pypy-dev] FW: Would the following shared memory model be
	possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
	
Message-ID: 

On Thu, Jul 29, 2010 at 6:44 PM, Kevin Ar18  wrote:
> You brought up a lot of topics.  I went ahead and sent you a private email.
> There's always lots of interesting things I can add to my list of things to
> learn about. :)

Yes, there are lots of interesting things. I have a limited amount of
time however (I should be in bed, it's very late here, but I do /try/
to reply to on-list mails), so cannot spood feed you. Mailing me
directly rather than a (relevant) list precludes you getting answers
from someone other than me. Not being on lists also precludes you
getting answers to questions by chance. Changing emails and names in
email headers also makes keeping track of people hard...

(For example you asked off list last year about Kamaelia's license
from a different email address. Since it wasn't searchable I
completely forgot. You also asked all sorts of questions but didn't
want the answers public, so I didn't reply. If instead you'd
subscribed to the list, and asked there, you'd've found out that
Kamaelia's license changed - to the Apache Software License v2 ...)

If I mention something you find interesting, please Google first and
then ask publicly somewhere relevant. (the answer and question are
then googleable, and you're doing the community a service IMO if you
ask q's that way - if you're question is somewhere relevant and shows
you've already googled prior work as far as you can... People are
always willing to help people who show willing to help themselves in
my experience.)

>> just looks to me that you're tieing yourself up in knots over things
>> that aren't problems, when there are some things which could be useful
>> (in practice) & interesting in this space.
> The particular issue in this situation is that there is no way to make
> Kamaelia, FBP, or other concurrency concepts run in parallel (unless you are
> willing to accept lots of overhead like with the multiprocessing queues).
>
> Since you have worked with Kamaelia code a lot... you understand a lot more
> about implementation details.  Do you think the previous shared memory
> concept or something like it would let you make Kamaelia parallel?
> If not, can you think of any method that would let you make Kamaelia
> parallel?

Kamaelia already CAN run components in parallel in different processes
(has been able to do so for quite some time) or on different
processors. Indeed, all you do is use a ProcessPipeline or
ProcessGraphline rather than Pipeline or Graphline, and the components
in the top level are spread across processes. I still view the code as
experimental, but it does work, and when needed is very useful.

Kamaelia running on Iron Python can run on seperate processors sharing
data efficiently (due to lack of GIL there) happily too. Threaded
components there do that naturally - I don't use IronPython, but it
does run on Iron Python. On windows this is easiest, though Mono works
just as well.

I believe Jython also is GIL free, and Kamaelia's Axon runs there
cleanly too. As a result because Kamaelia is pure python, it runs
truly in parallel there too (based on hearing from people using
kamaelia on jython). Cpython is the exception (and a rather big one at
that). (Pypy has a choice IIUC)

Personally, I think if PyPy worked with generators better (which is
why I keep an eye on PyPy) and cpyext was improved, it'd provide a
really compelling platform for me. (I was rather gutted at Europython
to hear that PyPy's generator support was still ... problematic)

Regarding the *efficiency* and *enforcement* of the approach taken, I
feel you're chasing the wrong tree, but let's go there.

What approach does baseline (non-Iron Python running) kamaelia take
for multi-process work?

For historical reasons, it builds on top of pprocess rather than
multiprocessing module based. This means for interprocess
communications objects are pickled before being sent over operating
system pipes.

This provides an obvious communications overhead - and this isn't
really kamaelia specific at this point.

However, shifting data from one CPU to another is expensive, and only
worth doing in some circumstances. (Consider a machine with several
physical CPUs - each has a local CPU cache, and the data needs to be
transferred from one to another, which is why partly people worry
about thread/CPU affinity etc)

Basically, if you can manage it, you don't want to shift data between
CPUs, you want to partition the processing.

ie you may want to start caring about the size of messages and number
of messages going between processes. Sending small and few between
processes is going to be preferable to sending large and many for
throughput purposes.

In the case of small and few, the approach of pickling and sending
across OS pipes isn't such a bad idea. It works.

If you do want to share data between CPUs, and it sounds like you do,
then most OSs already provide a means of doing that - threads. The
conventions people use for using threads are where they become
unpicked, but as a mechanism, threads do generally work, and work
well.

As well as channels/boxes, you can use an STM approach, such as than
in Axon.STM ...
    * http://www.kamaelia.org/STM.html
    * http://code.google.com/p/kamaelia/source/browse/trunk/Code/Python/Bindings/STM/

...which is logically very similar to version control for variables. A
downside of STM (at least with this approach) however, is that for it
to work, you need either copy on write semantics for objects, or full
copying of objects or similar. Personally I use a biological metaphor
here, in that channels/boxes and components, and similar perform a
similar function to axons and neurons in the body, and that STM is
akin to the hormonal system for maintaining and controlling system
state. (I modelled biological tree growth many moons ago)

Anyhow, coming back to threads, that brings us back to python, and
implementations with a GIL, and those without.

For implementations with a GIL, you then have a choice: do I choose to
try and implement a memory model that _enforces_ data locality? that
is if a piece of data is in use inside a single "process" or "thread"
(from hereon I'll use "task" as a generic phrase) that trying to use
it inside another causes a problem for the task attempting to breach
the model.

In order to enforce this, I personally believe you'd need to use
multiple processes, and only share data through dedicated code
managing shared memory. You could of course do this outside user code.
To do this you'd need an abstraction that made sense, and something
like stackless' channels or kamaelia's (in/out) box model makes sense
there. (The CELL API uses a mailbox metaphor as well for reference)

In that case, you have a choice. You either copy the data into shared
memory, or you share the data in situ. The former gives you back
precisely the same overhead previously described, or the latter
fragments your memory (since you can no longer access it). You could
also have compaction.

However, personally, I think any possible benefits here are outweighed
by the costs and complexity.

The alternative is to _encourage_ data locality. That is encourage the
usage and sharing of data such that whilst you could share data
between tasks and cause corruption that the common way of using the
system discourages such actions. In essence that's what I try to do in
Kamaelia, and it seems to work. Specifically, the model says:

    * If I take a piece of data from an inbox, I own it and can do anything
      with it that I like. If you think of a physical piece of paper and
      I take it from an intray, then that really is the case.

    * If I put a piece of data in an outbox, I no longer own it and should
      not attempt to do anything more with it. Again, using a physical
      metaphor, and naming scheme helps here. In particular, if I put a
      piece of paper in the post, I can no longer modify it. How it gets
      to its recipient is not my concern either.

In practice this does actually work. If you add in immutable tuples,
and immutable strings then it becomes a lot clearer how this can work.

Is there a risk here of accidental modification? Yes. However, the
size and general simplicity of components tends to lead to such
problems being picked up early. It also enables component level
acceptance tests. (We tend to build small examples of usage, which in
turn effectively form acceptance tests)

[ An alternative is to make the "send" primitive make a copy on send.
That would be quite an overhead, and also limit the types of data you
can send. ]

In practical terms, it works. (Stackless proves this as well IMO,
since despite some differences, there's also lots of similarities)

The other question that arises, is "isn't the GIL a problem with
threads?". Well, the answer to that really depends on what you're
doing. David Beazely's talk on what happens on mixing different sorts
of threads shows that it isn't ideal, and if you're hitting that
behaviour, then actually switching to real processes makes sense.
However if you're doing CPU intensive work inside a C extension which
releases the GIL (eg numpy), then it's less of an issue in practice.
Custom extensions can do the same.

So, for example, picking something which I know colleagues [1] at work
do, you can use a DVS broadcast capture card to capture video frames,
pass those between threads which are doing processing on them, and
inside those threads use c extensions to process the data efficiently
(since image processing does take time...), and those release the GIL
boosting throughput.

   [1] On this project : http://www.bbc.co.uk/rd/projects/2009/10/i3dlive.shtml

So, that makes it all sound great - ie things can, after various
fashions, run in parallel on various versions of python, to practical
benefit. But obviously it could be improved.

Personally, I think the project most likely to make a difference here
is actually pypy. Now, talk is very cheap, and easy, and I'm not
likely to implement this, so I'll aim to be brief. Execution is hard.

In particular, what I think is most likely to be beneficial is
something _like_ this:

Assume pypy runs without a GIL. Then allow the creation of a green
process. A green process is implemented using threads, but with data
created on the heap such that it defaults to being marked private to
the thread (ie ala thread local storage, but perhaps implemented
slightly differently - via references from the thread local storage
into the heap) rather than shared. Sharing between green processes
(for channels or boxes) would "simply" be detagged as being owned by
one thread, and passed to another.

In particular this would mean that you need a mechanism for doing
this. Simply attempting to call another green process (or thread) from
another with mutable data types would be sufficient to raise the
equivalent of a segmentation fault.

Secondly, improve cpyext to the extent that each cpython extension
gets it's own version of the GIL. (ie each extension runs with its own
logical runtime, and thinks that it has its own GIL which it can lock
and release. In practice it's faked by the PyPy runtime. This is
essentially similar conceptually to creating green processes.

It's worth considering that the Linux kernel went through similar
changes, in that in the 2.0 days there was a large single big lock,
which was replaced by ever granular locks. I personally think that
since there are so many extensions that rely on the existence of the
GIL simply waving a wand to get rid of it isn't likely. However
logically providing a GIL per C-Extension may be plausible, and _may_
be sufficient.

However, I don't know - it might well not - I've not looked at the
code, and talk is cheap - execution is hard.

Hopefully the above (cheap :) comments are in some small way useful.

Regards,


Michael.


From cfbolz at gmx.de  Sat Jul 31 08:34:49 2010
From: cfbolz at gmx.de (Carl Friedrich Bolz)
Date: Sat, 31 Jul 2010 08:34:49 +0200
Subject: [pypy-dev] S3 2010 deadline extension
Message-ID: <4C53C409.1060101@gmx.de>

The S3 2010 Paper deadline was moved forward by two weeks, and is now 
August 13, 2010.


*** Workshop on Self-sustaining Systems (S3) 2010 ***

September 27-28, 2010
The University of Tokyo, Japan
http://www.hpi.uni-potsdam.de/swa/s3/s3-10/

In cooperation with ACM SIGPLAN

=== Call for papers ===

The Workshop on Self-sustaining Systems (S3) is a forum for discussion 
of topics relating to computer systems and languages that are able to 
bootstrap, implement, modify, and maintain themselves. One property of 
these systems is that their implementation is based on small but 
powerful abstractions; examples include (amongst others) 
Squeak/Smalltalk, COLA, Klein/Self, PyPy/Python, Rubinius/Ruby, and 
Lisp. Such systems are the engines of their own replacement, giving 
researchers and developers great power to experiment with, and explore 
future directions from within, their own small language kernels.

S3 will be take place September 27-28, 2010 at The University of Tokyo, 
Japan. It is an exciting opportunity for researchers and practitioners 
interested in self-sustaining systems to meet and share their knowledge, 
experience, and ideas for future research and development.

--- Submissions and proceedings ---

S3 invites submissions of high-quality papers reporting original 
research, or describing innovative contributions to, or experience with, 
self-sustaining systems, their implementation, and their application. 
Papers that depart significantly from established ideas and practices 
are particularly welcome.

Submissions must not have been published previously and must not be 
under review for any another refereed event or publication. The program 
committee will evaluate each contributed paper based on its relevance, 
significance, clarity, and originality. Revised papers will be published 
as post-proceedings in the ACM Digital Library.

Papers should be submitted electronically via EasyChair at 
http://www.easychair.org/conferences/?conf=s32010 in PDF format. 
Submissions must be written in English (the official language of the 
workshop) and must not exceed 10 pages. They should use the ACM SIGPLAN 
10 point format, templates for which are available at 
http://www.acm.org/sigs/sigplan/authorInformation.htm.

--- Venue ---

The University of Tokyo, Komaba Campus, Japan

--- Important dates ---

Submission of papers: *EXTENDED* August 13, 2010
Author notification: August 27, 2010
Early registration: September 3, 2010
Revised papers: September 10, 2010
S3 workshop: September 27-28, 2010
Final papers for ACM-DL post-proceedings: October 15, 2010

--- Invited talks ---

Yukihiro Matsumoto: "From Lisp to Ruby to Rubinius"
Takashi Ikegami: "Sustainable Autonomy and Designing Mind Time"

--- Chairs ---

Robert Hirschfeld (Hasso-Plattner-Institut Potsdam, Germany)
hirschfeld at hpi.uni-potsdam.de
Hidehiko Masuhara (The University of Tokyo, Japan)
masuhara at graco.c.u-tokyo.ac.jp
Kim Rose (Viewpoints Research Institute, USA)
kim.rose at vpri.org

--- Program committee ---

Carl Friedrich Bolz, University of Duesseldorf, Germany
Johan Brichau, Universite Catholique de Louvain, Belgium
Shigeru Chiba, Tokyo Institute of Technology, Japan
Brian Demsky, University of California, Irvine, USA
Marcus Denker, INRIA Lille, France
Richard P. Gabriel, IBM Research, USA
Michael Haupt, Hasso-Plattner-Institut, Germany
Robert Hirschfeld, Hasso-Plattner-Institut, Germany (co-chair)
Atsushi Igarashi, University of Kyoto, Japan
David Lorenz, The Open University, Israel
Hidehiko Masuhara, University of Tokyo, Japan (co-chair)
Eliot Miranda, Teleplace, USA
Ian Piumarta, Viewpoints Research Institute, USA
Martin Rinard, MIT, USA
Antero Taivalsaari, Nokia, Finland
David Ungar, IBM, USA

_______________________________________________
fonc mailing list
fonc at vpri.org
http://vpri.org/mailman/listinfo/fonc


From andrewfr_ice at yahoo.com  Sat Jul 31 12:00:49 2010
From: andrewfr_ice at yahoo.com (Andrew Francis)
Date: Sat, 31 Jul 2010 03:00:49 -0700 (PDT)
Subject: [pypy-dev] Would the following shared memory model be possible?
In-Reply-To: 
Message-ID: <597371.6380.qm@web120001.mail.ne1.yahoo.com>

Hi JP:

Message: 1
Date: Thu, 29 Jul 2010 21:24:58 -0000
From: exarkun at twistedmatrix.com
Subject: Re: [pypy-dev] Would the following shared memory model be
    possible?
To: pypy-dev at codespeak.net
Message-ID:
    <20100729212458.2188.24074246.divmod.xquotient.34 at localhost.localdomain>
   
Content-Type: text/plain; charset="utf-8"; format="flowed"

On 08:39 pm, andrewfr_ice at yahoo.com wrote:
>
>Okay. I attended Ray's Hettinger's talk on Monocle. In the past
>I have encountered situations where I bumped up with the nesting
>problem.
>If I recall, the problem involved request handlers that had a RPC style
>AND made additional Twisted deferred calls:
>
>class MyRequestHandler(...):
>
>    @defer.inlineCallbacks
>    def process(self):
>        try:
>            result = yield
>client.getPage("http://www.google.com")
>        except Exception, err:
>            log.err(err, "process getPage call
>failed")
>        else:
>            # do some processing with the result
>            return result
>
>looks reasonable but Python will balk.

JP>Aside from the "return result" (should be defer.returnValue(result),
JP>generators can't return with a value), this looks fine to me too.  Why
JP>do you say Python will balk?

Well the return with a value was the deal breaker. I used this example because this is where I came face-to-face with nested generators - and generated a mistrust for them in regard to exotic uses. There was something else about the real example (I am having a hard time finding the posts - somewhere in 2007) - I think it was a very early version of PyAMF and it really wanted a return (HTTP is okay). I believe under the hood, if the protocol returns a deferred or None, the reactor will expect further output in the future.

Cheers,
Andrew

Cheers,
Andrew


      



From sparks.m at gmail.com  Sat Jul 31 19:43:32 2010
From: sparks.m at gmail.com (Michael Sparks)
Date: Sat, 31 Jul 2010 18:43:32 +0100
Subject: [pypy-dev] FW: Would the following shared memory model be
	possible?
In-Reply-To: 
References: 
	<20100727062702.GE12699@tunixman.com>
	
	
	
	
	
	
	
	
Message-ID: 

[ cc'ing the list in case anyone else took my words the same way as Kevin :-( ]

On Sat, Jul 31, 2010 at 5:26 PM, Kevin Ar18  wrote:
> I have no idea what I did you warrant you hateful replies towards me, but
> they really are not appropriate (in public or private email).

I had absolutely no intention of offending you, and am deeply sorry
for any offense that I may have caused you.

In my reply I merely wanted to flag that I don't have time to go into
everything (like most people), that asking questions in a public realm
is better because you may then get answers from multiple people, and
that people who appear to do some research first tend to get better
answers. I also tried to give an example, but that doesn't appear to
have been helpful. (I'm fallible like everyone else)

My intention there was to be helpful and to explain why I have that
view of only replying on list, and it appears to have offended you
instead, and I apologise. (one person's direct and helpful speech in
one place can be a mortal insult somewhere else)

After those couple of paragraphs, I tried to add to your discussion by
replying to your specific points which you asked about parallel
execution, noting places and examples where it is possible today. (to
varying degrees of satisfaction) I then also tried to answer your
point of "if something extra could be done, what would probably be
generally useful". To that I noted that *my* talk there was cheap, and
that execution was hard.

Somehow along the way, my intent to try to be helpful to you has
resulted in offending and upsetting you, and for that I am truly sorry
- life is simply too short for people to upset each other, and in no
way was my post intended as "hateful", and once again, my apologies.
In future please assume good intentions - I assumed good intentions on
your part.

I'll bow out at this point.

Best Regards,


Michael.

>
>> Date: Sat, 31 Jul 2010 02:08:49 +0100
>> Subject: Re: [pypy-dev] FW: Would the following shared memory model be
>> possible?
>> From: sparks.m at gmail.com
>> To: kevinar18 at hotmail.com
>> CC: pypy-dev at codespeak.net
>>
>> On Thu, Jul 29, 2010 at 6:44 PM, Kevin Ar18  wrote:
>> > You brought up a lot of topics. I went ahead and sent you a private
>> > email.
>> > There's always lots of interesting things I can add to my list of things
>> > to
>> > learn about. :)
>>
>> Yes, there are lots of interesting things. I have a limited amount of
>> time however (I should be in bed, it's very late here, but I do /try/
>> to reply to on-list mails), so cannot spood feed you. Mailing me
>> directly rather than a (relevant) list precludes you getting answers
>> from someone other than me. Not being on lists also precludes you
>> getting answers to questions by chance. Changing emails and names in
>> email headers also makes keeping track of people hard...
>>
>> (For example you asked off list last year about Kamaelia's license
>> from a different email address. Since it wasn't searchable I
>> completely forgot. You also asked all sorts of questions but didn't
>> want the answers public, so I didn't reply. If instead you'd
>> subscribed to the list, and asked there, you'd've found out that
>> Kamaelia's license changed - to the Apache Software License v2 ...)
>>
>> If I mention something you find interesting, please Google first and
>> then ask publicly somewhere relevant. (the answer and question are
>> then googleable, and you're doing the community a service IMO if you
>> ask q's that way - if you're question is somewhere relevant and shows
>> you've already googled prior work as far as you can... People are
>> time however (I should be in bed, it's very late here, but I do /try/
>> to reply to on-list mails), so cannot spood feed you. Mailing me
>> directly rather than a (relevant) list precludes you getting answers
>> from someone other than me. Not being on lists also precludes you
>> getting answers to questions by chance. Changing emails and names in
>> email headers also makes keeping track of people hard...
>>
>> (For example you asked off list last year about Kamaelia's license
>> from a different email address. Since it wasn't searchable I
>> completely forgot. You also asked all sorts of questions but didn't
>> want the answers public, so I didn't reply. If instead you'd
>> subscribed to the list, and asked there, you'd've found out that
>> Kamaelia's license changed - to the Apache Software License v2 ...)
>>
>> always willing to help people who show willing to help themselves in
>> my experience.)
>>
>> >> just looks to me that you're tieing yourself up in knots over things
>> >> that aren't problems, when there are some things which could be useful
>> >> (in practice) & interesting in this space.
>> > The particular issue in this situation is that there is no way to make
>> > Kamaelia, FBP, or other concurrency concepts run in parallel (unless you
>> > are
>> > willing to accept lots of overhead like with the multiprocessing
>> > queues).
>> >
>> > Since you have worked with Kamaelia code a lot... you understand a lot
>> > more
>> > about implementation details. Do you think the previous shared memory
>> > concept or something like it would let you make Kamaelia parallel?
>> > If not, can you think of any method that would let you make Kamaelia
>> > parallel?
>>
>> Kamaelia already CAN run components in parallel in different processes
>> (has been able to do so for quite some time) or on different
>> processors. Indeed, all you do is use a ProcessPipeline or
>> ProcessGraphline rather than Pipeline or Graphline, and the components
>> in the top level are spread across processes. I still view the code as
>> experimental, but it does work, and when needed is very useful.
>>
>> Kamaelia running on Iron Python can run on seperate processors sharing
>> data efficiently (due to lack of GIL there) happily too. Threaded
>> components there do that naturally - I don't use IronPython, but it
>> does run on Iron Python. On windows this is easiest, though Mono works
>> just as well.
>>
>> I believe Jython also is GIL free, and Kamaelia's Axon runs there
>> cleanly too. As a result because Kamaelia is pure python, it runs
>> truly in parallel there too (based on hearing from people using
>> kamaelia on jython). Cpython is the exception (and a rather big one at
>> that). (Pypy has a choice IIUC)
>>
>> Personally, I think if PyPy worked with generators better (which is
>> why I keep an eye on PyPy) and cpyext was improved, it'd provide a
>> really compelling platform for me. (I was rather gutted at Europython
>> to hear that PyPy's generator support was still ... problematic)
>>
>> Regarding the *efficiency* and *enforcement* of the approach taken, I
>> feel you're chasing the wrong tree, but let's go there.
>>
>> What approach does baseline (non-Iron Python running) kamaelia take
>> for multi-process work?
>>
>> For historical reasons, it builds on top of pprocess rather than
>> multiprocessing module based. This means for interprocess
>> communications objects are pickled before being sent over operating
>> system pipes.
>>
>> This provides an obvious communications overhead - and this isn't
>> really kamaelia specific at this point.
>>
>> However, shifting data from one CPU to another is expensive, and only
>> worth doing in some circumstances. (Consider a machine with several
>> physical CPUs - each has a local CPU cache, and the data needs to be
>> transferred from one to another, which is why partly people worry
>> about thread/CPU affinity etc)
>>
>> Basically, if you can manage it, you don't want to shift data between
>> CPUs, you want to partition the processing.
>>
>> ie you may want to start caring about the size of messages and number
>> of messages going between processes. Sending small and few between
>> processes is going to be preferable to sending large and many for
>> throughput purposes.
>>
>> In the case of small and few, the approach of pickling and sending
>> across OS pipes isn't such a bad idea. It works.
>>
>> If you do want to share data between CPUs, and it sounds like you do,
>> then most OSs already provide a means of doing that - threads. The
>> conventions people use for using threads are where they become
>> unpicked, but as a mechanism, threads do generally work, and work
>> well.
>>
>> As well as channels/boxes, you can use an STM approach, such as than
>> in Axon.STM ...
>> * http://www.kamaelia.org/STM.html
>> *
>> http://code.google.com/p/kamaelia/source/browse/trunk/Code/Python/Bindings/STM/
>>
>> ...which is logically very similar to version control for variables. A
>> downside of STM (at least with this approach) however, is that for it
>> to work, you need either copy on write semantics for objects, or full
>> copying of objects or similar. Personally I use a biological metaphor
>> here, in that channels/boxes and components, and similar perform a
>> similar function to axons and neurons in the body, and that STM is
>> akin to the hormonal system for maintaining and controlling system
>> state. (I modelled biological tree growth many moons ago)
>>
>> Anyhow, coming back to threads, that brings us back to python, and
>> implementations with a GIL, and those without.
>>
>> For implementations with a GIL, you then have a choice: do I choose to
>> try and implement a memory model that _enforces_ data locality? that
>> is if a piece of data is in use inside a single "process" or "thread"
>> (from hereon I'll use "task" as a generic phrase) that trying to use
>> it inside another causes a problem for the task attempting to breach
>> the model.
>>
>> In order to enforce this, I personally believe you'd need to use
>> multiple processes, and only share data through dedicated code
>> managing shared memory. You could of course do this outside user code.
>> To do this you'd need an abstraction that made sense, and something
>> like stackless' channels or kamaelia's (in/out) box model makes sense
>> there. (The CELL API uses a mailbox metaphor as well for reference)
>>
>> In that case, you have a choice. You either copy the data into shared
>> memory, or you share the data in situ. The former gives you back
>> precisely the same overhead previously described, or the latter
>> fragments your memory (since you can no longer access it). You could
>> also have compaction.
>>
>> However, personally, I think any possible benefits here are outweighed
>> by the costs and complexity.
>>
>> The alternative is to _encourage_ data locality. That is encourage the
>> usage and sharing of data such that whilst you could share data
>> between tasks and cause corruption that the common way of using the
>> system discourages such actions. In essence that's what I try to do in
>> Kamaelia, and it seems to work. Specifically, the model says:
>>
>> * If I take a piece of data from an inbox, I own it and can do anything
>> with it that I like. If you think of a physical piece of paper and
>> I take it from an intray, then that really is the case.
>>
>> * If I put a piece of data in an outbox, I no longer own it and should
>> not attempt to do anything more with it. Again, using a physical
>> metaphor, and naming scheme helps here. In particular, if I put a
>> piece of paper in the post, I can no longer modify it. How it gets
>> to its recipient is not my concern either.
>>
>> In practice this does actually work. If you add in immutable tuples,
>> and immutable strings then it becomes a lot clearer how this can work.
>>
>> Is there a risk here of accidental modification? Yes. However, the
>> size and general simplicity of components tends to lead to such
>> problems being picked up early. It also enables component level
>> acceptance tests. (We tend to build small examples of usage, which in
>> turn effectively form acceptance tests)
>>
>> [ An alternative is to make the "send" primitive make a copy on send.
>> That would be quite an overhead, and also limit the types of data you
>> can send. ]
>>
>> In practical terms, it works. (Stackless proves this as well IMO,
>> since despite some differences, there's also lots of similarities)
>>
>> The other question that arises, is "isn't the GIL a problem with
>> threads?". Well, the answer to that really depends on what you're
>> doing. David Beazely's talk on what happens on mixing different sorts
>> of threads shows that it isn't ideal, and if you're hitting that
>> behaviour, then actually switching to real processes makes sense.
>> However if you're doing CPU intensive work inside a C extension which
>> releases the GIL (eg numpy), then it's less of an issue in practice.
>> Custom extensions can do the same.
>>
>> So, for example, picking something which I know colleagues [1] at work
>> do, you can use a DVS broadcast capture card to capture video frames,
>> pass those between threads which are doing processing on them, and
>> inside those threads use c extensions to process the data efficiently
>> (since image processing does take time...), and those release the GIL
>> boosting throughput.
>>
>> [1] On this project :
>> http://www.bbc.co.uk/rd/projects/2009/10/i3dlive.shtml
>>
>> So, that makes it all sound great - ie things can, after various
>> fashions, run in parallel on various versions of python, to practical
>> benefit. But obviously it could be improved.
>>
>> Personally, I think the project most likely to make a difference here
>> is actually pypy. Now, talk is very cheap, and easy, and I'm not
>> likely to implement this, so I'll aim to be brief. Execution is hard.
>>
>> In particular, what I think is most likely to be beneficial is
>> something _like_ this:
>>
>> Assume pypy runs without a GIL. Then allow the creation of a green
>> process. A green process is implemented using threads, but with data
>> created on the heap such that it defaults to being marked private to
>> the thread (ie ala thread local storage, but perhaps implemented
>> slightly differently - via references from the thread local storage
>> into the heap) rather than shared. Sharing between green processes
>> (for channels or boxes) would "simply" be detagged as being owned by
>> one thread, and passed to another.
>>
>> In particular this would mean that you need a mechanism for doing
>> this. Simply attempting to call another green process (or thread) from
>> another with mutable data types would be sufficient to raise the
>> equivalent of a segmentation fault.
>>
>> Secondly, improve cpyext to the extent that each cpython extension
>> gets it's own version of the GIL. (ie each extension runs with its own
>> logical runtime, and thinks that it has its own GIL which it can lock
>> and release. In practice it's faked by the PyPy runtime. This is
>> essentially similar conceptually to creating green processes.
>>
>> It's worth considering that the Linux kernel went through similar
>> changes, in that in the 2.0 days there was a large single big lock,
>> which was replaced by ever granular locks. I personally think that
>> since there are so many extensions that rely on the existence of the
>> GIL simply waving a wand to get rid of it isn't likely. However
>> logically providing a GIL per C-Extension may be plausible, and _may_
>> be sufficient.
>>
>> However, I don't know - it might well not - I've not looked at the
>> code, and talk is cheap - execution is hard.
>>
>> Hopefully the above (cheap :) comments are in some small way useful.
>>
>> Regards,
>>
>>
>> Michael.
>