From brecht at mos6581.org Sat Mar 1 23:34:17 2014 From: brecht at mos6581.org (Brecht Machiels) Date: Sat, 01 Mar 2014 23:34:17 +0100 Subject: [pypy-dev] RinohType and PyPy2 Message-ID: <1431737446.616651.1393713257918.JavaMail.sas1@[172.29.252.247]> Hello, I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). Results on my Celeron T3000 (Arch Linux x86_64): CPython 3.3.4 14 s PyPy3 2.1.0-beta1 61 s CPython 2.7.6 15 s PyPy 2.2.1 35 s If you want to give it a try (no external dependencies): git clone --branch pypy2 https://github.com/brechtm/rinohtype.git cd rinohtype/examples/rfic2009 rm -rf template.ptc; PYTHONPATH=../.. pypy template.py While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? Best regards, Brecht From fijall at gmail.com Sun Mar 2 08:19:35 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 2 Mar 2014 09:19:35 +0200 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> Message-ID: the first obvious thing that jumps at me is your casual use of sys._getframe - the JIT aborts in this case and proceeds to the interpreter (so you pay the price for JITting, while you also pay the prace for not having compiled assembler). That probably does not explain everything, but please don't use sys._getframe in production code if you want the JIT to be fast. On Sun, Mar 2, 2014 at 12:34 AM, Brecht Machiels wrote: > Hello, > > I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). > > Results on my Celeron T3000 (Arch Linux x86_64): > CPython 3.3.4 14 s > PyPy3 2.1.0-beta1 61 s > CPython 2.7.6 15 s > PyPy 2.2.1 35 s > > If you want to give it a try (no external dependencies): > > git clone --branch pypy2 https://github.com/brechtm/rinohtype.git > cd rinohtype/examples/rfic2009 > rm -rf template.ptc; PYTHONPATH=../.. pypy template.py > > > While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? > > Best regards, > Brecht > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From brecht at mos6581.org Sun Mar 2 11:11:43 2014 From: brecht at mos6581.org (Brecht Machiels) Date: Sun, 02 Mar 2014 11:11:43 +0100 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> Message-ID: <1689107550.632557.1393755103839.JavaMail.sas1@[172.29.252.247]> Thanks Maciej, sys._getframe was introduced by "magicsuper", which I used to avoid refactoring all super() calls. I've done that now and there shouldn't be any more sys._getframe calls. You can pull in this commit from the pypy2 branch. Unfortunately, this didn't improve performance much. PyPy now takes 26 seconds. Any other ideas? Best regards, Brecht ---- On Sun, 02 Mar 2014 08:19:35 +0100 Maciej Fijalkowski wrote ---- >the first obvious thing that jumps at me is your casual use of >sys._getframe - the JIT aborts in this case and proceeds to the >interpreter (so you pay the price for JITting, while you also pay the >prace for not having compiled assembler). That probably does not >explain everything, but please don't use sys._getframe in production >code if you want the JIT to be fast. > >On Sun, Mar 2, 2014 at 12:34 AM, Brecht Machiels wrote: >> Hello, >> >> I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). >> >> Results on my Celeron T3000 (Arch Linux x86_64): >> CPython 3.3.4 14 s >> PyPy3 2.1.0-beta1 61 s >> CPython 2.7.6 15 s >> PyPy 2.2.1 35 s >> >> If you want to give it a try (no external dependencies): >> >> git clone --branch pypy2 https://github.com/brechtm/rinohtype.git >> cd rinohtype/examples/rfic2009 >> rm -rf template.ptc; PYTHONPATH=../.. pypy template.py >> >> >> While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? >> >> Best regards, >> Brecht >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev > From numerodix at gmail.com Sun Mar 2 13:03:47 2014 From: numerodix at gmail.com (Martin Matusiak) Date: Sun, 2 Mar 2014 13:03:47 +0100 Subject: [pypy-dev] pypy in python3? Message-ID: Hi, I'm wondering whether there are any plans to port pypy itself to python 3 at some point. And what the benefits of that might be (other than having a more recent host language). Is there anything in python 3 that would make it easier/harder for pypy? Thanks, Martin From arigo at tunes.org Sun Mar 2 16:41:15 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 2 Mar 2014 16:41:15 +0100 Subject: [pypy-dev] pypy in python3? In-Reply-To: References: Message-ID: Hi Martin, On 2 March 2014 13:03, Martin Matusiak wrote: > I'm wondering whether there are any plans to port pypy itself to > python 3 at some point. And what the benefits of that might be (other > than having a more recent host language). Is there anything in python > 3 that would make it easier/harder for pypy? Just to make it clear to readers: this is about the language in which PyPy is implemented; this is not about the fact that PyPy itself implements Python 2.7 and 3.2 (currently). If we were starting today, then we could certainly use some small new features, like the ability to decorate function arguments rather than the whole function. However, that's about it as far as advantages go. There are small disadvantages too, like the unicode-everywhere model; you'd have to write byte strings explicitly everywhere in order to implement Python 2, or almost any small language you want to play with. That's the main difference from Python 2, ignoring new things in the stdlib which we cannot use from RPython anyway. But we're not starting today, and we have a very large code base already. As far as I'm concerned, Python 2 works nicely, is going to stay around for a long time, and is stable --- i.e. does not require us to adapt our code base every 2 years when a new Python 3.x version goes out (even if the required work is usually minimal, as far as our experience goes, from 2.3 to 2.7). For this reason I imagine that PyPy is going to be Python 2 forever. As it runs on PyPy itself, we won't even require a working CPython 2.x to get started, athough I'm sure these will also remain available forever. A bient?t, Armin. From arigo at tunes.org Sun Mar 2 17:01:32 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 2 Mar 2014 17:01:32 +0100 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> Message-ID: Hi Brecht, On 1 March 2014 23:34, Brecht Machiels wrote: > While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? It's not really helpful, but the warm-up time is the first issue here. If I edit template.py to run it e.g. 10 times instead of only once, the speed grows quickly by a factor of 4. It means your code, for some reason, exhibits slow warm-ups (not the worst we've seen, but I agree it's a lot). It would be interesting to know if you have a similar speed-up when processing a single 10-times-larger document instead of 10 times the same small document :-) A bient?t, Armin. From fijall at gmail.com Sun Mar 2 22:15:28 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 2 Mar 2014 23:15:28 +0200 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> Message-ID: Hi Brecht. I must say I've been trying to understand what's going on and I'm failing so far. Thanks for a valuable benchmark! And yes, we're working on improving the warmup time (ETA unknown though) On Sun, Mar 2, 2014 at 12:11 PM, Brecht Machiels wrote: > Thanks Maciej, > > sys._getframe was introduced by "magicsuper", which I used to avoid refactoring all super() calls. I've done that now and there shouldn't be any more sys._getframe calls. You can pull in this commit from the pypy2 branch. > > Unfortunately, this didn't improve performance much. PyPy now takes 26 seconds. Any other ideas? > > Best regards, > Brecht > > ---- On Sun, 02 Mar 2014 08:19:35 +0100 Maciej Fijalkowski wrote ---- > >>the first obvious thing that jumps at me is your casual use of >>sys._getframe - the JIT aborts in this case and proceeds to the >>interpreter (so you pay the price for JITting, while you also pay the >>prace for not having compiled assembler). That probably does not >>explain everything, but please don't use sys._getframe in production >>code if you want the JIT to be fast. >> >>On Sun, Mar 2, 2014 at 12:34 AM, Brecht Machiels > wrote: >>> Hello, >>> >>> I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). >>> >>> Results on my Celeron T3000 (Arch Linux x86_64): >>> CPython 3.3.4 14 s >>> PyPy3 2.1.0-beta1 61 s >>> CPython 2.7.6 15 s >>> PyPy 2.2.1 35 s >>> >>> If you want to give it a try (no external dependencies): >>> >>> git clone --branch pypy2 https://github.com/brechtm/rinohtype.git >>> cd rinohtype/examples/rfic2009 >>> rm -rf template.ptc; PYTHONPATH=../.. pypy template.py >>> >>> >>> While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? >>> >>> Best regards, >>> Brecht >>> >>> >>> _______________________________________________ >>> pypy-dev mailing list >>> pypy-dev at python.org >>> https://mail.python.org/mailman/listinfo/pypy-dev >> > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From johan.rade at gmail.com Tue Mar 4 15:43:55 2014 From: johan.rade at gmail.com (=?ISO-8859-1?Q?Johan_R=E5de?=) Date: Tue, 04 Mar 2014 15:43:55 +0100 Subject: [pypy-dev] pypy 2.2.1 win32 In-Reply-To: <53107B9C.4000705@gmx.de> References: <366629207.1251178.1393571016417.JavaMail.ngmail@webmail11.arcor-online.net> <53107B9C.4000705@gmx.de> Message-ID: <5315E6AB.3080906@gmail.com> Hi Carl Friedrich, What kind of benchmark do you prefer? A benchmark that shows how great PyPy is compared with C-Python? Then you might use Sunfish, https://github.com/thomasahle/sunfish. Sunfish does not have any offical benchmarks, but I think you might could use test.selfplay() as a benchmark. (It has Sunfish play against itself, and it plays the same 84-move game each time.) This benchmark shows that PyPy is 3.5 times faster than C-Python. Or do you want a benchmark that shows how poor PyPy is, and maybe suggests where some improvement might be needed? Then you could use PyChess, http://code.google.com/p/pychess. PyChess comes with an official benchmark, pychess.Utils.lutils.Benchmark.benchmark(). It shows that PyPy is only 25% faster than C-Python. Sunfish is a bit of a toy program (but what a nice toy!). PyChess is a real chess program, maybe the leading chess program in Python. Best wishes, Johan On 2014-02-28 13:05, Carl Friedrich Bolz wrote: > Hi Norbert, > > On 28/02/14 08:03, norbert.raimund.leisner at arcor.de wrote: >> I ask you because a chess program "Sunfish" > https://github.com/thomasahle/sunfish/ is using pypy. > > Unrelated to your actual question, this sounds like a very cool > addition to our benchmark set. Somebody feel like adding it? > > Cheers, > > Carl Friedrich > From len-l at telus.net Thu Mar 6 07:16:57 2014 From: len-l at telus.net (Lenard Lindstrom) Date: Wed, 05 Mar 2014 22:16:57 -0800 Subject: [pypy-dev] RPython question about the lifetime of global state Message-ID: <531812D9.4030207@telus.net> Hi everyone, I am developing a new image blit system for Pygame 2.0 - the SDL2 edition. A blitter prototype project is maintained at https://bitbucket.org/llindstrom/blitter. The prototype implements a blit loop JIT; Pixel format specific blit code is generated dynamically as needed. The prototype is written in RPython as an interpreter for executing array copies. The JIT comes automatically from the RPython tool chain, of course. The prototype blitter is built as a stand-alone shared library with flags -Ojit --gcrootfinder=shadowstack. It has no Python dependencies. There are two entrypoint C functions, both decorated with rpython.rlib.entrypoint.entrypoint. Python side code uses CFFI to access the library. The library is initialized with a single call to rpython_startup_code at load time. The blitter library is meant to be an embedded interpreter, with initialize, configure, and execute functions. So my question, does the RPython tool chained explicitly support embedded interpreters? I ask because I have only seen secondary entry points used as callbacks into an interpreter (PyPy's cpyext interface). So I wish to confirm that the lifetime of an RPython global namespace, the JIT caches, and the garbage collector are that of the loaded library, and not just that of an entry point function call. Thanks in advance, Lenard Lindstrom From arigo at tunes.org Thu Mar 6 07:55:22 2014 From: arigo at tunes.org (Armin Rigo) Date: Thu, 6 Mar 2014 07:55:22 +0100 Subject: [pypy-dev] RPython question about the lifetime of global state In-Reply-To: <531812D9.4030207@telus.net> References: <531812D9.4030207@telus.net> Message-ID: Hi Lenard, On 6 March 2014 07:16, Lenard Lindstrom wrote: > The prototype is written in RPython as an interpreter for executing > array copies. The JIT comes automatically from the RPython tool chain, of > course. Cool :-) RPython can certainly be used in this way, although critics might rightfully argue that you're getting a very big framework around a very small piece of code. You're getting for free a JIT that knows all about optimizing temporary allocations and tons of other things typical in a dynamic language, none of which really applies in your case. As long as the interpreter to JIT is only of a few lines of source, I'd recommend to at least have a look at other libraries (LibJIT for example). It would come with a smaller footprint (in code size, in memory usage, and in warm-up time) for similar results. It only works if the interpreter to JIT is small or if you have tons of time on your hand :-) > So I wish to confirm that the lifetime of an RPython > global namespace, the JIT caches, and the garbage collector are that of the > loaded library, and not just that of an entry point function call. Yes: it must be initialized only once, and then everything stays around. A bient?t, Armin. From norbert.raimund.leisner at arcor.de Fri Mar 7 07:45:29 2014 From: norbert.raimund.leisner at arcor.de (norbert.raimund.leisner at arcor.de) Date: Fri, 7 Mar 2014 07:45:29 +0100 (CET) Subject: [pypy-dev] pypy 2.2.1 win32 In-Reply-To: References: <366629207.1251178.1393571016417.JavaMail.ngmail@webmail11.arcor-online.net> Message-ID: <1261519382.77187.1394174729955.JavaMail.ngmail@webmail09.arcor-online.net> Hello Maciej, hello support-team! my hardware is : Intel Core 2 Duo E6600 (2x2,4 Ghz) - 1 GB RAM - 512 MB graphical card operation system: Windows XP SP3 32-bit Would yoru recommendation be PyPy 3 2.1 beta 1 win32 or is Python 3.3.4 x86 MSI at this case your first choice? http://www.python.org/download/releases/3.3.4/ I use it for Sunfish https://github.com/thomasahle/sunfish/ and Shatranj http://code.google.com/p/shatranjpy/ (two chess engines) cf. WinBoard/CECP-protocol http://www.open-aurec.com/wbforum/WinBoard/engine-intf.html and WinBoard-GUI http://www.open-aurec.com/wbforum/viewtopic.php?f=19&t=51528 Best wishes, Norbert ----- Original Nachricht ---- Von: Maciej Fijalkowski An: norbert.raimund.leisner at arcor.de Datum: 28.02.2014 08:39 Betreff: Re: [pypy-dev] pypy 2.2.1 win32 > On Fri, Feb 28, 2014 at 9:03 AM, wrote: > > Hello support-team, > > > > I have installed pypy 2.2.1 win32 for my OS Windows XP SP 3 -32 bit, > Python 2.7, MSI Microsoft Visual C++ 2008 SP1 Redistributable Package > (x86). > > > > Now my question: > > Must be Python 2.7 deinstalled and replaced by Pythonv2.7.6 or not? > > As far as I understand your question, the answer is no. Various > versions of Python (and PyPy) can happily coexist next to each other. > > Cheers, > fijal > From johan.rade at gmail.com Fri Mar 7 19:25:20 2014 From: johan.rade at gmail.com (=?ISO-8859-1?Q?Johan_R=E5de?=) Date: Fri, 07 Mar 2014 19:25:20 +0100 Subject: [pypy-dev] pypy 2.2.1 win32 In-Reply-To: <1261519382.77187.1394174729955.JavaMail.ngmail@webmail09.arcor-online.net> References: <366629207.1251178.1393571016417.JavaMail.ngmail@webmail11.arcor-online.net> <1261519382.77187.1394174729955.JavaMail.ngmail@webmail09.arcor-online.net> Message-ID: Hi Norbert, It is not easy to answer questions like that. We have not tested every Python program with every Python version. Why don't you just try yourself? If you run into problems when you try with PyPy, feel free to ask for advice here. But OK, I happen to have Sunfish and four different 32-bit Python versions installed on my computer. Here are the results I got when I timed the Sunfish function test.selfplay(): PyPy 2.2.1: 106.2 s PyPy 3.2.1 beta: 395.2 s CPython 2.7.6: 363.9 s CPython 3.4.0 RC2: 426.5 s So it seems that you can use any of these four, but PyPy 2.2.1 is fastest. (I think the author of Sunfish has optimized the code using PyPy 2.) And let's get the terminology straight: Python is a language. PyPy, CPython, IronPython and Jython are different implementations of that language. The software that you call Python 3.3.4 should be called CPython 3.3.4. Cheers, Johan On 2014-03-07 07:45, norbert.raimund.leisner at arcor.de wrote: > Hello Maciej, hello support-team! > > my hardware is : Intel Core 2 Duo E6600 (2x2,4 Ghz) - 1 GB RAM - 512 MB graphical card > operation system: Windows XP SP3 32-bit > > Would yoru recommendation be PyPy 3 2.1 beta 1 win32 or is Python 3.3.4 x86 MSI at this case your first choice? > http://www.python.org/download/releases/3.3.4/ > > I use it for Sunfish https://github.com/thomasahle/sunfish/ and Shatranj http://code.google.com/p/shatranjpy/ (two chess engines) cf. WinBoard/CECP-protocol http://www.open-aurec.com/wbforum/WinBoard/engine-intf.html > and WinBoard-GUI http://www.open-aurec.com/wbforum/viewtopic.php?f=19&t=51528 > > Best wishes, > Norbert > > > ----- Original Nachricht ---- > Von: Maciej Fijalkowski > An: norbert.raimund.leisner at arcor.de > Datum: 28.02.2014 08:39 > Betreff: Re: [pypy-dev] pypy 2.2.1 win32 > >> On Fri, Feb 28, 2014 at 9:03 AM, wrote: >>> Hello support-team, >>> >>> I have installed pypy 2.2.1 win32 for my OS Windows XP SP 3 -32 bit, >> Python 2.7, MSI Microsoft Visual C++ 2008 SP1 Redistributable Package >> (x86). >>> >>> Now my question: >>> Must be Python 2.7 deinstalled and replaced by Pythonv2.7.6 or not? >> >> As far as I understand your question, the answer is no. Various >> versions of Python (and PyPy) can happily coexist next to each other. >> >> Cheers, >> fijal >> From johan.rade at gmail.com Sat Mar 8 14:42:52 2014 From: johan.rade at gmail.com (=?ISO-8859-1?Q?Johan_R=E5de?=) Date: Sat, 08 Mar 2014 14:42:52 +0100 Subject: [pypy-dev] Possibly a PyPy C-API bug Message-ID: Hi everyone, I think I might have found a bug in the PyPy C-API. It seems that PyType_Type.tp_new is broken. Here is a minimal example that reproduces the bug. Instructions: Compile Foo3.c as a python extension module named Foo3. Set up the paths so that Test3.py can find Foo3. Run Test3.py Expected result and observed result with CPython 2.7.6: Test3.py runs Observed result with PyPy 2.2.1: Test3.py crashes. (It gets into an infinite recursive loop where PyType_Type.tpnew and Foo3Type_Type.tp_new keep calling each other.) Fixing this bug, or finding a workaround, would get me one step closer to getting PySide to run with PyPy. Cheers, Johan -------------- next part -------------- #include PyObject* foo3type_tp_new(PyTypeObject* metatype, PyObject* args, PyObject* kwds) { // In a more realistic example we might do some preprocessing of args and kwargs here ... PyObject* newType = PyType_Type.tp_new(metatype, args, kwds); // ... and some postprocessing of newType here return newType; } PyTypeObject Foo3Type_Type = { PyVarObject_HEAD_INIT(0, 0) /*tp_name*/ "Foo3.Type", /*tp_basicsize*/ sizeof(PyTypeObject), /*tp_itemsize*/ 0, /*tp_dealloc*/ 0, /*tp_print*/ 0, /*tp_getattr*/ 0, /*tp_setattr*/ 0, /*tp_compare*/ 0, /*tp_repr*/ 0, /*tp_as_number*/ 0, /*tp_as_sequence*/ 0, /*tp_as_mapping*/ 0, /*tp_hash*/ 0, /*tp_call*/ 0, /*tp_str*/ 0, /*tp_getattro*/ 0, /*tp_setattro*/ 0, /*tp_as_buffer*/ 0, /*tp_flags*/ Py_TPFLAGS_DEFAULT, /*tp_doc*/ 0, /*tp_traverse*/ 0, /*tp_clear*/ 0, /*tp_richcompare*/ 0, /*tp_weaklistoffset*/ 0, /*tp_iter*/ 0, /*tp_iternext*/ 0, /*tp_methods*/ 0, /*tp_members*/ 0, /*tp_getset*/ 0, /*tp_base*/ 0, // set to &PyType_Type in module init function (why can it not be done here?) /*tp_dict*/ 0, /*tp_descr_get*/ 0, /*tp_descr_set*/ 0, /*tp_dictoffset*/ 0, /*tp_init*/ 0, /*tp_alloc*/ 0, /*tp_new*/ foo3type_tp_new, /*tp_free*/ 0, /*tp_is_gc*/ 0, /*tp_bases*/ 0, /*tp_mro*/ 0, /*tp_cache*/ 0, /*tp_subclasses*/ 0, /*tp_weaklist*/ 0 }; static PyMethodDef sbkMethods[] = {{NULL, NULL, 0, NULL}}; #ifdef _WIN32 __declspec(dllexport) void // PyModINIT_FUNC is broken on PyPy/Windows #else PyMODINIT_FUNC #endif initFoo3(void) { PyObject* mod = Py_InitModule("Foo3", sbkMethods); Foo3Type_Type.tp_base = &PyType_Type; PyType_Ready(&Foo3Type_Type); PyModule_AddObject(mod, "Type", (PyObject*)&Foo3Type_Type); } -------------- next part -------------- import Foo3 class X(object): __metaclass__ = Foo3.Type pass From arigo at tunes.org Sun Mar 9 08:26:53 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 9 Mar 2014 08:26:53 +0100 Subject: [pypy-dev] Possibly a PyPy C-API bug In-Reply-To: References: Message-ID: Hi Johan, On 8 March 2014 14:42, Johan R?de wrote: > I think I might have found a bug in the PyPy C-API. > It seems that PyType_Type.tp_new is broken. Indeed. I tried to look, but either I missed something or it looks like it won't be that obvious to fix. For reference, the built-in types like PyType_Type are generated automatically and all their slots (or maybe only tp_new?) seem to be subtly wrong: they are done with slot_tp_new(), which calls the instance's generic operation; the latter is possibly overridden in a subtype, thus leading to infinite recursion in cases like you report. Can you post this to the bug tracker? Otherwise it will likely be forgotten. A bient?t, Armin. From matti.picus at gmail.com Sun Mar 9 23:37:07 2014 From: matti.picus at gmail.com (Matti Picus) Date: Mon, 10 Mar 2014 00:37:07 +0200 Subject: [pypy-dev] win32 failures on own tests Message-ID: <531CED13.7020401@gmail.com> An HTML attachment was scrubbed... URL: From arigo at tunes.org Mon Mar 10 08:00:37 2014 From: arigo at tunes.org (Armin Rigo) Date: Mon, 10 Mar 2014 08:00:37 +0100 Subject: [pypy-dev] win32 failures on own tests In-Reply-To: <531CED13.7020401@gmail.com> References: <531CED13.7020401@gmail.com> Message-ID: Hi Matti, On 9 March 2014 23:37, Matti Picus wrote: > id(x) returning a long where an int is expected in rlib\objectmodel.py You're right, both CPython and PyPy return an unsigned integer which may not fit into an "int". A bient?t, Armin. From dimaqq at gmail.com Mon Mar 10 09:38:57 2014 From: dimaqq at gmail.com (Dima Tisnek) Date: Mon, 10 Mar 2014 09:38:57 +0100 Subject: [pypy-dev] slow-ish multithreaded primitives In-Reply-To: References:

Message-ID: Can I try to make a case for _py3k_acquire inclusion when using context manager API? Let's say a well-formed Python program always context managers, and thus timeouts are only supplied to condition,wait(): c = threading.Condition() with c: while something: c.wait(some time) change state with c: c.notifyAll() What is the semantic difference in the choice of the underlying implementation of c._Condition__lock._RLock__block.acquire vs _py3k_acquire? what could go wrong if c._Condition_lock.__enter__ was mapped to _py3k_acquire instead? AFAIK context manager API doesn't allow user to pass blocking=0 here. Thus lock acquisition cannot time out. Seems pretty solid to me... That still leaves signal handling. Is the concern here about the context in which signal handler executes? the behaviour of user program because signal may be caught earlier? unexpected exception site for KeyboardInterrupt? d. On 27 February 2014 15:54, Armin Rigo wrote: > Hi Dima, > > On 25 February 2014 16:45, Dima Tisnek wrote: >> Armin, is there really a semantical change? >> Consider invocations valid in 2.7, (i.e. without timeout argument), is >> it not the same then? > > It's different: Python 3.x acquire() can be interrupted by signals, > whereas Python 2.x acquire() cannot. > >> should this code be in nightly builds? > > Yes. > > Armin From dimaqq at gmail.com Mon Mar 10 10:19:38 2014 From: dimaqq at gmail.com (Dima Tisnek) Date: Mon, 10 Mar 2014 10:19:38 +0100 Subject: [pypy-dev] slow-ish multithreaded primitives In-Reply-To: References:

Message-ID: Oh, so sorry to have jumped the gun. now that I properly tested the nightly build I see that the performance issue I saw is gone and that condition.acquire actually calls _py3k_acquire when timeout argument is present. d. On 10 March 2014 09:38, Dima Tisnek wrote: > Can I try to make a case for _py3k_acquire inclusion when using > context manager API? > > Let's say a well-formed Python program always context managers, and > thus timeouts are only supplied to condition,wait(): > > c = threading.Condition() > with c: > while something: > c.wait(some time) > change state > > with c: > c.notifyAll() > > What is the semantic difference in the choice of the underlying > implementation of c._Condition__lock._RLock__block.acquire vs > _py3k_acquire? > > what could go wrong if c._Condition_lock.__enter__ was mapped to > _py3k_acquire instead? > > AFAIK context manager API doesn't allow user to pass blocking=0 here. > Thus lock acquisition cannot time out. Seems pretty solid to me... > > That still leaves signal handling. Is the concern here about the > context in which signal handler executes? the behaviour of user > program because signal may be caught earlier? unexpected exception > site for KeyboardInterrupt? > > d. > > On 27 February 2014 15:54, Armin Rigo wrote: >> Hi Dima, >> >> On 25 February 2014 16:45, Dima Tisnek wrote: >>> Armin, is there really a semantical change? >>> Consider invocations valid in 2.7, (i.e. without timeout argument), is >>> it not the same then? >> >> It's different: Python 3.x acquire() can be interrupted by signals, >> whereas Python 2.x acquire() cannot. >> >>> should this code be in nightly builds? >> >> Yes. >> >> Armin From naylor.b.david at gmail.com Mon Mar 10 18:26:19 2014 From: naylor.b.david at gmail.com (David Naylor) Date: Mon, 10 Mar 2014 20:26:19 +0300 Subject: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault] In-Reply-To: References: <201108102127.13752.naylor.b.david@gmail.com> <201108252144.09934.naylor.b.david@gmail.com> Message-ID: <3514347.BF9MiKfKNF@dragon.dg> On Friday, 26 August 2011 06:37:30 Armin Rigo wrote: > Hi David, > > On Thu, Aug 25, 2011 at 9:44 PM, David Naylor wrote: > > Below is the patch, and results, for my proposed hash methods for > > datetime.datetime (and easily adaptable to include tzinfo and the other > > datetime objects). I tried to make the hash safe for both 32bit and 64bit > > systems, and beyond. > > Yes, the patch looks good to me. I can definitely see how it can be a > huge improvement in performance :-) > > If you can also "fix" the other __hash__ methods in the same way, it > would be great. To follow up on a very old email. The latest results are: # python2.7 iforkey.py ifdict: [2.110611915588379, 2.12678599357605, 2.1126320362091064] keydict: [2.1322460174560547, 2.098900079727173, 2.0998198986053467] defaultdict: [3.184070110321045, 3.2007319927215576, 3.188380002975464] # pypy2.2 iforkey.py ifdict: [0.510915994644165, 0.23750996589660645, 0.2241990566253662] keydict: [0.23270201683044434, 0.18279695510864258, 0.18002104759216309] defaultdict: [3.4535930156707764, 3.3697848320007324, 3.392897129058838] And using the latest datetime.py: pypy iforkey.py ifdict: [0.2814958095550537, 0.23425602912902832, 0.22999906539916992] keydict: [0.23637700080871582, 0.18506789207458496, 0.1831810474395752] defaultdict: [2.8174121379852295, 2.74626088142395, 2.7308008670806885] Excellent, thank you :-) Regards -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 964 bytes Desc: This is a digitally signed message part. URL: From brecht at mos6581.org Tue Mar 11 21:49:15 2014 From: brecht at mos6581.org (Brecht Machiels) Date: Tue, 11 Mar 2014 21:49:15 +0100 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> Message-ID: <144b2e80935.-6422525578898360305.-4494664970611521086@mos6581.org> Hello Maciej and Armin, Glad you think this is a valuable benchmark, since I provided it mostly for selfish reasons ;) I've done a quick test similar to Armin's, rendering the original 4-page document over and over again. While I can see the speed improving, it still doesn't reach CPython's performance. I haven't found the time yet to try with a longer document. I'll render a book from project Gutenberg soon and report back here. Let me know if there's anything else I can do. Bengt Richter raised an interesting question (but his message didn't seem to make it to the list): > Is there any way that jit results could be cached to some degree, in one > or more files, to give the next execution of a program a warmer start? I remember seeing a similar question before. IIRC one suggestion was to spawn a daemon process. I suppose that could work for RinohType, but I'm also interested to hear if it would be possible to have PyPy save the JIT state to a file on termination. Cheers, Brecht ---- On Sun, 02 Mar 2014 22:15:28 +0100 Maciej Fijalkowski wrote ---- > I must say I've been trying to understand what's going on and I'm > failing so far. Thanks for a valuable benchmark! And yes, we're > working on improving the warmup time (ETA unknown though) ---- On Sun, 02 Mar 2014 17:01:32 +0100 Armin Rigo wrote ---- > On 1 March 2014 23:34, Brecht Machiels wrote: > > While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? > > It's not really helpful, but the warm-up time is the first issue here. > If I edit template.py to run it e.g. 10 times instead of only once, > the speed grows quickly by a factor of 4. It means your code, for > some reason, exhibits slow warm-ups (not the worst we've seen, but I > agree it's a lot). It would be interesting to know if you have a > similar speed-up when processing a single 10-times-larger document > instead of 10 times the same small document :-) > > > A bient?t, > > Armin. From taavi.burns at gmail.com Tue Mar 11 23:26:45 2014 From: taavi.burns at gmail.com (Taavi Burns) Date: Tue, 11 Mar 2014 18:26:45 -0400 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <144b2e80935.-6422525578898360305.-4494664970611521086@mos6581.org> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> <144b2e80935.-6422525578898360305.-4494664970611521086@mos6581.org> Message-ID: <5E16145E-0A70-4A75-8743-C9E4D685DBFB@gmail.com> > On Mar 11, 2014, at 16:49, Brecht Machiels wrote: > > >> Is there any way that jit results could be cached to some degree, in one >> or more files, to give the next execution of a program a warmer start? > > I remember seeing a similar question before. IIRC one suggestion was to spawn a daemon process. I suppose that could work for RinohType, but I'm also interested to hear if it would be possible to have PyPy save the JIT state to a file on termination. There's a FAQ entry for that! :) http://pypy.readthedocs.org/en/improve-docs/faq.html#couldn-t-the-jit-dump-and-reload-already-compiled-machine-code -- taa /*eof*/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Wed Mar 12 23:06:19 2014 From: mak at issuu.com (Martin Koch) Date: Wed, 12 Mar 2014 23:06:19 +0100 Subject: [pypy-dev] Pypy garbage collection Message-ID: Hi List I'm running a server (written in python, executed with pypy) that holds a large graph (55GB, millions of nodes and edges) in memory and responds to queries by traversing the graph.The graph is mutated a few times a second, and there are hundreds of read-only requests a second. My problem is that I no control over garbage collection. Thus, a major GC might kick in while serving a query, and with this amount of data, the GC takes around 2 minutes. I have tried mitigating this by guessing when a GC might be due, and proactively starting the garbage collector while not serving a request (this is ok, as duplicate servers will respond to requests while this one is collecting). What I would really like is to be able to disable garbage collection for the old generation. This is because the graph is fairly static, and I can live with leaking memory from the relatively few and small mutations that occur. Any queries are only likely to generate objects in the new generation, and it is fine to collect these. Also, by design, the process is periodically restarted in order to re-synchronize it with an authoritative source (thus rebuilding the graph from scratch), so slight leakage is not an issue here. I have tried experimenting with setting environmentvariables as well as the 'gc' module, but nothing seems to give me what I want. If disabling gc for certain generations is not possible, it would be nice to be able to get a hint when a major collection is about to occur, so I can stop serving requests. I'm using the following pypy version: Python 2.7.3 (2.2.1+dfsg-1, Jan 24 2014, 10:12:37) [PyPy 2.2.1 with GCC 4.6.3] on linux2 An additional question: pypy 2.2.1 should have incremental GC; shouldn't that avoid long pauses due to garbage collection? Thanks, /Martin Koch - Senior Systems Architect - issuu.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 13 00:56:34 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 13 Mar 2014 01:56:34 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: On Thu, Mar 13, 2014 at 12:06 AM, Martin Koch wrote: > Hi List > > I'm running a server (written in python, executed with pypy) that holds a > large graph (55GB, millions of nodes and edges) in memory and responds to > queries by traversing the graph.The graph is mutated a few times a second, > and there are hundreds of read-only requests a second. > > My problem is that I no control over garbage collection. Thus, a major GC > might kick in while serving a query, and with this amount of data, the GC > takes around 2 minutes. I have tried mitigating this by guessing when a GC > might be due, and proactively starting the garbage collector while not > serving a request (this is ok, as duplicate servers will respond to requests > while this one is collecting). > > What I would really like is to be able to disable garbage collection for the > old generation. This is because the graph is fairly static, and I can live > with leaking memory from the relatively few and small mutations that occur. > Any queries are only likely to generate objects in the new generation, and > it is fine to collect these. Also, by design, the process is periodically > restarted in order to re-synchronize it with an authoritative source (thus > rebuilding the graph from scratch), so slight leakage is not an issue here. > > I have tried experimenting with setting environment variables as well as the > 'gc' module, but nothing seems to give me what I want. > > If disabling gc for certain generations is not possible, it would be nice to > be able to get a hint when a major collection is about to occur, so I can > stop serving requests. > > I'm using the following pypy version: > Python 2.7.3 (2.2.1+dfsg-1, Jan 24 2014, 10:12:37) > [PyPy 2.2.1 with GCC 4.6.3] on linux2 > > An additional question: pypy 2.2.1 should have incremental GC; shouldn't > that avoid long pauses due to garbage collection? Yes, it totally should. If your pauses are not incremental, we would like to be able to execute it. Since it's 55G, do you think you can make us an example that can run on a normal machine? From arigo at tunes.org Thu Mar 13 12:29:50 2014 From: arigo at tunes.org (Armin Rigo) Date: Thu, 13 Mar 2014 12:29:50 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: Hi Martin, On 13 March 2014 00:56, Maciej Fijalkowski wrote: > Yes, it totally should. If your pauses are not incremental, we would > like to be able to execute it. Since it's 55G, do you think you can > make us an example that can run on a normal machine? I think the request is not very clear. We do have a machine with 100GB of RAM, so that part should not be a problem. The question of Maciej can probably be rephrased as: can you give us a reproducible example? Even if the large pauses appear to occur on any example you try (which they shouldn't), please give us one such example. Also, maybe we should have anyway a way to give the GC a hint: "now is a good time to run if you need to". A bient?t, Armin. From mak at issuu.com Thu Mar 13 12:45:04 2014 From: mak at issuu.com (Martin Koch) Date: Thu, 13 Mar 2014 12:45:04 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: Hi Armin, Maciej Thanks for responding. I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you. Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :) Thanks, /Martin On Thu, Mar 13, 2014 at 12:29 PM, Armin Rigo wrote: > Hi Martin, > > On 13 March 2014 00:56, Maciej Fijalkowski wrote: > > Yes, it totally should. If your pauses are not incremental, we would > > like to be able to execute it. Since it's 55G, do you think you can > > make us an example that can run on a normal machine? > > I think the request is not very clear. We do have a machine with > 100GB of RAM, so that part should not be a problem. The question of > Maciej can probably be rephrased as: can you give us a reproducible > example? Even if the large pauses appear to occur on any example you > try (which they shouldn't), please give us one such example. > > Also, maybe we should have anyway a way to give the GC a hint: "now is > a good time to run if you need to". > > > A bient?t, > > Armin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 13 19:45:44 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 13 Mar 2014 20:45:44 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > Hi Armin, Maciej > > Thanks for responding. > > I'm in the process of trying to determine what (if any) of the code I'm in a > position to share, and I'll get back to you. > > Allowing hinting to the GC would be good. Even better would be a means to > allow me to (transparently) allocate objects in unmanaged memory, but I > would expect that to be a tall order :) > > Thanks, > /Martin Hi Martin. Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :) From mak at issuu.com Fri Mar 14 16:19:56 2014 From: mak at issuu.com (Martin Koch) Date: Fri, 14 Mar 2014 16:19:56 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: We have hacked up a small sample that seems to exhibit the same issue. We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list. We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters. Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size). We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS? Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building). Here is output from a few runs with different number of elements: *pypy mem.py 10000000* start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817 *pypy mem.py 20000000* start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502 *pypy mem.py 30000000* start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654 *Code below* *--------------------------------------------------------------* import time from random import randint, choice import sys allElems = {} class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self] def set_next(self, l): self.next = l def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)])) # if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1 return acc def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None) lst = follow(elem) total += choice(lst)[0] # use the return value for something end = time.time() elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski wrote: > On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > > Hi Armin, Maciej > > > > Thanks for responding. > > > > I'm in the process of trying to determine what (if any) of the code I'm > in a > > position to share, and I'll get back to you. > > > > Allowing hinting to the GC would be good. Even better would be a means to > > allow me to (transparently) allocate objects in unmanaged memory, but I > > would expect that to be a tall order :) > > > > Thanks, > > /Martin > > Hi Martin. > > Note that in case you want us to do the work of isolating the problem, > we do offer paid support to do that (then we can sign NDAs and stuff). > Otherwise we would be more than happy to fix bugs once you isolate a > part you can share freely :) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Sun Mar 16 22:34:51 2014 From: mak at issuu.com (Martin Koch) Date: Sun, 16 Mar 2014 22:34:51 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else. Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow. Cheers, /Martin On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: > We have hacked up a small sample that seems to exhibit the same issue. > > We basically generate a linked list of objects. To increase connectedness, > elements in the list hold references (dummy_links) to 10 randomly chosen > previous elements in the list. > > We then time a function that traverses 50000 elements from the list from a > random start point. If the traversal reaches the end of the list, we > instead traverse one of the dummy links. Thus, exactly 50K elements are > traversed every time. To generate some garbage, we build a list holding the > traversed elements and a dummy list of characters. > > Timings for the last 100 runs are stored in a circular buffer. If the > elapsed time for the last run is more than twice the average time, we print > out a line with the elapsed time, the threshold, and the 90% runtime (we > would like to see that the mean runtime does not increase with the number > of elements in the list, but that the max time does increase (linearly with > the number of object, i guess); traversing 50K elements should be > independent of the memory size). > > We have tried monitoring memory consumption by external inspection, but > cannot consistently verify that memory is deallocated at the same time that > we see slow requests. Perhaps the pypy runtime doesn't always return freed > pages back to the OS? > > Using top, we observe that 10M elements allocates around 17GB after > building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly > after building). > > Here is output from a few runs with different number of elements: > > > *pypy mem.py 10000000* > start build > end build 84.142424 > that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > 90th_quantile_runtime: 0.421558 > that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > 90th_quantile_runtime: 0.423441 > that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > 90th_quantile_runtime: 0.419817 > > *pypy mem.py 20000000* > start build > end build 180.823105 > that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > 90th_quantile_runtime: 0.434726 > that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > 90th_quantile_runtime: 0.374190 > that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > 90th_quantile_runtime: 0.371502 > > *pypy mem.py 30000000* > start build > end build 276.217811 > that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > 90th_quantile_runtime: 0.459891 > that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > 90th_quantile_runtime: 0.393654 > that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > 90th_quantile_runtime: 0.393677 > that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > 90th_quantile_runtime: 0.393654 > > *Code below* > *--------------------------------------------------------------* > import time > from random import randint, choice > import sys > > > allElems = {} > > class Node: > def __init__(self, v_): > self.v = v_ > self.next = None > self.dummy_data = [randint(0,100) > for _ in xrange(randint(50,100))] > allElems[self.v] = self > if self.v > 0: > self.dummy_links = [allElems[randint(0, self.v-1)] for _ in > xrange(10)] > else: > self.dummy_links = [self] > > def set_next(self, l): > self.next = l > > > def follow(node): > acc = [] > count = 0 > cur = node > assert node.v is not None > assert cur is not None > while count < 50000: > # return a value; generate some garbage > acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in > xrange(100)])) > > # if we have reached the end, chose a random link > cur = choice(cur.dummy_links) if cur.next is None else cur.next > count += 1 > > return acc > > > def build(num_elems): > start = time.time() > print "start build" > root = Node(0) > cur = root > for x in xrange(1, num_elems): > e = Node(x) > cur.next = e > cur = e > print "end build %f" % (time.time() - start) > return root > > > num_timings = 100 > if __name__ == "__main__": > num_elems = int(sys.argv[1]) > build(num_elems) > total = 0 > timings = [0.0] * num_timings # run times for the last num_timings runs > i = 0 > beginning = time.time() > while time.time() - beginning < 600: > start = time.time() > elem = allElems[randint(0, num_elems - 1)] > assert(elem is not None) > > lst = follow(elem) > > total += choice(lst)[0] # use the return value for something > > end = time.time() > > elapsed = end-start > timings[i % num_timings] = elapsed > if (i > num_timings): > slow_time = 2 * sum(timings)/num_timings # slow defined as > > 2*avg run time > if (elapsed > slow_time): > print "that took a long time elapsed: %f slow_threshold: > %f 90th_quantile_runtime: %f" % \ > (elapsed, slow_time, > sorted(timings)[int(num_timings*.9)]) > i += 1 > print total > > > > > > On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski wrote: > >> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >> > Hi Armin, Maciej >> > >> > Thanks for responding. >> > >> > I'm in the process of trying to determine what (if any) of the code I'm >> in a >> > position to share, and I'll get back to you. >> > >> > Allowing hinting to the GC would be good. Even better would be a means >> to >> > allow me to (transparently) allocate objects in unmanaged memory, but I >> > would expect that to be a tall order :) >> > >> > Thanks, >> > /Martin >> >> Hi Martin. >> >> Note that in case you want us to do the work of isolating the problem, >> we do offer paid support to do that (then we can sign NDAs and stuff). >> Otherwise we would be more than happy to fix bugs once you isolate a >> part you can share freely :) >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 08:21:25 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 09:21:25 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw. maybe you can find out what's that using profiling or valgrind? On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > I have tried getting the pypy source and building my own version of pypy. I > have modified rpython/memory/gc/incminimark.py:major_collection_step() to > print out when it starts and when it stops. Apparently, the slow queries do > NOT occur during major_collection_step; at least, I have not observed major > step output during a query execution. So, apparently, something else is > blocking. This could be another aspect of the GC, but it could also be > anything else. > > Just to be sure, I have tried running the same application in python with > garbage collection disabled. I don't see the problem there, so it is somehow > related to either GC or the runtime somehow. > > Cheers, > /Martin > > > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: >> >> We have hacked up a small sample that seems to exhibit the same issue. >> >> We basically generate a linked list of objects. To increase connectedness, >> elements in the list hold references (dummy_links) to 10 randomly chosen >> previous elements in the list. >> >> We then time a function that traverses 50000 elements from the list from a >> random start point. If the traversal reaches the end of the list, we instead >> traverse one of the dummy links. Thus, exactly 50K elements are traversed >> every time. To generate some garbage, we build a list holding the traversed >> elements and a dummy list of characters. >> >> Timings for the last 100 runs are stored in a circular buffer. If the >> elapsed time for the last run is more than twice the average time, we print >> out a line with the elapsed time, the threshold, and the 90% runtime (we >> would like to see that the mean runtime does not increase with the number of >> elements in the list, but that the max time does increase (linearly with the >> number of object, i guess); traversing 50K elements should be independent of >> the memory size). >> >> We have tried monitoring memory consumption by external inspection, but >> cannot consistently verify that memory is deallocated at the same time that >> we see slow requests. Perhaps the pypy runtime doesn't always return freed >> pages back to the OS? >> >> Using top, we observe that 10M elements allocates around 17GB after >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly >> after building). >> >> Here is output from a few runs with different number of elements: >> >> >> pypy mem.py 10000000 >> start build >> end build 84.142424 >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 >> 90th_quantile_runtime: 0.421558 >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 >> 90th_quantile_runtime: 0.423441 >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 >> 90th_quantile_runtime: 0.419817 >> >> pypy mem.py 20000000 >> start build >> end build 180.823105 >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 >> 90th_quantile_runtime: 0.434726 >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 >> 90th_quantile_runtime: 0.374190 >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 >> 90th_quantile_runtime: 0.371502 >> >> pypy mem.py 30000000 >> start build >> end build 276.217811 >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 >> 90th_quantile_runtime: 0.459891 >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 >> 90th_quantile_runtime: 0.393654 >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 >> 90th_quantile_runtime: 0.393677 >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 >> 90th_quantile_runtime: 0.393654 >> >> Code below >> -------------------------------------------------------------- >> import time >> from random import randint, choice >> import sys >> >> >> allElems = {} >> >> class Node: >> def __init__(self, v_): >> self.v = v_ >> self.next = None >> self.dummy_data = [randint(0,100) >> for _ in xrange(randint(50,100))] >> allElems[self.v] = self >> if self.v > 0: >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ in >> xrange(10)] >> else: >> self.dummy_links = [self] >> >> def set_next(self, l): >> self.next = l >> >> >> def follow(node): >> acc = [] >> count = 0 >> cur = node >> assert node.v is not None >> assert cur is not None >> while count < 50000: >> # return a value; generate some garbage >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in >> xrange(100)])) >> >> # if we have reached the end, chose a random link >> cur = choice(cur.dummy_links) if cur.next is None else cur.next >> count += 1 >> >> return acc >> >> >> def build(num_elems): >> start = time.time() >> print "start build" >> root = Node(0) >> cur = root >> for x in xrange(1, num_elems): >> e = Node(x) >> cur.next = e >> cur = e >> print "end build %f" % (time.time() - start) >> return root >> >> >> num_timings = 100 >> if __name__ == "__main__": >> num_elems = int(sys.argv[1]) >> build(num_elems) >> total = 0 >> timings = [0.0] * num_timings # run times for the last num_timings >> runs >> i = 0 >> beginning = time.time() >> while time.time() - beginning < 600: >> start = time.time() >> elem = allElems[randint(0, num_elems - 1)] >> assert(elem is not None) >> >> lst = follow(elem) >> >> total += choice(lst)[0] # use the return value for something >> >> end = time.time() >> >> elapsed = end-start >> timings[i % num_timings] = elapsed >> if (i > num_timings): >> slow_time = 2 * sum(timings)/num_timings # slow defined as > >> 2*avg run time >> if (elapsed > slow_time): >> print "that took a long time elapsed: %f slow_threshold: >> %f 90th_quantile_runtime: %f" % \ >> (elapsed, slow_time, >> sorted(timings)[int(num_timings*.9)]) >> i += 1 >> print total >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> wrote: >>> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >>> > Hi Armin, Maciej >>> > >>> > Thanks for responding. >>> > >>> > I'm in the process of trying to determine what (if any) of the code I'm >>> > in a >>> > position to share, and I'll get back to you. >>> > >>> > Allowing hinting to the GC would be good. Even better would be a means >>> > to >>> > allow me to (transparently) allocate objects in unmanaged memory, but I >>> > would expect that to be a tall order :) >>> > >>> > Thanks, >>> > /Martin >>> >>> Hi Martin. >>> >>> Note that in case you want us to do the work of isolating the problem, >>> we do offer paid support to do that (then we can sign NDAs and stuff). >>> Otherwise we would be more than happy to fix bugs once you isolate a >>> part you can share freely :) >> >> > From fijall at gmail.com Mon Mar 17 12:09:09 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 13:09:09 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: The number of lines is nonsense. This is a timestamp in hex. On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: > Based On Maciej's suggestion, I tried the following > > PYPYLOG=- pypy mem.py 10000000 > out > > This generates a logfile which looks something like this > > start--> > [2b99f1981b527e] {gc-minor > [2b99f1981ba680] {gc-minor-walkroots > [2b99f1981c2e02] gc-minor-walkroots} > [2b99f19890d750] gc-minor} > [snip] > ... > <--stop > > > It turns out that the culprit is a lot of MINOR collections. > > I base this on the following observations: > > I can't understand the format of the timestamp on each logline (the > "[2b99f1981b527e]"). From what I can see in the code, this should be output > from time.clock(), but that doesn't return a number like that when I run > pypy interactively > Instead, I count the number of debug lines between start--> and the > corresponding <--stop. > Most runs have a few hundred lines of output between start/stop > All slow runs have very close to 57800 lines out output between start/stop > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > operations, and 9647 gc-minor-walkroots operations. > > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > wrote: >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> which will do that for you btw. >> >> maybe you can find out what's that using profiling or valgrind? >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >> > I have tried getting the pypy source and building my own version of >> > pypy. I >> > have modified rpython/memory/gc/incminimark.py:major_collection_step() >> > to >> > print out when it starts and when it stops. Apparently, the slow queries >> > do >> > NOT occur during major_collection_step; at least, I have not observed >> > major >> > step output during a query execution. So, apparently, something else is >> > blocking. This could be another aspect of the GC, but it could also be >> > anything else. >> > >> > Just to be sure, I have tried running the same application in python >> > with >> > garbage collection disabled. I don't see the problem there, so it is >> > somehow >> > related to either GC or the runtime somehow. >> > >> > Cheers, >> > /Martin >> > >> > >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: >> >> >> >> We have hacked up a small sample that seems to exhibit the same issue. >> >> >> >> We basically generate a linked list of objects. To increase >> >> connectedness, >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> chosen >> >> previous elements in the list. >> >> >> >> We then time a function that traverses 50000 elements from the list >> >> from a >> >> random start point. If the traversal reaches the end of the list, we >> >> instead >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> traversed >> >> every time. To generate some garbage, we build a list holding the >> >> traversed >> >> elements and a dummy list of characters. >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If the >> >> elapsed time for the last run is more than twice the average time, we >> >> print >> >> out a line with the elapsed time, the threshold, and the 90% runtime >> >> (we >> >> would like to see that the mean runtime does not increase with the >> >> number of >> >> elements in the list, but that the max time does increase (linearly >> >> with the >> >> number of object, i guess); traversing 50K elements should be >> >> independent of >> >> the memory size). >> >> >> >> We have tried monitoring memory consumption by external inspection, but >> >> cannot consistently verify that memory is deallocated at the same time >> >> that >> >> we see slow requests. Perhaps the pypy runtime doesn't always return >> >> freed >> >> pages back to the OS? >> >> >> >> Using top, we observe that 10M elements allocates around 17GB after >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> shortly >> >> after building). >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> pypy mem.py 10000000 >> >> start build >> >> end build 84.142424 >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 >> >> 90th_quantile_runtime: 0.421558 >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 >> >> 90th_quantile_runtime: 0.423441 >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> pypy mem.py 20000000 >> >> start build >> >> end build 180.823105 >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 >> >> 90th_quantile_runtime: 0.434726 >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 >> >> 90th_quantile_runtime: 0.374190 >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> pypy mem.py 30000000 >> >> start build >> >> end build 276.217811 >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 >> >> 90th_quantile_runtime: 0.459891 >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 >> >> 90th_quantile_runtime: 0.393654 >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 >> >> 90th_quantile_runtime: 0.393677 >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> Code below >> >> -------------------------------------------------------------- >> >> import time >> >> from random import randint, choice >> >> import sys >> >> >> >> >> >> allElems = {} >> >> >> >> class Node: >> >> def __init__(self, v_): >> >> self.v = v_ >> >> self.next = None >> >> self.dummy_data = [randint(0,100) >> >> for _ in xrange(randint(50,100))] >> >> allElems[self.v] = self >> >> if self.v > 0: >> >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ in >> >> xrange(10)] >> >> else: >> >> self.dummy_links = [self] >> >> >> >> def set_next(self, l): >> >> self.next = l >> >> >> >> >> >> def follow(node): >> >> acc = [] >> >> count = 0 >> >> cur = node >> >> assert node.v is not None >> >> assert cur is not None >> >> while count < 50000: >> >> # return a value; generate some garbage >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x >> >> in >> >> xrange(100)])) >> >> >> >> # if we have reached the end, chose a random link >> >> cur = choice(cur.dummy_links) if cur.next is None else cur.next >> >> count += 1 >> >> >> >> return acc >> >> >> >> >> >> def build(num_elems): >> >> start = time.time() >> >> print "start build" >> >> root = Node(0) >> >> cur = root >> >> for x in xrange(1, num_elems): >> >> e = Node(x) >> >> cur.next = e >> >> cur = e >> >> print "end build %f" % (time.time() - start) >> >> return root >> >> >> >> >> >> num_timings = 100 >> >> if __name__ == "__main__": >> >> num_elems = int(sys.argv[1]) >> >> build(num_elems) >> >> total = 0 >> >> timings = [0.0] * num_timings # run times for the last num_timings >> >> runs >> >> i = 0 >> >> beginning = time.time() >> >> while time.time() - beginning < 600: >> >> start = time.time() >> >> elem = allElems[randint(0, num_elems - 1)] >> >> assert(elem is not None) >> >> >> >> lst = follow(elem) >> >> >> >> total += choice(lst)[0] # use the return value for something >> >> >> >> end = time.time() >> >> >> >> elapsed = end-start >> >> timings[i % num_timings] = elapsed >> >> if (i > num_timings): >> >> slow_time = 2 * sum(timings)/num_timings # slow defined as >> >> > >> >> 2*avg run time >> >> if (elapsed > slow_time): >> >> print "that took a long time elapsed: %f >> >> slow_threshold: >> >> %f 90th_quantile_runtime: %f" % \ >> >> (elapsed, slow_time, >> >> sorted(timings)[int(num_timings*.9)]) >> >> i += 1 >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> wrote: >> >>> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >> >>> > Hi Armin, Maciej >> >>> > >> >>> > Thanks for responding. >> >>> > >> >>> > I'm in the process of trying to determine what (if any) of the code >> >>> > I'm >> >>> > in a >> >>> > position to share, and I'll get back to you. >> >>> > >> >>> > Allowing hinting to the GC would be good. Even better would be a >> >>> > means >> >>> > to >> >>> > allow me to (transparently) allocate objects in unmanaged memory, >> >>> > but I >> >>> > would expect that to be a tall order :) >> >>> > >> >>> > Thanks, >> >>> > /Martin >> >>> >> >>> Hi Martin. >> >>> >> >>> Note that in case you want us to do the work of isolating the problem, >> >>> we do offer paid support to do that (then we can sign NDAs and stuff). >> >>> Otherwise we would be more than happy to fix bugs once you isolate a >> >>> part you can share freely :) >> >> >> >> >> > > > From fijall at gmail.com Mon Mar 17 13:53:20 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 14:53:20 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: I think it's the cycles of your CPU On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: > What is the unit? Perhaps I'm being thick here, but I can't correlate it > with seconds (which the program does print out). Slow runs are around 13 > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > > > > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > wrote: >> >> The number of lines is nonsense. This is a timestamp in hex. >> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: >> > Based On Maciej's suggestion, I tried the following >> > >> > PYPYLOG=- pypy mem.py 10000000 > out >> > >> > This generates a logfile which looks something like this >> > >> > start--> >> > [2b99f1981b527e] {gc-minor >> > [2b99f1981ba680] {gc-minor-walkroots >> > [2b99f1981c2e02] gc-minor-walkroots} >> > [2b99f19890d750] gc-minor} >> > [snip] >> > ... >> > <--stop >> > >> > >> > It turns out that the culprit is a lot of MINOR collections. >> > >> > I base this on the following observations: >> > >> > I can't understand the format of the timestamp on each logline (the >> > "[2b99f1981b527e]"). From what I can see in the code, this should be >> > output >> > from time.clock(), but that doesn't return a number like that when I run >> > pypy interactively >> > Instead, I count the number of debug lines between start--> and the >> > corresponding <--stop. >> > Most runs have a few hundred lines of output between start/stop >> > All slow runs have very close to 57800 lines out output between >> > start/stop >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor >> > operations, and 9647 gc-minor-walkroots operations. >> > >> > >> > Thanks, >> > /Martin >> > >> > >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> > wrote: >> >> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> >> which will do that for you btw. >> >> >> >> maybe you can find out what's that using profiling or valgrind? >> >> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >> >> > I have tried getting the pypy source and building my own version of >> >> > pypy. I >> >> > have modified >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >> >> > to >> >> > print out when it starts and when it stops. Apparently, the slow >> >> > queries >> >> > do >> >> > NOT occur during major_collection_step; at least, I have not observed >> >> > major >> >> > step output during a query execution. So, apparently, something else >> >> > is >> >> > blocking. This could be another aspect of the GC, but it could also >> >> > be >> >> > anything else. >> >> > >> >> > Just to be sure, I have tried running the same application in python >> >> > with >> >> > garbage collection disabled. I don't see the problem there, so it is >> >> > somehow >> >> > related to either GC or the runtime somehow. >> >> > >> >> > Cheers, >> >> > /Martin >> >> > >> >> > >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: >> >> >> >> >> >> We have hacked up a small sample that seems to exhibit the same >> >> >> issue. >> >> >> >> >> >> We basically generate a linked list of objects. To increase >> >> >> connectedness, >> >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> >> chosen >> >> >> previous elements in the list. >> >> >> >> >> >> We then time a function that traverses 50000 elements from the list >> >> >> from a >> >> >> random start point. If the traversal reaches the end of the list, we >> >> >> instead >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> >> traversed >> >> >> every time. To generate some garbage, we build a list holding the >> >> >> traversed >> >> >> elements and a dummy list of characters. >> >> >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >> >> >> the >> >> >> elapsed time for the last run is more than twice the average time, >> >> >> we >> >> >> print >> >> >> out a line with the elapsed time, the threshold, and the 90% runtime >> >> >> (we >> >> >> would like to see that the mean runtime does not increase with the >> >> >> number of >> >> >> elements in the list, but that the max time does increase (linearly >> >> >> with the >> >> >> number of object, i guess); traversing 50K elements should be >> >> >> independent of >> >> >> the memory size). >> >> >> >> >> >> We have tried monitoring memory consumption by external inspection, >> >> >> but >> >> >> cannot consistently verify that memory is deallocated at the same >> >> >> time >> >> >> that >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always return >> >> >> freed >> >> >> pages back to the OS? >> >> >> >> >> >> Using top, we observe that 10M elements allocates around 17GB after >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> >> shortly >> >> >> after building). >> >> >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> >> >> >> pypy mem.py 10000000 >> >> >> start build >> >> >> end build 84.142424 >> >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 >> >> >> 90th_quantile_runtime: 0.421558 >> >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 >> >> >> 90th_quantile_runtime: 0.423441 >> >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 >> >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> >> >> pypy mem.py 20000000 >> >> >> start build >> >> >> end build 180.823105 >> >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 >> >> >> 90th_quantile_runtime: 0.434726 >> >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 >> >> >> 90th_quantile_runtime: 0.374190 >> >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 >> >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> >> >> pypy mem.py 30000000 >> >> >> start build >> >> >> end build 276.217811 >> >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 >> >> >> 90th_quantile_runtime: 0.459891 >> >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 >> >> >> 90th_quantile_runtime: 0.393677 >> >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> >> >> Code below >> >> >> -------------------------------------------------------------- >> >> >> import time >> >> >> from random import randint, choice >> >> >> import sys >> >> >> >> >> >> >> >> >> allElems = {} >> >> >> >> >> >> class Node: >> >> >> def __init__(self, v_): >> >> >> self.v = v_ >> >> >> self.next = None >> >> >> self.dummy_data = [randint(0,100) >> >> >> for _ in xrange(randint(50,100))] >> >> >> allElems[self.v] = self >> >> >> if self.v > 0: >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ >> >> >> in >> >> >> xrange(10)] >> >> >> else: >> >> >> self.dummy_links = [self] >> >> >> >> >> >> def set_next(self, l): >> >> >> self.next = l >> >> >> >> >> >> >> >> >> def follow(node): >> >> >> acc = [] >> >> >> count = 0 >> >> >> cur = node >> >> >> assert node.v is not None >> >> >> assert cur is not None >> >> >> while count < 50000: >> >> >> # return a value; generate some garbage >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for >> >> >> x >> >> >> in >> >> >> xrange(100)])) >> >> >> >> >> >> # if we have reached the end, chose a random link >> >> >> cur = choice(cur.dummy_links) if cur.next is None else >> >> >> cur.next >> >> >> count += 1 >> >> >> >> >> >> return acc >> >> >> >> >> >> >> >> >> def build(num_elems): >> >> >> start = time.time() >> >> >> print "start build" >> >> >> root = Node(0) >> >> >> cur = root >> >> >> for x in xrange(1, num_elems): >> >> >> e = Node(x) >> >> >> cur.next = e >> >> >> cur = e >> >> >> print "end build %f" % (time.time() - start) >> >> >> return root >> >> >> >> >> >> >> >> >> num_timings = 100 >> >> >> if __name__ == "__main__": >> >> >> num_elems = int(sys.argv[1]) >> >> >> build(num_elems) >> >> >> total = 0 >> >> >> timings = [0.0] * num_timings # run times for the last >> >> >> num_timings >> >> >> runs >> >> >> i = 0 >> >> >> beginning = time.time() >> >> >> while time.time() - beginning < 600: >> >> >> start = time.time() >> >> >> elem = allElems[randint(0, num_elems - 1)] >> >> >> assert(elem is not None) >> >> >> >> >> >> lst = follow(elem) >> >> >> >> >> >> total += choice(lst)[0] # use the return value for something >> >> >> >> >> >> end = time.time() >> >> >> >> >> >> elapsed = end-start >> >> >> timings[i % num_timings] = elapsed >> >> >> if (i > num_timings): >> >> >> slow_time = 2 * sum(timings)/num_timings # slow defined >> >> >> as >> >> >> > >> >> >> 2*avg run time >> >> >> if (elapsed > slow_time): >> >> >> print "that took a long time elapsed: %f >> >> >> slow_threshold: >> >> >> %f 90th_quantile_runtime: %f" % \ >> >> >> (elapsed, slow_time, >> >> >> sorted(timings)[int(num_timings*.9)]) >> >> >> i += 1 >> >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> >> >> >> >> wrote: >> >> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >> >> >>> > Hi Armin, Maciej >> >> >>> > >> >> >>> > Thanks for responding. >> >> >>> > >> >> >>> > I'm in the process of trying to determine what (if any) of the >> >> >>> > code >> >> >>> > I'm >> >> >>> > in a >> >> >>> > position to share, and I'll get back to you. >> >> >>> > >> >> >>> > Allowing hinting to the GC would be good. Even better would be a >> >> >>> > means >> >> >>> > to >> >> >>> > allow me to (transparently) allocate objects in unmanaged memory, >> >> >>> > but I >> >> >>> > would expect that to be a tall order :) >> >> >>> > >> >> >>> > Thanks, >> >> >>> > /Martin >> >> >>> >> >> >>> Hi Martin. >> >> >>> >> >> >>> Note that in case you want us to do the work of isolating the >> >> >>> problem, >> >> >>> we do offer paid support to do that (then we can sign NDAs and >> >> >>> stuff). >> >> >>> Otherwise we would be more than happy to fix bugs once you isolate >> >> >>> a >> >> >>> part you can share freely :) >> >> >> >> >> >> >> >> > >> > >> > > > From mak at issuu.com Mon Mar 17 13:48:23 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 13:48:23 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49). On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski wrote: > The number of lines is nonsense. This is a timestamp in hex. > > On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: > > Based On Maciej's suggestion, I tried the following > > > > PYPYLOG=- pypy mem.py 10000000 > out > > > > This generates a logfile which looks something like this > > > > start--> > > [2b99f1981b527e] {gc-minor > > [2b99f1981ba680] {gc-minor-walkroots > > [2b99f1981c2e02] gc-minor-walkroots} > > [2b99f19890d750] gc-minor} > > [snip] > > ... > > <--stop > > > > > > It turns out that the culprit is a lot of MINOR collections. > > > > I base this on the following observations: > > > > I can't understand the format of the timestamp on each logline (the > > "[2b99f1981b527e]"). From what I can see in the code, this should be > output > > from time.clock(), but that doesn't return a number like that when I run > > pypy interactively > > Instead, I count the number of debug lines between start--> and the > > corresponding <--stop. > > Most runs have a few hundred lines of output between start/stop > > All slow runs have very close to 57800 lines out output between > start/stop > > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > > operations, and 9647 gc-minor-walkroots operations. > > > > > > Thanks, > > /Martin > > > > > > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > > wrote: > >> > >> there is an environment variable PYPYLOG=gc:- (where - is stdout) > >> which will do that for you btw. > >> > >> maybe you can find out what's that using profiling or valgrind? > >> > >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > >> > I have tried getting the pypy source and building my own version of > >> > pypy. I > >> > have modified rpython/memory/gc/incminimark.py:major_collection_step() > >> > to > >> > print out when it starts and when it stops. Apparently, the slow > queries > >> > do > >> > NOT occur during major_collection_step; at least, I have not observed > >> > major > >> > step output during a query execution. So, apparently, something else > is > >> > blocking. This could be another aspect of the GC, but it could also be > >> > anything else. > >> > > >> > Just to be sure, I have tried running the same application in python > >> > with > >> > garbage collection disabled. I don't see the problem there, so it is > >> > somehow > >> > related to either GC or the runtime somehow. > >> > > >> > Cheers, > >> > /Martin > >> > > >> > > >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: > >> >> > >> >> We have hacked up a small sample that seems to exhibit the same > issue. > >> >> > >> >> We basically generate a linked list of objects. To increase > >> >> connectedness, > >> >> elements in the list hold references (dummy_links) to 10 randomly > >> >> chosen > >> >> previous elements in the list. > >> >> > >> >> We then time a function that traverses 50000 elements from the list > >> >> from a > >> >> random start point. If the traversal reaches the end of the list, we > >> >> instead > >> >> traverse one of the dummy links. Thus, exactly 50K elements are > >> >> traversed > >> >> every time. To generate some garbage, we build a list holding the > >> >> traversed > >> >> elements and a dummy list of characters. > >> >> > >> >> Timings for the last 100 runs are stored in a circular buffer. If the > >> >> elapsed time for the last run is more than twice the average time, we > >> >> print > >> >> out a line with the elapsed time, the threshold, and the 90% runtime > >> >> (we > >> >> would like to see that the mean runtime does not increase with the > >> >> number of > >> >> elements in the list, but that the max time does increase (linearly > >> >> with the > >> >> number of object, i guess); traversing 50K elements should be > >> >> independent of > >> >> the memory size). > >> >> > >> >> We have tried monitoring memory consumption by external inspection, > but > >> >> cannot consistently verify that memory is deallocated at the same > time > >> >> that > >> >> we see slow requests. Perhaps the pypy runtime doesn't always return > >> >> freed > >> >> pages back to the OS? > >> >> > >> >> Using top, we observe that 10M elements allocates around 17GB after > >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > >> >> shortly > >> >> after building). > >> >> > >> >> Here is output from a few runs with different number of elements: > >> >> > >> >> > >> >> pypy mem.py 10000000 > >> >> start build > >> >> end build 84.142424 > >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > >> >> 90th_quantile_runtime: 0.421558 > >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > >> >> 90th_quantile_runtime: 0.423441 > >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > >> >> 90th_quantile_runtime: 0.419817 > >> >> > >> >> pypy mem.py 20000000 > >> >> start build > >> >> end build 180.823105 > >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > >> >> 90th_quantile_runtime: 0.434726 > >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > >> >> 90th_quantile_runtime: 0.374190 > >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > >> >> 90th_quantile_runtime: 0.371502 > >> >> > >> >> pypy mem.py 30000000 > >> >> start build > >> >> end build 276.217811 > >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > >> >> 90th_quantile_runtime: 0.459891 > >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > >> >> 90th_quantile_runtime: 0.393654 > >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > >> >> 90th_quantile_runtime: 0.393677 > >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > >> >> 90th_quantile_runtime: 0.393654 > >> >> > >> >> Code below > >> >> -------------------------------------------------------------- > >> >> import time > >> >> from random import randint, choice > >> >> import sys > >> >> > >> >> > >> >> allElems = {} > >> >> > >> >> class Node: > >> >> def __init__(self, v_): > >> >> self.v = v_ > >> >> self.next = None > >> >> self.dummy_data = [randint(0,100) > >> >> for _ in xrange(randint(50,100))] > >> >> allElems[self.v] = self > >> >> if self.v > 0: > >> >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ > in > >> >> xrange(10)] > >> >> else: > >> >> self.dummy_links = [self] > >> >> > >> >> def set_next(self, l): > >> >> self.next = l > >> >> > >> >> > >> >> def follow(node): > >> >> acc = [] > >> >> count = 0 > >> >> cur = node > >> >> assert node.v is not None > >> >> assert cur is not None > >> >> while count < 50000: > >> >> # return a value; generate some garbage > >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for > x > >> >> in > >> >> xrange(100)])) > >> >> > >> >> # if we have reached the end, chose a random link > >> >> cur = choice(cur.dummy_links) if cur.next is None else > cur.next > >> >> count += 1 > >> >> > >> >> return acc > >> >> > >> >> > >> >> def build(num_elems): > >> >> start = time.time() > >> >> print "start build" > >> >> root = Node(0) > >> >> cur = root > >> >> for x in xrange(1, num_elems): > >> >> e = Node(x) > >> >> cur.next = e > >> >> cur = e > >> >> print "end build %f" % (time.time() - start) > >> >> return root > >> >> > >> >> > >> >> num_timings = 100 > >> >> if __name__ == "__main__": > >> >> num_elems = int(sys.argv[1]) > >> >> build(num_elems) > >> >> total = 0 > >> >> timings = [0.0] * num_timings # run times for the last > num_timings > >> >> runs > >> >> i = 0 > >> >> beginning = time.time() > >> >> while time.time() - beginning < 600: > >> >> start = time.time() > >> >> elem = allElems[randint(0, num_elems - 1)] > >> >> assert(elem is not None) > >> >> > >> >> lst = follow(elem) > >> >> > >> >> total += choice(lst)[0] # use the return value for something > >> >> > >> >> end = time.time() > >> >> > >> >> elapsed = end-start > >> >> timings[i % num_timings] = elapsed > >> >> if (i > num_timings): > >> >> slow_time = 2 * sum(timings)/num_timings # slow defined > as > >> >> > > >> >> 2*avg run time > >> >> if (elapsed > slow_time): > >> >> print "that took a long time elapsed: %f > >> >> slow_threshold: > >> >> %f 90th_quantile_runtime: %f" % \ > >> >> (elapsed, slow_time, > >> >> sorted(timings)[int(num_timings*.9)]) > >> >> i += 1 > >> >> print total > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski < > fijall at gmail.com> > >> >> wrote: > >> >>> > >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > >> >>> > Hi Armin, Maciej > >> >>> > > >> >>> > Thanks for responding. > >> >>> > > >> >>> > I'm in the process of trying to determine what (if any) of the > code > >> >>> > I'm > >> >>> > in a > >> >>> > position to share, and I'll get back to you. > >> >>> > > >> >>> > Allowing hinting to the GC would be good. Even better would be a > >> >>> > means > >> >>> > to > >> >>> > allow me to (transparently) allocate objects in unmanaged memory, > >> >>> > but I > >> >>> > would expect that to be a tall order :) > >> >>> > > >> >>> > Thanks, > >> >>> > /Martin > >> >>> > >> >>> Hi Martin. > >> >>> > >> >>> Note that in case you want us to do the work of isolating the > problem, > >> >>> we do offer paid support to do that (then we can sign NDAs and > stuff). > >> >>> Otherwise we would be more than happy to fix bugs once you isolate a > >> >>> part you can share freely :) > >> >> > >> >> > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 14:20:57 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 15:20:57 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things: a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those and plot them On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: > Well, then it works out to around 2.5GHz, which seems reasonable. But it > doesn't alter the conclusion from the previous email: The slow queries then > all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or > .4 seconds at this conversion. Also, the log shows that a slow query > performs many more gc-minor operations than a 'normal' one: 9600 > gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > > So the question becomes: Why do we get this large spike in > gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski > wrote: >> >> I think it's the cycles of your CPU >> >> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >> > What is the unit? Perhaps I'm being thick here, but I can't correlate it >> > with seconds (which the program does print out). Slow runs are around 13 >> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. >> > from >> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> > >> > >> > >> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> > wrote: >> >> >> >> The number of lines is nonsense. This is a timestamp in hex. >> >> >> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: >> >> > Based On Maciej's suggestion, I tried the following >> >> > >> >> > PYPYLOG=- pypy mem.py 10000000 > out >> >> > >> >> > This generates a logfile which looks something like this >> >> > >> >> > start--> >> >> > [2b99f1981b527e] {gc-minor >> >> > [2b99f1981ba680] {gc-minor-walkroots >> >> > [2b99f1981c2e02] gc-minor-walkroots} >> >> > [2b99f19890d750] gc-minor} >> >> > [snip] >> >> > ... >> >> > <--stop >> >> > >> >> > >> >> > It turns out that the culprit is a lot of MINOR collections. >> >> > >> >> > I base this on the following observations: >> >> > >> >> > I can't understand the format of the timestamp on each logline (the >> >> > "[2b99f1981b527e]"). From what I can see in the code, this should be >> >> > output >> >> > from time.clock(), but that doesn't return a number like that when I >> >> > run >> >> > pypy interactively >> >> > Instead, I count the number of debug lines between start--> and the >> >> > corresponding <--stop. >> >> > Most runs have a few hundred lines of output between start/stop >> >> > All slow runs have very close to 57800 lines out output between >> >> > start/stop >> >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor >> >> > operations, and 9647 gc-minor-walkroots operations. >> >> > >> >> > >> >> > Thanks, >> >> > /Martin >> >> > >> >> > >> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> >> > >> >> > wrote: >> >> >> >> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> >> >> which will do that for you btw. >> >> >> >> >> >> maybe you can find out what's that using profiling or valgrind? >> >> >> >> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >> >> >> > I have tried getting the pypy source and building my own version >> >> >> > of >> >> >> > pypy. I >> >> >> > have modified >> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >> >> >> > to >> >> >> > print out when it starts and when it stops. Apparently, the slow >> >> >> > queries >> >> >> > do >> >> >> > NOT occur during major_collection_step; at least, I have not >> >> >> > observed >> >> >> > major >> >> >> > step output during a query execution. So, apparently, something >> >> >> > else >> >> >> > is >> >> >> > blocking. This could be another aspect of the GC, but it could >> >> >> > also >> >> >> > be >> >> >> > anything else. >> >> >> > >> >> >> > Just to be sure, I have tried running the same application in >> >> >> > python >> >> >> > with >> >> >> > garbage collection disabled. I don't see the problem there, so it >> >> >> > is >> >> >> > somehow >> >> >> > related to either GC or the runtime somehow. >> >> >> > >> >> >> > Cheers, >> >> >> > /Martin >> >> >> > >> >> >> > >> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >> >> >> > wrote: >> >> >> >> >> >> >> >> We have hacked up a small sample that seems to exhibit the same >> >> >> >> issue. >> >> >> >> >> >> >> >> We basically generate a linked list of objects. To increase >> >> >> >> connectedness, >> >> >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> >> >> chosen >> >> >> >> previous elements in the list. >> >> >> >> >> >> >> >> We then time a function that traverses 50000 elements from the >> >> >> >> list >> >> >> >> from a >> >> >> >> random start point. If the traversal reaches the end of the list, >> >> >> >> we >> >> >> >> instead >> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> >> >> traversed >> >> >> >> every time. To generate some garbage, we build a list holding the >> >> >> >> traversed >> >> >> >> elements and a dummy list of characters. >> >> >> >> >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >> >> >> >> the >> >> >> >> elapsed time for the last run is more than twice the average >> >> >> >> time, >> >> >> >> we >> >> >> >> print >> >> >> >> out a line with the elapsed time, the threshold, and the 90% >> >> >> >> runtime >> >> >> >> (we >> >> >> >> would like to see that the mean runtime does not increase with >> >> >> >> the >> >> >> >> number of >> >> >> >> elements in the list, but that the max time does increase >> >> >> >> (linearly >> >> >> >> with the >> >> >> >> number of object, i guess); traversing 50K elements should be >> >> >> >> independent of >> >> >> >> the memory size). >> >> >> >> >> >> >> >> We have tried monitoring memory consumption by external >> >> >> >> inspection, >> >> >> >> but >> >> >> >> cannot consistently verify that memory is deallocated at the same >> >> >> >> time >> >> >> >> that >> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always >> >> >> >> return >> >> >> >> freed >> >> >> >> pages back to the OS? >> >> >> >> >> >> >> >> Using top, we observe that 10M elements allocates around 17GB >> >> >> >> after >> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> >> >> shortly >> >> >> >> after building). >> >> >> >> >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> >> >> >> >> >> >> pypy mem.py 10000000 >> >> >> >> start build >> >> >> >> end build 84.142424 >> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: >> >> >> >> 1.495401 >> >> >> >> 90th_quantile_runtime: 0.421558 >> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: >> >> >> >> 1.488160 >> >> >> >> 90th_quantile_runtime: 0.423441 >> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: >> >> >> >> 1.474563 >> >> >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> >> >> >> >> pypy mem.py 20000000 >> >> >> >> start build >> >> >> >> end build 180.823105 >> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: >> >> >> >> 2.295146 >> >> >> >> 90th_quantile_runtime: 0.434726 >> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: >> >> >> >> 2.283927 >> >> >> >> 90th_quantile_runtime: 0.374190 >> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: >> >> >> >> 2.279631 >> >> >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> >> >> >> >> pypy mem.py 30000000 >> >> >> >> start build >> >> >> >> end build 276.217811 >> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: >> >> >> >> 3.188464 >> >> >> >> 90th_quantile_runtime: 0.459891 >> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: >> >> >> >> 3.183003 >> >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: >> >> >> >> 3.190782 >> >> >> >> 90th_quantile_runtime: 0.393677 >> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: >> >> >> >> 3.239637 >> >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> >> >> >> >> Code below >> >> >> >> -------------------------------------------------------------- >> >> >> >> import time >> >> >> >> from random import randint, choice >> >> >> >> import sys >> >> >> >> >> >> >> >> >> >> >> >> allElems = {} >> >> >> >> >> >> >> >> class Node: >> >> >> >> def __init__(self, v_): >> >> >> >> self.v = v_ >> >> >> >> self.next = None >> >> >> >> self.dummy_data = [randint(0,100) >> >> >> >> for _ in xrange(randint(50,100))] >> >> >> >> allElems[self.v] = self >> >> >> >> if self.v > 0: >> >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] >> >> >> >> for _ >> >> >> >> in >> >> >> >> xrange(10)] >> >> >> >> else: >> >> >> >> self.dummy_links = [self] >> >> >> >> >> >> >> >> def set_next(self, l): >> >> >> >> self.next = l >> >> >> >> >> >> >> >> >> >> >> >> def follow(node): >> >> >> >> acc = [] >> >> >> >> count = 0 >> >> >> >> cur = node >> >> >> >> assert node.v is not None >> >> >> >> assert cur is not None >> >> >> >> while count < 50000: >> >> >> >> # return a value; generate some garbage >> >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") >> >> >> >> for >> >> >> >> x >> >> >> >> in >> >> >> >> xrange(100)])) >> >> >> >> >> >> >> >> # if we have reached the end, chose a random link >> >> >> >> cur = choice(cur.dummy_links) if cur.next is None else >> >> >> >> cur.next >> >> >> >> count += 1 >> >> >> >> >> >> >> >> return acc >> >> >> >> >> >> >> >> >> >> >> >> def build(num_elems): >> >> >> >> start = time.time() >> >> >> >> print "start build" >> >> >> >> root = Node(0) >> >> >> >> cur = root >> >> >> >> for x in xrange(1, num_elems): >> >> >> >> e = Node(x) >> >> >> >> cur.next = e >> >> >> >> cur = e >> >> >> >> print "end build %f" % (time.time() - start) >> >> >> >> return root >> >> >> >> >> >> >> >> >> >> >> >> num_timings = 100 >> >> >> >> if __name__ == "__main__": >> >> >> >> num_elems = int(sys.argv[1]) >> >> >> >> build(num_elems) >> >> >> >> total = 0 >> >> >> >> timings = [0.0] * num_timings # run times for the last >> >> >> >> num_timings >> >> >> >> runs >> >> >> >> i = 0 >> >> >> >> beginning = time.time() >> >> >> >> while time.time() - beginning < 600: >> >> >> >> start = time.time() >> >> >> >> elem = allElems[randint(0, num_elems - 1)] >> >> >> >> assert(elem is not None) >> >> >> >> >> >> >> >> lst = follow(elem) >> >> >> >> >> >> >> >> total += choice(lst)[0] # use the return value for >> >> >> >> something >> >> >> >> >> >> >> >> end = time.time() >> >> >> >> >> >> >> >> elapsed = end-start >> >> >> >> timings[i % num_timings] = elapsed >> >> >> >> if (i > num_timings): >> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow >> >> >> >> defined >> >> >> >> as >> >> >> >> > >> >> >> >> 2*avg run time >> >> >> >> if (elapsed > slow_time): >> >> >> >> print "that took a long time elapsed: %f >> >> >> >> slow_threshold: >> >> >> >> %f 90th_quantile_runtime: %f" % \ >> >> >> >> (elapsed, slow_time, >> >> >> >> sorted(timings)[int(num_timings*.9)]) >> >> >> >> i += 1 >> >> >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> >> >> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >> >> >> >>> wrote: >> >> >> >>> > Hi Armin, Maciej >> >> >> >>> > >> >> >> >>> > Thanks for responding. >> >> >> >>> > >> >> >> >>> > I'm in the process of trying to determine what (if any) of the >> >> >> >>> > code >> >> >> >>> > I'm >> >> >> >>> > in a >> >> >> >>> > position to share, and I'll get back to you. >> >> >> >>> > >> >> >> >>> > Allowing hinting to the GC would be good. Even better would be >> >> >> >>> > a >> >> >> >>> > means >> >> >> >>> > to >> >> >> >>> > allow me to (transparently) allocate objects in unmanaged >> >> >> >>> > memory, >> >> >> >>> > but I >> >> >> >>> > would expect that to be a tall order :) >> >> >> >>> > >> >> >> >>> > Thanks, >> >> >> >>> > /Martin >> >> >> >>> >> >> >> >>> Hi Martin. >> >> >> >>> >> >> >> >>> Note that in case you want us to do the work of isolating the >> >> >> >>> problem, >> >> >> >>> we do offer paid support to do that (then we can sign NDAs and >> >> >> >>> stuff). >> >> >> >>> Otherwise we would be more than happy to fix bugs once you >> >> >> >>> isolate >> >> >> >>> a >> >> >> >>> part you can share freely :) >> >> >> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> > >> > > > From mak at issuu.com Mon Mar 17 14:18:07 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 14:18:07 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? Thanks, /Martin On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski wrote: > I think it's the cycles of your CPU > > On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: > > What is the unit? Perhaps I'm being thick here, but I can't correlate it > > with seconds (which the program does print out). Slow runs are around 13 > > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. > from > > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > > > > > > > > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > > wrote: > >> > >> The number of lines is nonsense. This is a timestamp in hex. > >> > >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: > >> > Based On Maciej's suggestion, I tried the following > >> > > >> > PYPYLOG=- pypy mem.py 10000000 > out > >> > > >> > This generates a logfile which looks something like this > >> > > >> > start--> > >> > [2b99f1981b527e] {gc-minor > >> > [2b99f1981ba680] {gc-minor-walkroots > >> > [2b99f1981c2e02] gc-minor-walkroots} > >> > [2b99f19890d750] gc-minor} > >> > [snip] > >> > ... > >> > <--stop > >> > > >> > > >> > It turns out that the culprit is a lot of MINOR collections. > >> > > >> > I base this on the following observations: > >> > > >> > I can't understand the format of the timestamp on each logline (the > >> > "[2b99f1981b527e]"). From what I can see in the code, this should be > >> > output > >> > from time.clock(), but that doesn't return a number like that when I > run > >> > pypy interactively > >> > Instead, I count the number of debug lines between start--> and the > >> > corresponding <--stop. > >> > Most runs have a few hundred lines of output between start/stop > >> > All slow runs have very close to 57800 lines out output between > >> > start/stop > >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > >> > operations, and 9647 gc-minor-walkroots operations. > >> > > >> > > >> > Thanks, > >> > /Martin > >> > > >> > > >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > > >> > wrote: > >> >> > >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) > >> >> which will do that for you btw. > >> >> > >> >> maybe you can find out what's that using profiling or valgrind? > >> >> > >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > >> >> > I have tried getting the pypy source and building my own version of > >> >> > pypy. I > >> >> > have modified > >> >> > rpython/memory/gc/incminimark.py:major_collection_step() > >> >> > to > >> >> > print out when it starts and when it stops. Apparently, the slow > >> >> > queries > >> >> > do > >> >> > NOT occur during major_collection_step; at least, I have not > observed > >> >> > major > >> >> > step output during a query execution. So, apparently, something > else > >> >> > is > >> >> > blocking. This could be another aspect of the GC, but it could also > >> >> > be > >> >> > anything else. > >> >> > > >> >> > Just to be sure, I have tried running the same application in > python > >> >> > with > >> >> > garbage collection disabled. I don't see the problem there, so it > is > >> >> > somehow > >> >> > related to either GC or the runtime somehow. > >> >> > > >> >> > Cheers, > >> >> > /Martin > >> >> > > >> >> > > >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch > wrote: > >> >> >> > >> >> >> We have hacked up a small sample that seems to exhibit the same > >> >> >> issue. > >> >> >> > >> >> >> We basically generate a linked list of objects. To increase > >> >> >> connectedness, > >> >> >> elements in the list hold references (dummy_links) to 10 randomly > >> >> >> chosen > >> >> >> previous elements in the list. > >> >> >> > >> >> >> We then time a function that traverses 50000 elements from the > list > >> >> >> from a > >> >> >> random start point. If the traversal reaches the end of the list, > we > >> >> >> instead > >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are > >> >> >> traversed > >> >> >> every time. To generate some garbage, we build a list holding the > >> >> >> traversed > >> >> >> elements and a dummy list of characters. > >> >> >> > >> >> >> Timings for the last 100 runs are stored in a circular buffer. If > >> >> >> the > >> >> >> elapsed time for the last run is more than twice the average time, > >> >> >> we > >> >> >> print > >> >> >> out a line with the elapsed time, the threshold, and the 90% > runtime > >> >> >> (we > >> >> >> would like to see that the mean runtime does not increase with the > >> >> >> number of > >> >> >> elements in the list, but that the max time does increase > (linearly > >> >> >> with the > >> >> >> number of object, i guess); traversing 50K elements should be > >> >> >> independent of > >> >> >> the memory size). > >> >> >> > >> >> >> We have tried monitoring memory consumption by external > inspection, > >> >> >> but > >> >> >> cannot consistently verify that memory is deallocated at the same > >> >> >> time > >> >> >> that > >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always > return > >> >> >> freed > >> >> >> pages back to the OS? > >> >> >> > >> >> >> Using top, we observe that 10M elements allocates around 17GB > after > >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > >> >> >> shortly > >> >> >> after building). > >> >> >> > >> >> >> Here is output from a few runs with different number of elements: > >> >> >> > >> >> >> > >> >> >> pypy mem.py 10000000 > >> >> >> start build > >> >> >> end build 84.142424 > >> >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > >> >> >> 90th_quantile_runtime: 0.421558 > >> >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > >> >> >> 90th_quantile_runtime: 0.423441 > >> >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > >> >> >> 90th_quantile_runtime: 0.419817 > >> >> >> > >> >> >> pypy mem.py 20000000 > >> >> >> start build > >> >> >> end build 180.823105 > >> >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > >> >> >> 90th_quantile_runtime: 0.434726 > >> >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > >> >> >> 90th_quantile_runtime: 0.374190 > >> >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > >> >> >> 90th_quantile_runtime: 0.371502 > >> >> >> > >> >> >> pypy mem.py 30000000 > >> >> >> start build > >> >> >> end build 276.217811 > >> >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > >> >> >> 90th_quantile_runtime: 0.459891 > >> >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > >> >> >> 90th_quantile_runtime: 0.393654 > >> >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > >> >> >> 90th_quantile_runtime: 0.393677 > >> >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > >> >> >> 90th_quantile_runtime: 0.393654 > >> >> >> > >> >> >> Code below > >> >> >> -------------------------------------------------------------- > >> >> >> import time > >> >> >> from random import randint, choice > >> >> >> import sys > >> >> >> > >> >> >> > >> >> >> allElems = {} > >> >> >> > >> >> >> class Node: > >> >> >> def __init__(self, v_): > >> >> >> self.v = v_ > >> >> >> self.next = None > >> >> >> self.dummy_data = [randint(0,100) > >> >> >> for _ in xrange(randint(50,100))] > >> >> >> allElems[self.v] = self > >> >> >> if self.v > 0: > >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] > for _ > >> >> >> in > >> >> >> xrange(10)] > >> >> >> else: > >> >> >> self.dummy_links = [self] > >> >> >> > >> >> >> def set_next(self, l): > >> >> >> self.next = l > >> >> >> > >> >> >> > >> >> >> def follow(node): > >> >> >> acc = [] > >> >> >> count = 0 > >> >> >> cur = node > >> >> >> assert node.v is not None > >> >> >> assert cur is not None > >> >> >> while count < 50000: > >> >> >> # return a value; generate some garbage > >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") > for > >> >> >> x > >> >> >> in > >> >> >> xrange(100)])) > >> >> >> > >> >> >> # if we have reached the end, chose a random link > >> >> >> cur = choice(cur.dummy_links) if cur.next is None else > >> >> >> cur.next > >> >> >> count += 1 > >> >> >> > >> >> >> return acc > >> >> >> > >> >> >> > >> >> >> def build(num_elems): > >> >> >> start = time.time() > >> >> >> print "start build" > >> >> >> root = Node(0) > >> >> >> cur = root > >> >> >> for x in xrange(1, num_elems): > >> >> >> e = Node(x) > >> >> >> cur.next = e > >> >> >> cur = e > >> >> >> print "end build %f" % (time.time() - start) > >> >> >> return root > >> >> >> > >> >> >> > >> >> >> num_timings = 100 > >> >> >> if __name__ == "__main__": > >> >> >> num_elems = int(sys.argv[1]) > >> >> >> build(num_elems) > >> >> >> total = 0 > >> >> >> timings = [0.0] * num_timings # run times for the last > >> >> >> num_timings > >> >> >> runs > >> >> >> i = 0 > >> >> >> beginning = time.time() > >> >> >> while time.time() - beginning < 600: > >> >> >> start = time.time() > >> >> >> elem = allElems[randint(0, num_elems - 1)] > >> >> >> assert(elem is not None) > >> >> >> > >> >> >> lst = follow(elem) > >> >> >> > >> >> >> total += choice(lst)[0] # use the return value for > something > >> >> >> > >> >> >> end = time.time() > >> >> >> > >> >> >> elapsed = end-start > >> >> >> timings[i % num_timings] = elapsed > >> >> >> if (i > num_timings): > >> >> >> slow_time = 2 * sum(timings)/num_timings # slow > defined > >> >> >> as > >> >> >> > > >> >> >> 2*avg run time > >> >> >> if (elapsed > slow_time): > >> >> >> print "that took a long time elapsed: %f > >> >> >> slow_threshold: > >> >> >> %f 90th_quantile_runtime: %f" % \ > >> >> >> (elapsed, slow_time, > >> >> >> sorted(timings)[int(num_timings*.9)]) > >> >> >> i += 1 > >> >> >> print total > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >> >> >> > >> >> >> wrote: > >> >> >>> > >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > wrote: > >> >> >>> > Hi Armin, Maciej > >> >> >>> > > >> >> >>> > Thanks for responding. > >> >> >>> > > >> >> >>> > I'm in the process of trying to determine what (if any) of the > >> >> >>> > code > >> >> >>> > I'm > >> >> >>> > in a > >> >> >>> > position to share, and I'll get back to you. > >> >> >>> > > >> >> >>> > Allowing hinting to the GC would be good. Even better would be > a > >> >> >>> > means > >> >> >>> > to > >> >> >>> > allow me to (transparently) allocate objects in unmanaged > memory, > >> >> >>> > but I > >> >> >>> > would expect that to be a tall order :) > >> >> >>> > > >> >> >>> > Thanks, > >> >> >>> > /Martin > >> >> >>> > >> >> >>> Hi Martin. > >> >> >>> > >> >> >>> Note that in case you want us to do the work of isolating the > >> >> >>> problem, > >> >> >>> we do offer paid support to do that (then we can sign NDAs and > >> >> >>> stuff). > >> >> >>> Otherwise we would be more than happy to fix bugs once you > isolate > >> >> >>> a > >> >> >>> part you can share freely :) > >> >> >> > >> >> >> > >> >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 14:23:57 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 15:23:57 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski wrote: > are you *sure* it's the walkroots that take that long and not > something else (like gc-minor)? More of those mean that you allocate a > lot more surviving objects. Can you do two things: > > a) take a max of gc-minor (and gc-minor-stackwalk), per request > b) take the sum of those > > and plot them ^^^ or just paste the results actually > > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >> Well, then it works out to around 2.5GHz, which seems reasonable. But it >> doesn't alter the conclusion from the previous email: The slow queries then >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or >> .4 seconds at this conversion. Also, the log shows that a slow query >> performs many more gc-minor operations than a 'normal' one: 9600 >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >> So the question becomes: Why do we get this large spike in >> gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? >> >> Thanks, >> /Martin >> >> >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >> wrote: >>> >>> I think it's the cycles of your CPU >>> >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >>> > What is the unit? Perhaps I'm being thick here, but I can't correlate it >>> > with seconds (which the program does print out). Slow runs are around 13 >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. >>> > from >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>> > >>> > >>> > >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>> > wrote: >>> >> >>> >> The number of lines is nonsense. This is a timestamp in hex. >>> >> >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: >>> >> > Based On Maciej's suggestion, I tried the following >>> >> > >>> >> > PYPYLOG=- pypy mem.py 10000000 > out >>> >> > >>> >> > This generates a logfile which looks something like this >>> >> > >>> >> > start--> >>> >> > [2b99f1981b527e] {gc-minor >>> >> > [2b99f1981ba680] {gc-minor-walkroots >>> >> > [2b99f1981c2e02] gc-minor-walkroots} >>> >> > [2b99f19890d750] gc-minor} >>> >> > [snip] >>> >> > ... >>> >> > <--stop >>> >> > >>> >> > >>> >> > It turns out that the culprit is a lot of MINOR collections. >>> >> > >>> >> > I base this on the following observations: >>> >> > >>> >> > I can't understand the format of the timestamp on each logline (the >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should be >>> >> > output >>> >> > from time.clock(), but that doesn't return a number like that when I >>> >> > run >>> >> > pypy interactively >>> >> > Instead, I count the number of debug lines between start--> and the >>> >> > corresponding <--stop. >>> >> > Most runs have a few hundred lines of output between start/stop >>> >> > All slow runs have very close to 57800 lines out output between >>> >> > start/stop >>> >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor >>> >> > operations, and 9647 gc-minor-walkroots operations. >>> >> > >>> >> > >>> >> > Thanks, >>> >> > /Martin >>> >> > >>> >> > >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>> >> > >>> >> > wrote: >>> >> >> >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >>> >> >> which will do that for you btw. >>> >> >> >>> >> >> maybe you can find out what's that using profiling or valgrind? >>> >> >> >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >>> >> >> > I have tried getting the pypy source and building my own version >>> >> >> > of >>> >> >> > pypy. I >>> >> >> > have modified >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >>> >> >> > to >>> >> >> > print out when it starts and when it stops. Apparently, the slow >>> >> >> > queries >>> >> >> > do >>> >> >> > NOT occur during major_collection_step; at least, I have not >>> >> >> > observed >>> >> >> > major >>> >> >> > step output during a query execution. So, apparently, something >>> >> >> > else >>> >> >> > is >>> >> >> > blocking. This could be another aspect of the GC, but it could >>> >> >> > also >>> >> >> > be >>> >> >> > anything else. >>> >> >> > >>> >> >> > Just to be sure, I have tried running the same application in >>> >> >> > python >>> >> >> > with >>> >> >> > garbage collection disabled. I don't see the problem there, so it >>> >> >> > is >>> >> >> > somehow >>> >> >> > related to either GC or the runtime somehow. >>> >> >> > >>> >> >> > Cheers, >>> >> >> > /Martin >>> >> >> > >>> >> >> > >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>> >> >> > wrote: >>> >> >> >> >>> >> >> >> We have hacked up a small sample that seems to exhibit the same >>> >> >> >> issue. >>> >> >> >> >>> >> >> >> We basically generate a linked list of objects. To increase >>> >> >> >> connectedness, >>> >> >> >> elements in the list hold references (dummy_links) to 10 randomly >>> >> >> >> chosen >>> >> >> >> previous elements in the list. >>> >> >> >> >>> >> >> >> We then time a function that traverses 50000 elements from the >>> >> >> >> list >>> >> >> >> from a >>> >> >> >> random start point. If the traversal reaches the end of the list, >>> >> >> >> we >>> >> >> >> instead >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are >>> >> >> >> traversed >>> >> >> >> every time. To generate some garbage, we build a list holding the >>> >> >> >> traversed >>> >> >> >> elements and a dummy list of characters. >>> >> >> >> >>> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >>> >> >> >> the >>> >> >> >> elapsed time for the last run is more than twice the average >>> >> >> >> time, >>> >> >> >> we >>> >> >> >> print >>> >> >> >> out a line with the elapsed time, the threshold, and the 90% >>> >> >> >> runtime >>> >> >> >> (we >>> >> >> >> would like to see that the mean runtime does not increase with >>> >> >> >> the >>> >> >> >> number of >>> >> >> >> elements in the list, but that the max time does increase >>> >> >> >> (linearly >>> >> >> >> with the >>> >> >> >> number of object, i guess); traversing 50K elements should be >>> >> >> >> independent of >>> >> >> >> the memory size). >>> >> >> >> >>> >> >> >> We have tried monitoring memory consumption by external >>> >> >> >> inspection, >>> >> >> >> but >>> >> >> >> cannot consistently verify that memory is deallocated at the same >>> >> >> >> time >>> >> >> >> that >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always >>> >> >> >> return >>> >> >> >> freed >>> >> >> >> pages back to the OS? >>> >> >> >> >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB >>> >> >> >> after >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >>> >> >> >> shortly >>> >> >> >> after building). >>> >> >> >> >>> >> >> >> Here is output from a few runs with different number of elements: >>> >> >> >> >>> >> >> >> >>> >> >> >> pypy mem.py 10000000 >>> >> >> >> start build >>> >> >> >> end build 84.142424 >>> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: >>> >> >> >> 1.495401 >>> >> >> >> 90th_quantile_runtime: 0.421558 >>> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: >>> >> >> >> 1.488160 >>> >> >> >> 90th_quantile_runtime: 0.423441 >>> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: >>> >> >> >> 1.474563 >>> >> >> >> 90th_quantile_runtime: 0.419817 >>> >> >> >> >>> >> >> >> pypy mem.py 20000000 >>> >> >> >> start build >>> >> >> >> end build 180.823105 >>> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: >>> >> >> >> 2.295146 >>> >> >> >> 90th_quantile_runtime: 0.434726 >>> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: >>> >> >> >> 2.283927 >>> >> >> >> 90th_quantile_runtime: 0.374190 >>> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: >>> >> >> >> 2.279631 >>> >> >> >> 90th_quantile_runtime: 0.371502 >>> >> >> >> >>> >> >> >> pypy mem.py 30000000 >>> >> >> >> start build >>> >> >> >> end build 276.217811 >>> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: >>> >> >> >> 3.188464 >>> >> >> >> 90th_quantile_runtime: 0.459891 >>> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: >>> >> >> >> 3.183003 >>> >> >> >> 90th_quantile_runtime: 0.393654 >>> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: >>> >> >> >> 3.190782 >>> >> >> >> 90th_quantile_runtime: 0.393677 >>> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: >>> >> >> >> 3.239637 >>> >> >> >> 90th_quantile_runtime: 0.393654 >>> >> >> >> >>> >> >> >> Code below >>> >> >> >> -------------------------------------------------------------- >>> >> >> >> import time >>> >> >> >> from random import randint, choice >>> >> >> >> import sys >>> >> >> >> >>> >> >> >> >>> >> >> >> allElems = {} >>> >> >> >> >>> >> >> >> class Node: >>> >> >> >> def __init__(self, v_): >>> >> >> >> self.v = v_ >>> >> >> >> self.next = None >>> >> >> >> self.dummy_data = [randint(0,100) >>> >> >> >> for _ in xrange(randint(50,100))] >>> >> >> >> allElems[self.v] = self >>> >> >> >> if self.v > 0: >>> >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] >>> >> >> >> for _ >>> >> >> >> in >>> >> >> >> xrange(10)] >>> >> >> >> else: >>> >> >> >> self.dummy_links = [self] >>> >> >> >> >>> >> >> >> def set_next(self, l): >>> >> >> >> self.next = l >>> >> >> >> >>> >> >> >> >>> >> >> >> def follow(node): >>> >> >> >> acc = [] >>> >> >> >> count = 0 >>> >> >> >> cur = node >>> >> >> >> assert node.v is not None >>> >> >> >> assert cur is not None >>> >> >> >> while count < 50000: >>> >> >> >> # return a value; generate some garbage >>> >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") >>> >> >> >> for >>> >> >> >> x >>> >> >> >> in >>> >> >> >> xrange(100)])) >>> >> >> >> >>> >> >> >> # if we have reached the end, chose a random link >>> >> >> >> cur = choice(cur.dummy_links) if cur.next is None else >>> >> >> >> cur.next >>> >> >> >> count += 1 >>> >> >> >> >>> >> >> >> return acc >>> >> >> >> >>> >> >> >> >>> >> >> >> def build(num_elems): >>> >> >> >> start = time.time() >>> >> >> >> print "start build" >>> >> >> >> root = Node(0) >>> >> >> >> cur = root >>> >> >> >> for x in xrange(1, num_elems): >>> >> >> >> e = Node(x) >>> >> >> >> cur.next = e >>> >> >> >> cur = e >>> >> >> >> print "end build %f" % (time.time() - start) >>> >> >> >> return root >>> >> >> >> >>> >> >> >> >>> >> >> >> num_timings = 100 >>> >> >> >> if __name__ == "__main__": >>> >> >> >> num_elems = int(sys.argv[1]) >>> >> >> >> build(num_elems) >>> >> >> >> total = 0 >>> >> >> >> timings = [0.0] * num_timings # run times for the last >>> >> >> >> num_timings >>> >> >> >> runs >>> >> >> >> i = 0 >>> >> >> >> beginning = time.time() >>> >> >> >> while time.time() - beginning < 600: >>> >> >> >> start = time.time() >>> >> >> >> elem = allElems[randint(0, num_elems - 1)] >>> >> >> >> assert(elem is not None) >>> >> >> >> >>> >> >> >> lst = follow(elem) >>> >> >> >> >>> >> >> >> total += choice(lst)[0] # use the return value for >>> >> >> >> something >>> >> >> >> >>> >> >> >> end = time.time() >>> >> >> >> >>> >> >> >> elapsed = end-start >>> >> >> >> timings[i % num_timings] = elapsed >>> >> >> >> if (i > num_timings): >>> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow >>> >> >> >> defined >>> >> >> >> as >>> >> >> >> > >>> >> >> >> 2*avg run time >>> >> >> >> if (elapsed > slow_time): >>> >> >> >> print "that took a long time elapsed: %f >>> >> >> >> slow_threshold: >>> >> >> >> %f 90th_quantile_runtime: %f" % \ >>> >> >> >> (elapsed, slow_time, >>> >> >> >> sorted(timings)[int(num_timings*.9)]) >>> >> >> >> i += 1 >>> >> >> >> print total >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>> >> >> >> >>> >> >> >> wrote: >>> >> >> >>> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>> >> >> >>> wrote: >>> >> >> >>> > Hi Armin, Maciej >>> >> >> >>> > >>> >> >> >>> > Thanks for responding. >>> >> >> >>> > >>> >> >> >>> > I'm in the process of trying to determine what (if any) of the >>> >> >> >>> > code >>> >> >> >>> > I'm >>> >> >> >>> > in a >>> >> >> >>> > position to share, and I'll get back to you. >>> >> >> >>> > >>> >> >> >>> > Allowing hinting to the GC would be good. Even better would be >>> >> >> >>> > a >>> >> >> >>> > means >>> >> >> >>> > to >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged >>> >> >> >>> > memory, >>> >> >> >>> > but I >>> >> >> >>> > would expect that to be a tall order :) >>> >> >> >>> > >>> >> >> >>> > Thanks, >>> >> >> >>> > /Martin >>> >> >> >>> >>> >> >> >>> Hi Martin. >>> >> >> >>> >>> >> >> >>> Note that in case you want us to do the work of isolating the >>> >> >> >>> problem, >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs and >>> >> >> >>> stuff). >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you >>> >> >> >>> isolate >>> >> >> >>> a >>> >> >> >>> part you can share freely :) >>> >> >> >> >>> >> >> >> >>> >> >> > >>> >> > >>> >> > >>> > >>> > >> >> From mak at issuu.com Mon Mar 17 11:46:17 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 11:46:17 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: Based On Maciej's suggestion, I tried the following PYPYLOG=- pypy mem.py 10000000 > out This generates a logfile which looks something like this start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop It turns out that the culprit is a lot of MINOR collections. I base this on the following observations: - I can't understand the format of the timestamp on each logline (the " [2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively - Instead, I count the number of debug lines between start--> and the corresponding <--stop. - Most runs have a few hundred lines of output between start/stop - All slow runs have very close to 57800 lines out output between start/stop - One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations. Thanks, /Martin On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski wrote: > there is an environment variable PYPYLOG=gc:- (where - is stdout) > which will do that for you btw. > > maybe you can find out what's that using profiling or valgrind? > > On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > > I have tried getting the pypy source and building my own version of > pypy. I > > have modified rpython/memory/gc/incminimark.py:major_collection_step() to > > print out when it starts and when it stops. Apparently, the slow queries > do > > NOT occur during major_collection_step; at least, I have not observed > major > > step output during a query execution. So, apparently, something else is > > blocking. This could be another aspect of the GC, but it could also be > > anything else. > > > > Just to be sure, I have tried running the same application in python with > > garbage collection disabled. I don't see the problem there, so it is > somehow > > related to either GC or the runtime somehow. > > > > Cheers, > > /Martin > > > > > > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: > >> > >> We have hacked up a small sample that seems to exhibit the same issue. > >> > >> We basically generate a linked list of objects. To increase > connectedness, > >> elements in the list hold references (dummy_links) to 10 randomly chosen > >> previous elements in the list. > >> > >> We then time a function that traverses 50000 elements from the list > from a > >> random start point. If the traversal reaches the end of the list, we > instead > >> traverse one of the dummy links. Thus, exactly 50K elements are > traversed > >> every time. To generate some garbage, we build a list holding the > traversed > >> elements and a dummy list of characters. > >> > >> Timings for the last 100 runs are stored in a circular buffer. If the > >> elapsed time for the last run is more than twice the average time, we > print > >> out a line with the elapsed time, the threshold, and the 90% runtime (we > >> would like to see that the mean runtime does not increase with the > number of > >> elements in the list, but that the max time does increase (linearly > with the > >> number of object, i guess); traversing 50K elements should be > independent of > >> the memory size). > >> > >> We have tried monitoring memory consumption by external inspection, but > >> cannot consistently verify that memory is deallocated at the same time > that > >> we see slow requests. Perhaps the pypy runtime doesn't always return > freed > >> pages back to the OS? > >> > >> Using top, we observe that 10M elements allocates around 17GB after > >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > shortly > >> after building). > >> > >> Here is output from a few runs with different number of elements: > >> > >> > >> pypy mem.py 10000000 > >> start build > >> end build 84.142424 > >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > >> 90th_quantile_runtime: 0.421558 > >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > >> 90th_quantile_runtime: 0.423441 > >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > >> 90th_quantile_runtime: 0.419817 > >> > >> pypy mem.py 20000000 > >> start build > >> end build 180.823105 > >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > >> 90th_quantile_runtime: 0.434726 > >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > >> 90th_quantile_runtime: 0.374190 > >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > >> 90th_quantile_runtime: 0.371502 > >> > >> pypy mem.py 30000000 > >> start build > >> end build 276.217811 > >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > >> 90th_quantile_runtime: 0.459891 > >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > >> 90th_quantile_runtime: 0.393654 > >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > >> 90th_quantile_runtime: 0.393677 > >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > >> 90th_quantile_runtime: 0.393654 > >> > >> Code below > >> -------------------------------------------------------------- > >> import time > >> from random import randint, choice > >> import sys > >> > >> > >> allElems = {} > >> > >> class Node: > >> def __init__(self, v_): > >> self.v = v_ > >> self.next = None > >> self.dummy_data = [randint(0,100) > >> for _ in xrange(randint(50,100))] > >> allElems[self.v] = self > >> if self.v > 0: > >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ in > >> xrange(10)] > >> else: > >> self.dummy_links = [self] > >> > >> def set_next(self, l): > >> self.next = l > >> > >> > >> def follow(node): > >> acc = [] > >> count = 0 > >> cur = node > >> assert node.v is not None > >> assert cur is not None > >> while count < 50000: > >> # return a value; generate some garbage > >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x > in > >> xrange(100)])) > >> > >> # if we have reached the end, chose a random link > >> cur = choice(cur.dummy_links) if cur.next is None else cur.next > >> count += 1 > >> > >> return acc > >> > >> > >> def build(num_elems): > >> start = time.time() > >> print "start build" > >> root = Node(0) > >> cur = root > >> for x in xrange(1, num_elems): > >> e = Node(x) > >> cur.next = e > >> cur = e > >> print "end build %f" % (time.time() - start) > >> return root > >> > >> > >> num_timings = 100 > >> if __name__ == "__main__": > >> num_elems = int(sys.argv[1]) > >> build(num_elems) > >> total = 0 > >> timings = [0.0] * num_timings # run times for the last num_timings > >> runs > >> i = 0 > >> beginning = time.time() > >> while time.time() - beginning < 600: > >> start = time.time() > >> elem = allElems[randint(0, num_elems - 1)] > >> assert(elem is not None) > >> > >> lst = follow(elem) > >> > >> total += choice(lst)[0] # use the return value for something > >> > >> end = time.time() > >> > >> elapsed = end-start > >> timings[i % num_timings] = elapsed > >> if (i > num_timings): > >> slow_time = 2 * sum(timings)/num_timings # slow defined as > > >> 2*avg run time > >> if (elapsed > slow_time): > >> print "that took a long time elapsed: %f > slow_threshold: > >> %f 90th_quantile_runtime: %f" % \ > >> (elapsed, slow_time, > >> sorted(timings)[int(num_timings*.9)]) > >> i += 1 > >> print total > >> > >> > >> > >> > >> > >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >> wrote: > >>> > >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > >>> > Hi Armin, Maciej > >>> > > >>> > Thanks for responding. > >>> > > >>> > I'm in the process of trying to determine what (if any) of the code > I'm > >>> > in a > >>> > position to share, and I'll get back to you. > >>> > > >>> > Allowing hinting to the GC would be good. Even better would be a > means > >>> > to > >>> > allow me to (transparently) allocate objects in unmanaged memory, > but I > >>> > would expect that to be a tall order :) > >>> > > >>> > Thanks, > >>> > /Martin > >>> > >>> Hi Martin. > >>> > >>> Note that in case you want us to do the work of isolating the problem, > >>> we do offer paid support to do that (then we can sign NDAs and stuff). > >>> Otherwise we would be more than happy to fix bugs once you isolate a > >>> part you can share freely :) > >> > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Mon Mar 17 15:19:28 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 15:19:28 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor). Please let me know if this is unclear :) 668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 *slow*:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 *slow*:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 *slow*:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104 Thanks, /Martin On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski wrote: > On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > wrote: > > are you *sure* it's the walkroots that take that long and not > > something else (like gc-minor)? More of those mean that you allocate a > > lot more surviving objects. Can you do two things: > > > > a) take a max of gc-minor (and gc-minor-stackwalk), per request > > b) take the sum of those > > > > and plot them > > ^^^ or just paste the results actually > > > > > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: > >> Well, then it works out to around 2.5GHz, which seems reasonable. But it > >> doesn't alter the conclusion from the previous email: The slow queries > then > >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 > units, or > >> .4 seconds at this conversion. Also, the log shows that a slow query > >> performs many more gc-minor operations than a 'normal' one: 9600 > >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > >> > >> So the question becomes: Why do we get this large spike in > >> gc-minor-walkroots, and, in particular, is there any way to avoid it :) > ? > >> > >> Thanks, > >> /Martin > >> > >> > >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski > >> wrote: > >>> > >>> I think it's the cycles of your CPU > >>> > >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: > >>> > What is the unit? Perhaps I'm being thick here, but I can't > correlate it > >>> > with seconds (which the program does print out). Slow runs are > around 13 > >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units > (e.g. > >>> > from > >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > >>> > > >>> > > >>> > > >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski < > fijall at gmail.com> > >>> > wrote: > >>> >> > >>> >> The number of lines is nonsense. This is a timestamp in hex. > >>> >> > >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch > wrote: > >>> >> > Based On Maciej's suggestion, I tried the following > >>> >> > > >>> >> > PYPYLOG=- pypy mem.py 10000000 > out > >>> >> > > >>> >> > This generates a logfile which looks something like this > >>> >> > > >>> >> > start--> > >>> >> > [2b99f1981b527e] {gc-minor > >>> >> > [2b99f1981ba680] {gc-minor-walkroots > >>> >> > [2b99f1981c2e02] gc-minor-walkroots} > >>> >> > [2b99f19890d750] gc-minor} > >>> >> > [snip] > >>> >> > ... > >>> >> > <--stop > >>> >> > > >>> >> > > >>> >> > It turns out that the culprit is a lot of MINOR collections. > >>> >> > > >>> >> > I base this on the following observations: > >>> >> > > >>> >> > I can't understand the format of the timestamp on each logline > (the > >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should > be > >>> >> > output > >>> >> > from time.clock(), but that doesn't return a number like that > when I > >>> >> > run > >>> >> > pypy interactively > >>> >> > Instead, I count the number of debug lines between start--> and > the > >>> >> > corresponding <--stop. > >>> >> > Most runs have a few hundred lines of output between start/stop > >>> >> > All slow runs have very close to 57800 lines out output between > >>> >> > start/stop > >>> >> > One such sample does 9609 gc-collect-step operations, 9647 > gc-minor > >>> >> > operations, and 9647 gc-minor-walkroots operations. > >>> >> > > >>> >> > > >>> >> > Thanks, > >>> >> > /Martin > >>> >> > > >>> >> > > >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > >>> >> > > >>> >> > wrote: > >>> >> >> > >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) > >>> >> >> which will do that for you btw. > >>> >> >> > >>> >> >> maybe you can find out what's that using profiling or valgrind? > >>> >> >> > >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch > wrote: > >>> >> >> > I have tried getting the pypy source and building my own > version > >>> >> >> > of > >>> >> >> > pypy. I > >>> >> >> > have modified > >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() > >>> >> >> > to > >>> >> >> > print out when it starts and when it stops. Apparently, the > slow > >>> >> >> > queries > >>> >> >> > do > >>> >> >> > NOT occur during major_collection_step; at least, I have not > >>> >> >> > observed > >>> >> >> > major > >>> >> >> > step output during a query execution. So, apparently, something > >>> >> >> > else > >>> >> >> > is > >>> >> >> > blocking. This could be another aspect of the GC, but it could > >>> >> >> > also > >>> >> >> > be > >>> >> >> > anything else. > >>> >> >> > > >>> >> >> > Just to be sure, I have tried running the same application in > >>> >> >> > python > >>> >> >> > with > >>> >> >> > garbage collection disabled. I don't see the problem there, so > it > >>> >> >> > is > >>> >> >> > somehow > >>> >> >> > related to either GC or the runtime somehow. > >>> >> >> > > >>> >> >> > Cheers, > >>> >> >> > /Martin > >>> >> >> > > >>> >> >> > > >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch > >>> >> >> > wrote: > >>> >> >> >> > >>> >> >> >> We have hacked up a small sample that seems to exhibit the > same > >>> >> >> >> issue. > >>> >> >> >> > >>> >> >> >> We basically generate a linked list of objects. To increase > >>> >> >> >> connectedness, > >>> >> >> >> elements in the list hold references (dummy_links) to 10 > randomly > >>> >> >> >> chosen > >>> >> >> >> previous elements in the list. > >>> >> >> >> > >>> >> >> >> We then time a function that traverses 50000 elements from the > >>> >> >> >> list > >>> >> >> >> from a > >>> >> >> >> random start point. If the traversal reaches the end of the > list, > >>> >> >> >> we > >>> >> >> >> instead > >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements > are > >>> >> >> >> traversed > >>> >> >> >> every time. To generate some garbage, we build a list holding > the > >>> >> >> >> traversed > >>> >> >> >> elements and a dummy list of characters. > >>> >> >> >> > >>> >> >> >> Timings for the last 100 runs are stored in a circular > buffer. If > >>> >> >> >> the > >>> >> >> >> elapsed time for the last run is more than twice the average > >>> >> >> >> time, > >>> >> >> >> we > >>> >> >> >> print > >>> >> >> >> out a line with the elapsed time, the threshold, and the 90% > >>> >> >> >> runtime > >>> >> >> >> (we > >>> >> >> >> would like to see that the mean runtime does not increase with > >>> >> >> >> the > >>> >> >> >> number of > >>> >> >> >> elements in the list, but that the max time does increase > >>> >> >> >> (linearly > >>> >> >> >> with the > >>> >> >> >> number of object, i guess); traversing 50K elements should be > >>> >> >> >> independent of > >>> >> >> >> the memory size). > >>> >> >> >> > >>> >> >> >> We have tried monitoring memory consumption by external > >>> >> >> >> inspection, > >>> >> >> >> but > >>> >> >> >> cannot consistently verify that memory is deallocated at the > same > >>> >> >> >> time > >>> >> >> >> that > >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always > >>> >> >> >> return > >>> >> >> >> freed > >>> >> >> >> pages back to the OS? > >>> >> >> >> > >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB > >>> >> >> >> after > >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to > 35GB > >>> >> >> >> shortly > >>> >> >> >> after building). > >>> >> >> >> > >>> >> >> >> Here is output from a few runs with different number of > elements: > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> pypy mem.py 10000000 > >>> >> >> >> start build > >>> >> >> >> end build 84.142424 > >>> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: > >>> >> >> >> 1.495401 > >>> >> >> >> 90th_quantile_runtime: 0.421558 > >>> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: > >>> >> >> >> 1.488160 > >>> >> >> >> 90th_quantile_runtime: 0.423441 > >>> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: > >>> >> >> >> 1.474563 > >>> >> >> >> 90th_quantile_runtime: 0.419817 > >>> >> >> >> > >>> >> >> >> pypy mem.py 20000000 > >>> >> >> >> start build > >>> >> >> >> end build 180.823105 > >>> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: > >>> >> >> >> 2.295146 > >>> >> >> >> 90th_quantile_runtime: 0.434726 > >>> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: > >>> >> >> >> 2.283927 > >>> >> >> >> 90th_quantile_runtime: 0.374190 > >>> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: > >>> >> >> >> 2.279631 > >>> >> >> >> 90th_quantile_runtime: 0.371502 > >>> >> >> >> > >>> >> >> >> pypy mem.py 30000000 > >>> >> >> >> start build > >>> >> >> >> end build 276.217811 > >>> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: > >>> >> >> >> 3.188464 > >>> >> >> >> 90th_quantile_runtime: 0.459891 > >>> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: > >>> >> >> >> 3.183003 > >>> >> >> >> 90th_quantile_runtime: 0.393654 > >>> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: > >>> >> >> >> 3.190782 > >>> >> >> >> 90th_quantile_runtime: 0.393677 > >>> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: > >>> >> >> >> 3.239637 > >>> >> >> >> 90th_quantile_runtime: 0.393654 > >>> >> >> >> > >>> >> >> >> Code below > >>> >> >> >> -------------------------------------------------------------- > >>> >> >> >> import time > >>> >> >> >> from random import randint, choice > >>> >> >> >> import sys > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> allElems = {} > >>> >> >> >> > >>> >> >> >> class Node: > >>> >> >> >> def __init__(self, v_): > >>> >> >> >> self.v = v_ > >>> >> >> >> self.next = None > >>> >> >> >> self.dummy_data = [randint(0,100) > >>> >> >> >> for _ in xrange(randint(50,100))] > >>> >> >> >> allElems[self.v] = self > >>> >> >> >> if self.v > 0: > >>> >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] > >>> >> >> >> for _ > >>> >> >> >> in > >>> >> >> >> xrange(10)] > >>> >> >> >> else: > >>> >> >> >> self.dummy_links = [self] > >>> >> >> >> > >>> >> >> >> def set_next(self, l): > >>> >> >> >> self.next = l > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> def follow(node): > >>> >> >> >> acc = [] > >>> >> >> >> count = 0 > >>> >> >> >> cur = node > >>> >> >> >> assert node.v is not None > >>> >> >> >> assert cur is not None > >>> >> >> >> while count < 50000: > >>> >> >> >> # return a value; generate some garbage > >>> >> >> >> acc.append((cur.v, > [choice("abcdefghijklmnopqrstuvwxyz") > >>> >> >> >> for > >>> >> >> >> x > >>> >> >> >> in > >>> >> >> >> xrange(100)])) > >>> >> >> >> > >>> >> >> >> # if we have reached the end, chose a random link > >>> >> >> >> cur = choice(cur.dummy_links) if cur.next is None else > >>> >> >> >> cur.next > >>> >> >> >> count += 1 > >>> >> >> >> > >>> >> >> >> return acc > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> def build(num_elems): > >>> >> >> >> start = time.time() > >>> >> >> >> print "start build" > >>> >> >> >> root = Node(0) > >>> >> >> >> cur = root > >>> >> >> >> for x in xrange(1, num_elems): > >>> >> >> >> e = Node(x) > >>> >> >> >> cur.next = e > >>> >> >> >> cur = e > >>> >> >> >> print "end build %f" % (time.time() - start) > >>> >> >> >> return root > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> num_timings = 100 > >>> >> >> >> if __name__ == "__main__": > >>> >> >> >> num_elems = int(sys.argv[1]) > >>> >> >> >> build(num_elems) > >>> >> >> >> total = 0 > >>> >> >> >> timings = [0.0] * num_timings # run times for the last > >>> >> >> >> num_timings > >>> >> >> >> runs > >>> >> >> >> i = 0 > >>> >> >> >> beginning = time.time() > >>> >> >> >> while time.time() - beginning < 600: > >>> >> >> >> start = time.time() > >>> >> >> >> elem = allElems[randint(0, num_elems - 1)] > >>> >> >> >> assert(elem is not None) > >>> >> >> >> > >>> >> >> >> lst = follow(elem) > >>> >> >> >> > >>> >> >> >> total += choice(lst)[0] # use the return value for > >>> >> >> >> something > >>> >> >> >> > >>> >> >> >> end = time.time() > >>> >> >> >> > >>> >> >> >> elapsed = end-start > >>> >> >> >> timings[i % num_timings] = elapsed > >>> >> >> >> if (i > num_timings): > >>> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow > >>> >> >> >> defined > >>> >> >> >> as > >>> >> >> >> > > >>> >> >> >> 2*avg run time > >>> >> >> >> if (elapsed > slow_time): > >>> >> >> >> print "that took a long time elapsed: %f > >>> >> >> >> slow_threshold: > >>> >> >> >> %f 90th_quantile_runtime: %f" % \ > >>> >> >> >> (elapsed, slow_time, > >>> >> >> >> sorted(timings)[int(num_timings*.9)]) > >>> >> >> >> i += 1 > >>> >> >> >> print total > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >>> >> >> >> > >>> >> >> >> wrote: > >>> >> >> >>> > >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > >>> >> >> >>> wrote: > >>> >> >> >>> > Hi Armin, Maciej > >>> >> >> >>> > > >>> >> >> >>> > Thanks for responding. > >>> >> >> >>> > > >>> >> >> >>> > I'm in the process of trying to determine what (if any) of > the > >>> >> >> >>> > code > >>> >> >> >>> > I'm > >>> >> >> >>> > in a > >>> >> >> >>> > position to share, and I'll get back to you. > >>> >> >> >>> > > >>> >> >> >>> > Allowing hinting to the GC would be good. Even better > would be > >>> >> >> >>> > a > >>> >> >> >>> > means > >>> >> >> >>> > to > >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged > >>> >> >> >>> > memory, > >>> >> >> >>> > but I > >>> >> >> >>> > would expect that to be a tall order :) > >>> >> >> >>> > > >>> >> >> >>> > Thanks, > >>> >> >> >>> > /Martin > >>> >> >> >>> > >>> >> >> >>> Hi Martin. > >>> >> >> >>> > >>> >> >> >>> Note that in case you want us to do the work of isolating the > >>> >> >> >>> problem, > >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs > and > >>> >> >> >>> stuff). > >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you > >>> >> >> >>> isolate > >>> >> >> >>> a > >>> >> >> >>> part you can share freely :) > >>> >> >> >> > >>> >> >> >> > >>> >> >> > > >>> >> > > >>> >> > > >>> > > >>> > > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 15:21:38 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 16:21:38 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

Message-ID: eh, this is not what I need I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step) On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: > Here are the collated results of running each query. For each run, I count > how many of each of the pypy debug lines i get. I.e. there were 668 runs > that printed 58 loglines that contain "{gc-minor" which was eventually > followed by "gc-minor}". I have also counted if the query was slow; > interestingly, not all the queries with many gc-minors were slow (but all > slow queries had a gc-minor). > > Please let me know if this is unclear :) > > 668 gc-minor:58 gc-minor-walkroots:58 > 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 > 140 gc-minor:59 gc-minor-walkroots:59 > 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 > 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 > 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 > 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 > 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 > 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 > 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 > 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 > jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 > jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 > jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 > jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 > jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 > jit-resume:84 > 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 > jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 > jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 > jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 > jit-resume:14 > 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 > jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 > gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 > jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:84 > 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 > jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 > gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 > jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:84 > 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 > jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 > jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 > jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:104 > > > Thanks, > /Martin > > > > On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski > wrote: >> >> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >> wrote: >> > are you *sure* it's the walkroots that take that long and not >> > something else (like gc-minor)? More of those mean that you allocate a >> > lot more surviving objects. Can you do two things: >> > >> > a) take a max of gc-minor (and gc-minor-stackwalk), per request >> > b) take the sum of those >> > >> > and plot them >> >> ^^^ or just paste the results actually >> >> > >> > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >> >> Well, then it works out to around 2.5GHz, which seems reasonable. But >> >> it >> >> doesn't alter the conclusion from the previous email: The slow queries >> >> then >> >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >> >> units, or >> >> .4 seconds at this conversion. Also, the log shows that a slow query >> >> performs many more gc-minor operations than a 'normal' one: 9600 >> >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >> >> >> So the question becomes: Why do we get this large spike in >> >> gc-minor-walkroots, and, in particular, is there any way to avoid it :) >> >> ? >> >> >> >> Thanks, >> >> /Martin >> >> >> >> >> >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >> >> wrote: >> >>> >> >>> I think it's the cycles of your CPU >> >>> >> >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >> >>> > What is the unit? Perhaps I'm being thick here, but I can't >> >>> > correlate it >> >>> > with seconds (which the program does print out). Slow runs are >> >>> > around 13 >> >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >> >>> > (e.g. >> >>> > from >> >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> >>> > >> >>> > >> >>> > >> >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> >>> > >> >>> > wrote: >> >>> >> >> >>> >> The number of lines is nonsense. This is a timestamp in hex. >> >>> >> >> >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >> >>> >> wrote: >> >>> >> > Based On Maciej's suggestion, I tried the following >> >>> >> > >> >>> >> > PYPYLOG=- pypy mem.py 10000000 > out >> >>> >> > >> >>> >> > This generates a logfile which looks something like this >> >>> >> > >> >>> >> > start--> >> >>> >> > [2b99f1981b527e] {gc-minor >> >>> >> > [2b99f1981ba680] {gc-minor-walkroots >> >>> >> > [2b99f1981c2e02] gc-minor-walkroots} >> >>> >> > [2b99f19890d750] gc-minor} >> >>> >> > [snip] >> >>> >> > ... >> >>> >> > <--stop >> >>> >> > >> >>> >> > >> >>> >> > It turns out that the culprit is a lot of MINOR collections. >> >>> >> > >> >>> >> > I base this on the following observations: >> >>> >> > >> >>> >> > I can't understand the format of the timestamp on each logline >> >>> >> > (the >> >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should >> >>> >> > be >> >>> >> > output >> >>> >> > from time.clock(), but that doesn't return a number like that >> >>> >> > when I >> >>> >> > run >> >>> >> > pypy interactively >> >>> >> > Instead, I count the number of debug lines between start--> and >> >>> >> > the >> >>> >> > corresponding <--stop. >> >>> >> > Most runs have a few hundred lines of output between start/stop >> >>> >> > All slow runs have very close to 57800 lines out output between >> >>> >> > start/stop >> >>> >> > One such sample does 9609 gc-collect-step operations, 9647 >> >>> >> > gc-minor >> >>> >> > operations, and 9647 gc-minor-walkroots operations. >> >>> >> > >> >>> >> > >> >>> >> > Thanks, >> >>> >> > /Martin >> >>> >> > >> >>> >> > >> >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> >>> >> > >> >>> >> > wrote: >> >>> >> >> >> >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is >> >>> >> >> stdout) >> >>> >> >> which will do that for you btw. >> >>> >> >> >> >>> >> >> maybe you can find out what's that using profiling or valgrind? >> >>> >> >> >> >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >> >>> >> >> wrote: >> >>> >> >> > I have tried getting the pypy source and building my own >> >>> >> >> > version >> >>> >> >> > of >> >>> >> >> > pypy. I >> >>> >> >> > have modified >> >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >> >>> >> >> > to >> >>> >> >> > print out when it starts and when it stops. Apparently, the >> >>> >> >> > slow >> >>> >> >> > queries >> >>> >> >> > do >> >>> >> >> > NOT occur during major_collection_step; at least, I have not >> >>> >> >> > observed >> >>> >> >> > major >> >>> >> >> > step output during a query execution. So, apparently, >> >>> >> >> > something >> >>> >> >> > else >> >>> >> >> > is >> >>> >> >> > blocking. This could be another aspect of the GC, but it could >> >>> >> >> > also >> >>> >> >> > be >> >>> >> >> > anything else. >> >>> >> >> > >> >>> >> >> > Just to be sure, I have tried running the same application in >> >>> >> >> > python >> >>> >> >> > with >> >>> >> >> > garbage collection disabled. I don't see the problem there, so >> >>> >> >> > it >> >>> >> >> > is >> >>> >> >> > somehow >> >>> >> >> > related to either GC or the runtime somehow. >> >>> >> >> > >> >>> >> >> > Cheers, >> >>> >> >> > /Martin >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >> >>> >> >> > wrote: >> >>> >> >> >> >> >>> >> >> >> We have hacked up a small sample that seems to exhibit the >> >>> >> >> >> same >> >>> >> >> >> issue. >> >>> >> >> >> >> >>> >> >> >> We basically generate a linked list of objects. To increase >> >>> >> >> >> connectedness, >> >>> >> >> >> elements in the list hold references (dummy_links) to 10 >> >>> >> >> >> randomly >> >>> >> >> >> chosen >> >>> >> >> >> previous elements in the list. >> >>> >> >> >> >> >>> >> >> >> We then time a function that traverses 50000 elements from >> >>> >> >> >> the >> >>> >> >> >> list >> >>> >> >> >> from a >> >>> >> >> >> random start point. If the traversal reaches the end of the >> >>> >> >> >> list, >> >>> >> >> >> we >> >>> >> >> >> instead >> >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements >> >>> >> >> >> are >> >>> >> >> >> traversed >> >>> >> >> >> every time. To generate some garbage, we build a list holding >> >>> >> >> >> the >> >>> >> >> >> traversed >> >>> >> >> >> elements and a dummy list of characters. >> >>> >> >> >> >> >>> >> >> >> Timings for the last 100 runs are stored in a circular >> >>> >> >> >> buffer. If >> >>> >> >> >> the >> >>> >> >> >> elapsed time for the last run is more than twice the average >> >>> >> >> >> time, >> >>> >> >> >> we >> >>> >> >> >> print >> >>> >> >> >> out a line with the elapsed time, the threshold, and the 90% >> >>> >> >> >> runtime >> >>> >> >> >> (we >> >>> >> >> >> would like to see that the mean runtime does not increase >> >>> >> >> >> with >> >>> >> >> >> the >> >>> >> >> >> number of >> >>> >> >> >> elements in the list, but that the max time does increase >> >>> >> >> >> (linearly >> >>> >> >> >> with the >> >>> >> >> >> number of object, i guess); traversing 50K elements should be >> >>> >> >> >> independent of >> >>> >> >> >> the memory size). >> >>> >> >> >> >> >>> >> >> >> We have tried monitoring memory consumption by external >> >>> >> >> >> inspection, >> >>> >> >> >> but >> >>> >> >> >> cannot consistently verify that memory is deallocated at the >> >>> >> >> >> same >> >>> >> >> >> time >> >>> >> >> >> that >> >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always >> >>> >> >> >> return >> >>> >> >> >> freed >> >>> >> >> >> pages back to the OS? >> >>> >> >> >> >> >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB >> >>> >> >> >> after >> >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to >> >>> >> >> >> 35GB >> >>> >> >> >> shortly >> >>> >> >> >> after building). >> >>> >> >> >> >> >>> >> >> >> Here is output from a few runs with different number of >> >>> >> >> >> elements: >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> pypy mem.py 10000000 >> >>> >> >> >> start build >> >>> >> >> >> end build 84.142424 >> >>> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: >> >>> >> >> >> 1.495401 >> >>> >> >> >> 90th_quantile_runtime: 0.421558 >> >>> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: >> >>> >> >> >> 1.488160 >> >>> >> >> >> 90th_quantile_runtime: 0.423441 >> >>> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: >> >>> >> >> >> 1.474563 >> >>> >> >> >> 90th_quantile_runtime: 0.419817 >> >>> >> >> >> >> >>> >> >> >> pypy mem.py 20000000 >> >>> >> >> >> start build >> >>> >> >> >> end build 180.823105 >> >>> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: >> >>> >> >> >> 2.295146 >> >>> >> >> >> 90th_quantile_runtime: 0.434726 >> >>> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: >> >>> >> >> >> 2.283927 >> >>> >> >> >> 90th_quantile_runtime: 0.374190 >> >>> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: >> >>> >> >> >> 2.279631 >> >>> >> >> >> 90th_quantile_runtime: 0.371502 >> >>> >> >> >> >> >>> >> >> >> pypy mem.py 30000000 >> >>> >> >> >> start build >> >>> >> >> >> end build 276.217811 >> >>> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: >> >>> >> >> >> 3.188464 >> >>> >> >> >> 90th_quantile_runtime: 0.459891 >> >>> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: >> >>> >> >> >> 3.183003 >> >>> >> >> >> 90th_quantile_runtime: 0.393654 >> >>> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: >> >>> >> >> >> 3.190782 >> >>> >> >> >> 90th_quantile_runtime: 0.393677 >> >>> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: >> >>> >> >> >> 3.239637 >> >>> >> >> >> 90th_quantile_runtime: 0.393654 >> >>> >> >> >> >> >>> >> >> >> Code below >> >>> >> >> >> >> >>> >> >> >> -------------------------------------------------------------- >> >>> >> >> >> import time >> >>> >> >> >> from random import randint, choice >> >>> >> >> >> import sys >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> allElems = {} >> >>> >> >> >> >> >>> >> >> >> class Node: >> >>> >> >> >> def __init__(self, v_): >> >>> >> >> >> self.v = v_ >> >>> >> >> >> self.next = None >> >>> >> >> >> self.dummy_data = [randint(0,100) >> >>> >> >> >> for _ in xrange(randint(50,100))] >> >>> >> >> >> allElems[self.v] = self >> >>> >> >> >> if self.v > 0: >> >>> >> >> >> self.dummy_links = [allElems[randint(0, >> >>> >> >> >> self.v-1)] >> >>> >> >> >> for _ >> >>> >> >> >> in >> >>> >> >> >> xrange(10)] >> >>> >> >> >> else: >> >>> >> >> >> self.dummy_links = [self] >> >>> >> >> >> >> >>> >> >> >> def set_next(self, l): >> >>> >> >> >> self.next = l >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> def follow(node): >> >>> >> >> >> acc = [] >> >>> >> >> >> count = 0 >> >>> >> >> >> cur = node >> >>> >> >> >> assert node.v is not None >> >>> >> >> >> assert cur is not None >> >>> >> >> >> while count < 50000: >> >>> >> >> >> # return a value; generate some garbage >> >>> >> >> >> acc.append((cur.v, >> >>> >> >> >> [choice("abcdefghijklmnopqrstuvwxyz") >> >>> >> >> >> for >> >>> >> >> >> x >> >>> >> >> >> in >> >>> >> >> >> xrange(100)])) >> >>> >> >> >> >> >>> >> >> >> # if we have reached the end, chose a random link >> >>> >> >> >> cur = choice(cur.dummy_links) if cur.next is None >> >>> >> >> >> else >> >>> >> >> >> cur.next >> >>> >> >> >> count += 1 >> >>> >> >> >> >> >>> >> >> >> return acc >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> def build(num_elems): >> >>> >> >> >> start = time.time() >> >>> >> >> >> print "start build" >> >>> >> >> >> root = Node(0) >> >>> >> >> >> cur = root >> >>> >> >> >> for x in xrange(1, num_elems): >> >>> >> >> >> e = Node(x) >> >>> >> >> >> cur.next = e >> >>> >> >> >> cur = e >> >>> >> >> >> print "end build %f" % (time.time() - start) >> >>> >> >> >> return root >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> num_timings = 100 >> >>> >> >> >> if __name__ == "__main__": >> >>> >> >> >> num_elems = int(sys.argv[1]) >> >>> >> >> >> build(num_elems) >> >>> >> >> >> total = 0 >> >>> >> >> >> timings = [0.0] * num_timings # run times for the last >> >>> >> >> >> num_timings >> >>> >> >> >> runs >> >>> >> >> >> i = 0 >> >>> >> >> >> beginning = time.time() >> >>> >> >> >> while time.time() - beginning < 600: >> >>> >> >> >> start = time.time() >> >>> >> >> >> elem = allElems[randint(0, num_elems - 1)] >> >>> >> >> >> assert(elem is not None) >> >>> >> >> >> >> >>> >> >> >> lst = follow(elem) >> >>> >> >> >> >> >>> >> >> >> total += choice(lst)[0] # use the return value for >> >>> >> >> >> something >> >>> >> >> >> >> >>> >> >> >> end = time.time() >> >>> >> >> >> >> >>> >> >> >> elapsed = end-start >> >>> >> >> >> timings[i % num_timings] = elapsed >> >>> >> >> >> if (i > num_timings): >> >>> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow >> >>> >> >> >> defined >> >>> >> >> >> as >> >>> >> >> >> > >> >>> >> >> >> 2*avg run time >> >>> >> >> >> if (elapsed > slow_time): >> >>> >> >> >> print "that took a long time elapsed: %f >> >>> >> >> >> slow_threshold: >> >>> >> >> >> %f 90th_quantile_runtime: %f" % \ >> >>> >> >> >> (elapsed, slow_time, >> >>> >> >> >> sorted(timings)[int(num_timings*.9)]) >> >>> >> >> >> i += 1 >> >>> >> >> >> print total >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >>> >> >> >> >> >>> >> >> >> wrote: >> >>> >> >> >>> >> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >> >>> >> >> >>> wrote: >> >>> >> >> >>> > Hi Armin, Maciej >> >>> >> >> >>> > >> >>> >> >> >>> > Thanks for responding. >> >>> >> >> >>> > >> >>> >> >> >>> > I'm in the process of trying to determine what (if any) of >> >>> >> >> >>> > the >> >>> >> >> >>> > code >> >>> >> >> >>> > I'm >> >>> >> >> >>> > in a >> >>> >> >> >>> > position to share, and I'll get back to you. >> >>> >> >> >>> > >> >>> >> >> >>> > Allowing hinting to the GC would be good. Even better >> >>> >> >> >>> > would be >> >>> >> >> >>> > a >> >>> >> >> >>> > means >> >>> >> >> >>> > to >> >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged >> >>> >> >> >>> > memory, >> >>> >> >> >>> > but I >> >>> >> >> >>> > would expect that to be a tall order :) >> >>> >> >> >>> > >> >>> >> >> >>> > Thanks, >> >>> >> >> >>> > /Martin >> >>> >> >> >>> >> >>> >> >> >>> Hi Martin. >> >>> >> >> >>> >> >>> >> >> >>> Note that in case you want us to do the work of isolating >> >>> >> >> >>> the >> >>> >> >> >>> problem, >> >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs >> >>> >> >> >>> and >> >>> >> >> >>> stuff). >> >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you >> >>> >> >> >>> isolate >> >>> >> >> >>> a >> >>> >> >> >>> part you can share freely :) >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> > >> >>> >> > >> >>> >> > >> >>> > >> >>> > >> >> >> >> > > From mak at issuu.com Mon Mar 17 16:35:33 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 16:35:33 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> References:

<5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Message-ID: Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow. *Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: gc-minor:10 gc-collect-step:247 *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: gc-minor:10 gc-collect-step:245 *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: gc-minor:11 gc-collect-step:244 *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 *Max*: gc-minor:17 gc-collect-step:244 *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 *Max*: gc-minor:11 gc-collect-step:248 *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 *Max*: gc-minor:8 gc-collect-step:299 *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 *Max*: gc-minor:11 gc-collect-step:246 *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 *Max*: gc-minor:36 gc-collect-step:248 *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 *Max*: gc-minor:38 gc-collect-step:244 *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 *Max*: gc-minor:23 gc-collect-step:245 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 *Max*: gc-minor:9 gc-collect-step:244 *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 *Max*: gc-minor:8 gc-collect-step:248 *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 *Max*: gc-minor:8 gc-collect-step:250 *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 *Max*: gc-minor:543 gc-collect-step:244 *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 *Max*: gc-minor:20 gc-collect-step:246 *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 *Max*: gc-minor:25 gc-collect-step:245 Thanks, /Martin On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: > Ah. I had misunderstood. I'll get back to you on that :) thanks > > /Martin > > > > On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: > > > > eh, this is not what I need > > > > I need a max of TIME it took for a gc-minor and the TOTAL time it took > > for a gc-minor (per query) (ideally same for gc-walkroots and > > gc-collect-step) > > > >> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: > >> Here are the collated results of running each query. For each run, I > count > >> how many of each of the pypy debug lines i get. I.e. there were 668 runs > >> that printed 58 loglines that contain "{gc-minor" which was eventually > >> followed by "gc-minor}". I have also counted if the query was slow; > >> interestingly, not all the queries with many gc-minors were slow (but > all > >> slow queries had a gc-minor). > >> > >> Please let me know if this is unclear :) > >> > >> 668 gc-minor:58 gc-minor-walkroots:58 > >> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 > >> 140 gc-minor:59 gc-minor-walkroots:59 > >> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 > >> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 > >> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 > >> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 > >> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 > >> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 > >> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 > >> 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 > >> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 > >> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 > >> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 > >> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 > >> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 > >> jit-resume:84 > >> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 > >> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 > >> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 > >> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 > >> jit-resume:14 > >> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 > >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 > >> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 > >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > jit-abort:3 > >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > >> jit-log-compiling-bridge:2 jit-resume:84 > >> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 > >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 > >> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 > >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > jit-abort:3 > >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > >> jit-log-compiling-bridge:2 jit-resume:84 > >> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 > >> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 > gc-minor:61 > >> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 > >> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 > jit-abort:3 > >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 > >> jit-log-compiling-bridge:2 jit-resume:104 > >> > >> > >> Thanks, > >> /Martin > >> > >> > >> > >> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski > >> wrote: > >>> > >>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > >>> wrote: > >>>> are you *sure* it's the walkroots that take that long and not > >>>> something else (like gc-minor)? More of those mean that you allocate a > >>>> lot more surviving objects. Can you do two things: > >>>> > >>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request > >>>> b) take the sum of those > >>>> > >>>> and plot them > >>> > >>> ^^^ or just paste the results actually > >>> > >>>> > >>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: > >>>>> Well, then it works out to around 2.5GHz, which seems reasonable. But > >>>>> it > >>>>> doesn't alter the conclusion from the previous email: The slow > queries > >>>>> then > >>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 > >>>>> units, or > >>>>> .4 seconds at this conversion. Also, the log shows that a slow query > >>>>> performs many more gc-minor operations than a 'normal' one: 9600 > >>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > >>>>> > >>>>> So the question becomes: Why do we get this large spike in > >>>>> gc-minor-walkroots, and, in particular, is there any way to avoid it > :) > >>>>> ? > >>>>> > >>>>> Thanks, > >>>>> /Martin > >>>>> > >>>>> > >>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski < > fijall at gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> I think it's the cycles of your CPU > >>>>>> > >>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch > wrote: > >>>>>>> What is the unit? Perhaps I'm being thick here, but I can't > >>>>>>> correlate it > >>>>>>> with seconds (which the program does print out). Slow runs are > >>>>>>> around 13 > >>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units > >>>>>>> (e.g. > >>>>>>> from > >>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > >>>>>>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> The number of lines is nonsense. This is a timestamp in hex. > >>>>>>>> > >>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch > >>>>>>>> wrote: > >>>>>>>>> Based On Maciej's suggestion, I tried the following > >>>>>>>>> > >>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out > >>>>>>>>> > >>>>>>>>> This generates a logfile which looks something like this > >>>>>>>>> > >>>>>>>>> start--> > >>>>>>>>> [2b99f1981b527e] {gc-minor > >>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots > >>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} > >>>>>>>>> [2b99f19890d750] gc-minor} > >>>>>>>>> [snip] > >>>>>>>>> ... > >>>>>>>>> <--stop > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> It turns out that the culprit is a lot of MINOR collections. > >>>>>>>>> > >>>>>>>>> I base this on the following observations: > >>>>>>>>> > >>>>>>>>> I can't understand the format of the timestamp on each logline > >>>>>>>>> (the > >>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this should > >>>>>>>>> be > >>>>>>>>> output > >>>>>>>>> from time.clock(), but that doesn't return a number like that > >>>>>>>>> when I > >>>>>>>>> run > >>>>>>>>> pypy interactively > >>>>>>>>> Instead, I count the number of debug lines between start--> and > >>>>>>>>> the > >>>>>>>>> corresponding <--stop. > >>>>>>>>> Most runs have a few hundred lines of output between start/stop > >>>>>>>>> All slow runs have very close to 57800 lines out output between > >>>>>>>>> start/stop > >>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 > >>>>>>>>> gc-minor > >>>>>>>>> operations, and 9647 gc-minor-walkroots operations. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> /Martin > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is > >>>>>>>>>> stdout) > >>>>>>>>>> which will do that for you btw. > >>>>>>>>>> > >>>>>>>>>> maybe you can find out what's that using profiling or valgrind? > >>>>>>>>>> > >>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch > >>>>>>>>>> wrote: > >>>>>>>>>>> I have tried getting the pypy source and building my own > >>>>>>>>>>> version > >>>>>>>>>>> of > >>>>>>>>>>> pypy. I > >>>>>>>>>>> have modified > >>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() > >>>>>>>>>>> to > >>>>>>>>>>> print out when it starts and when it stops. Apparently, the > >>>>>>>>>>> slow > >>>>>>>>>>> queries > >>>>>>>>>>> do > >>>>>>>>>>> NOT occur during major_collection_step; at least, I have not > >>>>>>>>>>> observed > >>>>>>>>>>> major > >>>>>>>>>>> step output during a query execution. So, apparently, > >>>>>>>>>>> something > >>>>>>>>>>> else > >>>>>>>>>>> is > >>>>>>>>>>> blocking. This could be another aspect of the GC, but it could > >>>>>>>>>>> also > >>>>>>>>>>> be > >>>>>>>>>>> anything else. > >>>>>>>>>>> > >>>>>>>>>>> Just to be sure, I have tried running the same application in > >>>>>>>>>>> python > >>>>>>>>>>> with > >>>>>>>>>>> garbage collection disabled. I don't see the problem there, so > >>>>>>>>>>> it > >>>>>>>>>>> is > >>>>>>>>>>> somehow > >>>>>>>>>>> related to either GC or the runtime somehow. > >>>>>>>>>>> > >>>>>>>>>>> Cheers, > >>>>>>>>>>> /Martin > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the > >>>>>>>>>>>> same > >>>>>>>>>>>> issue. > >>>>>>>>>>>> > >>>>>>>>>>>> We basically generate a linked list of objects. To increase > >>>>>>>>>>>> connectedness, > >>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 > >>>>>>>>>>>> randomly > >>>>>>>>>>>> chosen > >>>>>>>>>>>> previous elements in the list. > >>>>>>>>>>>> > >>>>>>>>>>>> We then time a function that traverses 50000 elements from > >>>>>>>>>>>> the > >>>>>>>>>>>> list > >>>>>>>>>>>> from a > >>>>>>>>>>>> random start point. If the traversal reaches the end of the > >>>>>>>>>>>> list, > >>>>>>>>>>>> we > >>>>>>>>>>>> instead > >>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements > >>>>>>>>>>>> are > >>>>>>>>>>>> traversed > >>>>>>>>>>>> every time. To generate some garbage, we build a list holding > >>>>>>>>>>>> the > >>>>>>>>>>>> traversed > >>>>>>>>>>>> elements and a dummy list of characters. > >>>>>>>>>>>> > >>>>>>>>>>>> Timings for the last 100 runs are stored in a circular > >>>>>>>>>>>> buffer. If > >>>>>>>>>>>> the > >>>>>>>>>>>> elapsed time for the last run is more than twice the average > >>>>>>>>>>>> time, > >>>>>>>>>>>> we > >>>>>>>>>>>> print > >>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% > >>>>>>>>>>>> runtime > >>>>>>>>>>>> (we > >>>>>>>>>>>> would like to see that the mean runtime does not increase > >>>>>>>>>>>> with > >>>>>>>>>>>> the > >>>>>>>>>>>> number of > >>>>>>>>>>>> elements in the list, but that the max time does increase > >>>>>>>>>>>> (linearly > >>>>>>>>>>>> with the > >>>>>>>>>>>> number of object, i guess); traversing 50K elements should be > >>>>>>>>>>>> independent of > >>>>>>>>>>>> the memory size). > >>>>>>>>>>>> > >>>>>>>>>>>> We have tried monitoring memory consumption by external > >>>>>>>>>>>> inspection, > >>>>>>>>>>>> but > >>>>>>>>>>>> cannot consistently verify that memory is deallocated at the > >>>>>>>>>>>> same > >>>>>>>>>>>> time > >>>>>>>>>>>> that > >>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always > >>>>>>>>>>>> return > >>>>>>>>>>>> freed > >>>>>>>>>>>> pages back to the OS? > >>>>>>>>>>>> > >>>>>>>>>>>> Using top, we observe that 10M elements allocates around 17GB > >>>>>>>>>>>> after > >>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to > >>>>>>>>>>>> 35GB > >>>>>>>>>>>> shortly > >>>>>>>>>>>> after building). > >>>>>>>>>>>> > >>>>>>>>>>>> Here is output from a few runs with different number of > >>>>>>>>>>>> elements: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> pypy mem.py 10000000 > >>>>>>>>>>>> start build > >>>>>>>>>>>> end build 84.142424 > >>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: > >>>>>>>>>>>> 1.495401 > >>>>>>>>>>>> 90th_quantile_runtime: 0.421558 > >>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: > >>>>>>>>>>>> 1.488160 > >>>>>>>>>>>> 90th_quantile_runtime: 0.423441 > >>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: > >>>>>>>>>>>> 1.474563 > >>>>>>>>>>>> 90th_quantile_runtime: 0.419817 > >>>>>>>>>>>> > >>>>>>>>>>>> pypy mem.py 20000000 > >>>>>>>>>>>> start build > >>>>>>>>>>>> end build 180.823105 > >>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: > >>>>>>>>>>>> 2.295146 > >>>>>>>>>>>> 90th_quantile_runtime: 0.434726 > >>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: > >>>>>>>>>>>> 2.283927 > >>>>>>>>>>>> 90th_quantile_runtime: 0.374190 > >>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: > >>>>>>>>>>>> 2.279631 > >>>>>>>>>>>> 90th_quantile_runtime: 0.371502 > >>>>>>>>>>>> > >>>>>>>>>>>> pypy mem.py 30000000 > >>>>>>>>>>>> start build > >>>>>>>>>>>> end build 276.217811 > >>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: > >>>>>>>>>>>> 3.188464 > >>>>>>>>>>>> 90th_quantile_runtime: 0.459891 > >>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: > >>>>>>>>>>>> 3.183003 > >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 > >>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: > >>>>>>>>>>>> 3.190782 > >>>>>>>>>>>> 90th_quantile_runtime: 0.393677 > >>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: > >>>>>>>>>>>> 3.239637 > >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 > >>>>>>>>>>>> > >>>>>>>>>>>> Code below > >>>>>>>>>>>> > >>>>>>>>>>>> -------------------------------------------------------------- > >>>>>>>>>>>> import time > >>>>>>>>>>>> from random import randint, choice > >>>>>>>>>>>> import sys > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> allElems = {} > >>>>>>>>>>>> > >>>>>>>>>>>> class Node: > >>>>>>>>>>>> def __init__(self, v_): > >>>>>>>>>>>> self.v = v_ > >>>>>>>>>>>> self.next = None > >>>>>>>>>>>> self.dummy_data = [randint(0,100) > >>>>>>>>>>>> for _ in xrange(randint(50,100))] > >>>>>>>>>>>> allElems[self.v] = self > >>>>>>>>>>>> if self.v > 0: > >>>>>>>>>>>> self.dummy_links = [allElems[randint(0, > >>>>>>>>>>>> self.v-1)] > >>>>>>>>>>>> for _ > >>>>>>>>>>>> in > >>>>>>>>>>>> xrange(10)] > >>>>>>>>>>>> else: > >>>>>>>>>>>> self.dummy_links = [self] > >>>>>>>>>>>> > >>>>>>>>>>>> def set_next(self, l): > >>>>>>>>>>>> self.next = l > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> def follow(node): > >>>>>>>>>>>> acc = [] > >>>>>>>>>>>> count = 0 > >>>>>>>>>>>> cur = node > >>>>>>>>>>>> assert node.v is not None > >>>>>>>>>>>> assert cur is not None > >>>>>>>>>>>> while count < 50000: > >>>>>>>>>>>> # return a value; generate some garbage > >>>>>>>>>>>> acc.append((cur.v, > >>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") > >>>>>>>>>>>> for > >>>>>>>>>>>> x > >>>>>>>>>>>> in > >>>>>>>>>>>> xrange(100)])) > >>>>>>>>>>>> > >>>>>>>>>>>> # if we have reached the end, chose a random link > >>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None > >>>>>>>>>>>> else > >>>>>>>>>>>> cur.next > >>>>>>>>>>>> count += 1 > >>>>>>>>>>>> > >>>>>>>>>>>> return acc > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> def build(num_elems): > >>>>>>>>>>>> start = time.time() > >>>>>>>>>>>> print "start build" > >>>>>>>>>>>> root = Node(0) > >>>>>>>>>>>> cur = root > >>>>>>>>>>>> for x in xrange(1, num_elems): > >>>>>>>>>>>> e = Node(x) > >>>>>>>>>>>> cur.next = e > >>>>>>>>>>>> cur = e > >>>>>>>>>>>> print "end build %f" % (time.time() - start) > >>>>>>>>>>>> return root > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> num_timings = 100 > >>>>>>>>>>>> if __name__ == "__main__": > >>>>>>>>>>>> num_elems = int(sys.argv[1]) > >>>>>>>>>>>> build(num_elems) > >>>>>>>>>>>> total = 0 > >>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last > >>>>>>>>>>>> num_timings > >>>>>>>>>>>> runs > >>>>>>>>>>>> i = 0 > >>>>>>>>>>>> beginning = time.time() > >>>>>>>>>>>> while time.time() - beginning < 600: > >>>>>>>>>>>> start = time.time() > >>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] > >>>>>>>>>>>> assert(elem is not None) > >>>>>>>>>>>> > >>>>>>>>>>>> lst = follow(elem) > >>>>>>>>>>>> > >>>>>>>>>>>> total += choice(lst)[0] # use the return value for > >>>>>>>>>>>> something > >>>>>>>>>>>> > >>>>>>>>>>>> end = time.time() > >>>>>>>>>>>> > >>>>>>>>>>>> elapsed = end-start > >>>>>>>>>>>> timings[i % num_timings] = elapsed > >>>>>>>>>>>> if (i > num_timings): > >>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow > >>>>>>>>>>>> defined > >>>>>>>>>>>> as > >>>>>>>>>>>>> > >>>>>>>>>>>> 2*avg run time > >>>>>>>>>>>> if (elapsed > slow_time): > >>>>>>>>>>>> print "that took a long time elapsed: %f > >>>>>>>>>>>> slow_threshold: > >>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ > >>>>>>>>>>>> (elapsed, slow_time, > >>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) > >>>>>>>>>>>> i += 1 > >>>>>>>>>>>> print total > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >>>>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> Hi Armin, Maciej > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for responding. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>> code > >>>>>>>>>>>>>> I'm > >>>>>>>>>>>>>> in a > >>>>>>>>>>>>>> position to share, and I'll get back to you. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better > >>>>>>>>>>>>>> would be > >>>>>>>>>>>>>> a > >>>>>>>>>>>>>> means > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged > >>>>>>>>>>>>>> memory, > >>>>>>>>>>>>>> but I > >>>>>>>>>>>>>> would expect that to be a tall order :) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> /Martin > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Martin. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Note that in case you want us to do the work of isolating > >>>>>>>>>>>>> the > >>>>>>>>>>>>> problem, > >>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs > >>>>>>>>>>>>> and > >>>>>>>>>>>>> stuff). > >>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you > >>>>>>>>>>>>> isolate > >>>>>>>>>>>>> a > >>>>>>>>>>>>> part you can share freely :) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Mon Mar 17 16:37:45 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 16:37:45 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References:

<5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Message-ID: Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs. On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: > Here are the total and max times in millions of units; 30000 units is > approximately 13 seconds. I have extracted the runs where there are many > gc-collect-steps. These are in execution order, so the first runs with many > gc-collect-steps aren't slow. > > *Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: > gc-minor:10 gc-collect-step:247 > *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: > gc-minor:10 gc-collect-step:245 > *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: > gc-minor:11 gc-collect-step:244 > *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 > *Max*: gc-minor:17 gc-collect-step:244 > *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 > *Max*: gc-minor:11 gc-collect-step:248 > *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 > *Max*: gc-minor:8 gc-collect-step:299 > *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 > *Max*: gc-minor:11 gc-collect-step:246 > *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 > *Max*: gc-minor:8 gc-collect-step:244 > *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 > *Max*: gc-minor:36 gc-collect-step:248 > *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 > *Max*: gc-minor:8 gc-collect-step:244 > *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 > *Max*: gc-minor:8 gc-collect-step:245 > *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 > *Max*: gc-minor:8 gc-collect-step:244 > *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 > *Max*: gc-minor:38 gc-collect-step:244 > *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 > *Max*: gc-minor:23 gc-collect-step:245 > *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 > *Max*: gc-minor:8 gc-collect-step:246 > *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 > *Max*: gc-minor:9 gc-collect-step:244 > *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 > *Max*: gc-minor:8 gc-collect-step:246 > *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 > *Max*: gc-minor:8 gc-collect-step:248 > *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 > *Max*: gc-minor:8 gc-collect-step:250 > *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 > *Max*: gc-minor:8 gc-collect-step:245 > *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 > *Max*: gc-minor:543 gc-collect-step:244 > *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 > *Max*: gc-minor:20 gc-collect-step:246 > *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 > *Max*: gc-minor:25 gc-collect-step:245 > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch