From brecht at mos6581.org Sat Mar 1 23:34:17 2014 From: brecht at mos6581.org (Brecht Machiels) Date: Sat, 01 Mar 2014 23:34:17 +0100 Subject: [pypy-dev] RinohType and PyPy2 Message-ID: <1431737446.616651.1393713257918.JavaMail.sas1@[172.29.252.247]> Hello, I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). Results on my Celeron T3000 (Arch Linux x86_64): CPython 3.3.4 14 s PyPy3 2.1.0-beta1 61 s CPython 2.7.6 15 s PyPy 2.2.1 35 s If you want to give it a try (no external dependencies): git clone --branch pypy2 https://github.com/brechtm/rinohtype.git cd rinohtype/examples/rfic2009 rm -rf template.ptc; PYTHONPATH=../.. pypy template.py While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? Best regards, Brecht From fijall at gmail.com Sun Mar 2 08:19:35 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 2 Mar 2014 09:19:35 +0200 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> Message-ID: the first obvious thing that jumps at me is your casual use of sys._getframe - the JIT aborts in this case and proceeds to the interpreter (so you pay the price for JITting, while you also pay the prace for not having compiled assembler). That probably does not explain everything, but please don't use sys._getframe in production code if you want the JIT to be fast. On Sun, Mar 2, 2014 at 12:34 AM, Brecht Machiels wrote: > Hello, > > I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). > > Results on my Celeron T3000 (Arch Linux x86_64): > CPython 3.3.4 14 s > PyPy3 2.1.0-beta1 61 s > CPython 2.7.6 15 s > PyPy 2.2.1 35 s > > If you want to give it a try (no external dependencies): > > git clone --branch pypy2 https://github.com/brechtm/rinohtype.git > cd rinohtype/examples/rfic2009 > rm -rf template.ptc; PYTHONPATH=../.. pypy template.py > > > While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? > > Best regards, > Brecht > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From brecht at mos6581.org Sun Mar 2 11:11:43 2014 From: brecht at mos6581.org (Brecht Machiels) Date: Sun, 02 Mar 2014 11:11:43 +0100 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> Message-ID: <1689107550.632557.1393755103839.JavaMail.sas1@[172.29.252.247]> Thanks Maciej, sys._getframe was introduced by "magicsuper", which I used to avoid refactoring all super() calls. I've done that now and there shouldn't be any more sys._getframe calls. You can pull in this commit from the pypy2 branch. Unfortunately, this didn't improve performance much. PyPy now takes 26 seconds. Any other ideas? Best regards, Brecht ---- On Sun, 02 Mar 2014 08:19:35 +0100 Maciej Fijalkowski wrote ---- >the first obvious thing that jumps at me is your casual use of >sys._getframe - the JIT aborts in this case and proceeds to the >interpreter (so you pay the price for JITting, while you also pay the >prace for not having compiled assembler). That probably does not >explain everything, but please don't use sys._getframe in production >code if you want the JIT to be fast. > >On Sun, Mar 2, 2014 at 12:34 AM, Brecht Machiels wrote: >> Hello, >> >> I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). >> >> Results on my Celeron T3000 (Arch Linux x86_64): >> CPython 3.3.4 14 s >> PyPy3 2.1.0-beta1 61 s >> CPython 2.7.6 15 s >> PyPy 2.2.1 35 s >> >> If you want to give it a try (no external dependencies): >> >> git clone --branch pypy2 https://github.com/brechtm/rinohtype.git >> cd rinohtype/examples/rfic2009 >> rm -rf template.ptc; PYTHONPATH=../.. pypy template.py >> >> >> While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? >> >> Best regards, >> Brecht >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev > From numerodix at gmail.com Sun Mar 2 13:03:47 2014 From: numerodix at gmail.com (Martin Matusiak) Date: Sun, 2 Mar 2014 13:03:47 +0100 Subject: [pypy-dev] pypy in python3? Message-ID: Hi, I'm wondering whether there are any plans to port pypy itself to python 3 at some point. And what the benefits of that might be (other than having a more recent host language). Is there anything in python 3 that would make it easier/harder for pypy? Thanks, Martin From arigo at tunes.org Sun Mar 2 16:41:15 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 2 Mar 2014 16:41:15 +0100 Subject: [pypy-dev] pypy in python3? In-Reply-To: References: Message-ID: Hi Martin, On 2 March 2014 13:03, Martin Matusiak wrote: > I'm wondering whether there are any plans to port pypy itself to > python 3 at some point. And what the benefits of that might be (other > than having a more recent host language). Is there anything in python > 3 that would make it easier/harder for pypy? Just to make it clear to readers: this is about the language in which PyPy is implemented; this is not about the fact that PyPy itself implements Python 2.7 and 3.2 (currently). If we were starting today, then we could certainly use some small new features, like the ability to decorate function arguments rather than the whole function. However, that's about it as far as advantages go. There are small disadvantages too, like the unicode-everywhere model; you'd have to write byte strings explicitly everywhere in order to implement Python 2, or almost any small language you want to play with. That's the main difference from Python 2, ignoring new things in the stdlib which we cannot use from RPython anyway. But we're not starting today, and we have a very large code base already. As far as I'm concerned, Python 2 works nicely, is going to stay around for a long time, and is stable --- i.e. does not require us to adapt our code base every 2 years when a new Python 3.x version goes out (even if the required work is usually minimal, as far as our experience goes, from 2.3 to 2.7). For this reason I imagine that PyPy is going to be Python 2 forever. As it runs on PyPy itself, we won't even require a working CPython 2.x to get started, athough I'm sure these will also remain available forever. A bient?t, Armin. From arigo at tunes.org Sun Mar 2 17:01:32 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 2 Mar 2014 17:01:32 +0100 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> Message-ID: Hi Brecht, On 1 March 2014 23:34, Brecht Machiels wrote: > While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? It's not really helpful, but the warm-up time is the first issue here. If I edit template.py to run it e.g. 10 times instead of only once, the speed grows quickly by a factor of 4. It means your code, for some reason, exhibits slow warm-ups (not the worst we've seen, but I agree it's a lot). It would be interesting to know if you have a similar speed-up when processing a single 10-times-larger document instead of 10 times the same small document :-) A bient?t, Armin. From fijall at gmail.com Sun Mar 2 22:15:28 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 2 Mar 2014 23:15:28 +0200 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> Message-ID: Hi Brecht. I must say I've been trying to understand what's going on and I'm failing so far. Thanks for a valuable benchmark! And yes, we're working on improving the warmup time (ETA unknown though) On Sun, Mar 2, 2014 at 12:11 PM, Brecht Machiels wrote: > Thanks Maciej, > > sys._getframe was introduced by "magicsuper", which I used to avoid refactoring all super() calls. I've done that now and there shouldn't be any more sys._getframe calls. You can pull in this commit from the pypy2 branch. > > Unfortunately, this didn't improve performance much. PyPy now takes 26 seconds. Any other ideas? > > Best regards, > Brecht > > ---- On Sun, 02 Mar 2014 08:19:35 +0100 Maciej Fijalkowski wrote ---- > >>the first obvious thing that jumps at me is your casual use of >>sys._getframe - the JIT aborts in this case and proceeds to the >>interpreter (so you pay the price for JITting, while you also pay the >>prace for not having compiled assembler). That probably does not >>explain everything, but please don't use sys._getframe in production >>code if you want the JIT to be fast. >> >>On Sun, Mar 2, 2014 at 12:34 AM, Brecht Machiels > wrote: >>> Hello, >>> >>> I've managed to backport RinohType to Python 2 (took me only a couple of hours thankfully). >>> >>> Results on my Celeron T3000 (Arch Linux x86_64): >>> CPython 3.3.4 14 s >>> PyPy3 2.1.0-beta1 61 s >>> CPython 2.7.6 15 s >>> PyPy 2.2.1 35 s >>> >>> If you want to give it a try (no external dependencies): >>> >>> git clone --branch pypy2 https://github.com/brechtm/rinohtype.git >>> cd rinohtype/examples/rfic2009 >>> rm -rf template.ptc; PYTHONPATH=../.. pypy template.py >>> >>> >>> While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? >>> >>> Best regards, >>> Brecht >>> >>> >>> _______________________________________________ >>> pypy-dev mailing list >>> pypy-dev at python.org >>> https://mail.python.org/mailman/listinfo/pypy-dev >> > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From johan.rade at gmail.com Tue Mar 4 15:43:55 2014 From: johan.rade at gmail.com (=?ISO-8859-1?Q?Johan_R=E5de?=) Date: Tue, 04 Mar 2014 15:43:55 +0100 Subject: [pypy-dev] pypy 2.2.1 win32 In-Reply-To: <53107B9C.4000705@gmx.de> References: <366629207.1251178.1393571016417.JavaMail.ngmail@webmail11.arcor-online.net> <53107B9C.4000705@gmx.de> Message-ID: <5315E6AB.3080906@gmail.com> Hi Carl Friedrich, What kind of benchmark do you prefer? A benchmark that shows how great PyPy is compared with C-Python? Then you might use Sunfish, https://github.com/thomasahle/sunfish. Sunfish does not have any offical benchmarks, but I think you might could use test.selfplay() as a benchmark. (It has Sunfish play against itself, and it plays the same 84-move game each time.) This benchmark shows that PyPy is 3.5 times faster than C-Python. Or do you want a benchmark that shows how poor PyPy is, and maybe suggests where some improvement might be needed? Then you could use PyChess, http://code.google.com/p/pychess. PyChess comes with an official benchmark, pychess.Utils.lutils.Benchmark.benchmark(). It shows that PyPy is only 25% faster than C-Python. Sunfish is a bit of a toy program (but what a nice toy!). PyChess is a real chess program, maybe the leading chess program in Python. Best wishes, Johan On 2014-02-28 13:05, Carl Friedrich Bolz wrote: > Hi Norbert, > > On 28/02/14 08:03, norbert.raimund.leisner at arcor.de wrote: >> I ask you because a chess program "Sunfish" > https://github.com/thomasahle/sunfish/ is using pypy. > > Unrelated to your actual question, this sounds like a very cool > addition to our benchmark set. Somebody feel like adding it? > > Cheers, > > Carl Friedrich > From len-l at telus.net Thu Mar 6 07:16:57 2014 From: len-l at telus.net (Lenard Lindstrom) Date: Wed, 05 Mar 2014 22:16:57 -0800 Subject: [pypy-dev] RPython question about the lifetime of global state Message-ID: <531812D9.4030207@telus.net> Hi everyone, I am developing a new image blit system for Pygame 2.0 - the SDL2 edition. A blitter prototype project is maintained at https://bitbucket.org/llindstrom/blitter. The prototype implements a blit loop JIT; Pixel format specific blit code is generated dynamically as needed. The prototype is written in RPython as an interpreter for executing array copies. The JIT comes automatically from the RPython tool chain, of course. The prototype blitter is built as a stand-alone shared library with flags -Ojit --gcrootfinder=shadowstack. It has no Python dependencies. There are two entrypoint C functions, both decorated with rpython.rlib.entrypoint.entrypoint. Python side code uses CFFI to access the library. The library is initialized with a single call to rpython_startup_code at load time. The blitter library is meant to be an embedded interpreter, with initialize, configure, and execute functions. So my question, does the RPython tool chained explicitly support embedded interpreters? I ask because I have only seen secondary entry points used as callbacks into an interpreter (PyPy's cpyext interface). So I wish to confirm that the lifetime of an RPython global namespace, the JIT caches, and the garbage collector are that of the loaded library, and not just that of an entry point function call. Thanks in advance, Lenard Lindstrom From arigo at tunes.org Thu Mar 6 07:55:22 2014 From: arigo at tunes.org (Armin Rigo) Date: Thu, 6 Mar 2014 07:55:22 +0100 Subject: [pypy-dev] RPython question about the lifetime of global state In-Reply-To: <531812D9.4030207@telus.net> References: <531812D9.4030207@telus.net> Message-ID: Hi Lenard, On 6 March 2014 07:16, Lenard Lindstrom wrote: > The prototype is written in RPython as an interpreter for executing > array copies. The JIT comes automatically from the RPython tool chain, of > course. Cool :-) RPython can certainly be used in this way, although critics might rightfully argue that you're getting a very big framework around a very small piece of code. You're getting for free a JIT that knows all about optimizing temporary allocations and tons of other things typical in a dynamic language, none of which really applies in your case. As long as the interpreter to JIT is only of a few lines of source, I'd recommend to at least have a look at other libraries (LibJIT for example). It would come with a smaller footprint (in code size, in memory usage, and in warm-up time) for similar results. It only works if the interpreter to JIT is small or if you have tons of time on your hand :-) > So I wish to confirm that the lifetime of an RPython > global namespace, the JIT caches, and the garbage collector are that of the > loaded library, and not just that of an entry point function call. Yes: it must be initialized only once, and then everything stays around. A bient?t, Armin. From norbert.raimund.leisner at arcor.de Fri Mar 7 07:45:29 2014 From: norbert.raimund.leisner at arcor.de (norbert.raimund.leisner at arcor.de) Date: Fri, 7 Mar 2014 07:45:29 +0100 (CET) Subject: [pypy-dev] pypy 2.2.1 win32 In-Reply-To: References: <366629207.1251178.1393571016417.JavaMail.ngmail@webmail11.arcor-online.net> Message-ID: <1261519382.77187.1394174729955.JavaMail.ngmail@webmail09.arcor-online.net> Hello Maciej, hello support-team! my hardware is : Intel Core 2 Duo E6600 (2x2,4 Ghz) - 1 GB RAM - 512 MB graphical card operation system: Windows XP SP3 32-bit Would yoru recommendation be PyPy 3 2.1 beta 1 win32 or is Python 3.3.4 x86 MSI at this case your first choice? http://www.python.org/download/releases/3.3.4/ I use it for Sunfish https://github.com/thomasahle/sunfish/ and Shatranj http://code.google.com/p/shatranjpy/ (two chess engines) cf. WinBoard/CECP-protocol http://www.open-aurec.com/wbforum/WinBoard/engine-intf.html and WinBoard-GUI http://www.open-aurec.com/wbforum/viewtopic.php?f=19&t=51528 Best wishes, Norbert ----- Original Nachricht ---- Von: Maciej Fijalkowski An: norbert.raimund.leisner at arcor.de Datum: 28.02.2014 08:39 Betreff: Re: [pypy-dev] pypy 2.2.1 win32 > On Fri, Feb 28, 2014 at 9:03 AM, wrote: > > Hello support-team, > > > > I have installed pypy 2.2.1 win32 for my OS Windows XP SP 3 -32 bit, > Python 2.7, MSI Microsoft Visual C++ 2008 SP1 Redistributable Package > (x86). > > > > Now my question: > > Must be Python 2.7 deinstalled and replaced by Pythonv2.7.6 or not? > > As far as I understand your question, the answer is no. Various > versions of Python (and PyPy) can happily coexist next to each other. > > Cheers, > fijal > From johan.rade at gmail.com Fri Mar 7 19:25:20 2014 From: johan.rade at gmail.com (=?ISO-8859-1?Q?Johan_R=E5de?=) Date: Fri, 07 Mar 2014 19:25:20 +0100 Subject: [pypy-dev] pypy 2.2.1 win32 In-Reply-To: <1261519382.77187.1394174729955.JavaMail.ngmail@webmail09.arcor-online.net> References: <366629207.1251178.1393571016417.JavaMail.ngmail@webmail11.arcor-online.net> <1261519382.77187.1394174729955.JavaMail.ngmail@webmail09.arcor-online.net> Message-ID: Hi Norbert, It is not easy to answer questions like that. We have not tested every Python program with every Python version. Why don't you just try yourself? If you run into problems when you try with PyPy, feel free to ask for advice here. But OK, I happen to have Sunfish and four different 32-bit Python versions installed on my computer. Here are the results I got when I timed the Sunfish function test.selfplay(): PyPy 2.2.1: 106.2 s PyPy 3.2.1 beta: 395.2 s CPython 2.7.6: 363.9 s CPython 3.4.0 RC2: 426.5 s So it seems that you can use any of these four, but PyPy 2.2.1 is fastest. (I think the author of Sunfish has optimized the code using PyPy 2.) And let's get the terminology straight: Python is a language. PyPy, CPython, IronPython and Jython are different implementations of that language. The software that you call Python 3.3.4 should be called CPython 3.3.4. Cheers, Johan On 2014-03-07 07:45, norbert.raimund.leisner at arcor.de wrote: > Hello Maciej, hello support-team! > > my hardware is : Intel Core 2 Duo E6600 (2x2,4 Ghz) - 1 GB RAM - 512 MB graphical card > operation system: Windows XP SP3 32-bit > > Would yoru recommendation be PyPy 3 2.1 beta 1 win32 or is Python 3.3.4 x86 MSI at this case your first choice? > http://www.python.org/download/releases/3.3.4/ > > I use it for Sunfish https://github.com/thomasahle/sunfish/ and Shatranj http://code.google.com/p/shatranjpy/ (two chess engines) cf. WinBoard/CECP-protocol http://www.open-aurec.com/wbforum/WinBoard/engine-intf.html > and WinBoard-GUI http://www.open-aurec.com/wbforum/viewtopic.php?f=19&t=51528 > > Best wishes, > Norbert > > > ----- Original Nachricht ---- > Von: Maciej Fijalkowski > An: norbert.raimund.leisner at arcor.de > Datum: 28.02.2014 08:39 > Betreff: Re: [pypy-dev] pypy 2.2.1 win32 > >> On Fri, Feb 28, 2014 at 9:03 AM, wrote: >>> Hello support-team, >>> >>> I have installed pypy 2.2.1 win32 for my OS Windows XP SP 3 -32 bit, >> Python 2.7, MSI Microsoft Visual C++ 2008 SP1 Redistributable Package >> (x86). >>> >>> Now my question: >>> Must be Python 2.7 deinstalled and replaced by Pythonv2.7.6 or not? >> >> As far as I understand your question, the answer is no. Various >> versions of Python (and PyPy) can happily coexist next to each other. >> >> Cheers, >> fijal >> From johan.rade at gmail.com Sat Mar 8 14:42:52 2014 From: johan.rade at gmail.com (=?ISO-8859-1?Q?Johan_R=E5de?=) Date: Sat, 08 Mar 2014 14:42:52 +0100 Subject: [pypy-dev] Possibly a PyPy C-API bug Message-ID: Hi everyone, I think I might have found a bug in the PyPy C-API. It seems that PyType_Type.tp_new is broken. Here is a minimal example that reproduces the bug. Instructions: Compile Foo3.c as a python extension module named Foo3. Set up the paths so that Test3.py can find Foo3. Run Test3.py Expected result and observed result with CPython 2.7.6: Test3.py runs Observed result with PyPy 2.2.1: Test3.py crashes. (It gets into an infinite recursive loop where PyType_Type.tpnew and Foo3Type_Type.tp_new keep calling each other.) Fixing this bug, or finding a workaround, would get me one step closer to getting PySide to run with PyPy. Cheers, Johan -------------- next part -------------- #include PyObject* foo3type_tp_new(PyTypeObject* metatype, PyObject* args, PyObject* kwds) { // In a more realistic example we might do some preprocessing of args and kwargs here ... PyObject* newType = PyType_Type.tp_new(metatype, args, kwds); // ... and some postprocessing of newType here return newType; } PyTypeObject Foo3Type_Type = { PyVarObject_HEAD_INIT(0, 0) /*tp_name*/ "Foo3.Type", /*tp_basicsize*/ sizeof(PyTypeObject), /*tp_itemsize*/ 0, /*tp_dealloc*/ 0, /*tp_print*/ 0, /*tp_getattr*/ 0, /*tp_setattr*/ 0, /*tp_compare*/ 0, /*tp_repr*/ 0, /*tp_as_number*/ 0, /*tp_as_sequence*/ 0, /*tp_as_mapping*/ 0, /*tp_hash*/ 0, /*tp_call*/ 0, /*tp_str*/ 0, /*tp_getattro*/ 0, /*tp_setattro*/ 0, /*tp_as_buffer*/ 0, /*tp_flags*/ Py_TPFLAGS_DEFAULT, /*tp_doc*/ 0, /*tp_traverse*/ 0, /*tp_clear*/ 0, /*tp_richcompare*/ 0, /*tp_weaklistoffset*/ 0, /*tp_iter*/ 0, /*tp_iternext*/ 0, /*tp_methods*/ 0, /*tp_members*/ 0, /*tp_getset*/ 0, /*tp_base*/ 0, // set to &PyType_Type in module init function (why can it not be done here?) /*tp_dict*/ 0, /*tp_descr_get*/ 0, /*tp_descr_set*/ 0, /*tp_dictoffset*/ 0, /*tp_init*/ 0, /*tp_alloc*/ 0, /*tp_new*/ foo3type_tp_new, /*tp_free*/ 0, /*tp_is_gc*/ 0, /*tp_bases*/ 0, /*tp_mro*/ 0, /*tp_cache*/ 0, /*tp_subclasses*/ 0, /*tp_weaklist*/ 0 }; static PyMethodDef sbkMethods[] = {{NULL, NULL, 0, NULL}}; #ifdef _WIN32 __declspec(dllexport) void // PyModINIT_FUNC is broken on PyPy/Windows #else PyMODINIT_FUNC #endif initFoo3(void) { PyObject* mod = Py_InitModule("Foo3", sbkMethods); Foo3Type_Type.tp_base = &PyType_Type; PyType_Ready(&Foo3Type_Type); PyModule_AddObject(mod, "Type", (PyObject*)&Foo3Type_Type); } -------------- next part -------------- import Foo3 class X(object): __metaclass__ = Foo3.Type pass From arigo at tunes.org Sun Mar 9 08:26:53 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 9 Mar 2014 08:26:53 +0100 Subject: [pypy-dev] Possibly a PyPy C-API bug In-Reply-To: References: Message-ID: Hi Johan, On 8 March 2014 14:42, Johan R?de wrote: > I think I might have found a bug in the PyPy C-API. > It seems that PyType_Type.tp_new is broken. Indeed. I tried to look, but either I missed something or it looks like it won't be that obvious to fix. For reference, the built-in types like PyType_Type are generated automatically and all their slots (or maybe only tp_new?) seem to be subtly wrong: they are done with slot_tp_new(), which calls the instance's generic operation; the latter is possibly overridden in a subtype, thus leading to infinite recursion in cases like you report. Can you post this to the bug tracker? Otherwise it will likely be forgotten. A bient?t, Armin. From matti.picus at gmail.com Sun Mar 9 23:37:07 2014 From: matti.picus at gmail.com (Matti Picus) Date: Mon, 10 Mar 2014 00:37:07 +0200 Subject: [pypy-dev] win32 failures on own tests Message-ID: <531CED13.7020401@gmail.com> An HTML attachment was scrubbed... URL: From arigo at tunes.org Mon Mar 10 08:00:37 2014 From: arigo at tunes.org (Armin Rigo) Date: Mon, 10 Mar 2014 08:00:37 +0100 Subject: [pypy-dev] win32 failures on own tests In-Reply-To: <531CED13.7020401@gmail.com> References: <531CED13.7020401@gmail.com> Message-ID: Hi Matti, On 9 March 2014 23:37, Matti Picus wrote: > id(x) returning a long where an int is expected in rlib\objectmodel.py You're right, both CPython and PyPy return an unsigned integer which may not fit into an "int". A bient?t, Armin. From dimaqq at gmail.com Mon Mar 10 09:38:57 2014 From: dimaqq at gmail.com (Dima Tisnek) Date: Mon, 10 Mar 2014 09:38:57 +0100 Subject: [pypy-dev] slow-ish multithreaded primitives In-Reply-To: References: Message-ID: Can I try to make a case for _py3k_acquire inclusion when using context manager API? Let's say a well-formed Python program always context managers, and thus timeouts are only supplied to condition,wait(): c = threading.Condition() with c: while something: c.wait(some time) change state with c: c.notifyAll() What is the semantic difference in the choice of the underlying implementation of c._Condition__lock._RLock__block.acquire vs _py3k_acquire? what could go wrong if c._Condition_lock.__enter__ was mapped to _py3k_acquire instead? AFAIK context manager API doesn't allow user to pass blocking=0 here. Thus lock acquisition cannot time out. Seems pretty solid to me... That still leaves signal handling. Is the concern here about the context in which signal handler executes? the behaviour of user program because signal may be caught earlier? unexpected exception site for KeyboardInterrupt? d. On 27 February 2014 15:54, Armin Rigo wrote: > Hi Dima, > > On 25 February 2014 16:45, Dima Tisnek wrote: >> Armin, is there really a semantical change? >> Consider invocations valid in 2.7, (i.e. without timeout argument), is >> it not the same then? > > It's different: Python 3.x acquire() can be interrupted by signals, > whereas Python 2.x acquire() cannot. > >> should this code be in nightly builds? > > Yes. > > Armin From dimaqq at gmail.com Mon Mar 10 10:19:38 2014 From: dimaqq at gmail.com (Dima Tisnek) Date: Mon, 10 Mar 2014 10:19:38 +0100 Subject: [pypy-dev] slow-ish multithreaded primitives In-Reply-To: References: Message-ID: Oh, so sorry to have jumped the gun. now that I properly tested the nightly build I see that the performance issue I saw is gone and that condition.acquire actually calls _py3k_acquire when timeout argument is present. d. On 10 March 2014 09:38, Dima Tisnek wrote: > Can I try to make a case for _py3k_acquire inclusion when using > context manager API? > > Let's say a well-formed Python program always context managers, and > thus timeouts are only supplied to condition,wait(): > > c = threading.Condition() > with c: > while something: > c.wait(some time) > change state > > with c: > c.notifyAll() > > What is the semantic difference in the choice of the underlying > implementation of c._Condition__lock._RLock__block.acquire vs > _py3k_acquire? > > what could go wrong if c._Condition_lock.__enter__ was mapped to > _py3k_acquire instead? > > AFAIK context manager API doesn't allow user to pass blocking=0 here. > Thus lock acquisition cannot time out. Seems pretty solid to me... > > That still leaves signal handling. Is the concern here about the > context in which signal handler executes? the behaviour of user > program because signal may be caught earlier? unexpected exception > site for KeyboardInterrupt? > > d. > > On 27 February 2014 15:54, Armin Rigo wrote: >> Hi Dima, >> >> On 25 February 2014 16:45, Dima Tisnek wrote: >>> Armin, is there really a semantical change? >>> Consider invocations valid in 2.7, (i.e. without timeout argument), is >>> it not the same then? >> >> It's different: Python 3.x acquire() can be interrupted by signals, >> whereas Python 2.x acquire() cannot. >> >>> should this code be in nightly builds? >> >> Yes. >> >> Armin From naylor.b.david at gmail.com Mon Mar 10 18:26:19 2014 From: naylor.b.david at gmail.com (David Naylor) Date: Mon, 10 Mar 2014 20:26:19 +0300 Subject: [pypy-dev] Python vs pypy: interesting performance difference [dict.setdefault] In-Reply-To: References: <201108102127.13752.naylor.b.david@gmail.com> <201108252144.09934.naylor.b.david@gmail.com> Message-ID: <3514347.BF9MiKfKNF@dragon.dg> On Friday, 26 August 2011 06:37:30 Armin Rigo wrote: > Hi David, > > On Thu, Aug 25, 2011 at 9:44 PM, David Naylor wrote: > > Below is the patch, and results, for my proposed hash methods for > > datetime.datetime (and easily adaptable to include tzinfo and the other > > datetime objects). I tried to make the hash safe for both 32bit and 64bit > > systems, and beyond. > > Yes, the patch looks good to me. I can definitely see how it can be a > huge improvement in performance :-) > > If you can also "fix" the other __hash__ methods in the same way, it > would be great. To follow up on a very old email. The latest results are: # python2.7 iforkey.py ifdict: [2.110611915588379, 2.12678599357605, 2.1126320362091064] keydict: [2.1322460174560547, 2.098900079727173, 2.0998198986053467] defaultdict: [3.184070110321045, 3.2007319927215576, 3.188380002975464] # pypy2.2 iforkey.py ifdict: [0.510915994644165, 0.23750996589660645, 0.2241990566253662] keydict: [0.23270201683044434, 0.18279695510864258, 0.18002104759216309] defaultdict: [3.4535930156707764, 3.3697848320007324, 3.392897129058838] And using the latest datetime.py: pypy iforkey.py ifdict: [0.2814958095550537, 0.23425602912902832, 0.22999906539916992] keydict: [0.23637700080871582, 0.18506789207458496, 0.1831810474395752] defaultdict: [2.8174121379852295, 2.74626088142395, 2.7308008670806885] Excellent, thank you :-) Regards -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 964 bytes Desc: This is a digitally signed message part. URL: From brecht at mos6581.org Tue Mar 11 21:49:15 2014 From: brecht at mos6581.org (Brecht Machiels) Date: Tue, 11 Mar 2014 21:49:15 +0100 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> Message-ID: <144b2e80935.-6422525578898360305.-4494664970611521086@mos6581.org> Hello Maciej and Armin, Glad you think this is a valuable benchmark, since I provided it mostly for selfish reasons ;) I've done a quick test similar to Armin's, rendering the original 4-page document over and over again. While I can see the speed improving, it still doesn't reach CPython's performance. I haven't found the time yet to try with a longer document. I'll render a book from project Gutenberg soon and report back here. Let me know if there's anything else I can do. Bengt Richter raised an interesting question (but his message didn't seem to make it to the list): > Is there any way that jit results could be cached to some degree, in one > or more files, to give the next execution of a program a warmer start? I remember seeing a similar question before. IIRC one suggestion was to spawn a daemon process. I suppose that could work for RinohType, but I'm also interested to hear if it would be possible to have PyPy save the JIT state to a file on termination. Cheers, Brecht ---- On Sun, 02 Mar 2014 22:15:28 +0100 Maciej Fijalkowski wrote ---- > I must say I've been trying to understand what's going on and I'm > failing so far. Thanks for a valuable benchmark! And yes, we're > working on improving the warmup time (ETA unknown though) ---- On Sun, 02 Mar 2014 17:01:32 +0100 Armin Rigo wrote ---- > On 1 March 2014 23:34, Brecht Machiels wrote: > > While PyPy2 performs better than PyPy3, it's still much slower than CPython. Is RinohType hitting a weak spot in PyPy? Any hints on what I can do to improve performance? > > It's not really helpful, but the warm-up time is the first issue here. > If I edit template.py to run it e.g. 10 times instead of only once, > the speed grows quickly by a factor of 4. It means your code, for > some reason, exhibits slow warm-ups (not the worst we've seen, but I > agree it's a lot). It would be interesting to know if you have a > similar speed-up when processing a single 10-times-larger document > instead of 10 times the same small document :-) > > > A bient?t, > > Armin. From taavi.burns at gmail.com Tue Mar 11 23:26:45 2014 From: taavi.burns at gmail.com (Taavi Burns) Date: Tue, 11 Mar 2014 18:26:45 -0400 Subject: [pypy-dev] RinohType and PyPy2 In-Reply-To: <144b2e80935.-6422525578898360305.-4494664970611521086@mos6581.org> References: <1431737446.616651.1393713257918.JavaMail.sas1@172.29.252.247> <1689107550.632557.1393755103839.JavaMail.sas1@172.29.252.247> <144b2e80935.-6422525578898360305.-4494664970611521086@mos6581.org> Message-ID: <5E16145E-0A70-4A75-8743-C9E4D685DBFB@gmail.com> > On Mar 11, 2014, at 16:49, Brecht Machiels wrote: > > >> Is there any way that jit results could be cached to some degree, in one >> or more files, to give the next execution of a program a warmer start? > > I remember seeing a similar question before. IIRC one suggestion was to spawn a daemon process. I suppose that could work for RinohType, but I'm also interested to hear if it would be possible to have PyPy save the JIT state to a file on termination. There's a FAQ entry for that! :) http://pypy.readthedocs.org/en/improve-docs/faq.html#couldn-t-the-jit-dump-and-reload-already-compiled-machine-code -- taa /*eof*/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Wed Mar 12 23:06:19 2014 From: mak at issuu.com (Martin Koch) Date: Wed, 12 Mar 2014 23:06:19 +0100 Subject: [pypy-dev] Pypy garbage collection Message-ID: Hi List I'm running a server (written in python, executed with pypy) that holds a large graph (55GB, millions of nodes and edges) in memory and responds to queries by traversing the graph.The graph is mutated a few times a second, and there are hundreds of read-only requests a second. My problem is that I no control over garbage collection. Thus, a major GC might kick in while serving a query, and with this amount of data, the GC takes around 2 minutes. I have tried mitigating this by guessing when a GC might be due, and proactively starting the garbage collector while not serving a request (this is ok, as duplicate servers will respond to requests while this one is collecting). What I would really like is to be able to disable garbage collection for the old generation. This is because the graph is fairly static, and I can live with leaking memory from the relatively few and small mutations that occur. Any queries are only likely to generate objects in the new generation, and it is fine to collect these. Also, by design, the process is periodically restarted in order to re-synchronize it with an authoritative source (thus rebuilding the graph from scratch), so slight leakage is not an issue here. I have tried experimenting with setting environmentvariables as well as the 'gc' module, but nothing seems to give me what I want. If disabling gc for certain generations is not possible, it would be nice to be able to get a hint when a major collection is about to occur, so I can stop serving requests. I'm using the following pypy version: Python 2.7.3 (2.2.1+dfsg-1, Jan 24 2014, 10:12:37) [PyPy 2.2.1 with GCC 4.6.3] on linux2 An additional question: pypy 2.2.1 should have incremental GC; shouldn't that avoid long pauses due to garbage collection? Thanks, /Martin Koch - Senior Systems Architect - issuu.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 13 00:56:34 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 13 Mar 2014 01:56:34 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: On Thu, Mar 13, 2014 at 12:06 AM, Martin Koch wrote: > Hi List > > I'm running a server (written in python, executed with pypy) that holds a > large graph (55GB, millions of nodes and edges) in memory and responds to > queries by traversing the graph.The graph is mutated a few times a second, > and there are hundreds of read-only requests a second. > > My problem is that I no control over garbage collection. Thus, a major GC > might kick in while serving a query, and with this amount of data, the GC > takes around 2 minutes. I have tried mitigating this by guessing when a GC > might be due, and proactively starting the garbage collector while not > serving a request (this is ok, as duplicate servers will respond to requests > while this one is collecting). > > What I would really like is to be able to disable garbage collection for the > old generation. This is because the graph is fairly static, and I can live > with leaking memory from the relatively few and small mutations that occur. > Any queries are only likely to generate objects in the new generation, and > it is fine to collect these. Also, by design, the process is periodically > restarted in order to re-synchronize it with an authoritative source (thus > rebuilding the graph from scratch), so slight leakage is not an issue here. > > I have tried experimenting with setting environment variables as well as the > 'gc' module, but nothing seems to give me what I want. > > If disabling gc for certain generations is not possible, it would be nice to > be able to get a hint when a major collection is about to occur, so I can > stop serving requests. > > I'm using the following pypy version: > Python 2.7.3 (2.2.1+dfsg-1, Jan 24 2014, 10:12:37) > [PyPy 2.2.1 with GCC 4.6.3] on linux2 > > An additional question: pypy 2.2.1 should have incremental GC; shouldn't > that avoid long pauses due to garbage collection? Yes, it totally should. If your pauses are not incremental, we would like to be able to execute it. Since it's 55G, do you think you can make us an example that can run on a normal machine? From arigo at tunes.org Thu Mar 13 12:29:50 2014 From: arigo at tunes.org (Armin Rigo) Date: Thu, 13 Mar 2014 12:29:50 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: Hi Martin, On 13 March 2014 00:56, Maciej Fijalkowski wrote: > Yes, it totally should. If your pauses are not incremental, we would > like to be able to execute it. Since it's 55G, do you think you can > make us an example that can run on a normal machine? I think the request is not very clear. We do have a machine with 100GB of RAM, so that part should not be a problem. The question of Maciej can probably be rephrased as: can you give us a reproducible example? Even if the large pauses appear to occur on any example you try (which they shouldn't), please give us one such example. Also, maybe we should have anyway a way to give the GC a hint: "now is a good time to run if you need to". A bient?t, Armin. From mak at issuu.com Thu Mar 13 12:45:04 2014 From: mak at issuu.com (Martin Koch) Date: Thu, 13 Mar 2014 12:45:04 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: Hi Armin, Maciej Thanks for responding. I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you. Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :) Thanks, /Martin On Thu, Mar 13, 2014 at 12:29 PM, Armin Rigo wrote: > Hi Martin, > > On 13 March 2014 00:56, Maciej Fijalkowski wrote: > > Yes, it totally should. If your pauses are not incremental, we would > > like to be able to execute it. Since it's 55G, do you think you can > > make us an example that can run on a normal machine? > > I think the request is not very clear. We do have a machine with > 100GB of RAM, so that part should not be a problem. The question of > Maciej can probably be rephrased as: can you give us a reproducible > example? Even if the large pauses appear to occur on any example you > try (which they shouldn't), please give us one such example. > > Also, maybe we should have anyway a way to give the GC a hint: "now is > a good time to run if you need to". > > > A bient?t, > > Armin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 13 19:45:44 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 13 Mar 2014 20:45:44 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > Hi Armin, Maciej > > Thanks for responding. > > I'm in the process of trying to determine what (if any) of the code I'm in a > position to share, and I'll get back to you. > > Allowing hinting to the GC would be good. Even better would be a means to > allow me to (transparently) allocate objects in unmanaged memory, but I > would expect that to be a tall order :) > > Thanks, > /Martin Hi Martin. Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :) From mak at issuu.com Fri Mar 14 16:19:56 2014 From: mak at issuu.com (Martin Koch) Date: Fri, 14 Mar 2014 16:19:56 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: We have hacked up a small sample that seems to exhibit the same issue. We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list. We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters. Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size). We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS? Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building). Here is output from a few runs with different number of elements: *pypy mem.py 10000000* start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817 *pypy mem.py 20000000* start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502 *pypy mem.py 30000000* start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654 *Code below* *--------------------------------------------------------------* import time from random import randint, choice import sys allElems = {} class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self] def set_next(self, l): self.next = l def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)])) # if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1 return acc def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None) lst = follow(elem) total += choice(lst)[0] # use the return value for something end = time.time() elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski wrote: > On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > > Hi Armin, Maciej > > > > Thanks for responding. > > > > I'm in the process of trying to determine what (if any) of the code I'm > in a > > position to share, and I'll get back to you. > > > > Allowing hinting to the GC would be good. Even better would be a means to > > allow me to (transparently) allocate objects in unmanaged memory, but I > > would expect that to be a tall order :) > > > > Thanks, > > /Martin > > Hi Martin. > > Note that in case you want us to do the work of isolating the problem, > we do offer paid support to do that (then we can sign NDAs and stuff). > Otherwise we would be more than happy to fix bugs once you isolate a > part you can share freely :) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Sun Mar 16 22:34:51 2014 From: mak at issuu.com (Martin Koch) Date: Sun, 16 Mar 2014 22:34:51 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else. Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow. Cheers, /Martin On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: > We have hacked up a small sample that seems to exhibit the same issue. > > We basically generate a linked list of objects. To increase connectedness, > elements in the list hold references (dummy_links) to 10 randomly chosen > previous elements in the list. > > We then time a function that traverses 50000 elements from the list from a > random start point. If the traversal reaches the end of the list, we > instead traverse one of the dummy links. Thus, exactly 50K elements are > traversed every time. To generate some garbage, we build a list holding the > traversed elements and a dummy list of characters. > > Timings for the last 100 runs are stored in a circular buffer. If the > elapsed time for the last run is more than twice the average time, we print > out a line with the elapsed time, the threshold, and the 90% runtime (we > would like to see that the mean runtime does not increase with the number > of elements in the list, but that the max time does increase (linearly with > the number of object, i guess); traversing 50K elements should be > independent of the memory size). > > We have tried monitoring memory consumption by external inspection, but > cannot consistently verify that memory is deallocated at the same time that > we see slow requests. Perhaps the pypy runtime doesn't always return freed > pages back to the OS? > > Using top, we observe that 10M elements allocates around 17GB after > building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly > after building). > > Here is output from a few runs with different number of elements: > > > *pypy mem.py 10000000* > start build > end build 84.142424 > that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > 90th_quantile_runtime: 0.421558 > that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > 90th_quantile_runtime: 0.423441 > that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > 90th_quantile_runtime: 0.419817 > > *pypy mem.py 20000000* > start build > end build 180.823105 > that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > 90th_quantile_runtime: 0.434726 > that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > 90th_quantile_runtime: 0.374190 > that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > 90th_quantile_runtime: 0.371502 > > *pypy mem.py 30000000* > start build > end build 276.217811 > that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > 90th_quantile_runtime: 0.459891 > that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > 90th_quantile_runtime: 0.393654 > that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > 90th_quantile_runtime: 0.393677 > that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > 90th_quantile_runtime: 0.393654 > > *Code below* > *--------------------------------------------------------------* > import time > from random import randint, choice > import sys > > > allElems = {} > > class Node: > def __init__(self, v_): > self.v = v_ > self.next = None > self.dummy_data = [randint(0,100) > for _ in xrange(randint(50,100))] > allElems[self.v] = self > if self.v > 0: > self.dummy_links = [allElems[randint(0, self.v-1)] for _ in > xrange(10)] > else: > self.dummy_links = [self] > > def set_next(self, l): > self.next = l > > > def follow(node): > acc = [] > count = 0 > cur = node > assert node.v is not None > assert cur is not None > while count < 50000: > # return a value; generate some garbage > acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in > xrange(100)])) > > # if we have reached the end, chose a random link > cur = choice(cur.dummy_links) if cur.next is None else cur.next > count += 1 > > return acc > > > def build(num_elems): > start = time.time() > print "start build" > root = Node(0) > cur = root > for x in xrange(1, num_elems): > e = Node(x) > cur.next = e > cur = e > print "end build %f" % (time.time() - start) > return root > > > num_timings = 100 > if __name__ == "__main__": > num_elems = int(sys.argv[1]) > build(num_elems) > total = 0 > timings = [0.0] * num_timings # run times for the last num_timings runs > i = 0 > beginning = time.time() > while time.time() - beginning < 600: > start = time.time() > elem = allElems[randint(0, num_elems - 1)] > assert(elem is not None) > > lst = follow(elem) > > total += choice(lst)[0] # use the return value for something > > end = time.time() > > elapsed = end-start > timings[i % num_timings] = elapsed > if (i > num_timings): > slow_time = 2 * sum(timings)/num_timings # slow defined as > > 2*avg run time > if (elapsed > slow_time): > print "that took a long time elapsed: %f slow_threshold: > %f 90th_quantile_runtime: %f" % \ > (elapsed, slow_time, > sorted(timings)[int(num_timings*.9)]) > i += 1 > print total > > > > > > On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski wrote: > >> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >> > Hi Armin, Maciej >> > >> > Thanks for responding. >> > >> > I'm in the process of trying to determine what (if any) of the code I'm >> in a >> > position to share, and I'll get back to you. >> > >> > Allowing hinting to the GC would be good. Even better would be a means >> to >> > allow me to (transparently) allocate objects in unmanaged memory, but I >> > would expect that to be a tall order :) >> > >> > Thanks, >> > /Martin >> >> Hi Martin. >> >> Note that in case you want us to do the work of isolating the problem, >> we do offer paid support to do that (then we can sign NDAs and stuff). >> Otherwise we would be more than happy to fix bugs once you isolate a >> part you can share freely :) >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 08:21:25 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 09:21:25 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw. maybe you can find out what's that using profiling or valgrind? On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > I have tried getting the pypy source and building my own version of pypy. I > have modified rpython/memory/gc/incminimark.py:major_collection_step() to > print out when it starts and when it stops. Apparently, the slow queries do > NOT occur during major_collection_step; at least, I have not observed major > step output during a query execution. So, apparently, something else is > blocking. This could be another aspect of the GC, but it could also be > anything else. > > Just to be sure, I have tried running the same application in python with > garbage collection disabled. I don't see the problem there, so it is somehow > related to either GC or the runtime somehow. > > Cheers, > /Martin > > > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: >> >> We have hacked up a small sample that seems to exhibit the same issue. >> >> We basically generate a linked list of objects. To increase connectedness, >> elements in the list hold references (dummy_links) to 10 randomly chosen >> previous elements in the list. >> >> We then time a function that traverses 50000 elements from the list from a >> random start point. If the traversal reaches the end of the list, we instead >> traverse one of the dummy links. Thus, exactly 50K elements are traversed >> every time. To generate some garbage, we build a list holding the traversed >> elements and a dummy list of characters. >> >> Timings for the last 100 runs are stored in a circular buffer. If the >> elapsed time for the last run is more than twice the average time, we print >> out a line with the elapsed time, the threshold, and the 90% runtime (we >> would like to see that the mean runtime does not increase with the number of >> elements in the list, but that the max time does increase (linearly with the >> number of object, i guess); traversing 50K elements should be independent of >> the memory size). >> >> We have tried monitoring memory consumption by external inspection, but >> cannot consistently verify that memory is deallocated at the same time that >> we see slow requests. Perhaps the pypy runtime doesn't always return freed >> pages back to the OS? >> >> Using top, we observe that 10M elements allocates around 17GB after >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly >> after building). >> >> Here is output from a few runs with different number of elements: >> >> >> pypy mem.py 10000000 >> start build >> end build 84.142424 >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 >> 90th_quantile_runtime: 0.421558 >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 >> 90th_quantile_runtime: 0.423441 >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 >> 90th_quantile_runtime: 0.419817 >> >> pypy mem.py 20000000 >> start build >> end build 180.823105 >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 >> 90th_quantile_runtime: 0.434726 >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 >> 90th_quantile_runtime: 0.374190 >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 >> 90th_quantile_runtime: 0.371502 >> >> pypy mem.py 30000000 >> start build >> end build 276.217811 >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 >> 90th_quantile_runtime: 0.459891 >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 >> 90th_quantile_runtime: 0.393654 >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 >> 90th_quantile_runtime: 0.393677 >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 >> 90th_quantile_runtime: 0.393654 >> >> Code below >> -------------------------------------------------------------- >> import time >> from random import randint, choice >> import sys >> >> >> allElems = {} >> >> class Node: >> def __init__(self, v_): >> self.v = v_ >> self.next = None >> self.dummy_data = [randint(0,100) >> for _ in xrange(randint(50,100))] >> allElems[self.v] = self >> if self.v > 0: >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ in >> xrange(10)] >> else: >> self.dummy_links = [self] >> >> def set_next(self, l): >> self.next = l >> >> >> def follow(node): >> acc = [] >> count = 0 >> cur = node >> assert node.v is not None >> assert cur is not None >> while count < 50000: >> # return a value; generate some garbage >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in >> xrange(100)])) >> >> # if we have reached the end, chose a random link >> cur = choice(cur.dummy_links) if cur.next is None else cur.next >> count += 1 >> >> return acc >> >> >> def build(num_elems): >> start = time.time() >> print "start build" >> root = Node(0) >> cur = root >> for x in xrange(1, num_elems): >> e = Node(x) >> cur.next = e >> cur = e >> print "end build %f" % (time.time() - start) >> return root >> >> >> num_timings = 100 >> if __name__ == "__main__": >> num_elems = int(sys.argv[1]) >> build(num_elems) >> total = 0 >> timings = [0.0] * num_timings # run times for the last num_timings >> runs >> i = 0 >> beginning = time.time() >> while time.time() - beginning < 600: >> start = time.time() >> elem = allElems[randint(0, num_elems - 1)] >> assert(elem is not None) >> >> lst = follow(elem) >> >> total += choice(lst)[0] # use the return value for something >> >> end = time.time() >> >> elapsed = end-start >> timings[i % num_timings] = elapsed >> if (i > num_timings): >> slow_time = 2 * sum(timings)/num_timings # slow defined as > >> 2*avg run time >> if (elapsed > slow_time): >> print "that took a long time elapsed: %f slow_threshold: >> %f 90th_quantile_runtime: %f" % \ >> (elapsed, slow_time, >> sorted(timings)[int(num_timings*.9)]) >> i += 1 >> print total >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> wrote: >>> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >>> > Hi Armin, Maciej >>> > >>> > Thanks for responding. >>> > >>> > I'm in the process of trying to determine what (if any) of the code I'm >>> > in a >>> > position to share, and I'll get back to you. >>> > >>> > Allowing hinting to the GC would be good. Even better would be a means >>> > to >>> > allow me to (transparently) allocate objects in unmanaged memory, but I >>> > would expect that to be a tall order :) >>> > >>> > Thanks, >>> > /Martin >>> >>> Hi Martin. >>> >>> Note that in case you want us to do the work of isolating the problem, >>> we do offer paid support to do that (then we can sign NDAs and stuff). >>> Otherwise we would be more than happy to fix bugs once you isolate a >>> part you can share freely :) >> >> > From fijall at gmail.com Mon Mar 17 12:09:09 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 13:09:09 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: The number of lines is nonsense. This is a timestamp in hex. On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: > Based On Maciej's suggestion, I tried the following > > PYPYLOG=- pypy mem.py 10000000 > out > > This generates a logfile which looks something like this > > start--> > [2b99f1981b527e] {gc-minor > [2b99f1981ba680] {gc-minor-walkroots > [2b99f1981c2e02] gc-minor-walkroots} > [2b99f19890d750] gc-minor} > [snip] > ... > <--stop > > > It turns out that the culprit is a lot of MINOR collections. > > I base this on the following observations: > > I can't understand the format of the timestamp on each logline (the > "[2b99f1981b527e]"). From what I can see in the code, this should be output > from time.clock(), but that doesn't return a number like that when I run > pypy interactively > Instead, I count the number of debug lines between start--> and the > corresponding <--stop. > Most runs have a few hundred lines of output between start/stop > All slow runs have very close to 57800 lines out output between start/stop > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > operations, and 9647 gc-minor-walkroots operations. > > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > wrote: >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> which will do that for you btw. >> >> maybe you can find out what's that using profiling or valgrind? >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >> > I have tried getting the pypy source and building my own version of >> > pypy. I >> > have modified rpython/memory/gc/incminimark.py:major_collection_step() >> > to >> > print out when it starts and when it stops. Apparently, the slow queries >> > do >> > NOT occur during major_collection_step; at least, I have not observed >> > major >> > step output during a query execution. So, apparently, something else is >> > blocking. This could be another aspect of the GC, but it could also be >> > anything else. >> > >> > Just to be sure, I have tried running the same application in python >> > with >> > garbage collection disabled. I don't see the problem there, so it is >> > somehow >> > related to either GC or the runtime somehow. >> > >> > Cheers, >> > /Martin >> > >> > >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: >> >> >> >> We have hacked up a small sample that seems to exhibit the same issue. >> >> >> >> We basically generate a linked list of objects. To increase >> >> connectedness, >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> chosen >> >> previous elements in the list. >> >> >> >> We then time a function that traverses 50000 elements from the list >> >> from a >> >> random start point. If the traversal reaches the end of the list, we >> >> instead >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> traversed >> >> every time. To generate some garbage, we build a list holding the >> >> traversed >> >> elements and a dummy list of characters. >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If the >> >> elapsed time for the last run is more than twice the average time, we >> >> print >> >> out a line with the elapsed time, the threshold, and the 90% runtime >> >> (we >> >> would like to see that the mean runtime does not increase with the >> >> number of >> >> elements in the list, but that the max time does increase (linearly >> >> with the >> >> number of object, i guess); traversing 50K elements should be >> >> independent of >> >> the memory size). >> >> >> >> We have tried monitoring memory consumption by external inspection, but >> >> cannot consistently verify that memory is deallocated at the same time >> >> that >> >> we see slow requests. Perhaps the pypy runtime doesn't always return >> >> freed >> >> pages back to the OS? >> >> >> >> Using top, we observe that 10M elements allocates around 17GB after >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> shortly >> >> after building). >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> pypy mem.py 10000000 >> >> start build >> >> end build 84.142424 >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 >> >> 90th_quantile_runtime: 0.421558 >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 >> >> 90th_quantile_runtime: 0.423441 >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> pypy mem.py 20000000 >> >> start build >> >> end build 180.823105 >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 >> >> 90th_quantile_runtime: 0.434726 >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 >> >> 90th_quantile_runtime: 0.374190 >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> pypy mem.py 30000000 >> >> start build >> >> end build 276.217811 >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 >> >> 90th_quantile_runtime: 0.459891 >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 >> >> 90th_quantile_runtime: 0.393654 >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 >> >> 90th_quantile_runtime: 0.393677 >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> Code below >> >> -------------------------------------------------------------- >> >> import time >> >> from random import randint, choice >> >> import sys >> >> >> >> >> >> allElems = {} >> >> >> >> class Node: >> >> def __init__(self, v_): >> >> self.v = v_ >> >> self.next = None >> >> self.dummy_data = [randint(0,100) >> >> for _ in xrange(randint(50,100))] >> >> allElems[self.v] = self >> >> if self.v > 0: >> >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ in >> >> xrange(10)] >> >> else: >> >> self.dummy_links = [self] >> >> >> >> def set_next(self, l): >> >> self.next = l >> >> >> >> >> >> def follow(node): >> >> acc = [] >> >> count = 0 >> >> cur = node >> >> assert node.v is not None >> >> assert cur is not None >> >> while count < 50000: >> >> # return a value; generate some garbage >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x >> >> in >> >> xrange(100)])) >> >> >> >> # if we have reached the end, chose a random link >> >> cur = choice(cur.dummy_links) if cur.next is None else cur.next >> >> count += 1 >> >> >> >> return acc >> >> >> >> >> >> def build(num_elems): >> >> start = time.time() >> >> print "start build" >> >> root = Node(0) >> >> cur = root >> >> for x in xrange(1, num_elems): >> >> e = Node(x) >> >> cur.next = e >> >> cur = e >> >> print "end build %f" % (time.time() - start) >> >> return root >> >> >> >> >> >> num_timings = 100 >> >> if __name__ == "__main__": >> >> num_elems = int(sys.argv[1]) >> >> build(num_elems) >> >> total = 0 >> >> timings = [0.0] * num_timings # run times for the last num_timings >> >> runs >> >> i = 0 >> >> beginning = time.time() >> >> while time.time() - beginning < 600: >> >> start = time.time() >> >> elem = allElems[randint(0, num_elems - 1)] >> >> assert(elem is not None) >> >> >> >> lst = follow(elem) >> >> >> >> total += choice(lst)[0] # use the return value for something >> >> >> >> end = time.time() >> >> >> >> elapsed = end-start >> >> timings[i % num_timings] = elapsed >> >> if (i > num_timings): >> >> slow_time = 2 * sum(timings)/num_timings # slow defined as >> >> > >> >> 2*avg run time >> >> if (elapsed > slow_time): >> >> print "that took a long time elapsed: %f >> >> slow_threshold: >> >> %f 90th_quantile_runtime: %f" % \ >> >> (elapsed, slow_time, >> >> sorted(timings)[int(num_timings*.9)]) >> >> i += 1 >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> wrote: >> >>> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >> >>> > Hi Armin, Maciej >> >>> > >> >>> > Thanks for responding. >> >>> > >> >>> > I'm in the process of trying to determine what (if any) of the code >> >>> > I'm >> >>> > in a >> >>> > position to share, and I'll get back to you. >> >>> > >> >>> > Allowing hinting to the GC would be good. Even better would be a >> >>> > means >> >>> > to >> >>> > allow me to (transparently) allocate objects in unmanaged memory, >> >>> > but I >> >>> > would expect that to be a tall order :) >> >>> > >> >>> > Thanks, >> >>> > /Martin >> >>> >> >>> Hi Martin. >> >>> >> >>> Note that in case you want us to do the work of isolating the problem, >> >>> we do offer paid support to do that (then we can sign NDAs and stuff). >> >>> Otherwise we would be more than happy to fix bugs once you isolate a >> >>> part you can share freely :) >> >> >> >> >> > > > From fijall at gmail.com Mon Mar 17 13:53:20 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 14:53:20 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: I think it's the cycles of your CPU On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: > What is the unit? Perhaps I'm being thick here, but I can't correlate it > with seconds (which the program does print out). Slow runs are around 13 > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > > > > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > wrote: >> >> The number of lines is nonsense. This is a timestamp in hex. >> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: >> > Based On Maciej's suggestion, I tried the following >> > >> > PYPYLOG=- pypy mem.py 10000000 > out >> > >> > This generates a logfile which looks something like this >> > >> > start--> >> > [2b99f1981b527e] {gc-minor >> > [2b99f1981ba680] {gc-minor-walkroots >> > [2b99f1981c2e02] gc-minor-walkroots} >> > [2b99f19890d750] gc-minor} >> > [snip] >> > ... >> > <--stop >> > >> > >> > It turns out that the culprit is a lot of MINOR collections. >> > >> > I base this on the following observations: >> > >> > I can't understand the format of the timestamp on each logline (the >> > "[2b99f1981b527e]"). From what I can see in the code, this should be >> > output >> > from time.clock(), but that doesn't return a number like that when I run >> > pypy interactively >> > Instead, I count the number of debug lines between start--> and the >> > corresponding <--stop. >> > Most runs have a few hundred lines of output between start/stop >> > All slow runs have very close to 57800 lines out output between >> > start/stop >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor >> > operations, and 9647 gc-minor-walkroots operations. >> > >> > >> > Thanks, >> > /Martin >> > >> > >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> > wrote: >> >> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> >> which will do that for you btw. >> >> >> >> maybe you can find out what's that using profiling or valgrind? >> >> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >> >> > I have tried getting the pypy source and building my own version of >> >> > pypy. I >> >> > have modified >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >> >> > to >> >> > print out when it starts and when it stops. Apparently, the slow >> >> > queries >> >> > do >> >> > NOT occur during major_collection_step; at least, I have not observed >> >> > major >> >> > step output during a query execution. So, apparently, something else >> >> > is >> >> > blocking. This could be another aspect of the GC, but it could also >> >> > be >> >> > anything else. >> >> > >> >> > Just to be sure, I have tried running the same application in python >> >> > with >> >> > garbage collection disabled. I don't see the problem there, so it is >> >> > somehow >> >> > related to either GC or the runtime somehow. >> >> > >> >> > Cheers, >> >> > /Martin >> >> > >> >> > >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: >> >> >> >> >> >> We have hacked up a small sample that seems to exhibit the same >> >> >> issue. >> >> >> >> >> >> We basically generate a linked list of objects. To increase >> >> >> connectedness, >> >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> >> chosen >> >> >> previous elements in the list. >> >> >> >> >> >> We then time a function that traverses 50000 elements from the list >> >> >> from a >> >> >> random start point. If the traversal reaches the end of the list, we >> >> >> instead >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> >> traversed >> >> >> every time. To generate some garbage, we build a list holding the >> >> >> traversed >> >> >> elements and a dummy list of characters. >> >> >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >> >> >> the >> >> >> elapsed time for the last run is more than twice the average time, >> >> >> we >> >> >> print >> >> >> out a line with the elapsed time, the threshold, and the 90% runtime >> >> >> (we >> >> >> would like to see that the mean runtime does not increase with the >> >> >> number of >> >> >> elements in the list, but that the max time does increase (linearly >> >> >> with the >> >> >> number of object, i guess); traversing 50K elements should be >> >> >> independent of >> >> >> the memory size). >> >> >> >> >> >> We have tried monitoring memory consumption by external inspection, >> >> >> but >> >> >> cannot consistently verify that memory is deallocated at the same >> >> >> time >> >> >> that >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always return >> >> >> freed >> >> >> pages back to the OS? >> >> >> >> >> >> Using top, we observe that 10M elements allocates around 17GB after >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> >> shortly >> >> >> after building). >> >> >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> >> >> >> pypy mem.py 10000000 >> >> >> start build >> >> >> end build 84.142424 >> >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 >> >> >> 90th_quantile_runtime: 0.421558 >> >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 >> >> >> 90th_quantile_runtime: 0.423441 >> >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 >> >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> >> >> pypy mem.py 20000000 >> >> >> start build >> >> >> end build 180.823105 >> >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 >> >> >> 90th_quantile_runtime: 0.434726 >> >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 >> >> >> 90th_quantile_runtime: 0.374190 >> >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 >> >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> >> >> pypy mem.py 30000000 >> >> >> start build >> >> >> end build 276.217811 >> >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 >> >> >> 90th_quantile_runtime: 0.459891 >> >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 >> >> >> 90th_quantile_runtime: 0.393677 >> >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> >> >> Code below >> >> >> -------------------------------------------------------------- >> >> >> import time >> >> >> from random import randint, choice >> >> >> import sys >> >> >> >> >> >> >> >> >> allElems = {} >> >> >> >> >> >> class Node: >> >> >> def __init__(self, v_): >> >> >> self.v = v_ >> >> >> self.next = None >> >> >> self.dummy_data = [randint(0,100) >> >> >> for _ in xrange(randint(50,100))] >> >> >> allElems[self.v] = self >> >> >> if self.v > 0: >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ >> >> >> in >> >> >> xrange(10)] >> >> >> else: >> >> >> self.dummy_links = [self] >> >> >> >> >> >> def set_next(self, l): >> >> >> self.next = l >> >> >> >> >> >> >> >> >> def follow(node): >> >> >> acc = [] >> >> >> count = 0 >> >> >> cur = node >> >> >> assert node.v is not None >> >> >> assert cur is not None >> >> >> while count < 50000: >> >> >> # return a value; generate some garbage >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for >> >> >> x >> >> >> in >> >> >> xrange(100)])) >> >> >> >> >> >> # if we have reached the end, chose a random link >> >> >> cur = choice(cur.dummy_links) if cur.next is None else >> >> >> cur.next >> >> >> count += 1 >> >> >> >> >> >> return acc >> >> >> >> >> >> >> >> >> def build(num_elems): >> >> >> start = time.time() >> >> >> print "start build" >> >> >> root = Node(0) >> >> >> cur = root >> >> >> for x in xrange(1, num_elems): >> >> >> e = Node(x) >> >> >> cur.next = e >> >> >> cur = e >> >> >> print "end build %f" % (time.time() - start) >> >> >> return root >> >> >> >> >> >> >> >> >> num_timings = 100 >> >> >> if __name__ == "__main__": >> >> >> num_elems = int(sys.argv[1]) >> >> >> build(num_elems) >> >> >> total = 0 >> >> >> timings = [0.0] * num_timings # run times for the last >> >> >> num_timings >> >> >> runs >> >> >> i = 0 >> >> >> beginning = time.time() >> >> >> while time.time() - beginning < 600: >> >> >> start = time.time() >> >> >> elem = allElems[randint(0, num_elems - 1)] >> >> >> assert(elem is not None) >> >> >> >> >> >> lst = follow(elem) >> >> >> >> >> >> total += choice(lst)[0] # use the return value for something >> >> >> >> >> >> end = time.time() >> >> >> >> >> >> elapsed = end-start >> >> >> timings[i % num_timings] = elapsed >> >> >> if (i > num_timings): >> >> >> slow_time = 2 * sum(timings)/num_timings # slow defined >> >> >> as >> >> >> > >> >> >> 2*avg run time >> >> >> if (elapsed > slow_time): >> >> >> print "that took a long time elapsed: %f >> >> >> slow_threshold: >> >> >> %f 90th_quantile_runtime: %f" % \ >> >> >> (elapsed, slow_time, >> >> >> sorted(timings)[int(num_timings*.9)]) >> >> >> i += 1 >> >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> >> >> >> >> wrote: >> >> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: >> >> >>> > Hi Armin, Maciej >> >> >>> > >> >> >>> > Thanks for responding. >> >> >>> > >> >> >>> > I'm in the process of trying to determine what (if any) of the >> >> >>> > code >> >> >>> > I'm >> >> >>> > in a >> >> >>> > position to share, and I'll get back to you. >> >> >>> > >> >> >>> > Allowing hinting to the GC would be good. Even better would be a >> >> >>> > means >> >> >>> > to >> >> >>> > allow me to (transparently) allocate objects in unmanaged memory, >> >> >>> > but I >> >> >>> > would expect that to be a tall order :) >> >> >>> > >> >> >>> > Thanks, >> >> >>> > /Martin >> >> >>> >> >> >>> Hi Martin. >> >> >>> >> >> >>> Note that in case you want us to do the work of isolating the >> >> >>> problem, >> >> >>> we do offer paid support to do that (then we can sign NDAs and >> >> >>> stuff). >> >> >>> Otherwise we would be more than happy to fix bugs once you isolate >> >> >>> a >> >> >>> part you can share freely :) >> >> >> >> >> >> >> >> > >> > >> > > > From mak at issuu.com Mon Mar 17 13:48:23 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 13:48:23 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49). On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski wrote: > The number of lines is nonsense. This is a timestamp in hex. > > On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: > > Based On Maciej's suggestion, I tried the following > > > > PYPYLOG=- pypy mem.py 10000000 > out > > > > This generates a logfile which looks something like this > > > > start--> > > [2b99f1981b527e] {gc-minor > > [2b99f1981ba680] {gc-minor-walkroots > > [2b99f1981c2e02] gc-minor-walkroots} > > [2b99f19890d750] gc-minor} > > [snip] > > ... > > <--stop > > > > > > It turns out that the culprit is a lot of MINOR collections. > > > > I base this on the following observations: > > > > I can't understand the format of the timestamp on each logline (the > > "[2b99f1981b527e]"). From what I can see in the code, this should be > output > > from time.clock(), but that doesn't return a number like that when I run > > pypy interactively > > Instead, I count the number of debug lines between start--> and the > > corresponding <--stop. > > Most runs have a few hundred lines of output between start/stop > > All slow runs have very close to 57800 lines out output between > start/stop > > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > > operations, and 9647 gc-minor-walkroots operations. > > > > > > Thanks, > > /Martin > > > > > > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > > wrote: > >> > >> there is an environment variable PYPYLOG=gc:- (where - is stdout) > >> which will do that for you btw. > >> > >> maybe you can find out what's that using profiling or valgrind? > >> > >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > >> > I have tried getting the pypy source and building my own version of > >> > pypy. I > >> > have modified rpython/memory/gc/incminimark.py:major_collection_step() > >> > to > >> > print out when it starts and when it stops. Apparently, the slow > queries > >> > do > >> > NOT occur during major_collection_step; at least, I have not observed > >> > major > >> > step output during a query execution. So, apparently, something else > is > >> > blocking. This could be another aspect of the GC, but it could also be > >> > anything else. > >> > > >> > Just to be sure, I have tried running the same application in python > >> > with > >> > garbage collection disabled. I don't see the problem there, so it is > >> > somehow > >> > related to either GC or the runtime somehow. > >> > > >> > Cheers, > >> > /Martin > >> > > >> > > >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: > >> >> > >> >> We have hacked up a small sample that seems to exhibit the same > issue. > >> >> > >> >> We basically generate a linked list of objects. To increase > >> >> connectedness, > >> >> elements in the list hold references (dummy_links) to 10 randomly > >> >> chosen > >> >> previous elements in the list. > >> >> > >> >> We then time a function that traverses 50000 elements from the list > >> >> from a > >> >> random start point. If the traversal reaches the end of the list, we > >> >> instead > >> >> traverse one of the dummy links. Thus, exactly 50K elements are > >> >> traversed > >> >> every time. To generate some garbage, we build a list holding the > >> >> traversed > >> >> elements and a dummy list of characters. > >> >> > >> >> Timings for the last 100 runs are stored in a circular buffer. If the > >> >> elapsed time for the last run is more than twice the average time, we > >> >> print > >> >> out a line with the elapsed time, the threshold, and the 90% runtime > >> >> (we > >> >> would like to see that the mean runtime does not increase with the > >> >> number of > >> >> elements in the list, but that the max time does increase (linearly > >> >> with the > >> >> number of object, i guess); traversing 50K elements should be > >> >> independent of > >> >> the memory size). > >> >> > >> >> We have tried monitoring memory consumption by external inspection, > but > >> >> cannot consistently verify that memory is deallocated at the same > time > >> >> that > >> >> we see slow requests. Perhaps the pypy runtime doesn't always return > >> >> freed > >> >> pages back to the OS? > >> >> > >> >> Using top, we observe that 10M elements allocates around 17GB after > >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > >> >> shortly > >> >> after building). > >> >> > >> >> Here is output from a few runs with different number of elements: > >> >> > >> >> > >> >> pypy mem.py 10000000 > >> >> start build > >> >> end build 84.142424 > >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > >> >> 90th_quantile_runtime: 0.421558 > >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > >> >> 90th_quantile_runtime: 0.423441 > >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > >> >> 90th_quantile_runtime: 0.419817 > >> >> > >> >> pypy mem.py 20000000 > >> >> start build > >> >> end build 180.823105 > >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > >> >> 90th_quantile_runtime: 0.434726 > >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > >> >> 90th_quantile_runtime: 0.374190 > >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > >> >> 90th_quantile_runtime: 0.371502 > >> >> > >> >> pypy mem.py 30000000 > >> >> start build > >> >> end build 276.217811 > >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > >> >> 90th_quantile_runtime: 0.459891 > >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > >> >> 90th_quantile_runtime: 0.393654 > >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > >> >> 90th_quantile_runtime: 0.393677 > >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > >> >> 90th_quantile_runtime: 0.393654 > >> >> > >> >> Code below > >> >> -------------------------------------------------------------- > >> >> import time > >> >> from random import randint, choice > >> >> import sys > >> >> > >> >> > >> >> allElems = {} > >> >> > >> >> class Node: > >> >> def __init__(self, v_): > >> >> self.v = v_ > >> >> self.next = None > >> >> self.dummy_data = [randint(0,100) > >> >> for _ in xrange(randint(50,100))] > >> >> allElems[self.v] = self > >> >> if self.v > 0: > >> >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ > in > >> >> xrange(10)] > >> >> else: > >> >> self.dummy_links = [self] > >> >> > >> >> def set_next(self, l): > >> >> self.next = l > >> >> > >> >> > >> >> def follow(node): > >> >> acc = [] > >> >> count = 0 > >> >> cur = node > >> >> assert node.v is not None > >> >> assert cur is not None > >> >> while count < 50000: > >> >> # return a value; generate some garbage > >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for > x > >> >> in > >> >> xrange(100)])) > >> >> > >> >> # if we have reached the end, chose a random link > >> >> cur = choice(cur.dummy_links) if cur.next is None else > cur.next > >> >> count += 1 > >> >> > >> >> return acc > >> >> > >> >> > >> >> def build(num_elems): > >> >> start = time.time() > >> >> print "start build" > >> >> root = Node(0) > >> >> cur = root > >> >> for x in xrange(1, num_elems): > >> >> e = Node(x) > >> >> cur.next = e > >> >> cur = e > >> >> print "end build %f" % (time.time() - start) > >> >> return root > >> >> > >> >> > >> >> num_timings = 100 > >> >> if __name__ == "__main__": > >> >> num_elems = int(sys.argv[1]) > >> >> build(num_elems) > >> >> total = 0 > >> >> timings = [0.0] * num_timings # run times for the last > num_timings > >> >> runs > >> >> i = 0 > >> >> beginning = time.time() > >> >> while time.time() - beginning < 600: > >> >> start = time.time() > >> >> elem = allElems[randint(0, num_elems - 1)] > >> >> assert(elem is not None) > >> >> > >> >> lst = follow(elem) > >> >> > >> >> total += choice(lst)[0] # use the return value for something > >> >> > >> >> end = time.time() > >> >> > >> >> elapsed = end-start > >> >> timings[i % num_timings] = elapsed > >> >> if (i > num_timings): > >> >> slow_time = 2 * sum(timings)/num_timings # slow defined > as > >> >> > > >> >> 2*avg run time > >> >> if (elapsed > slow_time): > >> >> print "that took a long time elapsed: %f > >> >> slow_threshold: > >> >> %f 90th_quantile_runtime: %f" % \ > >> >> (elapsed, slow_time, > >> >> sorted(timings)[int(num_timings*.9)]) > >> >> i += 1 > >> >> print total > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski < > fijall at gmail.com> > >> >> wrote: > >> >>> > >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > >> >>> > Hi Armin, Maciej > >> >>> > > >> >>> > Thanks for responding. > >> >>> > > >> >>> > I'm in the process of trying to determine what (if any) of the > code > >> >>> > I'm > >> >>> > in a > >> >>> > position to share, and I'll get back to you. > >> >>> > > >> >>> > Allowing hinting to the GC would be good. Even better would be a > >> >>> > means > >> >>> > to > >> >>> > allow me to (transparently) allocate objects in unmanaged memory, > >> >>> > but I > >> >>> > would expect that to be a tall order :) > >> >>> > > >> >>> > Thanks, > >> >>> > /Martin > >> >>> > >> >>> Hi Martin. > >> >>> > >> >>> Note that in case you want us to do the work of isolating the > problem, > >> >>> we do offer paid support to do that (then we can sign NDAs and > stuff). > >> >>> Otherwise we would be more than happy to fix bugs once you isolate a > >> >>> part you can share freely :) > >> >> > >> >> > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 14:20:57 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 15:20:57 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things: a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those and plot them On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: > Well, then it works out to around 2.5GHz, which seems reasonable. But it > doesn't alter the conclusion from the previous email: The slow queries then > all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or > .4 seconds at this conversion. Also, the log shows that a slow query > performs many more gc-minor operations than a 'normal' one: 9600 > gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > > So the question becomes: Why do we get this large spike in > gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski > wrote: >> >> I think it's the cycles of your CPU >> >> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >> > What is the unit? Perhaps I'm being thick here, but I can't correlate it >> > with seconds (which the program does print out). Slow runs are around 13 >> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. >> > from >> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> > >> > >> > >> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> > wrote: >> >> >> >> The number of lines is nonsense. This is a timestamp in hex. >> >> >> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: >> >> > Based On Maciej's suggestion, I tried the following >> >> > >> >> > PYPYLOG=- pypy mem.py 10000000 > out >> >> > >> >> > This generates a logfile which looks something like this >> >> > >> >> > start--> >> >> > [2b99f1981b527e] {gc-minor >> >> > [2b99f1981ba680] {gc-minor-walkroots >> >> > [2b99f1981c2e02] gc-minor-walkroots} >> >> > [2b99f19890d750] gc-minor} >> >> > [snip] >> >> > ... >> >> > <--stop >> >> > >> >> > >> >> > It turns out that the culprit is a lot of MINOR collections. >> >> > >> >> > I base this on the following observations: >> >> > >> >> > I can't understand the format of the timestamp on each logline (the >> >> > "[2b99f1981b527e]"). From what I can see in the code, this should be >> >> > output >> >> > from time.clock(), but that doesn't return a number like that when I >> >> > run >> >> > pypy interactively >> >> > Instead, I count the number of debug lines between start--> and the >> >> > corresponding <--stop. >> >> > Most runs have a few hundred lines of output between start/stop >> >> > All slow runs have very close to 57800 lines out output between >> >> > start/stop >> >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor >> >> > operations, and 9647 gc-minor-walkroots operations. >> >> > >> >> > >> >> > Thanks, >> >> > /Martin >> >> > >> >> > >> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> >> > >> >> > wrote: >> >> >> >> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> >> >> which will do that for you btw. >> >> >> >> >> >> maybe you can find out what's that using profiling or valgrind? >> >> >> >> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >> >> >> > I have tried getting the pypy source and building my own version >> >> >> > of >> >> >> > pypy. I >> >> >> > have modified >> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >> >> >> > to >> >> >> > print out when it starts and when it stops. Apparently, the slow >> >> >> > queries >> >> >> > do >> >> >> > NOT occur during major_collection_step; at least, I have not >> >> >> > observed >> >> >> > major >> >> >> > step output during a query execution. So, apparently, something >> >> >> > else >> >> >> > is >> >> >> > blocking. This could be another aspect of the GC, but it could >> >> >> > also >> >> >> > be >> >> >> > anything else. >> >> >> > >> >> >> > Just to be sure, I have tried running the same application in >> >> >> > python >> >> >> > with >> >> >> > garbage collection disabled. I don't see the problem there, so it >> >> >> > is >> >> >> > somehow >> >> >> > related to either GC or the runtime somehow. >> >> >> > >> >> >> > Cheers, >> >> >> > /Martin >> >> >> > >> >> >> > >> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >> >> >> > wrote: >> >> >> >> >> >> >> >> We have hacked up a small sample that seems to exhibit the same >> >> >> >> issue. >> >> >> >> >> >> >> >> We basically generate a linked list of objects. To increase >> >> >> >> connectedness, >> >> >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> >> >> chosen >> >> >> >> previous elements in the list. >> >> >> >> >> >> >> >> We then time a function that traverses 50000 elements from the >> >> >> >> list >> >> >> >> from a >> >> >> >> random start point. If the traversal reaches the end of the list, >> >> >> >> we >> >> >> >> instead >> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> >> >> traversed >> >> >> >> every time. To generate some garbage, we build a list holding the >> >> >> >> traversed >> >> >> >> elements and a dummy list of characters. >> >> >> >> >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >> >> >> >> the >> >> >> >> elapsed time for the last run is more than twice the average >> >> >> >> time, >> >> >> >> we >> >> >> >> print >> >> >> >> out a line with the elapsed time, the threshold, and the 90% >> >> >> >> runtime >> >> >> >> (we >> >> >> >> would like to see that the mean runtime does not increase with >> >> >> >> the >> >> >> >> number of >> >> >> >> elements in the list, but that the max time does increase >> >> >> >> (linearly >> >> >> >> with the >> >> >> >> number of object, i guess); traversing 50K elements should be >> >> >> >> independent of >> >> >> >> the memory size). >> >> >> >> >> >> >> >> We have tried monitoring memory consumption by external >> >> >> >> inspection, >> >> >> >> but >> >> >> >> cannot consistently verify that memory is deallocated at the same >> >> >> >> time >> >> >> >> that >> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always >> >> >> >> return >> >> >> >> freed >> >> >> >> pages back to the OS? >> >> >> >> >> >> >> >> Using top, we observe that 10M elements allocates around 17GB >> >> >> >> after >> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> >> >> shortly >> >> >> >> after building). >> >> >> >> >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> >> >> >> >> >> >> pypy mem.py 10000000 >> >> >> >> start build >> >> >> >> end build 84.142424 >> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: >> >> >> >> 1.495401 >> >> >> >> 90th_quantile_runtime: 0.421558 >> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: >> >> >> >> 1.488160 >> >> >> >> 90th_quantile_runtime: 0.423441 >> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: >> >> >> >> 1.474563 >> >> >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> >> >> >> >> pypy mem.py 20000000 >> >> >> >> start build >> >> >> >> end build 180.823105 >> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: >> >> >> >> 2.295146 >> >> >> >> 90th_quantile_runtime: 0.434726 >> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: >> >> >> >> 2.283927 >> >> >> >> 90th_quantile_runtime: 0.374190 >> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: >> >> >> >> 2.279631 >> >> >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> >> >> >> >> pypy mem.py 30000000 >> >> >> >> start build >> >> >> >> end build 276.217811 >> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: >> >> >> >> 3.188464 >> >> >> >> 90th_quantile_runtime: 0.459891 >> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: >> >> >> >> 3.183003 >> >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: >> >> >> >> 3.190782 >> >> >> >> 90th_quantile_runtime: 0.393677 >> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: >> >> >> >> 3.239637 >> >> >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> >> >> >> >> Code below >> >> >> >> -------------------------------------------------------------- >> >> >> >> import time >> >> >> >> from random import randint, choice >> >> >> >> import sys >> >> >> >> >> >> >> >> >> >> >> >> allElems = {} >> >> >> >> >> >> >> >> class Node: >> >> >> >> def __init__(self, v_): >> >> >> >> self.v = v_ >> >> >> >> self.next = None >> >> >> >> self.dummy_data = [randint(0,100) >> >> >> >> for _ in xrange(randint(50,100))] >> >> >> >> allElems[self.v] = self >> >> >> >> if self.v > 0: >> >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] >> >> >> >> for _ >> >> >> >> in >> >> >> >> xrange(10)] >> >> >> >> else: >> >> >> >> self.dummy_links = [self] >> >> >> >> >> >> >> >> def set_next(self, l): >> >> >> >> self.next = l >> >> >> >> >> >> >> >> >> >> >> >> def follow(node): >> >> >> >> acc = [] >> >> >> >> count = 0 >> >> >> >> cur = node >> >> >> >> assert node.v is not None >> >> >> >> assert cur is not None >> >> >> >> while count < 50000: >> >> >> >> # return a value; generate some garbage >> >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") >> >> >> >> for >> >> >> >> x >> >> >> >> in >> >> >> >> xrange(100)])) >> >> >> >> >> >> >> >> # if we have reached the end, chose a random link >> >> >> >> cur = choice(cur.dummy_links) if cur.next is None else >> >> >> >> cur.next >> >> >> >> count += 1 >> >> >> >> >> >> >> >> return acc >> >> >> >> >> >> >> >> >> >> >> >> def build(num_elems): >> >> >> >> start = time.time() >> >> >> >> print "start build" >> >> >> >> root = Node(0) >> >> >> >> cur = root >> >> >> >> for x in xrange(1, num_elems): >> >> >> >> e = Node(x) >> >> >> >> cur.next = e >> >> >> >> cur = e >> >> >> >> print "end build %f" % (time.time() - start) >> >> >> >> return root >> >> >> >> >> >> >> >> >> >> >> >> num_timings = 100 >> >> >> >> if __name__ == "__main__": >> >> >> >> num_elems = int(sys.argv[1]) >> >> >> >> build(num_elems) >> >> >> >> total = 0 >> >> >> >> timings = [0.0] * num_timings # run times for the last >> >> >> >> num_timings >> >> >> >> runs >> >> >> >> i = 0 >> >> >> >> beginning = time.time() >> >> >> >> while time.time() - beginning < 600: >> >> >> >> start = time.time() >> >> >> >> elem = allElems[randint(0, num_elems - 1)] >> >> >> >> assert(elem is not None) >> >> >> >> >> >> >> >> lst = follow(elem) >> >> >> >> >> >> >> >> total += choice(lst)[0] # use the return value for >> >> >> >> something >> >> >> >> >> >> >> >> end = time.time() >> >> >> >> >> >> >> >> elapsed = end-start >> >> >> >> timings[i % num_timings] = elapsed >> >> >> >> if (i > num_timings): >> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow >> >> >> >> defined >> >> >> >> as >> >> >> >> > >> >> >> >> 2*avg run time >> >> >> >> if (elapsed > slow_time): >> >> >> >> print "that took a long time elapsed: %f >> >> >> >> slow_threshold: >> >> >> >> %f 90th_quantile_runtime: %f" % \ >> >> >> >> (elapsed, slow_time, >> >> >> >> sorted(timings)[int(num_timings*.9)]) >> >> >> >> i += 1 >> >> >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> >> >> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >> >> >> >>> wrote: >> >> >> >>> > Hi Armin, Maciej >> >> >> >>> > >> >> >> >>> > Thanks for responding. >> >> >> >>> > >> >> >> >>> > I'm in the process of trying to determine what (if any) of the >> >> >> >>> > code >> >> >> >>> > I'm >> >> >> >>> > in a >> >> >> >>> > position to share, and I'll get back to you. >> >> >> >>> > >> >> >> >>> > Allowing hinting to the GC would be good. Even better would be >> >> >> >>> > a >> >> >> >>> > means >> >> >> >>> > to >> >> >> >>> > allow me to (transparently) allocate objects in unmanaged >> >> >> >>> > memory, >> >> >> >>> > but I >> >> >> >>> > would expect that to be a tall order :) >> >> >> >>> > >> >> >> >>> > Thanks, >> >> >> >>> > /Martin >> >> >> >>> >> >> >> >>> Hi Martin. >> >> >> >>> >> >> >> >>> Note that in case you want us to do the work of isolating the >> >> >> >>> problem, >> >> >> >>> we do offer paid support to do that (then we can sign NDAs and >> >> >> >>> stuff). >> >> >> >>> Otherwise we would be more than happy to fix bugs once you >> >> >> >>> isolate >> >> >> >>> a >> >> >> >>> part you can share freely :) >> >> >> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> > >> > > > From mak at issuu.com Mon Mar 17 14:18:07 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 14:18:07 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? Thanks, /Martin On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski wrote: > I think it's the cycles of your CPU > > On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: > > What is the unit? Perhaps I'm being thick here, but I can't correlate it > > with seconds (which the program does print out). Slow runs are around 13 > > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. > from > > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > > > > > > > > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > > wrote: > >> > >> The number of lines is nonsense. This is a timestamp in hex. > >> > >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: > >> > Based On Maciej's suggestion, I tried the following > >> > > >> > PYPYLOG=- pypy mem.py 10000000 > out > >> > > >> > This generates a logfile which looks something like this > >> > > >> > start--> > >> > [2b99f1981b527e] {gc-minor > >> > [2b99f1981ba680] {gc-minor-walkroots > >> > [2b99f1981c2e02] gc-minor-walkroots} > >> > [2b99f19890d750] gc-minor} > >> > [snip] > >> > ... > >> > <--stop > >> > > >> > > >> > It turns out that the culprit is a lot of MINOR collections. > >> > > >> > I base this on the following observations: > >> > > >> > I can't understand the format of the timestamp on each logline (the > >> > "[2b99f1981b527e]"). From what I can see in the code, this should be > >> > output > >> > from time.clock(), but that doesn't return a number like that when I > run > >> > pypy interactively > >> > Instead, I count the number of debug lines between start--> and the > >> > corresponding <--stop. > >> > Most runs have a few hundred lines of output between start/stop > >> > All slow runs have very close to 57800 lines out output between > >> > start/stop > >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > >> > operations, and 9647 gc-minor-walkroots operations. > >> > > >> > > >> > Thanks, > >> > /Martin > >> > > >> > > >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > > >> > wrote: > >> >> > >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) > >> >> which will do that for you btw. > >> >> > >> >> maybe you can find out what's that using profiling or valgrind? > >> >> > >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > >> >> > I have tried getting the pypy source and building my own version of > >> >> > pypy. I > >> >> > have modified > >> >> > rpython/memory/gc/incminimark.py:major_collection_step() > >> >> > to > >> >> > print out when it starts and when it stops. Apparently, the slow > >> >> > queries > >> >> > do > >> >> > NOT occur during major_collection_step; at least, I have not > observed > >> >> > major > >> >> > step output during a query execution. So, apparently, something > else > >> >> > is > >> >> > blocking. This could be another aspect of the GC, but it could also > >> >> > be > >> >> > anything else. > >> >> > > >> >> > Just to be sure, I have tried running the same application in > python > >> >> > with > >> >> > garbage collection disabled. I don't see the problem there, so it > is > >> >> > somehow > >> >> > related to either GC or the runtime somehow. > >> >> > > >> >> > Cheers, > >> >> > /Martin > >> >> > > >> >> > > >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch > wrote: > >> >> >> > >> >> >> We have hacked up a small sample that seems to exhibit the same > >> >> >> issue. > >> >> >> > >> >> >> We basically generate a linked list of objects. To increase > >> >> >> connectedness, > >> >> >> elements in the list hold references (dummy_links) to 10 randomly > >> >> >> chosen > >> >> >> previous elements in the list. > >> >> >> > >> >> >> We then time a function that traverses 50000 elements from the > list > >> >> >> from a > >> >> >> random start point. If the traversal reaches the end of the list, > we > >> >> >> instead > >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are > >> >> >> traversed > >> >> >> every time. To generate some garbage, we build a list holding the > >> >> >> traversed > >> >> >> elements and a dummy list of characters. > >> >> >> > >> >> >> Timings for the last 100 runs are stored in a circular buffer. If > >> >> >> the > >> >> >> elapsed time for the last run is more than twice the average time, > >> >> >> we > >> >> >> print > >> >> >> out a line with the elapsed time, the threshold, and the 90% > runtime > >> >> >> (we > >> >> >> would like to see that the mean runtime does not increase with the > >> >> >> number of > >> >> >> elements in the list, but that the max time does increase > (linearly > >> >> >> with the > >> >> >> number of object, i guess); traversing 50K elements should be > >> >> >> independent of > >> >> >> the memory size). > >> >> >> > >> >> >> We have tried monitoring memory consumption by external > inspection, > >> >> >> but > >> >> >> cannot consistently verify that memory is deallocated at the same > >> >> >> time > >> >> >> that > >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always > return > >> >> >> freed > >> >> >> pages back to the OS? > >> >> >> > >> >> >> Using top, we observe that 10M elements allocates around 17GB > after > >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > >> >> >> shortly > >> >> >> after building). > >> >> >> > >> >> >> Here is output from a few runs with different number of elements: > >> >> >> > >> >> >> > >> >> >> pypy mem.py 10000000 > >> >> >> start build > >> >> >> end build 84.142424 > >> >> >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > >> >> >> 90th_quantile_runtime: 0.421558 > >> >> >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > >> >> >> 90th_quantile_runtime: 0.423441 > >> >> >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > >> >> >> 90th_quantile_runtime: 0.419817 > >> >> >> > >> >> >> pypy mem.py 20000000 > >> >> >> start build > >> >> >> end build 180.823105 > >> >> >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > >> >> >> 90th_quantile_runtime: 0.434726 > >> >> >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > >> >> >> 90th_quantile_runtime: 0.374190 > >> >> >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > >> >> >> 90th_quantile_runtime: 0.371502 > >> >> >> > >> >> >> pypy mem.py 30000000 > >> >> >> start build > >> >> >> end build 276.217811 > >> >> >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > >> >> >> 90th_quantile_runtime: 0.459891 > >> >> >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > >> >> >> 90th_quantile_runtime: 0.393654 > >> >> >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > >> >> >> 90th_quantile_runtime: 0.393677 > >> >> >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > >> >> >> 90th_quantile_runtime: 0.393654 > >> >> >> > >> >> >> Code below > >> >> >> -------------------------------------------------------------- > >> >> >> import time > >> >> >> from random import randint, choice > >> >> >> import sys > >> >> >> > >> >> >> > >> >> >> allElems = {} > >> >> >> > >> >> >> class Node: > >> >> >> def __init__(self, v_): > >> >> >> self.v = v_ > >> >> >> self.next = None > >> >> >> self.dummy_data = [randint(0,100) > >> >> >> for _ in xrange(randint(50,100))] > >> >> >> allElems[self.v] = self > >> >> >> if self.v > 0: > >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] > for _ > >> >> >> in > >> >> >> xrange(10)] > >> >> >> else: > >> >> >> self.dummy_links = [self] > >> >> >> > >> >> >> def set_next(self, l): > >> >> >> self.next = l > >> >> >> > >> >> >> > >> >> >> def follow(node): > >> >> >> acc = [] > >> >> >> count = 0 > >> >> >> cur = node > >> >> >> assert node.v is not None > >> >> >> assert cur is not None > >> >> >> while count < 50000: > >> >> >> # return a value; generate some garbage > >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") > for > >> >> >> x > >> >> >> in > >> >> >> xrange(100)])) > >> >> >> > >> >> >> # if we have reached the end, chose a random link > >> >> >> cur = choice(cur.dummy_links) if cur.next is None else > >> >> >> cur.next > >> >> >> count += 1 > >> >> >> > >> >> >> return acc > >> >> >> > >> >> >> > >> >> >> def build(num_elems): > >> >> >> start = time.time() > >> >> >> print "start build" > >> >> >> root = Node(0) > >> >> >> cur = root > >> >> >> for x in xrange(1, num_elems): > >> >> >> e = Node(x) > >> >> >> cur.next = e > >> >> >> cur = e > >> >> >> print "end build %f" % (time.time() - start) > >> >> >> return root > >> >> >> > >> >> >> > >> >> >> num_timings = 100 > >> >> >> if __name__ == "__main__": > >> >> >> num_elems = int(sys.argv[1]) > >> >> >> build(num_elems) > >> >> >> total = 0 > >> >> >> timings = [0.0] * num_timings # run times for the last > >> >> >> num_timings > >> >> >> runs > >> >> >> i = 0 > >> >> >> beginning = time.time() > >> >> >> while time.time() - beginning < 600: > >> >> >> start = time.time() > >> >> >> elem = allElems[randint(0, num_elems - 1)] > >> >> >> assert(elem is not None) > >> >> >> > >> >> >> lst = follow(elem) > >> >> >> > >> >> >> total += choice(lst)[0] # use the return value for > something > >> >> >> > >> >> >> end = time.time() > >> >> >> > >> >> >> elapsed = end-start > >> >> >> timings[i % num_timings] = elapsed > >> >> >> if (i > num_timings): > >> >> >> slow_time = 2 * sum(timings)/num_timings # slow > defined > >> >> >> as > >> >> >> > > >> >> >> 2*avg run time > >> >> >> if (elapsed > slow_time): > >> >> >> print "that took a long time elapsed: %f > >> >> >> slow_threshold: > >> >> >> %f 90th_quantile_runtime: %f" % \ > >> >> >> (elapsed, slow_time, > >> >> >> sorted(timings)[int(num_timings*.9)]) > >> >> >> i += 1 > >> >> >> print total > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >> >> >> > >> >> >> wrote: > >> >> >>> > >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > wrote: > >> >> >>> > Hi Armin, Maciej > >> >> >>> > > >> >> >>> > Thanks for responding. > >> >> >>> > > >> >> >>> > I'm in the process of trying to determine what (if any) of the > >> >> >>> > code > >> >> >>> > I'm > >> >> >>> > in a > >> >> >>> > position to share, and I'll get back to you. > >> >> >>> > > >> >> >>> > Allowing hinting to the GC would be good. Even better would be > a > >> >> >>> > means > >> >> >>> > to > >> >> >>> > allow me to (transparently) allocate objects in unmanaged > memory, > >> >> >>> > but I > >> >> >>> > would expect that to be a tall order :) > >> >> >>> > > >> >> >>> > Thanks, > >> >> >>> > /Martin > >> >> >>> > >> >> >>> Hi Martin. > >> >> >>> > >> >> >>> Note that in case you want us to do the work of isolating the > >> >> >>> problem, > >> >> >>> we do offer paid support to do that (then we can sign NDAs and > >> >> >>> stuff). > >> >> >>> Otherwise we would be more than happy to fix bugs once you > isolate > >> >> >>> a > >> >> >>> part you can share freely :) > >> >> >> > >> >> >> > >> >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 14:23:57 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 15:23:57 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski wrote: > are you *sure* it's the walkroots that take that long and not > something else (like gc-minor)? More of those mean that you allocate a > lot more surviving objects. Can you do two things: > > a) take a max of gc-minor (and gc-minor-stackwalk), per request > b) take the sum of those > > and plot them ^^^ or just paste the results actually > > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >> Well, then it works out to around 2.5GHz, which seems reasonable. But it >> doesn't alter the conclusion from the previous email: The slow queries then >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or >> .4 seconds at this conversion. Also, the log shows that a slow query >> performs many more gc-minor operations than a 'normal' one: 9600 >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >> So the question becomes: Why do we get this large spike in >> gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? >> >> Thanks, >> /Martin >> >> >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >> wrote: >>> >>> I think it's the cycles of your CPU >>> >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >>> > What is the unit? Perhaps I'm being thick here, but I can't correlate it >>> > with seconds (which the program does print out). Slow runs are around 13 >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. >>> > from >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>> > >>> > >>> > >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>> > wrote: >>> >> >>> >> The number of lines is nonsense. This is a timestamp in hex. >>> >> >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch wrote: >>> >> > Based On Maciej's suggestion, I tried the following >>> >> > >>> >> > PYPYLOG=- pypy mem.py 10000000 > out >>> >> > >>> >> > This generates a logfile which looks something like this >>> >> > >>> >> > start--> >>> >> > [2b99f1981b527e] {gc-minor >>> >> > [2b99f1981ba680] {gc-minor-walkroots >>> >> > [2b99f1981c2e02] gc-minor-walkroots} >>> >> > [2b99f19890d750] gc-minor} >>> >> > [snip] >>> >> > ... >>> >> > <--stop >>> >> > >>> >> > >>> >> > It turns out that the culprit is a lot of MINOR collections. >>> >> > >>> >> > I base this on the following observations: >>> >> > >>> >> > I can't understand the format of the timestamp on each logline (the >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should be >>> >> > output >>> >> > from time.clock(), but that doesn't return a number like that when I >>> >> > run >>> >> > pypy interactively >>> >> > Instead, I count the number of debug lines between start--> and the >>> >> > corresponding <--stop. >>> >> > Most runs have a few hundred lines of output between start/stop >>> >> > All slow runs have very close to 57800 lines out output between >>> >> > start/stop >>> >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor >>> >> > operations, and 9647 gc-minor-walkroots operations. >>> >> > >>> >> > >>> >> > Thanks, >>> >> > /Martin >>> >> > >>> >> > >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>> >> > >>> >> > wrote: >>> >> >> >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >>> >> >> which will do that for you btw. >>> >> >> >>> >> >> maybe you can find out what's that using profiling or valgrind? >>> >> >> >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: >>> >> >> > I have tried getting the pypy source and building my own version >>> >> >> > of >>> >> >> > pypy. I >>> >> >> > have modified >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >>> >> >> > to >>> >> >> > print out when it starts and when it stops. Apparently, the slow >>> >> >> > queries >>> >> >> > do >>> >> >> > NOT occur during major_collection_step; at least, I have not >>> >> >> > observed >>> >> >> > major >>> >> >> > step output during a query execution. So, apparently, something >>> >> >> > else >>> >> >> > is >>> >> >> > blocking. This could be another aspect of the GC, but it could >>> >> >> > also >>> >> >> > be >>> >> >> > anything else. >>> >> >> > >>> >> >> > Just to be sure, I have tried running the same application in >>> >> >> > python >>> >> >> > with >>> >> >> > garbage collection disabled. I don't see the problem there, so it >>> >> >> > is >>> >> >> > somehow >>> >> >> > related to either GC or the runtime somehow. >>> >> >> > >>> >> >> > Cheers, >>> >> >> > /Martin >>> >> >> > >>> >> >> > >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>> >> >> > wrote: >>> >> >> >> >>> >> >> >> We have hacked up a small sample that seems to exhibit the same >>> >> >> >> issue. >>> >> >> >> >>> >> >> >> We basically generate a linked list of objects. To increase >>> >> >> >> connectedness, >>> >> >> >> elements in the list hold references (dummy_links) to 10 randomly >>> >> >> >> chosen >>> >> >> >> previous elements in the list. >>> >> >> >> >>> >> >> >> We then time a function that traverses 50000 elements from the >>> >> >> >> list >>> >> >> >> from a >>> >> >> >> random start point. If the traversal reaches the end of the list, >>> >> >> >> we >>> >> >> >> instead >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are >>> >> >> >> traversed >>> >> >> >> every time. To generate some garbage, we build a list holding the >>> >> >> >> traversed >>> >> >> >> elements and a dummy list of characters. >>> >> >> >> >>> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >>> >> >> >> the >>> >> >> >> elapsed time for the last run is more than twice the average >>> >> >> >> time, >>> >> >> >> we >>> >> >> >> print >>> >> >> >> out a line with the elapsed time, the threshold, and the 90% >>> >> >> >> runtime >>> >> >> >> (we >>> >> >> >> would like to see that the mean runtime does not increase with >>> >> >> >> the >>> >> >> >> number of >>> >> >> >> elements in the list, but that the max time does increase >>> >> >> >> (linearly >>> >> >> >> with the >>> >> >> >> number of object, i guess); traversing 50K elements should be >>> >> >> >> independent of >>> >> >> >> the memory size). >>> >> >> >> >>> >> >> >> We have tried monitoring memory consumption by external >>> >> >> >> inspection, >>> >> >> >> but >>> >> >> >> cannot consistently verify that memory is deallocated at the same >>> >> >> >> time >>> >> >> >> that >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always >>> >> >> >> return >>> >> >> >> freed >>> >> >> >> pages back to the OS? >>> >> >> >> >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB >>> >> >> >> after >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >>> >> >> >> shortly >>> >> >> >> after building). >>> >> >> >> >>> >> >> >> Here is output from a few runs with different number of elements: >>> >> >> >> >>> >> >> >> >>> >> >> >> pypy mem.py 10000000 >>> >> >> >> start build >>> >> >> >> end build 84.142424 >>> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: >>> >> >> >> 1.495401 >>> >> >> >> 90th_quantile_runtime: 0.421558 >>> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: >>> >> >> >> 1.488160 >>> >> >> >> 90th_quantile_runtime: 0.423441 >>> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: >>> >> >> >> 1.474563 >>> >> >> >> 90th_quantile_runtime: 0.419817 >>> >> >> >> >>> >> >> >> pypy mem.py 20000000 >>> >> >> >> start build >>> >> >> >> end build 180.823105 >>> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: >>> >> >> >> 2.295146 >>> >> >> >> 90th_quantile_runtime: 0.434726 >>> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: >>> >> >> >> 2.283927 >>> >> >> >> 90th_quantile_runtime: 0.374190 >>> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: >>> >> >> >> 2.279631 >>> >> >> >> 90th_quantile_runtime: 0.371502 >>> >> >> >> >>> >> >> >> pypy mem.py 30000000 >>> >> >> >> start build >>> >> >> >> end build 276.217811 >>> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: >>> >> >> >> 3.188464 >>> >> >> >> 90th_quantile_runtime: 0.459891 >>> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: >>> >> >> >> 3.183003 >>> >> >> >> 90th_quantile_runtime: 0.393654 >>> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: >>> >> >> >> 3.190782 >>> >> >> >> 90th_quantile_runtime: 0.393677 >>> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: >>> >> >> >> 3.239637 >>> >> >> >> 90th_quantile_runtime: 0.393654 >>> >> >> >> >>> >> >> >> Code below >>> >> >> >> -------------------------------------------------------------- >>> >> >> >> import time >>> >> >> >> from random import randint, choice >>> >> >> >> import sys >>> >> >> >> >>> >> >> >> >>> >> >> >> allElems = {} >>> >> >> >> >>> >> >> >> class Node: >>> >> >> >> def __init__(self, v_): >>> >> >> >> self.v = v_ >>> >> >> >> self.next = None >>> >> >> >> self.dummy_data = [randint(0,100) >>> >> >> >> for _ in xrange(randint(50,100))] >>> >> >> >> allElems[self.v] = self >>> >> >> >> if self.v > 0: >>> >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] >>> >> >> >> for _ >>> >> >> >> in >>> >> >> >> xrange(10)] >>> >> >> >> else: >>> >> >> >> self.dummy_links = [self] >>> >> >> >> >>> >> >> >> def set_next(self, l): >>> >> >> >> self.next = l >>> >> >> >> >>> >> >> >> >>> >> >> >> def follow(node): >>> >> >> >> acc = [] >>> >> >> >> count = 0 >>> >> >> >> cur = node >>> >> >> >> assert node.v is not None >>> >> >> >> assert cur is not None >>> >> >> >> while count < 50000: >>> >> >> >> # return a value; generate some garbage >>> >> >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") >>> >> >> >> for >>> >> >> >> x >>> >> >> >> in >>> >> >> >> xrange(100)])) >>> >> >> >> >>> >> >> >> # if we have reached the end, chose a random link >>> >> >> >> cur = choice(cur.dummy_links) if cur.next is None else >>> >> >> >> cur.next >>> >> >> >> count += 1 >>> >> >> >> >>> >> >> >> return acc >>> >> >> >> >>> >> >> >> >>> >> >> >> def build(num_elems): >>> >> >> >> start = time.time() >>> >> >> >> print "start build" >>> >> >> >> root = Node(0) >>> >> >> >> cur = root >>> >> >> >> for x in xrange(1, num_elems): >>> >> >> >> e = Node(x) >>> >> >> >> cur.next = e >>> >> >> >> cur = e >>> >> >> >> print "end build %f" % (time.time() - start) >>> >> >> >> return root >>> >> >> >> >>> >> >> >> >>> >> >> >> num_timings = 100 >>> >> >> >> if __name__ == "__main__": >>> >> >> >> num_elems = int(sys.argv[1]) >>> >> >> >> build(num_elems) >>> >> >> >> total = 0 >>> >> >> >> timings = [0.0] * num_timings # run times for the last >>> >> >> >> num_timings >>> >> >> >> runs >>> >> >> >> i = 0 >>> >> >> >> beginning = time.time() >>> >> >> >> while time.time() - beginning < 600: >>> >> >> >> start = time.time() >>> >> >> >> elem = allElems[randint(0, num_elems - 1)] >>> >> >> >> assert(elem is not None) >>> >> >> >> >>> >> >> >> lst = follow(elem) >>> >> >> >> >>> >> >> >> total += choice(lst)[0] # use the return value for >>> >> >> >> something >>> >> >> >> >>> >> >> >> end = time.time() >>> >> >> >> >>> >> >> >> elapsed = end-start >>> >> >> >> timings[i % num_timings] = elapsed >>> >> >> >> if (i > num_timings): >>> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow >>> >> >> >> defined >>> >> >> >> as >>> >> >> >> > >>> >> >> >> 2*avg run time >>> >> >> >> if (elapsed > slow_time): >>> >> >> >> print "that took a long time elapsed: %f >>> >> >> >> slow_threshold: >>> >> >> >> %f 90th_quantile_runtime: %f" % \ >>> >> >> >> (elapsed, slow_time, >>> >> >> >> sorted(timings)[int(num_timings*.9)]) >>> >> >> >> i += 1 >>> >> >> >> print total >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>> >> >> >> >>> >> >> >> wrote: >>> >> >> >>> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>> >> >> >>> wrote: >>> >> >> >>> > Hi Armin, Maciej >>> >> >> >>> > >>> >> >> >>> > Thanks for responding. >>> >> >> >>> > >>> >> >> >>> > I'm in the process of trying to determine what (if any) of the >>> >> >> >>> > code >>> >> >> >>> > I'm >>> >> >> >>> > in a >>> >> >> >>> > position to share, and I'll get back to you. >>> >> >> >>> > >>> >> >> >>> > Allowing hinting to the GC would be good. Even better would be >>> >> >> >>> > a >>> >> >> >>> > means >>> >> >> >>> > to >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged >>> >> >> >>> > memory, >>> >> >> >>> > but I >>> >> >> >>> > would expect that to be a tall order :) >>> >> >> >>> > >>> >> >> >>> > Thanks, >>> >> >> >>> > /Martin >>> >> >> >>> >>> >> >> >>> Hi Martin. >>> >> >> >>> >>> >> >> >>> Note that in case you want us to do the work of isolating the >>> >> >> >>> problem, >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs and >>> >> >> >>> stuff). >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you >>> >> >> >>> isolate >>> >> >> >>> a >>> >> >> >>> part you can share freely :) >>> >> >> >> >>> >> >> >> >>> >> >> > >>> >> > >>> >> > >>> > >>> > >> >> From mak at issuu.com Mon Mar 17 11:46:17 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 11:46:17 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: Based On Maciej's suggestion, I tried the following PYPYLOG=- pypy mem.py 10000000 > out This generates a logfile which looks something like this start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop It turns out that the culprit is a lot of MINOR collections. I base this on the following observations: - I can't understand the format of the timestamp on each logline (the " [2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively - Instead, I count the number of debug lines between start--> and the corresponding <--stop. - Most runs have a few hundred lines of output between start/stop - All slow runs have very close to 57800 lines out output between start/stop - One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations. Thanks, /Martin On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski wrote: > there is an environment variable PYPYLOG=gc:- (where - is stdout) > which will do that for you btw. > > maybe you can find out what's that using profiling or valgrind? > > On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch wrote: > > I have tried getting the pypy source and building my own version of > pypy. I > > have modified rpython/memory/gc/incminimark.py:major_collection_step() to > > print out when it starts and when it stops. Apparently, the slow queries > do > > NOT occur during major_collection_step; at least, I have not observed > major > > step output during a query execution. So, apparently, something else is > > blocking. This could be another aspect of the GC, but it could also be > > anything else. > > > > Just to be sure, I have tried running the same application in python with > > garbage collection disabled. I don't see the problem there, so it is > somehow > > related to either GC or the runtime somehow. > > > > Cheers, > > /Martin > > > > > > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch wrote: > >> > >> We have hacked up a small sample that seems to exhibit the same issue. > >> > >> We basically generate a linked list of objects. To increase > connectedness, > >> elements in the list hold references (dummy_links) to 10 randomly chosen > >> previous elements in the list. > >> > >> We then time a function that traverses 50000 elements from the list > from a > >> random start point. If the traversal reaches the end of the list, we > instead > >> traverse one of the dummy links. Thus, exactly 50K elements are > traversed > >> every time. To generate some garbage, we build a list holding the > traversed > >> elements and a dummy list of characters. > >> > >> Timings for the last 100 runs are stored in a circular buffer. If the > >> elapsed time for the last run is more than twice the average time, we > print > >> out a line with the elapsed time, the threshold, and the 90% runtime (we > >> would like to see that the mean runtime does not increase with the > number of > >> elements in the list, but that the max time does increase (linearly > with the > >> number of object, i guess); traversing 50K elements should be > independent of > >> the memory size). > >> > >> We have tried monitoring memory consumption by external inspection, but > >> cannot consistently verify that memory is deallocated at the same time > that > >> we see slow requests. Perhaps the pypy runtime doesn't always return > freed > >> pages back to the OS? > >> > >> Using top, we observe that 10M elements allocates around 17GB after > >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > shortly > >> after building). > >> > >> Here is output from a few runs with different number of elements: > >> > >> > >> pypy mem.py 10000000 > >> start build > >> end build 84.142424 > >> that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > >> 90th_quantile_runtime: 0.421558 > >> that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > >> 90th_quantile_runtime: 0.423441 > >> that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > >> 90th_quantile_runtime: 0.419817 > >> > >> pypy mem.py 20000000 > >> start build > >> end build 180.823105 > >> that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > >> 90th_quantile_runtime: 0.434726 > >> that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > >> 90th_quantile_runtime: 0.374190 > >> that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > >> 90th_quantile_runtime: 0.371502 > >> > >> pypy mem.py 30000000 > >> start build > >> end build 276.217811 > >> that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > >> 90th_quantile_runtime: 0.459891 > >> that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > >> 90th_quantile_runtime: 0.393654 > >> that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > >> 90th_quantile_runtime: 0.393677 > >> that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > >> 90th_quantile_runtime: 0.393654 > >> > >> Code below > >> -------------------------------------------------------------- > >> import time > >> from random import randint, choice > >> import sys > >> > >> > >> allElems = {} > >> > >> class Node: > >> def __init__(self, v_): > >> self.v = v_ > >> self.next = None > >> self.dummy_data = [randint(0,100) > >> for _ in xrange(randint(50,100))] > >> allElems[self.v] = self > >> if self.v > 0: > >> self.dummy_links = [allElems[randint(0, self.v-1)] for _ in > >> xrange(10)] > >> else: > >> self.dummy_links = [self] > >> > >> def set_next(self, l): > >> self.next = l > >> > >> > >> def follow(node): > >> acc = [] > >> count = 0 > >> cur = node > >> assert node.v is not None > >> assert cur is not None > >> while count < 50000: > >> # return a value; generate some garbage > >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x > in > >> xrange(100)])) > >> > >> # if we have reached the end, chose a random link > >> cur = choice(cur.dummy_links) if cur.next is None else cur.next > >> count += 1 > >> > >> return acc > >> > >> > >> def build(num_elems): > >> start = time.time() > >> print "start build" > >> root = Node(0) > >> cur = root > >> for x in xrange(1, num_elems): > >> e = Node(x) > >> cur.next = e > >> cur = e > >> print "end build %f" % (time.time() - start) > >> return root > >> > >> > >> num_timings = 100 > >> if __name__ == "__main__": > >> num_elems = int(sys.argv[1]) > >> build(num_elems) > >> total = 0 > >> timings = [0.0] * num_timings # run times for the last num_timings > >> runs > >> i = 0 > >> beginning = time.time() > >> while time.time() - beginning < 600: > >> start = time.time() > >> elem = allElems[randint(0, num_elems - 1)] > >> assert(elem is not None) > >> > >> lst = follow(elem) > >> > >> total += choice(lst)[0] # use the return value for something > >> > >> end = time.time() > >> > >> elapsed = end-start > >> timings[i % num_timings] = elapsed > >> if (i > num_timings): > >> slow_time = 2 * sum(timings)/num_timings # slow defined as > > >> 2*avg run time > >> if (elapsed > slow_time): > >> print "that took a long time elapsed: %f > slow_threshold: > >> %f 90th_quantile_runtime: %f" % \ > >> (elapsed, slow_time, > >> sorted(timings)[int(num_timings*.9)]) > >> i += 1 > >> print total > >> > >> > >> > >> > >> > >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >> wrote: > >>> > >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch wrote: > >>> > Hi Armin, Maciej > >>> > > >>> > Thanks for responding. > >>> > > >>> > I'm in the process of trying to determine what (if any) of the code > I'm > >>> > in a > >>> > position to share, and I'll get back to you. > >>> > > >>> > Allowing hinting to the GC would be good. Even better would be a > means > >>> > to > >>> > allow me to (transparently) allocate objects in unmanaged memory, > but I > >>> > would expect that to be a tall order :) > >>> > > >>> > Thanks, > >>> > /Martin > >>> > >>> Hi Martin. > >>> > >>> Note that in case you want us to do the work of isolating the problem, > >>> we do offer paid support to do that (then we can sign NDAs and stuff). > >>> Otherwise we would be more than happy to fix bugs once you isolate a > >>> part you can share freely :) > >> > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Mon Mar 17 15:19:28 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 15:19:28 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor). Please let me know if this is unclear :) 668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 *slow*:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 *slow*:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 *slow*:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104 Thanks, /Martin On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski wrote: > On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > wrote: > > are you *sure* it's the walkroots that take that long and not > > something else (like gc-minor)? More of those mean that you allocate a > > lot more surviving objects. Can you do two things: > > > > a) take a max of gc-minor (and gc-minor-stackwalk), per request > > b) take the sum of those > > > > and plot them > > ^^^ or just paste the results actually > > > > > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: > >> Well, then it works out to around 2.5GHz, which seems reasonable. But it > >> doesn't alter the conclusion from the previous email: The slow queries > then > >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 > units, or > >> .4 seconds at this conversion. Also, the log shows that a slow query > >> performs many more gc-minor operations than a 'normal' one: 9600 > >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > >> > >> So the question becomes: Why do we get this large spike in > >> gc-minor-walkroots, and, in particular, is there any way to avoid it :) > ? > >> > >> Thanks, > >> /Martin > >> > >> > >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski > >> wrote: > >>> > >>> I think it's the cycles of your CPU > >>> > >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: > >>> > What is the unit? Perhaps I'm being thick here, but I can't > correlate it > >>> > with seconds (which the program does print out). Slow runs are > around 13 > >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units > (e.g. > >>> > from > >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > >>> > > >>> > > >>> > > >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski < > fijall at gmail.com> > >>> > wrote: > >>> >> > >>> >> The number of lines is nonsense. This is a timestamp in hex. > >>> >> > >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch > wrote: > >>> >> > Based On Maciej's suggestion, I tried the following > >>> >> > > >>> >> > PYPYLOG=- pypy mem.py 10000000 > out > >>> >> > > >>> >> > This generates a logfile which looks something like this > >>> >> > > >>> >> > start--> > >>> >> > [2b99f1981b527e] {gc-minor > >>> >> > [2b99f1981ba680] {gc-minor-walkroots > >>> >> > [2b99f1981c2e02] gc-minor-walkroots} > >>> >> > [2b99f19890d750] gc-minor} > >>> >> > [snip] > >>> >> > ... > >>> >> > <--stop > >>> >> > > >>> >> > > >>> >> > It turns out that the culprit is a lot of MINOR collections. > >>> >> > > >>> >> > I base this on the following observations: > >>> >> > > >>> >> > I can't understand the format of the timestamp on each logline > (the > >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should > be > >>> >> > output > >>> >> > from time.clock(), but that doesn't return a number like that > when I > >>> >> > run > >>> >> > pypy interactively > >>> >> > Instead, I count the number of debug lines between start--> and > the > >>> >> > corresponding <--stop. > >>> >> > Most runs have a few hundred lines of output between start/stop > >>> >> > All slow runs have very close to 57800 lines out output between > >>> >> > start/stop > >>> >> > One such sample does 9609 gc-collect-step operations, 9647 > gc-minor > >>> >> > operations, and 9647 gc-minor-walkroots operations. > >>> >> > > >>> >> > > >>> >> > Thanks, > >>> >> > /Martin > >>> >> > > >>> >> > > >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > >>> >> > > >>> >> > wrote: > >>> >> >> > >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) > >>> >> >> which will do that for you btw. > >>> >> >> > >>> >> >> maybe you can find out what's that using profiling or valgrind? > >>> >> >> > >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch > wrote: > >>> >> >> > I have tried getting the pypy source and building my own > version > >>> >> >> > of > >>> >> >> > pypy. I > >>> >> >> > have modified > >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() > >>> >> >> > to > >>> >> >> > print out when it starts and when it stops. Apparently, the > slow > >>> >> >> > queries > >>> >> >> > do > >>> >> >> > NOT occur during major_collection_step; at least, I have not > >>> >> >> > observed > >>> >> >> > major > >>> >> >> > step output during a query execution. So, apparently, something > >>> >> >> > else > >>> >> >> > is > >>> >> >> > blocking. This could be another aspect of the GC, but it could > >>> >> >> > also > >>> >> >> > be > >>> >> >> > anything else. > >>> >> >> > > >>> >> >> > Just to be sure, I have tried running the same application in > >>> >> >> > python > >>> >> >> > with > >>> >> >> > garbage collection disabled. I don't see the problem there, so > it > >>> >> >> > is > >>> >> >> > somehow > >>> >> >> > related to either GC or the runtime somehow. > >>> >> >> > > >>> >> >> > Cheers, > >>> >> >> > /Martin > >>> >> >> > > >>> >> >> > > >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch > >>> >> >> > wrote: > >>> >> >> >> > >>> >> >> >> We have hacked up a small sample that seems to exhibit the > same > >>> >> >> >> issue. > >>> >> >> >> > >>> >> >> >> We basically generate a linked list of objects. To increase > >>> >> >> >> connectedness, > >>> >> >> >> elements in the list hold references (dummy_links) to 10 > randomly > >>> >> >> >> chosen > >>> >> >> >> previous elements in the list. > >>> >> >> >> > >>> >> >> >> We then time a function that traverses 50000 elements from the > >>> >> >> >> list > >>> >> >> >> from a > >>> >> >> >> random start point. If the traversal reaches the end of the > list, > >>> >> >> >> we > >>> >> >> >> instead > >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements > are > >>> >> >> >> traversed > >>> >> >> >> every time. To generate some garbage, we build a list holding > the > >>> >> >> >> traversed > >>> >> >> >> elements and a dummy list of characters. > >>> >> >> >> > >>> >> >> >> Timings for the last 100 runs are stored in a circular > buffer. If > >>> >> >> >> the > >>> >> >> >> elapsed time for the last run is more than twice the average > >>> >> >> >> time, > >>> >> >> >> we > >>> >> >> >> print > >>> >> >> >> out a line with the elapsed time, the threshold, and the 90% > >>> >> >> >> runtime > >>> >> >> >> (we > >>> >> >> >> would like to see that the mean runtime does not increase with > >>> >> >> >> the > >>> >> >> >> number of > >>> >> >> >> elements in the list, but that the max time does increase > >>> >> >> >> (linearly > >>> >> >> >> with the > >>> >> >> >> number of object, i guess); traversing 50K elements should be > >>> >> >> >> independent of > >>> >> >> >> the memory size). > >>> >> >> >> > >>> >> >> >> We have tried monitoring memory consumption by external > >>> >> >> >> inspection, > >>> >> >> >> but > >>> >> >> >> cannot consistently verify that memory is deallocated at the > same > >>> >> >> >> time > >>> >> >> >> that > >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always > >>> >> >> >> return > >>> >> >> >> freed > >>> >> >> >> pages back to the OS? > >>> >> >> >> > >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB > >>> >> >> >> after > >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to > 35GB > >>> >> >> >> shortly > >>> >> >> >> after building). > >>> >> >> >> > >>> >> >> >> Here is output from a few runs with different number of > elements: > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> pypy mem.py 10000000 > >>> >> >> >> start build > >>> >> >> >> end build 84.142424 > >>> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: > >>> >> >> >> 1.495401 > >>> >> >> >> 90th_quantile_runtime: 0.421558 > >>> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: > >>> >> >> >> 1.488160 > >>> >> >> >> 90th_quantile_runtime: 0.423441 > >>> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: > >>> >> >> >> 1.474563 > >>> >> >> >> 90th_quantile_runtime: 0.419817 > >>> >> >> >> > >>> >> >> >> pypy mem.py 20000000 > >>> >> >> >> start build > >>> >> >> >> end build 180.823105 > >>> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: > >>> >> >> >> 2.295146 > >>> >> >> >> 90th_quantile_runtime: 0.434726 > >>> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: > >>> >> >> >> 2.283927 > >>> >> >> >> 90th_quantile_runtime: 0.374190 > >>> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: > >>> >> >> >> 2.279631 > >>> >> >> >> 90th_quantile_runtime: 0.371502 > >>> >> >> >> > >>> >> >> >> pypy mem.py 30000000 > >>> >> >> >> start build > >>> >> >> >> end build 276.217811 > >>> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: > >>> >> >> >> 3.188464 > >>> >> >> >> 90th_quantile_runtime: 0.459891 > >>> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: > >>> >> >> >> 3.183003 > >>> >> >> >> 90th_quantile_runtime: 0.393654 > >>> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: > >>> >> >> >> 3.190782 > >>> >> >> >> 90th_quantile_runtime: 0.393677 > >>> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: > >>> >> >> >> 3.239637 > >>> >> >> >> 90th_quantile_runtime: 0.393654 > >>> >> >> >> > >>> >> >> >> Code below > >>> >> >> >> -------------------------------------------------------------- > >>> >> >> >> import time > >>> >> >> >> from random import randint, choice > >>> >> >> >> import sys > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> allElems = {} > >>> >> >> >> > >>> >> >> >> class Node: > >>> >> >> >> def __init__(self, v_): > >>> >> >> >> self.v = v_ > >>> >> >> >> self.next = None > >>> >> >> >> self.dummy_data = [randint(0,100) > >>> >> >> >> for _ in xrange(randint(50,100))] > >>> >> >> >> allElems[self.v] = self > >>> >> >> >> if self.v > 0: > >>> >> >> >> self.dummy_links = [allElems[randint(0, self.v-1)] > >>> >> >> >> for _ > >>> >> >> >> in > >>> >> >> >> xrange(10)] > >>> >> >> >> else: > >>> >> >> >> self.dummy_links = [self] > >>> >> >> >> > >>> >> >> >> def set_next(self, l): > >>> >> >> >> self.next = l > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> def follow(node): > >>> >> >> >> acc = [] > >>> >> >> >> count = 0 > >>> >> >> >> cur = node > >>> >> >> >> assert node.v is not None > >>> >> >> >> assert cur is not None > >>> >> >> >> while count < 50000: > >>> >> >> >> # return a value; generate some garbage > >>> >> >> >> acc.append((cur.v, > [choice("abcdefghijklmnopqrstuvwxyz") > >>> >> >> >> for > >>> >> >> >> x > >>> >> >> >> in > >>> >> >> >> xrange(100)])) > >>> >> >> >> > >>> >> >> >> # if we have reached the end, chose a random link > >>> >> >> >> cur = choice(cur.dummy_links) if cur.next is None else > >>> >> >> >> cur.next > >>> >> >> >> count += 1 > >>> >> >> >> > >>> >> >> >> return acc > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> def build(num_elems): > >>> >> >> >> start = time.time() > >>> >> >> >> print "start build" > >>> >> >> >> root = Node(0) > >>> >> >> >> cur = root > >>> >> >> >> for x in xrange(1, num_elems): > >>> >> >> >> e = Node(x) > >>> >> >> >> cur.next = e > >>> >> >> >> cur = e > >>> >> >> >> print "end build %f" % (time.time() - start) > >>> >> >> >> return root > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> num_timings = 100 > >>> >> >> >> if __name__ == "__main__": > >>> >> >> >> num_elems = int(sys.argv[1]) > >>> >> >> >> build(num_elems) > >>> >> >> >> total = 0 > >>> >> >> >> timings = [0.0] * num_timings # run times for the last > >>> >> >> >> num_timings > >>> >> >> >> runs > >>> >> >> >> i = 0 > >>> >> >> >> beginning = time.time() > >>> >> >> >> while time.time() - beginning < 600: > >>> >> >> >> start = time.time() > >>> >> >> >> elem = allElems[randint(0, num_elems - 1)] > >>> >> >> >> assert(elem is not None) > >>> >> >> >> > >>> >> >> >> lst = follow(elem) > >>> >> >> >> > >>> >> >> >> total += choice(lst)[0] # use the return value for > >>> >> >> >> something > >>> >> >> >> > >>> >> >> >> end = time.time() > >>> >> >> >> > >>> >> >> >> elapsed = end-start > >>> >> >> >> timings[i % num_timings] = elapsed > >>> >> >> >> if (i > num_timings): > >>> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow > >>> >> >> >> defined > >>> >> >> >> as > >>> >> >> >> > > >>> >> >> >> 2*avg run time > >>> >> >> >> if (elapsed > slow_time): > >>> >> >> >> print "that took a long time elapsed: %f > >>> >> >> >> slow_threshold: > >>> >> >> >> %f 90th_quantile_runtime: %f" % \ > >>> >> >> >> (elapsed, slow_time, > >>> >> >> >> sorted(timings)[int(num_timings*.9)]) > >>> >> >> >> i += 1 > >>> >> >> >> print total > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >>> >> >> >> > >>> >> >> >> wrote: > >>> >> >> >>> > >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > >>> >> >> >>> wrote: > >>> >> >> >>> > Hi Armin, Maciej > >>> >> >> >>> > > >>> >> >> >>> > Thanks for responding. > >>> >> >> >>> > > >>> >> >> >>> > I'm in the process of trying to determine what (if any) of > the > >>> >> >> >>> > code > >>> >> >> >>> > I'm > >>> >> >> >>> > in a > >>> >> >> >>> > position to share, and I'll get back to you. > >>> >> >> >>> > > >>> >> >> >>> > Allowing hinting to the GC would be good. Even better > would be > >>> >> >> >>> > a > >>> >> >> >>> > means > >>> >> >> >>> > to > >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged > >>> >> >> >>> > memory, > >>> >> >> >>> > but I > >>> >> >> >>> > would expect that to be a tall order :) > >>> >> >> >>> > > >>> >> >> >>> > Thanks, > >>> >> >> >>> > /Martin > >>> >> >> >>> > >>> >> >> >>> Hi Martin. > >>> >> >> >>> > >>> >> >> >>> Note that in case you want us to do the work of isolating the > >>> >> >> >>> problem, > >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs > and > >>> >> >> >>> stuff). > >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you > >>> >> >> >>> isolate > >>> >> >> >>> a > >>> >> >> >>> part you can share freely :) > >>> >> >> >> > >>> >> >> >> > >>> >> >> > > >>> >> > > >>> >> > > >>> > > >>> > > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 15:21:38 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 16:21:38 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: eh, this is not what I need I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step) On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: > Here are the collated results of running each query. For each run, I count > how many of each of the pypy debug lines i get. I.e. there were 668 runs > that printed 58 loglines that contain "{gc-minor" which was eventually > followed by "gc-minor}". I have also counted if the query was slow; > interestingly, not all the queries with many gc-minors were slow (but all > slow queries had a gc-minor). > > Please let me know if this is unclear :) > > 668 gc-minor:58 gc-minor-walkroots:58 > 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 > 140 gc-minor:59 gc-minor-walkroots:59 > 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 > 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 > 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 > 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 > 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 > 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 > 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 > 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 > jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 > jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 > jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 > jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 > jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 > jit-resume:84 > 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 > jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 > jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 > jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 > jit-resume:14 > 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 > jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 > gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 > jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:84 > 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 > jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 > gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 > jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:84 > 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 > jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 > jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 > jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:104 > > > Thanks, > /Martin > > > > On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski > wrote: >> >> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >> wrote: >> > are you *sure* it's the walkroots that take that long and not >> > something else (like gc-minor)? More of those mean that you allocate a >> > lot more surviving objects. Can you do two things: >> > >> > a) take a max of gc-minor (and gc-minor-stackwalk), per request >> > b) take the sum of those >> > >> > and plot them >> >> ^^^ or just paste the results actually >> >> > >> > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >> >> Well, then it works out to around 2.5GHz, which seems reasonable. But >> >> it >> >> doesn't alter the conclusion from the previous email: The slow queries >> >> then >> >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >> >> units, or >> >> .4 seconds at this conversion. Also, the log shows that a slow query >> >> performs many more gc-minor operations than a 'normal' one: 9600 >> >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >> >> >> So the question becomes: Why do we get this large spike in >> >> gc-minor-walkroots, and, in particular, is there any way to avoid it :) >> >> ? >> >> >> >> Thanks, >> >> /Martin >> >> >> >> >> >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >> >> wrote: >> >>> >> >>> I think it's the cycles of your CPU >> >>> >> >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >> >>> > What is the unit? Perhaps I'm being thick here, but I can't >> >>> > correlate it >> >>> > with seconds (which the program does print out). Slow runs are >> >>> > around 13 >> >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >> >>> > (e.g. >> >>> > from >> >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> >>> > >> >>> > >> >>> > >> >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> >>> > >> >>> > wrote: >> >>> >> >> >>> >> The number of lines is nonsense. This is a timestamp in hex. >> >>> >> >> >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >> >>> >> wrote: >> >>> >> > Based On Maciej's suggestion, I tried the following >> >>> >> > >> >>> >> > PYPYLOG=- pypy mem.py 10000000 > out >> >>> >> > >> >>> >> > This generates a logfile which looks something like this >> >>> >> > >> >>> >> > start--> >> >>> >> > [2b99f1981b527e] {gc-minor >> >>> >> > [2b99f1981ba680] {gc-minor-walkroots >> >>> >> > [2b99f1981c2e02] gc-minor-walkroots} >> >>> >> > [2b99f19890d750] gc-minor} >> >>> >> > [snip] >> >>> >> > ... >> >>> >> > <--stop >> >>> >> > >> >>> >> > >> >>> >> > It turns out that the culprit is a lot of MINOR collections. >> >>> >> > >> >>> >> > I base this on the following observations: >> >>> >> > >> >>> >> > I can't understand the format of the timestamp on each logline >> >>> >> > (the >> >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should >> >>> >> > be >> >>> >> > output >> >>> >> > from time.clock(), but that doesn't return a number like that >> >>> >> > when I >> >>> >> > run >> >>> >> > pypy interactively >> >>> >> > Instead, I count the number of debug lines between start--> and >> >>> >> > the >> >>> >> > corresponding <--stop. >> >>> >> > Most runs have a few hundred lines of output between start/stop >> >>> >> > All slow runs have very close to 57800 lines out output between >> >>> >> > start/stop >> >>> >> > One such sample does 9609 gc-collect-step operations, 9647 >> >>> >> > gc-minor >> >>> >> > operations, and 9647 gc-minor-walkroots operations. >> >>> >> > >> >>> >> > >> >>> >> > Thanks, >> >>> >> > /Martin >> >>> >> > >> >>> >> > >> >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> >>> >> > >> >>> >> > wrote: >> >>> >> >> >> >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is >> >>> >> >> stdout) >> >>> >> >> which will do that for you btw. >> >>> >> >> >> >>> >> >> maybe you can find out what's that using profiling or valgrind? >> >>> >> >> >> >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >> >>> >> >> wrote: >> >>> >> >> > I have tried getting the pypy source and building my own >> >>> >> >> > version >> >>> >> >> > of >> >>> >> >> > pypy. I >> >>> >> >> > have modified >> >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step() >> >>> >> >> > to >> >>> >> >> > print out when it starts and when it stops. Apparently, the >> >>> >> >> > slow >> >>> >> >> > queries >> >>> >> >> > do >> >>> >> >> > NOT occur during major_collection_step; at least, I have not >> >>> >> >> > observed >> >>> >> >> > major >> >>> >> >> > step output during a query execution. So, apparently, >> >>> >> >> > something >> >>> >> >> > else >> >>> >> >> > is >> >>> >> >> > blocking. This could be another aspect of the GC, but it could >> >>> >> >> > also >> >>> >> >> > be >> >>> >> >> > anything else. >> >>> >> >> > >> >>> >> >> > Just to be sure, I have tried running the same application in >> >>> >> >> > python >> >>> >> >> > with >> >>> >> >> > garbage collection disabled. I don't see the problem there, so >> >>> >> >> > it >> >>> >> >> > is >> >>> >> >> > somehow >> >>> >> >> > related to either GC or the runtime somehow. >> >>> >> >> > >> >>> >> >> > Cheers, >> >>> >> >> > /Martin >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >> >>> >> >> > wrote: >> >>> >> >> >> >> >>> >> >> >> We have hacked up a small sample that seems to exhibit the >> >>> >> >> >> same >> >>> >> >> >> issue. >> >>> >> >> >> >> >>> >> >> >> We basically generate a linked list of objects. To increase >> >>> >> >> >> connectedness, >> >>> >> >> >> elements in the list hold references (dummy_links) to 10 >> >>> >> >> >> randomly >> >>> >> >> >> chosen >> >>> >> >> >> previous elements in the list. >> >>> >> >> >> >> >>> >> >> >> We then time a function that traverses 50000 elements from >> >>> >> >> >> the >> >>> >> >> >> list >> >>> >> >> >> from a >> >>> >> >> >> random start point. If the traversal reaches the end of the >> >>> >> >> >> list, >> >>> >> >> >> we >> >>> >> >> >> instead >> >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements >> >>> >> >> >> are >> >>> >> >> >> traversed >> >>> >> >> >> every time. To generate some garbage, we build a list holding >> >>> >> >> >> the >> >>> >> >> >> traversed >> >>> >> >> >> elements and a dummy list of characters. >> >>> >> >> >> >> >>> >> >> >> Timings for the last 100 runs are stored in a circular >> >>> >> >> >> buffer. If >> >>> >> >> >> the >> >>> >> >> >> elapsed time for the last run is more than twice the average >> >>> >> >> >> time, >> >>> >> >> >> we >> >>> >> >> >> print >> >>> >> >> >> out a line with the elapsed time, the threshold, and the 90% >> >>> >> >> >> runtime >> >>> >> >> >> (we >> >>> >> >> >> would like to see that the mean runtime does not increase >> >>> >> >> >> with >> >>> >> >> >> the >> >>> >> >> >> number of >> >>> >> >> >> elements in the list, but that the max time does increase >> >>> >> >> >> (linearly >> >>> >> >> >> with the >> >>> >> >> >> number of object, i guess); traversing 50K elements should be >> >>> >> >> >> independent of >> >>> >> >> >> the memory size). >> >>> >> >> >> >> >>> >> >> >> We have tried monitoring memory consumption by external >> >>> >> >> >> inspection, >> >>> >> >> >> but >> >>> >> >> >> cannot consistently verify that memory is deallocated at the >> >>> >> >> >> same >> >>> >> >> >> time >> >>> >> >> >> that >> >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always >> >>> >> >> >> return >> >>> >> >> >> freed >> >>> >> >> >> pages back to the OS? >> >>> >> >> >> >> >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB >> >>> >> >> >> after >> >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to >> >>> >> >> >> 35GB >> >>> >> >> >> shortly >> >>> >> >> >> after building). >> >>> >> >> >> >> >>> >> >> >> Here is output from a few runs with different number of >> >>> >> >> >> elements: >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> pypy mem.py 10000000 >> >>> >> >> >> start build >> >>> >> >> >> end build 84.142424 >> >>> >> >> >> that took a long time elapsed: 13.230586 slow_threshold: >> >>> >> >> >> 1.495401 >> >>> >> >> >> 90th_quantile_runtime: 0.421558 >> >>> >> >> >> that took a long time elapsed: 13.016531 slow_threshold: >> >>> >> >> >> 1.488160 >> >>> >> >> >> 90th_quantile_runtime: 0.423441 >> >>> >> >> >> that took a long time elapsed: 13.032537 slow_threshold: >> >>> >> >> >> 1.474563 >> >>> >> >> >> 90th_quantile_runtime: 0.419817 >> >>> >> >> >> >> >>> >> >> >> pypy mem.py 20000000 >> >>> >> >> >> start build >> >>> >> >> >> end build 180.823105 >> >>> >> >> >> that took a long time elapsed: 27.346064 slow_threshold: >> >>> >> >> >> 2.295146 >> >>> >> >> >> 90th_quantile_runtime: 0.434726 >> >>> >> >> >> that took a long time elapsed: 26.028852 slow_threshold: >> >>> >> >> >> 2.283927 >> >>> >> >> >> 90th_quantile_runtime: 0.374190 >> >>> >> >> >> that took a long time elapsed: 25.432279 slow_threshold: >> >>> >> >> >> 2.279631 >> >>> >> >> >> 90th_quantile_runtime: 0.371502 >> >>> >> >> >> >> >>> >> >> >> pypy mem.py 30000000 >> >>> >> >> >> start build >> >>> >> >> >> end build 276.217811 >> >>> >> >> >> that took a long time elapsed: 40.993855 slow_threshold: >> >>> >> >> >> 3.188464 >> >>> >> >> >> 90th_quantile_runtime: 0.459891 >> >>> >> >> >> that took a long time elapsed: 41.693553 slow_threshold: >> >>> >> >> >> 3.183003 >> >>> >> >> >> 90th_quantile_runtime: 0.393654 >> >>> >> >> >> that took a long time elapsed: 39.679769 slow_threshold: >> >>> >> >> >> 3.190782 >> >>> >> >> >> 90th_quantile_runtime: 0.393677 >> >>> >> >> >> that took a long time elapsed: 43.573411 slow_threshold: >> >>> >> >> >> 3.239637 >> >>> >> >> >> 90th_quantile_runtime: 0.393654 >> >>> >> >> >> >> >>> >> >> >> Code below >> >>> >> >> >> >> >>> >> >> >> -------------------------------------------------------------- >> >>> >> >> >> import time >> >>> >> >> >> from random import randint, choice >> >>> >> >> >> import sys >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> allElems = {} >> >>> >> >> >> >> >>> >> >> >> class Node: >> >>> >> >> >> def __init__(self, v_): >> >>> >> >> >> self.v = v_ >> >>> >> >> >> self.next = None >> >>> >> >> >> self.dummy_data = [randint(0,100) >> >>> >> >> >> for _ in xrange(randint(50,100))] >> >>> >> >> >> allElems[self.v] = self >> >>> >> >> >> if self.v > 0: >> >>> >> >> >> self.dummy_links = [allElems[randint(0, >> >>> >> >> >> self.v-1)] >> >>> >> >> >> for _ >> >>> >> >> >> in >> >>> >> >> >> xrange(10)] >> >>> >> >> >> else: >> >>> >> >> >> self.dummy_links = [self] >> >>> >> >> >> >> >>> >> >> >> def set_next(self, l): >> >>> >> >> >> self.next = l >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> def follow(node): >> >>> >> >> >> acc = [] >> >>> >> >> >> count = 0 >> >>> >> >> >> cur = node >> >>> >> >> >> assert node.v is not None >> >>> >> >> >> assert cur is not None >> >>> >> >> >> while count < 50000: >> >>> >> >> >> # return a value; generate some garbage >> >>> >> >> >> acc.append((cur.v, >> >>> >> >> >> [choice("abcdefghijklmnopqrstuvwxyz") >> >>> >> >> >> for >> >>> >> >> >> x >> >>> >> >> >> in >> >>> >> >> >> xrange(100)])) >> >>> >> >> >> >> >>> >> >> >> # if we have reached the end, chose a random link >> >>> >> >> >> cur = choice(cur.dummy_links) if cur.next is None >> >>> >> >> >> else >> >>> >> >> >> cur.next >> >>> >> >> >> count += 1 >> >>> >> >> >> >> >>> >> >> >> return acc >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> def build(num_elems): >> >>> >> >> >> start = time.time() >> >>> >> >> >> print "start build" >> >>> >> >> >> root = Node(0) >> >>> >> >> >> cur = root >> >>> >> >> >> for x in xrange(1, num_elems): >> >>> >> >> >> e = Node(x) >> >>> >> >> >> cur.next = e >> >>> >> >> >> cur = e >> >>> >> >> >> print "end build %f" % (time.time() - start) >> >>> >> >> >> return root >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> num_timings = 100 >> >>> >> >> >> if __name__ == "__main__": >> >>> >> >> >> num_elems = int(sys.argv[1]) >> >>> >> >> >> build(num_elems) >> >>> >> >> >> total = 0 >> >>> >> >> >> timings = [0.0] * num_timings # run times for the last >> >>> >> >> >> num_timings >> >>> >> >> >> runs >> >>> >> >> >> i = 0 >> >>> >> >> >> beginning = time.time() >> >>> >> >> >> while time.time() - beginning < 600: >> >>> >> >> >> start = time.time() >> >>> >> >> >> elem = allElems[randint(0, num_elems - 1)] >> >>> >> >> >> assert(elem is not None) >> >>> >> >> >> >> >>> >> >> >> lst = follow(elem) >> >>> >> >> >> >> >>> >> >> >> total += choice(lst)[0] # use the return value for >> >>> >> >> >> something >> >>> >> >> >> >> >>> >> >> >> end = time.time() >> >>> >> >> >> >> >>> >> >> >> elapsed = end-start >> >>> >> >> >> timings[i % num_timings] = elapsed >> >>> >> >> >> if (i > num_timings): >> >>> >> >> >> slow_time = 2 * sum(timings)/num_timings # slow >> >>> >> >> >> defined >> >>> >> >> >> as >> >>> >> >> >> > >> >>> >> >> >> 2*avg run time >> >>> >> >> >> if (elapsed > slow_time): >> >>> >> >> >> print "that took a long time elapsed: %f >> >>> >> >> >> slow_threshold: >> >>> >> >> >> %f 90th_quantile_runtime: %f" % \ >> >>> >> >> >> (elapsed, slow_time, >> >>> >> >> >> sorted(timings)[int(num_timings*.9)]) >> >>> >> >> >> i += 1 >> >>> >> >> >> print total >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >>> >> >> >> >> >>> >> >> >> wrote: >> >>> >> >> >>> >> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >> >>> >> >> >>> wrote: >> >>> >> >> >>> > Hi Armin, Maciej >> >>> >> >> >>> > >> >>> >> >> >>> > Thanks for responding. >> >>> >> >> >>> > >> >>> >> >> >>> > I'm in the process of trying to determine what (if any) of >> >>> >> >> >>> > the >> >>> >> >> >>> > code >> >>> >> >> >>> > I'm >> >>> >> >> >>> > in a >> >>> >> >> >>> > position to share, and I'll get back to you. >> >>> >> >> >>> > >> >>> >> >> >>> > Allowing hinting to the GC would be good. Even better >> >>> >> >> >>> > would be >> >>> >> >> >>> > a >> >>> >> >> >>> > means >> >>> >> >> >>> > to >> >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged >> >>> >> >> >>> > memory, >> >>> >> >> >>> > but I >> >>> >> >> >>> > would expect that to be a tall order :) >> >>> >> >> >>> > >> >>> >> >> >>> > Thanks, >> >>> >> >> >>> > /Martin >> >>> >> >> >>> >> >>> >> >> >>> Hi Martin. >> >>> >> >> >>> >> >>> >> >> >>> Note that in case you want us to do the work of isolating >> >>> >> >> >>> the >> >>> >> >> >>> problem, >> >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs >> >>> >> >> >>> and >> >>> >> >> >>> stuff). >> >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you >> >>> >> >> >>> isolate >> >>> >> >> >>> a >> >>> >> >> >>> part you can share freely :) >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> > >> >>> >> > >> >>> >> > >> >>> > >> >>> > >> >> >> >> > > From mak at issuu.com Mon Mar 17 16:35:33 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 16:35:33 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Message-ID: Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow. *Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: gc-minor:10 gc-collect-step:247 *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: gc-minor:10 gc-collect-step:245 *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: gc-minor:11 gc-collect-step:244 *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 *Max*: gc-minor:17 gc-collect-step:244 *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 *Max*: gc-minor:11 gc-collect-step:248 *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 *Max*: gc-minor:8 gc-collect-step:299 *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 *Max*: gc-minor:11 gc-collect-step:246 *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 *Max*: gc-minor:36 gc-collect-step:248 *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 *Max*: gc-minor:38 gc-collect-step:244 *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 *Max*: gc-minor:23 gc-collect-step:245 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 *Max*: gc-minor:9 gc-collect-step:244 *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 *Max*: gc-minor:8 gc-collect-step:248 *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 *Max*: gc-minor:8 gc-collect-step:250 *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 *Max*: gc-minor:543 gc-collect-step:244 *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 *Max*: gc-minor:20 gc-collect-step:246 *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 *Max*: gc-minor:25 gc-collect-step:245 Thanks, /Martin On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: > Ah. I had misunderstood. I'll get back to you on that :) thanks > > /Martin > > > > On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: > > > > eh, this is not what I need > > > > I need a max of TIME it took for a gc-minor and the TOTAL time it took > > for a gc-minor (per query) (ideally same for gc-walkroots and > > gc-collect-step) > > > >> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: > >> Here are the collated results of running each query. For each run, I > count > >> how many of each of the pypy debug lines i get. I.e. there were 668 runs > >> that printed 58 loglines that contain "{gc-minor" which was eventually > >> followed by "gc-minor}". I have also counted if the query was slow; > >> interestingly, not all the queries with many gc-minors were slow (but > all > >> slow queries had a gc-minor). > >> > >> Please let me know if this is unclear :) > >> > >> 668 gc-minor:58 gc-minor-walkroots:58 > >> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 > >> 140 gc-minor:59 gc-minor-walkroots:59 > >> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 > >> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 > >> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 > >> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 > >> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 > >> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 > >> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 > >> 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 > >> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 > >> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 > >> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 > >> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 > >> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 > >> jit-resume:84 > >> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 > >> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 > >> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 > >> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 > >> jit-resume:14 > >> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 > >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 > >> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 > >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > jit-abort:3 > >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > >> jit-log-compiling-bridge:2 jit-resume:84 > >> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 > >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 > >> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 > >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > jit-abort:3 > >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > >> jit-log-compiling-bridge:2 jit-resume:84 > >> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 > >> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 > gc-minor:61 > >> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 > >> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 > jit-abort:3 > >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 > >> jit-log-compiling-bridge:2 jit-resume:104 > >> > >> > >> Thanks, > >> /Martin > >> > >> > >> > >> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski > >> wrote: > >>> > >>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > >>> wrote: > >>>> are you *sure* it's the walkroots that take that long and not > >>>> something else (like gc-minor)? More of those mean that you allocate a > >>>> lot more surviving objects. Can you do two things: > >>>> > >>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request > >>>> b) take the sum of those > >>>> > >>>> and plot them > >>> > >>> ^^^ or just paste the results actually > >>> > >>>> > >>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: > >>>>> Well, then it works out to around 2.5GHz, which seems reasonable. But > >>>>> it > >>>>> doesn't alter the conclusion from the previous email: The slow > queries > >>>>> then > >>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 > >>>>> units, or > >>>>> .4 seconds at this conversion. Also, the log shows that a slow query > >>>>> performs many more gc-minor operations than a 'normal' one: 9600 > >>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > >>>>> > >>>>> So the question becomes: Why do we get this large spike in > >>>>> gc-minor-walkroots, and, in particular, is there any way to avoid it > :) > >>>>> ? > >>>>> > >>>>> Thanks, > >>>>> /Martin > >>>>> > >>>>> > >>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski < > fijall at gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> I think it's the cycles of your CPU > >>>>>> > >>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch > wrote: > >>>>>>> What is the unit? Perhaps I'm being thick here, but I can't > >>>>>>> correlate it > >>>>>>> with seconds (which the program does print out). Slow runs are > >>>>>>> around 13 > >>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units > >>>>>>> (e.g. > >>>>>>> from > >>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > >>>>>>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> The number of lines is nonsense. This is a timestamp in hex. > >>>>>>>> > >>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch > >>>>>>>> wrote: > >>>>>>>>> Based On Maciej's suggestion, I tried the following > >>>>>>>>> > >>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out > >>>>>>>>> > >>>>>>>>> This generates a logfile which looks something like this > >>>>>>>>> > >>>>>>>>> start--> > >>>>>>>>> [2b99f1981b527e] {gc-minor > >>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots > >>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} > >>>>>>>>> [2b99f19890d750] gc-minor} > >>>>>>>>> [snip] > >>>>>>>>> ... > >>>>>>>>> <--stop > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> It turns out that the culprit is a lot of MINOR collections. > >>>>>>>>> > >>>>>>>>> I base this on the following observations: > >>>>>>>>> > >>>>>>>>> I can't understand the format of the timestamp on each logline > >>>>>>>>> (the > >>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this should > >>>>>>>>> be > >>>>>>>>> output > >>>>>>>>> from time.clock(), but that doesn't return a number like that > >>>>>>>>> when I > >>>>>>>>> run > >>>>>>>>> pypy interactively > >>>>>>>>> Instead, I count the number of debug lines between start--> and > >>>>>>>>> the > >>>>>>>>> corresponding <--stop. > >>>>>>>>> Most runs have a few hundred lines of output between start/stop > >>>>>>>>> All slow runs have very close to 57800 lines out output between > >>>>>>>>> start/stop > >>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 > >>>>>>>>> gc-minor > >>>>>>>>> operations, and 9647 gc-minor-walkroots operations. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> /Martin > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is > >>>>>>>>>> stdout) > >>>>>>>>>> which will do that for you btw. > >>>>>>>>>> > >>>>>>>>>> maybe you can find out what's that using profiling or valgrind? > >>>>>>>>>> > >>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch > >>>>>>>>>> wrote: > >>>>>>>>>>> I have tried getting the pypy source and building my own > >>>>>>>>>>> version > >>>>>>>>>>> of > >>>>>>>>>>> pypy. I > >>>>>>>>>>> have modified > >>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() > >>>>>>>>>>> to > >>>>>>>>>>> print out when it starts and when it stops. Apparently, the > >>>>>>>>>>> slow > >>>>>>>>>>> queries > >>>>>>>>>>> do > >>>>>>>>>>> NOT occur during major_collection_step; at least, I have not > >>>>>>>>>>> observed > >>>>>>>>>>> major > >>>>>>>>>>> step output during a query execution. So, apparently, > >>>>>>>>>>> something > >>>>>>>>>>> else > >>>>>>>>>>> is > >>>>>>>>>>> blocking. This could be another aspect of the GC, but it could > >>>>>>>>>>> also > >>>>>>>>>>> be > >>>>>>>>>>> anything else. > >>>>>>>>>>> > >>>>>>>>>>> Just to be sure, I have tried running the same application in > >>>>>>>>>>> python > >>>>>>>>>>> with > >>>>>>>>>>> garbage collection disabled. I don't see the problem there, so > >>>>>>>>>>> it > >>>>>>>>>>> is > >>>>>>>>>>> somehow > >>>>>>>>>>> related to either GC or the runtime somehow. > >>>>>>>>>>> > >>>>>>>>>>> Cheers, > >>>>>>>>>>> /Martin > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the > >>>>>>>>>>>> same > >>>>>>>>>>>> issue. > >>>>>>>>>>>> > >>>>>>>>>>>> We basically generate a linked list of objects. To increase > >>>>>>>>>>>> connectedness, > >>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 > >>>>>>>>>>>> randomly > >>>>>>>>>>>> chosen > >>>>>>>>>>>> previous elements in the list. > >>>>>>>>>>>> > >>>>>>>>>>>> We then time a function that traverses 50000 elements from > >>>>>>>>>>>> the > >>>>>>>>>>>> list > >>>>>>>>>>>> from a > >>>>>>>>>>>> random start point. If the traversal reaches the end of the > >>>>>>>>>>>> list, > >>>>>>>>>>>> we > >>>>>>>>>>>> instead > >>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements > >>>>>>>>>>>> are > >>>>>>>>>>>> traversed > >>>>>>>>>>>> every time. To generate some garbage, we build a list holding > >>>>>>>>>>>> the > >>>>>>>>>>>> traversed > >>>>>>>>>>>> elements and a dummy list of characters. > >>>>>>>>>>>> > >>>>>>>>>>>> Timings for the last 100 runs are stored in a circular > >>>>>>>>>>>> buffer. If > >>>>>>>>>>>> the > >>>>>>>>>>>> elapsed time for the last run is more than twice the average > >>>>>>>>>>>> time, > >>>>>>>>>>>> we > >>>>>>>>>>>> print > >>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% > >>>>>>>>>>>> runtime > >>>>>>>>>>>> (we > >>>>>>>>>>>> would like to see that the mean runtime does not increase > >>>>>>>>>>>> with > >>>>>>>>>>>> the > >>>>>>>>>>>> number of > >>>>>>>>>>>> elements in the list, but that the max time does increase > >>>>>>>>>>>> (linearly > >>>>>>>>>>>> with the > >>>>>>>>>>>> number of object, i guess); traversing 50K elements should be > >>>>>>>>>>>> independent of > >>>>>>>>>>>> the memory size). > >>>>>>>>>>>> > >>>>>>>>>>>> We have tried monitoring memory consumption by external > >>>>>>>>>>>> inspection, > >>>>>>>>>>>> but > >>>>>>>>>>>> cannot consistently verify that memory is deallocated at the > >>>>>>>>>>>> same > >>>>>>>>>>>> time > >>>>>>>>>>>> that > >>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always > >>>>>>>>>>>> return > >>>>>>>>>>>> freed > >>>>>>>>>>>> pages back to the OS? > >>>>>>>>>>>> > >>>>>>>>>>>> Using top, we observe that 10M elements allocates around 17GB > >>>>>>>>>>>> after > >>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to > >>>>>>>>>>>> 35GB > >>>>>>>>>>>> shortly > >>>>>>>>>>>> after building). > >>>>>>>>>>>> > >>>>>>>>>>>> Here is output from a few runs with different number of > >>>>>>>>>>>> elements: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> pypy mem.py 10000000 > >>>>>>>>>>>> start build > >>>>>>>>>>>> end build 84.142424 > >>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: > >>>>>>>>>>>> 1.495401 > >>>>>>>>>>>> 90th_quantile_runtime: 0.421558 > >>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: > >>>>>>>>>>>> 1.488160 > >>>>>>>>>>>> 90th_quantile_runtime: 0.423441 > >>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: > >>>>>>>>>>>> 1.474563 > >>>>>>>>>>>> 90th_quantile_runtime: 0.419817 > >>>>>>>>>>>> > >>>>>>>>>>>> pypy mem.py 20000000 > >>>>>>>>>>>> start build > >>>>>>>>>>>> end build 180.823105 > >>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: > >>>>>>>>>>>> 2.295146 > >>>>>>>>>>>> 90th_quantile_runtime: 0.434726 > >>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: > >>>>>>>>>>>> 2.283927 > >>>>>>>>>>>> 90th_quantile_runtime: 0.374190 > >>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: > >>>>>>>>>>>> 2.279631 > >>>>>>>>>>>> 90th_quantile_runtime: 0.371502 > >>>>>>>>>>>> > >>>>>>>>>>>> pypy mem.py 30000000 > >>>>>>>>>>>> start build > >>>>>>>>>>>> end build 276.217811 > >>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: > >>>>>>>>>>>> 3.188464 > >>>>>>>>>>>> 90th_quantile_runtime: 0.459891 > >>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: > >>>>>>>>>>>> 3.183003 > >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 > >>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: > >>>>>>>>>>>> 3.190782 > >>>>>>>>>>>> 90th_quantile_runtime: 0.393677 > >>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: > >>>>>>>>>>>> 3.239637 > >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 > >>>>>>>>>>>> > >>>>>>>>>>>> Code below > >>>>>>>>>>>> > >>>>>>>>>>>> -------------------------------------------------------------- > >>>>>>>>>>>> import time > >>>>>>>>>>>> from random import randint, choice > >>>>>>>>>>>> import sys > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> allElems = {} > >>>>>>>>>>>> > >>>>>>>>>>>> class Node: > >>>>>>>>>>>> def __init__(self, v_): > >>>>>>>>>>>> self.v = v_ > >>>>>>>>>>>> self.next = None > >>>>>>>>>>>> self.dummy_data = [randint(0,100) > >>>>>>>>>>>> for _ in xrange(randint(50,100))] > >>>>>>>>>>>> allElems[self.v] = self > >>>>>>>>>>>> if self.v > 0: > >>>>>>>>>>>> self.dummy_links = [allElems[randint(0, > >>>>>>>>>>>> self.v-1)] > >>>>>>>>>>>> for _ > >>>>>>>>>>>> in > >>>>>>>>>>>> xrange(10)] > >>>>>>>>>>>> else: > >>>>>>>>>>>> self.dummy_links = [self] > >>>>>>>>>>>> > >>>>>>>>>>>> def set_next(self, l): > >>>>>>>>>>>> self.next = l > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> def follow(node): > >>>>>>>>>>>> acc = [] > >>>>>>>>>>>> count = 0 > >>>>>>>>>>>> cur = node > >>>>>>>>>>>> assert node.v is not None > >>>>>>>>>>>> assert cur is not None > >>>>>>>>>>>> while count < 50000: > >>>>>>>>>>>> # return a value; generate some garbage > >>>>>>>>>>>> acc.append((cur.v, > >>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") > >>>>>>>>>>>> for > >>>>>>>>>>>> x > >>>>>>>>>>>> in > >>>>>>>>>>>> xrange(100)])) > >>>>>>>>>>>> > >>>>>>>>>>>> # if we have reached the end, chose a random link > >>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None > >>>>>>>>>>>> else > >>>>>>>>>>>> cur.next > >>>>>>>>>>>> count += 1 > >>>>>>>>>>>> > >>>>>>>>>>>> return acc > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> def build(num_elems): > >>>>>>>>>>>> start = time.time() > >>>>>>>>>>>> print "start build" > >>>>>>>>>>>> root = Node(0) > >>>>>>>>>>>> cur = root > >>>>>>>>>>>> for x in xrange(1, num_elems): > >>>>>>>>>>>> e = Node(x) > >>>>>>>>>>>> cur.next = e > >>>>>>>>>>>> cur = e > >>>>>>>>>>>> print "end build %f" % (time.time() - start) > >>>>>>>>>>>> return root > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> num_timings = 100 > >>>>>>>>>>>> if __name__ == "__main__": > >>>>>>>>>>>> num_elems = int(sys.argv[1]) > >>>>>>>>>>>> build(num_elems) > >>>>>>>>>>>> total = 0 > >>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last > >>>>>>>>>>>> num_timings > >>>>>>>>>>>> runs > >>>>>>>>>>>> i = 0 > >>>>>>>>>>>> beginning = time.time() > >>>>>>>>>>>> while time.time() - beginning < 600: > >>>>>>>>>>>> start = time.time() > >>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] > >>>>>>>>>>>> assert(elem is not None) > >>>>>>>>>>>> > >>>>>>>>>>>> lst = follow(elem) > >>>>>>>>>>>> > >>>>>>>>>>>> total += choice(lst)[0] # use the return value for > >>>>>>>>>>>> something > >>>>>>>>>>>> > >>>>>>>>>>>> end = time.time() > >>>>>>>>>>>> > >>>>>>>>>>>> elapsed = end-start > >>>>>>>>>>>> timings[i % num_timings] = elapsed > >>>>>>>>>>>> if (i > num_timings): > >>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow > >>>>>>>>>>>> defined > >>>>>>>>>>>> as > >>>>>>>>>>>>> > >>>>>>>>>>>> 2*avg run time > >>>>>>>>>>>> if (elapsed > slow_time): > >>>>>>>>>>>> print "that took a long time elapsed: %f > >>>>>>>>>>>> slow_threshold: > >>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ > >>>>>>>>>>>> (elapsed, slow_time, > >>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) > >>>>>>>>>>>> i += 1 > >>>>>>>>>>>> print total > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >>>>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> Hi Armin, Maciej > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for responding. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>> code > >>>>>>>>>>>>>> I'm > >>>>>>>>>>>>>> in a > >>>>>>>>>>>>>> position to share, and I'll get back to you. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better > >>>>>>>>>>>>>> would be > >>>>>>>>>>>>>> a > >>>>>>>>>>>>>> means > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged > >>>>>>>>>>>>>> memory, > >>>>>>>>>>>>>> but I > >>>>>>>>>>>>>> would expect that to be a tall order :) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> /Martin > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Martin. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Note that in case you want us to do the work of isolating > >>>>>>>>>>>>> the > >>>>>>>>>>>>> problem, > >>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs > >>>>>>>>>>>>> and > >>>>>>>>>>>>> stuff). > >>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you > >>>>>>>>>>>>> isolate > >>>>>>>>>>>>> a > >>>>>>>>>>>>> part you can share freely :) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mak at issuu.com Mon Mar 17 16:37:45 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 16:37:45 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Message-ID: Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs. On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: > Here are the total and max times in millions of units; 30000 units is > approximately 13 seconds. I have extracted the runs where there are many > gc-collect-steps. These are in execution order, so the first runs with many > gc-collect-steps aren't slow. > > *Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: > gc-minor:10 gc-collect-step:247 > *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: > gc-minor:10 gc-collect-step:245 > *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: > gc-minor:11 gc-collect-step:244 > *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 > *Max*: gc-minor:17 gc-collect-step:244 > *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 > *Max*: gc-minor:11 gc-collect-step:248 > *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 > *Max*: gc-minor:8 gc-collect-step:299 > *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 > *Max*: gc-minor:11 gc-collect-step:246 > *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 > *Max*: gc-minor:8 gc-collect-step:244 > *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 > *Max*: gc-minor:36 gc-collect-step:248 > *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 > *Max*: gc-minor:8 gc-collect-step:244 > *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 > *Max*: gc-minor:8 gc-collect-step:245 > *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 > *Max*: gc-minor:8 gc-collect-step:244 > *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 > *Max*: gc-minor:38 gc-collect-step:244 > *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 > *Max*: gc-minor:23 gc-collect-step:245 > *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 > *Max*: gc-minor:8 gc-collect-step:246 > *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 > *Max*: gc-minor:9 gc-collect-step:244 > *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 > *Max*: gc-minor:8 gc-collect-step:246 > *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 > *Max*: gc-minor:8 gc-collect-step:248 > *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 > *Max*: gc-minor:8 gc-collect-step:250 > *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 > *Max*: gc-minor:8 gc-collect-step:245 > *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 > *Max*: gc-minor:543 gc-collect-step:244 > *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 > *Max*: gc-minor:20 gc-collect-step:246 > *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 > *Max*: gc-minor:25 gc-collect-step:245 > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: > >> Ah. I had misunderstood. I'll get back to you on that :) thanks >> >> /Martin >> >> >> > On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: >> > >> > eh, this is not what I need >> > >> > I need a max of TIME it took for a gc-minor and the TOTAL time it took >> > for a gc-minor (per query) (ideally same for gc-walkroots and >> > gc-collect-step) >> > >> >> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: >> >> Here are the collated results of running each query. For each run, I >> count >> >> how many of each of the pypy debug lines i get. I.e. there were 668 >> runs >> >> that printed 58 loglines that contain "{gc-minor" which was eventually >> >> followed by "gc-minor}". I have also counted if the query was slow; >> >> interestingly, not all the queries with many gc-minors were slow (but >> all >> >> slow queries had a gc-minor). >> >> >> >> Please let me know if this is unclear :) >> >> >> >> 668 gc-minor:58 gc-minor-walkroots:58 >> >> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >> >> 140 gc-minor:59 gc-minor-walkroots:59 >> >> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >> >> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >> >> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >> gc-collect-step:9589 >> >> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >> gc-collect-step:9590 >> >> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >> gc-collect-step:9609 >> >> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >> >> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >> >> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >> jit-backend-dump:78 >> >> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 >> >> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >> >> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >> >> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >> >> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 >> >> jit-resume:84 >> >> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >> >> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 >> >> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >> >> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 >> >> jit-resume:14 >> >> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >> >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 >> >> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >> >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >> jit-abort:3 >> >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >> jit-log-opt-bridge:2 >> >> jit-log-compiling-bridge:2 jit-resume:84 >> >> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >> >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 >> >> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >> >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >> jit-abort:3 >> >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >> jit-log-opt-bridge:2 >> >> jit-log-compiling-bridge:2 jit-resume:84 >> >> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >> >> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >> gc-minor:61 >> >> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >> >> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >> jit-abort:3 >> >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >> jit-log-opt-bridge:2 >> >> jit-log-compiling-bridge:2 jit-resume:104 >> >> >> >> >> >> Thanks, >> >> /Martin >> >> >> >> >> >> >> >> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >> >> wrote: >> >>> >> >>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > > >> >>> wrote: >> >>>> are you *sure* it's the walkroots that take that long and not >> >>>> something else (like gc-minor)? More of those mean that you allocate >> a >> >>>> lot more surviving objects. Can you do two things: >> >>>> >> >>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >> >>>> b) take the sum of those >> >>>> >> >>>> and plot them >> >>> >> >>> ^^^ or just paste the results actually >> >>> >> >>>> >> >>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >> >>>>> Well, then it works out to around 2.5GHz, which seems reasonable. >> But >> >>>>> it >> >>>>> doesn't alter the conclusion from the previous email: The slow >> queries >> >>>>> then >> >>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >> >>>>> units, or >> >>>>> .4 seconds at this conversion. Also, the log shows that a slow query >> >>>>> performs many more gc-minor operations than a 'normal' one: 9600 >> >>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >>>>> >> >>>>> So the question becomes: Why do we get this large spike in >> >>>>> gc-minor-walkroots, and, in particular, is there any way to avoid >> it :) >> >>>>> ? >> >>>>> >> >>>>> Thanks, >> >>>>> /Martin >> >>>>> >> >>>>> >> >>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski < >> fijall at gmail.com> >> >>>>> wrote: >> >>>>>> >> >>>>>> I think it's the cycles of your CPU >> >>>>>> >> >>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch >> wrote: >> >>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >> >>>>>>> correlate it >> >>>>>>> with seconds (which the program does print out). Slow runs are >> >>>>>>> around 13 >> >>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >> >>>>>>> (e.g. >> >>>>>>> from >> >>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> >>>>>>> >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >> >>>>>>>> >> >>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >> >>>>>>>> wrote: >> >>>>>>>>> Based On Maciej's suggestion, I tried the following >> >>>>>>>>> >> >>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >> >>>>>>>>> >> >>>>>>>>> This generates a logfile which looks something like this >> >>>>>>>>> >> >>>>>>>>> start--> >> >>>>>>>>> [2b99f1981b527e] {gc-minor >> >>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >> >>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >> >>>>>>>>> [2b99f19890d750] gc-minor} >> >>>>>>>>> [snip] >> >>>>>>>>> ... >> >>>>>>>>> <--stop >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> It turns out that the culprit is a lot of MINOR collections. >> >>>>>>>>> >> >>>>>>>>> I base this on the following observations: >> >>>>>>>>> >> >>>>>>>>> I can't understand the format of the timestamp on each logline >> >>>>>>>>> (the >> >>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >> should >> >>>>>>>>> be >> >>>>>>>>> output >> >>>>>>>>> from time.clock(), but that doesn't return a number like that >> >>>>>>>>> when I >> >>>>>>>>> run >> >>>>>>>>> pypy interactively >> >>>>>>>>> Instead, I count the number of debug lines between start--> and >> >>>>>>>>> the >> >>>>>>>>> corresponding <--stop. >> >>>>>>>>> Most runs have a few hundred lines of output between start/stop >> >>>>>>>>> All slow runs have very close to 57800 lines out output between >> >>>>>>>>> start/stop >> >>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >> >>>>>>>>> gc-minor >> >>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> /Martin >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> >>>>>>>>> >> >>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >> >>>>>>>>>> stdout) >> >>>>>>>>>> which will do that for you btw. >> >>>>>>>>>> >> >>>>>>>>>> maybe you can find out what's that using profiling or valgrind? >> >>>>>>>>>> >> >>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >> >>>>>>>>>> wrote: >> >>>>>>>>>>> I have tried getting the pypy source and building my own >> >>>>>>>>>>> version >> >>>>>>>>>>> of >> >>>>>>>>>>> pypy. I >> >>>>>>>>>>> have modified >> >>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >> >>>>>>>>>>> to >> >>>>>>>>>>> print out when it starts and when it stops. Apparently, the >> >>>>>>>>>>> slow >> >>>>>>>>>>> queries >> >>>>>>>>>>> do >> >>>>>>>>>>> NOT occur during major_collection_step; at least, I have not >> >>>>>>>>>>> observed >> >>>>>>>>>>> major >> >>>>>>>>>>> step output during a query execution. So, apparently, >> >>>>>>>>>>> something >> >>>>>>>>>>> else >> >>>>>>>>>>> is >> >>>>>>>>>>> blocking. This could be another aspect of the GC, but it could >> >>>>>>>>>>> also >> >>>>>>>>>>> be >> >>>>>>>>>>> anything else. >> >>>>>>>>>>> >> >>>>>>>>>>> Just to be sure, I have tried running the same application in >> >>>>>>>>>>> python >> >>>>>>>>>>> with >> >>>>>>>>>>> garbage collection disabled. I don't see the problem there, so >> >>>>>>>>>>> it >> >>>>>>>>>>> is >> >>>>>>>>>>> somehow >> >>>>>>>>>>> related to either GC or the runtime somehow. >> >>>>>>>>>>> >> >>>>>>>>>>> Cheers, >> >>>>>>>>>>> /Martin >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >> >>>>>>>>>>> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the >> >>>>>>>>>>>> same >> >>>>>>>>>>>> issue. >> >>>>>>>>>>>> >> >>>>>>>>>>>> We basically generate a linked list of objects. To increase >> >>>>>>>>>>>> connectedness, >> >>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >> >>>>>>>>>>>> randomly >> >>>>>>>>>>>> chosen >> >>>>>>>>>>>> previous elements in the list. >> >>>>>>>>>>>> >> >>>>>>>>>>>> We then time a function that traverses 50000 elements from >> >>>>>>>>>>>> the >> >>>>>>>>>>>> list >> >>>>>>>>>>>> from a >> >>>>>>>>>>>> random start point. If the traversal reaches the end of the >> >>>>>>>>>>>> list, >> >>>>>>>>>>>> we >> >>>>>>>>>>>> instead >> >>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >> >>>>>>>>>>>> are >> >>>>>>>>>>>> traversed >> >>>>>>>>>>>> every time. To generate some garbage, we build a list holding >> >>>>>>>>>>>> the >> >>>>>>>>>>>> traversed >> >>>>>>>>>>>> elements and a dummy list of characters. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >> >>>>>>>>>>>> buffer. If >> >>>>>>>>>>>> the >> >>>>>>>>>>>> elapsed time for the last run is more than twice the average >> >>>>>>>>>>>> time, >> >>>>>>>>>>>> we >> >>>>>>>>>>>> print >> >>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >> >>>>>>>>>>>> runtime >> >>>>>>>>>>>> (we >> >>>>>>>>>>>> would like to see that the mean runtime does not increase >> >>>>>>>>>>>> with >> >>>>>>>>>>>> the >> >>>>>>>>>>>> number of >> >>>>>>>>>>>> elements in the list, but that the max time does increase >> >>>>>>>>>>>> (linearly >> >>>>>>>>>>>> with the >> >>>>>>>>>>>> number of object, i guess); traversing 50K elements should be >> >>>>>>>>>>>> independent of >> >>>>>>>>>>>> the memory size). >> >>>>>>>>>>>> >> >>>>>>>>>>>> We have tried monitoring memory consumption by external >> >>>>>>>>>>>> inspection, >> >>>>>>>>>>>> but >> >>>>>>>>>>>> cannot consistently verify that memory is deallocated at the >> >>>>>>>>>>>> same >> >>>>>>>>>>>> time >> >>>>>>>>>>>> that >> >>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always >> >>>>>>>>>>>> return >> >>>>>>>>>>>> freed >> >>>>>>>>>>>> pages back to the OS? >> >>>>>>>>>>>> >> >>>>>>>>>>>> Using top, we observe that 10M elements allocates around 17GB >> >>>>>>>>>>>> after >> >>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >> >>>>>>>>>>>> 35GB >> >>>>>>>>>>>> shortly >> >>>>>>>>>>>> after building). >> >>>>>>>>>>>> >> >>>>>>>>>>>> Here is output from a few runs with different number of >> >>>>>>>>>>>> elements: >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> pypy mem.py 10000000 >> >>>>>>>>>>>> start build >> >>>>>>>>>>>> end build 84.142424 >> >>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >> >>>>>>>>>>>> 1.495401 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >> >>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >> >>>>>>>>>>>> 1.488160 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >> >>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >> >>>>>>>>>>>> 1.474563 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >> >>>>>>>>>>>> >> >>>>>>>>>>>> pypy mem.py 20000000 >> >>>>>>>>>>>> start build >> >>>>>>>>>>>> end build 180.823105 >> >>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >> >>>>>>>>>>>> 2.295146 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >> >>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >> >>>>>>>>>>>> 2.283927 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >> >>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >> >>>>>>>>>>>> 2.279631 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >> >>>>>>>>>>>> >> >>>>>>>>>>>> pypy mem.py 30000000 >> >>>>>>>>>>>> start build >> >>>>>>>>>>>> end build 276.217811 >> >>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >> >>>>>>>>>>>> 3.188464 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >> >>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >> >>>>>>>>>>>> 3.183003 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >> >>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >> >>>>>>>>>>>> 3.190782 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >> >>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >> >>>>>>>>>>>> 3.239637 >> >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >> >>>>>>>>>>>> >> >>>>>>>>>>>> Code below >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> -------------------------------------------------------------- >> >>>>>>>>>>>> import time >> >>>>>>>>>>>> from random import randint, choice >> >>>>>>>>>>>> import sys >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> allElems = {} >> >>>>>>>>>>>> >> >>>>>>>>>>>> class Node: >> >>>>>>>>>>>> def __init__(self, v_): >> >>>>>>>>>>>> self.v = v_ >> >>>>>>>>>>>> self.next = None >> >>>>>>>>>>>> self.dummy_data = [randint(0,100) >> >>>>>>>>>>>> for _ in xrange(randint(50,100))] >> >>>>>>>>>>>> allElems[self.v] = self >> >>>>>>>>>>>> if self.v > 0: >> >>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >> >>>>>>>>>>>> self.v-1)] >> >>>>>>>>>>>> for _ >> >>>>>>>>>>>> in >> >>>>>>>>>>>> xrange(10)] >> >>>>>>>>>>>> else: >> >>>>>>>>>>>> self.dummy_links = [self] >> >>>>>>>>>>>> >> >>>>>>>>>>>> def set_next(self, l): >> >>>>>>>>>>>> self.next = l >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> def follow(node): >> >>>>>>>>>>>> acc = [] >> >>>>>>>>>>>> count = 0 >> >>>>>>>>>>>> cur = node >> >>>>>>>>>>>> assert node.v is not None >> >>>>>>>>>>>> assert cur is not None >> >>>>>>>>>>>> while count < 50000: >> >>>>>>>>>>>> # return a value; generate some garbage >> >>>>>>>>>>>> acc.append((cur.v, >> >>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >> >>>>>>>>>>>> for >> >>>>>>>>>>>> x >> >>>>>>>>>>>> in >> >>>>>>>>>>>> xrange(100)])) >> >>>>>>>>>>>> >> >>>>>>>>>>>> # if we have reached the end, chose a random link >> >>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >> >>>>>>>>>>>> else >> >>>>>>>>>>>> cur.next >> >>>>>>>>>>>> count += 1 >> >>>>>>>>>>>> >> >>>>>>>>>>>> return acc >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> def build(num_elems): >> >>>>>>>>>>>> start = time.time() >> >>>>>>>>>>>> print "start build" >> >>>>>>>>>>>> root = Node(0) >> >>>>>>>>>>>> cur = root >> >>>>>>>>>>>> for x in xrange(1, num_elems): >> >>>>>>>>>>>> e = Node(x) >> >>>>>>>>>>>> cur.next = e >> >>>>>>>>>>>> cur = e >> >>>>>>>>>>>> print "end build %f" % (time.time() - start) >> >>>>>>>>>>>> return root >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> num_timings = 100 >> >>>>>>>>>>>> if __name__ == "__main__": >> >>>>>>>>>>>> num_elems = int(sys.argv[1]) >> >>>>>>>>>>>> build(num_elems) >> >>>>>>>>>>>> total = 0 >> >>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >> >>>>>>>>>>>> num_timings >> >>>>>>>>>>>> runs >> >>>>>>>>>>>> i = 0 >> >>>>>>>>>>>> beginning = time.time() >> >>>>>>>>>>>> while time.time() - beginning < 600: >> >>>>>>>>>>>> start = time.time() >> >>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >> >>>>>>>>>>>> assert(elem is not None) >> >>>>>>>>>>>> >> >>>>>>>>>>>> lst = follow(elem) >> >>>>>>>>>>>> >> >>>>>>>>>>>> total += choice(lst)[0] # use the return value for >> >>>>>>>>>>>> something >> >>>>>>>>>>>> >> >>>>>>>>>>>> end = time.time() >> >>>>>>>>>>>> >> >>>>>>>>>>>> elapsed = end-start >> >>>>>>>>>>>> timings[i % num_timings] = elapsed >> >>>>>>>>>>>> if (i > num_timings): >> >>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >> >>>>>>>>>>>> defined >> >>>>>>>>>>>> as >> >>>>>>>>>>>>> >> >>>>>>>>>>>> 2*avg run time >> >>>>>>>>>>>> if (elapsed > slow_time): >> >>>>>>>>>>>> print "that took a long time elapsed: %f >> >>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >> >>>>>>>>>>>> (elapsed, slow_time, >> >>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >> >>>>>>>>>>>> i += 1 >> >>>>>>>>>>>> print total >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >>>>>>>>>>>> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > > >> >>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> Hi Armin, Maciej >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks for responding. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>> code >> >>>>>>>>>>>>>> I'm >> >>>>>>>>>>>>>> in a >> >>>>>>>>>>>>>> position to share, and I'll get back to you. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >> >>>>>>>>>>>>>> would be >> >>>>>>>>>>>>>> a >> >>>>>>>>>>>>>> means >> >>>>>>>>>>>>>> to >> >>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >> >>>>>>>>>>>>>> memory, >> >>>>>>>>>>>>>> but I >> >>>>>>>>>>>>>> would expect that to be a tall order :) >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>> /Martin >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Hi Martin. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Note that in case you want us to do the work of isolating >> >>>>>>>>>>>>> the >> >>>>>>>>>>>>> problem, >> >>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >> >>>>>>>>>>>>> and >> >>>>>>>>>>>>> stuff). >> >>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >> >>>>>>>>>>>>> isolate >> >>>>>>>>>>>>> a >> >>>>>>>>>>>>> part you can share freely :) >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>>> >> >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 16:41:16 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 17:41:16 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Message-ID: ok. so as you can probably see, the max is not that big, which means the GC is really incremental. What happens is you get tons of garbage that survives minor collection every now and then. I don't exactly know why, but you should look what objects can potentially survive for too long. On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch wrote: > Ah - it just occured to me that the first runs may be slow anyway: Since we > take the average of the last 100 runs as the benchmark, then the first 100 > runs are not classified as slow. Indeed, the first three runs with many > collections are in the first 100 runs. > > > On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: >> >> Here are the total and max times in millions of units; 30000 units is >> approximately 13 seconds. I have extracted the runs where there are many >> gc-collect-steps. These are in execution order, so the first runs with many >> gc-collect-steps aren't slow. >> >> Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: >> gc-minor:10 gc-collect-step:247 >> Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: >> gc-minor:10 gc-collect-step:245 >> Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: >> gc-minor:11 gc-collect-step:244 >> Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 >> Max: gc-minor:17 gc-collect-step:244 >> Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 >> Max: gc-minor:11 gc-collect-step:248 >> Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 >> Max: gc-minor:8 gc-collect-step:299 >> Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 >> Max: gc-minor:11 gc-collect-step:246 >> Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 >> Max: gc-minor:8 gc-collect-step:244 >> Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 >> Max: gc-minor:36 gc-collect-step:248 >> Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 >> Max: gc-minor:8 gc-collect-step:244 >> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 >> Max: gc-minor:8 gc-collect-step:245 >> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 >> Max: gc-minor:8 gc-collect-step:244 >> Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 >> Max: gc-minor:38 gc-collect-step:244 >> Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 >> Max: gc-minor:23 gc-collect-step:245 >> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 >> Max: gc-minor:8 gc-collect-step:246 >> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 >> Max: gc-minor:9 gc-collect-step:244 >> Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 >> Max: gc-minor:8 gc-collect-step:246 >> Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 >> Max: gc-minor:8 gc-collect-step:248 >> Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 >> Max: gc-minor:8 gc-collect-step:250 >> Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 >> Max: gc-minor:8 gc-collect-step:245 >> Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 >> Max: gc-minor:543 gc-collect-step:244 >> Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 >> Max: gc-minor:20 gc-collect-step:246 >> Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 >> Max: gc-minor:25 gc-collect-step:245 >> >> Thanks, >> /Martin >> >> >> On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: >>> >>> Ah. I had misunderstood. I'll get back to you on that :) thanks >>> >>> /Martin >>> >>> >>> > On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: >>> > >>> > eh, this is not what I need >>> > >>> > I need a max of TIME it took for a gc-minor and the TOTAL time it took >>> > for a gc-minor (per query) (ideally same for gc-walkroots and >>> > gc-collect-step) >>> > >>> >> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: >>> >> Here are the collated results of running each query. For each run, I >>> >> count >>> >> how many of each of the pypy debug lines i get. I.e. there were 668 >>> >> runs >>> >> that printed 58 loglines that contain "{gc-minor" which was eventually >>> >> followed by "gc-minor}". I have also counted if the query was slow; >>> >> interestingly, not all the queries with many gc-minors were slow (but >>> >> all >>> >> slow queries had a gc-minor). >>> >> >>> >> Please let me know if this is unclear :) >>> >> >>> >> 668 gc-minor:58 gc-minor-walkroots:58 >>> >> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >>> >> 140 gc-minor:59 gc-minor-walkroots:59 >>> >> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >>> >> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >>> >> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >>> >> gc-collect-step:9589 >>> >> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >>> >> gc-collect-step:9590 >>> >> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >>> >> gc-collect-step:9609 >>> >> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >>> >> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >>> >> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >>> >> jit-backend-dump:78 >>> >> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 >>> >> gc-minor:9030 >>> >> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >>> >> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >>> >> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >>> >> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >>> >> jit-log-compiling-bridge:2 >>> >> jit-resume:84 >>> >> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >>> >> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 >>> >> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >>> >> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 >>> >> jit-resume:14 >>> >> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >>> >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 >>> >> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >>> >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>> >> jit-abort:3 >>> >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>> >> jit-log-opt-bridge:2 >>> >> jit-log-compiling-bridge:2 jit-resume:84 >>> >> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >>> >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 >>> >> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >>> >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>> >> jit-abort:3 >>> >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>> >> jit-log-opt-bridge:2 >>> >> jit-log-compiling-bridge:2 jit-resume:84 >>> >> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >>> >> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >>> >> gc-minor:61 >>> >> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >>> >> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >>> >> jit-abort:3 >>> >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >>> >> jit-log-opt-bridge:2 >>> >> jit-log-compiling-bridge:2 jit-resume:104 >>> >> >>> >> >>> >> Thanks, >>> >> /Martin >>> >> >>> >> >>> >> >>> >> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >>> >> wrote: >>> >>> >>> >>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >>> >>> >>> >>> wrote: >>> >>>> are you *sure* it's the walkroots that take that long and not >>> >>>> something else (like gc-minor)? More of those mean that you allocate >>> >>>> a >>> >>>> lot more surviving objects. Can you do two things: >>> >>>> >>> >>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>> >>>> b) take the sum of those >>> >>>> >>> >>>> and plot them >>> >>> >>> >>> ^^^ or just paste the results actually >>> >>> >>> >>>> >>> >>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >>> >>>>> Well, then it works out to around 2.5GHz, which seems reasonable. >>> >>>>> But >>> >>>>> it >>> >>>>> doesn't alter the conclusion from the previous email: The slow >>> >>>>> queries >>> >>>>> then >>> >>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>> >>>>> units, or >>> >>>>> .4 seconds at this conversion. Also, the log shows that a slow >>> >>>>> query >>> >>>>> performs many more gc-minor operations than a 'normal' one: 9600 >>> >>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>> >>>>> >>> >>>>> So the question becomes: Why do we get this large spike in >>> >>>>> gc-minor-walkroots, and, in particular, is there any way to avoid >>> >>>>> it :) >>> >>>>> ? >>> >>>>> >>> >>>>> Thanks, >>> >>>>> /Martin >>> >>>>> >>> >>>>> >>> >>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>> >>>>> >>> >>>>> wrote: >>> >>>>>> >>> >>>>>> I think it's the cycles of your CPU >>> >>>>>> >>> >>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch >>> >>>>>>> wrote: >>> >>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>> >>>>>>> correlate it >>> >>>>>>> with seconds (which the program does print out). Slow runs are >>> >>>>>>> around 13 >>> >>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>> >>>>>>> (e.g. >>> >>>>>>> from >>> >>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>> >>>>>>> >>> >>>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>> >>>>>>>> >>> >>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >>> >>>>>>>> wrote: >>> >>>>>>>>> Based On Maciej's suggestion, I tried the following >>> >>>>>>>>> >>> >>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>> >>>>>>>>> >>> >>>>>>>>> This generates a logfile which looks something like this >>> >>>>>>>>> >>> >>>>>>>>> start--> >>> >>>>>>>>> [2b99f1981b527e] {gc-minor >>> >>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>> >>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>> >>>>>>>>> [2b99f19890d750] gc-minor} >>> >>>>>>>>> [snip] >>> >>>>>>>>> ... >>> >>>>>>>>> <--stop >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> It turns out that the culprit is a lot of MINOR collections. >>> >>>>>>>>> >>> >>>>>>>>> I base this on the following observations: >>> >>>>>>>>> >>> >>>>>>>>> I can't understand the format of the timestamp on each logline >>> >>>>>>>>> (the >>> >>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>> >>>>>>>>> should >>> >>>>>>>>> be >>> >>>>>>>>> output >>> >>>>>>>>> from time.clock(), but that doesn't return a number like that >>> >>>>>>>>> when I >>> >>>>>>>>> run >>> >>>>>>>>> pypy interactively >>> >>>>>>>>> Instead, I count the number of debug lines between start--> and >>> >>>>>>>>> the >>> >>>>>>>>> corresponding <--stop. >>> >>>>>>>>> Most runs have a few hundred lines of output between start/stop >>> >>>>>>>>> All slow runs have very close to 57800 lines out output between >>> >>>>>>>>> start/stop >>> >>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>> >>>>>>>>> gc-minor >>> >>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> Thanks, >>> >>>>>>>>> /Martin >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>> >>>>>>>>> >>> >>>>>>>>> wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>> >>>>>>>>>> stdout) >>> >>>>>>>>>> which will do that for you btw. >>> >>>>>>>>>> >>> >>>>>>>>>> maybe you can find out what's that using profiling or >>> >>>>>>>>>> valgrind? >>> >>>>>>>>>> >>> >>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >>> >>>>>>>>>> wrote: >>> >>>>>>>>>>> I have tried getting the pypy source and building my own >>> >>>>>>>>>>> version >>> >>>>>>>>>>> of >>> >>>>>>>>>>> pypy. I >>> >>>>>>>>>>> have modified >>> >>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>> >>>>>>>>>>> to >>> >>>>>>>>>>> print out when it starts and when it stops. Apparently, the >>> >>>>>>>>>>> slow >>> >>>>>>>>>>> queries >>> >>>>>>>>>>> do >>> >>>>>>>>>>> NOT occur during major_collection_step; at least, I have not >>> >>>>>>>>>>> observed >>> >>>>>>>>>>> major >>> >>>>>>>>>>> step output during a query execution. So, apparently, >>> >>>>>>>>>>> something >>> >>>>>>>>>>> else >>> >>>>>>>>>>> is >>> >>>>>>>>>>> blocking. This could be another aspect of the GC, but it >>> >>>>>>>>>>> could >>> >>>>>>>>>>> also >>> >>>>>>>>>>> be >>> >>>>>>>>>>> anything else. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Just to be sure, I have tried running the same application in >>> >>>>>>>>>>> python >>> >>>>>>>>>>> with >>> >>>>>>>>>>> garbage collection disabled. I don't see the problem there, >>> >>>>>>>>>>> so >>> >>>>>>>>>>> it >>> >>>>>>>>>>> is >>> >>>>>>>>>>> somehow >>> >>>>>>>>>>> related to either GC or the runtime somehow. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Cheers, >>> >>>>>>>>>>> /Martin >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>> >>>>>>>>>>> wrote: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the >>> >>>>>>>>>>>> same >>> >>>>>>>>>>>> issue. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> We basically generate a linked list of objects. To increase >>> >>>>>>>>>>>> connectedness, >>> >>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>> >>>>>>>>>>>> randomly >>> >>>>>>>>>>>> chosen >>> >>>>>>>>>>>> previous elements in the list. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> We then time a function that traverses 50000 elements from >>> >>>>>>>>>>>> the >>> >>>>>>>>>>>> list >>> >>>>>>>>>>>> from a >>> >>>>>>>>>>>> random start point. If the traversal reaches the end of the >>> >>>>>>>>>>>> list, >>> >>>>>>>>>>>> we >>> >>>>>>>>>>>> instead >>> >>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>> >>>>>>>>>>>> are >>> >>>>>>>>>>>> traversed >>> >>>>>>>>>>>> every time. To generate some garbage, we build a list >>> >>>>>>>>>>>> holding >>> >>>>>>>>>>>> the >>> >>>>>>>>>>>> traversed >>> >>>>>>>>>>>> elements and a dummy list of characters. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>> >>>>>>>>>>>> buffer. If >>> >>>>>>>>>>>> the >>> >>>>>>>>>>>> elapsed time for the last run is more than twice the average >>> >>>>>>>>>>>> time, >>> >>>>>>>>>>>> we >>> >>>>>>>>>>>> print >>> >>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>> >>>>>>>>>>>> runtime >>> >>>>>>>>>>>> (we >>> >>>>>>>>>>>> would like to see that the mean runtime does not increase >>> >>>>>>>>>>>> with >>> >>>>>>>>>>>> the >>> >>>>>>>>>>>> number of >>> >>>>>>>>>>>> elements in the list, but that the max time does increase >>> >>>>>>>>>>>> (linearly >>> >>>>>>>>>>>> with the >>> >>>>>>>>>>>> number of object, i guess); traversing 50K elements should >>> >>>>>>>>>>>> be >>> >>>>>>>>>>>> independent of >>> >>>>>>>>>>>> the memory size). >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> We have tried monitoring memory consumption by external >>> >>>>>>>>>>>> inspection, >>> >>>>>>>>>>>> but >>> >>>>>>>>>>>> cannot consistently verify that memory is deallocated at the >>> >>>>>>>>>>>> same >>> >>>>>>>>>>>> time >>> >>>>>>>>>>>> that >>> >>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>> >>>>>>>>>>>> always >>> >>>>>>>>>>>> return >>> >>>>>>>>>>>> freed >>> >>>>>>>>>>>> pages back to the OS? >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Using top, we observe that 10M elements allocates around >>> >>>>>>>>>>>> 17GB >>> >>>>>>>>>>>> after >>> >>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>> >>>>>>>>>>>> 35GB >>> >>>>>>>>>>>> shortly >>> >>>>>>>>>>>> after building). >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Here is output from a few runs with different number of >>> >>>>>>>>>>>> elements: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> pypy mem.py 10000000 >>> >>>>>>>>>>>> start build >>> >>>>>>>>>>>> end build 84.142424 >>> >>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>> >>>>>>>>>>>> 1.495401 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>> >>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>> >>>>>>>>>>>> 1.488160 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>> >>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>> >>>>>>>>>>>> 1.474563 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> pypy mem.py 20000000 >>> >>>>>>>>>>>> start build >>> >>>>>>>>>>>> end build 180.823105 >>> >>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>> >>>>>>>>>>>> 2.295146 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>> >>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>> >>>>>>>>>>>> 2.283927 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>> >>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>> >>>>>>>>>>>> 2.279631 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> pypy mem.py 30000000 >>> >>>>>>>>>>>> start build >>> >>>>>>>>>>>> end build 276.217811 >>> >>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>> >>>>>>>>>>>> 3.188464 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>> >>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>> >>>>>>>>>>>> 3.183003 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>> >>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>> >>>>>>>>>>>> 3.190782 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>> >>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>> >>>>>>>>>>>> 3.239637 >>> >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Code below >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> -------------------------------------------------------------- >>> >>>>>>>>>>>> import time >>> >>>>>>>>>>>> from random import randint, choice >>> >>>>>>>>>>>> import sys >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> allElems = {} >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> class Node: >>> >>>>>>>>>>>> def __init__(self, v_): >>> >>>>>>>>>>>> self.v = v_ >>> >>>>>>>>>>>> self.next = None >>> >>>>>>>>>>>> self.dummy_data = [randint(0,100) >>> >>>>>>>>>>>> for _ in xrange(randint(50,100))] >>> >>>>>>>>>>>> allElems[self.v] = self >>> >>>>>>>>>>>> if self.v > 0: >>> >>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>> >>>>>>>>>>>> self.v-1)] >>> >>>>>>>>>>>> for _ >>> >>>>>>>>>>>> in >>> >>>>>>>>>>>> xrange(10)] >>> >>>>>>>>>>>> else: >>> >>>>>>>>>>>> self.dummy_links = [self] >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> def set_next(self, l): >>> >>>>>>>>>>>> self.next = l >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> def follow(node): >>> >>>>>>>>>>>> acc = [] >>> >>>>>>>>>>>> count = 0 >>> >>>>>>>>>>>> cur = node >>> >>>>>>>>>>>> assert node.v is not None >>> >>>>>>>>>>>> assert cur is not None >>> >>>>>>>>>>>> while count < 50000: >>> >>>>>>>>>>>> # return a value; generate some garbage >>> >>>>>>>>>>>> acc.append((cur.v, >>> >>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>> >>>>>>>>>>>> for >>> >>>>>>>>>>>> x >>> >>>>>>>>>>>> in >>> >>>>>>>>>>>> xrange(100)])) >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> # if we have reached the end, chose a random link >>> >>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>> >>>>>>>>>>>> else >>> >>>>>>>>>>>> cur.next >>> >>>>>>>>>>>> count += 1 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> return acc >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> def build(num_elems): >>> >>>>>>>>>>>> start = time.time() >>> >>>>>>>>>>>> print "start build" >>> >>>>>>>>>>>> root = Node(0) >>> >>>>>>>>>>>> cur = root >>> >>>>>>>>>>>> for x in xrange(1, num_elems): >>> >>>>>>>>>>>> e = Node(x) >>> >>>>>>>>>>>> cur.next = e >>> >>>>>>>>>>>> cur = e >>> >>>>>>>>>>>> print "end build %f" % (time.time() - start) >>> >>>>>>>>>>>> return root >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> num_timings = 100 >>> >>>>>>>>>>>> if __name__ == "__main__": >>> >>>>>>>>>>>> num_elems = int(sys.argv[1]) >>> >>>>>>>>>>>> build(num_elems) >>> >>>>>>>>>>>> total = 0 >>> >>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>> >>>>>>>>>>>> num_timings >>> >>>>>>>>>>>> runs >>> >>>>>>>>>>>> i = 0 >>> >>>>>>>>>>>> beginning = time.time() >>> >>>>>>>>>>>> while time.time() - beginning < 600: >>> >>>>>>>>>>>> start = time.time() >>> >>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>> >>>>>>>>>>>> assert(elem is not None) >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> lst = follow(elem) >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> total += choice(lst)[0] # use the return value for >>> >>>>>>>>>>>> something >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> end = time.time() >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> elapsed = end-start >>> >>>>>>>>>>>> timings[i % num_timings] = elapsed >>> >>>>>>>>>>>> if (i > num_timings): >>> >>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>> >>>>>>>>>>>> defined >>> >>>>>>>>>>>> as >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>> 2*avg run time >>> >>>>>>>>>>>> if (elapsed > slow_time): >>> >>>>>>>>>>>> print "that took a long time elapsed: %f >>> >>>>>>>>>>>> slow_threshold: >>> >>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>> >>>>>>>>>>>> (elapsed, slow_time, >>> >>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>> >>>>>>>>>>>> i += 1 >>> >>>>>>>>>>>> print total >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> wrote: >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> wrote: >>> >>>>>>>>>>>>>> Hi Armin, Maciej >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Thanks for responding. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>> >>>>>>>>>>>>>> the >>> >>>>>>>>>>>>>> code >>> >>>>>>>>>>>>>> I'm >>> >>>>>>>>>>>>>> in a >>> >>>>>>>>>>>>>> position to share, and I'll get back to you. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>> >>>>>>>>>>>>>> would be >>> >>>>>>>>>>>>>> a >>> >>>>>>>>>>>>>> means >>> >>>>>>>>>>>>>> to >>> >>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>> >>>>>>>>>>>>>> memory, >>> >>>>>>>>>>>>>> but I >>> >>>>>>>>>>>>>> would expect that to be a tall order :) >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Thanks, >>> >>>>>>>>>>>>>> /Martin >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> Hi Martin. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> Note that in case you want us to do the work of isolating >>> >>>>>>>>>>>>> the >>> >>>>>>>>>>>>> problem, >>> >>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>> >>>>>>>>>>>>> and >>> >>>>>>>>>>>>> stuff). >>> >>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>> >>>>>>>>>>>>> isolate >>> >>>>>>>>>>>>> a >>> >>>>>>>>>>>>> part you can share freely :) >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>>> >>> >> >>> >> >> >> > From mak at issuu.com Mon Mar 17 17:05:15 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 17:05:15 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Message-ID: <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> Thanks :) /Martin > On 17/03/2014, at 16.41, Maciej Fijalkowski wrote: > > ok. > > so as you can probably see, the max is not that big, which means the > GC is really incremental. What happens is you get tons of garbage that > survives minor collection every now and then. I don't exactly know > why, but you should look what objects can potentially survive for too > long. > >> On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch wrote: >> Ah - it just occured to me that the first runs may be slow anyway: Since we >> take the average of the last 100 runs as the benchmark, then the first 100 >> runs are not classified as slow. Indeed, the first three runs with many >> collections are in the first 100 runs. >> >> >>> On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: >>> >>> Here are the total and max times in millions of units; 30000 units is >>> approximately 13 seconds. I have extracted the runs where there are many >>> gc-collect-steps. These are in execution order, so the first runs with many >>> gc-collect-steps aren't slow. >>> >>> Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: >>> gc-minor:10 gc-collect-step:247 >>> Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: >>> gc-minor:10 gc-collect-step:245 >>> Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: >>> gc-minor:11 gc-collect-step:244 >>> Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 >>> Max: gc-minor:17 gc-collect-step:244 >>> Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 >>> Max: gc-minor:11 gc-collect-step:248 >>> Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 >>> Max: gc-minor:8 gc-collect-step:299 >>> Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 >>> Max: gc-minor:11 gc-collect-step:246 >>> Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 >>> Max: gc-minor:8 gc-collect-step:244 >>> Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 >>> Max: gc-minor:36 gc-collect-step:248 >>> Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 >>> Max: gc-minor:8 gc-collect-step:244 >>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 >>> Max: gc-minor:8 gc-collect-step:245 >>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 >>> Max: gc-minor:8 gc-collect-step:244 >>> Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 >>> Max: gc-minor:38 gc-collect-step:244 >>> Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 >>> Max: gc-minor:23 gc-collect-step:245 >>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 >>> Max: gc-minor:8 gc-collect-step:246 >>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 >>> Max: gc-minor:9 gc-collect-step:244 >>> Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 >>> Max: gc-minor:8 gc-collect-step:246 >>> Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 >>> Max: gc-minor:8 gc-collect-step:248 >>> Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 >>> Max: gc-minor:8 gc-collect-step:250 >>> Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 >>> Max: gc-minor:8 gc-collect-step:245 >>> Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 >>> Max: gc-minor:543 gc-collect-step:244 >>> Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 >>> Max: gc-minor:20 gc-collect-step:246 >>> Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 >>> Max: gc-minor:25 gc-collect-step:245 >>> >>> Thanks, >>> /Martin >>> >>> >>>> On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: >>>> >>>> Ah. I had misunderstood. I'll get back to you on that :) thanks >>>> >>>> /Martin >>>> >>>> >>>>> On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: >>>>> >>>>> eh, this is not what I need >>>>> >>>>> I need a max of TIME it took for a gc-minor and the TOTAL time it took >>>>> for a gc-minor (per query) (ideally same for gc-walkroots and >>>>> gc-collect-step) >>>>> >>>>>> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: >>>>>> Here are the collated results of running each query. For each run, I >>>>>> count >>>>>> how many of each of the pypy debug lines i get. I.e. there were 668 >>>>>> runs >>>>>> that printed 58 loglines that contain "{gc-minor" which was eventually >>>>>> followed by "gc-minor}". I have also counted if the query was slow; >>>>>> interestingly, not all the queries with many gc-minors were slow (but >>>>>> all >>>>>> slow queries had a gc-minor). >>>>>> >>>>>> Please let me know if this is unclear :) >>>>>> >>>>>> 668 gc-minor:58 gc-minor-walkroots:58 >>>>>> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >>>>>> 140 gc-minor:59 gc-minor-walkroots:59 >>>>>> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >>>>>> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >>>>>> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >>>>>> gc-collect-step:9589 >>>>>> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >>>>>> gc-collect-step:9590 >>>>>> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >>>>>> gc-collect-step:9609 >>>>>> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >>>>>> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >>>>>> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >>>>>> jit-backend-dump:78 >>>>>> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 >>>>>> gc-minor:9030 >>>>>> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >>>>>> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >>>>>> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >>>>>> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >>>>>> jit-log-compiling-bridge:2 >>>>>> jit-resume:84 >>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >>>>>> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 >>>>>> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >>>>>> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 >>>>>> jit-resume:14 >>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 >>>>>> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>>>>> jit-abort:3 >>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>>>>> jit-log-opt-bridge:2 >>>>>> jit-log-compiling-bridge:2 jit-resume:84 >>>>>> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 >>>>>> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>>>>> jit-abort:3 >>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>>>>> jit-log-opt-bridge:2 >>>>>> jit-log-compiling-bridge:2 jit-resume:84 >>>>>> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >>>>>> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >>>>>> gc-minor:61 >>>>>> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >>>>>> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >>>>>> jit-abort:3 >>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >>>>>> jit-log-opt-bridge:2 >>>>>> jit-log-compiling-bridge:2 jit-resume:104 >>>>>> >>>>>> >>>>>> Thanks, >>>>>> /Martin >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >>>>>> wrote: >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >>>>>>> >>>>>>> wrote: >>>>>>>> are you *sure* it's the walkroots that take that long and not >>>>>>>> something else (like gc-minor)? More of those mean that you allocate >>>>>>>> a >>>>>>>> lot more surviving objects. Can you do two things: >>>>>>>> >>>>>>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>>>>>>> b) take the sum of those >>>>>>>> >>>>>>>> and plot them >>>>>>> >>>>>>> ^^^ or just paste the results actually >>>>>>> >>>>>>>> >>>>>>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >>>>>>>>> Well, then it works out to around 2.5GHz, which seems reasonable. >>>>>>>>> But >>>>>>>>> it >>>>>>>>> doesn't alter the conclusion from the previous email: The slow >>>>>>>>> queries >>>>>>>>> then >>>>>>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>>>>>>>> units, or >>>>>>>>> .4 seconds at this conversion. Also, the log shows that a slow >>>>>>>>> query >>>>>>>>> performs many more gc-minor operations than a 'normal' one: 9600 >>>>>>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>>>>>>>> >>>>>>>>> So the question becomes: Why do we get this large spike in >>>>>>>>> gc-minor-walkroots, and, in particular, is there any way to avoid >>>>>>>>> it :) >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> /Martin >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I think it's the cycles of your CPU >>>>>>>>>> >>>>>>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch >>>>>>>>>>> wrote: >>>>>>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>>>>>>>> correlate it >>>>>>>>>>> with seconds (which the program does print out). Slow runs are >>>>>>>>>>> around 13 >>>>>>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>>>>>>>>> (e.g. >>>>>>>>>>> from >>>>>>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>>>>>>>> >>>>>>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>>>>>>>> >>>>>>>>>>>>> This generates a logfile which looks something like this >>>>>>>>>>>>> >>>>>>>>>>>>> start--> >>>>>>>>>>>>> [2b99f1981b527e] {gc-minor >>>>>>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>>>>>>>> [2b99f19890d750] gc-minor} >>>>>>>>>>>>> [snip] >>>>>>>>>>>>> ... >>>>>>>>>>>>> <--stop >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>>>>>>>>> >>>>>>>>>>>>> I base this on the following observations: >>>>>>>>>>>>> >>>>>>>>>>>>> I can't understand the format of the timestamp on each logline >>>>>>>>>>>>> (the >>>>>>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>>>>>>>>> should >>>>>>>>>>>>> be >>>>>>>>>>>>> output >>>>>>>>>>>>> from time.clock(), but that doesn't return a number like that >>>>>>>>>>>>> when I >>>>>>>>>>>>> run >>>>>>>>>>>>> pypy interactively >>>>>>>>>>>>> Instead, I count the number of debug lines between start--> and >>>>>>>>>>>>> the >>>>>>>>>>>>> corresponding <--stop. >>>>>>>>>>>>> Most runs have a few hundred lines of output between start/stop >>>>>>>>>>>>> All slow runs have very close to 57800 lines out output between >>>>>>>>>>>>> start/stop >>>>>>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>>>>>>>> gc-minor >>>>>>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> /Martin >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>>>>>>>> stdout) >>>>>>>>>>>>>> which will do that for you btw. >>>>>>>>>>>>>> >>>>>>>>>>>>>> maybe you can find out what's that using profiling or >>>>>>>>>>>>>> valgrind? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>>>>>>>> version >>>>>>>>>>>>>>> of >>>>>>>>>>>>>>> pypy. I >>>>>>>>>>>>>>> have modified >>>>>>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>>>>>>>>>> slow >>>>>>>>>>>>>>> queries >>>>>>>>>>>>>>> do >>>>>>>>>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>>>>>>>>> observed >>>>>>>>>>>>>>> major >>>>>>>>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>>>>>>>> something >>>>>>>>>>>>>>> else >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>>>>>>>>> could >>>>>>>>>>>>>>> also >>>>>>>>>>>>>>> be >>>>>>>>>>>>>>> anything else. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>>>>>>>>> python >>>>>>>>>>>>>>> with >>>>>>>>>>>>>>> garbage collection disabled. I don't see the problem there, >>>>>>>>>>>>>>> so >>>>>>>>>>>>>>> it >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> somehow >>>>>>>>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> /Martin >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>> issue. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>>>>>>>>> connectedness, >>>>>>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>>>>>>>> randomly >>>>>>>>>>>>>>>> chosen >>>>>>>>>>>>>>>> previous elements in the list. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> list >>>>>>>>>>>>>>>> from a >>>>>>>>>>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>>>>>>>>>> list, >>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>> instead >>>>>>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>> traversed >>>>>>>>>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>>>>>>>>> holding >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> traversed >>>>>>>>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>>>>>>>> buffer. If >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>>>>>>>>> time, >>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>> print >>>>>>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>>>>>>>>> runtime >>>>>>>>>>>>>>>> (we >>>>>>>>>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> number of >>>>>>>>>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>>>>>>>>> (linearly >>>>>>>>>>>>>>>> with the >>>>>>>>>>>>>>>> number of object, i guess); traversing 50K elements should >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> independent of >>>>>>>>>>>>>>>> the memory size). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>>>>>>>> inspection, >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>>>>>>>>> always >>>>>>>>>>>>>>>> return >>>>>>>>>>>>>>>> freed >>>>>>>>>>>>>>>> pages back to the OS? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>>>>>>>>> 17GB >>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>>>>>>>>> 35GB >>>>>>>>>>>>>>>> shortly >>>>>>>>>>>>>>>> after building). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>>>>>>>> elements: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>>>>>>>> start build >>>>>>>>>>>>>>>> end build 84.142424 >>>>>>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>>>>>>>>> 1.495401 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>>>>>>>>> 1.488160 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>>>>>>>>> 1.474563 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>>>>>>>> start build >>>>>>>>>>>>>>>> end build 180.823105 >>>>>>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>>>>>>>>> 2.295146 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>>>>>>>>> 2.283927 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>>>>>>>>> 2.279631 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>>>>>>>> start build >>>>>>>>>>>>>>>> end build 276.217811 >>>>>>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>>>>>>>>> 3.188464 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>>>>>>>>> 3.183003 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>>>>>>>>> 3.190782 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>>>>>>>>> 3.239637 >>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Code below >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>>>>>>> import time >>>>>>>>>>>>>>>> from random import randint, choice >>>>>>>>>>>>>>>> import sys >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> allElems = {} >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> class Node: >>>>>>>>>>>>>>>> def __init__(self, v_): >>>>>>>>>>>>>>>> self.v = v_ >>>>>>>>>>>>>>>> self.next = None >>>>>>>>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>>>>>>>>> allElems[self.v] = self >>>>>>>>>>>>>>>> if self.v > 0: >>>>>>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>>>>>>>> self.v-1)] >>>>>>>>>>>>>>>> for _ >>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>> xrange(10)] >>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>> self.dummy_links = [self] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def set_next(self, l): >>>>>>>>>>>>>>>> self.next = l >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def follow(node): >>>>>>>>>>>>>>>> acc = [] >>>>>>>>>>>>>>>> count = 0 >>>>>>>>>>>>>>>> cur = node >>>>>>>>>>>>>>>> assert node.v is not None >>>>>>>>>>>>>>>> assert cur is not None >>>>>>>>>>>>>>>> while count < 50000: >>>>>>>>>>>>>>>> # return a value; generate some garbage >>>>>>>>>>>>>>>> acc.append((cur.v, >>>>>>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> x >>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>> xrange(100)])) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>> cur.next >>>>>>>>>>>>>>>> count += 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> return acc >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def build(num_elems): >>>>>>>>>>>>>>>> start = time.time() >>>>>>>>>>>>>>>> print "start build" >>>>>>>>>>>>>>>> root = Node(0) >>>>>>>>>>>>>>>> cur = root >>>>>>>>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>>>>>>>> e = Node(x) >>>>>>>>>>>>>>>> cur.next = e >>>>>>>>>>>>>>>> cur = e >>>>>>>>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>>>>>>>> return root >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> num_timings = 100 >>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>>>>>>>> build(num_elems) >>>>>>>>>>>>>>>> total = 0 >>>>>>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>>>>>>>> num_timings >>>>>>>>>>>>>>>> runs >>>>>>>>>>>>>>>> i = 0 >>>>>>>>>>>>>>>> beginning = time.time() >>>>>>>>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>>>>>>>> start = time.time() >>>>>>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>>>>>>>> assert(elem is not None) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> lst = follow(elem) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> end = time.time() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> elapsed = end-start >>>>>>>>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>>>>>>>> if (i > num_timings): >>>>>>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>>>>>>>>> defined >>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2*avg run time >>>>>>>>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>>>>>>>> (elapsed, slow_time, >>>>>>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>>>>>>>> i += 1 >>>>>>>>>>>>>>>> print total >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> code >>>>>>>>>>>>>>>>>> I'm >>>>>>>>>>>>>>>>>> in a >>>>>>>>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> means >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>>>>>>>>> memory, >>>>>>>>>>>>>>>>>> but I >>>>>>>>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> /Martin >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Martin. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> problem, >>>>>>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> stuff). >>>>>>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>>>>>>>>> isolate >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> part you can share freely :) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>> >>> >> From mak at issuu.com Mon Mar 17 15:24:46 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 15:24:46 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: Message-ID: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> Ah. I had misunderstood. I'll get back to you on that :) thanks /Martin > On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: > > eh, this is not what I need > > I need a max of TIME it took for a gc-minor and the TOTAL time it took > for a gc-minor (per query) (ideally same for gc-walkroots and > gc-collect-step) > >> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: >> Here are the collated results of running each query. For each run, I count >> how many of each of the pypy debug lines i get. I.e. there were 668 runs >> that printed 58 loglines that contain "{gc-minor" which was eventually >> followed by "gc-minor}". I have also counted if the query was slow; >> interestingly, not all the queries with many gc-minors were slow (but all >> slow queries had a gc-minor). >> >> Please let me know if this is unclear :) >> >> 668 gc-minor:58 gc-minor-walkroots:58 >> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >> 140 gc-minor:59 gc-minor-walkroots:59 >> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 >> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 >> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 >> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >> 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 >> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 >> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 >> jit-resume:84 >> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 >> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 >> jit-resume:14 >> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 >> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 jit-resume:84 >> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 >> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 jit-resume:84 >> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 >> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 jit-resume:104 >> >> >> Thanks, >> /Martin >> >> >> >> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >> wrote: >>> >>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >>> wrote: >>>> are you *sure* it's the walkroots that take that long and not >>>> something else (like gc-minor)? More of those mean that you allocate a >>>> lot more surviving objects. Can you do two things: >>>> >>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>>> b) take the sum of those >>>> >>>> and plot them >>> >>> ^^^ or just paste the results actually >>> >>>> >>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >>>>> Well, then it works out to around 2.5GHz, which seems reasonable. But >>>>> it >>>>> doesn't alter the conclusion from the previous email: The slow queries >>>>> then >>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>>>> units, or >>>>> .4 seconds at this conversion. Also, the log shows that a slow query >>>>> performs many more gc-minor operations than a 'normal' one: 9600 >>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>>>> >>>>> So the question becomes: Why do we get this large spike in >>>>> gc-minor-walkroots, and, in particular, is there any way to avoid it :) >>>>> ? >>>>> >>>>> Thanks, >>>>> /Martin >>>>> >>>>> >>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>>>> wrote: >>>>>> >>>>>> I think it's the cycles of your CPU >>>>>> >>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch wrote: >>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>>>> correlate it >>>>>>> with seconds (which the program does print out). Slow runs are >>>>>>> around 13 >>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>>>>> (e.g. >>>>>>> from >>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>>>> >>>>>>> wrote: >>>>>>>> >>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>>>> >>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >>>>>>>> wrote: >>>>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>>>> >>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>>>> >>>>>>>>> This generates a logfile which looks something like this >>>>>>>>> >>>>>>>>> start--> >>>>>>>>> [2b99f1981b527e] {gc-minor >>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>>>> [2b99f19890d750] gc-minor} >>>>>>>>> [snip] >>>>>>>>> ... >>>>>>>>> <--stop >>>>>>>>> >>>>>>>>> >>>>>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>>>>> >>>>>>>>> I base this on the following observations: >>>>>>>>> >>>>>>>>> I can't understand the format of the timestamp on each logline >>>>>>>>> (the >>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this should >>>>>>>>> be >>>>>>>>> output >>>>>>>>> from time.clock(), but that doesn't return a number like that >>>>>>>>> when I >>>>>>>>> run >>>>>>>>> pypy interactively >>>>>>>>> Instead, I count the number of debug lines between start--> and >>>>>>>>> the >>>>>>>>> corresponding <--stop. >>>>>>>>> Most runs have a few hundred lines of output between start/stop >>>>>>>>> All slow runs have very close to 57800 lines out output between >>>>>>>>> start/stop >>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>>>> gc-minor >>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> /Martin >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>>>> stdout) >>>>>>>>>> which will do that for you btw. >>>>>>>>>> >>>>>>>>>> maybe you can find out what's that using profiling or valgrind? >>>>>>>>>> >>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >>>>>>>>>> wrote: >>>>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>>>> version >>>>>>>>>>> of >>>>>>>>>>> pypy. I >>>>>>>>>>> have modified >>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>>>> to >>>>>>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>>>>>> slow >>>>>>>>>>> queries >>>>>>>>>>> do >>>>>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>>>>> observed >>>>>>>>>>> major >>>>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>>>> something >>>>>>>>>>> else >>>>>>>>>>> is >>>>>>>>>>> blocking. This could be another aspect of the GC, but it could >>>>>>>>>>> also >>>>>>>>>>> be >>>>>>>>>>> anything else. >>>>>>>>>>> >>>>>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>>>>> python >>>>>>>>>>> with >>>>>>>>>>> garbage collection disabled. I don't see the problem there, so >>>>>>>>>>> it >>>>>>>>>>> is >>>>>>>>>>> somehow >>>>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> /Martin >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>>>>>> same >>>>>>>>>>>> issue. >>>>>>>>>>>> >>>>>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>>>>> connectedness, >>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>>>> randomly >>>>>>>>>>>> chosen >>>>>>>>>>>> previous elements in the list. >>>>>>>>>>>> >>>>>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>>>>> the >>>>>>>>>>>> list >>>>>>>>>>>> from a >>>>>>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>>>>>> list, >>>>>>>>>>>> we >>>>>>>>>>>> instead >>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>>>>> are >>>>>>>>>>>> traversed >>>>>>>>>>>> every time. To generate some garbage, we build a list holding >>>>>>>>>>>> the >>>>>>>>>>>> traversed >>>>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>>>> >>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>>>> buffer. If >>>>>>>>>>>> the >>>>>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>>>>> time, >>>>>>>>>>>> we >>>>>>>>>>>> print >>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>>>>> runtime >>>>>>>>>>>> (we >>>>>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>>>>> with >>>>>>>>>>>> the >>>>>>>>>>>> number of >>>>>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>>>>> (linearly >>>>>>>>>>>> with the >>>>>>>>>>>> number of object, i guess); traversing 50K elements should be >>>>>>>>>>>> independent of >>>>>>>>>>>> the memory size). >>>>>>>>>>>> >>>>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>>>> inspection, >>>>>>>>>>>> but >>>>>>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>>>>>> same >>>>>>>>>>>> time >>>>>>>>>>>> that >>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always >>>>>>>>>>>> return >>>>>>>>>>>> freed >>>>>>>>>>>> pages back to the OS? >>>>>>>>>>>> >>>>>>>>>>>> Using top, we observe that 10M elements allocates around 17GB >>>>>>>>>>>> after >>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>>>>> 35GB >>>>>>>>>>>> shortly >>>>>>>>>>>> after building). >>>>>>>>>>>> >>>>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>>>> elements: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>>>> start build >>>>>>>>>>>> end build 84.142424 >>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>>>>> 1.495401 >>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>>>>> 1.488160 >>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>>>>> 1.474563 >>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>>>> >>>>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>>>> start build >>>>>>>>>>>> end build 180.823105 >>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>>>>> 2.295146 >>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>>>>> 2.283927 >>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>>>>> 2.279631 >>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>>>> >>>>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>>>> start build >>>>>>>>>>>> end build 276.217811 >>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>>>>> 3.188464 >>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>>>>> 3.183003 >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>>>>> 3.190782 >>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>>>>> 3.239637 >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>> >>>>>>>>>>>> Code below >>>>>>>>>>>> >>>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>>> import time >>>>>>>>>>>> from random import randint, choice >>>>>>>>>>>> import sys >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> allElems = {} >>>>>>>>>>>> >>>>>>>>>>>> class Node: >>>>>>>>>>>> def __init__(self, v_): >>>>>>>>>>>> self.v = v_ >>>>>>>>>>>> self.next = None >>>>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>>>>> allElems[self.v] = self >>>>>>>>>>>> if self.v > 0: >>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>>>> self.v-1)] >>>>>>>>>>>> for _ >>>>>>>>>>>> in >>>>>>>>>>>> xrange(10)] >>>>>>>>>>>> else: >>>>>>>>>>>> self.dummy_links = [self] >>>>>>>>>>>> >>>>>>>>>>>> def set_next(self, l): >>>>>>>>>>>> self.next = l >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> def follow(node): >>>>>>>>>>>> acc = [] >>>>>>>>>>>> count = 0 >>>>>>>>>>>> cur = node >>>>>>>>>>>> assert node.v is not None >>>>>>>>>>>> assert cur is not None >>>>>>>>>>>> while count < 50000: >>>>>>>>>>>> # return a value; generate some garbage >>>>>>>>>>>> acc.append((cur.v, >>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>>>> for >>>>>>>>>>>> x >>>>>>>>>>>> in >>>>>>>>>>>> xrange(100)])) >>>>>>>>>>>> >>>>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>>>> else >>>>>>>>>>>> cur.next >>>>>>>>>>>> count += 1 >>>>>>>>>>>> >>>>>>>>>>>> return acc >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> def build(num_elems): >>>>>>>>>>>> start = time.time() >>>>>>>>>>>> print "start build" >>>>>>>>>>>> root = Node(0) >>>>>>>>>>>> cur = root >>>>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>>>> e = Node(x) >>>>>>>>>>>> cur.next = e >>>>>>>>>>>> cur = e >>>>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>>>> return root >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> num_timings = 100 >>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>>>> build(num_elems) >>>>>>>>>>>> total = 0 >>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>>>> num_timings >>>>>>>>>>>> runs >>>>>>>>>>>> i = 0 >>>>>>>>>>>> beginning = time.time() >>>>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>>>> start = time.time() >>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>>>> assert(elem is not None) >>>>>>>>>>>> >>>>>>>>>>>> lst = follow(elem) >>>>>>>>>>>> >>>>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>>>> something >>>>>>>>>>>> >>>>>>>>>>>> end = time.time() >>>>>>>>>>>> >>>>>>>>>>>> elapsed = end-start >>>>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>>>> if (i > num_timings): >>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>>>>> defined >>>>>>>>>>>> as >>>>>>>>>>>>> >>>>>>>>>>>> 2*avg run time >>>>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>>>> slow_threshold: >>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>>>> (elapsed, slow_time, >>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>>>> i += 1 >>>>>>>>>>>> print total >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>>>>> the >>>>>>>>>>>>>> code >>>>>>>>>>>>>> I'm >>>>>>>>>>>>>> in a >>>>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>>>> would be >>>>>>>>>>>>>> a >>>>>>>>>>>>>> means >>>>>>>>>>>>>> to >>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>>>>> memory, >>>>>>>>>>>>>> but I >>>>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> /Martin >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Martin. >>>>>>>>>>>>> >>>>>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>>>>> the >>>>>>>>>>>>> problem, >>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>>>>> and >>>>>>>>>>>>> stuff). >>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>>>>> isolate >>>>>>>>>>>>> a >>>>>>>>>>>>> part you can share freely :) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >> >> From rajul.iitkgp at gmail.com Mon Mar 17 17:34:37 2014 From: rajul.iitkgp at gmail.com (Rajul Srivastava) Date: Mon, 17 Mar 2014 22:04:37 +0530 Subject: [pypy-dev] GSOC: Introduction and Interested in Numpy Improvements Project Message-ID: Hi all, My name in Rajul, and I am a final year undergraduate student at the Indian Institute of Technology Kharagpur. I wish to participate in Google Summer of Code 2014, and while going through the list of organisations, I came across PyPy. I am proficient with programming languages C/C++, Python, Java, Groovy, Ruby. I am very interested in the fields of Algorithms, Computational Sciences, and Software Engineering. I have always been interested in programming and in the past I have participated in Google Summer of Code 2012, with the organisation Network Time Foundation,working on the project "improving the Logging/Debugging System of Network Time Protocol Software". I have also interned in the Global Technology division of Barclays, during the summers of 2013, working with the Market Risk IT team. Besides I have worked on a few Research projects in the fields of Computational Finance, Complex Networks, and Computational Chemistry. I am currently working on my Thesis project in the field of Computational Sciences on a project titled "Network Analysis of Chemical Reactions". I have had courses in the fields of Programming and Data Structures, Complex Networks, Distributed Systems, Algorithms, Operations Research in the past. I have gone through the list of project ideas and I found all of the project ideas very interesting. Although I find all the projects listed worth a while, I am particularly interested in the "Numpy Improvements" project. I suppose that my programming background is suitable for these projects. I shall be grateful if anyone can help me and give me reference to the literature that I may use and also shed some light on how I can go about making a successful proposal. Thanks!! Best Regards, Rajul -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 17:39:13 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 18:39:13 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> Message-ID: not sure how more we can help without looking into the code On Mon, Mar 17, 2014 at 6:05 PM, Martin Koch wrote: > Thanks :) > > /Martin > > >> On 17/03/2014, at 16.41, Maciej Fijalkowski wrote: >> >> ok. >> >> so as you can probably see, the max is not that big, which means the >> GC is really incremental. What happens is you get tons of garbage that >> survives minor collection every now and then. I don't exactly know >> why, but you should look what objects can potentially survive for too >> long. >> >>> On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch wrote: >>> Ah - it just occured to me that the first runs may be slow anyway: Since we >>> take the average of the last 100 runs as the benchmark, then the first 100 >>> runs are not classified as slow. Indeed, the first three runs with many >>> collections are in the first 100 runs. >>> >>> >>>> On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: >>>> >>>> Here are the total and max times in millions of units; 30000 units is >>>> approximately 13 seconds. I have extracted the runs where there are many >>>> gc-collect-steps. These are in execution order, so the first runs with many >>>> gc-collect-steps aren't slow. >>>> >>>> Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: >>>> gc-minor:10 gc-collect-step:247 >>>> Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: >>>> gc-minor:10 gc-collect-step:245 >>>> Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: >>>> gc-minor:11 gc-collect-step:244 >>>> Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 >>>> Max: gc-minor:17 gc-collect-step:244 >>>> Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 >>>> Max: gc-minor:11 gc-collect-step:248 >>>> Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 >>>> Max: gc-minor:8 gc-collect-step:299 >>>> Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 >>>> Max: gc-minor:11 gc-collect-step:246 >>>> Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 >>>> Max: gc-minor:8 gc-collect-step:244 >>>> Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 >>>> Max: gc-minor:36 gc-collect-step:248 >>>> Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 >>>> Max: gc-minor:8 gc-collect-step:244 >>>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 >>>> Max: gc-minor:8 gc-collect-step:245 >>>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 >>>> Max: gc-minor:8 gc-collect-step:244 >>>> Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 >>>> Max: gc-minor:38 gc-collect-step:244 >>>> Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 >>>> Max: gc-minor:23 gc-collect-step:245 >>>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 >>>> Max: gc-minor:8 gc-collect-step:246 >>>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 >>>> Max: gc-minor:9 gc-collect-step:244 >>>> Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 >>>> Max: gc-minor:8 gc-collect-step:246 >>>> Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 >>>> Max: gc-minor:8 gc-collect-step:248 >>>> Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 >>>> Max: gc-minor:8 gc-collect-step:250 >>>> Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 >>>> Max: gc-minor:8 gc-collect-step:245 >>>> Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 >>>> Max: gc-minor:543 gc-collect-step:244 >>>> Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 >>>> Max: gc-minor:20 gc-collect-step:246 >>>> Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 >>>> Max: gc-minor:25 gc-collect-step:245 >>>> >>>> Thanks, >>>> /Martin >>>> >>>> >>>>> On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: >>>>> >>>>> Ah. I had misunderstood. I'll get back to you on that :) thanks >>>>> >>>>> /Martin >>>>> >>>>> >>>>>> On 17/03/2014, at 15.21, Maciej Fijalkowski wrote: >>>>>> >>>>>> eh, this is not what I need >>>>>> >>>>>> I need a max of TIME it took for a gc-minor and the TOTAL time it took >>>>>> for a gc-minor (per query) (ideally same for gc-walkroots and >>>>>> gc-collect-step) >>>>>> >>>>>>> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch wrote: >>>>>>> Here are the collated results of running each query. For each run, I >>>>>>> count >>>>>>> how many of each of the pypy debug lines i get. I.e. there were 668 >>>>>>> runs >>>>>>> that printed 58 loglines that contain "{gc-minor" which was eventually >>>>>>> followed by "gc-minor}". I have also counted if the query was slow; >>>>>>> interestingly, not all the queries with many gc-minors were slow (but >>>>>>> all >>>>>>> slow queries had a gc-minor). >>>>>>> >>>>>>> Please let me know if this is unclear :) >>>>>>> >>>>>>> 668 gc-minor:58 gc-minor-walkroots:58 >>>>>>> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >>>>>>> 140 gc-minor:59 gc-minor-walkroots:59 >>>>>>> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >>>>>>> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >>>>>>> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >>>>>>> gc-collect-step:9589 >>>>>>> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >>>>>>> gc-collect-step:9590 >>>>>>> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >>>>>>> gc-collect-step:9609 >>>>>>> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >>>>>>> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >>>>>>> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >>>>>>> jit-backend-dump:78 >>>>>>> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 >>>>>>> gc-minor:9030 >>>>>>> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >>>>>>> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >>>>>>> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >>>>>>> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >>>>>>> jit-log-compiling-bridge:2 >>>>>>> jit-resume:84 >>>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >>>>>>> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 >>>>>>> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >>>>>>> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 >>>>>>> jit-resume:14 >>>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >>>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 >>>>>>> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >>>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>>>>>> jit-abort:3 >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>>>>>> jit-log-opt-bridge:2 >>>>>>> jit-log-compiling-bridge:2 jit-resume:84 >>>>>>> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >>>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 >>>>>>> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >>>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>>>>>> jit-abort:3 >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>>>>>> jit-log-opt-bridge:2 >>>>>>> jit-log-compiling-bridge:2 jit-resume:84 >>>>>>> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >>>>>>> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >>>>>>> gc-minor:61 >>>>>>> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >>>>>>> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >>>>>>> jit-abort:3 >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >>>>>>> jit-log-opt-bridge:2 >>>>>>> jit-log-compiling-bridge:2 jit-resume:104 >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> /Martin >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >>>>>>> wrote: >>>>>>>> >>>>>>>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >>>>>>>> >>>>>>>> wrote: >>>>>>>>> are you *sure* it's the walkroots that take that long and not >>>>>>>>> something else (like gc-minor)? More of those mean that you allocate >>>>>>>>> a >>>>>>>>> lot more surviving objects. Can you do two things: >>>>>>>>> >>>>>>>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>>>>>>>> b) take the sum of those >>>>>>>>> >>>>>>>>> and plot them >>>>>>>> >>>>>>>> ^^^ or just paste the results actually >>>>>>>> >>>>>>>>> >>>>>>>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch wrote: >>>>>>>>>> Well, then it works out to around 2.5GHz, which seems reasonable. >>>>>>>>>> But >>>>>>>>>> it >>>>>>>>>> doesn't alter the conclusion from the previous email: The slow >>>>>>>>>> queries >>>>>>>>>> then >>>>>>>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>>>>>>>>> units, or >>>>>>>>>> .4 seconds at this conversion. Also, the log shows that a slow >>>>>>>>>> query >>>>>>>>>> performs many more gc-minor operations than a 'normal' one: 9600 >>>>>>>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>>>>>>>>> >>>>>>>>>> So the question becomes: Why do we get this large spike in >>>>>>>>>> gc-minor-walkroots, and, in particular, is there any way to avoid >>>>>>>>>> it :) >>>>>>>>>> ? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> /Martin >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I think it's the cycles of your CPU >>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch >>>>>>>>>>>> wrote: >>>>>>>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>>>>>>>>> correlate it >>>>>>>>>>>> with seconds (which the program does print out). Slow runs are >>>>>>>>>>>> around 13 >>>>>>>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>>>>>>>>>> (e.g. >>>>>>>>>>>> from >>>>>>>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>>>>>>>>> >>>>>>>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>>>>>>>>> >>>>>>>>>>>>>> This generates a logfile which looks something like this >>>>>>>>>>>>>> >>>>>>>>>>>>>> start--> >>>>>>>>>>>>>> [2b99f1981b527e] {gc-minor >>>>>>>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>>>>>>>>> [2b99f19890d750] gc-minor} >>>>>>>>>>>>>> [snip] >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> <--stop >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I base this on the following observations: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can't understand the format of the timestamp on each logline >>>>>>>>>>>>>> (the >>>>>>>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>>>>>>>>>> should >>>>>>>>>>>>>> be >>>>>>>>>>>>>> output >>>>>>>>>>>>>> from time.clock(), but that doesn't return a number like that >>>>>>>>>>>>>> when I >>>>>>>>>>>>>> run >>>>>>>>>>>>>> pypy interactively >>>>>>>>>>>>>> Instead, I count the number of debug lines between start--> and >>>>>>>>>>>>>> the >>>>>>>>>>>>>> corresponding <--stop. >>>>>>>>>>>>>> Most runs have a few hundred lines of output between start/stop >>>>>>>>>>>>>> All slow runs have very close to 57800 lines out output between >>>>>>>>>>>>>> start/stop >>>>>>>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>>>>>>>>> gc-minor >>>>>>>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> /Martin >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>>>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>>>>>>>>> stdout) >>>>>>>>>>>>>>> which will do that for you btw. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> maybe you can find out what's that using profiling or >>>>>>>>>>>>>>> valgrind? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>>>>>>>>> version >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> pypy. I >>>>>>>>>>>>>>>> have modified >>>>>>>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>>>>>>>>>>> slow >>>>>>>>>>>>>>>> queries >>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>>>>>>>>>> observed >>>>>>>>>>>>>>>> major >>>>>>>>>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>>>>>>>>>> could >>>>>>>>>>>>>>>> also >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> anything else. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>>>>>>>>>> python >>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>> garbage collection disabled. I don't see the problem there, >>>>>>>>>>>>>>>> so >>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> somehow >>>>>>>>>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> /Martin >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>>> issue. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>>>>>>>>>> connectedness, >>>>>>>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>>>>>>>>> randomly >>>>>>>>>>>>>>>>> chosen >>>>>>>>>>>>>>>>> previous elements in the list. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> list >>>>>>>>>>>>>>>>> from a >>>>>>>>>>>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>>>>>>>>>>> list, >>>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>> instead >>>>>>>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> traversed >>>>>>>>>>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>>>>>>>>>> holding >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> traversed >>>>>>>>>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>>>>>>>>> buffer. If >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>>>>>>>>>> time, >>>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>> print >>>>>>>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>>>>>>>>>> runtime >>>>>>>>>>>>>>>>> (we >>>>>>>>>>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> number of >>>>>>>>>>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>>>>>>>>>> (linearly >>>>>>>>>>>>>>>>> with the >>>>>>>>>>>>>>>>> number of object, i guess); traversing 50K elements should >>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>> independent of >>>>>>>>>>>>>>>>> the memory size). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>>>>>>>>> inspection, >>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>>>>>>>>>> always >>>>>>>>>>>>>>>>> return >>>>>>>>>>>>>>>>> freed >>>>>>>>>>>>>>>>> pages back to the OS? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>>>>>>>>>> 17GB >>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>>>>>>>>>> 35GB >>>>>>>>>>>>>>>>> shortly >>>>>>>>>>>>>>>>> after building). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>>>>>>>>> elements: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>>>>>>>>> start build >>>>>>>>>>>>>>>>> end build 84.142424 >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>>>>>>>>>> 1.495401 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>>>>>>>>>> 1.488160 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>>>>>>>>>> 1.474563 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>>>>>>>>> start build >>>>>>>>>>>>>>>>> end build 180.823105 >>>>>>>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>>>>>>>>>> 2.295146 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>>>>>>>>>> 2.283927 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>>>>>>>>>> 2.279631 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>>>>>>>>> start build >>>>>>>>>>>>>>>>> end build 276.217811 >>>>>>>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>>>>>>>>>> 3.188464 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>>>>>>>>>> 3.183003 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>>>>>>>>>> 3.190782 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>>>>>>>>>> 3.239637 >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Code below >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>>>>>>>> import time >>>>>>>>>>>>>>>>> from random import randint, choice >>>>>>>>>>>>>>>>> import sys >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> allElems = {} >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> class Node: >>>>>>>>>>>>>>>>> def __init__(self, v_): >>>>>>>>>>>>>>>>> self.v = v_ >>>>>>>>>>>>>>>>> self.next = None >>>>>>>>>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>>>>>>>>>> allElems[self.v] = self >>>>>>>>>>>>>>>>> if self.v > 0: >>>>>>>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>>>>>>>>> self.v-1)] >>>>>>>>>>>>>>>>> for _ >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> xrange(10)] >>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>> self.dummy_links = [self] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> def set_next(self, l): >>>>>>>>>>>>>>>>> self.next = l >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> def follow(node): >>>>>>>>>>>>>>>>> acc = [] >>>>>>>>>>>>>>>>> count = 0 >>>>>>>>>>>>>>>>> cur = node >>>>>>>>>>>>>>>>> assert node.v is not None >>>>>>>>>>>>>>>>> assert cur is not None >>>>>>>>>>>>>>>>> while count < 50000: >>>>>>>>>>>>>>>>> # return a value; generate some garbage >>>>>>>>>>>>>>>>> acc.append((cur.v, >>>>>>>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> x >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> xrange(100)])) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>> cur.next >>>>>>>>>>>>>>>>> count += 1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> return acc >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> def build(num_elems): >>>>>>>>>>>>>>>>> start = time.time() >>>>>>>>>>>>>>>>> print "start build" >>>>>>>>>>>>>>>>> root = Node(0) >>>>>>>>>>>>>>>>> cur = root >>>>>>>>>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>>>>>>>>> e = Node(x) >>>>>>>>>>>>>>>>> cur.next = e >>>>>>>>>>>>>>>>> cur = e >>>>>>>>>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>>>>>>>>> return root >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> num_timings = 100 >>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>>>>>>>>> build(num_elems) >>>>>>>>>>>>>>>>> total = 0 >>>>>>>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>>>>>>>>> num_timings >>>>>>>>>>>>>>>>> runs >>>>>>>>>>>>>>>>> i = 0 >>>>>>>>>>>>>>>>> beginning = time.time() >>>>>>>>>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>>>>>>>>> start = time.time() >>>>>>>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>>>>>>>>> assert(elem is not None) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> lst = follow(elem) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> end = time.time() >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> elapsed = end-start >>>>>>>>>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>>>>>>>>> if (i > num_timings): >>>>>>>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>>>>>>>>>> defined >>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2*avg run time >>>>>>>>>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>>>>>>>>> (elapsed, slow_time, >>>>>>>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>>>>>>>>> i += 1 >>>>>>>>>>>>>>>>> print total >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> code >>>>>>>>>>>>>>>>>>> I'm >>>>>>>>>>>>>>>>>>> in a >>>>>>>>>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>> means >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>>>>>>>>>> memory, >>>>>>>>>>>>>>>>>>> but I >>>>>>>>>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> /Martin >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Martin. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> problem, >>>>>>>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>> stuff). >>>>>>>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>>>>>>>>>> isolate >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> part you can share freely :) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>>> >>>> >>>> >>> From mak at issuu.com Mon Mar 17 20:04:47 2014 From: mak at issuu.com (Martin Koch) Date: Mon, 17 Mar 2014 20:04:47 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> Message-ID: Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here. I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?). Thanks, /Martin On Mon, Mar 17, 2014 at 5:39 PM, Maciej Fijalkowski wrote: > not sure how more we can help without looking into the code > > On Mon, Mar 17, 2014 at 6:05 PM, Martin Koch wrote: > > Thanks :) > > > > /Martin > > > > > >> On 17/03/2014, at 16.41, Maciej Fijalkowski wrote: > >> > >> ok. > >> > >> so as you can probably see, the max is not that big, which means the > >> GC is really incremental. What happens is you get tons of garbage that > >> survives minor collection every now and then. I don't exactly know > >> why, but you should look what objects can potentially survive for too > >> long. > >> > >>> On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch wrote: > >>> Ah - it just occured to me that the first runs may be slow anyway: > Since we > >>> take the average of the last 100 runs as the benchmark, then the first > 100 > >>> runs are not classified as slow. Indeed, the first three runs with many > >>> collections are in the first 100 runs. > >>> > >>> > >>>> On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: > >>>> > >>>> Here are the total and max times in millions of units; 30000 units is > >>>> approximately 13 seconds. I have extracted the runs where there are > many > >>>> gc-collect-steps. These are in execution order, so the first runs > with many > >>>> gc-collect-steps aren't slow. > >>>> > >>>> Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: > >>>> gc-minor:10 gc-collect-step:247 > >>>> Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: > >>>> gc-minor:10 gc-collect-step:245 > >>>> Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: > >>>> gc-minor:11 gc-collect-step:244 > >>>> Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 > >>>> Max: gc-minor:17 gc-collect-step:244 > >>>> Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 > >>>> Max: gc-minor:11 gc-collect-step:248 > >>>> Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 > >>>> Max: gc-minor:8 gc-collect-step:299 > >>>> Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 > >>>> Max: gc-minor:11 gc-collect-step:246 > >>>> Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 > >>>> Max: gc-minor:8 gc-collect-step:244 > >>>> Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 > >>>> Max: gc-minor:36 gc-collect-step:248 > >>>> Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 > >>>> Max: gc-minor:8 gc-collect-step:244 > >>>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 > >>>> Max: gc-minor:8 gc-collect-step:245 > >>>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 > >>>> Max: gc-minor:8 gc-collect-step:244 > >>>> Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 > >>>> Max: gc-minor:38 gc-collect-step:244 > >>>> Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 > >>>> Max: gc-minor:23 gc-collect-step:245 > >>>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 > >>>> Max: gc-minor:8 gc-collect-step:246 > >>>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 > >>>> Max: gc-minor:9 gc-collect-step:244 > >>>> Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 > >>>> Max: gc-minor:8 gc-collect-step:246 > >>>> Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 > >>>> Max: gc-minor:8 gc-collect-step:248 > >>>> Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 > >>>> Max: gc-minor:8 gc-collect-step:250 > >>>> Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 > >>>> Max: gc-minor:8 gc-collect-step:245 > >>>> Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 > >>>> Max: gc-minor:543 gc-collect-step:244 > >>>> Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 > >>>> Max: gc-minor:20 gc-collect-step:246 > >>>> Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 > >>>> Max: gc-minor:25 gc-collect-step:245 > >>>> > >>>> Thanks, > >>>> /Martin > >>>> > >>>> > >>>>> On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: > >>>>> > >>>>> Ah. I had misunderstood. I'll get back to you on that :) thanks > >>>>> > >>>>> /Martin > >>>>> > >>>>> > >>>>>> On 17/03/2014, at 15.21, Maciej Fijalkowski > wrote: > >>>>>> > >>>>>> eh, this is not what I need > >>>>>> > >>>>>> I need a max of TIME it took for a gc-minor and the TOTAL time it > took > >>>>>> for a gc-minor (per query) (ideally same for gc-walkroots and > >>>>>> gc-collect-step) > >>>>>> > >>>>>>> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch > wrote: > >>>>>>> Here are the collated results of running each query. For each run, > I > >>>>>>> count > >>>>>>> how many of each of the pypy debug lines i get. I.e. there were 668 > >>>>>>> runs > >>>>>>> that printed 58 loglines that contain "{gc-minor" which was > eventually > >>>>>>> followed by "gc-minor}". I have also counted if the query was slow; > >>>>>>> interestingly, not all the queries with many gc-minors were slow > (but > >>>>>>> all > >>>>>>> slow queries had a gc-minor). > >>>>>>> > >>>>>>> Please let me know if this is unclear :) > >>>>>>> > >>>>>>> 668 gc-minor:58 gc-minor-walkroots:58 > >>>>>>> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 > >>>>>>> 140 gc-minor:59 gc-minor-walkroots:59 > >>>>>>> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 > >>>>>>> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 > >>>>>>> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 > >>>>>>> gc-collect-step:9589 > >>>>>>> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 > >>>>>>> gc-collect-step:9590 > >>>>>>> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 > >>>>>>> gc-collect-step:9609 > >>>>>>> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 > >>>>>>> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 > >>>>>>> 1 jit-log-compiling-loop:1 gc-collect-step:8991 > >>>>>>> jit-backend-dump:78 > >>>>>>> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 > >>>>>>> gc-minor:9030 > >>>>>>> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 > >>>>>>> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 > >>>>>>> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 > >>>>>>> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > >>>>>>> jit-log-compiling-bridge:2 > >>>>>>> jit-resume:84 > >>>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 > >>>>>>> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 > gc-minor-walkroots:60 > >>>>>>> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 > >>>>>>> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 > jit-log-rewritten-loop:1 > >>>>>>> jit-resume:14 > >>>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 > >>>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 > jit-tracing:3 > >>>>>>> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 > >>>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > >>>>>>> jit-abort:3 > >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 > >>>>>>> jit-log-opt-bridge:2 > >>>>>>> jit-log-compiling-bridge:2 jit-resume:84 > >>>>>>> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 > >>>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 > jit-tracing:3 > >>>>>>> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 > >>>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > >>>>>>> jit-abort:3 > >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 > >>>>>>> jit-log-opt-bridge:2 > >>>>>>> jit-log-compiling-bridge:2 jit-resume:84 > >>>>>>> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 > >>>>>>> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 > >>>>>>> gc-minor:61 > >>>>>>> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 > >>>>>>> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 > >>>>>>> jit-abort:3 > >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 > >>>>>>> jit-log-opt-bridge:2 > >>>>>>> jit-log-compiling-bridge:2 jit-resume:104 > >>>>>>> > >>>>>>> > >>>>>>> Thanks, > >>>>>>> /Martin > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski < > fijall at gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > >>>>>>>> > >>>>>>>> wrote: > >>>>>>>>> are you *sure* it's the walkroots that take that long and not > >>>>>>>>> something else (like gc-minor)? More of those mean that you > allocate > >>>>>>>>> a > >>>>>>>>> lot more surviving objects. Can you do two things: > >>>>>>>>> > >>>>>>>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request > >>>>>>>>> b) take the sum of those > >>>>>>>>> > >>>>>>>>> and plot them > >>>>>>>> > >>>>>>>> ^^^ or just paste the results actually > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch > wrote: > >>>>>>>>>> Well, then it works out to around 2.5GHz, which seems > reasonable. > >>>>>>>>>> But > >>>>>>>>>> it > >>>>>>>>>> doesn't alter the conclusion from the previous email: The slow > >>>>>>>>>> queries > >>>>>>>>>> then > >>>>>>>>>> all have a duration around 34*10^9 units, 'normal' queries > 1*10^9 > >>>>>>>>>> units, or > >>>>>>>>>> .4 seconds at this conversion. Also, the log shows that a slow > >>>>>>>>>> query > >>>>>>>>>> performs many more gc-minor operations than a 'normal' one: 9600 > >>>>>>>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > >>>>>>>>>> > >>>>>>>>>> So the question becomes: Why do we get this large spike in > >>>>>>>>>> gc-minor-walkroots, and, in particular, is there any way to > avoid > >>>>>>>>>> it :) > >>>>>>>>>> ? > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> /Martin > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski > >>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> I think it's the cycles of your CPU > >>>>>>>>>>> > >>>>>>>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> What is the unit? Perhaps I'm being thick here, but I can't > >>>>>>>>>>>> correlate it > >>>>>>>>>>>> with seconds (which the program does print out). Slow runs are > >>>>>>>>>>>> around 13 > >>>>>>>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp > units > >>>>>>>>>>>> (e.g. > >>>>>>>>>>>> from > >>>>>>>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > >>>>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> The number of lines is nonsense. This is a timestamp in hex. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch > > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> Based On Maciej's suggestion, I tried the following > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This generates a logfile which looks something like this > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> start--> > >>>>>>>>>>>>>> [2b99f1981b527e] {gc-minor > >>>>>>>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots > >>>>>>>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} > >>>>>>>>>>>>>> [2b99f19890d750] gc-minor} > >>>>>>>>>>>>>> [snip] > >>>>>>>>>>>>>> ... > >>>>>>>>>>>>>> <--stop > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> It turns out that the culprit is a lot of MINOR collections. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I base this on the following observations: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I can't understand the format of the timestamp on each > logline > >>>>>>>>>>>>>> (the > >>>>>>>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this > >>>>>>>>>>>>>> should > >>>>>>>>>>>>>> be > >>>>>>>>>>>>>> output > >>>>>>>>>>>>>> from time.clock(), but that doesn't return a number like > that > >>>>>>>>>>>>>> when I > >>>>>>>>>>>>>> run > >>>>>>>>>>>>>> pypy interactively > >>>>>>>>>>>>>> Instead, I count the number of debug lines between start--> > and > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>> corresponding <--stop. > >>>>>>>>>>>>>> Most runs have a few hundred lines of output between > start/stop > >>>>>>>>>>>>>> All slow runs have very close to 57800 lines out output > between > >>>>>>>>>>>>>> start/stop > >>>>>>>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 > >>>>>>>>>>>>>> gc-minor > >>>>>>>>>>>>>> operations, and 9647 gc-minor-walkroots operations. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> /Martin > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is > >>>>>>>>>>>>>>> stdout) > >>>>>>>>>>>>>>> which will do that for you btw. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> maybe you can find out what's that using profiling or > >>>>>>>>>>>>>>> valgrind? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch < > mak at issuu.com> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> I have tried getting the pypy source and building my own > >>>>>>>>>>>>>>>> version > >>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>> pypy. I > >>>>>>>>>>>>>>>> have modified > >>>>>>>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() > >>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> print out when it starts and when it stops. Apparently, > the > >>>>>>>>>>>>>>>> slow > >>>>>>>>>>>>>>>> queries > >>>>>>>>>>>>>>>> do > >>>>>>>>>>>>>>>> NOT occur during major_collection_step; at least, I have > not > >>>>>>>>>>>>>>>> observed > >>>>>>>>>>>>>>>> major > >>>>>>>>>>>>>>>> step output during a query execution. So, apparently, > >>>>>>>>>>>>>>>> something > >>>>>>>>>>>>>>>> else > >>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>> blocking. This could be another aspect of the GC, but it > >>>>>>>>>>>>>>>> could > >>>>>>>>>>>>>>>> also > >>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>> anything else. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Just to be sure, I have tried running the same > application in > >>>>>>>>>>>>>>>> python > >>>>>>>>>>>>>>>> with > >>>>>>>>>>>>>>>> garbage collection disabled. I don't see the problem > there, > >>>>>>>>>>>>>>>> so > >>>>>>>>>>>>>>>> it > >>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>> somehow > >>>>>>>>>>>>>>>> related to either GC or the runtime somehow. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>>>> /Martin > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch < > mak at issuu.com> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We have hacked up a small sample that seems to exhibit > the > >>>>>>>>>>>>>>>>> same > >>>>>>>>>>>>>>>>> issue. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We basically generate a linked list of objects. To > increase > >>>>>>>>>>>>>>>>> connectedness, > >>>>>>>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 > >>>>>>>>>>>>>>>>> randomly > >>>>>>>>>>>>>>>>> chosen > >>>>>>>>>>>>>>>>> previous elements in the list. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We then time a function that traverses 50000 elements > from > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> list > >>>>>>>>>>>>>>>>> from a > >>>>>>>>>>>>>>>>> random start point. If the traversal reaches the end of > the > >>>>>>>>>>>>>>>>> list, > >>>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>> instead > >>>>>>>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K > elements > >>>>>>>>>>>>>>>>> are > >>>>>>>>>>>>>>>>> traversed > >>>>>>>>>>>>>>>>> every time. To generate some garbage, we build a list > >>>>>>>>>>>>>>>>> holding > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> traversed > >>>>>>>>>>>>>>>>> elements and a dummy list of characters. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Timings for the last 100 runs are stored in a circular > >>>>>>>>>>>>>>>>> buffer. If > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> elapsed time for the last run is more than twice the > average > >>>>>>>>>>>>>>>>> time, > >>>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>> print > >>>>>>>>>>>>>>>>> out a line with the elapsed time, the threshold, and the > 90% > >>>>>>>>>>>>>>>>> runtime > >>>>>>>>>>>>>>>>> (we > >>>>>>>>>>>>>>>>> would like to see that the mean runtime does not increase > >>>>>>>>>>>>>>>>> with > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> number of > >>>>>>>>>>>>>>>>> elements in the list, but that the max time does increase > >>>>>>>>>>>>>>>>> (linearly > >>>>>>>>>>>>>>>>> with the > >>>>>>>>>>>>>>>>> number of object, i guess); traversing 50K elements > should > >>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>> independent of > >>>>>>>>>>>>>>>>> the memory size). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> We have tried monitoring memory consumption by external > >>>>>>>>>>>>>>>>> inspection, > >>>>>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>> cannot consistently verify that memory is deallocated at > the > >>>>>>>>>>>>>>>>> same > >>>>>>>>>>>>>>>>> time > >>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't > >>>>>>>>>>>>>>>>> always > >>>>>>>>>>>>>>>>> return > >>>>>>>>>>>>>>>>> freed > >>>>>>>>>>>>>>>>> pages back to the OS? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Using top, we observe that 10M elements allocates around > >>>>>>>>>>>>>>>>> 17GB > >>>>>>>>>>>>>>>>> after > >>>>>>>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and > grows to > >>>>>>>>>>>>>>>>> 35GB > >>>>>>>>>>>>>>>>> shortly > >>>>>>>>>>>>>>>>> after building). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Here is output from a few runs with different number of > >>>>>>>>>>>>>>>>> elements: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> pypy mem.py 10000000 > >>>>>>>>>>>>>>>>> start build > >>>>>>>>>>>>>>>>> end build 84.142424 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: > >>>>>>>>>>>>>>>>> 1.495401 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.421558 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: > >>>>>>>>>>>>>>>>> 1.488160 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.423441 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: > >>>>>>>>>>>>>>>>> 1.474563 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.419817 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> pypy mem.py 20000000 > >>>>>>>>>>>>>>>>> start build > >>>>>>>>>>>>>>>>> end build 180.823105 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: > >>>>>>>>>>>>>>>>> 2.295146 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.434726 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: > >>>>>>>>>>>>>>>>> 2.283927 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.374190 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: > >>>>>>>>>>>>>>>>> 2.279631 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.371502 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> pypy mem.py 30000000 > >>>>>>>>>>>>>>>>> start build > >>>>>>>>>>>>>>>>> end build 276.217811 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: > >>>>>>>>>>>>>>>>> 3.188464 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.459891 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: > >>>>>>>>>>>>>>>>> 3.183003 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: > >>>>>>>>>>>>>>>>> 3.190782 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393677 > >>>>>>>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: > >>>>>>>>>>>>>>>>> 3.239637 > >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Code below > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > -------------------------------------------------------------- > >>>>>>>>>>>>>>>>> import time > >>>>>>>>>>>>>>>>> from random import randint, choice > >>>>>>>>>>>>>>>>> import sys > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> allElems = {} > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> class Node: > >>>>>>>>>>>>>>>>> def __init__(self, v_): > >>>>>>>>>>>>>>>>> self.v = v_ > >>>>>>>>>>>>>>>>> self.next = None > >>>>>>>>>>>>>>>>> self.dummy_data = [randint(0,100) > >>>>>>>>>>>>>>>>> for _ in > xrange(randint(50,100))] > >>>>>>>>>>>>>>>>> allElems[self.v] = self > >>>>>>>>>>>>>>>>> if self.v > 0: > >>>>>>>>>>>>>>>>> self.dummy_links = [allElems[randint(0, > >>>>>>>>>>>>>>>>> self.v-1)] > >>>>>>>>>>>>>>>>> for _ > >>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>> xrange(10)] > >>>>>>>>>>>>>>>>> else: > >>>>>>>>>>>>>>>>> self.dummy_links = [self] > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> def set_next(self, l): > >>>>>>>>>>>>>>>>> self.next = l > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> def follow(node): > >>>>>>>>>>>>>>>>> acc = [] > >>>>>>>>>>>>>>>>> count = 0 > >>>>>>>>>>>>>>>>> cur = node > >>>>>>>>>>>>>>>>> assert node.v is not None > >>>>>>>>>>>>>>>>> assert cur is not None > >>>>>>>>>>>>>>>>> while count < 50000: > >>>>>>>>>>>>>>>>> # return a value; generate some garbage > >>>>>>>>>>>>>>>>> acc.append((cur.v, > >>>>>>>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") > >>>>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>> x > >>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>> xrange(100)])) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> # if we have reached the end, chose a random link > >>>>>>>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None > >>>>>>>>>>>>>>>>> else > >>>>>>>>>>>>>>>>> cur.next > >>>>>>>>>>>>>>>>> count += 1 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> return acc > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> def build(num_elems): > >>>>>>>>>>>>>>>>> start = time.time() > >>>>>>>>>>>>>>>>> print "start build" > >>>>>>>>>>>>>>>>> root = Node(0) > >>>>>>>>>>>>>>>>> cur = root > >>>>>>>>>>>>>>>>> for x in xrange(1, num_elems): > >>>>>>>>>>>>>>>>> e = Node(x) > >>>>>>>>>>>>>>>>> cur.next = e > >>>>>>>>>>>>>>>>> cur = e > >>>>>>>>>>>>>>>>> print "end build %f" % (time.time() - start) > >>>>>>>>>>>>>>>>> return root > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> num_timings = 100 > >>>>>>>>>>>>>>>>> if __name__ == "__main__": > >>>>>>>>>>>>>>>>> num_elems = int(sys.argv[1]) > >>>>>>>>>>>>>>>>> build(num_elems) > >>>>>>>>>>>>>>>>> total = 0 > >>>>>>>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last > >>>>>>>>>>>>>>>>> num_timings > >>>>>>>>>>>>>>>>> runs > >>>>>>>>>>>>>>>>> i = 0 > >>>>>>>>>>>>>>>>> beginning = time.time() > >>>>>>>>>>>>>>>>> while time.time() - beginning < 600: > >>>>>>>>>>>>>>>>> start = time.time() > >>>>>>>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] > >>>>>>>>>>>>>>>>> assert(elem is not None) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> lst = follow(elem) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> total += choice(lst)[0] # use the return value for > >>>>>>>>>>>>>>>>> something > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> end = time.time() > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> elapsed = end-start > >>>>>>>>>>>>>>>>> timings[i % num_timings] = elapsed > >>>>>>>>>>>>>>>>> if (i > num_timings): > >>>>>>>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow > >>>>>>>>>>>>>>>>> defined > >>>>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 2*avg run time > >>>>>>>>>>>>>>>>> if (elapsed > slow_time): > >>>>>>>>>>>>>>>>> print "that took a long time elapsed: %f > >>>>>>>>>>>>>>>>> slow_threshold: > >>>>>>>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ > >>>>>>>>>>>>>>>>> (elapsed, slow_time, > >>>>>>>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) > >>>>>>>>>>>>>>>>> i += 1 > >>>>>>>>>>>>>>>>> print total > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> Hi Armin, Maciej > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks for responding. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'm in the process of trying to determine what (if > any) of > >>>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> code > >>>>>>>>>>>>>>>>>>> I'm > >>>>>>>>>>>>>>>>>>> in a > >>>>>>>>>>>>>>>>>>> position to share, and I'll get back to you. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better > >>>>>>>>>>>>>>>>>>> would be > >>>>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>> means > >>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> allow me to (transparently) allocate objects in > unmanaged > >>>>>>>>>>>>>>>>>>> memory, > >>>>>>>>>>>>>>>>>>> but I > >>>>>>>>>>>>>>>>>>> would expect that to be a tall order :) > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>> /Martin > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Martin. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Note that in case you want us to do the work of > isolating > >>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> problem, > >>>>>>>>>>>>>>>>>> we do offer paid support to do that (then we can sign > NDAs > >>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>> stuff). > >>>>>>>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once > you > >>>>>>>>>>>>>>>>>> isolate > >>>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>> part you can share freely :) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>>>>> > >>>> > >>>> > >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 17 20:51:04 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 17 Mar 2014 21:51:04 +0200 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> Message-ID: no, the problem here is definitely that you're generating a lot of garbage that survives the young generation. you can always try playing with PYPY_GC_NURSERY envirtonment var (defaults to 4M I think) On Mon, Mar 17, 2014 at 9:04 PM, Martin Koch wrote: > Well, it would appear that we have the problem because we're generating a > lot of garbage in the young generation, just like we're doing in the example > we've been studying here. I'm unsure how we can avoid that in our real > implementation. Can we force gc of the young generation? Either by > gc.collect() or implcitly somehow (does the gc e.g. kick in across function > calls?). > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 5:39 PM, Maciej Fijalkowski > wrote: >> >> not sure how more we can help without looking into the code >> >> On Mon, Mar 17, 2014 at 6:05 PM, Martin Koch wrote: >> > Thanks :) >> > >> > /Martin >> > >> > >> >> On 17/03/2014, at 16.41, Maciej Fijalkowski wrote: >> >> >> >> ok. >> >> >> >> so as you can probably see, the max is not that big, which means the >> >> GC is really incremental. What happens is you get tons of garbage that >> >> survives minor collection every now and then. I don't exactly know >> >> why, but you should look what objects can potentially survive for too >> >> long. >> >> >> >>> On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch wrote: >> >>> Ah - it just occured to me that the first runs may be slow anyway: >> >>> Since we >> >>> take the average of the last 100 runs as the benchmark, then the first >> >>> 100 >> >>> runs are not classified as slow. Indeed, the first three runs with >> >>> many >> >>> collections are in the first 100 runs. >> >>> >> >>> >> >>>> On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch wrote: >> >>>> >> >>>> Here are the total and max times in millions of units; 30000 units is >> >>>> approximately 13 seconds. I have extracted the runs where there are >> >>>> many >> >>>> gc-collect-steps. These are in execution order, so the first runs >> >>>> with many >> >>>> gc-collect-steps aren't slow. >> >>>> >> >>>> Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: >> >>>> gc-minor:10 gc-collect-step:247 >> >>>> Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: >> >>>> gc-minor:10 gc-collect-step:245 >> >>>> Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: >> >>>> gc-minor:11 gc-collect-step:244 >> >>>> Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31270 >> >>>> Max: gc-minor:17 gc-collect-step:244 >> >>>> Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30365 >> >>>> Max: gc-minor:11 gc-collect-step:248 >> >>>> Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31235 >> >>>> Max: gc-minor:8 gc-collect-step:299 >> >>>> Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31124 >> >>>> Max: gc-minor:11 gc-collect-step:246 >> >>>> Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30541 >> >>>> Max: gc-minor:8 gc-collect-step:244 >> >>>> Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31427 >> >>>> Max: gc-minor:36 gc-collect-step:248 >> >>>> Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30743 >> >>>> Max: gc-minor:8 gc-collect-step:244 >> >>>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30207 >> >>>> Max: gc-minor:8 gc-collect-step:245 >> >>>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30837 >> >>>> Max: gc-minor:8 gc-collect-step:244 >> >>>> Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30898 >> >>>> Max: gc-minor:38 gc-collect-step:244 >> >>>> Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30407 >> >>>> Max: gc-minor:23 gc-collect-step:245 >> >>>> Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30591 >> >>>> Max: gc-minor:8 gc-collect-step:246 >> >>>> Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31193 >> >>>> Max: gc-minor:9 gc-collect-step:244 >> >>>> Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30026 >> >>>> Max: gc-minor:8 gc-collect-step:246 >> >>>> Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31179 >> >>>> Max: gc-minor:8 gc-collect-step:248 >> >>>> Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30674 >> >>>> Max: gc-minor:8 gc-collect-step:250 >> >>>> Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30413 >> >>>> Max: gc-minor:8 gc-collect-step:245 >> >>>> Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:30830 >> >>>> Max: gc-minor:543 gc-collect-step:244 >> >>>> Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:31153 >> >>>> Max: gc-minor:20 gc-collect-step:246 >> >>>> Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 >> >>>> gc-collect-step:29815 >> >>>> Max: gc-minor:25 gc-collect-step:245 >> >>>> >> >>>> Thanks, >> >>>> /Martin >> >>>> >> >>>> >> >>>>> On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch wrote: >> >>>>> >> >>>>> Ah. I had misunderstood. I'll get back to you on that :) thanks >> >>>>> >> >>>>> /Martin >> >>>>> >> >>>>> >> >>>>>> On 17/03/2014, at 15.21, Maciej Fijalkowski >> >>>>>> wrote: >> >>>>>> >> >>>>>> eh, this is not what I need >> >>>>>> >> >>>>>> I need a max of TIME it took for a gc-minor and the TOTAL time it >> >>>>>> took >> >>>>>> for a gc-minor (per query) (ideally same for gc-walkroots and >> >>>>>> gc-collect-step) >> >>>>>> >> >>>>>>> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch >> >>>>>>> wrote: >> >>>>>>> Here are the collated results of running each query. For each run, >> >>>>>>> I >> >>>>>>> count >> >>>>>>> how many of each of the pypy debug lines i get. I.e. there were >> >>>>>>> 668 >> >>>>>>> runs >> >>>>>>> that printed 58 loglines that contain "{gc-minor" which was >> >>>>>>> eventually >> >>>>>>> followed by "gc-minor}". I have also counted if the query was >> >>>>>>> slow; >> >>>>>>> interestingly, not all the queries with many gc-minors were slow >> >>>>>>> (but >> >>>>>>> all >> >>>>>>> slow queries had a gc-minor). >> >>>>>>> >> >>>>>>> Please let me know if this is unclear :) >> >>>>>>> >> >>>>>>> 668 gc-minor:58 gc-minor-walkroots:58 >> >>>>>>> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >> >>>>>>> 140 gc-minor:59 gc-minor-walkroots:59 >> >>>>>>> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >> >>>>>>> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >> >>>>>>> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >> >>>>>>> gc-collect-step:9589 >> >>>>>>> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >> >>>>>>> gc-collect-step:9590 >> >>>>>>> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >> >>>>>>> gc-collect-step:9609 >> >>>>>>> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >> >>>>>>> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >> >>>>>>> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >> >>>>>>> jit-backend-dump:78 >> >>>>>>> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 >> >>>>>>> gc-minor:9030 >> >>>>>>> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >> >>>>>>> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >> >>>>>>> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >> >>>>>>> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >> >>>>>>> jit-log-compiling-bridge:2 >> >>>>>>> jit-resume:84 >> >>>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >> >>>>>>> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 >> >>>>>>> gc-minor-walkroots:60 >> >>>>>>> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >> >>>>>>> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 >> >>>>>>> jit-log-rewritten-loop:1 >> >>>>>>> jit-resume:14 >> >>>>>>> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >> >>>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 >> >>>>>>> jit-tracing:3 >> >>>>>>> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >> >>>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >> >>>>>>> jit-abort:3 >> >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >> >>>>>>> jit-log-opt-bridge:2 >> >>>>>>> jit-log-compiling-bridge:2 jit-resume:84 >> >>>>>>> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >> >>>>>>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 >> >>>>>>> jit-tracing:3 >> >>>>>>> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >> >>>>>>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >> >>>>>>> jit-abort:3 >> >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >> >>>>>>> jit-log-opt-bridge:2 >> >>>>>>> jit-log-compiling-bridge:2 jit-resume:84 >> >>>>>>> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >> >>>>>>> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >> >>>>>>> gc-minor:61 >> >>>>>>> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >> >>>>>>> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >> >>>>>>> jit-abort:3 >> >>>>>>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >> >>>>>>> jit-log-opt-bridge:2 >> >>>>>>> jit-log-compiling-bridge:2 jit-resume:104 >> >>>>>>> >> >>>>>>> >> >>>>>>> Thanks, >> >>>>>>> /Martin >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >> >>>>>>> >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >> >>>>>>>> >> >>>>>>>> wrote: >> >>>>>>>>> are you *sure* it's the walkroots that take that long and not >> >>>>>>>>> something else (like gc-minor)? More of those mean that you >> >>>>>>>>> allocate >> >>>>>>>>> a >> >>>>>>>>> lot more surviving objects. Can you do two things: >> >>>>>>>>> >> >>>>>>>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >> >>>>>>>>> b) take the sum of those >> >>>>>>>>> >> >>>>>>>>> and plot them >> >>>>>>>> >> >>>>>>>> ^^^ or just paste the results actually >> >>>>>>>> >> >>>>>>>>> >> >>>>>>>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch >> >>>>>>>>>> wrote: >> >>>>>>>>>> Well, then it works out to around 2.5GHz, which seems >> >>>>>>>>>> reasonable. >> >>>>>>>>>> But >> >>>>>>>>>> it >> >>>>>>>>>> doesn't alter the conclusion from the previous email: The slow >> >>>>>>>>>> queries >> >>>>>>>>>> then >> >>>>>>>>>> all have a duration around 34*10^9 units, 'normal' queries >> >>>>>>>>>> 1*10^9 >> >>>>>>>>>> units, or >> >>>>>>>>>> .4 seconds at this conversion. Also, the log shows that a slow >> >>>>>>>>>> query >> >>>>>>>>>> performs many more gc-minor operations than a 'normal' one: >> >>>>>>>>>> 9600 >> >>>>>>>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >>>>>>>>>> >> >>>>>>>>>> So the question becomes: Why do we get this large spike in >> >>>>>>>>>> gc-minor-walkroots, and, in particular, is there any way to >> >>>>>>>>>> avoid >> >>>>>>>>>> it :) >> >>>>>>>>>> ? >> >>>>>>>>>> >> >>>>>>>>>> Thanks, >> >>>>>>>>>> /Martin >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >> >>>>>>>>>> >> >>>>>>>>>> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> I think it's the cycles of your CPU >> >>>>>>>>>>> >> >>>>>>>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >> >>>>>>>>>>>> correlate it >> >>>>>>>>>>>> with seconds (which the program does print out). Slow runs >> >>>>>>>>>>>> are >> >>>>>>>>>>>> around 13 >> >>>>>>>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp >> >>>>>>>>>>>> units >> >>>>>>>>>>>> (e.g. >> >>>>>>>>>>>> from >> >>>>>>>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> >>>>>>>>>>>> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> Based On Maciej's suggestion, I tried the following >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> This generates a logfile which looks something like this >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> start--> >> >>>>>>>>>>>>>> [2b99f1981b527e] {gc-minor >> >>>>>>>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >> >>>>>>>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >> >>>>>>>>>>>>>> [2b99f19890d750] gc-minor} >> >>>>>>>>>>>>>> [snip] >> >>>>>>>>>>>>>> ... >> >>>>>>>>>>>>>> <--stop >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> It turns out that the culprit is a lot of MINOR >> >>>>>>>>>>>>>> collections. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I base this on the following observations: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I can't understand the format of the timestamp on each >> >>>>>>>>>>>>>> logline >> >>>>>>>>>>>>>> (the >> >>>>>>>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >> >>>>>>>>>>>>>> should >> >>>>>>>>>>>>>> be >> >>>>>>>>>>>>>> output >> >>>>>>>>>>>>>> from time.clock(), but that doesn't return a number like >> >>>>>>>>>>>>>> that >> >>>>>>>>>>>>>> when I >> >>>>>>>>>>>>>> run >> >>>>>>>>>>>>>> pypy interactively >> >>>>>>>>>>>>>> Instead, I count the number of debug lines between start--> >> >>>>>>>>>>>>>> and >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>> corresponding <--stop. >> >>>>>>>>>>>>>> Most runs have a few hundred lines of output between >> >>>>>>>>>>>>>> start/stop >> >>>>>>>>>>>>>> All slow runs have very close to 57800 lines out output >> >>>>>>>>>>>>>> between >> >>>>>>>>>>>>>> start/stop >> >>>>>>>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >> >>>>>>>>>>>>>> gc-minor >> >>>>>>>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>> /Martin >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >> >>>>>>>>>>>>>>> stdout) >> >>>>>>>>>>>>>>> which will do that for you btw. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> maybe you can find out what's that using profiling or >> >>>>>>>>>>>>>>> valgrind? >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>> I have tried getting the pypy source and building my own >> >>>>>>>>>>>>>>>> version >> >>>>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>>> pypy. I >> >>>>>>>>>>>>>>>> have modified >> >>>>>>>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >> >>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>> print out when it starts and when it stops. Apparently, >> >>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>> slow >> >>>>>>>>>>>>>>>> queries >> >>>>>>>>>>>>>>>> do >> >>>>>>>>>>>>>>>> NOT occur during major_collection_step; at least, I have >> >>>>>>>>>>>>>>>> not >> >>>>>>>>>>>>>>>> observed >> >>>>>>>>>>>>>>>> major >> >>>>>>>>>>>>>>>> step output during a query execution. So, apparently, >> >>>>>>>>>>>>>>>> something >> >>>>>>>>>>>>>>>> else >> >>>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>> blocking. This could be another aspect of the GC, but it >> >>>>>>>>>>>>>>>> could >> >>>>>>>>>>>>>>>> also >> >>>>>>>>>>>>>>>> be >> >>>>>>>>>>>>>>>> anything else. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Just to be sure, I have tried running the same >> >>>>>>>>>>>>>>>> application in >> >>>>>>>>>>>>>>>> python >> >>>>>>>>>>>>>>>> with >> >>>>>>>>>>>>>>>> garbage collection disabled. I don't see the problem >> >>>>>>>>>>>>>>>> there, >> >>>>>>>>>>>>>>>> so >> >>>>>>>>>>>>>>>> it >> >>>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>> somehow >> >>>>>>>>>>>>>>>> related to either GC or the runtime somehow. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Cheers, >> >>>>>>>>>>>>>>>> /Martin >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> We have hacked up a small sample that seems to exhibit >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> same >> >>>>>>>>>>>>>>>>> issue. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> We basically generate a linked list of objects. To >> >>>>>>>>>>>>>>>>> increase >> >>>>>>>>>>>>>>>>> connectedness, >> >>>>>>>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >> >>>>>>>>>>>>>>>>> randomly >> >>>>>>>>>>>>>>>>> chosen >> >>>>>>>>>>>>>>>>> previous elements in the list. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> We then time a function that traverses 50000 elements >> >>>>>>>>>>>>>>>>> from >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> list >> >>>>>>>>>>>>>>>>> from a >> >>>>>>>>>>>>>>>>> random start point. If the traversal reaches the end of >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> list, >> >>>>>>>>>>>>>>>>> we >> >>>>>>>>>>>>>>>>> instead >> >>>>>>>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K >> >>>>>>>>>>>>>>>>> elements >> >>>>>>>>>>>>>>>>> are >> >>>>>>>>>>>>>>>>> traversed >> >>>>>>>>>>>>>>>>> every time. To generate some garbage, we build a list >> >>>>>>>>>>>>>>>>> holding >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> traversed >> >>>>>>>>>>>>>>>>> elements and a dummy list of characters. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >> >>>>>>>>>>>>>>>>> buffer. If >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> elapsed time for the last run is more than twice the >> >>>>>>>>>>>>>>>>> average >> >>>>>>>>>>>>>>>>> time, >> >>>>>>>>>>>>>>>>> we >> >>>>>>>>>>>>>>>>> print >> >>>>>>>>>>>>>>>>> out a line with the elapsed time, the threshold, and the >> >>>>>>>>>>>>>>>>> 90% >> >>>>>>>>>>>>>>>>> runtime >> >>>>>>>>>>>>>>>>> (we >> >>>>>>>>>>>>>>>>> would like to see that the mean runtime does not >> >>>>>>>>>>>>>>>>> increase >> >>>>>>>>>>>>>>>>> with >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> number of >> >>>>>>>>>>>>>>>>> elements in the list, but that the max time does >> >>>>>>>>>>>>>>>>> increase >> >>>>>>>>>>>>>>>>> (linearly >> >>>>>>>>>>>>>>>>> with the >> >>>>>>>>>>>>>>>>> number of object, i guess); traversing 50K elements >> >>>>>>>>>>>>>>>>> should >> >>>>>>>>>>>>>>>>> be >> >>>>>>>>>>>>>>>>> independent of >> >>>>>>>>>>>>>>>>> the memory size). >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> We have tried monitoring memory consumption by external >> >>>>>>>>>>>>>>>>> inspection, >> >>>>>>>>>>>>>>>>> but >> >>>>>>>>>>>>>>>>> cannot consistently verify that memory is deallocated at >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> same >> >>>>>>>>>>>>>>>>> time >> >>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >> >>>>>>>>>>>>>>>>> always >> >>>>>>>>>>>>>>>>> return >> >>>>>>>>>>>>>>>>> freed >> >>>>>>>>>>>>>>>>> pages back to the OS? >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Using top, we observe that 10M elements allocates around >> >>>>>>>>>>>>>>>>> 17GB >> >>>>>>>>>>>>>>>>> after >> >>>>>>>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and >> >>>>>>>>>>>>>>>>> grows to >> >>>>>>>>>>>>>>>>> 35GB >> >>>>>>>>>>>>>>>>> shortly >> >>>>>>>>>>>>>>>>> after building). >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Here is output from a few runs with different number of >> >>>>>>>>>>>>>>>>> elements: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> pypy mem.py 10000000 >> >>>>>>>>>>>>>>>>> start build >> >>>>>>>>>>>>>>>>> end build 84.142424 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.230586 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 1.495401 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.016531 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 1.488160 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 13.032537 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 1.474563 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> pypy mem.py 20000000 >> >>>>>>>>>>>>>>>>> start build >> >>>>>>>>>>>>>>>>> end build 180.823105 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 27.346064 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 2.295146 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 26.028852 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 2.283927 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 25.432279 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 2.279631 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> pypy mem.py 30000000 >> >>>>>>>>>>>>>>>>> start build >> >>>>>>>>>>>>>>>>> end build 276.217811 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 40.993855 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 3.188464 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 41.693553 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 3.183003 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 39.679769 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 3.190782 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >> >>>>>>>>>>>>>>>>> that took a long time elapsed: 43.573411 >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> 3.239637 >> >>>>>>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Code below >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> -------------------------------------------------------------- >> >>>>>>>>>>>>>>>>> import time >> >>>>>>>>>>>>>>>>> from random import randint, choice >> >>>>>>>>>>>>>>>>> import sys >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> allElems = {} >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> class Node: >> >>>>>>>>>>>>>>>>> def __init__(self, v_): >> >>>>>>>>>>>>>>>>> self.v = v_ >> >>>>>>>>>>>>>>>>> self.next = None >> >>>>>>>>>>>>>>>>> self.dummy_data = [randint(0,100) >> >>>>>>>>>>>>>>>>> for _ in >> >>>>>>>>>>>>>>>>> xrange(randint(50,100))] >> >>>>>>>>>>>>>>>>> allElems[self.v] = self >> >>>>>>>>>>>>>>>>> if self.v > 0: >> >>>>>>>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >> >>>>>>>>>>>>>>>>> self.v-1)] >> >>>>>>>>>>>>>>>>> for _ >> >>>>>>>>>>>>>>>>> in >> >>>>>>>>>>>>>>>>> xrange(10)] >> >>>>>>>>>>>>>>>>> else: >> >>>>>>>>>>>>>>>>> self.dummy_links = [self] >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> def set_next(self, l): >> >>>>>>>>>>>>>>>>> self.next = l >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> def follow(node): >> >>>>>>>>>>>>>>>>> acc = [] >> >>>>>>>>>>>>>>>>> count = 0 >> >>>>>>>>>>>>>>>>> cur = node >> >>>>>>>>>>>>>>>>> assert node.v is not None >> >>>>>>>>>>>>>>>>> assert cur is not None >> >>>>>>>>>>>>>>>>> while count < 50000: >> >>>>>>>>>>>>>>>>> # return a value; generate some garbage >> >>>>>>>>>>>>>>>>> acc.append((cur.v, >> >>>>>>>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >> >>>>>>>>>>>>>>>>> for >> >>>>>>>>>>>>>>>>> x >> >>>>>>>>>>>>>>>>> in >> >>>>>>>>>>>>>>>>> xrange(100)])) >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> # if we have reached the end, chose a random link >> >>>>>>>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >> >>>>>>>>>>>>>>>>> else >> >>>>>>>>>>>>>>>>> cur.next >> >>>>>>>>>>>>>>>>> count += 1 >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> return acc >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> def build(num_elems): >> >>>>>>>>>>>>>>>>> start = time.time() >> >>>>>>>>>>>>>>>>> print "start build" >> >>>>>>>>>>>>>>>>> root = Node(0) >> >>>>>>>>>>>>>>>>> cur = root >> >>>>>>>>>>>>>>>>> for x in xrange(1, num_elems): >> >>>>>>>>>>>>>>>>> e = Node(x) >> >>>>>>>>>>>>>>>>> cur.next = e >> >>>>>>>>>>>>>>>>> cur = e >> >>>>>>>>>>>>>>>>> print "end build %f" % (time.time() - start) >> >>>>>>>>>>>>>>>>> return root >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> num_timings = 100 >> >>>>>>>>>>>>>>>>> if __name__ == "__main__": >> >>>>>>>>>>>>>>>>> num_elems = int(sys.argv[1]) >> >>>>>>>>>>>>>>>>> build(num_elems) >> >>>>>>>>>>>>>>>>> total = 0 >> >>>>>>>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >> >>>>>>>>>>>>>>>>> num_timings >> >>>>>>>>>>>>>>>>> runs >> >>>>>>>>>>>>>>>>> i = 0 >> >>>>>>>>>>>>>>>>> beginning = time.time() >> >>>>>>>>>>>>>>>>> while time.time() - beginning < 600: >> >>>>>>>>>>>>>>>>> start = time.time() >> >>>>>>>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >> >>>>>>>>>>>>>>>>> assert(elem is not None) >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> lst = follow(elem) >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> total += choice(lst)[0] # use the return value for >> >>>>>>>>>>>>>>>>> something >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> end = time.time() >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> elapsed = end-start >> >>>>>>>>>>>>>>>>> timings[i % num_timings] = elapsed >> >>>>>>>>>>>>>>>>> if (i > num_timings): >> >>>>>>>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # >> >>>>>>>>>>>>>>>>> slow >> >>>>>>>>>>>>>>>>> defined >> >>>>>>>>>>>>>>>>> as >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> 2*avg run time >> >>>>>>>>>>>>>>>>> if (elapsed > slow_time): >> >>>>>>>>>>>>>>>>> print "that took a long time elapsed: %f >> >>>>>>>>>>>>>>>>> slow_threshold: >> >>>>>>>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >> >>>>>>>>>>>>>>>>> (elapsed, slow_time, >> >>>>>>>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >> >>>>>>>>>>>>>>>>> i += 1 >> >>>>>>>>>>>>>>>>> print total >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>> Hi Armin, Maciej >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Thanks for responding. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> I'm in the process of trying to determine what (if >> >>>>>>>>>>>>>>>>>>> any) of >> >>>>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>> code >> >>>>>>>>>>>>>>>>>>> I'm >> >>>>>>>>>>>>>>>>>>> in a >> >>>>>>>>>>>>>>>>>>> position to share, and I'll get back to you. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >> >>>>>>>>>>>>>>>>>>> would be >> >>>>>>>>>>>>>>>>>>> a >> >>>>>>>>>>>>>>>>>>> means >> >>>>>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>>> allow me to (transparently) allocate objects in >> >>>>>>>>>>>>>>>>>>> unmanaged >> >>>>>>>>>>>>>>>>>>> memory, >> >>>>>>>>>>>>>>>>>>> but I >> >>>>>>>>>>>>>>>>>>> would expect that to be a tall order :) >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>>>>>>> /Martin >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Hi Martin. >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Note that in case you want us to do the work of >> >>>>>>>>>>>>>>>>>> isolating >> >>>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>> problem, >> >>>>>>>>>>>>>>>>>> we do offer paid support to do that (then we can sign >> >>>>>>>>>>>>>>>>>> NDAs >> >>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>> stuff). >> >>>>>>>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once >> >>>>>>>>>>>>>>>>>> you >> >>>>>>>>>>>>>>>>>> isolate >> >>>>>>>>>>>>>>>>>> a >> >>>>>>>>>>>>>>>>>> part you can share freely :) >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>> >> >>>> >> >>> > > From cfbolz at gmx.de Tue Mar 18 09:47:12 2014 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Tue, 18 Mar 2014 09:47:12 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> Message-ID: <53280810.6060602@gmx.de> On 17/03/14 20:04, Martin Koch wrote: > Well, it would appear that we have the problem because we're generating > a lot of garbage in the young generation, just like we're doing in the > example we've been studying here. No, I think it's because your generating a lot of garbage in the *old* generation. Meaning objects which survive one minor collection but then die. > I'm unsure how we can avoid that in > our real implementation. Can we force gc of the young generation? Either > by gc.collect() or implcitly somehow (does the gc e.g. kick in across > function calls?). That would make matters worse, because increasing the frequency of minor collects means *more* objects get moved to the old generation (where they cause problems). So indeed, maybe in your case making the new generation bigger might help. This can be done using PYPY_GC_NURSERY, I think (nursery is the space reserved for young objects). The risk is that minor collections become unreasonably slow. Anyway, if the example code you gave us also shows the problem I think we should eventually look into it. It's not really fair to say "but you're allocating too much!" to explain why the GC takes a lot of time. Cheers, Carl Friedrich From mak at issuu.com Tue Mar 18 10:37:30 2014 From: mak at issuu.com (Martin Koch) Date: Tue, 18 Mar 2014 10:37:30 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: <53280810.6060602@gmx.de> References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> <53280810.6060602@gmx.de> Message-ID: Thanks, Carl. This bit of code certainly exhibits the surprising property that some runs unpredictably stall for a very long time. Further, it seems that this stall time can be made arbitrarily large by increasing the number of nodes generated (== more data in the old generation == more stuff to traverse if lots of garbage is generated and survives the young generation?). As a user of an incremental garbage collector, I would expect that there are pauses due to GC, but that these are predictable and small. I tried running PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000 but that seemed to have no effect. I'm looking forward to the results of the Software Transactional Memory, btw :) /Martin On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz wrote: > On 17/03/14 20:04, Martin Koch wrote: > > Well, it would appear that we have the problem because we're generating > > a lot of garbage in the young generation, just like we're doing in the > > example we've been studying here. > > No, I think it's because your generating a lot of garbage in the *old* > generation. Meaning objects which survive one minor collection but then > die. > > > I'm unsure how we can avoid that in > > our real implementation. Can we force gc of the young generation? Either > > by gc.collect() or implcitly somehow (does the gc e.g. kick in across > > function calls?). > > That would make matters worse, because increasing the frequency of > minor collects means *more* objects get moved to the old generation > (where they cause problems). So indeed, maybe in your case making the > new generation bigger might help. This can be done using > PYPY_GC_NURSERY, I think (nursery is the space reserved for young > objects). The risk is that minor collections become unreasonably slow. > > Anyway, if the example code you gave us also shows the problem I think > we should eventually look into it. It's not really fair to say "but > you're allocating too much!" to explain why the GC takes a lot of time. > > Cheers, > > Carl Friedrich > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfbolz at gmx.de Tue Mar 18 11:23:38 2014 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Tue, 18 Mar 2014 11:23:38 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> <53280810.6060602@gmx.de> Message-ID: Agreed, somehow this should not happen. Anyway, I'm not the person to look into this, but I filed a bug, so at least your example code does not get lost: https://bugs.pypy.org/issue1710 Cheers, Carl Friedrich Martin Koch wrote: >Thanks, Carl. > >This bit of code certainly exhibits the surprising property that some >runs >unpredictably stall for a very long time. Further, it seems that this >stall >time can be made arbitrarily large by increasing the number of nodes >generated (== more data in the old generation == more stuff to traverse >if >lots of garbage is generated and survives the young generation?). As a >user >of an incremental garbage collector, I would expect that there are >pauses >due to GC, but that these are predictable and small. > >I tried running > >PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000 > >but that seemed to have no effect. > >I'm looking forward to the results of the Software Transactional >Memory, >btw :) > >/Martin > > >On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz >wrote: > >> On 17/03/14 20:04, Martin Koch wrote: >> > Well, it would appear that we have the problem because we're >generating >> > a lot of garbage in the young generation, just like we're doing in >the >> > example we've been studying here. >> >> No, I think it's because your generating a lot of garbage in the >*old* >> generation. Meaning objects which survive one minor collection but >then >> die. >> >> > I'm unsure how we can avoid that in >> > our real implementation. Can we force gc of the young generation? >Either >> > by gc.collect() or implcitly somehow (does the gc e.g. kick in >across >> > function calls?). >> >> That would make matters worse, because increasing the frequency of >> minor collects means *more* objects get moved to the old generation >> (where they cause problems). So indeed, maybe in your case making the >> new generation bigger might help. This can be done using >> PYPY_GC_NURSERY, I think (nursery is the space reserved for young >> objects). The risk is that minor collections become unreasonably >slow. >> >> Anyway, if the example code you gave us also shows the problem I >think >> we should eventually look into it. It's not really fair to say "but >> you're allocating too much!" to explain why the GC takes a lot of >time. >> >> Cheers, >> >> Carl Friedrich >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> Carl Friedrich -------------- next part -------------- An HTML attachment was scrubbed... URL: From bokr at oz.net Tue Mar 18 11:41:07 2014 From: bokr at oz.net (Bengt Richter) Date: Tue, 18 Mar 2014 11:41:07 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> <53280810.6060602@gmx.de> Message-ID: PMJI, but I wonder if some of these objects could come from trivial re-instatiations instead of re-use of mutable objects, e.g., fishing out one attribute to use together with a new value as init values for an (unnecessarily) new obj. obj = ObjClass(obj.someattr, chgval) where obj.chg = chgval would have done the job without creating garbage. I suspect this pattern can happen more subtly than above, especially if __new__ is defined to do something tricky with old instances. Also, creating a new object can be a tempting way to feel sure about its complete state, without having to write a custom (re)init method. On 03/18/2014 10:37 AM Martin Koch wrote: > Thanks, Carl. > > This bit of code certainly exhibits the surprising property that some runs > unpredictably stall for a very long time. Further, it seems that this stall > time can be made arbitrarily large by increasing the number of nodes > generated (== more data in the old generation == more stuff to traverse if > lots of garbage is generated and survives the young generation?). As a user > of an incremental garbage collector, I would expect that there are pauses > due to GC, but that these are predictable and small. > > I tried running > > PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000 > > but that seemed to have no effect. > > I'm looking forward to the results of the Software Transactional Memory, > btw :) > > /Martin > > > On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz wrote: > >> On 17/03/14 20:04, Martin Koch wrote: >>> Well, it would appear that we have the problem because we're generating >>> a lot of garbage in the young generation, just like we're doing in the >>> example we've been studying here. >> >> No, I think it's because your generating a lot of garbage in the *old* >> generation. Meaning objects which survive one minor collection but then >> die. >> >>> I'm unsure how we can avoid that in >>> our real implementation. Can we force gc of the young generation? Either >>> by gc.collect() or implcitly somehow (does the gc e.g. kick in across >>> function calls?). >> >> That would make matters worse, because increasing the frequency of >> minor collects means *more* objects get moved to the old generation >> (where they cause problems). So indeed, maybe in your case making the >> new generation bigger might help. This can be done using >> PYPY_GC_NURSERY, I think (nursery is the space reserved for young >> objects). The risk is that minor collections become unreasonably slow. >> >> Anyway, if the example code you gave us also shows the problem I think >> we should eventually look into it. It's not really fair to say "but >> you're allocating too much!" to explain why the GC takes a lot of time. >> >> Cheers, >> >> Carl Friedrich >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> > > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From techtonik at gmail.com Tue Mar 18 11:43:31 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Tue, 18 Mar 2014 13:43:31 +0300 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting Message-ID: Hi, I wonder if it possible to discard run-time changes to interpreter state and get back to some point in the past? One of the applications is forking to speed up unit tests - for it after interpreter is initialized and loaded unittest imports. -- anatoly t. From mak at issuu.com Tue Mar 18 11:55:36 2014 From: mak at issuu.com (Martin Koch) Date: Tue, 18 Mar 2014 11:55:36 +0100 Subject: [pypy-dev] Pypy garbage collection In-Reply-To: References: <5115402B-A7B4-4CA0-8735-EDE0F5FA2403@issuu.com> <5797C257-4FD3-4E62-B8F9-F00551E4141D@issuu.com> <53280810.6060602@gmx.de> Message-ID: Thanks, Carl I think that the part of the mail thread with the timing measurements that show that it is many gc-collect-steps and not one single major gc is also relevant for the bug, so that this information won't have to be rediscovered whenever someone gets the time to look at the bug :) I.e. that is the mail with lines like this one: *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 It might also be relevant to include info on the command line to reproduce the problem: pypy mem.py 10000000 Thanks, /Martin On Tue, Mar 18, 2014 at 11:23 AM, Carl Friedrich Bolz wrote: > Agreed, somehow this should not happen. > > Anyway, I'm not the person to look into this, but I filed a bug, so at > least your example code does not get lost: > > https://bugs.pypy.org/issue1710 > > Cheers, > > Carl Friedrich > > > Martin Koch wrote: >> >> Thanks, Carl. >> >> This bit of code certainly exhibits the surprising property that some >> runs unpredictably stall for a very long time. Further, it seems that this >> stall time can be made arbitrarily large by increasing the number of nodes >> generated (== more data in the old generation == more stuff to traverse if >> lots of garbage is generated and survives the young generation?). As a user >> of an incremental garbage collector, I would expect that there are pauses >> due to GC, but that these are predictable and small. >> >> I tried running >> >> PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000 >> >> but that seemed to have no effect. >> >> I'm looking forward to the results of the Software Transactional Memory, >> btw :) >> >> /Martin >> >> >> On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz wrote: >> >>> On 17/03/14 20:04, Martin Koch wrote: >>> > Well, it would appear that we have the problem because we're generating >>> > a lot of garbage in the young generation, just like we're doing in the >>> > example we've been studying here. >>> >>> No, I think it's because your generating a lot of garbage in the *old* >>> generation. Meaning objects which survive one minor collection but then >>> die. >>> >>> > I'm unsure how we can avoid that in >>> > our real implementation. Can we force gc of the young generation? >>> Either >>> > by gc.collect() or implcitly somehow (does the gc e.g. kick in across >>> > function calls?). >>> >>> That would make matters worse, because increasing the frequency of >>> minor collects means *more* objects get moved to the old generation >>> (where they cause problems). So indeed, maybe in your case making the >>> new generation bigger might help. This can be done using >>> PYPY_GC_NURSERY, I think (nursery is the space reserved for young >>> objects). The risk is that minor collections become unreasonably slow. >>> >>> Anyway, if the example code you gave us also shows the problem I think >>> we should eventually look into it. It's not really fair to say "but >>> you're allocating too much!" to explain why the GC takes a lot of time. >>> >>> Cheers, >>> >>> Carl Friedrich >>> _______________________________________________ >>> pypy-dev mailing list >>> pypy-dev at python.org >>> https://mail.python.org/mailman/listinfo/pypy-dev >>> >> >> > > Carl Friedrich > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Wed Mar 19 09:35:24 2014 From: matti.picus at gmail.com (matti picus) Date: Wed, 19 Mar 2014 10:35:24 +0200 Subject: [pypy-dev] ssl and win32 Message-ID: I have been slowly trying to work on the failure on win32 of ssl, as reported in issue 1696 [0]. The issue now includes a 10 line test, using the stdlib test_ftplib classes. i tried converting the test to run on untranslated python, but am running into a very strange failure mode similar to the own-testing failure of ssl_wrap [1]. For some reason a failing call to socket functions is setting errno==2, which is not a valid socket call error [3] I would love to get some ideas about how to progress, or even better that someone else fix this. Matti [0] https://bugs.pypy.org/issue1696 [1] http://buildbot.pypy.org/summary/longrepr?testname=AppTestSSL.%28%29.test_sslwrap&builder=own-win-x86-32&build=59&mod=module._ssl.test.test_ssl [3] http://msdn.microsoft.com/en-us/library/windows/desktop/ms740668(v=vs.85).aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Wed Mar 19 09:41:33 2014 From: arigo at tunes.org (Armin Rigo) Date: Wed, 19 Mar 2014 09:41:33 +0100 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting In-Reply-To: References: Message-ID: Hi Anatoly, On 18 March 2014 11:43, anatoly techtonik wrote: > I wonder if it possible to discard run-time changes to interpreter > state and get back to some point in the past? One of the applications > is forking to speed up unit tests - for it after interpreter is > initialized and loaded unittest imports. Unsure what you want to do, but isn't os.fork() the answer to your first question? Armin From techtonik at gmail.com Wed Mar 19 09:47:38 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Wed, 19 Mar 2014 11:47:38 +0300 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting In-Reply-To: References: Message-ID: Hi Armin. On Wed, Mar 19, 2014 at 11:41 AM, Armin Rigo wrote: > On 18 March 2014 11:43, anatoly techtonik wrote: >> I wonder if it possible to discard run-time changes to interpreter >> state and get back to some point in the past? One of the applications >> is forking to speed up unit tests - for it after interpreter is >> initialized and loaded unittest imports. > > Unsure what you want to do, but isn't os.fork() the answer to your > first question? Yes, but on a interpreter level, independent of underlying platform. From arigo at tunes.org Wed Mar 19 09:54:25 2014 From: arigo at tunes.org (Armin Rigo) Date: Wed, 19 Mar 2014 09:54:25 +0100 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting In-Reply-To: References: Message-ID: Hi Anatoly, On 19 March 2014 09:47, anatoly techtonik wrote: >> Unsure what you want to do, but isn't os.fork() the answer to your >> first question? > > Yes, but on a interpreter level, independent of underlying platform. What is the motivation for avoiding os.fork()? It's possible to do something like that in RPython, if you ignore all the additional complications like tracking raw-memory too; it looks like an infinite amount of painful work to me, but well, it's not my time :-) A bient?t, Armin. From techtonik at gmail.com Wed Mar 19 10:42:16 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Wed, 19 Mar 2014 12:42:16 +0300 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting In-Reply-To: References: Message-ID: On Wed, Mar 19, 2014 at 11:54 AM, Armin Rigo wrote: > Hi Anatoly, > > On 19 March 2014 09:47, anatoly techtonik wrote: >>> Unsure what you want to do, but isn't os.fork() the answer to your >>> first question? >> >> Yes, but on a interpreter level, independent of underlying platform. > > What is the motivation for avoiding os.fork()? I'd gladly use it as a quick hack to solve my unit-testing performance problem, but I am on Windows, so I had to think about ideal case. > It's possible to do something like that in RPython, if you ignore all > the additional complications like tracking raw-memory too; it looks > like an infinite amount of painful work to me, but well, it's not my > time :-) Fair point. =) I am thinking about bytecode machine. Virtualization software like virtualbox allow to save state at run-time and restore it later at the exact point - continue to run the system from the moment it was saved. And they do this in incremental way - keeping track of what memory and disk have been touched. So, can interpreter, while playing bytecode, do keep track of these things and save/restore the state the same way? Is that possible currently? If not, then why and what can be done? From arigo at tunes.org Wed Mar 19 11:21:17 2014 From: arigo at tunes.org (Armin Rigo) Date: Wed, 19 Mar 2014 11:21:17 +0100 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting In-Reply-To: References: Message-ID: Hi Anatoly, On 19 March 2014 10:42, anatoly techtonik wrote: >> It's possible to do something like that in RPython, if you ignore all >> the additional complications like tracking raw-memory too; it looks >> like an infinite amount of painful work to me, but well, it's not my >> time :-) > > Fair point. =) I am thinking about bytecode machine. Virtualization > software like virtualbox allow to save state at run-time and restore it > later at the exact point - continue to run the system from the moment > it was saved. And they do this in incremental way - keeping track of > what memory and disk have been touched. > > So, can interpreter, while playing bytecode, do keep track of these > things and save/restore the state the same way? Is that possible > currently? If not, then why and what can be done? It's not fundamentally easier or harder to do than it would be doing the same thing on CPython or any custom C program. While I can imagine coming up with a proof of concept very quickly, that would save and restore only the GC-managed objects; the real pain starts when needing to track changes done to general low-level memory, which is not possible in general. You would instead need some gross hack that copies the entire content of the memory of a process to emulate a fork(), which could also be done for CPython or any custom C program. How to do it concretely on a specific OS like Windows is left as an exercice to the reader, but as a starting point, look at how Cygwin implements fork(). The only advantage of PyPy, if you want, is that we can *add* an extra small complication on top of that, which is the aforementioned custom way to track the content of the GC objects. Given that this is hopefully the biggest part of the memory, doing so would give a boost to the performance of the fork() emulation written as described above. A bient?t, Armin. From techtonik at gmail.com Wed Mar 19 11:54:02 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Wed, 19 Mar 2014 13:54:02 +0300 Subject: [pypy-dev] Rollback interpreter state to fork for unittesting In-Reply-To: References: Message-ID: On Wed, Mar 19, 2014 at 1:21 PM, Armin Rigo wrote: > Hi Anatoly, > > On 19 March 2014 10:42, anatoly techtonik wrote: >>> It's possible to do something like that in RPython, if you ignore all >>> the additional complications like tracking raw-memory too; it looks >>> like an infinite amount of painful work to me, but well, it's not my >>> time :-) >> >> Fair point. =) I am thinking about bytecode machine. Virtualization >> software like virtualbox allow to save state at run-time and restore it >> later at the exact point - continue to run the system from the moment >> it was saved. And they do this in incremental way - keeping track of >> what memory and disk have been touched. >> >> So, can interpreter, while playing bytecode, do keep track of these >> things and save/restore the state the same way? Is that possible >> currently? If not, then why and what can be done? > > It's not fundamentally easier or harder to do than it would be doing > the same thing on CPython or any custom C program. > > While I can imagine coming up with a proof of concept very quickly, > that would save and restore only the GC-managed objects; the real pain > starts when needing to track changes done to general low-level memory, > which is not possible in general. You would instead need some gross > hack that copies the entire content of the memory of a process to > emulate a fork(), which could also be done for CPython or any custom C > program. How to do it concretely on a specific OS like Windows is > left as an exercice to the reader, but as a starting point, look at > how Cygwin implements fork(). I don't know C well enough to read that code. Is it possible to describe this in C independent manner? If I understand correctly, the problem starts when you interact with some specific OS API calls and calls to .dll and .so modules that use low level memory API during dynamic imports? Are there other reasons? I'd like to get the idea what is the exact scope when the rollback is still possible? > The only advantage of PyPy, if you want, is that we can *add* an extra > small complication on top of that, which is the aforementioned custom > way to track the content of the GC objects. Given that this is > hopefully the biggest part of the memory, doing so would give a boost > to the performance of the fork() emulation written as described above. I don't feel confident that this is enough. Tracking GC memory is a cool thing, and it would help to understand the problem better it is also helpful to get notifications when something is done outside of interpreter sandbox. The goal is like to track that Python bytecode was safe to rollback up to a forking point (after unittest initialization is finished, for example). The next step would be to annotate the exact system calls to calm down the interpreter (and developers) and tell them what is the nature of these calls and how to deal with them on rollback. From romain.py at gmail.com Wed Mar 19 15:34:22 2014 From: romain.py at gmail.com (Romain Guillebert) Date: Wed, 19 Mar 2014 15:34:22 +0100 Subject: [pypy-dev] GSOC: Introduction and Interested in Numpy Improvements Project In-Reply-To: References: Message-ID: <20140319143422.GN11056@Plop> Hi Rajul I think you can start off by reading the FAQ http://doc.pypy.org/en/latest/faq.html. After that, you will need to familiarize yourself with the code base by making a non-trivial contribution, since you are interested in numpy, fixing this bug https://bugs.pypy.org/issue1590 would be a good way to start in my opinion. If you need help with this, feel free to ask me on irc, I'm rguillebert on freenode, you can also join #pypy there to meet everyone. Cheers Romain On 03/17, Rajul Srivastava wrote: > Hi all, > > My name in Rajul, and I am a final year undergraduate student at the Indian > Institute of Technology Kharagpur. I wish to participate in > Google Summer of Code 2014, and while going through the list of > organisations, I came across PyPy. I am proficient with programming > languages C/C++, Python, Java, Groovy, Ruby. I am very interested in the > fields of Algorithms, Computational Sciences, and Software Engineering. > > I have always been interested in programming and in the past I have > participated in Google Summer of Code 2012, with the organisation > Network Time Foundation,working on the project "improving the > Logging/Debugging System of Network Time Protocol Software". I have also > interned in the Global Technology division of Barclays, during the summers > of 2013, working with the Market Risk IT team. Besides I have worked on a > few Research projects in the fields > of Computational Finance, Complex Networks, and Computational Chemistry. I > am currently working on my Thesis project in the field > of Computational Sciences on a project titled "Network Analysis > of Chemical Reactions". I have had courses in the fields of Programming and > Data Structures, Complex Networks, Distributed Systems, > Algorithms, Operations Research in the past. > > I have gone through the list of project ideas and I found all of the > project ideas very interesting. Although I find all the projects listed > worth a while, I am particularly interested in the "Numpy Improvements" > project. I suppose that my programming background is suitable for these > projects. > > I shall be grateful if anyone can help me and give me reference to the > literature that I may use and also shed some light on how I can go about > making a successful proposal. > > Thanks!! > > Best Regards, > Rajul > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From koder.mail at gmail.com Fri Mar 21 15:27:03 2014 From: koder.mail at gmail.com (KoDer) Date: Fri, 21 Mar 2014 16:27:03 +0200 Subject: [pypy-dev] How to check, that jit is turned on? Message-ID: Hi all, I have a quite simple python script, which mostly don't use python dynamic features, but it almost two times slower unded latest pypy than under python 2.6.7. Is there any way to check, that jit actually working? pypy is builded from source with > python ../../rpython/bin/rpython -Ojit targetpypystandalone [PyPy 2.2.1 with GCC 4.8.1] on linux2 Thanks -- K.Danilov aka koder Skype:koder.ua Tel:+38-050-4030512 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Fri Mar 21 18:50:28 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 21 Mar 2014 19:50:28 +0200 Subject: [pypy-dev] How to check, that jit is turned on? In-Reply-To: References: Message-ID: can you share the script? On Fri, Mar 21, 2014 at 4:27 PM, KoDer wrote: > Hi all, > > I have a quite simple python script, which mostly don't use python dynamic > features, > but it almost two times slower unded latest pypy than under python 2.6.7. > > Is there any way to check, that jit actually working? > > pypy is builded from source with >> python ../../rpython/bin/rpython -Ojit targetpypystandalone > > [PyPy 2.2.1 with GCC 4.8.1] on linux2 > > > Thanks > -- > K.Danilov aka koder > Skype:koder.ua > Tel:+38-050-4030512 > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From koder.mail at gmail.com Fri Mar 21 21:05:45 2014 From: koder.mail at gmail.com (KoDer) Date: Fri, 21 Mar 2014 22:05:45 +0200 Subject: [pypy-dev] How to check, that jit is turned on? In-Reply-To: References: Message-ID: Sure. Run - python chess.py 2014-03-21 19:50 GMT+02:00 Maciej Fijalkowski : > can you share the script? > > On Fri, Mar 21, 2014 at 4:27 PM, KoDer wrote: > > Hi all, > > > > I have a quite simple python script, which mostly don't use python > dynamic > > features, > > but it almost two times slower unded latest pypy than under python 2.6.7. > > > > Is there any way to check, that jit actually working? > > > > pypy is builded from source with > >> python ../../rpython/bin/rpython -Ojit targetpypystandalone > > > > [PyPy 2.2.1 with GCC 4.8.1] on linux2 > > > > > > Thanks > > -- > > K.Danilov aka koder > > Skype:koder.ua > > Tel:+38-050-4030512 > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > -- K.Danilov aka koder ICQ:214286120 Skype:koder.ua Tel:+38-050-4030512 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: chess.py Type: text/x-python Size: 13098 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: chess_base.py Type: text/x-python Size: 754 bytes Desc: not available URL: From fijall at gmail.com Fri Mar 21 22:15:23 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 21 Mar 2014 23:15:23 +0200 Subject: [pypy-dev] How to check, that jit is turned on? In-Reply-To: References: Message-ID: hey. this exhibits a crazy amount of jit warmup (try to run it a few times in a loop) and then speeds up a little. It however shows a bug in our JIT (that tries to trace stuff again and again). We'll look into it, thanks for reporting! It should really be fast, we'll make sure it is :-) On Fri, Mar 21, 2014 at 10:05 PM, KoDer wrote: > Sure. Run - > > python chess.py > > > 2014-03-21 19:50 GMT+02:00 Maciej Fijalkowski : > >> can you share the script? >> >> On Fri, Mar 21, 2014 at 4:27 PM, KoDer wrote: >> > Hi all, >> > >> > I have a quite simple python script, which mostly don't use python >> > dynamic >> > features, >> > but it almost two times slower unded latest pypy than under python >> > 2.6.7. >> > >> > Is there any way to check, that jit actually working? >> > >> > pypy is builded from source with >> >> python ../../rpython/bin/rpython -Ojit targetpypystandalone >> > >> > [PyPy 2.2.1 with GCC 4.8.1] on linux2 >> > >> > >> > Thanks >> > -- >> > K.Danilov aka koder >> > Skype:koder.ua >> > Tel:+38-050-4030512 >> > >> > _______________________________________________ >> > pypy-dev mailing list >> > pypy-dev at python.org >> > https://mail.python.org/mailman/listinfo/pypy-dev >> > > > > > > -- > K.Danilov aka koder > ICQ:214286120 > Skype:koder.ua > Tel:+38-050-4030512 From matti.picus at gmail.com Sat Mar 22 23:06:08 2014 From: matti.picus at gmail.com (Matti Picus) Date: Sun, 23 Mar 2014 00:06:08 +0200 Subject: [pypy-dev] win32 and external function calls Message-ID: <532E0950.5020701@gmail.com> An HTML attachment was scrubbed... URL: From kmod at dropbox.com Wed Mar 26 00:19:24 2014 From: kmod at dropbox.com (Kevin Modzelewski) Date: Tue, 25 Mar 2014 16:19:24 -0700 Subject: [pypy-dev] Question about extension support Message-ID: Hi all, I've been trying to learn about how PyPy supports (unmodified) Python extensions, and one thing I've heard is that it's much slower than cPython, and/or uses more memory. I tried finding some documentation about why, and all I could find is this, from 2010: https://bitbucket.org/pypy/compatibility/wiki/c-api Sorry if this should be obvious, but is there more up-to-date information about this stuff? And secondly, assuming the info that I linked to is still valid, is there a reason you guys settled on this method of bridging the refcount/tracing divide, as opposed to other possibilities (can you pin the objects in the GC)? I'm curious, since I've heard a number of people mention that extension modules are the primary reason that PyPy is slower than cPython for their code; definitely an improvement over "PyPy doesn't run my code at all", but it's made me curious about whether or not / why it has to be that way. kmod -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.m.camara at gmail.com Wed Mar 26 02:53:04 2014 From: john.m.camara at gmail.com (John Camara) Date: Tue, 25 Mar 2014 21:53:04 -0400 Subject: [pypy-dev] Question about extension support In-Reply-To: References: Message-ID: Hi Kevin, Here is another link about writing extensions for PyPy. http://doc.pypy.org/en/latest/extending.html John On Tue, Mar 25, 2014 at 9:48 PM, John Camara wrote: > Hi Kevin, > > More up to date information can be found on the FAQ page > > > http://doc.pypy.org/en/latest/faq.html#do-cpython-extension-modules-work-with-pypy > > The best approach for PyPy is either use a pure Python module if possible > or use a cffi wrapped extension instead of an extension that uses the > CPython CAPI. Often CPython CAPI extensions are wrapping some c library. > Creating a cffi wrapper for the library is actually much simpler than > writing a CPython CAPI wrapper. Quite a few CPython CAPI extensions have > already been wrapped for cffi so make sure to search for one before > creating your own wrapper. If you need to create a wrapper, refer to the > cffi documentation at > > http://cffi.readthedocs.org/en/release-0.8/ > > Extensions wrapped with cffi are compatible with both CPython and PyPy. > On CPython the performance is similar to what you would get if you used > ctypes. How every, under PyPy, the performance is much closer to a native > C call plus the overhead for releasing and acquiring the gil. > > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.m.camara at gmail.com Wed Mar 26 02:48:27 2014 From: john.m.camara at gmail.com (John Camara) Date: Tue, 25 Mar 2014 21:48:27 -0400 Subject: [pypy-dev] Question about extension support Message-ID: Hi Kevin, More up to date information can be found on the FAQ page http://doc.pypy.org/en/latest/faq.html#do-cpython-extension-modules-work-with-pypy The best approach for PyPy is either use a pure Python module if possible or use a cffi wrapped extension instead of an extension that uses the CPython CAPI. Often CPython CAPI extensions are wrapping some c library. Creating a cffi wrapper for the library is actually much simpler than writing a CPython CAPI wrapper. Quite a few CPython CAPI extensions have already been wrapped for cffi so make sure to search for one before creating your own wrapper. If you need to create a wrapper, refer to the cffi documentation at http://cffi.readthedocs.org/en/release-0.8/ Extensions wrapped with cffi are compatible with both CPython and PyPy. On CPython the performance is similar to what you would get if you used ctypes. How every, under PyPy, the performance is much closer to a native C call plus the overhead for releasing and acquiring the gil. John -------------- next part -------------- An HTML attachment was scrubbed... URL: From yury at shurup.com Wed Mar 26 09:31:26 2014 From: yury at shurup.com (Yury V. Zaytsev) Date: Wed, 26 Mar 2014 09:31:26 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: References: Message-ID: <1395822686.2779.12.camel@newpride> On Tue, 2014-03-25 at 16:19 -0700, Kevin Modzelewski wrote: > > I'm curious, since I've heard a number of people mention that > extension modules are the primary reason that PyPy is slower than > cPython for their code; definitely an improvement over "PyPy doesn't > run my code at all", but it's made me curious about whether or not / > why it has to be that way. In my opinion, it all depends on how do you use CPyExt and what your extension modules are for. There are two scenarios here (or combinations thereof) that I think cover most of the use cases: 1) You use C extensions to make it faster. 2) You use C extensions to steer external processes. Ideally with PyPy you should be able to drop (1) altogether and write nice Python code that JIT will be able to optimize sometimes even better than hand-written C code, so here the answer would be "don't use extensions". Now, if as a part of (2) you are doing some lengthy processing entirely outside PyPy, this might still just as fast as with CPython with CPyExt, but if the calls to your foreign functions are short and/or you are transferring a lot of data C <-> PyPy, then there you go... Personally, I've been using CPyExt and I'm very happy about it, because the function calls take a long time, and whatever happens outside doesn't have much to do with objects in PyPy land. However, if my requirements were different, I would have rather re-written everything using cffi, from what I understood it can deliver comparable performance to cPython, and also it works both for PyPy and cPython, not just PyPy... -- Sincerely yours, Yury V. Zaytsev From kmod at dropbox.com Wed Mar 26 21:47:11 2014 From: kmod at dropbox.com (Kevin Modzelewski) Date: Wed, 26 Mar 2014 13:47:11 -0700 Subject: [pypy-dev] Question about extension support In-Reply-To: <1395822686.2779.12.camel@newpride> References: <1395822686.2779.12.camel@newpride> Message-ID: Hi all, thanks for the responses, but I guess I should have been more explicit -- I'm curious about *why* PyPy is slow on existing extension modules and why people are being steered away from them. I completely support the push to move away from CPython extension modules, but I'm not sure it's reasonable to expect that programmers will rewrite all the extension modules they use. Put another way, I understand how having a JIT-understandable cffi module will be faster on PyPy than an extension module, but what I don't quite understand is why CPython extension modules have to be slower on PyPy than they are on CPython. I'm not saying that extension modules should be sped up by PyPy, but I'm curious why they have a reputation for being slower. On Wed, Mar 26, 2014 at 1:31 AM, Yury V. Zaytsev wrote: > On Tue, 2014-03-25 at 16:19 -0700, Kevin Modzelewski wrote: > > > > I'm curious, since I've heard a number of people mention that > > extension modules are the primary reason that PyPy is slower than > > cPython for their code; definitely an improvement over "PyPy doesn't > > run my code at all", but it's made me curious about whether or not / > > why it has to be that way. > > In my opinion, it all depends on how do you use CPyExt and what your > extension modules are for. There are two scenarios here (or combinations > thereof) that I think cover most of the use cases: > > 1) You use C extensions to make it faster. > 2) You use C extensions to steer external processes. > > Ideally with PyPy you should be able to drop (1) altogether and write > nice Python code that JIT will be able to optimize sometimes even better > than hand-written C code, so here the answer would be "don't use > extensions". > > Now, if as a part of (2) you are doing some lengthy processing entirely > outside PyPy, this might still just as fast as with CPython with CPyExt, > but if the calls to your foreign functions are short and/or you are > transferring a lot of data C <-> PyPy, then there you go... > > Personally, I've been using CPyExt and I'm very happy about it, because > the function calls take a long time, and whatever happens outside > doesn't have much to do with objects in PyPy land. > > However, if my requirements were different, I would have rather > re-written everything using cffi, from what I understood it can deliver > comparable performance to cPython, and also it works both for PyPy and > cPython, not just PyPy... > > -- > Sincerely yours, > Yury V. Zaytsev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Wed Mar 26 21:52:39 2014 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 26 Mar 2014 13:52:39 -0700 Subject: [pypy-dev] Question about extension support In-Reply-To: References: <1395822686.2779.12.camel@newpride> Message-ID: <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> On Wed, Mar 26, 2014, at 13:47, Kevin Modzelewski wrote: > Hi all, thanks for the responses, but I guess I should have been more > explicit -- I'm curious about *why* PyPy is slow on existing extension > modules and why people are being steered away from them. I completely > support the push to move away from CPython extension modules, but I'm not > sure it's reasonable to expect that programmers will rewrite all the > extension modules they use. > > Put another way, I understand how having a JIT-understandable cffi module > will be faster on PyPy than an extension module, but what I don't quite > understand is why CPython extension modules have to be slower on PyPy > than > they are on CPython. I'm not saying that extension modules should be > sped > up by PyPy, but I'm curious why they have a reputation for being slower. There are several reasons. Two of the most important are 1) PyPy's internal representation of objects is different from CPython's, so a conversion cost must be payed every time objects pass between pure Python and C. Unlike CPython, extensions with PyPy can't poke around directly in data structures. Macros like PyList_SET_ITEM have to become function calls. 2) Bridging the gap between PyPy's GC and CPython's ref counting requires a lot of bookkeeping. From lac at openend.se Wed Mar 26 23:29:33 2014 From: lac at openend.se (Laura Creighton) Date: Wed, 26 Mar 2014 23:29:33 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: Message from Kevin Modzelewski of "Wed, 26 Mar 2014 13:47:11 -0700." References: <1395822686.2779.12.camel@newpride> Message-ID: <201403262229.s2QMTXkK024725@fido.openend.se> Your C-extensions come all bundled up with a whole lot of gorp which is designed to make them play nicely in a ref-counting environment. Ref counting is a very slow way to do GC. Sometimes -- really, really, really hideously slow. You are sometimes _way_ better off writing python code instead -- pypy with the jit turned off outperforms CPython purely on the benefits of not doing ref-counting, and pypy really needs the jit to be fast. There is a bit of conceptual confusion here -- on the one hand, because C extensions often were written for reasons of performance when compared to CPython, there is a tendency to believe that C-extensions are, pretty much by definition fast. And the other thing is a sort of reflexive belief that 'if it is in C (or C++) then it has to be fast'. Both of these ideas are wrong. A whole lot of C extensions are actually really, really slow. They are just faster than CPython -- or perhaps 'faster than CPython was when I wrote this thing' which isn't, after all, that hard a target to meet. When PyPy finds a C extension which is working very hard to pretend it is a set of Python objects that can be refcounted, it isn't brilliant enough to be able to throw away all the ref-counting fakery, intuit what the code here really is trying to do here, and just run that bit. That's too hard. Instead it decides to play along with the ref-counting faking. So we are at 'watch the elephant tap-dance' time ... it doesn't have to do a very good (read fast) job at this, it is amazing that it does it at all. Laura From kmod at dropbox.com Thu Mar 27 05:17:23 2014 From: kmod at dropbox.com (Kevin Modzelewski) Date: Wed, 26 Mar 2014 21:17:23 -0700 Subject: [pypy-dev] Question about extension support In-Reply-To: <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> References: <1395822686.2779.12.camel@newpride> <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> Message-ID: On Wed, Mar 26, 2014 at 1:52 PM, Benjamin Peterson wrote: > > There are several reasons. Two of the most important are > 1) PyPy's internal representation of objects is different from > CPython's, so a conversion cost must be payed every time objects pass > between pure Python and C. Unlike CPython, extensions with PyPy can't > poke around directly in data structures. Macros like PyList_SET_ITEM > have to become function calls. > Hmm interesting... I'm not sure I follow, though, why the calling PyList_SET_ITEM on a PyPy list can't know about the PyPy object representation. Again, I understand how it's not necessarily going to be as fast as pure-python code, but I don't understand why PyList_SET_ITEM on PyPy needs to be slower than on CPython. Is it because PyPy uses more complicated internal representations, expecting the overhead to be elided by the JIT? Also, I'm assuming that CPyExt gets to do a recompilation of the extension module; I could definitely understand how there could be significant overhead if this was being done as an ABI compatibility layer. 2) Bridging the gap between PyPy's GC and CPython's ref counting requires a lot of bookkeeping. > >From a personal standpoint I'm also curious about how much of this overhead is fundamental, and how much could be alleviated with (potentially significant) implementation effort. I know PyPy has a precise GC, but I wonder if using a conservative GC could change the situation dramatically if you were able to hook the extension module's allocator and switch it to using the conservative GC. That's my plan, at least, which is one of the reasons I've been curious about the issues that PyPy has been running into since I'm curious about how much will be applicable. kmod -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Thu Mar 27 05:32:05 2014 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 26 Mar 2014 21:32:05 -0700 Subject: [pypy-dev] Question about extension support In-Reply-To: References: <1395822686.2779.12.camel@newpride> <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> Message-ID: <1395894725.517.99438153.216B4835@webmail.messagingengine.com> On Wed, Mar 26, 2014, at 21:17, Kevin Modzelewski wrote: > On Wed, Mar 26, 2014 at 1:52 PM, Benjamin Peterson > wrote: > > > > > There are several reasons. Two of the most important are > > 1) PyPy's internal representation of objects is different from > > CPython's, so a conversion cost must be payed every time objects pass > > between pure Python and C. Unlike CPython, extensions with PyPy can't > > poke around directly in data structures. Macros like PyList_SET_ITEM > > have to become function calls. > > > > Hmm interesting... I'm not sure I follow, though, why the calling > PyList_SET_ITEM on a PyPy list can't know about the PyPy object > representation. Again, I understand how it's not necessarily going to be > as fast as pure-python code, but I don't understand why PyList_SET_ITEM > on > PyPy needs to be slower than on CPython. Is it because PyPy uses more > complicated internal representations, expecting the overhead to be elided > by the JIT? Let's continue with the list example. pypy lists use an array as the underlying data structure like CPython, but the similarity stops there. You can't just have random C code putting things in pypy lists. The internal representation of the list might be unwrapped integers, not points to int objects like CPython lists. There also needs to be GC barriers. The larger picture is that building a robust CPython compatibility layer is difficult and error-prone compared to the solution of rewriting C extensions in Python (possibly with cffi). > > Also, I'm assuming that CPyExt gets to do a recompilation of the > extension > module; Yes > 2) Bridging the gap between PyPy's GC and CPython's ref counting > > requires a lot of bookkeeping. > > > > From a personal standpoint I'm also curious about how much of this > overhead > is fundamental, and how much could be alleviated with (potentially > significant) implementation effort. I know PyPy has a precise GC, but I > wonder if using a conservative GC could change the situation dramatically > if you were able to hook the extension module's allocator and switch it > to > using the conservative GC. That's my plan, at least, which is one of the > reasons I've been curious about the issues that PyPy has been running > into > since I'm curious about how much will be applicable. Conservative GCs are evil and slow. :) I don't know what you mean by the "extension module's allocator". That's a fairly global thing. From kmod at dropbox.com Thu Mar 27 05:51:00 2014 From: kmod at dropbox.com (Kevin Modzelewski) Date: Wed, 26 Mar 2014 21:51:00 -0700 Subject: [pypy-dev] Question about extension support In-Reply-To: <1395894725.517.99438153.216B4835@webmail.messagingengine.com> References: <1395822686.2779.12.camel@newpride> <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> <1395894725.517.99438153.216B4835@webmail.messagingengine.com> Message-ID: On Wed, Mar 26, 2014 at 9:32 PM, Benjamin Peterson wrote: > > > On Wed, Mar 26, 2014, at 21:17, Kevin Modzelewski wrote: > > On Wed, Mar 26, 2014 at 1:52 PM, Benjamin Peterson > > wrote: > > > > > > > > There are several reasons. Two of the most important are > > > 1) PyPy's internal representation of objects is different from > > > CPython's, so a conversion cost must be payed every time objects pass > > > between pure Python and C. Unlike CPython, extensions with PyPy can't > > > poke around directly in data structures. Macros like PyList_SET_ITEM > > > have to become function calls. > > > > > > > Hmm interesting... I'm not sure I follow, though, why the calling > > PyList_SET_ITEM on a PyPy list can't know about the PyPy object > > representation. Again, I understand how it's not necessarily going to be > > as fast as pure-python code, but I don't understand why PyList_SET_ITEM > > on > > PyPy needs to be slower than on CPython. Is it because PyPy uses more > > complicated internal representations, expecting the overhead to be elided > > by the JIT? > > Let's continue with the list example. pypy lists use an array as the > underlying data structure like CPython, but the similarity stops there. > You can't just have random C code putting things in pypy lists. The > internal representation of the list might be unwrapped integers, not > points to int objects like CPython lists. There also needs to be GC > barriers. > > The larger picture is that building a robust CPython compatibility layer > is difficult and error-prone compared to the solution of rewriting C > extensions in Python (possibly with cffi). > Using that logic, I would counter that building a JIT for a dynamic language is difficult and error-prone compared to rewriting your dynamic language programs in a faster language :) The benefit to supporting it in your runtime is 1) you only do the work once, and 2) you get to support existing code out there. I'm writing not from the standpoint of saying "I have an extension module and I want it to run quickly", but rather "what do you guys think about the (presumed) situation of extension modules being a key blocker of PyPy adoption". While I'd love the world to migrate to a better solution overnight, I don't think that's realistic -- just look at the state of Python 3, which has a much larger constituency pushing much harder for it, and presumably has lower switching costs than rewriting C extensions in Python. > > > > Also, I'm assuming that CPyExt gets to do a recompilation of the > > extension > > module; > > Yes > > > 2) Bridging the gap between PyPy's GC and CPython's ref counting > > > > requires a lot of bookkeeping. > > > > > > > From a personal standpoint I'm also curious about how much of this > > overhead > > is fundamental, and how much could be alleviated with (potentially > > significant) implementation effort. I know PyPy has a precise GC, but I > > wonder if using a conservative GC could change the situation dramatically > > if you were able to hook the extension module's allocator and switch it > > to > > using the conservative GC. That's my plan, at least, which is one of the > > reasons I've been curious about the issues that PyPy has been running > > into > > since I'm curious about how much will be applicable. > > Conservative GCs are evil and slow. :) > > I don't know what you mean by the "extension module's allocator". That's > a fairly global thing. > I'm assuming that you can hook out malloc and mmap to be calls to the GC allocator; I've seen other projects do this, though I don't know how robust it is. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Thu Mar 27 06:05:28 2014 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 26 Mar 2014 22:05:28 -0700 Subject: [pypy-dev] Question about extension support In-Reply-To: References: <1395822686.2779.12.camel@newpride> <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> <1395894725.517.99438153.216B4835@webmail.messagingengine.com> Message-ID: <1395896728.7256.99445357.0EBA473A@webmail.messagingengine.com> On Wed, Mar 26, 2014, at 21:51, Kevin Modzelewski wrote: > On Wed, Mar 26, 2014 at 9:32 PM, Benjamin Peterson > wrote: > > > > > > > On Wed, Mar 26, 2014, at 21:17, Kevin Modzelewski wrote: > > > On Wed, Mar 26, 2014 at 1:52 PM, Benjamin Peterson > > > wrote: > > > > > > > > > > > There are several reasons. Two of the most important are > > > > 1) PyPy's internal representation of objects is different from > > > > CPython's, so a conversion cost must be payed every time objects pass > > > > between pure Python and C. Unlike CPython, extensions with PyPy can't > > > > poke around directly in data structures. Macros like PyList_SET_ITEM > > > > have to become function calls. > > > > > > > > > > Hmm interesting... I'm not sure I follow, though, why the calling > > > PyList_SET_ITEM on a PyPy list can't know about the PyPy object > > > representation. Again, I understand how it's not necessarily going to be > > > as fast as pure-python code, but I don't understand why PyList_SET_ITEM > > > on > > > PyPy needs to be slower than on CPython. Is it because PyPy uses more > > > complicated internal representations, expecting the overhead to be elided > > > by the JIT? > > > > Let's continue with the list example. pypy lists use an array as the > > underlying data structure like CPython, but the similarity stops there. > > You can't just have random C code putting things in pypy lists. The > > internal representation of the list might be unwrapped integers, not > > points to int objects like CPython lists. There also needs to be GC > > barriers. > > > > The larger picture is that building a robust CPython compatibility layer > > is difficult and error-prone compared to the solution of rewriting C > > extensions in Python (possibly with cffi). > > > > Using that logic, I would counter that building a JIT for a dynamic > language is difficult and error-prone compared to rewriting your dynamic > language programs in a faster language :) The benefit to supporting it > in > your runtime is 1) you only do the work once, and 2) you get to support > existing code out there. I don't want to argue that an amazing fast CPython API compatibility isn't possible, but current experience suggests that creating it will be painful. It's hard to get excited about building compatibility layers when there are shiny JITs to be made. > > I'm writing not from the standpoint of saying "I have an extension module > and I want it to run quickly", but rather "what do you guys think about > the > (presumed) situation of extension modules being a key blocker of PyPy > adoption". While I'd love the world to migrate to a better solution > overnight, I don't think that's realistic -- just look at the state of > Python 3, which has a much larger constituency pushing much harder for > it, > and presumably has lower switching costs than rewriting C extensions in > Python. Yes, but you get to use PyPy and get super fast Python code, whereas you code gets no faster by porting to Python 3. Plus you get rid of C! The incentives are a bit better. > > > > > > > > Also, I'm assuming that CPyExt gets to do a recompilation of the > > > extension > > > module; > > > > Yes > > > > > 2) Bridging the gap between PyPy's GC and CPython's ref counting > > > > > > requires a lot of bookkeeping. > > > > > > > > > > From a personal standpoint I'm also curious about how much of this > > > overhead > > > is fundamental, and how much could be alleviated with (potentially > > > significant) implementation effort. I know PyPy has a precise GC, but I > > > wonder if using a conservative GC could change the situation dramatically > > > if you were able to hook the extension module's allocator and switch it > > > to > > > using the conservative GC. That's my plan, at least, which is one of the > > > reasons I've been curious about the issues that PyPy has been running > > > into > > > since I'm curious about how much will be applicable. > > > > Conservative GCs are evil and slow. :) > > > > I don't know what you mean by the "extension module's allocator". That's > > a fairly global thing. > > > > I'm assuming that you can hook out malloc and mmap to be calls to the GC > allocator; I've seen other projects do this, though I don't know how > robust > it is. That's the easy part. The hard part is keeping your precise GC informed of native C doing arbitrary things. From arigo at tunes.org Fri Mar 28 08:59:12 2014 From: arigo at tunes.org (Armin Rigo) Date: Fri, 28 Mar 2014 08:59:12 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: <1395896728.7256.99445357.0EBA473A@webmail.messagingengine.com> References: <1395822686.2779.12.camel@newpride> <1395867159.28967.99312285.58A94BF6@webmail.messagingengine.com> <1395894725.517.99438153.216B4835@webmail.messagingengine.com> <1395896728.7256.99445357.0EBA473A@webmail.messagingengine.com> Message-ID: Hi all, I'd like to point Kevin to the thread "cpyext performance" of July-August 2012, in which we did some explanation of what is slow about cpyext and could potentially be improved. As others have mentioned here again, we can't reasonably hope to get them to the same speed as on CPython, but "someone" could at least work and lower the difference. (Nobody did so far.) https://mail.python.org/pipermail/pypy-dev/2012-July/010263.html A small note about PyList_SET_ITEM(): this is impossible to keep as a macro or even to write as C code. There are some practical reasons as mentioned here, but the most fundamental reason imho is that doing so means throwing the flexibility of PyPy out of the window. I'm talking here about adding new implementations of list objects (we already have several ones, e.g. for lists-of-integers or for range()), about changing the GC, and so on. In other words: of course it is possible (if hard) to write the complete logic of PyList_SET_ITEM as C code, and even as a C macro. The point is that if we did that, then we'd give up on the possibility of ever changing any of these other aspects, or at least require painful adaptation every time we want to change them. (And yes, we do change them from time to time. For example, the STM branch we're working on has a different GC, and we had to look inside cpyext exactly zero times to make it work.) A bient?t, Armin. From arigo at tunes.org Fri Mar 28 09:51:42 2014 From: arigo at tunes.org (Armin Rigo) Date: Fri, 28 Mar 2014 09:51:42 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: <201403262229.s2QMTXkK024725@fido.openend.se> References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> Message-ID: Hi Laura, On 26 March 2014 23:29, Laura Creighton wrote: > really, really hideously slow. You are sometimes _way_ > better off writing python code instead -- pypy with the jit turned off > outperforms CPython purely on the benefits of not doing ref-counting, and > pypy really needs the jit to be fast. That detail is wrong. PyPy with the JIT turned off is 1.5x to 2x slower than CPython. (The reason is just the higher level of RPython versus hand-optimized C.) It is a common misconception to blame reference counting for the general slowness of standard interpreters of Python-like languages; I've seen it a few times in various contexts. But it's wrong: manual reference counting as in CPython doesn't cost much. This is probably because the reference counting increment/decrement end up as a very small fraction of the total runtime in this case. Also, the malloc() implementation used probably helps (it is very close to the one for the standard malloc() on Linux, which is really efficient for a lot of usage patterns). On the other hand, the reason PyPy doesn't include refcounting is that it's not reasonable at the level of the RPython implementation. Our RPython code assumes a GC *all the time*, e.g. when dealing with internal lists; by contrast, the CPython implementation assumes no GC and tries hard to put everything as stack objects, only occasionally "giving up" and using a malloc, like it is common in C. So incref/decref "everywhere in PyPy" means at many, many more places than "everywhere in CPython". In PyPy it is enough to kill performance completely (as we measured a long time ago). So I would say that the difference between refcounting and a "real GC" is that the latter scales nicely to higher levels of memory pressure than refcounting; but refcounting is fine if you can manually optimize where increfs are needed and where they are not, and if you manually avoid creating garbage as much as possible. A bient?t, Armin. From lac at openend.se Fri Mar 28 10:48:01 2014 From: lac at openend.se (Laura Creighton) Date: Fri, 28 Mar 2014 10:48:01 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: Message from Armin Rigo of "Fri, 28 Mar 2014 09:51:42 +0100." References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> Message-ID: <201403280948.s2S9m1Vb008983@fido.openend.se> In a message of Fri, 28 Mar 2014 09:51:42 +0100, Armin Rigo writes: >Hi Laura, Ah, thank you. I actually thought we had found the odd example where a better gc beat CPython performance even when the jit was off. I am completely wrong about this? Or is it just that it is so rare it doesn't matter? Thank you for teaching me this, in any case. Sorry for the misinformation. Laura From mount.sarah at gmail.com Fri Mar 28 11:08:18 2014 From: mount.sarah at gmail.com (Sarah Mount) Date: Fri, 28 Mar 2014 10:08:18 +0000 Subject: [pypy-dev] Question about extension support In-Reply-To: References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> Message-ID: Hi all, On Fri, Mar 28, 2014 at 8:51 AM, Armin Rigo wrote: > Hi Laura, > > On 26 March 2014 23:29, Laura Creighton wrote: > > really, really hideously slow. You are sometimes _way_ > > better off writing python code instead -- pypy with the jit turned off > > outperforms CPython purely on the benefits of not doing ref-counting, and > > pypy really needs the jit to be fast. > > That detail is wrong. PyPy with the JIT turned off is 1.5x to 2x > slower than CPython. (The reason is just the higher level of RPython > versus hand-optimized C.) > > It is a common misconception to blame reference counting for the > general slowness of standard interpreters of Python-like languages; > I've seen it a few times in various contexts. But it's wrong: manual > reference counting as in CPython doesn't cost much. This is probably > because the reference counting increment/decrement end up as a very > small fraction of the total runtime in this case. Also, the malloc() > implementation used probably helps (it is very close to the one for > the standard malloc() on Linux, which is really efficient for a lot of > usage patterns). > > On the other hand, the reason PyPy doesn't include refcounting is that > it's not reasonable at the level of the RPython implementation. Our > RPython code assumes a GC *all the time*, e.g. when dealing with > internal lists; by contrast, the CPython implementation assumes no GC > and tries hard to put everything as stack objects, only occasionally > "giving up" and using a malloc, like it is common in C. So > incref/decref "everywhere in PyPy" means at many, many more places > than "everywhere in CPython". In PyPy it is enough to kill > performance completely (as we measured a long time ago). > > This is a really interesting discussion, thanks for spelling it out the details so clearly. Did the measurements you refer to get published anywhere? Thanks, Sarah -- Sarah Mount, Senior Lecturer, University of Wolverhampton website: http://www.snim2.org/ twitter: @snim2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfbolz at gmx.de Fri Mar 28 13:21:41 2014 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Fri, 28 Mar 2014 13:21:41 +0100 Subject: [pypy-dev] ICOOOLPS 2014 call for papers Message-ID: <53356955.9090003@gmx.de> ======================================================================== 9th ICOOOLPS Workshop Implementation, Compilation, Optimization of OO Languages, Programs and Systems July 28th 2014, Uppsala, Sweden Colocated with ECOOP http://soft-dev.org/events/icooolps14/ ======================================================================== Overview The ICOOOLPS workshop series brings together researchers and practitioners working in the field of OO languages implementation and optimization. ICOOOLPS key goal is to identify current and emerging issues relating to the efficient implementation, compilation and optimization of such languages, and outlining future challenges and research directions. Topics of interest for ICOOOLPS include, but are not limited to: implementation of fundamental OO and OO-like features (e.g. inheritance, parametric types, memory management, objects, prototypes), runtime systems (e.g. compilers, linkers, virtual machines, garbage collectors), optimizations (e.g. static or dynamic analyses, adaptive virtual machines), resource constraints (e.g. time for real-time systems, space or low-power for embedded systems) and relevant choices and tradeoffs (e.g. constant time vs. non-constant time mechanisms, separate compilation vs. global compilation, dynamic loading vs. global linking, dynamic checking vs. proof-carrying code...). Submissions ICOOOLPS is not a mini-conference; it is a workshop designed to facilitate discussion and the exchange of ideas between peers. ICOOOLPS therefore welcomes both position (1?4 pages) and research (max. 10 pages) papers. Position papers should outline interesting or unconventional ideas, which need not be fully fleshed out. Research papers are expected to contain more complete ideas, but these need not necessarily be fully complete as with a traditional conference. Authors will be given the option to publish their papers (short or long) in the ACM Digital Library if they wish. Submissions must be written in English, formatted according to ACM SIG Proceedings style. Please submit via EasyChair (link on the ICOOOLPS website). Important dates Submission: May 5th 2014 (FIRM DEADLINE) Notification: May 26th 2014 Workshop: July 28th 2014 Programme chairs Laurence Tratt, King's College London, UK Olivier Zendra, INRIA Nancy, France e-mail: icooolps14 at easychair.org Programme committee Carl Friedrich Bolz, King's College London, UK Eric Jul, University of Copenhagen, DK Jos? Manuel Redondo L?pez, Universidad de Oviedo, ES Stefan Marr, INRIA Lille, FR Flor?al Morandat, Labri, FR Todd Mytkowicz, Microsoft, US Tobias Pape, Hasso-Plattner-Institut Potsdam, DE Ian Rogers, Google, US Jeremy Singer, University of Glasgow, UK Jan Vitek, Purdue University, US Mario Wolczko, Oracle Labs, US From arigo at tunes.org Sat Mar 29 09:23:04 2014 From: arigo at tunes.org (Armin Rigo) Date: Sat, 29 Mar 2014 09:23:04 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: <201403280948.s2S9m1Vb008983@fido.openend.se> References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> <201403280948.s2S9m1Vb008983@fido.openend.se> Message-ID: Hi Laura, On 28 March 2014 10:48, Laura Creighton wrote: > Ah, thank you. I actually thought we had found the odd example where > a better gc beat CPython performance even when the jit was off. I am > completely wrong about this? Or is it just that it is so rare it doesn't > matter? Ah, no, you are correct, sorry. Running "gcbench.py" we see a different behavior than CPython. This benchmark runs 7 tests which, on CPython, take from 0.9 increasing to 3.5 seconds each; and on PyPy (without the JIT) it increases from 1.7 to 1.9 seconds only. The last three of these 7 tests are slower on CPython. I'm unsure if we ever had a precise explanation for why. In general it's one of the only examples we have. It seems to show that even CPython is, most of the time, not making demands on its GC that exceed what reference counting is good for. (There might be other cases that could appear in big applications, like creating tons and tons of cyclic garbage, but we don't really know, I think.) A bient?t, Armin. From arigo at tunes.org Sat Mar 29 09:29:11 2014 From: arigo at tunes.org (Armin Rigo) Date: Sat, 29 Mar 2014 09:29:11 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> Message-ID: Hi Sarah, On 28 March 2014 11:08, Sarah Mount wrote: > This is a really interesting discussion, thanks for spelling it out the > details so clearly. Did the measurements you refer to get published > anywhere? Yes, in https://bitbucket.org/pypy/extradoc/raw/tip/eu-report/D07.1_Massive_Parallelism_and_Translation_Aspects-2007-02-28.pdf , section Reference counting. A bient?t, Armin. From lac at openend.se Sat Mar 29 09:52:56 2014 From: lac at openend.se (Laura Creighton) Date: Sat, 29 Mar 2014 09:52:56 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: Message from Armin Rigo of "Sat, 29 Mar 2014 09:23:04 +0100." References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> <201403280948.s2S9m1Vb008983@fido.openend.se> Message-ID: <201403290852.s2T8quRb015305@fido.openend.se> Ok -- the lesson I took from this is 'ref counting hurts performance'. Why was that the wrong inference to make? Laura From arigo at tunes.org Sat Mar 29 10:57:35 2014 From: arigo at tunes.org (Armin Rigo) Date: Sat, 29 Mar 2014 10:57:35 +0100 Subject: [pypy-dev] Question about extension support In-Reply-To: <201403290852.s2T8quRb015305@fido.openend.se> References: <1395822686.2779.12.camel@newpride> <201403262229.s2QMTXkK024725@fido.openend.se> <201403280948.s2S9m1Vb008983@fido.openend.se> <201403290852.s2T8quRb015305@fido.openend.se> Message-ID: Hi Laura, On 29 March 2014 09:52, Laura Creighton wrote: > Ok -- the lesson I took from this is 'ref counting hurts performance'. Why > was that the wrong inference to make? No, the lession is "unoptimized refcounting hurts performance". As I've explained above, optimized refcounting like in CPython seems to be good enough for almost all use cases, with the sole known exception being some cases of gcbench.py. I think that since this 2007 report we never were really interested enough to know the details more precisely. If someone wants to go to the bottom of the question, he'd need to write an automatic optimizer to remove most incref/decref in PyPy, and add a cycle detector. Or do the opposite: scrap reference counting in CPython and replace the Py_Incref/Py_Decref with, say, shadow stack saving/restoring (search for _du_save/_du_restore in https://bitbucket.org/pypy/stmgc/raw/default/duhton for an example). Either way, it looks unlikely that he'd get something that is either simpler or more performant (here I'm ignoring CPython+Psyco which muddles the picture). I think that each of CPython and PyPy have right now the kind of GC that is best suited for them. A bient?t, Armin. From techtonik at gmail.com Sat Mar 29 12:49:00 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Sat, 29 Mar 2014 14:49:00 +0300 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans Message-ID: I know what C in CFFI stands for C way of doing things, so I hope people won't try to defend that position and instead try to think about what if we have to re-engineer ABI access from scratch, for explicit and obvious debug binary interface. CFFI is not useful for Python programmers and here is why. The primary reason is that it requires you to know C. And knowing C requires you to know about OS architecture. And knowing about OS architecture requires you to know about ABI, which is: http://stackoverflow.com/a/3784697 This is how the compiler builds an application. It defines things (but is not limited too): How parameters are passed to functions (registers/stack). Who cleans parameters from the stack (caller/callee). Where the return value is placed for return. How exceptions propagate. The problematic part of it is that you need to think of OS ABI in terms of unusual C abstractions. Coming through several levels of them. Suppose you know OS ABI and you know that you need direct physical memory access to set bytes for a certain call in this way: 0024: 00 00 00 6C 33 33 74 00 How would you do this in Python? The most obvious way is with byte string - \x00\x00\x00\x6c\x33\x33\x74\x00 but that's not how you prepare the data for the call if, for example, 00 6C means anything to you. What is the Python way to convert 00 6C to convenient Python data structure and back and is it Pythonic (user friendly and intuitive)? import struct struct.unpack('wtf?', '\x00\x6C') If you try to lookup the magic string in struct docs: http://docs.python.org/2/library/struct.html#format-characters You'll notice that there is the mapping between possible combinations of these 2 bytes to some Python type is very mystic. First it requires you to choose either "short" or "unsigned short", but that's not enough for parsing binary data - you need to figure out the proper "endianness" and make up a magic string for it. This is just for two bytes. Imagine a definition for a binary protocol with variable message size and nested data structures. You won't be able to understand it by reading Python code. More than that - Python *by default* uses platform specific "endianness", it is uncertain (implicit) about it, so not only you should care about "endianness", but also be an expert to find out which is the correct metrics for you. Look at this: 0024: 00 00 00 6C 33 33 74 00 Where is "endianness", "alignment", "size" from this doc http://docs.python.org/2/library/struct.html#byte-order-size-and-alignment People need to *start* with this base and this concept and that's why it is harmful. CFFI proposes to provide a better interface to skip this complexity by getting back to roots and use C level. That's a pretty nice hack for C guys, I am sure it makes them completely happy, but for academic side of PyPy project, for Python interpreter and other projects build over RPython it is important to have a tool that allows to experiment with binary interfaces in convenient, readable and direct way, makes it easier for humans to understand (by reading Python code) how Python instructions are translated by JIT into binary pieces in computer memory, pieces that will be processed by operating system as a system function call on ABI level. But let's not digress, and get back to the point that struct module doesn't allow to work with structured data. In Python the only alternative standard way to define binary structure is ctypes. ctypes documentation is no better for binary guy: http://docs.python.org/2/library/ctypes.html#fundamental-data-types See how that binary guy suffered to map binary data to Python structures through ctypes: https://bitbucket.org/techtonik/discovery/src/eacd864e6542f14039c9b31eecf94302f3ef99ec/graphics/gfxtablet/gfxtablet.py?at=default And I am saying that this is the best way available from standard library. It is pretty close to Django models, but for binary data. ctypes still is worse that struct in one thing - looking into docs, there are no size specifiers for any kind of C type, so no guarantee that 2 bytes are read as 4 bytes or worse. By looking at the ctypes code it is hard to figure out size of structure and when it may change. I can't hardly name ctypes mapping process as user friendly and resulting code as intuitive. Probably nobody could, and that's why CFFI was born. But CFFI took a different route - instead of trying to map C types to binary data (ABI level), it decided to go onto API level. While it exposes many better tool, it basically means you are dealing with C interface again - not with Pythonic interface for binary data. I am not saying that CFFI is bad - I am saying that it is good, but not enough, and that it can be fixed with cleanroom engineering approach for a broader scope of modern usage pattern for binary data than just calling OS API in C way. Why we need it? I frankly think that Stackless way of doing thing without C stack is the future, and the problem with not that not many people can see how it works, builds alternative system without classic C stack with (R)Python. Can CFFI help this? I doubt that. So, that am I proposing. Just an idea. Given the fact that I am mentally incapable of filling 100 sheet requirement to get funding under H2020, the fact that no existing commercial body could be interested to support the development as an open source project and the fact that hacking on it alone might become boring, giving this idea is the least I can do. Cleanroom engineering. http://en.wikipedia.org/wiki/Cleanroom_software_engineering "The focus of the Cleanroom process is on defect prevention, rather than defect removal." When we talk about Pythonic way of doing thing, how can we define "a defect"? Basically, we talking about user experience - the emotions that user experiences when he uses Python for the given task. What is the task at hand? For me - it is working with binary data in Python - not just parsing save games, but creating binary commands such as OS systems calls that are executed by certain CPU, GPU or whatever is on the receiver end of whatever communication interface is used. This is hardware independent and platform neutral way of doing things. So, the UX is the key, but properties of engineered product are not limited single task. The cleanroom approach allows to concentrate on the defect - when user experience will start to suffer because of the conflicts between tasks that users are trying to accomplish. For PyPy project I see the value in library for compositing of binary structures in that these operations can be pipelined and optimized at run-time in a highly effective fashion. I think that convenient binary tool is the missing brick in the basement of academic PyPy infrastructure to enable universal interoperability from (R)Python with other digital systems by providing a direct interface to the binary world. I think that 1973 year views on "high level" and "low level" systems are a little bit outdated now that we have Python, Ruby, Erlang and etc. Now C is just not a very good intermediary for "low level" access. But frankly, I do not think that with advent of networking, binary can be called a low level anymore. It is just another data format that can be as readable for humans as program structure written in Python. P.S. I have some design ideas how to make an attractive gameplay out of binary data by "coloring" regions and adding "multi-level context" to hex dumps. This falls out of scope of this issue, and requires more drawing that texting, but if somebody wants to help me with sharing the vision - I would not object. It will help to make binary world more accessible, especially for new people, who start coding with JavaScript and Python. -- anatoly t. From fijall at gmail.com Sat Mar 29 12:59:16 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sat, 29 Mar 2014 13:59:16 +0200 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: Message-ID: On Sat, Mar 29, 2014 at 1:49 PM, anatoly techtonik wrote: > I know what C in CFFI stands for C way of doing things, so > I hope people won't try to defend that position and instead > try to think about what if we have to re-engineer ABI access > from scratch, for explicit and obvious debug binary interface. > > > CFFI is not useful for Python programmers and here is why. Anatoly, such blank statements that are factually incorrect is the reason why people think your input is counterproductive. Please stop posting random ideas on pypy-dev, if you have no interest in working on them. If you want to convince someone to implement your next idea, there are places to do that, but pypy-dev is about the development of PyPy and it's the wrong place to do so. Cheers, fijal From markus at unterwaditzer.net Sat Mar 29 13:58:16 2014 From: markus at unterwaditzer.net (Markus Unterwaditzer) Date: Sat, 29 Mar 2014 13:58:16 +0100 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: Message-ID: <20140329125815.GA20269@untibox.unti> On Sat, Mar 29, 2014 at 02:49:00PM +0300, anatoly techtonik wrote: > I know what C in CFFI stands for C way of doing things, so > I hope people won't try to defend that position and instead > try to think about what if we have to re-engineer ABI access > from scratch, for explicit and obvious debug binary interface. > > > CFFI is not useful for Python programmers and here is why. > > The primary reason is that it requires you to know C. You're using C if you're calling it from Python. Knowing the language (to some degree) when using it is inevitable. > And knowing C requires you to know about OS architecture. The PyPy team (especially fijal) has always strongly discouraged from porting Python code to C for performance. If you have a good reason to use C, it is not surprising that you're going to be confronted with the dangers of such a language. I am not sure if you're trying to make a point against C or CFFI here. I am also not sure if the rest of your post actually means anything, or if it is just way above my head. But given that you're throwing around with statements like "this is useless", i don't feel compelled or motivated to try to understand your ramblings. -- Markus From techtonik at gmail.com Sun Mar 30 10:49:40 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Sun, 30 Mar 2014 11:49:40 +0300 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: Message-ID: On Sat, Mar 29, 2014 at 2:59 PM, Maciej Fijalkowski wrote: > On Sat, Mar 29, 2014 at 1:49 PM, anatoly techtonik wrote: >> I know what C in CFFI stands for C way of doing things, so >> I hope people won't try to defend that position and instead >> try to think about what if we have to re-engineer ABI access >> from scratch, for explicit and obvious debug binary interface. >> >> >> CFFI is not useful for Python programmers and here is why. > > Anatoly, such blank statements that are factually incorrect is the > reason why people think your input is counterproductive. This statement in incorrect out of context, the context in which it is correct is provided below "here is why.". If you know how to rephrase the statement differently to define the scope correctly, I am all ears. Unfortunately I am not a journalist or writer or scientists to write short and concise papers. I write how I think, and it is a problem to rewrite stuff in English to sound differently, mostly because of time constraints. > Please stop posting random ideas on pypy-dev, if you have no interest > in working on them. You're wrong in your assumptions that I am not interested. Now that it is clear, let me restate that the problem is find time to work on the problem, because productive hours are spent on paid chores. > If you want to convince someone to implement your > next idea, there are places to do that, but pypy-dev is about the > development of PyPy and it's the wrong place to do so. I just want an get a feedback on the opinion. If the next idea is good or not. Maybe there are people who had this idea before. I don't ask anybody to do this. About others how can implement your ideas, I am convinced that people work on their own ideas in their free time, so collaboration only happens when ideas match. If you know the places where it is not true, I'll be interested to know about them. If you can give direct link - that will be productive, but I really doubt there are places where people spending time implementing ideas of others just for good. -- anatoly t. From techtonik at gmail.com Sun Mar 30 11:05:06 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Sun, 30 Mar 2014 12:05:06 +0300 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: <20140329125815.GA20269@untibox.unti> References: <20140329125815.GA20269@untibox.unti> Message-ID: On Sat, Mar 29, 2014 at 3:58 PM, Markus Unterwaditzer wrote: > On Sat, Mar 29, 2014 at 02:49:00PM +0300, anatoly techtonik wrote: >> I know what C in CFFI stands for C way of doing things, so >> I hope people won't try to defend that position and instead >> try to think about what if we have to re-engineer ABI access >> from scratch, for explicit and obvious debug binary interface. >> >> >> CFFI is not useful for Python programmers and here is why. >> >> The primary reason is that it requires you to know C. > > You're using C if you're calling it from Python. Knowing the language (to some > degree) when using it is inevitable. This is the problem that I've tried to describe: All standard Python tools for ABI level access require C knowledge. >> And knowing C requires you to know about OS architecture. > > The PyPy team (especially fijal) has always strongly discouraged from > porting Python code to C for performance. If you have a good reason to use C, > it is not surprising that you're going to be confronted with the dangers of > such a language. I am not sure if you're trying to make a point against C or > CFFI here. Against C. As I said, CFFI is good, but not enough to work conveniently with binary interfaces, and the reason for that is that it is C-centric. I support fijal - my position is that rewriting the same code in faster language is not a way to solve performance problems. Language as a problem is a failed smoke test for app architecture. > I am also not sure if the rest of your post actually means anything, or if it > is just way above my head. But given that you're throwing around with > statements like "this is useless", i don't feel compelled or motivated to try > to understand your ramblings. Fair point. Thanks for the feedback. Sometimes I feel like I should just stop wasting my time on ideas, and start eating some pills so that I could better concentrate on a mindless coding. -- anatoly t. From markus at unterwaditzer.net Sun Mar 30 11:20:11 2014 From: markus at unterwaditzer.net (Markus Unterwaditzer) Date: Sun, 30 Mar 2014 11:20:11 +0200 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: <20140330092011.GA28013@untibox.unti> On Sun, Mar 30, 2014 at 12:05:06PM +0300, anatoly techtonik wrote: > On Sat, Mar 29, 2014 at 3:58 PM, Markus Unterwaditzer > wrote: > > On Sat, Mar 29, 2014 at 02:49:00PM +0300, anatoly techtonik wrote: > >> I know what C in CFFI stands for C way of doing things, so > >> I hope people won't try to defend that position and instead > >> try to think about what if we have to re-engineer ABI access > >> from scratch, for explicit and obvious debug binary interface. > >> > >> > >> CFFI is not useful for Python programmers and here is why. > >> > >> The primary reason is that it requires you to know C. > > > > You're using C if you're calling it from Python. Knowing the language (to some > > degree) when using it is inevitable. > > This is the problem that I've tried to describe: > > All standard Python tools for ABI level access require C knowledge. > > >> And knowing C requires you to know about OS architecture. > > > > The PyPy team (especially fijal) has always strongly discouraged from > > porting Python code to C for performance. If you have a good reason to use C, > > it is not surprising that you're going to be confronted with the dangers of > > such a language. I am not sure if you're trying to make a point against C or > > CFFI here. > > Against C. As I said, CFFI is good, but not enough to work conveniently with > binary interfaces, and the reason for that is that it is C-centric. I am not trying to dogmatize anything here, but i don't see a reason why efforts should be made to eliminate that property you're seeing as a problem, and i am not sure it'd be *worth it*. To me, the main usecase of CFFI seems to be embedding existing C libraries, not directly accessing ABIs. > > I support fijal - my position is that rewriting the same code in faster language > is not a way to solve performance problems. Language as a problem is a > failed smoke test for app architecture. > > > I am also not sure if the rest of your post actually means anything, or if it > > is just way above my head. But given that you're throwing around with > > statements like "this is useless", i don't feel compelled or motivated to try > > to understand your ramblings. > > Fair point. Thanks for the feedback. Sometimes I feel like I should just stop > wasting my time on ideas, and start eating some pills so that I could better > concentrate on a mindless coding. That's not what i meant. It doesn't matter whether your ideas are good or bad: The way you're formulating your ideas is incredibly insulting to authors of existing solutions. > -- > anatoly t. From roberto at unbit.it Sun Mar 30 11:26:01 2014 From: roberto at unbit.it (Roberto De Ioris) Date: Sun, 30 Mar 2014 11:26:01 +0200 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: <7f1bab2ea6358b5c15337bd15bf245d7.squirrel@manage.unbit.it> > On Sat, Mar 29, 2014 at 3:58 PM, Markus Unterwaditzer > wrote: >> On Sat, Mar 29, 2014 at 02:49:00PM +0300, anatoly techtonik wrote: >>> I know what C in CFFI stands for C way of doing things, so >>> I hope people won't try to defend that position and instead >>> try to think about what if we have to re-engineer ABI access >>> from scratch, for explicit and obvious debug binary interface. >>> >>> >>> CFFI is not useful for Python programmers and here is why. >>> >>> The primary reason is that it requires you to know C. >> >> You're using C if you're calling it from Python. Knowing the language >> (to some >> degree) when using it is inevitable. > > This is the problem that I've tried to describe: > > All standard Python tools for ABI level access require C knowledge. > >>> And knowing C requires you to know about OS architecture. >> >> The PyPy team (especially fijal) has always strongly discouraged from >> porting Python code to C for performance. If you have a good reason to >> use C, >> it is not surprising that you're going to be confronted with the dangers >> of >> such a language. I am not sure if you're trying to make a point against >> C or >> CFFI here. > > Against C. As I said, CFFI is good, but not enough to work conveniently > with > binary interfaces, and the reason for that is that it is C-centric. > Hi, (disclaimer: i have worked a lot with cffi, and i basically love it because most of my projects are in C and i need to interface with them), i am not sure to follow you. You seem to mix C with binary/structures manipulation. CFFI is for integration between PyPy/Python and C interfaces, so using C (or something really near to it) is its main purpose. If your problem is with structures manipulations, i totally agree, the python world need something better (unless i am missing some project regarding it), but this is totally irrelevant in the CFFI area. Regards -- Roberto De Ioris http://unbit.it From kennylevinsen at gmail.com Sun Mar 30 13:40:28 2014 From: kennylevinsen at gmail.com (Kenny Lasse Hoff Levinsen) Date: Sun, 30 Mar 2014 13:40:28 +0200 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: Okay, just to get things right: What you want is an only-ABI solution, which abstracts completely away from technical details, in a nice pythonic wrapper? Having that idea suggestion is fine (although it is slightly off-topic on pypy-dev), but it has nothing to do with the C Foreign Function Interface (CFFI), which is C-centric as it focuses on interfacing with C. It allows for making very easy interfaces to C with very little code, which as nice bonus also appears very clean. One can also easily make the argument that, when you're glueing two languages together, you have to know both to make the proper considerations about their use. You risk making improper use of returned memory if you don't know what's going on, and you'll have no clue how to debug it. But to sum it up: - You want a language independent ABI interface that looks completely pythonic from the users point of view, and requires no knowledge of other languages - This has nothing to do with CFFI, which is very specifically - as the name implies - a C interface, which does it's job very well. Correct? My personal opinion of the idea is that it is likely to be troublesome enough to be unfeasible and very unpleasant to code. I also find it unlikely to work (without giving a load of trouble to the user), but that should not stop interested parties from trying. Regards, Kenny P.S.: Calling the work of others useless is a bad way to introduce an idea. (Unless it's an idea for "How to be hated for Dummies") > On 30/03/2014, at 11.05, anatoly techtonik wrote: > > On Sat, Mar 29, 2014 at 3:58 PM, Markus Unterwaditzer > wrote: >> On Sat, Mar 29, 2014 at 02:49:00PM +0300, anatoly techtonik wrote: >>> I know what C in CFFI stands for C way of doing things, so >>> I hope people won't try to defend that position and instead >>> try to think about what if we have to re-engineer ABI access >>> from scratch, for explicit and obvious debug binary interface. >>> >>> >>> CFFI is not useful for Python programmers and here is why. >>> >>> The primary reason is that it requires you to know C. >> >> You're using C if you're calling it from Python. Knowing the language (to some >> degree) when using it is inevitable. > > This is the problem that I've tried to describe: > > All standard Python tools for ABI level access require C knowledge. > >>> And knowing C requires you to know about OS architecture. >> >> The PyPy team (especially fijal) has always strongly discouraged from >> porting Python code to C for performance. If you have a good reason to use C, >> it is not surprising that you're going to be confronted with the dangers of >> such a language. I am not sure if you're trying to make a point against C or >> CFFI here. > > Against C. As I said, CFFI is good, but not enough to work conveniently with > binary interfaces, and the reason for that is that it is C-centric. > > I support fijal - my position is that rewriting the same code in faster language > is not a way to solve performance problems. Language as a problem is a > failed smoke test for app architecture. > >> I am also not sure if the rest of your post actually means anything, or if it >> is just way above my head. But given that you're throwing around with >> statements like "this is useless", i don't feel compelled or motivated to try >> to understand your ramblings. > > Fair point. Thanks for the feedback. Sometimes I feel like I should just stop > wasting my time on ideas, and start eating some pills so that I could better > concentrate on a mindless coding. > -- > anatoly t. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From techtonik at gmail.com Sun Mar 30 13:47:24 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Sun, 30 Mar 2014 14:47:24 +0300 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: <7f1bab2ea6358b5c15337bd15bf245d7.squirrel@manage.unbit.it> References: <20140329125815.GA20269@untibox.unti> <7f1bab2ea6358b5c15337bd15bf245d7.squirrel@manage.unbit.it> Message-ID: On Sun, Mar 30, 2014 at 12:26 PM, Roberto De Ioris wrote: > > Hi, (disclaimer: i have worked a lot with cffi, and i basically love it > because most of my projects are in C and i need to interface with them), i > am not sure to follow you. You seem to mix C with binary/structures > manipulation. CFFI is for integration between PyPy/Python and C > interfaces, so using C (or something really near to it) is its main > purpose. > > If your problem is with structures manipulations, i totally agree, the > python world need something better (unless i am missing some project > regarding it), but this is totally irrelevant in the CFFI area. I mix C with binary manipulation - that's right. I feel like the problem of efficiently making C calls is a subset of a more generic problem of converting Python data into corresponding binary data, because C call looks like a chunk of binary data in memory + register setup. http://unixwiz.net/techtips/win32-callconv-asm.html I think that PyPy JIT can do faster calls that C if it was possible to manipulate with memory on this low level (for instance, providing a pre-filled memory for a series of calls and then just modifying esp for each call). If CFFI included a toolset to construct such binary calls - it would not be limited to C anymore, and will open ways to engineer more effective paradigms (Stackless + channels) and better calling conventions for interop with other useful tools (Go etc.) on binary level. I started idea with CFFI, because there is no better base to start. -- anatoly t. From techtonik at gmail.com Sun Mar 30 14:36:08 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Sun, 30 Mar 2014 15:36:08 +0300 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: On Sun, Mar 30, 2014 at 2:40 PM, Kenny Lasse Hoff Levinsen wrote: > Okay, just to get things right: What you want is an only-ABI solution, which abstracts completely away from technical details, in a nice pythonic wrapper? Not really. I want a language independent ABI solution, yes. ABI-only implies that there is some alternative to that. I don't see any alternative - for me ABI is the necessary basis for everything else on top. So doing ABI solution design without considering these use cases is impossible. I want a decoupled ABI level. Nice pythonic wrapper that abstracts completely from technical details is not the goal. The goal is to provide practical defaults for language and hardware independent abstraction. The primary object that abstraction should work with is "platform-independent binary data", the method is "to be readable by humans". On implementation level that means that by default there is no ambiguity in syntax that defines binary data (size or endianness), and if there is a dependency on platform (CPU bitness etc.) it should be explicit, so that the behavior of structure should be clear (self-describing type names + type docs that list relevant platforms and effects on every platform). This approach inverts existing practice of using platform dependent binary structures by default. > But to sum it up: > - You want a language independent ABI interface that looks completely pythonic from the users point of view, and requires no knowledge of other languages Exactly. Python here means "intuitive and user-friendly" (which may appear different than indentations or YAML). > - This has nothing to do with CFFI, which is very specifically - as the name implies - a C interface, which does it's job very well. > > Correct? Yes CFFI does a perfect job on API level - I can't think of a better way to provide access to C API other than with C syntax. On ABI level tools can be more useful, and there is where idea intersects with CFFI. It doesn't cancel the fact that people need safe feet injury prevention interfaces. Look for my previous answer with the phrase "I mix C with binary manipulation" that covers this. > My personal opinion of the idea is that it is likely to be troublesome enough to be unfeasible and very unpleasant to code. I also find it unlikely to work (without giving a load of trouble to the user), but that should not stop interested parties from trying. Ack. Thanks for the feedback. From santagada at gmail.com Sun Mar 30 17:32:10 2014 From: santagada at gmail.com (Leonardo Santagada) Date: Sun, 30 Mar 2014 12:32:10 -0300 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: On Sun, Mar 30, 2014 at 9:36 AM, anatoly techtonik wrote: > On ABI level tools can be more useful, and there is where idea > intersects with CFFI. It doesn't cancel the fact that people need safe > feet injury prevention interfaces. Look for my previous answer with > the phrase "I mix C with binary manipulation" that covers this. > I think you want something like https://pypi.python.org/pypi/construct/2.5.1right? It exists and is pretty good... some guys in my company are using it to model a binary audio control protocol both to prototype and to test the embeded implementation (which is in C). -- Leonardo Santagada -------------- next part -------------- An HTML attachment was scrubbed... URL: From njh at njhurst.com Sun Mar 30 17:29:05 2014 From: njh at njhurst.com (Nathan Hurst) Date: Mon, 31 Mar 2014 02:29:05 +1100 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: <20140330152905.GA18471@ajhurst.org> On Sun, Mar 30, 2014 at 03:36:08PM +0300, anatoly techtonik wrote: > On Sun, Mar 30, 2014 at 2:40 PM, Kenny Lasse Hoff Levinsen > wrote: > > Okay, just to get things right: What you want is an only-ABI solution, which abstracts completely away from technical details, in a nice pythonic wrapper? > > Not really. I want a language independent ABI solution, yes. ABI-only > implies that there is some alternative to that. I don't see any > alternative - for me ABI is the necessary basis for everything else on > top. So doing ABI solution design without considering these use cases > is impossible. I want a decoupled ABI level. If you want to purely binary data in python there are already some good options: struct and numpy. The first is optimised for short records, the latter for multidimensional arrays of binary data. I've used both on occasion for binary file and cross language communications. > Nice pythonic wrapper that abstracts completely from technical details > is not the goal. The goal is to provide practical defaults for > language and hardware independent abstraction. The primary object that > abstraction should work with is "platform-independent binary data", > the method is "to be readable by humans". On implementation level that > means that by default there is no ambiguity in syntax that defines > binary data (size or endianness), and if there is a dependency on > platform (CPU bitness etc.) it should be explicit, so that the > behavior of structure should be clear (self-describing type names + > type docs that list relevant platforms and effects on every platform). > This approach inverts existing practice of using platform dependent > binary structures by default. So struct and numpy are existing solutions to this problem, but I think you are thinking too low level for most problems. It sounds like what you really want is a schema based serialisation protocol. There are a million of these, all alike, but the two I've used the most are msgpack and thrift. Generally you want to be specifying statistical distributions rather that the specific encoding scheme; the choice between 4 byte and 8 byte ints is not particularly useful at the python programmer level, but knowing whether the number is a small int or a large one is. Thrift is the best implementation of a remote call I've used, but it still leaves a lot of room for improvment. If you are actually talking about building something that will talk to any existing code in any existing language, then you will need something more like CFFI. However, I don't think you want to take the C out of CFFI. The reason is that the packing of the data is actually one of the least important parts of that interface. As you've read on pypy-dev recently, reference counting vs gc is hard to get right. But there are many other problems which you have to address for a truly language independent ABI: Stack convention: what order are things pushed onto the stack (IIRC C and pascal grow their heaps in opposite directions in memory). Are things even pushed onto a stack? (LISP stacks are implemented with linked lists, some languages don't even bother with stack frames when they perform tail recursion removal, using conditional gotos instead) Packing conventions: different cpus like pointers to be even, or aligned to 8 bytes. Your code won't be portable if you don't handle this. Exception handling: There are as many ways to handle exceptions as there are compilers, all of them with subtle rules around lifetimes of all the objects that are being excepted over. Virtual methods and the like: In most languages (C is actually somewhat unusual here) methods or functions are actually called via a dereference or two rather than a direct memory location. In C++ virtual methods and Java this looks like a field lookup to find the vtable then a jump to a memory location from that table. This is tedious and error prone to implement correctly. Generics and types: just working out which function to call is difficult once you have C++ templates, performing the correct type checking is difficult for Java. Haskell's internal type specification alone is probably larger than all of the CFFI interfaces put together. There are no doubt whole areas I've missed here, but I hope this gives you a taste for why language developers love CFFI - it's easy to implement, easy enough to use and hides all of these complexities by pushing everything through the bottleneck which is the C ABI. Language developers would rather make their own language wonderful, compiler writers would rather make their own language efficient. Nobody really cares enough to make supporting other languages directly a goal, the biggest and best effort in this regard was .net and it also cheated by folding everything through a small bottleneck (albeit a more expressive one). This folding scars the languages around it - F# is still less pure than Haskell, J is incompatible with Java. If you want to tackle something in this area I would encourage you to look at the various serialisation tools out there and work out why they are not as good as they could be. This is probably not the right mailing list for this discussion though. njh From bokr at oz.net Sun Mar 30 19:23:39 2014 From: bokr at oz.net (Bengt Richter) Date: Sun, 30 Mar 2014 19:23:39 +0200 Subject: [pypy-dev] Why CFFI is not useful - need direct ABI access 4 humans In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: On 03/30/2014 05:32 PM Leonardo Santagada wrote: > On Sun, Mar 30, 2014 at 9:36 AM, anatoly techtonikwrote: > >> On ABI level tools can be more useful, and there is where idea >> intersects with CFFI. It doesn't cancel the fact that people need safe >> feet injury prevention interfaces. Look for my previous answer with >> the phrase "I mix C with binary manipulation" that covers this. >> > > I think you want something like > https://pypi.python.org/pypi/construct/2.5.1right? It exists and is Need to strip/separate the "right?" to isolate and make the link work: https://pypi.python.org/pypi/construct/2.5.1 > pretty good... some guys in my company are using it > to model a binary audio control protocol both to prototype and to test the > embeded implementation (which is in C). > > > > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From estama at gmail.com Mon Mar 31 20:28:55 2014 From: estama at gmail.com (Eleytherios Stamatogiannakis) Date: Mon, 31 Mar 2014 21:28:55 +0300 Subject: [pypy-dev] JIT not kicking in for callbacks In-Reply-To: References: <20140329125815.GA20269@untibox.unti> Message-ID: <5339B3E7.50804@gmail.com> Hello, I'm not sure, but i think that the JIT doesn't kick in when the loops are "external", where C-code does the loop and calls Python code for each iteration, through CFFI callbacks. In our code, we have SQLite, calling Python code through callbacks. Using the loop log: PYPYLOG=log pypy mterm.py i see this summary: interpret 91% gc-collect-step 4% gc-minor 3% jit-optimize 0% gc-minor-walkroots 0% jit-tracing 0% jit-backend 0% jit-log-compiling-bridge 0% jit-resume 0% jit-backend-dump 0% gc-set-nursery-size 0% jit-log-virtualstate 0% jit-backend-addr 0% gc-hardware 0% jit-log-noopt-loop 0% jit-mem-looptoken-alloc 0% jit-log-opt-bridge 0% jit-log-rewritten-loop 0% jit-log-rewritten-bridge 0% jit-log-compiling-loop 0% jit-log-opt-loop 0% jit-log-short-preamble 0% jit-abort 0% jit-mem-collect 0% jit-summary 0% jit-backend-counts 0% If i read that correctly, most of the execution time is in the interpreter and not in the JITed code. Is there some way where i can "hint" to pypy that a callback is essentially the inside of a loop, so PyPy can JIT it?. I've attached the simple log from which above summary has been produced via: python logparser.py print-summary log.log - Regards, l. -------------- next part -------------- A non-text attachment was scrubbed... Name: log.log.gz Type: application/gzip Size: 282249 bytes Desc: not available URL: From fijall at gmail.com Mon Mar 31 20:33:15 2014 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 31 Mar 2014 20:33:15 +0200 Subject: [pypy-dev] JIT not kicking in for callbacks In-Reply-To: <5339B3E7.50804@gmail.com> References: <20140329125815.GA20269@untibox.unti> <5339B3E7.50804@gmail.com> Message-ID: hi "interpret" includes "in the jit". if you're having trouble with the performance, please let us know how to reproduce it and we'll try to help you pinpoint down the problem. On Mon, Mar 31, 2014 at 8:28 PM, Eleytherios Stamatogiannakis wrote: > Hello, > > I'm not sure, but i think that the JIT doesn't kick in when the loops are > "external", where C-code does the loop and calls Python code for each > iteration, through CFFI callbacks. In our code, we have SQLite, calling > Python code through callbacks. > > Using the loop log: > > PYPYLOG=log pypy mterm.py > > i see this summary: > > interpret 91% > gc-collect-step 4% > gc-minor 3% > jit-optimize 0% > gc-minor-walkroots 0% > jit-tracing 0% > jit-backend 0% > jit-log-compiling-bridge 0% > jit-resume 0% > jit-backend-dump 0% > gc-set-nursery-size 0% > jit-log-virtualstate 0% > jit-backend-addr 0% > gc-hardware 0% > jit-log-noopt-loop 0% > jit-mem-looptoken-alloc 0% > jit-log-opt-bridge 0% > jit-log-rewritten-loop 0% > jit-log-rewritten-bridge 0% > jit-log-compiling-loop 0% > jit-log-opt-loop 0% > jit-log-short-preamble 0% > jit-abort 0% > jit-mem-collect 0% > jit-summary 0% > jit-backend-counts 0% > > > If i read that correctly, most of the execution time is in the interpreter > and not in the JITed code. Is there some way where i can "hint" to pypy that > a callback is essentially the inside of a loop, so PyPy can JIT it?. > > I've attached the simple log from which above summary has been produced via: > > python logparser.py print-summary log.log - > > Regards, > > l. > > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev >