[pypy-dev] Pypy garbage collection

Mon Mar 17 14:23:57 CET 2014

On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall at gmail.com> wrote:
> are you *sure* it's the walkroots that take that long and not
> something else (like gc-minor)? More of those mean that you allocate a
> lot more surviving objects. Can you do two things:
>
> a) take a max of gc-minor (and gc-minor-stackwalk), per request
> b) take the sum of those
>
> and plot them

^^^ or just paste the results actually

>
> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak at issuu.com> wrote:
>> Well, then it works out to around 2.5GHz, which seems reasonable. But it
>> doesn't alter the conclusion from the previous email: The slow queries then
>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or
>> .4 seconds at this conversion. Also, the log shows that a slow query
>> performs many more gc-minor operations than a 'normal' one: 9600
>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.
>>
>> So the question becomes: Why do we get this large spike in
>> gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?
>>
>> Thanks,
>> /Martin
>>
>>
>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall at gmail.com>
>> wrote:
>>>
>>> I think it's the cycles of your CPU
>>>
>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak at issuu.com> wrote:
>>> > What is the unit? Perhaps I'm being thick here, but I can't correlate it
>>> > with seconds (which the program does print out). Slow runs are around 13
>>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g.
>>> > from
>>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49).
>>> >
>>> >
>>> >
>>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall at gmail.com>
>>> > wrote:
>>> >>
>>> >> The number of lines is nonsense. This is a timestamp in hex.
>>> >>
>>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak at issuu.com> wrote:
>>> >> > Based On Maciej's suggestion, I tried the following
>>> >> >
>>> >> > PYPYLOG=- pypy mem.py 10000000 > out
>>> >> >
>>> >> > This generates a logfile which looks something like this
>>> >> >
>>> >> > start-->
>>> >> > [2b99f1981b527e] {gc-minor
>>> >> > [2b99f1981ba680] {gc-minor-walkroots
>>> >> > [2b99f1981c2e02] gc-minor-walkroots}
>>> >> > [2b99f19890d750] gc-minor}
>>> >> > [snip]
>>> >> > ...
>>> >> > <--stop
>>> >> >
>>> >> >
>>> >> > It turns out that the culprit is a lot of MINOR collections.
>>> >> >
>>> >> > I base this on the following observations:
>>> >> >
>>> >> > I can't understand the format of the timestamp on each logline (the
>>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should be
>>> >> > output
>>> >> > from time.clock(), but that doesn't return a number like that when I
>>> >> > run
>>> >> > pypy interactively
>>> >> > Instead, I count the number of debug lines between start--> and the
>>> >> > corresponding <--stop.
>>> >> > Most runs have a few hundred lines of output between start/stop
>>> >> > All slow runs have very close to 57800 lines out output between
>>> >> > start/stop
>>> >> > One such sample does 9609 gc-collect-step operations, 9647 gc-minor
>>> >> > operations, and 9647 gc-minor-walkroots operations.
>>> >> >
>>> >> >
>>> >> > Thanks,
>>> >> > /Martin
>>> >> >
>>> >> >
>>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski
>>> >> > <fijall at gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout)
>>> >> >> which will do that for you btw.
>>> >> >>
>>> >> >> maybe you can find out what's that using profiling or valgrind?
>>> >> >>
>>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak at issuu.com> wrote:
>>> >> >> > I have tried getting the pypy source and building my own version
>>> >> >> > of
>>> >> >> > pypy. I
>>> >> >> > have modified
>>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step()
>>> >> >> > to
>>> >> >> > print out when it starts and when it stops. Apparently, the slow
>>> >> >> > queries
>>> >> >> > do
>>> >> >> > NOT occur during major_collection_step; at least, I have not
>>> >> >> > observed
>>> >> >> > major
>>> >> >> > step output during a query execution. So, apparently, something
>>> >> >> > else
>>> >> >> > is
>>> >> >> > blocking. This could be another aspect of the GC, but it could
>>> >> >> > also
>>> >> >> > be
>>> >> >> > anything else.
>>> >> >> >
>>> >> >> > Just to be sure, I have tried running the same application in
>>> >> >> > python
>>> >> >> > with
>>> >> >> > garbage collection disabled. I don't see the problem there, so it
>>> >> >> > is
>>> >> >> > somehow
>>> >> >> > related to either GC or the runtime somehow.
>>> >> >> >
>>> >> >> > Cheers,
>>> >> >> > /Martin
>>> >> >> >
>>> >> >> >
>>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak at issuu.com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> We have hacked up a small sample that seems to exhibit the same
>>> >> >> >> issue.
>>> >> >> >>
>>> >> >> >> We basically generate a linked list of objects. To increase
>>> >> >> >> connectedness,
>>> >> >> >> elements in the list hold references (dummy_links) to 10 randomly
>>> >> >> >> chosen
>>> >> >> >> previous elements in the list.
>>> >> >> >>
>>> >> >> >> We then time a function that traverses 50000 elements from the
>>> >> >> >> list
>>> >> >> >> from a
>>> >> >> >> random start point. If the traversal reaches the end of the list,
>>> >> >> >> we
>>> >> >> >> instead
>>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements are
>>> >> >> >> traversed
>>> >> >> >> every time. To generate some garbage, we build a list holding the
>>> >> >> >> traversed
>>> >> >> >> elements and a dummy list of characters.
>>> >> >> >>
>>> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If
>>> >> >> >> the
>>> >> >> >> elapsed time for the last run is more than twice the average
>>> >> >> >> time,
>>> >> >> >> we
>>> >> >> >> print
>>> >> >> >> out a line with the elapsed time, the threshold, and the 90%
>>> >> >> >> runtime
>>> >> >> >> (we
>>> >> >> >> would like to see that the mean runtime does not increase with
>>> >> >> >> the
>>> >> >> >> number of
>>> >> >> >> elements in the list, but that the max time does increase
>>> >> >> >> (linearly
>>> >> >> >> with the
>>> >> >> >> number of object, i guess); traversing 50K elements should be
>>> >> >> >> independent of
>>> >> >> >> the memory size).
>>> >> >> >>
>>> >> >> >> We have tried monitoring memory consumption by external
>>> >> >> >> inspection,
>>> >> >> >> but
>>> >> >> >> cannot consistently verify that memory is deallocated at the same
>>> >> >> >> time
>>> >> >> >> that
>>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always
>>> >> >> >> return
>>> >> >> >> freed
>>> >> >> >> pages back to the OS?
>>> >> >> >>
>>> >> >> >> Using top, we observe that 10M elements allocates around 17GB
>>> >> >> >> after
>>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB
>>> >> >> >> shortly
>>> >> >> >> after building).
>>> >> >> >>
>>> >> >> >> Here is output from a few runs with different number of elements:
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> pypy mem.py 10000000
>>> >> >> >> start build
>>> >> >> >> end build 84.142424
>>> >> >> >> that took a long time elapsed: 13.230586  slow_threshold:
>>> >> >> >> 1.495401
>>> >> >> >> 90th_quantile_runtime: 0.421558
>>> >> >> >> that took a long time elapsed: 13.016531  slow_threshold:
>>> >> >> >> 1.488160
>>> >> >> >> 90th_quantile_runtime: 0.423441
>>> >> >> >> that took a long time elapsed: 13.032537  slow_threshold:
>>> >> >> >> 1.474563
>>> >> >> >> 90th_quantile_runtime: 0.419817
>>> >> >> >>
>>> >> >> >> pypy mem.py 20000000
>>> >> >> >> start build
>>> >> >> >> end build 180.823105
>>> >> >> >> that took a long time elapsed: 27.346064  slow_threshold:
>>> >> >> >> 2.295146
>>> >> >> >> 90th_quantile_runtime: 0.434726
>>> >> >> >> that took a long time elapsed: 26.028852  slow_threshold:
>>> >> >> >> 2.283927
>>> >> >> >> 90th_quantile_runtime: 0.374190
>>> >> >> >> that took a long time elapsed: 25.432279  slow_threshold:
>>> >> >> >> 2.279631
>>> >> >> >> 90th_quantile_runtime: 0.371502
>>> >> >> >>
>>> >> >> >> pypy mem.py 30000000
>>> >> >> >> start build
>>> >> >> >> end build 276.217811
>>> >> >> >> that took a long time elapsed: 40.993855  slow_threshold:
>>> >> >> >> 3.188464
>>> >> >> >> 90th_quantile_runtime: 0.459891
>>> >> >> >> that took a long time elapsed: 41.693553  slow_threshold:
>>> >> >> >> 3.183003
>>> >> >> >> 90th_quantile_runtime: 0.393654
>>> >> >> >> that took a long time elapsed: 39.679769  slow_threshold:
>>> >> >> >> 3.190782
>>> >> >> >> 90th_quantile_runtime: 0.393677
>>> >> >> >> that took a long time elapsed: 43.573411  slow_threshold:
>>> >> >> >> 3.239637
>>> >> >> >> 90th_quantile_runtime: 0.393654
>>> >> >> >>
>>> >> >> >> Code below
>>> >> >> >> --------------------------------------------------------------
>>> >> >> >> import time
>>> >> >> >> from random import randint, choice
>>> >> >> >> import sys
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> allElems = {}
>>> >> >> >>
>>> >> >> >> class Node:
>>> >> >> >>     def __init__(self, v_):
>>> >> >> >>         self.v = v_
>>> >> >> >>         self.next = None
>>> >> >> >>         self.dummy_data = [randint(0,100)
>>> >> >> >>                            for _ in xrange(randint(50,100))]
>>> >> >> >>         allElems[self.v] = self
>>> >> >> >>         if self.v > 0:
>>> >> >> >>             self.dummy_links = [allElems[randint(0, self.v-1)]
>>> >> >> >> for _
>>> >> >> >> in
>>> >> >> >> xrange(10)]
>>> >> >> >>         else:
>>> >> >> >>             self.dummy_links = [self]
>>> >> >> >>
>>> >> >> >>     def set_next(self, l):
>>> >> >> >>         self.next = l
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> def follow(node):
>>> >> >> >>     acc = []
>>> >> >> >>     count = 0
>>> >> >> >>     cur = node
>>> >> >> >>     assert node.v is not None
>>> >> >> >>     assert cur is not None
>>> >> >> >>     while count < 50000:
>>> >> >> >>         # return a value; generate some garbage
>>> >> >> >>         acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz")
>>> >> >> >> for
>>> >> >> >> x
>>> >> >> >> in
>>> >> >> >> xrange(100)]))
>>> >> >> >>
>>> >> >> >>         # if we have reached the end, chose a random link
>>> >> >> >>         cur = choice(cur.dummy_links) if cur.next is None else
>>> >> >> >> cur.next
>>> >> >> >>         count += 1
>>> >> >> >>
>>> >> >> >>     return acc
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> def build(num_elems):
>>> >> >> >>     start = time.time()
>>> >> >> >>     print "start build"
>>> >> >> >>     root = Node(0)
>>> >> >> >>     cur = root
>>> >> >> >>     for x in xrange(1, num_elems):
>>> >> >> >>         e = Node(x)
>>> >> >> >>         cur.next = e
>>> >> >> >>         cur = e
>>> >> >> >>     print "end build %f" % (time.time() - start)
>>> >> >> >>     return root
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> num_timings = 100
>>> >> >> >> if __name__ == "__main__":
>>> >> >> >>     num_elems = int(sys.argv[1])
>>> >> >> >>     build(num_elems)
>>> >> >> >>     total = 0
>>> >> >> >>     timings = [0.0] * num_timings # run times for the last
>>> >> >> >> num_timings
>>> >> >> >> runs
>>> >> >> >>     i = 0
>>> >> >> >>     beginning = time.time()
>>> >> >> >>     while time.time() - beginning < 600:
>>> >> >> >>         start = time.time()
>>> >> >> >>         elem = allElems[randint(0, num_elems - 1)]
>>> >> >> >>         assert(elem is not None)
>>> >> >> >>
>>> >> >> >>         lst = follow(elem)
>>> >> >> >>
>>> >> >> >>         total += choice(lst)[0] # use the return value for
>>> >> >> >> something
>>> >> >> >>
>>> >> >> >>         end = time.time()
>>> >> >> >>
>>> >> >> >>         elapsed = end-start
>>> >> >> >>         timings[i % num_timings] = elapsed
>>> >> >> >>         if (i > num_timings):
>>> >> >> >>             slow_time = 2 * sum(timings)/num_timings # slow
>>> >> >> >> defined
>>> >> >> >> as
>>> >> >> >> >
>>> >> >> >> 2*avg run time
>>> >> >> >>             if (elapsed > slow_time):
>>> >> >> >>                 print "that took a long time elapsed: %f
>>> >> >> >> slow_threshold:
>>> >> >> >> %f  90th_quantile_runtime: %f" % \
>>> >> >> >>                     (elapsed, slow_time,
>>> >> >> >> sorted(timings)[int(num_timings*.9)])
>>> >> >> >>         i += 1
>>> >> >> >>     print total
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski
>>> >> >> >> <fijall at gmail.com>
>>> >> >> >> wrote:
>>> >> >> >>>
>>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak at issuu.com>
>>> >> >> >>> wrote:
>>> >> >> >>> > Hi Armin, Maciej
>>> >> >> >>> >
>>> >> >> >>> > Thanks for responding.
>>> >> >> >>> >
>>> >> >> >>> > I'm in the process of trying to determine what (if any) of the
>>> >> >> >>> > code
>>> >> >> >>> > I'm
>>> >> >> >>> > in a
>>> >> >> >>> > position to share, and I'll get back to you.
>>> >> >> >>> >
>>> >> >> >>> > Allowing hinting to the GC would be good. Even better would be
>>> >> >> >>> > a
>>> >> >> >>> > means
>>> >> >> >>> > to
>>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged
>>> >> >> >>> > memory,
>>> >> >> >>> > but I
>>> >> >> >>> > would expect that to be a tall order :)
>>> >> >> >>> >
>>> >> >> >>> > Thanks,
>>> >> >> >>> > /Martin
>>> >> >> >>>
>>> >> >> >>> Hi Martin.
>>> >> >> >>>
>>> >> >> >>> Note that in case you want us to do the work of isolating the
>>> >> >> >>> problem,
>>> >> >> >>> we do offer paid support to do that (then we can sign NDAs and
>>> >> >> >>> stuff).
>>> >> >> >>> Otherwise we would be more than happy to fix bugs once you
>>> >> >> >>> isolate
>>> >> >> >>> a
>>> >> >> >>> part you can share freely :)
>>> >> >> >>
>>> >> >> >>
>>> >> >> >
>>> >> >
>>> >> >
>>> >
>>> >
>>
>>