[pypy-svn] pypy documentation-cleanup: Document the minimark gc.

arigo commits-noreply at bitbucket.org
Thu Apr 28 19:11:10 CEST 2011


Author: Armin Rigo <arigo at tunes.org>
Branch: documentation-cleanup
Changeset: r43739:08e8a5d17a31
Date: 2011-04-28 11:46 +0200
http://bitbucket.org/pypy/pypy/changeset/08e8a5d17a31/

Log:	Document the minimark gc.

diff --git a/pypy/doc/garbage_collection.rst b/pypy/doc/garbage_collection.rst
--- a/pypy/doc/garbage_collection.rst
+++ b/pypy/doc/garbage_collection.rst
@@ -8,10 +8,11 @@
 Introduction
 ============
 
-**Warning**: The overview and description of our garbage collection
-strategy and framework is not here but in the `EU-report on this
-topic`_.  The present document describes the specific garbage collectors
-that we wrote in our framework.
+The overview and description of our garbage collection strategy and
+framework can be found in the `EU-report on this topic`_.  Please refer
+to that file for an old, but still more or less accurate, description.
+The present document describes the specific garbage collectors that we
+wrote in our framework.
 
 .. _`EU-report on this topic`: http://codespeak.net/pypy/extradoc/eu-report/D07.1_Massive_Parallelism_and_Translation_Aspects-2007-02-28.pdf
 
@@ -28,6 +29,9 @@
 For more details, see the `overview of command line options for
 translation`_.
 
+The following overview is written in chronological order, so the "best"
+GC (which is the default when translating) is the last one below.
+
 .. _`overview of command line options for translation`: config/commandline.html#translation
 
 Mark and Sweep
@@ -126,4 +130,90 @@
 More details are available as comments at the start of the source
 in `pypy/rpython/memory/gc/markcompact.py`_.
 
+Minimark GC
+-----------
+
+This is a simplification and rewrite of the ideas from the Hybrid GC.
+It uses a nursery for the young objects, and mark-and-sweep for the old
+objects.  This is a moving GC, but objects may only move once (from
+the nursery to the old stage).
+
+The main difference with the Hybrid GC is that the mark-and-sweep
+objects (the "old stage") are directly handled by the GC's custom
+allocator, instead of being handled by malloc() calls.  The gain is that
+it is then possible, during a major collection, to walk through all old
+generation objects without needing to store a list of pointers to them.
+So as a first approximation, when compared to the Hybrid GC, the
+Minimark GC saves one word of memory per old object.
+
+There are a number of environment variables that can be tweaked to
+influence the GC.  (Their default value should be ok for most usages.)
+You can read more about them at the start of
+`rpython/memory/gc/minimark.py`_.
+
+In more details:
+
+- The small newly malloced objects are allocated in the nursery (case 1).
+  All objects living in the nursery are "young".
+
+- The big objects are always handled directly by the system malloc().
+  But the big newly malloced objects are still "young" when they are
+  allocated (case 2), even though they don't live in the nursery.
+
+- When the nursery is full, we do a minor collection, i.e. we find
+  which "young" objects are still alive (from cases 1 and 2).  The
+  "young" flag is then removed.  The surviving case 1 objects are moved
+  to the old stage. The dying case 2 objects are immediately freed.
+
+- The old stage is an area of memory containing old (small) objects.  It
+  is handled by `rpython/memory/gc/minimarkpage.py`_.  It is organized
+  as "arenas" of 256KB or 512KB, subdivided into "pages" of 4KB or 8KB.
+  Each page can either be free, or contain small objects of all the same
+  size.  Furthermore at any point in time each object location can be
+  either allocated or freed.  The basic design comes from ``obmalloc.c``
+  from CPython (which itself comes from the same source as the Linux
+  system malloc()).
+
+- New objects are added to the old stage at every minor collection.
+  Immediately after a minor collection, when we reach some threshold, we
+  trigger a major collection.  This is the mark-and-sweep step.  It walks
+  over *all* objects (mark), and then frees some fraction of them (sweep).
+  This means that the only time when we want to free objects is while
+  walking over all of them; we never ask to free an object given just its
+  address.  This allows some simplifications and memory savings when
+  compared to ``obmalloc.c``.
+
+- As with all generational collectors, this GC needs a write barrier to
+  record which old objects have a reference to young objects.
+
+- Additionally, we found out that it is useful to handle the case of
+  big arrays specially: when we allocate a big array (with the system
+  malloc()), we reserve a small number of bytes before.  When the array
+  grows old, we use the extra bytes as a set of bits.  Each bit
+  represents 128 entries in the array.  Whenever the write barrier is
+  called to record a reference from the Nth entry of the array to some
+  young object, we set the bit number ``(N/128)`` to 1.  This can
+  considerably speed up minor collections, because we then only have to
+  scan 128 entries of the array instead of all of them.
+
+- As usual, we need special care about weak references, and objects with
+  finalizers.  Weak references are allocated in the nursery, and if they
+  survive they move to the old stage, as usual for all objects; the
+  difference is that the reference they contain must either follow the
+  object, or be set to NULL if the object dies.  And the objects with
+  finalizers, considered rare enough, are immediately allocated old to
+  simplify the design.  In particular their ``__del__`` method can only
+  be called just after a major collection.
+
+- The objects move once only, so we can use a trick to implement id()
+  and hash().  If the object is not in the nursery, it won't move any
+  more, so its id() and hash() are the object's address, cast to an
+  integer.  If the object is in the nursery, and we ask for its id()
+  or its hash(), then we pre-reserve a location in the old stage, and
+  return the address of that location.  If the object survives the
+  next minor collection, we move it there, and so its id() and hash()
+  are preserved.  If the object dies then the pre-reserved location
+  becomes free garbage, to be collected at the next major collection.
+
+
 .. include:: _ref.txt


More information about the Pypy-commit mailing list