[Python-Dev] Cyclic GC issues

Mon Oct 11 00:45:00 CEST 2004

[Jason Evans]
> Since the spring of 2003, I have been developing Crux, which is a computer
> program for phylogenetic inferencing (bioinformatics) research.  In March
> of 2004, I switched Crux to using Python from having used a different
> embeddable interpreter.  For the most part, I have been very happy with
> Python, but Python's garbage collector has been a major source of
> frustration.

Then you're not using it correctly -- it's easy if you are.  You
didn't show actual code, though, and I don't have time to guess what
you might have done.

> ...
> The important aspect of Crux is that it deals with trees.

The details shouldn't matter; Python's cyclic gc handles arbitrary
graph structures, including multiple self-loops if you want them.  The
Python core creates graph structures of boggling complexity, so it
might help you to study how core objects implement their tp_traverse
and tp_clear methods.  They're mostly very simple.  Where they're not
very simple, it's usually because someone had a silly idea of
"optimization" in mind.

...
> 1) I originally had lone rings refer to themselves (ob_refcnt started out
>   at 3; 2 self-references and one reference held by the associated edge).
>   This didn't work.

"Didn't work" leaves the reader guessing about everything.  Showing
code, and being specific about both what you expected and what you
observed, are necessary.  python-dev isn't really the right place for
that, though.

>   It appears that the cyclic GC does not actually  calculate the number of live
>   references to objects (references that are reachable by traversing all objects
>   accessible from the root set);

Python's gc has no "root set" concept, so thinking in those terms
won't help.  It won't grow a root-set concept, either:  "foreign"
extension modules (like Tcl/Tk, or library packages written in
Fortran) can hold what would normally be considered root-set objects
for Python, and that's fine.  They're not required to cooperate with
Python's memory management beyond respecting Python's refcount rules. 
Forcing modules (via some new API) to identify root-set objects they
own would be backward-incompatible (not to mention a major new burden
on extension authors and users), so won't happen.

>   instead it assumes that if tp_clear() doesn't drop the reference count
>   to a number that equals the number of references from live objects,
>   there must still be references from live objects.  Unfortunately,
>   visit()ing self does not work, so there is no way to convince Python
>   that all references are from unreachable objects.

Again I can't guess what "does not work" means.  But I can assure you
it "would work" if I wrote it <wink>.  It's easy:  an object's
tp_traverse must call the visit() callback exactly once for each
non-NULL PyObject* *directly* (in one step) reachable from the object,
and that's all.  If self is reachable from self via a refcounted
pointer held by self, then self's tp_traverse() implementation must
call visit(self, arg).

But I think you're getting off track even *caring* what your pointers
point at.  If some object type T has three Py_Object* pointers pa, pb,
pc, then using Python 2.4's Py_VISIT macro, T's tp_traverse should be
coded

static int
T_traverse(T *self, visitproc visit, void *arg)
{
    Py_VISIT(self->pa);
    Py_VISIT(self->pb);
    Py_VISIT(self->pc);
    return 0;
}

It doesn't matter what pa (etc) points at.  If pa points to self,
fine; if it doesn't, also fine.

>   Working around this in Crux requires a lot of extra reference counting
>   complexity, because there are three different cases for reference
>   counts, depending on how many members there are in a ring (1, 2, or 3+
>   members).

Please show code.  If, e.g., you have a vector v of length n holding
the PyObject* "members", then

    int i;
    for (i = 0; i < self->n; ++i)
        Py_VISIT(self->v[i]);
    return 0;

is an appropriate tp_traverse().  For example, that's exactly what the
tp_traverse() for Python's builtin list objects does, and a Python
list L can certainly be an element of itself:

>>> L = []
>>> L.append(L)
>>> L[0] is L
True
>>>

There's no special-casing needed for the number of containees, or for
the nature of what they do or don't point at.  It's also irrelevant
what can be reached *from* the direct containees.

> 2) This issue is really just a different manifestation of issue (1).
>
>   At the C (not Python object) level, each node only actually stores a
>   pointer to a single member of the associated ring.

Please show code.  If the C object contains exactly one pointer, then
its tp_traverse should call visit() exactly once; etc.  The correct
C-level code has entirely to do with what the C struct contains, and
nothing to do with the view presented at the Python level.

>   Given a single ring member, it is possible to traverse the ring and reach all
>   other ring members.

That's irrelevant to gc.  tp_traverse() is only responsible for
revealing the objects *directly* reachable from self, or, more
precisely, for revealing the PyObject* pointers on which self owns a
reference.  gc computes the transitive closure itself; tp_traverse()
isn't responsible for that.

>   As mentioned in issue (1), the cyclic GC expects tp_traverse()
>   to call visit() once for each reference held.

Yes.

>   It is not enough for a node to visit() one ring member;

Don't know what it means for "a node to visit()".  If the C struct has
exactly one pointer, then it is enough for visit() to be invoked
exactly once with that pointer.  It would be incorrect not to visit()
that pointer, and it would be incorrect to visit() more pointers than
just that one.

>   it must visit() all ring members,

If and only if for each ring member M, self contains a pointer
directly to M.  Showing code is much better than English.  View your
C-level objects as a directed graph.  tp_traverse at a node `self` is
reponsible for revealing self's outgoing edges, no less and no more
than that.

>   in order for the GC to count how many references are from unreachable
>   objects, versus reachable from the root set.
> 
>   In summary, issues (1) and (2) are due to how the cyclic GC does the
>   "marking" phase of mark/sweep GC.

It's not a mark/sweep collector.

>  My expectation of mark/sweep GC is

You expectation is irrelevant <wink>, since it's not a mark/sweep collector.

>   that it should be sufficient to assure that all objects reachable from
>   the root set are visit()ed at least once; it should not be important how
>   many times each unreachable object is visit()ed.

Forget roots sets and mark/sweep collectors.  If you want to know how
Python's gc actually works, read the comments in gcmodule.c's
collect() function, and drill down from there as deep as your
curiosity takes you.

>   I don't have a deep enough understanding of the Python interpreter to
>   give a detailed suggestion for improvement.  I suspect that determining
>   the root set is not an easy operation; if this is the case, then I think
>   we're stuck with the current design.

We are, but I don't feel "stuck" with it -- it's a very good design.

>   If determining the root set *is* easy (or even possible), then I would suggest
>   using a standard mark/sweep collector, where unreachable objects are
>   scheduled for destruction.

The root set is impossible to determine without active cooperation
from extension modules.  Python's gc can infer the existence of roots,
but not their identies;  it can infer the transitive closure of what's
reachable from the root set intersected with the set of objects that
have registered with gc, but knows nothing about the roots themselves.

>   tp_traverse(), tp_clear(), and tp_dealloc() would retain
>   the same structure; the only difference would be the logic that
>   determines which objects can be destroyed.
>
> 3) A strange thing can happen when tp_clear() is called.  Suppose that an
>   edge object is being cleared, and it drops its references to the
>   associated rings.  If ob_refcnt of one of the rings drops to 0 as a
>   result, Python will tp_dealloc() the ring *right* *now*, without ever
>   calling tp_clear() on the ring.

That depends on how the tp_dealloc() for the ring is coded, so is up
to you.  The only *necessary* thing for tp_clear() to do is to clear
enough pointers to break cycles.  Some people do only that much.  For
example, Python tuples don't even implement tp_clear, because a cycle
involving tuples necessarily involves non-tuple objects too.  Other
people write tp_clear() so that it can be called from their
tp_dealloc() function too.  That's up to them.  If you want to do the
latter, it's good to write tp_clear() in such a way that calling it
more than once is harmless; e.g.,

    if (self->pa) {
        Py_DECREF(self->pa);
        self->pa = NULL;
    }
    etc

or, using 2.4's Py_CLEAR macro,

    Py_CLEAR(self->pa);
    etc.

>   That means that I have to keep track of whether tp_clear() has been called
>   on objects, if it is at all important that tp_clear() be called, so that I can
>   manually do so in tp_dealloc(),

As above, make your tp_clear() idempotent and then you don't have to
keep track of anything.  It's usually good practice for a tp_clear()
to be robust against NULL pointers anyway.

>   if necessary.  It is in my opinion reasonable to have cleanup code in
>   tp_clear(), with the assumption that it will be called precisely once,
>   but Python makes me do extra work to make sure that this happens.

You won't get a guarantee that tp_clear() will be called exactly once.
 You have a guarantee that tp_dealloc() will be called no more than
once.  It's actually unusual to find a tp_dealloc() that calls its
type's tp_clear(), but I think it's a nice pattern when you can get
away with it.  In such cases, tp_clear() should be coded so that it
can be called multiple times, and that's generally very easy to do.

>   This should be pretty easy to change.  A single bit per object is needed
>   to keep track of whether tp_clear() has been called.  I think this only
>   needs to be done for object types that support cyclic GC.

There's no such thing as "one bit" in reality:  because of malloc
memory padding, "one more bit" would require adding 4 bytes to every
gc'able object.  That's an extraordinary expense.  Luckily, it's not
needed.

> 4) There is no way to control the order in which objects are
>   tp_dealloc()ed.

It's true that there's no direct way to control the order.

>   This is a problem for the R-E-R construct, since at a low level, these objects
>   are always linked together.

Many things are always linked together in the core too.  Why that's "a
problem" for you isn't clear.  It's not a problem in the core, for
example.

>   What I would like to do is defer tp_dealloc() on the edge until after both
>   rings have been deleted.  Instead, I am forced to use a reference-counted
>   deletion function.

Sorry, don't know what that means, and (as above) don't know why you care.

>   Not calling self->ob_type->tp_free() on the edge in tp_dealloc() until
>   later is not a reasonable option, because this defers deletion of the
>   edge until a later round of garbage collection.
>
>   This could be addressed in the Python interpreter by paying heed to the
>   return value of tp_dealloc().  If the return value is non-zero, move the
>   object to the end of the list of objects to be destroyed,

Except there is no list of objects to be destroyed.

>   so that destruction is tried later.  This allows the module to impose its own
>   destruction ordering.

That part would need clearer motivation, preferably in the form of a PEP.