Python's garbage collection was Re: Python reliability

Fri Oct 14 01:55:27 EDT 2005

skip at pobox.com writes:
> Folks, most common GC schemes have been tried as experiments over
> the years.  None have succeeeded, for various reasons.  I think one
> of the main reasons is that Python has to "play nice" with external
> libraries, many of which weren't written with GC beyond malloc and
> free in mind.

Most GC'd language implementations I've seen actually have friendlier
FFI's (foreign function interfaces) than Python does.  I find it very
easy to make errors with the reference counts in what little Python C
extension hacking I've done.  Maybe one gets better at it with
experience.

>     Tagged integers: 
>         http://mail.python.org/pipermail/python-dev/2004-July/046139.html

That actually says there was a speedup from tagged ints with no
obvious counterbalancing cost.  However, it looks like there's no
traditional GC, it stays with the same reference counts for heap objects.

>     Boehm GC:
>         http://mail.python.org/pipermail/python-dev/2005-January/051370.html

This mentions use of Python idioms like txt = open(filename).read()
expecting the file handle to get freed immediately.  I think there's
been some language extensions ("with" statement) discussed to replace
that idiom.  Note that the idiom is already not so safe in Jython.

The Boehm GC doesn't seem like the right choice for an interpreter
being built from scratch anyway, but who knows.  Maybe I'm being
premature but I'm already thinking of CPython as a legacy
implementation and PyPy as the more likely testbed for stuff like
this.  I sorta wanted to go to the current sprint but oh well.

> http://wiki.python.org/moin/CodingProjectIdeas/PythonGarbageCollected

This correctly describes difficulties of using a copying GC in
CPython.  Note that the Boehm GC is mark-and-sweep.  As Alex mentions,
that usually means there's a pause every so often while the GC scans
the entire heap, touching all data both live and dead (maybe the Boehm
GC got around this somehow).

I haven't been keeping up with this stuff in recent years so I have a
worse concern.  I don't know whether it's founded or not.  Basically
in the past decade or so, memory has gotten 100x larger and cpu's have
gotten 100x faster, but memory is less than 10x faster once you're out
of the cpu cache.  The mark phase of mark/sweep tends to have a very
random access pattern (at least for Lisp).  In the old days that
wasn't so bad, since a random memory access took maybe a couple of cpu
cycles, but today it takes hundreds of cycles.  So for applications
that use a lot of memory, simple mark/sweep may be a much worse dog
than it was in the Vax era, even if you don't mind the pauses.

>     Miscellaneous:
>       http://mail.python.org/pipermail/python-dev/2002-June/026032.html

That looks potentially kind of neat, though it's not obvious what it does.

>       http://mail.python.org/pipermail/python-dev/2003-November/040299.html

I'll try to read that sometime out of general interest.  I'm too
sleepy right now.

> And lest anyone here think they were the first to suggest getting rid of
> reference counting in Python:
> 
>     http://www.python.org/search/hypermail/python-1993/0554.html

I think the language suggestions in that message are sound in
principle, if not quite the right thing in every detail.  Lisp went
through about the same evolution in the 60's or 70's (before my time).

With PyPy, it looks like Python implementation methodology is getting
a lot more advanced, so hopefully some big performance gains are
possible, and the language will likely evolve to help things along.