[Python-ideas] Ideas towards GIL removal

Talin talin at acm.org
Sun Apr 15 19:12:04 CEST 2007


Greg Ewing wrote:
> 2) Objects owned by a thread
> 
> Python code creates and destroys temporary objects
> at a high rate -- stack frames, argument tuples,
> intermediate results, etc. If the code is executed
> by a thread, those objects are rarely if ever seen
> outside of that thread. It would be beneficial if
> refcount operations on such objects could be carried
> out by the thread that created them without locking.

While we are on the topic of reference counting, I'd like to direct your 
attention to Recycler, an IBM research project:

"The Recycler is a concurrent multiprocessor garbage collector with 
extremely low pause times (maximum of 6 milliseconds over eight 
benchmarks) while remaining competitive with the best throughput- 
oriented collectors in end-to-end execution times. This paper describes 
the overall architecture of the Recycler, including its use of reference 
counting and concurrent cycle collection, and presents extensive 
measurements of the system comparing it to a parallel, stop-the-world 
mark-and-sweep collector."

There are a bunch of research papers describing Recycler which can be 
found at the following link:

http://www.research.ibm.com/people/d/dfb/publications.html

I'd start with the papers entitled "Java without the Coffee Breaks: A 
Non-intrusive Multiprocessor Garbage Collector" and "Concurrent Cycle 
Collection in Reference Counted Systems".

Let me describe a bit about how the Recycler works and how it relates to 
what you've proposed.

The basic idea is that for each thread, there is a set of thread local 
data (TLD) that contains a pair of "refcount buffers", one buffer for 
increfs and one buffer for decrefs. Each refcount buffer is a flat array 
of pointers which starts empty and gradually fills up.

The "incref" operation does not actually touch the reference count field 
of the object. Instead, an "incref" appends a pointer to the object to 
the end of the incref buffer for that thread. Similarly, a decref 
operation appends a pointer to the object to the decref buffer. Since 
the refcount buffers are thread-local, there is no need for locking or 
synchronization.

When one of the buffers gets full, both buffers are swapped out for new 
ones, and the old buffers are placed on a queue which is processed by 
the collector thread. The collector thread is the only thread which is 
allowed to actually touch the reference counts of the individual 
objects, and its the only thread which is allowed to delete objects.

Processing the buffers is relatively simple: First, the incref buffer is 
processed. The collector thread scans through each pointer in the 
buffer, and increments the refcount of each object. Then the decref 
buffer is processed in a similar way, decrementing the refcount.

However, it also needs to process the buffers for the other threads 
before the memory can be reclaimed. Recycler defines a series of 
"epochs" (i.e. intervals between collections). Within a refcount buffer, 
each epoch is represented as a contiguous range of values within the 
array -- all of the increfs and decrefs which occurred during that 
epoch. An auxiliary array records the high water mark for each epoch.

Using this information, the collector thread is able to process only 
those increfs and decrefs for all threads which occurred before the 
current epoch. This does mean that objects whose refcount falls to zero 
during the current epoch will not be deleted until the next collection 
cycle.

Recycler also handles cyclic garbage via cycle detection, which is 
described in the paper. It does not use a "mark and sweep" type of 
algorithm, but is instead able to detect cycles locally without scanning 
the entire heap.

Thus, the Recycler's use of refcount buffers achieves what you were 
trying to achieve, which is refcounting without locking. However, it 
does require access to thread-local data for each incref / release 
operation. The performance of this scheme will greatly depend on how 
quickly the code can get access to thread local data. The fastest 
possible method would be to dedicate a register, but that's infeasible 
on most systems. Another idea is for large functions to look up the TLD 
and stuff it in a local variable at the beginning of the function. For 
older source code the existing, backwards-compatible incref and decref 
macros could each individually get access to the TLD, but these would be 
slower than the more optimized methods in which the TLD was supplied as 
a parameter.

-- Talin



More information about the Python-ideas mailing list