Queue cleanup

Lawrence D'Oliveiro ldo at geek-central.gen.new_zealand
Sat Sep 4 22:33:03 EDT 2010


In message <7x7hj2kyd6.fsf at ruckus.brouhaha.com>, Paul Rubin wrote:

> Lawrence D'Oliveiro <ldo at geek-central.gen.new_zealand> writes:
>
>> In message <7xmxs2uez1.fsf at ruckus.brouhaha.com>, Paul Rubin wrote:
>>
>>> GC's for large systems generally don't free (or even examine) individual
>>> garbage objects.  They copy the live objects to a new contiguous heap
>>> without ever touching the garbage, and then they release the old heap.
>>
>> And suddenly you’ve doubled the memory requirements. And on top of that,
>> since you’re moving the valid objects into different memory, you’re
>> forcing cache misses on all of them as well.
> 
> A minimal naive implementation indeed doubles the memory requirements,
> but from a Python perspective where every integer takes something like
> 24 bytes already, even that doesn't seem so terrible.

Doubling 24 is less terrible than doubling 4 or 8?? You’re kidding, right?

> More sophisticated implementations use multiple small heaps or other
> tricks.

More and more complications to patch up the idea. At some point, you have to 
admit there is something fundamentally flawed about the whole concept.

> The new heap is filled sequentially so accesses to it will have good
> locality.

Unfortunately, that‘s not how locality of reference works. It doesn’t matter 
whether the objects being accessed are close together in memory or far apart 
(not with modern fully-associative caches, anyway), what does matter is the 
frequency distribution of references, namely that the vast majority of 
references are to a tiny minority of objects.

Your generational garbage collector completely breaks this assumption, by 
regularly forcing an access to every single object in the heap. Cache-
thrashing, anyone?

> It's also the case that programs with very large memory consumption tend
> to use most of the memory for large arrays that don't contain pointers
> (think of a database server with a huge cache).  That means the gc
> doesn't really have to think about all that much of the memory.

But your generational garbage collector still has to copy those very large 
objects to the new heap, with all the cache-hostile consequences therefrom.

By the way, isn’t this the opposite of the array-of-pointers example you 
were using earlier to try to cast reference-counting in a bad light? It 
seems to me a reference count would work very well for such a large, simple 
object.

>> This is the continuing problem with garbage collection: all the attempts
>> to make it cheaper just end up moving the costs somewhere else.
> 
> Same thing with manual allocation.  That moves the costs off the
> computer and onto the programmer.  Not good, most of the time.

Unfortunately, your argument falls down. It is a truism that hardware costs 
continue to come down, while programmers remain expensive. As I said before, 
computing performance has improved by something like five orders of 
magnitude over the last half-century. This has rendered all kinds of 
techniques, like high-level languages, dynamic memory allocation, 
stacks, hardware floating-point, memory protection and so on, which were 
once considered controversial because of their expense, cheap enough to 
become commonplace.

But not garbage collection. This is because of the asymmetric way in which 
hardware has become faster: the biggest improvements have been in the parts 
that were already the fastest to begin with (the CPU), while RAM speeds have 
improved much less, and backing-store speeds least of all. Hence the need 
for intermediate layers of cache to bridge the gap. But the effectiveness of 
that caching depends crucially on certain assumptions about the runtime 
behaviour of the programs: and garbage collection breaks those assumptions.

> Really, I'm no gc expert, but the stuff you're saying about gc is quite
> ill-informed.  You might want to check out some current literature.

You may want to enlighten yourself by meditating on this seeming paradox of 
modern computing hardware: memory is cheap, but accessing memory is 
expensive.



More information about the Python-list mailing list