[Python-Dev] decorate-sort-undecorate

Tue Oct 14 16:44:30 EDT 2003

[Fred]
>>>> We could allocate a second array of PyObject* to mirror the list
>>>> contents; that would have only the keys.  When two values are
>>>> switched in the sort, the values in both the key list and the value
>>>> list can be switched.  When done, we only need to decref the
>>>> computed keys and free the array of keys.

[Guido]
>>> I can't tell if that'll work, but if it does, it would be a great
>>> solution.

[Tim]
>> I mentioned that before -- doubling the amount of data movement
>> would hurt, at best by blowing cache all to hell.
>>
>> There's a related approach, though:  build a distinct vector of
>> custom objects, each containing:
>>
>> 1. A pointer to the key.
>> 2. The original index, as a C integer.
>>
>> This is similar to, but smaller than, something mentioned before.

[Guido]
> But wouldn't the memory allocated for all those tiny custom objects
> also be spread all over the place and hence blow away the cache?

Probably no more than that the data in the original list was spread all over
the place.  The mergesort has (per merge) two input areas and one output
area, which are contiguous vectors and accessed strictly left to right, one
slot at a time, a cache-friendly access pattern.  The real data pointed to
by the vectors is all over creation, but the vectors themselves are
contiguous.  We seem to get a lot of good out of that.  For example, the
version of weak heapsort I coded did very close to the theoretical minimum #
of comparisons on randomly ordered data, and was algorithmically much
simpler than the adaptive mergesort, yet ran much slower.  That can't be
explained by # of comparisons or by instruction count.  The one obvious
difference is that weak heapsort leaps all over the vector, in as
cache-hostile a way imaginable.  The mergesort also has some success in
reusing small merged areas ASAP, while they're still likely to be in cache.

If we were to mirror loads and stores across two lists in lockstep, we'd be
dealing with 4 input areas and 2 output areas per merge.  That's a lot more
strain on a cache, even if the access pattern per area is cache-friendly on
its own.

Of course a large sort is going to blow the cache wrt what the program was
doing before the sort regardless.

> I guess another approach would be to in-line those objects so that we
> sort an array of structs like this:
>
>   struct {
>     PyObject *key;
>     long index;
>   }
>
> rather than an array of PyObject*.  But this would probably require
> all of the sort code to be cloned.

Following Neil, we could change that to store a pointer to the object rather
than an index.  I agree that would be cache-friendlier, but I don't know how
much better it might do.  Dereferencing key pointers acts in the other
direction in all schemes, because the memory holding the keys is likely all
over the place (and probably gets worse as the sort goes on, and the initial
order gets scrambled).  No way to know except to try it and time it.