[Python-Dev] Re: heaps

Thu, 1 May 2003 02:13:51 -0400

[Raymond Hettinger]
>> I'm quite pleased with the version already in CVS.  It is a small
>> masterpiece of exposition, sophistication, simplicity, and speed.
>> A class based interface is not necessary for every algorithm.

[David Eppstein]
> It has some elegance, but omits basic operations that are necessary for
> many heap-based algorithms and are not provided by this interface.

I think Raymond was telling you it isn't intended to be "an interface",
rather it's quite up-front about being a collection of functions that
operate directly on a Python list, implementing a heap in a very
straightforward way, and deliberately not trying to hide any of that.  IOW,
it's a concrete data type, not an abstract one.  I asked, and it doesn't
feel like apologizing for being what it is <wink>.

That's not to say Python couldn't benefit from providing an abstract heap
API too, and umpteen different implementations specialized to different
kinds of heap applications.  It is saying that heapq isn't trying to be
that, so pointing out that it isn't falls kinda flat.

> Specifically, the three algorithms that use heaps in my upper-division
> undergraduate algorithms classes are heapsort (for which heapq works
> fine, but you would generally want to use L.sort() instead), Dijkstra's
> algorithm (and its relatives such as A* and Prim), which needs the
> ability to decrease keys, and event-queue-based plane sweep algorithms
> (e.g. for finding all crossing pairs in a set of line segments) which
> need the ability to delete items from other than the top.

Then some of those will want a different implementation of a heap.  The
algorithms in heapq are still suitable for many heap applications, such as
maintaining an N-best list (like retaining only the 10 best-scoring items in
a long sequence), and A* on a search tree (when there's only one path to a
node, decrease-key isn't needed; A* on a graph is harder).

> To see how important the lack of these operations is, I decided to
> compare two implementations of Dijkstra's algorithm.

I don't think anyone claimed-- or would claim --that a heapq is suitable for
all heap purposes.

> The priority-dict implementation from
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/119466 takes as
> input a graph, coded as nested dicts {vertex: {neighbor: edge length}}.
> This is a variation of a graph coding suggested in one of Guido's essays
> that, as Raymond suggests, avoids using a separate class based interface.
>
> Here's a simplification of my dictionary-based Dijkstra implementation:
>
> def Dijkstra(G,start,end=None):
>     D = {}   # dictionary of final distances
>     P = {}   # dictionary of predecessors
>     Q = priorityDictionary()   # est.dist. of non-final vert.
>     Q[start] = 0
>     for v in Q:
>         D[v] = Q[v]
>         for w in G[v]:
>             vwLength = D[v] + G[v][w]
>             if w not in D and (w not in Q or vwLength < Q[w]):
>                 Q[w] = vwLength
>                 P[w] = v
>    return (D,P)
>
> Here's a translation of the same implementation to heapq (untested
> since I'm not running 2.3).  Since there is no decrease in heapq, nor
> any way to find and remove old keys,

A heapq *is* a list, so you could loop over the list to find an old object.
I wouldn't recommend that in general <wink>, but it's easy, and if the need
is rare then the advertised fact that a heapq is a plain list can be very
convenient.  Deleting an object from "the interior" still isn't supported
directly, of course.  It's possible to do so efficiently with this
implementation of a heap, but since it doesn't support an efficient way to
find an old object to begin with, there seemed little point to providing an
efficient delete-by-index function.  Here's one such:

import heapq

def delete_obj_at_index(heap, pos):
    lastelt = heap.pop()
    if pos >= len(heap):
        return

    # The rest is a lightly fiddled variant of heapq._siftup.
    endpos = len(heap)
    # Bubble up the smaller child until hitting a leaf.
    childpos = 2*pos + 1    # leftmost child position
    while childpos < endpos:
        # Set childpos to index of smaller child.
        rightpos = childpos + 1
        if rightpos < endpos and heap[rightpos] <= heap[childpos]:
            childpos = rightpos
        # Move the smaller child up.
        heap[pos] = heap[childpos]
        pos = childpos
        childpos = 2*pos + 1
    # The leaf at pos is empty now.  Put lastelt there, and and bubble
    # it up to its final resting place (by sifting its parents down).
    heap[pos] = lastelt
    heapq._siftdown(heap, 0, pos)

> I changed the algorithm to add new tuples for each new key, leaving the
> old tuples in place until they bubble up to the top of the heap.
>
> def Dijkstra(G,start,end=None):
>     D = {}   # dictionary of final distances
>     P = {}   # dictionary of predecessors
>     Q = [(0,None,start)]  # heap of (est.dist., pred., vert.)
>     while Q:
>         dist,pred,v = heappop(Q)
>         if v in D:
>             continue  # tuple outdated by decrease-key, ignore
>         D[v] = dist
>         P[v] = pred
>         for w in G[v]:
>             heappush(Q, (D[v] + G[v][w], v, w))
>     return (D,P)
>
> My analysis of the differences between the two implementations:
>
> - The heapq version is slightly complicated (the two lines
> if...continue) by the need to explicitly ignore tuples with outdated
> priorities.  This need for inserting low-level data structure
> maintenance code into higher-level algorithms is intrinsic to using
> heapq, since its data is not structured in a way that can support
> efficient decrease key operations.

It surprised me that you tried using heapq at all for this algorithm.  I was
also surprised that you succeeded <0.9 wink>.

> - Since the heap version had no way to determine when a new key was
> smaller than an old one, the heapq implementation needed two separate
> data structures to maintain predecessors (middle elements of tuples for
> items in queue, dictionary P for items already removed from queue).  In
> the dictionary implementation, both types of items stored their
> predecessors in P, so there was no need to transfer this information
> from one structure to another.
>
> - The dictionary version is slightly complicated by the need to look up
> old heap keys and compare them with the new ones instead of just
> blasting new tuples onto the heap.  So despite the more-flexible heap
> structure of the dictionary implementation, the overall code complexity
> of both implementations ends up being about the same.
>
> - Heapq forced me to build tuples of keys and items, while the
> dictionary based heap did not have the same object-creation overhead
> (unless it's hidden inside the creation of dictionary entries).

Rest easy, it's not.

> On the other hand, since I was already building tuples, it was
> convenient to also store predecessors in them instead of in some
> other structure.
>
> - The heapq version uses significantly more storage than the dictionary:
> proportional to the number of edges instead of the number of vertices.
>
> - The changes I made to Dijkstra's algorithm in order to use heapq might
> not have been obvious to a non-expert; more generally I think this lack
> of flexibility would make it more difficult to use heapq for
> cookbook-type implementation of textbook algorithms.

Depends on the specific algorithms in question, of course.  No single heap
implementation is the best choice for all algorithms, and heapq would be
misleading people if, e.g., it did offer a decrease_key function -- it
doesn't support an efficient way to do that, and it doesn't pretend to.

> - In Dijkstra's algorithm, it was easy to identify and ignore outdated
> heap entries, sidestepping the inability to decrease keys.  I'm not
> convinced that this would be as easy in other applications of heaps.

All that is explaining why this specific implementation of a heap isn't
suited to the task at hand.  I don't believe that was at issue, though.  An
implementation of a heap that is suited for this task may well be less
suited for other tasks.

> - One of the reasons to separate data structures from the algorithms
> that use them is that the data structures can be replaced by ones with
> equivalent behavior, without changing any of the algorithm code.  The
> heapq Dijkstra implementation is forced to include code based on the
> internal details of heapq (specifically, the line initializing the heap
> to be a one element list), making it less flexible for some uses.
> The usual reason one might want to replace a data structure is for
> efficiency, but there are others: for instance, I teach various
> algorithms classes and might want to use an implementation of Dijkstra's
> algorithm as a testbed for learning about different priority queue data
> structures.  I could do that with the dictionary-based implementation
> (since it shows nothing of the heap details) but not the heapq one.

You can wrap any interface you like around heapq (that's very easy to do in
Python), but it won't change that heapq's implementation is poorly suited to
this application.  priorityDictionary looks like an especially nice API for
this specific algorithm, but, e.g., impossible to use directly for
maintaining an N-best queue (priorityDictionary doesn't support multiple
values with the same priority, right?  if we're trying to find the 10,000
poorest people in America, counting only one as dead broke would be too
Republican for some peoples' tastes <wink>).  OTOH, heapq is easy and
efficient for *that* class of heap application.

> Overall, while heapq was usable for implementing Dijkstra, I think it
> has significant shortcomings that could be avoided by a more
> well-thought-out interface that provided a little more functionality and
> a little clearer separation between interface and implementation.

heapq isn't trying to separate them at all -- quite the contrary!  It's much
like the bisect module that way.  They find very good uses in practice.

I should note that I objected to heapq at the start, because there are
several important heap implementation techniques, and just one doesn't fit
anyone all the time.  My objection went away when Guido pointed out how much
like bisect it is:  since it doesn't pretend one whit to generality or
opaqueness, it can't be taken as promising more than it actually does, nor
can it interfere with someone (so inclined) defining a general heap API:
it's not even a class, just a handful of functions.  Useful, too, just as it
is.  A general heap API would be nice, but it wouldn't have much (possibly
nothing) to do with heapq.