[Python-Dev] Re: heaps

Tim Peters tim.one@comcast.net
Sun, 04 May 2003 01:26:09 -0400


[Tim]
>> ...
>> priorityDictionary looks like an especially nice API for this specific
>> algorithm, but, e.g., impossible to use directly for maintaining an N-
>> best queue (priorityDictionary doesn't support multiple values with
>> the same priority, right?

That was wrong:  the dict maps items to priorities, and I read it backwards.
Sorry!

>> if we're trying to find the 10,000 poorest people in America, counting
>> only one as dead broke would be too Republican for some peoples' tastes
>> <wink>).  OTOH, heapq is easy and efficient for *that* class of heap
>> application.

[David Eppstein]
> I agree with your main points (heapq's inability to handle
> certain priority  queue applications doesn't mean it's useless, and
> its implementation-specific API helps avoid fooling programmers into
> thinking it's any more than what it is).  But I am confused at this
> example.  Surely it's just as easy to store (income,identity) tuples in
> either data structure.

As above, I was inside out.

"Just as easy" can't be answered without trying to write actual code,
though.  Given that heapq and priorityDictionary are both min-heaps, to
avoid artificial pain let's look for the people with the N highest incomes
instead.

For an N-best queue using heapq, "the natural" thing is to define people
like so:

class Person:
    def __init__(self, income):
        self.income = income

    def __cmp__(self, other):
        return cmp(self.income, other.income)

and then the N-best calculation is as follows; it's normal in N-best
applications that N is much smaller than the number of items being ranked,
and you don't want to consume more than O(N) memory (for example, google
wants to show you the best-scoring 25 documents of the 6 million matches it
found):

"""
# N-best queue for people with the N largest incomes.
import heapq

dummy = Person(-1)  # effectively an income of -Inf
q = [dummy] * N     # it's fine to use the same object N times

for person in people:
    if person > q[0]:
        heapq.heapreplace(q, person)

# The result list isn't sorted.
result = [person for person in q if q is not dummy]
"""

I'm not as experienced with priorityDictionary.  For heapq, the natural
__cmp__ is the one that compares objects' priorities.  For
priorityDictionary, we can't use that, because Person instances will be used
as dict keys, and then two Persons with the same income couldn't be in the
queue at the same time.  So Person.__cmp__ will have to change in such a way
that distinct Persons never compare equal.  I also have to make sure that a
Person is hashable.  I see there's another subtlety, apparent only from
reading the implementation code:  in the heap maintained alongside the dict,
it's actually

    (priority, object)

tuples that get compared.  Since I expect to see Persons with equal income,
when two such tuples get compared, they'll tie on the priority, and go on to
compare the Persons.  So I have to be sure too that comparing two Persons is
cheap.

Pondering all that for a while, it seems best to make sure Person doesn't
define __cmp__ or __hash__ at all.  Then instances will get compared by
memory address, distinct Persons will never compare equal, comparing Persons
is cheap, and hashing by memory address is cheap too:

class Person:
    def __init__(self, income):
        self.income = income

The N-best code is then:

"""
q = priorityDictionary()

for dummy in xrange(N):
    q[Person(-1)] = -1   # have to ensure these are distinct Persons

for person in people:
    if person.income > q.smallest().income:
        del q[q.smallest()]
        q[person] = person.income

# The result list is sorted.
result = [person for person in q if person.income != -1]
"""

Perhaps paradoxically, I had to know essentially everything about how
priorityDictionary is implemented to write a correct and efficient algorithm
here.  That was true of heapq too, of course, but there were fewer
subtleties to trip over there, and heapq isn't trying to hide its
implementation.

BTW, there's a good use of heapq for you:  you could use it to maintain the
under-the-covers heap inside priorityDictionary!  It would save much of the
code, and possibly speed it too (heapq's implementation of popping usually
requires substantially fewer comparisons than priorityDictionary.smallest
uses; this is explained briefly in the comments before _siftup, deferring to
Knuth for the gory details).

> If you mean, you want to find the 10k smallest income values (rather than
> the people having those incomes), then it may be that a better data
> structure would be a simple list L in which the value of L[i] is
> the count of people with income i.

Well, leaving pennies out of it, incomes in the USA span 9 digits, so
something taking O(N) memory would still be most attractive.