Py2.3: Feedback on Sets

Sun Aug 17 23:01:39 EDT 2003

[Beni Cherniavsky]
>> `isdisjoint` sounds like a good idea since it can be implemented
>> more effeciently than ``not a & b``.

[Raymond Hettinger]
> True enough.  It is easy to putogether an early-out algorithm using
> itertools.  So, it can be done a C speed.  The question is whether
> there are sufficient use cases to warrant expanding an already fat API.

Early-out algorithms consume time to determine when the early-out condition
obtains, and so can be a net loss (via consuming more time testing for
early-out than is saved by getting out early).  See the long comment block
before _siftup in heapq.py for a classic case where early-out usually loses
big.  For isdisjoint() it's muddier (depends on whether your data is such
that most sets fed to it are or aren't disjoint; if they're usually
disjoint, code testing for early-out is a waste of time).

...

>>> * Do you care that sets can only contain hashable elements?

>> No more than I care for dicts containing only hashable keys.  How
>> else could you define it?

> The hard way.  Keep a separate ordered list for elements that
> define ordering operations but not a hash value.  Insert and Search
> using a binary search.
>
> Keep a third list for elements that only support equality comparision.
> Use linear search for that one.
>
> This complicates the heck out of the code but would expand the range
> of things storable in sets.  The only real life use case I can think
> of is using a third party package that supplied objects without a hash
> code.

Part of Zope's ZODB (persistent object-oriented database) is an elaborate
BTree package, supplying many flavors of containers built on more-or-less
(depending on how much you squint <wink>) classical disk-based B+ trees.
Sets of C integers and sets of Python objects are some of the container
types supported.  BTrees require a total ordering among keys, but don't use
hash codes.  Most operations on B+-tree based sets take time logarithmic in
the number of elements; this includes insertion and deletion, where using a
straight sorted list instead would take linear time (much worse).

There are effective ways to use hashing schemes for disk-based containers
too.  The strongest reason for using ordered B+ trees in ZODB is that
searches in database apps very often want to find everything in a *range* of
key values, and traversing a B+ tree in sorted order is trivial (it's just a
left-to-right traversal over a linked list of the leaf buckets; a pair of
log-time searches is done first to find the range's endpoints).  So, e.g.,
where a Python dict has

    .iterkeys()

a ZODB BTree has

    .iterkeys(min=None, max=None, excludemin=False, excludemax=False)

A prime use case for an ordered implementation of sets is that adaptive
strategies such as those used in 2.3's list.sort() can greatly speed union
and intersection operations on sets of non-random data.  For example, what's
the intersection of

  {1, 2, 3, ..., 1000000}

and

  {2000000, 2000001, 2000002, ..., 3000000}

?

If sets are stored in sorted order, a quite general set intersection
algorithm can easily deduce that the intersection is empty after just a few
comparisons.  Indeed, the inspiration for 2.3's sometimes-amazing
list.sort() came from reading

    "Adaptive Set Intersections, Unions, and Differences" (2000)
    Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro

(easy to find via Google).  BTW, there's an extensive discussion of ways to
adaptively minimize the cost of early-out gimmicks when they're not paying
in Python's Objects/listsort.txt.