sorteddict [was a PEP proposal, but isn't anymore!]

Thu Sep 27 08:50:45 EDT 2007

I don't see a focused discussion of computational complexity of a
sorted dict; its API cannot be simpler than sorting a dictionary and
it has issues and complications that have already been discussed
without completely satisfactory solutions, so the only possible reason
to adopt a sorted dict is that some important use case for mapping
types becomes significantly cheaper.

With n entries, the size of a non-sorted hashtable, of a hashtable
plus lists or sets of keys, and of reasonable sorted dict
implementations with trees are all O(n). No substantial space
advantage can be obtained by sorting dictionaries.

Iterating through all n entries of a mapping, once, in sorted order,
is O(n) time and O(1) space with an unsorted hash table, a hash table
with a sorted list of the keys and all  types of tree that I know of.
If there is a performance gain, it must come from amortizing
insertions, deletions and index-building. (The other operation, value
updates for an existing key, doesn't matter: updates cause no
structural changes and they must not invalidate any iterator.)

Let's consider a very simple use case: n insertions followed by x
iterations through all entries and n*y lookups by key.

Cost for a hashtable and an ad hoc sorted list of the keys,
fundamentally equivalent to sorting a Python dict:

O(n) for insertions
O(n log n) for indexing
O(nx) for iterations
O(ny) for lookups

Cost for a tree:

O(n log n) for insertions
no indexing
O(nx) for iterations
O(ny log n) for lookups

The hashtable comes out ahead because of cheaper lookups, for any x
and y; note that without lookups there is no reason to use a mapping
instead of a list of (key,value) tuples.

With an equal number k of insertions and deletions between the
iterations, the hashtable must be reindexed x times:

O(n) for insertions
O(kx) for updates and deletions
O(nx log n) for indexing and reindexing
O(nx) for iterations
O(ny) for lookups

The tree might be cheaper:

O(n log n) for insertions
O(kx log n) for updates and deletions
no indexing and reindexing
O(nx) for iterations
O(ny log n) for lookups

For a fixed small k, or with k proportional to n, reindexing the
hashtable and lookups in the tree are equally mediocre.

Maybe we could make k changes in the middle of each iteration. For a
naively reindexed hashtable:

O(n) for insertions
O(kx) for updates and deletions
O(knx log n) for indexing and reindexing
O(nx) for iterations
O(ny) for lookups

For a tree, the costs remain as above: the new factor of n for the
hashtable is fatal. Clever updates of the existing index or use of a
heap would lower the cost, but they would need to be encapsulated as a
sorteddict implementation.

Is this a practical use case? When are sequential visits of all
elements in order frequently suspended to make insertions and
deletions, with a need for efficient lookup by key?
- Priority queues; but given the peculiar pattern of insertions and
deletions there are more suitable specialized data structures.
- A* and similar best-first algorithms.
It's a small but important niche; maybe it isn't important enough for
the standard library. Other absent datatypes like heaps, an immutable
mapping type similar to frozenset and tuple, or disjoint sets with
would be more fundamental and general, and a mapping that remembers
the order of insertion regardless of keys would be equally useful.
In the Java collections framework all these kinds of mapping and
others coexist peacefully, but Python doesn't have the same kitchen
sink approach to basic libraries.

Regarding the API, a sorted dict should not expose random access by an
entry's position in the sequence: it is a gratuitous difficulty for
the implementor and, more importantly, a perversion of the mapping
data type. For that purpose there are lists and tuples, or explicit
indices like those of the Boost multi-index containers (http://
www.boost.org/libs/multi_index).
The only differences with dict should be the constraint that items(),
keys(), values(), iteritems(), iterkeys(), itervalues() return entries
sorted by key.

Regards,

Lorenzo Gatti