Order of tuples in dict.items()

Sun Oct 14 19:01:42 EDT 2007

Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
> I've never seen the point of a sorted dictionary, it's easy to just say:
> for key, value in sorted(D.items())

You might want just a subrange of the dictionary (say the 100th
through 150th items in sorted order) without having to sort the entire
dictionary.

Also, with the right data structures you can support fast non-mutating
updates, i.e.

    e = d.updated(x=3)

is equivalent to:

    e = dict((x,y) for x,y in d.iteritems())   # copy all entries from d
    e[x] = 3  # do the update

but the sorted version can work in O(log n) time instead of O(n).  The
new dictionary shares most of its contents with the old dictionary,
except now you can no longer mutate one without mutating the other, so
better not do any in-place updates.

Usage case (inspired by Happs State which was inspired by Prevayler):
You are writing a web app that has to remember a lot of request data
and you don't want to use an SQL database (too much overhead).  Your
app gets requests containing name/value pairs.  It responds with the
old value and then updates its stored value.  I.e. your server process
looks like:

   d = {}
   while True:
      name, new_value = get_request()
      old_value = d.get(name, None)
      d[name] = new_value
      send_response(old_value)

This is very efficient (just in-memory dictionary access, no SQL
garbage) but has any obvious problem: what if the computer crashes?
You have to store the stuff on disk, too.  So you put:

     log_to_disk(name, new_value)

right after the get_request.  If the computer crashes, reboot it and
play back the log from the beginning.

But that's no good either, the server could run for years nonstop and
get billions of updates and you don't want to play them back from the
beginning.  You want to take a checkpoint now and then, so you can
restart from there:

   d = {}
   while True:
      name, new_value = get_request()
      log_to_disk(name, new_value, timestamp())
      if checkpoint_now():   # do this once an hour or so
          log_to_disk(pickle(d), timestamp())
      old_value = d.get(name, None)
      d[name] = new_value
      send_response(old_value)

Now to restore, you find the newest pickle, read it in, and then apply
any logged updates with a newer timestamp, i.e. just the last hour's worth.

But this is no good either, because d can have millions of items, and
your server freezes while you're pickling and writing it.  You want a
separate thread doing that--except d is busily mutating while that
operation is in progress!  You can go crazy with locks, rollback
buffers, and all the insanity that SQL servers use, or with the sorted
immutable dictionary, you can simply write:

   d = frozendict()
   while True:
      name, new_value = get_request()
      log_to_disk(name, new_value, timestamp())
      if checkpoint_now():   # do this once an hour or so
          in_new_thread(log_to_disk, (pickle(d), timestamp()))
      old_value = d.get(name, None)        # look mom, no freeze-up
      d = d.updated(name, new_value)
      send_response(old_value)

By making d.updated() create a new immutable dict that's separate from
d, you now don't have to worry about locks or synchronization or any
of that crap.  When the pickler is done with the old d, its references
are gone and the GC reclaims its storage.  Of course you can do more
elaborate checkpointing on subsets of the data etc. with the same
methods.  

Really it's the presence of the O(log n) .updated() operation, rather
than sorting, that makes this type of structure interesting, but the
usual implementation involves balanced tree structures so the sorting
comes naturally.