Trees

Tue Jan 20 13:02:17 EST 2015

Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
> Possibly because they aren't needed? Under what circumstances would
> you use a tree instead of a list or a dict or combination of both?

I've sometimes wanted a functional tree in the sense of functional
programming.  That means the tree structure is immutable and you insert
or delete nodes in O(log n) operations, by creating new trees that share
almost all their data with the old tree.  Once you release the old root,
garbage collection frees any nodes that were in the old tree but not the
new one.

Use case inspired by Haskell's Happstack-store (formerly called MACID):
you want a Prevayler-style in-memory key-value database.  First idea:
all read queries are served by pure in-memory lookups in a Python dict.
Write queries update the dict and also append the command to a serial
log on disk (one disk write, no seeks).  If the system crashes, restart
it empty, and play back the log from the beginning to restore the
in-memory data.

Problem with first idea: the database might be a few thousand entries
but take millions of updates, if entries are modified frequently.  You
don't want to reload the million updates from the beginning.  Next idea:
dump a snapshot of the dictionary now and then, and then just reload
updates starting from the last snapshot.  Trouble here is that the
dictionary is big enough that writing out the snapshot takes time, and
you don't want to lock the whole dict against more updates while the
snapshot is writing.  If you do this the Java way, welcome to
concurrency hell as you make finer and finer grained locks and deal with
resulting deadlocks, race conditions, etc.  And the dictionary still has
to be synchronized to avoid reads simultaneous with updates.

MACID solution: use a functional red-black tree instead of a dict.  To
add or update an entry, make a new tree that replaces the old tree by
updating a single pointer.  That allows unlimited read concurrency
(reading means getting the current tree pointer and navigating the tree
that you get: since that tree is immutable you can keep using it while
other threads make new trees).  Updates are managed by a single thread
that can take its time making the new tree and writing the log, and then
atomically updating the in-memory root pointer that's shared with other
threads.  Finally, to take a snapshot, you can just get the current root
and write out its contents: since it's immutable you can do that
concurrently with anything else (including updates) that might be
happening.  You just have a small interaction with the update thread to
remember where in the log to start reading from in case of a crash and
reload.

Finding out about this was one of the "wow" moments that made me realize
I had to learn Haskell.