Builtin classes list, set, dict reimplemented via B-trees

Wed Sep 14 09:48:46 EDT 2005

>barnesc at engr.orst.edu wrote:
> > So my question is: are there any other *practical* applications of a
> > B-tree based list/set/dict ?  In other words, is this module totally
> > worth coding, or is it just academic wankery and theoretical flim
> > flam ? :)
>
>B-trees are specifically designed for disk storage. Seeking to a
>page takes much longer than reading a page of several kilobytes.
>B-trees read the keys for a many-way comparison in one seek.
>
>For in-memory comparison-based dictionaries, Red-Black trees,
>AVL trees, 2-3 trees, or skip-lists, are a better choice.
>
>
>--
>--Bryan

---------------------------------------------------------------------
Memory Usage
---------------------------------------------------------------------

Here overhead is compared to a C array of > 1 million PyObject *s.

The B-tree is assumed to have a branch factor of ~1024.

A B-tree has an average of 1.5x overhead (2x at worst, 1x at best,
  using malloc).  This assumes that B-tree nodes contain a fixed
  size C array for storing values (which will be half-full on
  average), and that the child array is only allocated for internal
  nodes as needed.
Any balanced binary tree with left and right child pointers has
  a 6x overhead (using malloc, measured via a gcc win32 test program),
  or with a free list, a 3x overhead (though memory will not
  be returned until the data structure's destructor is called).
  I haven't tried using PyObject_Malloc(), but it should be in
  between these two factors.
A skip list has n values and n next pointers on the bottom level.
  In the ideal case (with respect to memory usage), we make the
  probability sufficiently small so that the higher levels take up a
  negligible amount of memory.  Thus we have a 4x overhead (using
  malloc, measured via a gcc win32 test program), or with a free list,
  an overhead slightly greater than 2x (again, memory will only be
  returned when the destructor is called).  I didn't test
  PyObject_Malloc(), but it should give an overhead between 2x and 4x.

Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.

---------------------------------------------------------------------
Speed
---------------------------------------------------------------------

I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.

To do this, one would need two high performance C implementations.

>From this, the Python performance can presumably be deduced (although
caching causes unpredictable effects, the faster C algorithm should be
faster in Python).

You could start with optimized parallel C implementations, such as
the ones by John-Mark Gurney:

 * http://resnet.uoregon.edu/~gurney_j/jmpc/btree.html
 * http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html

He indicates that the skip list is slower than the B-tree, but doesn't
give any solid numbers.  Of course you'd have to check his code and
optimize it to death (Perhaps representing a sorted set of integers
using each respective data structure would be a good test).

Also note that my B-tree code currently has optimized special cases
for block insertions, deletions, and retrievals, since this is easy
(er, well, relatively speaking) to do in the B-tree case.  In the skip
list case, block retrievals are easy (though not as fast as B-tree
block retrievals due to pointer indirection and cache problems).  I
have no idea how to do a fast block insert or delete on a skip list.

----------------------------------------------------------------------

Well, enough ivory tower silliness and academic blustering.

<on_topic>

Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?

</on_topic>

 - Connelly Barnes