Balanced trees

Mon Mar 10 09:59:01 EDT 2014

In article <87eh2atw6s.fsf at elektro.pacujo.net>,
 Marko Rauhamaa <marko at pacujo.net> wrote:

> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> 
> > Proof: I create a hash table that accepts unsigned bytes as keys, where 
> 
> The O(f(n)) notation has no meaning when n is limited.
> 
> This thing is not just pedantry. The discussion was how a balanced tree
> fares in comparison with hash tables. A simplistic O(1) vs O(log n) was
> presented as a proof that balanced trees don't have a chance in practice
> or theory. I wasn't so easily convinced.
> 
> 
> Marko

On the other hand, log n, for n = 1 million, is just 20.  It's not hard 
to imagine a hash function which costs 20x what a node traversal does, 
in which case, the log n lookup is ahead for all n < 1 million.

Looking at the Songza source, I see we have one user-defined hash 
function:

    def __hash__(self):
        return hash((self.song,
                     self.user,
                     self.add_time,
                     self.delete_time,
                     self.play_first))

where those things are 2 bson.ObjectId's, 2 datetimes, and a boolean.  I 
wouldn't be surprised if that falls into the 20x node traversal bucket.  
In this case, convenience was more important than speed, so it doesn't 
matter.

The hash vs. tree argument can get very complicated.  For example, if 
your tree is not completely resident in memory, the cost of paging in a 
node will swamp everything else, and improving lookup speed will boil 
down to reducing the number of I/O operations you need.  An expensive 
hash plus a linear walk through a collision chain which was resident in 
a single memory block will beat traversing two nodes, each of which had 
to be paged in separately.