Balanced trees

Roy Smith roy at panix.com
Wed Mar 19 10:06:29 EDT 2014


In article <53299eac$0$29994$c3e8da3$5496439d at news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:

> If you have a million items, then the odds that those 
> million items happen to be sorted (the worst-case order) are 1 in a 
> million factorial. That's a rather small number, small enough that we can 
> discount it from ever happening, not in a million lifetimes of the 
> Universe.

If the items are coming from some inherently random process, then I 
agree.  But, not all data sources are random.

Imagine you had a web site, and wanted to store various bits of data 
from the stream of input requests.  You decided that the data structure 
you were going to use was a balanced tree.  If your keys were, say, a 
hash of the client's IP address, then it's a pretty good guess they're 
random.  But, what if the keys were the time the request was received?  
Those are obviously not random, and using them as a keys in a balanced 
tree would give you sub-optimum performance.

This is not a hypothetical scenario.  Songza uses MongoDB as our 
database.  The indexes are balanced trees.  One of our biggest 
collections has a multi-component key, one of the components being the 
request time truncated to the hour.  Insertion time into that collection 
has an obvious sawtooth shape, with performance degrading as each hour 
progresses and jumping back up as the time rolls over to the next hour.  
This is due (we believe :-)) to the constant rebalancing of the index 
trees.

Almost-sorted data sets are very common.  For example, here's a pipeline 
I run a lot (from memory, could have gotten some detail wrong):

grep pattern lofgile | awk '{print $7}' | sed 's/:[0-9][0-9]$//' | sort 
| uniq -c

Field 7 is the timestamp for when a request came in.  What I'm doing 
here is counting how many requests of a certain type came in during each 
minute of the day.  Logging isn't exactly in chronological order, so I 
need to sort the times before I count them.  But, it's in *almost* 
chronological order; a sort which had pathologically bad behavior on 
almost sorted data would be unusable here.



More information about the Python-list mailing list