efficient interval containment lookup

Tue Jan 13 01:15:41 EST 2009

On Jan 12, 9:34 pm, Per Freem <perfr... at yahoo.com> wrote:
> hi brent, thanks very much for your informative reply -- didn't
> realize this about the size of the interval.
>
> thanks for the bx-python link.  could you (or someone else) explain
> why the size of the interval makes such a big difference? i don't
> understand why it affects efficiency so much...
>
> thanks.
>
> On Jan 13, 12:24 am, brent <bpede... at gmail.com> wrote:
>
> > On Jan 12, 8:55 pm, Per Freem <perfr... at yahoo.com> wrote:
>
> > > On Jan 12, 10:58 pm, Steven D'Aprano
>
> > > <ste... at REMOVE.THIS.cybersource.com.au> wrote:
> > > > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
> > > > > thanks for your replies -- a few clarifications and questions. the
> > > > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a
> > > > >>= c and b <= d. Note that I am not looking for intervals that
> > > > > overlap... this is why interval trees seem to me to not be relevant, as
> > > > > the overlapping interval problem is way harder than what I am trying to
> > > > > do. Please correct me if I'm wrong on this...
>
> > > > To test for contained intervals:
> > > > a >= c and b <= d
>
> > > > To test for overlapping intervals:
>
> > > > not (b < c or a > d)
>
> > > > Not exactly what I would call "way harder".
>
> > > > --
> > > > Steven
>
> > > hi Steven,
>
> > > i found an implementation (which is exactly how i'd write it based on
> > > the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
>
> > > when i use this however, it comes out either significantly slower or
> > > equal to a naive search. my naive search just iterates through a
> > > smallish list of intervals and for each one says whether they overlap
> > > with each of a large set of intervals.
>
> > > here is the exact code i used to make the comparison, plus the code at
> > > the link i have above:
>
> > > class Interval():
> > >     def __init__(self, start, stop):
> > >         self.start = start
> > >         self.stop = stop
>
> > > import random
> > > import time
> > > num_ints = 30000
> > > init_intervals = []
> > > for n in range(0,
> > > num_ints):
> > >     start = int(round(random.random()
> > > *1000))
> > >     end = start + int(round(random.random()*500+1))
> > >     init_intervals.append(Interval(start, end))
> > > num_ranges = 900
> > > ranges = []
> > > for n in range(0, num_ranges):
> > >   start = int(round(random.random()
> > > *1000))
> > >   end = start + int(round(random.random()*500+1))
> > >   ranges.append((start, end))
> > > #print init_intervals
> > > tree = IntervalTree(init_intervals)
> > > t1 = time.time()
> > > for r in ranges:
> > >   tree.find(r[0], r[1])
> > > t2 = time.time()
> > > print "interval tree: %.3f" %((t2-t1)*1000.0)
> > > t1 = time.time()
> > > for r in ranges:
> > >   naive_find(init_intervals, r[0], r[1])
> > > t2 = time.time()
> > > print "brute force: %.3f" %((t2-t1)*1000.0)
>
> > > on one run, i get:
> > > interval tree: 8584.682
> > > brute force: 8201.644
>
> > > is there anything wrong with this implementation? it seems very right
> > > to me but i am no expert. any help on this would be relly helpful.
>
> > hi, the tree is inefficient when the interval is large. as the size of
> > the interval shrinks to much less than the expanse of the tree, the
> > tree will be faster. changing 500 to 50 in both cases in your script,
> > i get:
> > interval tree: 3233.404
> > brute force: 9807.787
>
> > so the tree will work for limited cases. but it's quite simple. check
> > the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
> > for a more robust implementation.
> > -brentp
>
>

well, i think if your search interval covers the entire span of the
tree, you can't do better than just naive search as the tree just adds
overhead. as the len of the search interval decreases relative to the
span of the tree, the tree performs better.
if all the intervals in the tree itself are long and overlapping, that
tree just ... ugh ... doesnt work, because it has to do all the extra
checks in the find() method anyway. the bx-python tree probably does a
better job, but you set up a pretty rough test there.