efficient interval containment lookup

Mon Jan 12 15:43:57 EST 2009

> suppose I have two lists of intervals, one significantly larger than
> the other.
> For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
> thousands
> of elements while listB (of the same form) might contain hundreds of
> thousands
> or millions of elements.
> I want to count how many intervals in listB are contained within every
> listA. For example, if listA = [(10, 30), (600, 800)] and listB =
> [(20, 25), (12, 18)] is the input, then the output should be that (10,
> 30) has 2 intervals from listB contained within it, while (600, 800)
> has 0. (Elements of listB can be contained within many intervals in
> listA, not just one.)
> 
> What is an efficient way to this?  One simple way is:
> 
> for a_range in listA:
>   for b_range in listB:
>     is_within(b_range, a_range):
>       # accumulate a counter here

Could you detail the is_within() function?  most importantly, is 
it inclusive, such that "a1 <= b1 <= b2 <= a2", or is it 
overlapping such that "a1 <= b2 or a2 <= b1".  Additionally, do 
you want to count how many intervals of A overlap with intervals 
of B, or do you just want a count of how many intervals in B have 
*any* overlap within A?

My leaning would be to make a custom "is this in A" function, 
iterate over B and test against an "appropriate subset" of A:

   listA = sorted([...])
   min_a1 = min(a1 for (a1, a2) in listA)
   max_a2 = max(a2 for (a1, a2) in listA)
   def in_a(b1, b2):
     for a1, a2 in smartly_chosen_subset(listA, b1, b2):
       if is_within((b1, b2), (a1, a2)): return True
     return False
   i = 0
   for b1, b2 in sorted(listB):
     if b2 < min_a1: continue
     if b1 > max_a2: break
     if in_a(b1, b2): i += 1

The in_a() function can be optimized to terminate early if a1>b2 
or a2 < b1  (perhaps even choosing an smart starting point for 
iterating).  Short-circuiting by returning True as soon as you 
know there's a match will trim some of the time.

How distributed are the endpoints?  If there's a lot of 
commonality in the data, you might be able to cache results to 
cut even further.

Just a few ideas, and questions that might help develop better 
solutions.

-tkc