merging intervals repeatedly

Robert Bossy Robert.Bossy at jouy.inra.fr
Wed Mar 12 04:49:26 EDT 2008


Magdoll wrote:
> Hi,
>
> I have to read through a file that will give me a bunch of intervals.
> My ultimate goal is to produce a final set of intervals such that not
> two intervals overlap by more than N, where N is a predetermined
> length.
>
> For example, I could read through this input:
> (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
>
> btw, the input is not guaranteed to be in any sorted order.
>
> say N = 5, so the final set should be
> (1,15), (20, 30), (29, 40), (50, 100)
>   
Hi,

The problem, as stated here, may have several solutions. For instance 
the following set of intervals also satisfies the constraint:
(1,15), (20,40), (50,100)

One question you should ask yourself is: do you want all solutions? or 
just one?
If you want just one, there's another question: which one? the one with 
the most intervals? any one?
If you want all of them, then I suggest using prolog rather than python 
(I hope I won't be flamed for advocating another language here).


> Is there already some existing code in Python that I can easily take
> advantage of to produce this? Right now I've written my own simple
> solution, which is just to maintain a list of the intervals. I can use
> the Interval module, but it doesn't really affect much. I read one
> interval from the input file at a time, and use bisect to insert it in
> order. The problem comes with merging, which sometimes can be
> cascading.
>
> ex:
> read (51,65) ==> put (51,65) in list
> read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> merge with (62,100)
The way this algorithm is presented suggests an additional constraint: 
you cannot merge two intervals if their overlap <= N. In that case, 
there is a single solution indeed...
Nitpick: you cannot merge (50,66) and (62,100) since their overlap is 
still <= 5.

If you have a reasonable number of intervals, you're algorithm seems 
fine. But it is O(n**2), so in the case you read a lot of intervals and 
you observe unsatisfying performances, you will have to store the 
intervals in a cleverer data structure, see one of these:
http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree


Cheers,
RB



More information about the Python-list mailing list