complaints about no replies last week

Tue Mar 31 11:46:08 EDT 2009

On Mar 31, 2:56 am, Arnaud Delobelle <arno... at googlemail.com> wrote:
> Arnaud Delobelle wrote:
> > prueba... at latinmail.com writes:
> > [...]
> > > I myself asked about how to write a library to efficiently do union
> > > and intersection of sets containing time intervals some time ago on
> > > this list and got little to no answers. It is a tricky problem. Since
> > > I was getting paid I got an O(n*n) solution working. People on this
> > > list on the other hand do not get paid and answer whatever strikes
> > > their fancy. Sometimes the question is hard or confusing and nobody is
> > > motivated enough to answer.
>
> > I wasn't around when you posted this I guess. Do you mean intervals sets
> > on the (real) number line such as:
>
> >       1-3, 6-10 meaning all numbers between 1 and 3 and all numbers
> >       between 6 and 10.
>
> > In this case I think you can achieve union and intersection in O(nlogn)
> > where n is the total number of intervals in the interval sets to unify
> > or intersect. There is an implementation below. I have chosen a very
> > simple data structure for interval sets: an interval set is the list of
> > its endpoints. E.g.
>
> >     1-3, 6-10 is the list [1, 3, 6, 10]
>
> > This means that I can't specify whether an interval is closed or open.
> > So in the implementation below all intervals are assumed to be open.
> > The method could be made to work for any kind of intervals with the same
> > complexity, there would just be a few more LOC.  I'm focusing on the
> > principle - here it is:
>
> > --------------------------------------------------
> > # Implementation of union and intersection of interval sets.
>
> > from itertools import *
>
> > def combine(threshold, intsets):
> >     endpoints = sorted(chain(*imap(izip, intsets, repeat(cycle([1,-1])))))
> >     height = 0
> >     compound = []
> >     for x, step in endpoints:
> >         old_height = height
> >         height += step
> >         if max(height, old_height) == threshold:
> >             compound.append(x)
> >     return compound
>
> > def union(*intsets):
> >     return combine(1, intsets)
>
> > def intersection(*intsets):
> >     return combine(len(intsets), intsets)
>
> > # tests
>
> > def pretty(a):
> >     a = iter(a)
> >     return ', '.join("%s-%s" % (a, b) for a, b in izip(a, a))
>
> > tests = [
> >     ([1, 5, 10, 15], [3, 11, 13, 20]),
> >     ([2, 4, 6, 8], [4, 7, 10, 11]),
> >     ([0, 11], [5, 10, 15, 25], [7, 12, 13, 15]),
> >     ]
>
> > for intsets in tests:
> >     print "sets: ", "; ".join(imap(pretty, intsets))
> >     print "union: ", pretty(union(*intsets))
> >     print "intersection: ", pretty(intersection(*intsets))
> >     print "-"*20
> > --------------------------------------------------
>
> > Is this what you were looking for?
>
> > --
> > Arnaud
>
> I realised after posting last night that I must be
>
> (1) solving the wrong problem
> (2) solving it badly
>
> - My implementation of the combine() function above is O(nlogn)
> (because of the sorted() call) whereas it could be O(n) by iterating
> over the interval in the parallel manner, hence (2).  This would make
> union() and intersection() O(n).
>
> - As the problem was solved by the OP in O(n^2) I must be solving the
> wrong problem (1).
>
> I apologise for this.
>
> However it was a nice and compact implementation IMHO :)
>
> --
> Arnaud

I am pretty sure the problem can be solved in O(n log n). I just
wasn't feeling overly smart when I was writing the algorithm. N is on
average 4 and it had eventually to be implemented inside a framework
using C++ anyway, so it is pretty fast. I can’t believe that no
programmer has come over the same kind of problem before, yet my
Google fu didn’t do anything for me.

Well since I attracted a couple people's attention I will describe the
problem in more detail. Describing the problem properly is probably as
hard as solving it, so excuse me if I struggle a bit.

The problem is for a health insurance company and involves the period
of time a person is covered. Most insurance companies allow not only
for the main member to be insured but his family: the spouse and the
dependents (children). This additional coverage costs extra but less
than a full new insurance. So for example if Alice buys an insurance
worth at 100 dollars a month, she can insure her husband Bob for an
additional 50 dollars. Under certain circumstances Alice may go off
the insurance and only Bob stays. In that case the price goes back to
100 dollars or maybe there is a deal for 80 or something like that. In
other words the cost of the insurance is dependent on the combination
of family members that participate in it. Additionally not only do we
have different family compositions but also different insurance
products. So you can get medical, dental and vision insurance.

All that data is stored in a database that is not very tidy and looks
something like this:

First Day of Coverage, Last Day of Coverage, Relationship, Product
5/3/2005, 5/3/2005, D, M
9/10/2005, 10/10/2005, S, V
3/15/2005, 7/15/2005, M, M
3/1/2005, 6/1/2005, S, M
5/15/2005, 7/20/2005, D, D
9/10/2005, 1/1/2140, M, V
2/1/2005, 5/3/2005, M, M

Where
Relationship: M=Main Member, S=Spouse, D=Dependent
Product: M=Medical, D=Dental, V=Vision

As far as I know at the present time there are no deals based on
combination of products purchased so we will handle each product
independently. What I want is a simple algorithm that allows me to
calculate something like this out of it (hopefully I didn’t make a
mistake):

Medical:
2/1/2005, 2/28/2005, M
3/1/2005, 5/2/2005, M&S
5/3/2005, 5/3/2005, M&S&D
5/4/2005, 6/1/2005, M&S
6/2/2005, 7/15/2005, M

Dental:
5/15/2005, 7/20/2005, D

Vision:
9/10/2005, 10/10/2005, M&S
10/11/2005, 1/1/2140, M