set partition question

Sun May 25 23:40:16 EDT 2008

On May 25, 7:46 pm, miller.pau... at gmail.com wrote:
> > I think this problem is related to integer partitioning, but it's not
<snip>
>
> If, by "integer partitioning," you mean the "subset sum
> problem" (given a finite set S of integers, does S contain a subset
> which sums up to some given integer k?), then you are right.  I'm
> reasonably sure there's a polynomial reduction from your problem to
> the subset sum problem.  That would make your problem NP-complete.

Wow, that's helpful. http://en.wikipedia.org/wiki/Subset_sum clarifies
what I'm trying to do. Yes, it probably is NP-complete. Fortunately my
use case doesn't require that the process finish, just that it produce
a lot. I can exclude trivial cases (e.g., f' = f(S) + s_1 - s_1) with
some sort of simplifier.

> Like I said, because this looks related to the subset sum problem
> (though harder, because you're asking for "all" such functions, for
> some rigorous definition of "all," as per the previous sentence), I
> suspect it's NP-complete.  However, there are probably some good
> heuristics, such as if s1 has a large intersection with s0, it's
> probably a good idea to use s1 in whatever formula you come up with.

There are a few real-world constraints which make the problem easier,
but only by linear factors, I suspect. More on this below.

> Other than that, I don't really have an idea.  Can you say what the
> application of this algorithm would be?  Knowing the use case might
> suggest a better approach.

This is a problem in statistical estimation. Given n records made up
of k variables, define a cell as the point in the cartesian product of
v_1 * v_2 * ... * v_k. I want to apply an estimator on the data in
each cell.

However, many cells are empty, and others are too sparse to support
estimation. One solution is to build lots of valid aggregations,
called "strata," for which estimates can be made for chunks of cells.
A "valid aggregation" is a sequence of *adjacent* cells (see below)
whose data are pooled (by whatever mechanism is relevant to the
estimator).

The cells are "elements" in my initial problem statement, and the
strata are the sets.

The constraints I mentioned are on the composition of strata: strata
can only contain sequences of adjacent cells, where adjacency is
defined for each variable. Two cells are adjacent if they are equal on
every variable except one, and on that one and by its definition, they
are adjacent.

Does this make it easier to understand, or more difficult? thanks --
PB.