Maximizing observations from Sparse matrices

Mon Jul 15 19:59:24 EDT 2002

I'm going to first try to re-state your problem.

You are given a set of observations.  Each observation is in a particular
region and has several variables.

    class Observation:
	def __init__(self, region, variables):
	    self.region = region
	    self.variables = variables

A "None" value is used in the variables list when the observation is not
available.

A Dataset is a list of Observations.  In each observation in a dataset, the
indices of the variables are assumed to correspond.

The score of a dataset is
    def score(dataset):
	return (len([ob for ob in dataset if not None in ob.variables]) *
		len(dataset[0].variables))

I don't see why it's necessary to eliminate Regions, since a region with no
complete observations yields 0 to the score in any case.

So you're left with choosing the subset of "var" columns.  You could use
VariablePicker to choose only a subset of variables from an Observation:

    def VariablePicker(ob, columns):
	return Observation(ob.region,
		    [ob.variables[column] for column in columns])

    def prune(dataset, columns):
	return [VariablePicker(ob, columns) for ob in dataset]

With the preliminaries out of the way, the question is
    for what value of columns is
	score(prune(dataset, columns))
    maximized

Well, there are 2**len(obs.variables) different values to pick for
"columns".  I don't see an obvious way in which a greedy approach can be
guaranteed to find the best answer.  (eg start with an empty list of
columns.  At each step, append to the list the column that maximizes the
score, terminating when no "nearby" list is better)  Consider a
configuration like:
	V1  V2	V3
	     	X
		X
		X
	X   X
	X   X
In the greedy algorithm I just described, column v3 would be chosen first
(score 3), and then no better column would be found.  Actually, the best
score comes from choosing [v1, v2] (score 3).

If I had to suggest an approach, I'd start with some randomized subset and
then while it gave me a better score I would take a "random walk" to nearby
subsets, those formed by adding or removing a small number (or just 1) of
elements from the subset.  After picking several random starting points,
I'd just go with the best answer I'd seen so far.

However, I don't see how it's possible to exhaustively search the whole
problem space or otherwise be sure you've found the best answer.

Maybe somebody with a better background in this kind of algorithm question
could speak up.  You might have better luck asking in a language-neutral
group that knows about the problem domain and then figuring out how to
implement the algorithm in Python.

Jeff
PS if this dataset is really big, you're probably using Numeric to process
it.  In that case, my program snippets above won't directly help, but they
were helpful to me in thinking about the problem...