[SciPy-User] 4-D gaussian mixture model.

Fri Nov 26 12:51:34 EST 2010

On Fri, Nov 26, 2010 at 05:00:10PM +0100, Éric Depagne wrote:
> I have a set of data that are made of 4 parameters : x, y, dx and dy

> I'd like to classify this set  the following way : put together all
> (x,y) that have similar (dx, dy).

OK, so you have a learning task with a multivariate output, is that
right?

> I've had a look at Gaussian mixture models implementation in scikit, and it 
> seems to be what I need. But the examples i've found here :
> http://scikit-learn.sourceforge.net/0.5/auto_examples/gmm/plot_gmm.html#
> only fit y vs x.

Yes, standard Gaussian mixture models do not model multivariate output. 

> In my case for instance, all my (x,y) would be in red, but some of the (dx, 
> dy) would point towards you, and some would point away from you, and I'd like 
> to sort the data according to this "parameter": the pointing direction.

Can you extract this 'parameter' that makes most sens in your context.
This would make the problem much better posed, as the method would not
have to learn the relevant structure of the output space.

> How can I modify the example so that it fits 2 dims, keeping the first two as 
> input ?

You can't. Not with the Gaussian mixture models in the scikit.

> And does it make sense to use this kind of method, my knowledge in 
> statistics is quite limited.

I am not an expert in structured output learning, but I would say that
GMM is probably not an excellent choice for that. On the other hand, if
you are interested in a clustering method, all the methods I know work on
non structured output. The GMM could probably be adapted from a
theoretical sens to your problem, but that would mean redoing the
probabilistic model and the update laws used in the computation. 

For structured ouptut, latent factor models that learn from both spaces,
such as canonical correlation analysis, are well-posed. But you would
need to formulate your problem in a way that fits in these frameworks.

What is your end problem? Do you want to classify or cluster? Can you
define the quantity that you are interested in?

HTH,

Gael