[SciPy-user] PyEM: custom (non-Euclidean) distance function?

Mon Mar 16 20:24:32 EDT 2009

On Tue, Mar 17, 2009 at 8:39 AM, David Warde-Farley <dwf at cs.toronto.edu> wrote:
>
> On 16-Mar-09, at 12:46 PM, Emanuele Olivetti wrote:
>
>> You are right. I'm coming from K-means (MacKay's book) and
>> moving to GMM, that's why I had in mind custom distances.
>> I can try to transform my data to be meaningful under Euclidean
>> distance.
>
>
> Better yet, figure out what distribution might be the right one to use
> in the one-component case. What sort of data are you working with?
> MoGs aren't a magic bullet, and you might be better off putting some
> careful consideration into the form your data takes and choosing an
> appropriate base distribution.
>
> Pretty much any parametric distribution can be turned into a mixture
> distribution. The way a (finite) mixture works in the general case is
> that you have a discrete "hidden" random variable C that takes on
> values corresponding to one of the N clusters, and then N separate
> distributions from a parametric family (you can mix families too but
> that gets complicated and is rarely useful). Mixtures of Bernoulli,
> multinomial, Gamma,  and Poisson distributions (for example) are all
> fairly common. EM will work for all of these cases, and many more; it
> relies on a fairly general set of assumptions, the details of which
> escape me at the moment.

You need to be able to estimate efficiently the expected
log-likelihood of the complete data given the observation (which
reduces to the computation of the responsibilities for mixtures), and
to be able to maximizes it w.r.t the parameters. For complete data in
the exponential family with sufficient statistics s, the expected
log-likelihood reduces to some averaged sufficient statistics, and the
re-estimated parameter is the Legendre transform of the cumulant
generating function.

> I haven't ever used PyEM so I don't know how general David's code is,
> but it might be a helpful guide.

It only handles GMM. As you said, the machinery is much the same for
many mixtures, but it only helps formally. Concretely, the actual
formulations are quite different; mixture of Poisson would be simple
to derive from GMM, but mixtures of multinomial would be quite
different. To have an "automatic" EM algorithm for any given mixture,
one could imagine some sort of formal compiler, the EM algorithm being
a meta-algorithm. Needless to say, I have not attempted to do that -
people smarter than me have attempted to do so in a more general
context (infer.net from Minka et. al at MS).

cheers,

David