[SciPy-user] PyEM: custom (non-Euclidean) distance function?

Mon Mar 16 13:33:36 EDT 2009

On Tue, Mar 17, 2009 at 2:10 AM,  <josef.pktd at gmail.com> wrote:
> A comment on gmm_em.py:
>
> in
>       def _update_em_full(self, data, gamma, ngamma):
>
> there is a triple loop, the inner two loops are:
>
>            # This should be much faster than recursing on n...
>            for i in range(d):
>                for j in range(d):
>                    xx[i, j] = N.sum(data[:, i] * data[:, j] * gamma.T[c, :],
>                            axis = 0)
>
> in my reading data[:, i], data[:, j], and gamma.T[c, :] are all 1 dimensional.
> If this is correct, then to me this looks like
>
> xx = N.dot(data.T, data * gamma[:,c:c+1])
>
> I'm not completely sure about the shape of gamma, why you transposed it.

To be honest, this is not ideal code. This is actually the first code
I wrote in numpy (or even python), as an exercise to look at python :)

>
> According to a numpy ticket using dot should be much faster than sum.

Yes, it is, because it use ATLAS (if available). This can be much
faster than sum.

> I don't know about EM applications, but from a maximum likelihood view
> point, it might be possible to find the distribution class for the
> mixture that corresponds to different kinds of distance measures or
> that is appropriate for discrete data.

EM (for MLE) is applicable to many models within the exponential
hidden family (that is when the complete data follow a density in the
exponential family). So it is definitely much more general than GMM,
and can be applied to discrete data (for example mixture of
multinomials). In my own field, speech processing, the EM algorithm is
applied to both continuous data (GMM and HMM with GMM emission
densities for acoustic modelling) and discrete date (for language
modelling).

I am still not sure to understand how distance may come in that context, though.

cheers,

David