[scikit-learn] distances

Tue Mar 3 17:47:46 EST 2020

I noticed a comment by @amueller on Gitter re considering a project on our
distances implementations.

I think there's a lot of work that can be done in unifying distances
implementations... (though I'm not always sure the benefit.) I thought I
would summarise some of the issues below, as I was unsure what Andy
intended.

As @jeremiedbb said, making n_jobs more effective would be beneficial.
Reducing duplication between metrics.pairwise and neighbors._dist_metrics
and kmeans would be noble (especially with regard to parameters, where
scicpy.spatial's mahalanobis available through sklearn.metrics does not
accept V but sklearn.neighbors does). and perhaps offer higher consistency
of results and efficiencies.

We also have idioms the code like "if the metric is euclidean, use
squared=True where we only need a ranking, then take the squareroot" while
neighbors metrics abstract this with an API by providing rdsit and
rdist_to_dist.

There are issues about making sure that
pairwise_distances(metric='minkowski', p=2) is using the same
implementation as pairwise_distances(metric='euclidean'), etc.

We have issues with chunking and distributing computations in the case that
metric params are derived from the dataset (ideally a training set).

#16419 is a simple instance where the metric param is sample-aligned and
needs to be chunked up.

In other cases, we precompute some metric param over all the data, then
pass it to each chunk worker, using _precompute_metric_params introduced in
#12672. This is also relevant to #9555.

While that initial implementation in #12672 is helpful and aims to maintain
backwards compatibility, it makes some dubious choices.

Firstly in terms of code structure it is not a very modular approach - each
metric is handled with an if-then. Secondly, it *only* handles the chunking
case, relying on the fact that these metrics are in scipy.spatial, and have
a comparable handling of V=None and VI=None. In the Gower Distances PR
(#9555) when implementing a metric locally, rather than relying on
scipy.spatial, we needed to provide an implementation of these default
parameters both when the data is chunked and when the metric function is
called straight out.

Thirdly, its approach to training vs test data is dubious. We don't
formally label X and Y in pairwise_distances as train/test, and perhaps we
should. Maintaining backwards compat with scipy's seuclidean and
mahalanobis, our implementation stacks X and Y to each other if both are
provided, and then calculates their variance. This means that users may be
applying a different metric at train and at test time (if the variance of X
as train and Y as test is substantially different), which I consider a
silent error. We can either make the train/test nature of X and Y more
explicit, or we can require that data-based parameters are given explicitly
by the user and not implicitly computed. If I understand correctly,
sklearn.neighbors will not compute V or VI for you, and it must be provided
explicitly. (Requiring that the scaling of each feature be given explicitly
in Gower seems like an unnecessary burden on the user, however.)

Then there are issues like whether we should consistently set the diagonal
to zero in all metrics where Y=None.

In short, there are several projects in distances, and I'd support them
being considered for work.... But it's a lot of engineering, if motivated
by ML needs and consistency for users.

J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200304/e2e6d3cf/attachment.html>