[scikit-learn] Inconsistencies in clustering documentations

Tom DLT tom.duprelatour at orange.fr
Wed May 23 08:01:47 EDT 2018


Hi Anaël,

Thanks for spotting these inconsistencies.
You are very welcome to open pull-requests and/or issues on the GitHub
tracker (cf.
http://scikit-learn.org/stable/developers/contributing.html#contributing-code
)
The documentation issue should be straightforward.
The parameter renaming would need a proper deprecation cycle (cf
http://scikit-learn.org/stable/developers/contributing.html#deprecation).

See you on GitHub,

Tom

2018-05-23 11:50 GMT+02:00 Beaugnon Anael <anael.beaugnon at ssi.gouv.fr>:

> Dear all,
>
> Three clustering algorithms can take as input distance or similarity
> matrices instead of the observations (AgglomerativeClustering
> <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering>,
> AffinityPropagation
> <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>,
> and DBSCAN
> <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>),
> but there are inconsistencies in their documentations.
>
>
> *DBSCAN :*
>    The documentation explains clearly how to run DBSCAN with a precomputed
> distance matrix.
>    Constructor:
>
> *        metric: If metric is “precomputed”, X is assumed to be a distance
> matrix and must be square. *
>    fit / fit_predict
>
>
>
> *:        X: A feature array, or array of distances between samples if
> metric='precomputed'. *
>
> *AffinityPropagation : *
>     Constructor:
>         affinity:
> *Which affinity to use. At the moment precomputed and euclidean are
> supported. euclidean uses the negative squared euclidean distance between
> points. *
>     fit :
> *         X: *
> *Data matrix or, if affinity is precomputed, matrix of similarities /
> affinities. *
>     fit_predict :
> *        X: Input data.      *
>         X can also be a matrix of similarities ? fit and fit_predict
> should share the same documentation for the input X ?
>
>
>
> *AgglomerativeClustering : *    Constructor:
>         *affinity: Metric used to compute the linkage. Can be
> “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or ‘precomputed’. If
> linkage is “ward”, only “euclidean” is accepted*.
>         The name of the parameter 'affinity' seems misleading, since it
> does not correspond to similarity functions, but to distance functions.
>     fit :
> *         X: **The samples a.k.a. observations.*
>     fit_predict :
> *        X: *
> *Input data.  *        The documentation of fit and fit_predict does not
> specify that X can also be a matrix of distances.
>
> The user may be confused whether he/she should provide a distance or a
> similarity matrix to AgglomerativeClustering.
> The documentation of fit and fit_predict can be easily updated. As for the
> name of the 'affinity' parameter, it is more difficult since it involves an
> API change.
>
>
> What do you think of these potential updates of the documentation ?
>
> Cheers,
>
> Anaël Beaugnon
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/f0898076/attachment-0001.html>


More information about the scikit-learn mailing list