[scikit-learn] Issues with kmeans: Difference in centroid values

Andreas Mueller t3kcit at gmail.com
Mon Apr 16 17:36:14 EDT 2018



On 04/16/2018 04:07 PM, Sidak Pal Singh wrote:
> Hi everyone,
>
> I was using scikit-learn KMeans algorithm to cluster pretrained 
> word-vectors. There are a few things which I found to be surprising 
> and wanted to get some feedback on.
>
> - Based upon the 'labels_' assigned to each word-vector (i.e. cluster 
> memberships), I compute every cluster centroid as the average of the 
> word-vectors (corresponding to that cluster). Surprisingly, this seems 
> to be pretty different from the 'cluster_centers_'. Is there anything 
> that I am missing here?
If the algorithm did not fully converge, you just did one more step, so 
the results are expected to be different.
>
> - I was later using the verbose option to see if the clustering has 
> converged or not. I saw on the console log messages such as /"//center 
> shift 7.994126e-04 within tolerance 1.243425e-06"/. It seems that this 
> corresponds to some code in *kmeans_elkan.pyx* 
> (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_means_elkan.pyx). 
>
> - Lastly, another thing that seems strange is that I hadn't set the 
> tolerance value. So the default of 1e-4 should have been used. But if 
> you look again at the above log, it says /within tolerance 
> 1.243425e-06 instead of 1e-4.
> /
/https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L159
The tolerance is scaled by the variance of the data to be independent of 
the scal/e

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180416/fbf00938/attachment.html>


More information about the scikit-learn mailing list