[scikit-learn] Applying clustering to cosine distance matrix

prince gosavi princegosavi12 at gmail.com
Mon Feb 12 16:29:21 EST 2018


Hi,
Thanks for those tips Sebastian.That just saved my day.

Regards,
Rajkumar

On Tue, Feb 13, 2018 at 12:44 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> [image: Boxbe] <https://www.boxbe.com/overview> This message is eligible
> for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule
> <https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Fkey%3D0a2mz6HiALxmseA8EtEa3hg8FtAfQyTwNzLAvbS3JOk%253D%26token%3D8qZlnKU2OJ%252BeTscNUfA9PjpDKa2%252FZO8i9dvKkAyr7bKz%252Bi2MdFTFnLILfmhv4s3s%252Bva0Dy7LpRz63wO18BlP48DNIu3aSb%252FmxAVjQq1fCD0tDxFcxxdH2mq9Otany%252FdER3CzXyokyLg%253D&tc_serial=36653890807&tc_rand=854549477&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
> | More info
> <http://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=36653890807&tc_rand=854549477&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
>
> Hi,
>
> by default, the clustering classes from sklearn, (e.g., DBSCAN), take an
> [num_examples, num_features] array as input, but you can also provide the
> distance matrix directly, e.g., by instantiating it with
> metric='precomputed'
>
> my_dbscan = DBSCAN(..., metric='precomputed')
> my_dbscan.fit(my_distance_matrix)
>
> Not sure if it helps in that particular case (depending on how many zero
> elements you have), you can also use a sparse matrix in CSR format (
> https://docs.scipy.org/doc/scipy-1.0.0/reference/
> generated/scipy.sparse.csr_matrix.html).
>
> Also, you don't need to for-loop through the rows if you want to compute
> the pair-wise distances, you can simply do that on the complete array. E.g.,
>
> from sklearn.metrics.pairwise import cosine_distances
> from scipy import sparse
>
> distance_matrix = cosine_distances(sparse.csr_matrix(X),
> dense_output=False)
>
> where X is your "[num_examples, num_features]" array.
>
> Best,
> Sebastian
>
>
> > On Feb 12, 2018, at 1:10 PM, prince gosavi <princegosavi12 at gmail.com>
> wrote:
> >
> > I have generated a cosine distance matrix and would like to apply
> clustering algorithm to the given matrix.
> > np.shape(distance_matrix)==(14000,14000)
> >
> > I would like to know which clustering suits better and is there any need
> to process the data further to get it in the form so that a model can be
> applied.
> > Also any performance tip as the matrix takes around 3-4 hrs of
> processing.
> > You can find my code here https://github.com/
> maxyodedara5/BE_Project/blob/master/main.ipynb
> > Code for READ ONLY PURPOSE.
> > --
> > Regards
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180213/33ec61a0/attachment-0001.html>


More information about the scikit-learn mailing list