[scikit-learn] DBScan freezes my computer !!!
Andreas Mueller
t3kcit at gmail.com
Wed May 16 13:37:36 EDT 2018
You might also consider looking at hdbscan:
https://github.com/scikit-learn-contrib/hdbscan
On 05/13/2018 11:07 PM, Joel Nothman wrote:
> Note that this has long been documented under "Memory consumption for
> large sample sizes" at
> http://scikit-learn.org/stable/modules/clustering.html#dbscan
>
> On 14 May 2018 at 12:59, Joel Nothman <joel.nothman at gmail.com
> <mailto:joel.nothman at gmail.com>> wrote:
>
> This is quite a common issue with our implementation of DBSCAN,
> and improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius
> neighbors for all points. If using a distance metric that cannot
> be indexed with a KD-tree or Ball Tree, this results in n^2 floats
> being stored in memory even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph
> in chunks, so as to avoid all pairs being calculated and stored at
> once. This produces a sparse graph representation, which can be
> passed into dbscan with metric='precomputed'. (I've just seen
> Sebastian suggested the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical
> points would be merged but would have a sample_weight of 2).
>
> There is also a proposal to offer an alternative memory-efficient
> mode at https://github.com/scikit-learn/scikit-learn/pull/6813
> <https://github.com/scikit-learn/scikit-learn/pull/6813>. Feedback
> is welcome.
>
> Cheers,
>
> Joel
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180516/07f64d1f/attachment.html>
More information about the scikit-learn
mailing list