[scikit-learn] Can I evaluate clustering efficiency incrementally?

Thu May 16 03:06:37 EDT 2019

The contingency matrix (
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html)
counts how many times each pair of (true cluster, predicted cluster)
occurs. It is sufficient statistics for every "supervised" (i.e. ground
truth-based) clustering evaluation metric in Scikit-learn. In an
incremental setting, you can simply add to the contingency matrix with each
new predicted batch. In
https://github.com/scikit-learn/scikit-learn/issues/8103 I proposed that we
provide an API for calculating clustering metrics from the sufficient
statistics alone, but it's not come to fruition.

On Thu, 16 May 2019 at 11:47, lampahome <pahome.chen at mirlab.org> wrote:

> Joel Nothman <joel.nothman at gmail.com> 於 2019年5月15日 週三 下午12:16寫道：
>
>> Evaluating on large datasets is easy if the sufficient statistics are
>> just the contingency matrix.
>>
>>
> Sorry, I don't understand it. Can you explain detailly?
> You mean we could take  subset   of samples to evaluating if subset is
> contingency(normal distribution) matrix?
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190516/219d061e/attachment.html>