[Scipy-svn] r6884 - trunk/scipy/cluster
scipy-svn at scipy.org
scipy-svn at scipy.org
Sun Nov 14 04:58:41 EST 2010
Author: rgommers
Date: 2010-11-14 03:58:41 -0600 (Sun, 14 Nov 2010)
New Revision: 6884
Modified:
trunk/scipy/cluster/__init__.py
trunk/scipy/cluster/hierarchy.py
trunk/scipy/cluster/vq.py
Log:
DOC: merge wiki edits for cluster module.
Modified: trunk/scipy/cluster/__init__.py
===================================================================
--- trunk/scipy/cluster/__init__.py 2010-11-14 09:58:13 UTC (rev 6883)
+++ trunk/scipy/cluster/__init__.py 2010-11-14 09:58:41 UTC (rev 6884)
@@ -1,3 +1,20 @@
+"""
+Vector Quantization / Kmeans
+============================
+Clustering algorithms are useful in information theory, target detection,
+communications, compression, and other areas. The `vq` module only
+supports vector quantization and the k-means algorithms. Development of
+self-organizing maps (SOM) and other approaches is underway.
+
+Hierarchical Clustering
+=======================
+The `hierarchy` module provides functions for hierarchical and
+agglomerative clustering. Its features include generating hierarchical
+clusters from distance matrices, computing distance matrices from
+observation vectors, calculating statistics on clusters, cutting linkages
+to generate flat clusters, and visualizing clusters with dendrograms.
+
+"""
#
# spatial - Distances
#
Modified: trunk/scipy/cluster/hierarchy.py
===================================================================
--- trunk/scipy/cluster/hierarchy.py 2010-11-14 09:58:13 UTC (rev 6883)
+++ trunk/scipy/cluster/hierarchy.py 2010-11-14 09:58:41 UTC (rev 6884)
@@ -6,109 +6,75 @@
or find the roots of the forest formed by a cut by providing the flat
cluster ids of each observation.
-+------------------+-------------------------------------------------+
-|*Function* | *Description* |
-+------------------+-------------------------------------------------+
-|fcluster |forms flat clusters from hierarchical clusters. |
-+------------------+-------------------------------------------------+
-|fclusterdata |forms flat clusters directly from data. |
-+------------------+-------------------------------------------------+
-|leaders |singleton root nodes for flat cluster. |
-+------------------+-------------------------------------------------+
+.. autosummary::
+ :toctree: generated/
+ fcluster
+ fclusterdata
+ leaders
+
These are routines for agglomerative clustering.
-+------------------+-------------------------------------------------+
-|*Function* | *Description* |
-+------------------+-------------------------------------------------+
-|linkage |agglomeratively clusters original observations. |
-+------------------+-------------------------------------------------+
-|single |the single/min/nearest algorithm. (alias) |
-+------------------+-------------------------------------------------+
-|complete |the complete/max/farthest algorithm. (alias) |
-+------------------+-------------------------------------------------+
-|average |the average/UPGMA algorithm. (alias) |
-+------------------+-------------------------------------------------+
-|weighted |the weighted/WPGMA algorithm. (alias) |
-+------------------+-------------------------------------------------+
-|centroid |the centroid/UPGMC algorithm. (alias) |
-+------------------+-------------------------------------------------+
-|median |the median/WPGMC algorithm. (alias) |
-+------------------+-------------------------------------------------+
-|ward |the Ward/incremental algorithm. (alias) |
-+------------------+-------------------------------------------------+
+.. autosummary::
+ :toctree: generated/
+ linkage
+ single
+ complete
+ average
+ weighted
+ centroid
+ median
+ ward
+
These routines compute statistics on hierarchies.
-+------------------+-------------------------------------------------+
-|*Function* | *Description* |
-+------------------+-------------------------------------------------+
-|cophenet |computes the cophenetic distance between leaves. |
-+------------------+-------------------------------------------------+
-|from_mlab_linkage |converts a linkage produced by MATLAB(TM). |
-+------------------+-------------------------------------------------+
-|inconsistent |the inconsistency coefficients for cluster. |
-+------------------+-------------------------------------------------+
-|maxinconsts |the maximum inconsistency coefficient for each |
-| |cluster. |
-+------------------+-------------------------------------------------+
-|maxdists |the maximum distance for each cluster. |
-+------------------+-------------------------------------------------+
-|maxRstat |the maximum specific statistic for each cluster. |
-+------------------+-------------------------------------------------+
-|to_mlab_linkage |converts a linkage to one MATLAB(TM) can |
-| |understand. |
-+------------------+-------------------------------------------------+
+.. autosummary::
+ :toctree: generated/
+ cophenet
+ from_mlab_linkage
+ inconsistent
+ maxinconsts
+ maxdists
+ maxRstat
+ to_mlab_linkage
+
Routines for visualizing flat clusters.
-+------------------+-------------------------------------------------+
-|*Function* | *Description* |
-+------------------+-------------------------------------------------+
-|dendrogram |visualizes linkages (requires matplotlib). |
-+------------------+-------------------------------------------------+
+.. autosummary::
+ :toctree: generated/
+ dendrogram
+
These are data structures and routines for representing hierarchies as
tree objects.
-+------------------+-------------------------------------------------+
-|*Function* | *Description* |
-+------------------+-------------------------------------------------+
-|ClusterNode |represents cluster nodes in a cluster hierarchy. |
-+------------------+-------------------------------------------------+
-|leaves_list |a left-to-right traversal of the leaves. |
-+------------------+-------------------------------------------------+
-|to_tree |represents a linkage matrix as a tree object. |
-+------------------+-------------------------------------------------+
+.. autosummary::
+ :toctree: generated/
+ ClusterNode
+ leaves_list
+ to_tree
+
These are predicates for checking the validity of linkage and
inconsistency matrices as well as for checking isomorphism of two
flat cluster assignments.
-+------------------+-------------------------------------------------+
-|*Function* | *Description* |
-+------------------+-------------------------------------------------+
-|is_valid_im |checks for a valid inconsistency matrix. |
-+------------------+-------------------------------------------------+
-|is_valid_linkage |checks for a valid hierarchical clustering. |
-+------------------+-------------------------------------------------+
-|is_isomorphic |checks if two flat clusterings are isomorphic. |
-+------------------+-------------------------------------------------+
-|is_monotonic |checks if a linkage is monotonic. |
-+------------------+-------------------------------------------------+
-|correspond |checks whether a condensed distance matrix |
-| |corresponds with a linkage |
-+------------------+-------------------------------------------------+
-|num_obs_linkage |the number of observations corresponding to a |
-| |linkage matrix. |
-+------------------+-------------------------------------------------+
+.. autosummary::
+ :toctree: generated/
+ is_valid_im
+ is_valid_linkage
+ is_isomorphic
+ is_monotonic
+ correspond
+ num_obs_linkage
* MATLAB and MathWorks are registered trademarks of The MathWorks, Inc.
* Mathematica is a registered trademark of The Wolfram Research, Inc.
-
References
----------
@@ -1332,38 +1298,40 @@
def fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None):
"""
Forms flat clusters from the hierarchical clustering defined by
- the linkage matrix ``Z``. The threshold ``t`` is a required parameter.
+ the linkage matrix ``Z``.
- :Arguments:
+ Parameters
+ ----------
+ Z : ndarray
+ The hierarchical clustering encoded with the matrix returned
+ by the `linkage` function.
+ t : float
+ The threshold to apply when forming flat clusters.
+ criterion : str, optional
+ The criterion to use in forming flat clusters. This can
+ be any of the following values:
- - Z : ndarray
- The hierarchical clustering encoded with the matrix returned
- by the ``linkage`` function.
-
- - t : double
- The threshold to apply when forming flat clusters.
-
- - criterion : string (optional)
- The criterion to use in forming flat clusters. This can
- be any of the following values:
-
- * 'inconsistent': If a cluster node and all its
- decendents have an inconsistent value less than or equal
- to ``t`` then all its leaf descendents belong to the
+ 'inconsistent':
+ If a cluster node and all its
+ descendants have an inconsistent value less than or equal
+ to ``t`` then all its leaf descendants belong to the
same flat cluster. When no non-singleton cluster meets
this criterion, every node is assigned to its own
cluster. (Default)
- * 'distance': Forms flat clusters so that the original
+ 'distance':
+ Forms flat clusters so that the original
observations in each flat cluster have no greater a
cophenetic distance than ``t``.
- * 'maxclust': Finds a minimum threshold ``r`` so that
+ 'maxclust':
+ Finds a minimum threshold ``r`` so that
the cophenetic distance between any two original
observations in the same flat cluster is no more than
``r`` and no more than ``t`` flat clusters are formed.
- * 'monocrit': Forms a flat cluster from a cluster node c
+ 'monocrit':
+ Forms a flat cluster from a cluster node c
with index i when ``monocrit[j] <= t``.
For example, to threshold on the maximum mean distance
@@ -1373,38 +1341,38 @@
MR = maxRstat(Z, R, 3)
cluster(Z, t=0.8, criterion='monocrit', monocrit=MR)
- * 'maxclust_monocrit': Forms a flat cluster from a
+ 'maxclust_monocrit':
+ Forms a flat cluster from a
non-singleton cluster node ``c`` when ``monocrit[i] <=
r`` for all cluster indices ``i`` below and including
``c``. ``r`` is minimized such that no more than ``t``
flat clusters are formed. monocrit must be
monotonic. For example, to minimize the threshold t on
maximum inconsistency values so that no more than 3 flat
- clusters are formed, do:
+ clusters are formed, do::
MI = maxinconsts(Z, R)
cluster(Z, t=3, criterion='maxclust_monocrit', monocrit=MI)
- - depth : int (optional)
- The maximum depth to perform the inconsistency calculation.
- It has no meaning for the other criteria. (default=2)
+ depth : int, optional
+ The maximum depth to perform the inconsistency calculation.
+ It has no meaning for the other criteria. Default is 2.
+ R : ndarray, optional
+ The inconsistency matrix to use for the 'inconsistent'
+ criterion. This matrix is computed if not provided.
+ monocrit : ndarray, optional
+ An array of length n-1. ``monocrit[i]`` is the
+ statistics upon which non-singleton i is thresholded. The
+ monocrit vector must be monotonic, i.e. given a node c with
+ index i, for all node indices j corresponding to nodes
+ below c, ``monocrit[i] >= monocrit[j]``.
- - R : ndarray (optional)
- The inconsistency matrix to use for the 'inconsistent'
- criterion. This matrix is computed if not provided.
+ Returns
+ -------
+ fcluster : ndarray
+ An array of length n. T[i] is the flat cluster number to
+ which original observation i belongs.
- - monocrit : ndarray (optional)
- A ``(n-1)`` numpy vector of doubles. ``monocrit[i]`` is the
- statistics upon which non-singleton ``i`` is thresholded. The
- monocrit vector must be monotonic, i.e. given a node ``c`` with
- index ``i``, for all node indices j corresponding to nodes
- below ``c``, ``monocrit[i] >= monocrit[j]``.
-
- :Returns:
-
- - T : ndarray
- A vector of length ``n``. ``T[i]`` is the flat cluster number to
- which original observation ``i`` belongs.
"""
Z = np.asarray(Z, order='c')
is_valid_linkage(Z, throw=True, name='Z')
@@ -1444,67 +1412,56 @@
def fclusterdata(X, t, criterion='inconsistent', \
metric='euclidean', depth=2, method='single', R=None):
"""
- ``T = fclusterdata(X, t)``
+ Cluster observation data using a given metric.
- Clusters the original observations in the ``n`` by ``m`` data
- matrix ``X`` (``n`` observations in ``m`` dimensions), using the
- euclidean distance metric to calculate distances between original
- observations, performs hierarchical clustering using the single
- linkage algorithm, and forms flat clusters using the inconsistency
- method with t as the cut-off threshold.
+ Clusters the original observations in the n-by-m data
+ matrix X (n observations in m dimensions), using the euclidean
+ distance metric to calculate distances between original observations,
+ performs hierarchical clustering using the single linkage algorithm,
+ and forms flat clusters using the inconsistency method with `t` as the
+ cut-off threshold.
- A one-dimensional numpy array ``T`` of length ``n`` is
- returned. ``T[i]`` is the index of the flat cluster to which the
- original observation ``i`` belongs.
+ A one-dimensional array T of length n is returned. T[i] is the index
+ of the flat cluster to which the original observation i belongs.
- :Arguments:
+ Parameters
+ ----------
+ X : ndarray
+ n by m data matrix with n observations in m dimensions.
+ t : float
+ The threshold to apply when forming flat clusters.
+ criterion : str, optional
+ Specifies the criterion for forming flat clusters. Valid
+ values are 'inconsistent' (default), 'distance', or 'maxclust'
+ cluster formation algorithms. See `fcluster` for descriptions.
+ method : str, optional
+ The linkage method to use (single, complete, average,
+ weighted, median centroid, ward). See `linkage` for more
+ information. Default is "single".
+ metric : str, optional
+ The distance metric for calculating pairwise distances. See
+ `distance.pdist` for descriptions and linkage to verify
+ compatibility with the linkage method.
+ t : double, optional
+ The cut-off threshold for the cluster function or the
+ maximum number of clusters (criterion='maxclust').
+ depth : int, optional
+ The maximum depth for the inconsistency calculation. See
+ `inconsistent` for more information.
+ R : ndarray, optional
+ The inconsistency matrix. It will be computed if necessary
+ if it is not passed.
- - X : ndarray
- ``n`` by ``m`` data matrix with ``n`` observations in ``m``
- dimensions.
+ Returns
+ -------
+ T : ndarray
+ A vector of length n. T[i] is the flat cluster number to
+ which original observation i belongs.
- - t : double
- The threshold to apply when forming flat clusters.
-
- - criterion : string
- Specifies the criterion for forming flat clusters. Valid
- values are 'inconsistent', 'distance', or 'maxclust' cluster
- formation algorithms. See ``fcluster`` for descriptions.
-
- - method : string
- The linkage method to use (single, complete, average,
- weighted, median centroid, ward). See ``linkage`` for more
- information.
-
- - metric : string
- The distance metric for calculating pairwise distances. See
- distance.pdist for descriptions and linkage to verify
- compatibility with the linkage method.
-
- - t : double
- The cut-off threshold for the cluster function or the
- maximum number of clusters (criterion='maxclust').
-
- - depth : int
- The maximum depth for the inconsistency calculation. See
- ``inconsistent`` for more information.
-
- - R : ndarray
- The inconsistency matrix. It will be computed if necessary
- if it is not passed.
-
-
- :Returns:
-
- - T : ndarray
- A vector of length ``n``. ``T[i]`` is the flat cluster number to
- which original observation ``i`` belongs.
-
Notes
-----
+ This function is similar to the MATLAB function clusterdata.
- This function is similar to MATLAB(TM) clusterdata function.
-
"""
X = np.asarray(X, order='c', dtype=np.double)
Modified: trunk/scipy/cluster/vq.py
===================================================================
--- trunk/scipy/cluster/vq.py 2010-11-14 09:58:13 UTC (rev 6883)
+++ trunk/scipy/cluster/vq.py 2010-11-14 09:58:41 UTC (rev 6884)
@@ -1,71 +1,77 @@
-""" K-means Clustering and Vector Quantization Module
+"""
+K-means Clustering and Vector Quantization Module
- Provides routines for k-means clustering, generating code books
- from k-means models, and quantizing vectors by comparing them with
- centroids in a code book.
+Provides routines for k-means clustering, generating code books
+from k-means models, and quantizing vectors by comparing them with
+centroids in a code book.
- The k-means algorithm takes as input the number of clusters to
- generate, k, and a set of observation vectors to cluster. It
- returns a set of centroids, one for each of the k clusters. An
- observation vector is classified with the cluster number or
- centroid index of the centroid closest to it.
+The k-means algorithm takes as input the number of clusters to
+generate, k, and a set of observation vectors to cluster. It
+returns a set of centroids, one for each of the k clusters. An
+observation vector is classified with the cluster number or
+centroid index of the centroid closest to it.
- A vector v belongs to cluster i if it is closer to centroid i than
- any other centroids. If v belongs to i, we say centroid i is the
- dominating centroid of v. Common variants of k-means try to
- minimize distortion, which is defined as the sum of the distances
- between each observation vector and its dominating centroid. Each
- step of the k-means algorithm refines the choices of centroids to
- reduce distortion. The change in distortion is often used as a
- stopping criterion: when the change is lower than a threshold, the
- k-means algorithm is not making sufficient progress and
- terminates.
+A vector v belongs to cluster i if it is closer to centroid i than
+any other centroids. If v belongs to i, we say centroid i is the
+dominating centroid of v. The k-means algorithm tries to
+minimize distortion, which is defined as the sum of the squared distances
+between each observation vector and its dominating centroid. Each
+step of the k-means algorithm refines the choices of centroids to
+reduce distortion. The change in distortion is used as a
+stopping criterion: when the change is lower than a threshold, the
+k-means algorithm is not making sufficient progress and
+terminates. One can also define a maximum number of iterations.
- Since vector quantization is a natural application for k-means,
- information theory terminology is often used. The centroid index
- or cluster index is also referred to as a "code" and the table
- mapping codes to centroids and vice versa is often referred as a
- "code book". The result of k-means, a set of centroids, can be
- used to quantize vectors. Quantization aims to find an encoding of
- vectors that reduces the expected distortion.
+Since vector quantization is a natural application for k-means,
+information theory terminology is often used. The centroid index
+or cluster index is also referred to as a "code" and the table
+mapping codes to centroids and vice versa is often referred as a
+"code book". The result of k-means, a set of centroids, can be
+used to quantize vectors. Quantization aims to find an encoding of
+vectors that reduces the expected distortion.
- For example, suppose we wish to compress a 24-bit color image
- (each pixel is represented by one byte for red, one for blue, and
- one for green) before sending it over the web. By using a smaller
- 8-bit encoding, we can reduce the amount of data by two
- thirds. Ideally, the colors for each of the 256 possible 8-bit
- encoding values should be chosen to minimize distortion of the
- color. Running k-means with k=256 generates a code book of 256
- codes, which fills up all possible 8-bit sequences. Instead of
- sending a 3-byte value for each pixel, the 8-bit centroid index
- (or code word) of the dominating centroid is transmitted. The code
- book is also sent over the wire so each 8-bit code can be
- translated back to a 24-bit pixel value representation. If the
- image of interest was of an ocean, we would expect many 24-bit
- blues to be represented by 8-bit codes. If it was an image of a
- human face, more flesh tone colors would be represented in the
- code book.
+All routines expect obs to be a M by N array where the rows are
+the observation vectors. The codebook is a k by N array where the
+i'th row is the centroid of code word i. The observation vectors
+and centroids have the same feature dimension.
- All routines expect obs to be a M by N array where the rows are
- the observation vectors. The codebook is a k by N array where the
- i'th row is the centroid of code word i. The observation vectors
- and centroids have the same feature dimension.
+As an example, suppose we wish to compress a 24-bit color image
+(each pixel is represented by one byte for red, one for blue, and
+one for green) before sending it over the web. By using a smaller
+8-bit encoding, we can reduce the amount of data by two
+thirds. Ideally, the colors for each of the 256 possible 8-bit
+encoding values should be chosen to minimize distortion of the
+color. Running k-means with k=256 generates a code book of 256
+codes, which fills up all possible 8-bit sequences. Instead of
+sending a 3-byte value for each pixel, the 8-bit centroid index
+(or code word) of the dominating centroid is transmitted. The code
+book is also sent over the wire so each 8-bit code can be
+translated back to a 24-bit pixel value representation. If the
+image of interest was of an ocean, we would expect many 24-bit
+blues to be represented by 8-bit codes. If it was an image of a
+human face, more flesh tone colors would be represented in the
+code book.
- whiten(obs) --
- Normalize a group of observations so each feature has unit
- variance.
- vq(obs,code_book) --
- Calculate code book membership of a set of observation
- vectors.
- kmeans(obs,k_or_guess,iter=20,thresh=1e-5) --
- Clusters a set of observation vectors. Learns centroids with
- the k-means algorithm, trying to minimize distortion. A code
- book is generated that can be used to quantize vectors.
- kmeans2 --
- A different implementation of k-means with more methods for
- initializing centroids. Uses maximum number of iterations as
- opposed to a distortion threshold as its stopping criterion.
+Functions
+---------
+`whiten` :
+ Normalize a group of observations so each feature has unit
+ variance.
+`vq` :
+ Calculate code book membership of a set of observation
+ vectors.
+
+`kmeans` :
+ Clusters a set of observation vectors. Learns centroids with
+ the k-means algorithm, trying to minimize distortion. A code
+ book is generated that can be used to quantize vectors.
+
+`kmeans2` :
+ A different implementation of k-means with more methods for
+ initializing centroids. Uses maximum number of iterations as
+ opposed to a distortion threshold as its stopping criterion.
+
"""
__docformat__ = 'restructuredtext'
@@ -393,62 +399,68 @@
return code_book, avg_dist[-1]
def kmeans(obs, k_or_guess, iter=20, thresh=1e-5):
- """Performs k-means on a set of observation vectors forming k
- clusters. This yields a code book mapping centroids to codes
- and vice versa. The k-means algorithm adjusts the centroids
- until sufficient progress cannot be made, i.e. the change in
- distortion since the last iteration is less than some
- threshold.
+ """
+ Performs k-means on a set of observation vectors forming k clusters.
- :Parameters:
- obs : ndarray
- Each row of the M by N array is an observation vector. The
- columns are the features seen during each observation.
- The features must be whitened first with the whiten
- function.
+ The k-means algorithm adjusts the centroids until sufficient
+ progress cannot be made, i.e. the change in distortion since
+ the last iteration is less than some threshold. This yields
+ a code book mapping centroids to codes and vice versa.
- k_or_guess : int or ndarray
- The number of centroids to generate. A code is assigned to
- each centroid, which is also the row index of the centroid
- in the code_book matrix generated.
+ Distortion is defined as the sum of the squared differences
+ between the observations and the corresponding centroid.
- The initial k centroids are chosen by randomly selecting
- observations from the observation matrix. Alternatively,
- passing a k by N array specifies the initial k centroids.
+ Parameters
+ ----------
+ obs : ndarray
+ Each row of the M by N array is an observation vector. The
+ columns are the features seen during each observation.
+ The features must be whitened first with the `whiten` function.
- iter : int
- The number of times to run k-means, returning the codebook
- with the lowest distortion. This argument is ignored if
- initial centroids are specified with an array for the
- k_or_guess paramter. This parameter does not represent the
- number of iterations of the k-means algorithm.
+ k_or_guess : int or ndarray
+ The number of centroids to generate. A code is assigned to
+ each centroid, which is also the row index of the centroid
+ in the code_book matrix generated.
- thresh : float
- Terminates the k-means algorithm if the change in
- distortion since the last k-means iteration is less than
- thresh.
+ The initial k centroids are chosen by randomly selecting
+ observations from the observation matrix. Alternatively,
+ passing a k by N array specifies the initial k centroids.
- :Returns:
- codebook : ndarray
- A k by N array of k centroids. The i'th centroid
- codebook[i] is represented with the code i. The centroids
- and codes generated represent the lowest distortion seen,
- not necessarily the globally minimal distortion.
+ iter : int, optional
+ The number of times to run k-means, returning the codebook
+ with the lowest distortion. This argument is ignored if
+ initial centroids are specified with an array for the
+ ``k_or_guess`` parameter. This parameter does not represent the
+ number of iterations of the k-means algorithm.
- distortion : float
- The distortion between the observations passed and the
- centroids generated.
+ thresh : float, optional
+ Terminates the k-means algorithm if the change in
+ distortion since the last k-means iteration is less than
+ or equal to thresh.
- :SeeAlso:
- - kmeans2: a different implementation of k-means clustering
- with more methods for generating initial centroids but without
- using a distortion change threshold as a stopping criterion.
- - whiten: must be called prior to passing an observation matrix
- to kmeans.
+ Returns
+ -------
+ codebook : ndarray
+ A k by N array of k centroids. The i'th centroid
+ codebook[i] is represented with the code i. The centroids
+ and codes generated represent the lowest distortion seen,
+ not necessarily the globally minimal distortion.
+ distortion : float
+ The distortion between the observations passed and the
+ centroids generated.
+
+ See Also
+ --------
+ kmeans2 : a different implementation of k-means clustering
+ with more methods for generating initial centroids but without
+ using a distortion change threshold as a stopping criterion.
+
+ whiten : must be called prior to passing an observation matrix
+ to kmeans.
+
Examples
--------
-
>>> from numpy import array
>>> from scipy.cluster.vq import vq, kmeans, whiten
>>> features = array([[ 1.9,2.3],
More information about the Scipy-svn
mailing list