[SciPy-dev] Hierarchical clustering package

Fri Nov 23 22:12:14 EST 2007

Hi David,

Sorry for the late response. Thanksgiving festivities, holiday shopping, 
and my day job have all gotten in the way.

David Cournapeau wrote:
> Damian Eads wrote:
>> Hello,
>>
>> I developed a hierarchical clustering package, which I offer under the 
>> terms of the BSD License. It supports agglomerative clustering, plotting 
>> of dendrograms, flat cluster formation, computation of a few cluster 
>> statistics, and computing distances between vectors using a variety of 
>> distance metrics. The source code is available for your perusal at 
>> http://scipy-cluster.googlecode.com/svn/trunk/ and the API 
>> Documentation, http://www.soe.ucsc.edu/~eads/cluster.html . The 
>> interface is similar to the interface used in MATLAB's Statistics 
>> Toolbox to ease conversion of old MATLAB programs to Python. Eventually, 
>> I'd like to integrate it into Scipy (hence, naming my SVN repository 
>> scipy-cluster).
> Hi Damian,
> 
>     This looks great. I have a couple of questions:
> 
>     - do you think it would be possible to split the package for the 
> reusable parts (in perticular, the distance matrices: scipy.cluster, and 
> a few other packages could reuse those).

The distance functions are fairly self-contained so I don't see why not. 
In fact,  one would only need to move the *python* distance functions 
from the hcluster.py file to the appropriate module file. Alternatively, 
the __init__.py file in scipy/cluster can import the distance functions 
into the scipy.cluster package without importing the other hierarchical 
clustering functions.

>     - do you have some examples ?

I do, it is available at http://www.soe.ucsc.edu/~eads/iris.html. The 
hierarchical clustering examples in the MATLAB Statistics Toolbox 
documentation should work as well.

> I don't know what the opinon of others are on this, but maybe this 
> package could be added to scikits (there is already a scikits.learn 
> packages for ML-related algorithms, ANN, Em for mixtures of Gaussian, 
> and SVM) ?

I'm fine with maintaining it as a separate package (hcluster) on my 
website unless others find that including it in Scipy would be useful.

>>    * The tests I'm writing require some data files to run. What is the 
>> convention for storing and retrieving data files when running a Scipy 
>> regression test? Presumably the test programs should be able to find the 
>> data files without regard to whether the data files are stored in 
>> /usr/share or in the src directory. One solution is to embed the data in 
>> the testing programs themselves but this is messy, and I'd like to know 
>> if there is a better solution.
 >
> The convention is to have the datasets in the package. I am not sure to 
> understand why it is messy: it is good to have self-contained regression 
> tests ?

I am not making an argument against self-contained tests. Rather, I am 
simply stating that putting the data in Python programs is a bit messy, 
especially when the data are large. It gives unnecessary lines for SVN 
diff to process when a program containing a data set is changed, which 
is not as likely when the data sets are stored in separate text files. 
If the Scipy convention is to put testing data in test programs then so 
be it. What's the convention?

Java has a facility for loading resources like images, text and data 
files, which are all loaded in a similar way classes are loaded. Does 
Python have a similar facility?

Cheers,

Damian