[SciPy-User] Help with clustering : Memory Error with large dataset
Abhishek Pratap
apratap at lbl.gov
Thu Apr 5 17:10:33 EDT 2012
Hey Guys
I am re-posting a message I had sent to numpy mailing list earlier. In
summary I need help with clustering. My input dataset is about 1-2
million x,y coordinates which I would like to cluster together for ex
using DBSCAN algo. I tried it on a small data set and it works fine.
When I increase my input size it crashes. Can I be more efficient ?
More details copied below.
Thanks!
-Abhi
===message from numpy mailing list====
I am new to both python and more so to numpy. I am trying to cluster
close to a 900K points using DBSCAN algo. My input is a list of ~900k
tuples each having two points (x,y) coordinates. I am converting them
to numpy array and passing them to pdist method of
scipy.spatial.distance for calculating distance between each point.
Here is some size info on my numpy array
shape of input array : (828575, 2)
Size : 6872000 bytes
I think the error has something to do with the default double dtype of
numpy array of pdist function. I would appreciate if you could help me
debug this. I am sure I overlooking some naive thing here
See the traceback below.
MemoryError Traceback (most recent call last)
/house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
in <module>()
36
37 print cleaned_senseBam
---> 38 cluster_pet_points_per_chromosome(sense_bamFile)
/house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
in cluster_pet_points_per_chromosome(bamFile)
30 print 'Size of list points is %d' % sys.getsizeof(points)
31 print 'Size of numpy array is %d' %
sys.getsizeof(points_array)
---> 32 cluster_points_DBSCAN(points_array)
33 #print points_array
34
/house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-72-77005d7cd900>
in cluster_points_DBSCAN(data_numpy_array)
9 def cluster_points_DBSCAN(data_numpy_array):
10 #eucledian distance calculation
---> 11 D = distance.pdist(data_numpy_array)
12 S = distance.squareform(D)
13 H = 1 - S/np.max(S)
/house/homedirs/a/apratap/playground/software/epd-7.2-2-rh5-x86_64/lib/python2.7/site-packages/scipy/spatial/distance.pyc
in pdist(X, metric, p, w, V, VI)
1155
1156 m, n = s
-> 1157 dm = np.zeros((m * (m - 1) / 2,), dtype=np.double)
1158
1159 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
More information about the SciPy-User
mailing list