clusters of numbers

Sat Dec 15 23:43:44 EST 2018

-----Original Message-----
From: Avi Gross <avigross at verizon.net> 
Sent: Saturday, December 15, 2018 11:27 PM
To: 'Marc Lucke' <marc at marcsnet.com>
Subject: RE: clusters of numbers

Marc,

There are k-means implementations in python, R and other places. Most uses would have two or more dimensions with a goal of specifying how many clusters to look for and then it iterates starting with random existing points to cluster things near those points and then near the centers of those clusters until things stabilize.

Your data is 1-D. Something simpler like a bar chart makes sense. But that may not show underlying patterns.

I am more familiar with doing graphics in R but you can see a tabular view of your data:

data
  1   2   3   5   6   7   8  10  11  12  14  15  16  17  19  20  21  23  24  25  26  29  35  43 
124 116  97  95  89  74  57  73  48  49  38  35  20  33  21  19  14   5   4   4   3   1   1   1

There are clear gaps and a bar chart (which I cannot attach but could send in private email) does show clusters visibly.

But those may largely be an artifact of the missing info.

If you tell us more, we might be able to provide a better statistical answer. I assume you know how to get means and so on.

   vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 1021 7.82 6.01      6    7.12 5.93   1  43    42 1.04     1.23 0.19

Yes, the above is hard to read as I cannot use tables or a constant width font in this forum.

I ran a kmeans asking for 3 clusters:

1 16.512097
2  1.919881
3  7.433486

The three clusters had these scores in them:

Cluster 1: 5  6  7  8 10 11
Cluster 2:  1 2 3
Cluster 3: 12 14 15 16 17 19 20 21 23 24 25 26 29 35 43

If I run it asking for say 5 clusters:

Centers:

1  6.295238
2 11.432692
3  1.483333
4  3.000000
5 18.478261

And here are your five clusters:

5 6 7 8
10 11 12 14
1 2
3
15 16 17 19 20 21 23 24 25 26 29 35 43

If you ran this for various numbers, you might see one that makes more sense to you.  Or, maybe not.

We culd tell you what functions to use but if you search using keywords like python (or another language) followed by k-means or kmeans you can fid out what to install and use. In python, you would need Numpy and probably SciPy as well as the sklearn modules with the Kmeans function in sklearn.clusters. Note you can fine tune the algorithm multiple ways or run it several times as the results can depend on the initial guesses. And you may want to be able to make graphics showing the clusters, albeit it is 1-D.

Good luck.

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On Behalf Of Marc Lucke
Sent: Saturday, December 15, 2018 7:55 PM
To: python-list at python.org
Subject: clusters of numbers

hey guys,

I have a hobby project that sorts my email automatically for me & I want to improve it.  There's data science and statistical info that I'm missing, & I always enjoy reading about the pythonic way to do things too.

I have a list of percentage scores:

(1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10,8,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,17,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,2,3,5,5,14,11,8,1,8,10,5,25,8,7,2,6,3,11,1,11,7,3,10,7,12,10,8,6,1,1,17,3,1,1,2,19,6,10,2,2,7,5,16,3,2,11,10,7,10,21,3,5,2,21,3,14,6,7,2,24,3,17,3,21,8,5,11,17,5,6,10,5,20,1,12,2,3,20,6,11,12,14,6,6,1,14,15,12,15,6,20,7,7,19,3,7,5,16,12,6,7,2,10,3,2,11,8,6,6,5,1,11,1,15,21,14,6,3,2,2,5,6,1,3,5,3,6,20,1,15,12,2,3,3,7,1,16,5,24,10,7,1,12,16,8,26,16,15,10,19,11,6,6,5,6,5)

  & I'd like to know know whether, & how the numbers are clustered.  In an extreme & illustrative example, 1..10 would have zero clusters;
1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7);
17,22,20,45,47,51,82,84,83  would have 3 clusters. (around 20, 47 & 83).  In my set, when I scan it, I intuitively figure there's lots of numbers close to 0 & a lot close to 20 (or there abouts).

I saw info about k-clusters but I'm not sure if I'm going down the right path.  I'm interested in k-clusters & will teach myself, but my priority is working out this problem.

Do you know the name of the algorithm I'm trying to use?  If so, are there python libraries like numpy that I can leverage?  I imagine that I could iterate from 0 to 100% using that as an artificial mean, discard values that are over a standard deviation away, and count the number of scores for that mean; then at the end of that I could set a threshold for which the artificial mean would be kept something like (no attempt at correct syntax:

means={}
deviation=5
threshold=int(0.25*len(list))
for i in range 100:
   count=0
   for j in list:
     if abs(j-i) > deviation:
       count+=1
   if count > threshold:
     means[i]=count

That algorithm is entirely untested & I think it could work, it's just I don't want to reinvent the wheel.  Any ideas kindly appreciated.

-- 
https://mail.python.org/mailman/listinfo/python-list