how to sort a list of tuples with custom function

Fri Aug 4 10:39:06 EDT 2017

On Friday, August 4, 2017 at 10:08:56 PM UTC+8, Ho Yeung Lee wrote:
> i had changed to use kmeans
> 
> https://gist.github.com/hoyeunglee/2475391ad554e3d2b2a40ec24ab47940
> 
> i do not know whether write it correctly
> but it seems can cluster to find words in window, but not perfect
> 
> 
> On Wednesday, August 2, 2017 at 3:06:40 PM UTC+8, Peter Otten wrote:
> > Glenn Linderman wrote:
> > 
> > > On 8/1/2017 2:10 PM, Piet van Oostrum wrote:
> > >> Ho Yeung Lee <jobmattcon at gmail.com> writes:
> > >>
> > >>> def isneighborlocation(lo1, lo2):
> > >>>      if abs(lo1[0] - lo2[0]) < 7  and abs(lo1[1] - lo2[1]) < 7:
> > >>>          return 1
> > >>>      elif abs(lo1[0] - lo2[0]) == 1  and lo1[1] == lo2[1]:
> > >>>          return 1
> > >>>      elif abs(lo1[1] - lo2[1]) == 1  and lo1[0] == lo2[0]:
> > >>>          return 1
> > >>>      else:
> > >>>          return 0
> > >>>
> > >>>
> > >>> sorted(testing1, key=lambda x: (isneighborlocation.get(x[0]), x[1]))
> > >>>
> > >>> return something like
> > >>> [(1,2),(3,3),(2,5)]
> > 
> > >> I think you are trying to sort a list of two-dimensional points into a
> > >> one-dimensiqonal list in such a way thet points that are close together
> > >> in the two-dimensional sense will also be close together in the
> > >> one-dimensional list. But that is impossible.
> > 
> > > It's not impossible, it just requires an appropriate distance function
> > > used in the sort.
> > 
> > That's a grossly misleading addition. 
> > 
> > Once you have an appropriate clustering algorithm
> > 
> > clusters = split_into_clusters(items) # needs access to all items
> > 
> > you can devise a key function
> > 
> > def get_cluster(item, clusters=split_into_clusters(items)):
> >     return next(
> >         index for index, cluster in enumerate(clusters) if item in cluster
> >     )
> > 
> > such that
> > 
> > grouped_items = sorted(items, key=get_cluster)
> > 
> > but that's a roundabout way to write
> > 
> > grouped_items = sum(split_into_clusters(items), [])
> > 
> > In other words: sorting is useless, what you really need is a suitable 
> > approach to split the data into groups. 
> > 
> > One well-known algorithm is k-means clustering:
> > 
> > https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html
> > 
> > Here is an example with pictures:
> > 
> > https://dzone.com/articles/k-means-clustering-scipy


i use number of clusters = 120

https://gist.github.com/hoyeunglee/2475391ad554e3d2b2a40ec24ab47940
https://drive.google.com/file/d/0Bxs_ao6uuBDUZFByNVgzd0Jrdm8/view?usp=sharing

using my previous is not suitable for english words
but using kmeans is better, however not perfect to cluster words, 
some words are missing