help in algorithm

Thu Aug 11 16:23:13 EDT 2005

Bengt Richter wrote:
> On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <paolo_veronelli at tiscali.it> wrote:
> 
> 
>>I have  a self organizing net which aim is clustering words.
>>Let's think the clustering is about their 2-grams set.
>>Words then are instances of this class.
>>
>>class clusterable(str):
>>  def __abs__(self):# the set of q-grams (to be calculated only once)
>>    return set([(self+self[0])[n:n+2] for n in range(len(self))])
>>  def __sub__(self,other): # the q-grams distance between 2 words
>>    set1=abs(self)
>>    set2=abs(other)
>>    return len(set1|set2)-len(set1&set2)
>>
>>I'm looking  for the medium  of a set of words, as the word  which 
>>minimizes the sum of the distances from those words.
>>
>>Aka:sum([medium-word for word in words])
>>
>>
>>Thanks for ideas, Paolino
>>
> 
> Just wondering if this is a desired result:
> 
>  >>> clusterable('banana')-clusterable('bananana')
>  0

Yes, the clustering is the main filter,it's good (I hope) to cut the 
space of words down one or two magnitudes.
Final choices must be done with the expensive Levenstain distance, or 
other edit-type distance.

Now I'm using an empirical solution where I suppose the best set has 
lenght L equal the medium of the lenghts.Then I choose from the 
frequency distribution of 2-grams the first L 2-grams.

I have no clue this is the right set and I'm sure that set is not a word 
as there is no chance to chain those 2-grams to form a word.

Thanks for comments

Paolino