need some kind of "coherence index" for a group of strings

jladasky at itu.edu jladasky at itu.edu
Thu Nov 3 19:14:53 EDT 2016


On Thursday, November 3, 2016 at 3:47:41 PM UTC-7, jlad... at itu.edu wrote:
> On Thursday, November 3, 2016 at 1:09:48 PM UTC-7, Neil D. Cerutti wrote:
> > you may also be 
> > able to use some items "off the shelf" from Python's difflib.
> 
> I wasn't aware of that module, thanks for the tip!
> 
> difflib.SequenceMatcher.ratio() returns a numerical value which represents the "similarity" between two strings.  I don't see a precise definition of "similar", but it may do what the OP needs.

Following up to myself... I just experimented with difflib.SequenceMatcher.ratio() and discovered something.  The algorithm is not "commutative."  That is, it doesn't ALWAYS produce the same ratio when the two strings are swapped.

Here's an excerpt from my interpreter session.

==========

In [1]: from difflib import SequenceMatcher

In [2]: import numpy as np

In [3]: sim = np.zeros((4,4))


== snip ==


In [10]: strings
Out[10]: 
('Here is a string.',
 'Here is a slightly different string.',
 'This string should be significantly different from the other two?',
 "Let's look at all these string similarity values in a matrix.")

In [11]: for r, s1 in enumerate(strings):
   ....:     for c, s2 in enumerate(strings):
   ....:         m = SequenceMatcher(lambda x:x=="", s1, s2)
   ....:         sim[r,c] = m.ratio()
   ....:

In [12]: sim
Out[12]: 
array([[ 1.        ,  0.64150943,  0.2195122 ,  0.30769231],
       [ 0.64150943,  1.        ,  0.47524752,  0.30927835],
       [ 0.2195122 ,  0.45544554,  1.        ,  0.28571429],
       [ 0.30769231,  0.28865979,  0.33333333,  1.        ]])

==========

The values along the matrix diagonal, of course, are all ones, because each string was compared to itself.

I also expected the values reflected across the matrix diagonal to match.  The first row does in fact match the first column.  The remaining numbers disagree somewhat.  The differences are not large, but they are there.  I don't know the reason why.  Caveat programmer.



More information about the Python-list mailing list