Ensuring symmetry in difflib.SequenceMatcher

Peter Otten __peter__ at web.de
Wed Nov 24 04:43:41 EST 2010


John Yeung wrote:

> I'm generally pleased with difflib.SequenceMatcher:  It's probably not
> the best available string matcher out there, but it's in the standard
> library and I've seen worse in the wild.  One thing that kind of
> bothers me is that it's sensitive to which argument you pick as "seq1"
> and which you pick as "seq2":
> 
> Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
> (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import difflib
>>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
> 0.44444444444444442
>>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
> 0.66666666666666663
>>>>
> 
> Is this a bug?  I am guessing the algorithm is implemented correctly,
> and that it's just an inherent property of the algorithm used.  It's
> certainly not what I'd call a desirably property.  Are there any
> simple adjustments that can be made without sacrificing (too much)
> performance?

def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
    return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

I'm expecting 50% performance loss ;)

Seriously, have you tried to calculate the ratio with realistic data? 
Without looking into the source I would expect the two ratios to get more 
similar.

Peter



More information about the Python-list mailing list