How fuzzy is get_close_matches() in difflib?

John Henry john106henry at hotmail.com
Fri Nov 17 01:59:41 EST 2006


I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for.  I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

Steven D'Aprano wrote:
> On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
>
> > I did try them and I am impressed.  It helped me found a lot of useful
> > info.   I just want to get a feel as to what constitutes a "match".
>
> The source code has lots of comments, but they don't explain the basic
> algorithm (at least not in the difflib.py supplied with Python 2.3).
>
> There is no single diff algorithm, but I believe that the basic idea is to
> look for insertions and/or deletions of strings. If you want more
> detail, google "diff". Once you have a list of differences, the closest
> match is the search string with the fewest differences.
>
> As for getting a feel of what constitutes a match, I really can't make any
> better suggestion than just try lots of examples with the interactive
> Python shell.
> 
> 
> 
> -- 
> Steven D'Aprano




More information about the Python-list mailing list