[issue2986] difflib.SequenceMatcher not matching long sequences

Wed Sep 1 23:33:00 CEST 2010

Terry J. Reedy <tjreedy at udel.edu> added the comment:

While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code

        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

len(indices) is retrieved *before* the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2986>
_______________________________________