[issue2986] difflib.SequenceMatcher not matching long sequences

Mike Rotondo report at bugs.python.org
Mon Mar 30 02:40:17 CEST 2009


Mike Rotondo <mrotondo at gmail.com> added the comment:

>From the source, it seems that there is undocumented behavior to
SequenceMatcher which is causing this error. If b is longer than 200
characters, it will consider any element x in b that takes up more than
1% of it's contents as "popular", and thus junk. 

So, in this case, difflib is treating each individual digit as an
element of your sequences, and each one takes up more than 1% of the
complete sequence b. Therefore, each one is "popular", and therefore
ignored.

A snippet which demonstrates this:

from difflib import SequenceMatcher
for i in range(1, 202)[::10]:
  a = "a" * i
  b = "b" + "a" * i
  s = SequenceMatcher(None, a, b)
  print s.find_longest_match(0, len(a), 0, len(b))

Up til i=200, the strings match, but afterwards they do not because "a"
is "popular". 

Strangely, if you get rid of the "b" at the beginning of b, they
continue to match at lengths greater than 200. This may be a bug, I'll
keep looking into it but someone who knows more should probably take a
look too.

The comments from difflib.py say some interesting things:
 # b2j also does not contain entries for "popular" elements, meaning 
 # elements that account for more than 1% of the total elements, and
 # when the sequence is reasonably large (>= 200 elements); this can
 # be viewed as an adaptive notion of semi-junk, and yields an enormous
 # speedup when, e.g., comparing program files with hundreds of
 # instances of "return NULL;"

This seems to mean that you won't actually get an accurate diff in
certain cases, which seems odd. At the very least, this behavior should
probably be documented. Do people think it should be changed to get rid
of the "popularity" heuristic?

----------
nosy: +mrotondo
versions: +Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2986>
_______________________________________


More information about the Python-bugs-list mailing list