[ python-Bugs-1528074 ] difflib.SequenceMatcher.find_longest_match() wrong result

Mon Mar 19 02:51:45 CET 2007

Bugs item #1528074, was opened at 2006-07-24 19:59
Message generated for change (Comment added) made by jimjjewett
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528074&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Machin (sjmachin)
Assigned to: Tim Peters (tim_one)
Summary: difflib.SequenceMatcher.find_longest_match()  wrong result

Initial Comment:
A short example script is attached.
Two strings, each 500 bytes long. Longest match is the
first 32 bytes of each string. Result produced is a
10-byte match later in the string.

If one byte is chopped off the end of each string (len
now 499 each), the correct result is produced.

Other observations, none of which may be relevant:
1. Problem may be in the heuristic for "popular"
elements in the __chain_b method. In this particular
example, the character '0' (which has frequency of 6 in
the "b" string) is not "popular" with a len of 500 but
is "popular" with a len of 499.
2. '0' is the last byte of the correct longest match.
3. The correct longest match is at the start of the
each of the strings.
4. Disabling the "popular" heuristic (like below)
appears to make the problem go away:

                if 0:
                # if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
5. The callable self.isbpopular is created but appears
to be unused.
6. The determination of "popular" doesn't accord with
the comments, which say 1%. However with len==500,
takes 6 to be popular.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2007-03-18 21:51

Message:
Logged In: YES 
user_id=764593
Originator: NO

I think the bug is documentation only.  

According to the comments (but not the docstring) find_longest_match
returns the longest _interesting_ match.  A single element appearing too
often is likely to cause spurious matches, and is therefore not
interesting.

I do agree that this should be noted more prominently, so people don't try
things like comparing text strings letter-by-letter (where 1% is far too
low a threshhold for a 26-character alphabest).

And yes, the comments on popular are correct -- it ignores elements which
constitute *more* than 1%.  

I recommend closing this set of tracker items. If you could submit changes
to the documentation (docstrings and/or help files; maybe even the
comments), I would recommend applying them.

----------------------------------------------------------------------

Comment By: Denys Rtveliashvili (rtvd)
Date: 2007-03-14 16:11

Message:
Logged In: YES 
user_id=1416496
Originator: NO

By the way, I found that the implementation should better be changed
completely. The current one has a O(n^2) computational complexity, while
the one, based on suffix trees using Ukkonen's algorithm would use only
O(n)

----------------------------------------------------------------------

Comment By: Denys Rtveliashvili (rtvd)
Date: 2007-03-11 11:29

Message:
Logged In: YES 
user_id=1416496
Originator: NO

I have sent a testcase for this bug into the SourceForge. The ID is
#1678339.
Also I have submitted a fix for this bug (ID #1678345), but the fix
reduces the performance significantly.

----------------------------------------------------------------------

Comment By: Denys Rtveliashvili (rtvd)
Date: 2007-03-10 12:24

Message:
Logged In: YES 
user_id=1416496
Originator: NO

The quick test for this bug is:

for i in xrange(190, 200):
	text1 = "a" + "b"*i
	text2 = "b"*i + "c"
	m = difflib.SequenceMatcher(None, text1, text2)
	(aptr,bptr,l) = m.find_longest_match(0, len(text1), 0, len(text2))
	print "i:", i, "  l:", l, "  aptr:", aptr, "  bptr:", bptr
	assert l == i

The assertion will fail when i==199 (the length of the texts will be
200).
And yes, the bug is clearly "populardict"-related.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528074&group_id=5470