A bug in difflib module? (find_longest_match)

Mon Aug 4 02:48:23 EDT 2008

This came up again and I was taking a look at it. There seems to still
be no resolution. I have a patch that can add a kwarg to skip this
behavior if you know you need otherwise. Right now its a simple
boolean flag, but is this enough?

Are there any use cases anyone has to define how this case is
determined? For example, does the user of SequenceMatcher ever need to
tell it what is considered a large sequence? Right now it is hardcoded
at 200 and an element is popular is more than 1% of the elements are
it. So, does anyone need more than boolean or do you also need to say
"a large sequence is 1000 or more in size and a popular element needs
to be 10% of the sequence", as well as just turning off the behavior
completely.

With that answered, I'll apply submit the patch to the tracker.

On Thu, May 1, 2008 at 6:52 AM, Gabriel Genellina
<gagsl-py2 at yahoo.com.ar> wrote:
> En Thu, 01 May 2008 06:21:22 -0300, n00m <n00m at narod.ru> escribió:
>
>>> > import difflib
>>> > s = difflib.SequenceMatcher(None, s1, s2)
>>> > print s.find_longest_match(0, len(s1), 0, len(s2))
>>> > (0, 0, 0)
>>> >
>>> > I think it's line #314 in difflib "who's to blame" --
>>>
>>> Me too. Could you think of some alternative? Simply disabling that
>>> "popularity check" would slow down the algorithm, according to the
>>> comments.
>>
>> No idea :)
>
> The "ignore popular elements" is only an optmization, and it should not be
> applied in your case because it forces the algorithm to yield an invalid
> result.
> I can think of two alternatives:
> - tune up the conditions when the optimization is used, or
> - make it user configurable
>
> SequenceMatcher is a public class, and it is also internally used by Differ
> and others to compare both sequences of lines *and* pairs of similar lines
> (considered as sequences of characters). In this last usage the "ignore
> popular elements" has no much sense, as shown in your example feeding
> directly two dissimilar strings.
> In principle one should disable the "populardict" stuff when dealing with
> strings. Below is a simple attempt to detect that case:
>
> (around line 311 in difflib.py)
>
>        b_is_string = isinstance(b, basestring) # add this line
>        for i, elt in enumerate(b):
>            if elt in b2j:
>                indices = b2j[elt]
>                if not b_is_string and n >= 200 and len(indices) * 100 > n: #
> change this line
>                    populardict[elt] = 1
>                    del indices[:]
>                else:
>                    indices.append(i)
>            else:
>                b2j[elt] = [i]
>
>
> --
> Gabriel Genellina
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

-- 
Read my blog! I depend on your acceptance of my opinion! I am interesting!
http://ironfroggy-code.blogspot.com/