SequenceMatcher bug ?

Thu Dec 11 03:30:35 EST 2008

"Gabriel Genellina" <gagsl-py2 at yahoo.com.ar> wrote:

>En Wed, 10 Dec 2008 15:14:20 -0200, eliben <eliben at gmail.com> escribió:
>
>> What ? This can't be.
>>
>> 1. Go to http://try-python.mired.org/
>> 2. Type
>> import difflib
>> 3. Type
>> difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
>>
>> Don't you get 0 as the answer ?
>
>Ah, but that isn't the same expression you posted originally:
>
>SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio()
>
>Using *that* expression I got near 1.0 always. But leaving out the [5] at  
>the end, it's true, it gives the wrong answer.
>...
>I've updated the tracker item.

Your assessment that it is the same problem as #1528074 is correct.  It's
the "popularity" optimization.  The key here is that the second sequence
consists of more than 200 identical items.  For example, all of the
following give the same bad result:

difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
difflib.SequenceMatcher(None, [4] + [5]      , [5] * 200).ratio()
difflib.SequenceMatcher(None, [4]            , [5] * 200).ratio()

If you print get_matching_blocks(), you'll see that there are none, because
the "b" sequence is optimized completely away.  The #1528074 calls it
"working by designed" and suggests updating the doc.  However, I would
argue that it's worth checking for this.
-- 
Tim Roberts, timr at probo.com
Providenza & Boekelheide, Inc.