[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Tue Jul 13 23:20:24 CEST 2010

[Tim]
>> ...
>> BTW, it's not clear whether ratio() computes a _useful_ value in the
>> presence of junk, however that may be defined.

[Terry Reedy]
> I agree, which is one reason why one should be to disable auto-junking.

Yup.

> There are a number of statistical methods for analyzing similarity matrices,
> analogous to correlation matrices, except that entries are in [0.0,1.0]
> rather than [-1.0,1.0]. For my Ph.D. thesis, I did such analyses for sets of
> species. Similarity measures should ususally be symmetric and increase with
> greater matching. The heuristic can cause both to fail.

Two things about that.  First, ratio() is here _defined_ to be (when
comparing sequence a and b):

    2.0 * number_of_matches / (len(a) + len(b))

The denominator is not affected by any junk heuristics, so this ratio
does indeed increase directly with "greater matching".  But I don't
know exactly what you mean by "greater matching" - I get the
impression you think that's an inherent property of the sequences,
but, as before, I don't see any meaning for it independent of the
specific matching algorithm used.

The current implementation of quick_ratio() may be closer to what you
seem to be thinking.  That views both sequences as multisets, and
considers number_of_matches to be the cardinality of their
intersection.  That measure ignores the possibility of junk, and also
ignores the order in which elements appear - it's wholly independent
of the matching algorithm. But it returns 1.0 when the second sequence
is any permutation of the elements in the first sequence, so throws
away any notion of ordering.

Second, it's common as mud for ratio() to give a different result
depending on the order of SequenceMatcher's arguments, junk or no
junk.  This is mostly because it's a good thing for SequenceMatcher to
run in finite time ;-)

It's not just comparing sequences in the abstract, it's doing so in
the context of producing a one-pass "left to right" edit sequence
transforming the first sequence into the second, using only "insert"
and "delete" operations.  If the longest matching block between
sequences X and Y is M, X and Y are effectively split into:

    X = A + M + B
    Y = C + M + D

and then the same process is recursively applied to the pairs (A, C)
and (B, D).  If, for example, there are many matches between blocks A
and D, they'll never be seen - a one-pass edit sequence restricted to
"insert" and "delete" has to view M as a wall, transforming A into C
entirely before it can skip over the matching M and start thinking
about how to transform B into D.

For that reason, when there are multiple maximal matching blocks,
exactly which one is picked to be M can have percolating effects on
how many additional matches are found _later_ (BTW, it's also a reason
for why a notion of junk can improve the subjective _quality_ of
results as judged by humans:  if M is composed of things all of which
"look interesting" to people, people tend to think the match is
intuitive).

The method find_longest_match() uses to pick a winner is documented:

    Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
        alo <= i <= i+k <= ahi
        blo <= j <= j+k <= bhi
    and for all (i',j',k') meeting those conditions,
        k >= k'
        i <= i'
        and if i == i', j <= j'

    In other words, of all maximal matching blocks, return one that
    starts earliest in a, and of all those maximal matching blocks that
    start earliest in a, return the one that starts earliest in b.

This has consequences.  Here's a tiny example:

    X = "abbb"
    Y = "bbab"

All maximal matching blocks between X and Y have length 2.  "bb"
appears twice in X and once in "Y".  "ab" appears once in each.  Pass
X first, and M="ab" is picked as the winner.  That breaks the
sequences into:

    X = "" + "ab" + "bb"
    Y = "bb" + "ab" + ""

and no further matches are found between "" and "bb", or between "bb"
and "".  ratio() thus returns 0.5.

But pass Y first, and M="bb" is picked as the winner, breaking the
sequences into:

    X = "a" + "bb" + "b"
    Y = "" + "bb" + "ab"

and an _additional_ match (on "b") is later found when comparing the
suffixes ("b" and "ab").  ratio() thus returns 0.75 in that case.

I can conceive of trying all possible winners (and so on recursively),
and then backtracking to pick a winner at each branch that maximizes
the total number of matches - but "it's a good thing for
SequenceMatcher to run in finite time" ;-)

> There are multiple possible definitions of similarity for sets (and
> arguments thereabout). I am sure the same is true for sequences. But I
> consider the definition for .ratio, without the heuristic, to be sensible. I
> would consider using it should the occasion arise.

If you ever used difflib's file-comparison functions, they used
ratio() extensively under the covers.  The most interesting case when
producing diffs for human consumption is when two blocks have _no_
matching lines in common.  difflib's file comparators look for the two
most similar lines (highest ratio()) in that case, trying to guess
where (&, later, how) a human may have edited a line.  This produces
terrifically useful & intuitive output - when it works ;-)

>> It certainly was the intent that nothing would be
>> called junk unless it appeared at least twice, so the "n>= 200"
>> clause ensures that 1% of n is at least 2.

> Since 2 cannot be greater than something that is at least 2, you ensured
> that nothing would be called junk unless it appears as least thrice.

OK, staring at the code the minimum is actually 4 times.  It's true that

    if n >= 200 and len(indices) * 100 > n:

can't succeed when len(indices) is 2, but when it's 3 we've
_previously_ seen 3 instances of the element and are currently looking
at the fourth.

...

>> I'd call it "autojunk", cuz that's what it would do.  Also a useful
>> syntactic overlap with the name of the current "isjunk" argument.

> I like that. I am now leaning toward the following?
>
> G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 'autojunk
> = True' to the constructor signature. This is the minimal change that fixes
> the bug of no choice while keeping the default as is. So it is a minimal
> violation of the usual stricture against API changes in bugfix releases.

I'm not current on the rules for bugfix releases, but way back when
I'm afraid this wouldn't be allowed.  The problem is that anyone using
the new argument in 2.7.1 would be creating code that would blow up if
run under 2.7.  Seems that the 6 people who care ;-) about this case
would happily agree to live with that, but I have to defer to those
more current on bugfix release practice.

> I would doc this as "Use an internal heuristic that identifies 'common' items
> as junk." and separately describe the 'current heuristic', leaving open the
> possibility of changing it.
>
> Possible suboption: enforce 'autojunk in (True,False)' so the user cannot
> forget that it is a switch and not a tuning parameter.

Don't think that part is needed.

> In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with the
> tuning parameters for the current heuristic. Adding the _ would indicate
> that is it a private, expert-only, use at your own risk, subject to change
> attribute.
>
> If we agree on this much, we can then discuss what the tuple should be for
> 3.2.

Think the most pressing thing is to give people a way to turn the damn
thing off.  An ugly way would be to trigger on an unlikely
input-output behavior of the existing isjunk argument.  For example,
if

    isjunk("what's the airspeed velocity of an unladen swallow?")

returned

    "don't use auto junk!"

and 2.7.1 recognized that as meaning "don't use auto junk", code could
be written under 2.7.1 that didn't blow up under 2.7.  It could
_behave_ differently, although that's true of any way of disabling the
auto-junk heuristics.