difflib

Sun Apr 22 21:03:33 EDT 2001

[posted & mailed]

[Andrew Dalke]
> I'm also surprised.  I haven't heard of that alignment method
> before, as compared to sequence alignment methods for bioinformatics.
> But then I haven't really delved into that topic with sequences.

That's because you're trying to outguess nature, but SequenceMatcher is
trying to outguess people:  the latter doesn't care about the most efficient
way to change X into Y, it wants to guess what a *person* probably did when
changing X into Y by hand.  This can make text-file diffs much easier to
follow.

    before:  private Thread currentThread;
    after:   private volatile Thread currentThread;

Run this thru your usual sequence comparator, and it's likely to say "Ah, I
know!  They inserted 'e volatil' between the 't' and 'e' at the end of
'private'".  They can reach this bizarre conclusion because 'e Thread
currentThread' is a long matching subsequence.  Or:

    before:  ab
    after:   acab

"Ah, I know!  They stuffed 'ca' into the middle!" (something Unix diff and
Windows Windiff both believe, albeit via entirely different algorithms).

But "it's obvious" how a *person* changed the first example, and more likely
than not that a *person* actually tacked 'ac' on to the front in the second
example.  SequenceMatcher is trying to get those results, even if there are
"more efficient" ways to get from before to after.  AFAIK, it's only people
using vi who spend minutes plotting out the most efficient key sequence to
get a job done <wink>.

> ...
> Perhaps some of the biopython work along these lines might be
> relevant to the general population?

I don't know, but I named it "difflib.py" instead of "sequencematcher.py"
just in case one of you intrepid researchers did have something of general
value.  difflib.py came into being when Eric Raymond was writing a more
general library, and we both came to the conclusion that the SequenceMatcher
algorithm is the only one we actually had good results with in practice.  But
he was trying to outguess people too.

> ...
> Hmm, and it seems difflib has support for masking what is sometimes
> viewed as low information content regions.

No, Wootton has support for what difflib views as clumps of junk.  I'm not a
fan of overblown terminology unless *so* overblown it's funny <wink>.

comparatively y'rs  - tim