[Python-Dev] Difflib modifications [reposted]

Wed Dec 1 14:08:25 CET 2004

[Reposted to python-dev!]

Hello there,

    We've has done some customizations to difflib to make it work well
with pagetests we are running on a project at Canonical, and we are
looking for some guidance as to what's the best way to do them. There
are some tricky bits that have to do with how the class inheritance is
put together, and since we would want to avoid duplicating difflib I
figured we'd ask and see if some grand ideas come up.

A [rough first cut of the] patch is inlined below. Essentially, it does:

    - Implements a custom Differ.fancy_compare function that supports
      ellipsis and omits equal content

    - Hacks _fancy_replace to skip ellipsis as well.

    - Hacks best_ratio and cutoff. I'm a bit fuzzy on why this was
      changed, to be honest, and Celso's travelling today, but IIRC it
      had to do with how difflib grouped changes.

Essentially, what we aim for is:

    - Ignoring ellipsisized(!) content
    - Omitting content which is equal

I initially thought the best way to do this would be to inherit from
SequenceMatcher and make it not return opcodes for ellipsis. However,
there is no easy way to replace the class short of rewriting major bits
of Differ. I suspect this could be easily changed to use a class
attribute that we could override, but let me know what you think of the
whole thing.

--- /usr/lib/python2.3/difflib.py	2004-11-18 20:05:38.720109040 -0200
+++ difflib.py	2004-11-18 20:24:06.731665680 -0200
@@ -885,6 +885,45 @@
             for line in g:
                 yield line
 
+    def fancy_compare(self, a, b):
+        """
+        >>> import difflib
+        >>> engine = difflib.Differ()
+        >>> got = ['World is Cruel', 'Dudes are Cool']
+        >>> want = ['World ... Cruel', 'Dudes ... Cool']
+        >>> list(engine.fancy_compare(want, got))
+        []
+         
+        """
+        cruncher = SequenceMatcher(self.linejunk, a, b)
+        for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
+
+            if tag == 'replace':
+                ## replace single line
+                if a[alo:ahi][0].rstrip() == '...' and ((ahi - alo) == 1):   
+                    g = None
+                ## two lines replaced  
+                elif a[alo:ahi][0].rstrip() == '...' and ((ahi - alo) > 1):   
+                    g = self._fancy_replace(a, (ahi - 1), ahi,
+                                            b, (bhi - 1), bhi)
+                ## common
+                else:
+                    g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
+            elif tag == 'delete':
+                g = self._dump('-', a, alo, ahi)
+            elif tag == 'insert':
+                g = self._dump('+', b, blo, bhi)
+            elif tag == 'equal':
+                # do not show anything
+                g = None
+            else:
+                raise ValueError, 'unknown tag ' + `tag`
+
+            if g:
+                for line in g:
+                    yield line
+        
+
     def _dump(self, tag, x, lo, hi):
         """Generate comparison results for a same-tagged range."""
         for i in xrange(lo, hi):
@@ -926,7 +965,13 @@
 
         # don't synch up unless the lines have a similarity score of at
         # least cutoff; best_ratio tracks the best score seen so far
-        best_ratio, cutoff = 0.74, 0.75
+        #best_ratio, cutoff = 0.74, 0.75
+
+        ## reduce the cutoff to have enough similarity
+        ## between '<something> ... <something>' and '<a> blabla </a>'
+        ## for example 
+        best_ratio, cutoff = 0.009, 0.01
+
         cruncher = SequenceMatcher(self.charjunk)
         eqi, eqj = None, None   # 1st indices of equal lines (if any)
 
@@ -981,7 +1026,11 @@
             cruncher.set_seqs(aelt, belt)
             for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
                 la, lb = ai2 - ai1, bj2 - bj1
-                if tag == 'replace':
+
+                if aelt[ai1:ai2] == '...':
+                    return
+
+                if tag == 'replace':                    
                     atags += '^' * la
                     btags += '^' * lb
                 elif tag == 'delete':

Take care,
--
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3361 2331