[Tutor] Word-by-word diff in Python

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu, 18 Apr 2002 20:29:38 -0700 (PDT)


> I'm looking for a word-by-word diff module / class in Python. I came
> across this Perl script; this is basically what I mean:
>
> Main page: http://mike-labs.com/wd2h/
> Perl script: http://mike-labs.com/wd2h/wd2h.html
> Example output: http://mike-labs.com/wd2h/diff.htm

Interesting!  Hmmm... if the indentation or formatting is significant, we
could transform a line-by-line diff utility into a word-by-word by turning
the newlines into some sort of sentinel "NEWLINE"  character.

We could then apply a string.split() to break the lines into individual
words.  Python comes with a standard library module called "difflib":

    http://www.python.org/doc/current/lib/module-difflib.html

that can be used to find differences between two texts.  Here's an
example:


###
>>> revision_1 = """Today, a generation raised in the shadows of the Cold
... War assumes new responsibilities in a world warmed by the sunshine of
... freedom""".split()
>>> revision_2 = """Today, a person raised in the shadows of the Cold War
... assumes new responsibilities in a world warmed by the sunshine of
... freedom""".split()
>>> difflib.ndiff(revision_1, revision_2)
<generator object at 0x81641b8>
>>> diff = difflib.ndiff(revision_1, revision_2)
>>> diff.next()
'  Today,'
>>> diff.next()
'  a'
>>> diff.next()
'- generation'
###

Note that what gets returned is an generator, which is a Python 2.2 style
iterator that allows us to pull things from it one at a time if we use its
"next()" method.  Useful if we want to conserve memory, but not quite so
useful if we want it all at once.


To grab the whole diff at once, let's convince Python to give it to us as
a list:

###
>>> results = list(difflib.ndiff(revision_1, revision_2))
>>> results
['  Today,', '  a', '- generation', '+ person', '  raised', '  in', '
the', '  shadows', '  of', '  the', '  Cold', '  War', '  assumes', '
new', '  responsibilities', '  in', '  a', '  world', '  warmed', '  by',
'  the', '  sunshine', '  of', '  freedom']
###


And the output here can be modified to look like a nice HTML formatted
text with strikeouts and everything.  *grin*


Hope this helps!