diff for (text only) manuscripts

Tim Peters tim.one at comcast.net
Mon Mar 3 23:09:14 EST 2003


[Jon Slavin]
> This is not Python specific, but it seems that Python offers hope for
> a solution.
>
> I've been looking for a form of diff that is good for manuscipts --
> rather than programs.  That is, a diff that is word as opposed to line
> oriented.  My motivation is that, as a scientist I often collaborate
> with others on papers wherein we take turns editing a LaTeX file.
> Sometimes I'd like to know exactly how the file has been changed but I
> don't care if the lines wrap at a different point -- so I don't want
> to see noted as changed an entire paragraph if a couple words in the
> beginning have been added and the line breaks re-done.
>
> I have a little experience with Python and tried altering ndiff.py (in
> Python 2.1 Tools/scripts) but without too much luck.  The problem is
> that even if you make it do a diff on a series of words rather than
> lines, it's not clear how to put it all back together to make the
> result readable.
>
> Any suggestions welcome.

First figure out what you want the output to look like:  that's a UI issue,
and the technical details of diffing are irrelevant to getting output you
like.

After you figure that out, use difflib's SequenceMatcher class directly, and
note that it accepts sequences of any hashable objects.  The ndiff script
uses SequenceMatcher both to view a file as a sequence of lines, and to view
a line as a sequence of characters.  You're more likely to want to view a
file as a sequence of words, and a word as a sequence of characters (or a
file as a sequence of paragraphs, and a paragraph as a sequence of words,
and a word as a sequence of characters).  If you want to carry original
file-position info (whatever) along with your words, you can create a little
class with __hash__ and __eq__ methods wrapping up each word with whatever
metainfo you desire.  SequenceMatcher will be happy to diff sequences of
instances of that class (if you implement __hash__ and __eq__, the instances
are hashable, and that's the only requirement SequenceMatcher imposes).






More information about the Python-list mailing list