Diff of Text

Bryan bryanjugglercryptographer at yahoo.com
Sun Jun 6 00:16:32 EDT 2010


GZ wrote:
> I should distinguish between modifications and additions. In my above
> example, one line is modified/replaced, one line is added and one line
> is deleted. There are a total of 3 edits. I am looking for an
> alternative python library other than difflib that minimizes this
> number (edit distance).

Since you know the lingo, "edit distance", I expect you've already
Googled up the general results. Depending on what kinds of edits we
count, the problem of finding a minimal diff can be low-order poly-
time, NP-hard, or incomputable.

Gonzalo Navarro's 2001 survey, "A Guided Tour to Approximate String
Matching", is now available on-line for free:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf
The paper notes, "issues have been put aside to keep a reasonable
scope," but presents more than most of us ever wanted to know about
the subject.

I don't know of a module that does what GZ asks. There are scores of
line-oriented diff implementations, and there are minimal-edit-
distance diffs, but the combination is rare at best. Problem domains
that call for a true minimal diff tend not to work in units of lines
of text.

--
--Bryan



More information about the Python-list mailing list