Diff of Text

Dave Angel davea at ieee.org
Sat Jun 5 22:09:03 EDT 2010


GZ wrote:
> <snip>
>>>>> I want a library that does unix 'diff' like function, i.e. compare two
>>>>> strings line by line and output the difference. Python's difflib does
>>>>> not work perfectly for me, because the resulting differences are
>>>>> pretty big. I would like an algorithm that generates the smallest
>>>>> differences.
>>>>>           
>> <snip>
> Thanks for your response.
>
> The verboseness of the format is not really my problem and I only care
> about line by line comparison for now.
>
> Let me think of a better way to express what I mean by a "smaller
> diff." After I diff the two strings, I will have something like this:
>
>   AAA
> - BBB
> + CCC
> + DDD
> - EEE
>
> It means the first line does not change, the second line is replaced
> by the third line, the forth line is new, and the fifth line is
> deleted.
>
> I define the "smallness" of the diff algorithm as "the sum of the
> total number of minuses and pluses". In my above example, it is 4 (two
> minuses and 2 pluses). Note that no matter what format we use to
> represent the diff, this number is the same.
>
> Python's difflib does not really minimize this number. It tries to
> make this number small, but also tries to yield matches that “look
> right” to people at the cost of increasing this number. (http://
> docs.python.org/library/difflib.html).
>
> What I am looking for is an algo that can really minimize this number.
>
>   
The - lines aren't needed, any more than the context lines are needed, 
so that will typically cut your results in half. But perhaps the real 
algorithm you're describing is one that permits lines to be moved as 
well as inserted and deleted. If you then consider that information to 
be free (it's not part of the measure you're specifying), you may do 
best with the following algorithm:

Scan the two files looking for lines that appear exactly once in each 
file. Consider those lines the first invariants, and everything else a 
potential change. Now, starting with each pair (which I call a bridge 
between islands), look at the line previous and the line following, and 
if either or both are also matched, expand the two islands to include 
them. As an island grows to butt up against another island, merge them.

Now, each bridge is a "move" operation, and every line left over in 
either file is either a plus or a minus, an insert or a delete. For most 
editing transactions, there will be relatively few of these.

For example, if three functions were reordered, from A, B, C, to A, C, 
B, there would be no plus or minus lines at all.

DaveA




More information about the Python-list mailing list