difflib and intelligent file differences

Thu Mar 26 10:00:52 EDT 2009

Hello All:

I am starting to work on a file comparison script where I have to
compare the contents of two large files. Originally I thought to just
sort on a numeric key, and use UNIX's comm to do a line by line
comparison. However, this would fail, hence my thinking that I really
should've just used Python from the start. Let me outline the problem.

Imagine two text files, f1 and f2,

f1 is
1
2
3
4
5

and f2 is

12
2
3
4
5

where each line can be thought of as a record, not a running sentence.
Okay, this one is easy, in fact, this is just a line by line
comparison using comm -3 f1 f2. BUT...
(and this is why I'm thinking of using Python's difflib to work on it)

Now say f1 is

1
2
3
4
5

and f2 is

2
3
4
5

The only difference of the *contents* is 1, but if you did a line by
line comparison, all of them would return because of the line
difference at the beginning. So, what I'm really looking for, is not
just a line by line comparison, but a file contents comparison.
Ideally, all I want to generate is a file of lines which would contain
the differences.

My first thought is to do a sweep, where the first sweep takes one
line from f1, travels f2, if found, deletes it from a tmp version of
f2, and then on to the second line, and so on. If not found, it writes
to a file. At the end, if there are also lines still in f1 that never
were matched because it was longer, it appends those as well to the
difference file. At the end, you have a nice summary of the lines
(i.e., records) which are not found in either file.

Any suggestions where to start?