difflib and intelligent file differences

Thu Mar 26 11:03:23 EDT 2009

hayes.tyler at gmail.com wrote:

> My first thought is to do a sweep, where the first sweep takes one
> line from f1, travels f2, if found, deletes it from a tmp version of
> f2, and then on to the second line, and so on. If not found, it writes
> to a file. At the end, if there are also lines still in f1 that never
> were matched because it was longer, it appends those as well to the
> difference file. At the end, you have a nice summary of the lines
> (i.e., records) which are not found in either file.
> 
> Any suggestions where to start?

You can adapt and use this, provided the files are already sorted. 
Memory usage scales linearly with the size of the file difference, and 
time scales linearly with file sizes.

> #!/usr/bin/env python
> 
> import sys
> 
> 
> def run(fname_a, fname_b):
>     filea = file(fname_a)
>     fileb = file(fname_b)
>     a_lines = set()
>     b_lines = set()
> 
>     while True:
>         a = filea.readline()
>         b = fileb.readline()
>         if not (a or b):
>             break
> 
>         if a == b:
>             continue
> 
>         if a in b_lines:
>             b_lines.remove(a)
>         elif a:
>             a_lines.add(a)
> 
>         if b in a_lines:
>             a_lines.remove(b)
>         elif b:
>             b_lines.add(b)
> 
> 
>     for line in a_lines:
>         print line
> 
>     if a_lines or b_lines:
>         print ''
>         print '***************'
>         print ''
> 
>     for line in b_lines:
>         print line
> 
> 
> if __name__ == '__main__':
>     run(sys.argv[1], sys.argv[2])
>