comparing huge files

Wed Mar 15 23:56:32 EST 2006

s99999999s2003 at yahoo.com wrote:
> hi
> i wrote some code to compare 2 files. One is the base file, the other
> file i got from somewhere. I need to compare this file against the
> base,
> eg base file
> abc
> def
> ghi
> 
> eg another file
> abc
> def
> ghi
> jkl
> 
> after compare , the base file will be overwritten with "jkl". Also both
> files tend to grow towards > 20MB ..
> 
> Here is my code...using difflib.
> 
> pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
> difflib output...
> def difference(filename,basename):
>         import difflib
>         base = open(basename)
>         a = base.readlines()
>         input = open(filename)
>         b = input.readlines()
>         d = difflib.Differ()
>         diff = list(d.compare(a, b))
>         if len(diff) > 0:
>                 os.remove(basename)
>                 o = open(basename, "aU")
>                 for i in diff:
>                         if pat.search(i):
>                                 i = i.lstrip("\+ ")
>                                 o.writelines(i)  ## write a new base
> file...
>                 o.close()
>         g = open(basename)
>         return g.readlines()
> 
> Whenever the 2 files get very large, i find that it's very slow
> comparing...any good advice to speed things up.? I thought of removing
> readlines() method, and use line by line compare. Is it a better way?
> thanks
> 

It seems like you want a new base that contains only those lines 
contained in 'filename' that are not contained in 'basename' where 
'basename' is an ordered subset of filename. In other words, the 
'filename' file has all of the lines of 'basename' in order somewhere 
but 'basename' has some additional lines. Is that correct? difflib looks 
to be overkill for this. Here is a suggestion:

basefile = open(basename)
newfile = open(filename)
baseiter = basefile.xreadlines()
newiter = newfile.xreadlines()

newbase = open('tmp.txt', 'w')

for baseline in baseiter:
   for newline in newiter:
     if baseline != newline:
       newbase.write(newline)
     else:
       break

for afile in (basefile, newfile, newbase): afile.close()

If 'basename'is not an ordered subset of 'filename', then difflib seems 
to be your best bet because you have a computationally intensive problem.

James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/