Middle matching - any Python library functions (besides re)?

Mon Aug 28 08:11:09 EDT 2006

Simon Forman wrote:
> Paul Rubin wrote:
>> "EP" <eric.pederson at gmail.com> writes:
>>> Given that I am looking for matches of all files against all other
>>> files (of similar length) is there a better bet than using re.search?
>>> The initial application concerns files in the 1,000's, and I could use
>>> a good solution for a number of files in the 100,000's.
>> If these are text files, typically you'd use the Unix 'diff' utility
>> to locate the differences.
> 
> If you can, you definitely want to use diff.  Otherwise, the difflib
> standard library module may be of use to you.  Also, since you're
> talking about comparing many files to each other, you could pull out a
> substring of one file and use the 'in' "operator" to check if that
> substring is in another file.  Something like this:
> 
> f = open(filename) # or if binary open(filename, 'rb')
> f.seek(somewhere_in_the_file)
> substr = f.read(some_amount_of_data)
> f.close()
> 
> try_diffing_us = []
> for fn in list_of_filenames:
>     data = open(fn).read() # or again open(fn, 'rb')...
>     if substr in data:
>         try_diffing_us.append(fn)
> 
> # then diff just those filenames...
> 
> That's a naive implementation but it should illustrate how to cut down
> on the number of actual diffs you'll need to perform.  Of course, if
> your files are large it may not be feasible to do this with all of
> them.  But they'd have to be really large, or there'd have to be lots
> and lots of them...  :-)
> 
> More information on your actual use case would be helpful in narrowing
> down the best options.
> 
> Peace,
> ~Simon
> 

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

Utilizing the functions below may be of some help.

#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file 

import md5, sys

def getsum(filename):
        """
        Generate the check sum based on received chunks of the file
        """
        md5sum = md5.new()
        f = open(filename, 'r')
        for line in getblocks(f) :
             md5sum.update(line)
        f.close()
        return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
        """ 
        Read file in small chunks to avoid having large files loaded into memory
        """
        while True:
                s = f.read(blocksize)
                if not s: break
                yield s

def checksum_compare(caller, cs='',check='', filename=''):
        """
        Compare the generated and received checksum valued
        """
        if cs != check:
                return 1 # compare failed
        else:
                return 0 # compare successful

   -- 
Adversity: That which does not kill me only postpones the inevitable.