Middle matching - any Python library functions (besides re)?

Mon Aug 28 14:56:26 EDT 2006

Simon Forman wrote:
> Andrew Robert wrote:
>> Simon Forman wrote:
>>> Paul Rubin wrote:
>>>> "EP" <eric.pederson at gmail.com> writes:
>>>>> Given that I am looking for matches of all files against all other
>>>>> files (of similar length) is there a better bet than using re.search?
>>>>> The initial application concerns files in the 1,000's, and I could use
>>>>> a good solution for a number of files in the 100,000's.
>>>> If these are text files, typically you'd use the Unix 'diff' utility
>>>> to locate the differences.
>>> If you can, you definitely want to use diff.  Otherwise, the difflib
>>> standard library module may be of use to you.  Also, since you're
>>> talking about comparing many files to each other, you could pull out a
>>> substring of one file and use the 'in' "operator" to check if that
>>> substring is in another file.  Something like this:
>>>
>>> f = open(filename) # or if binary open(filename, 'rb')
>>> f.seek(somewhere_in_the_file)
>>> substr = f.read(some_amount_of_data)
>>> f.close()
>>>
>>> try_diffing_us = []
>>> for fn in list_of_filenames:
>>>     data = open(fn).read() # or again open(fn, 'rb')...
>>>     if substr in data:
>>>         try_diffing_us.append(fn)
>>>
>>> # then diff just those filenames...
>>>
>>> That's a naive implementation but it should illustrate how to cut down
>>> on the number of actual diffs you'll need to perform.  Of course, if
>>> your files are large it may not be feasible to do this with all of
>>> them.  But they'd have to be really large, or there'd have to be lots
>>> and lots of them...  :-)
>>>
>>> More information on your actual use case would be helpful in narrowing
>>> down the best options.
>>>
>>> Peace,
>>> ~Simon
>>>
>> Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?
>>
> 
> The thing about a checksum algorithm is that if you call it with some
> data, then change even one byte of the data and call it again, the
> resulting checksums will be (should be) very different.
> 
> Checksumming the entire file contents won't help, but as bearophile
> said: "If you can determine of a given a file where its heading garbage
> stops, then you can compute the signature just computing the python
> hash of some of the following bytes (30 or
> 200 byte may suffice)."
> 
> If the OP can do that then yes, it would likely be more efficient
> (although computing and comparing checksums on just 200 bytes or less
> might not be a significant gain on simply comparing the strings
> themselves.)  But if he can't then the next best thing would be to take
> a subsection of the file somewhere after the heading cruft and search
> (using string 'in' string form) for that subsection in other files.
> (Actually there may be a better option than that, but I'm just not
> bright enough to have thought of it...)
> 
>> Utilizing the functions below may be of some help.
>>
>>
>> #!/usr/bin/python
>> #
>> #
>> # Function: generate and compare checksums on a file
>>
>>
>> import md5, sys
>>
>>
>> def getsum(filename):
>>         """
>>         Generate the check sum based on received chunks of the file
>>         """
>>         md5sum = md5.new()
>>         f = open(filename, 'r')
>>         for line in getblocks(f) :
>>              md5sum.update(line)
>>         f.close()
>>         return md5sum.hexdigest()
>>
>> def getblocks(f, blocksize=1024):
>>         """
>>         Read file in small chunks to avoid having large files loaded into memory
>>         """
>>         while True:
>>                 s = f.read(blocksize)
>>                 if not s: break
>>                 yield s
>>
>> def checksum_compare(caller, cs='',check='', filename=''):
>>         """
>>         Compare the generated and received checksum valued
>>         """
>>         if cs != check:
>>                 return 1 # compare failed
>>         else:
>>                 return 0 # compare successful
>>
> 
> I'm curious why you included this elaborate function with it's unused
> args (caller and filename), unnecessary defaults, and it's odd
> inversion of Boolean values..
> 
> How is "if not checksum_compare(something, sum0, sum1): #do
> something..." any better than "if sum0 == sum1: #do something..." ?
> 
> Peace,
> ~Simon
> 
Because I was lazy..

The  checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..