Middle matching - any Python library functions (besides re)?
Andrew Robert
andrew.arobert at gmail.com
Mon Aug 28 08:11:09 EDT 2006
Simon Forman wrote:
> Paul Rubin wrote:
>> "EP" <eric.pederson at gmail.com> writes:
>>> Given that I am looking for matches of all files against all other
>>> files (of similar length) is there a better bet than using re.search?
>>> The initial application concerns files in the 1,000's, and I could use
>>> a good solution for a number of files in the 100,000's.
>> If these are text files, typically you'd use the Unix 'diff' utility
>> to locate the differences.
>
> If you can, you definitely want to use diff. Otherwise, the difflib
> standard library module may be of use to you. Also, since you're
> talking about comparing many files to each other, you could pull out a
> substring of one file and use the 'in' "operator" to check if that
> substring is in another file. Something like this:
>
> f = open(filename) # or if binary open(filename, 'rb')
> f.seek(somewhere_in_the_file)
> substr = f.read(some_amount_of_data)
> f.close()
>
> try_diffing_us = []
> for fn in list_of_filenames:
> data = open(fn).read() # or again open(fn, 'rb')...
> if substr in data:
> try_diffing_us.append(fn)
>
> # then diff just those filenames...
>
> That's a naive implementation but it should illustrate how to cut down
> on the number of actual diffs you'll need to perform. Of course, if
> your files are large it may not be feasible to do this with all of
> them. But they'd have to be really large, or there'd have to be lots
> and lots of them... :-)
>
> More information on your actual use case would be helpful in narrowing
> down the best options.
>
> Peace,
> ~Simon
>
Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?
Utilizing the functions below may be of some help.
#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file
import md5, sys
def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()
def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s
def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful
--
Adversity: That which does not kill me only postpones the inevitable.
More information about the Python-list
mailing list