Middle matching - any Python library functions (besides re)?

Mon Aug 28 14:49:09 EDT 2006

Andrew Robert wrote:
> Simon Forman wrote:
> > Paul Rubin wrote:
> >> "EP" <eric.pederson at gmail.com> writes:
> >>> Given that I am looking for matches of all files against all other
> >>> files (of similar length) is there a better bet than using re.search?
> >>> The initial application concerns files in the 1,000's, and I could use
> >>> a good solution for a number of files in the 100,000's.
> >> If these are text files, typically you'd use the Unix 'diff' utility
> >> to locate the differences.
> >
> > If you can, you definitely want to use diff.  Otherwise, the difflib
> > standard library module may be of use to you.  Also, since you're
> > talking about comparing many files to each other, you could pull out a
> > substring of one file and use the 'in' "operator" to check if that
> > substring is in another file.  Something like this:
> >
> > f = open(filename) # or if binary open(filename, 'rb')
> > f.seek(somewhere_in_the_file)
> > substr = f.read(some_amount_of_data)
> > f.close()
> >
> > try_diffing_us = []
> > for fn in list_of_filenames:
> >     data = open(fn).read() # or again open(fn, 'rb')...
> >     if substr in data:
> >         try_diffing_us.append(fn)
> >
> > # then diff just those filenames...
> >
> > That's a naive implementation but it should illustrate how to cut down
> > on the number of actual diffs you'll need to perform.  Of course, if
> > your files are large it may not be feasible to do this with all of
> > them.  But they'd have to be really large, or there'd have to be lots
> > and lots of them...  :-)
> >
> > More information on your actual use case would be helpful in narrowing
> > down the best options.
> >
> > Peace,
> > ~Simon
> >
>
> Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?
>

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.)  But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)

>
> Utilizing the functions below may be of some help.
>
>
> #!/usr/bin/python
> #
> #
> # Function: generate and compare checksums on a file
>
>
> import md5, sys
>
>
> def getsum(filename):
>         """
>         Generate the check sum based on received chunks of the file
>         """
>         md5sum = md5.new()
>         f = open(filename, 'r')
>         for line in getblocks(f) :
>              md5sum.update(line)
>         f.close()
>         return md5sum.hexdigest()
>
> def getblocks(f, blocksize=1024):
>         """
>         Read file in small chunks to avoid having large files loaded into memory
>         """
>         while True:
>                 s = f.read(blocksize)
>                 if not s: break
>                 yield s
>
> def checksum_compare(caller, cs='',check='', filename=''):
>         """
>         Compare the generated and received checksum valued
>         """
>         if cs != check:
>                 return 1 # compare failed
>         else:
>                 return 0 # compare successful
>

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon