Middle matching - any Python library functions (besides re)?

Mon Aug 28 01:32:21 EDT 2006

Paul Rubin wrote:
> "EP" <eric.pederson at gmail.com> writes:
> > Given that I am looking for matches of all files against all other
> > files (of similar length) is there a better bet than using re.search?
> > The initial application concerns files in the 1,000's, and I could use
> > a good solution for a number of files in the 100,000's.
>
> If these are text files, typically you'd use the Unix 'diff' utility
> to locate the differences.

If you can, you definitely want to use diff.  Otherwise, the difflib
standard library module may be of use to you.  Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file.  Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
    data = open(fn).read() # or again open(fn, 'rb')...
    if substr in data:
        try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform.  Of course, if
your files are large it may not be feasible to do this with all of
them.  But they'd have to be really large, or there'd have to be lots
and lots of them...  :-)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon