Middle matching - any Python library functions (besides re)?

EP eric.pederson at gmail.com
Sun Aug 27 19:43:14 EDT 2006


Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... )  I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other.  I want
to find which files have in common with which other files the same
pattern of origin within them.  The cruft portions should be a small %
of the overall file lengths.

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP




More information about the Python-list mailing list