Middle matching - any Python library functions (besides re)?

Paddy paddy3118 at netscape.net
Sun Aug 27 22:50:21 EDT 2006


EP wrote:
> Hi,
>
> I'm a bit green in this area and wonder to what extent there may be
> some existing Python tools (or if I have to scratch my head real hard
> for an appropriate algorithm... )  I'd hate to build an inferior
> solution to that someone has painstakingly built before me.
>
> I have some files which may have had the same origin, but some may have
> had some cruft added to the front, and some may have had some cruft
> added to the back; thus they may be of slightly different lengths, but
> if they had the same origin, there will be a matching pattern of bytes
> in the middle, though it may be offset relative to each other.  I want
> to find which files have in common with which other files the same
> pattern of origin within them.  The cruft portions should be a small %
> of the overall file lengths.

Are they source files?
Is the cruft comments?

Have you done an exhaustive search for info on the files and the
histories? If they are systematically generated then there may be a way
to systematically uncover the matching info from their histories, such
as source code management histories, file pathnames, file dates?

What more can you find out about cruft? Is there a way to strip the
cruft by program?  If so, file sizes and checksums could be compared on
the stripped files.

>
> Given that I am looking for matches of all files against all other
> files (of similar length) is there a better bet than using re.search?
> The initial application concerns files in the 1,000's, and I could use
> a good solution for a number of files in the 100,000's.

Can you change the flow that creates these files so that the
information you want is generated alongside the files?

Can you use data later in the flow to more easily extract the info you
want?

>
> TIA for bearing with my ignorance of the clear solution I'm surely
> blind to...
>
> EP

Just some questions that I woud ask myself if given the task.

- Paddy.




More information about the Python-list mailing list