String/source code analysis tools

Moosebumps moosebumps at moosebumps.com
Thu Apr 22 02:56:15 EDT 2004


I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

i.e. a simple example:

If I have two pieces of code like this:

func1( a, b, c, 13, d, e, f )
func2( x, y, z, z )

and

func1( a, b, c, 55, d, e, f )
func2( x, y, z, x )

I would like to be able to detect the redundancies.  This is obviously a
simple example, the real code is worlds messier -- say a 3 line script, each
line has 800 characters, copied 10 times over with slight modifications
among the 800 characters.  I'm not exaggerating.  So I'm wondering if there
is any code out there that will assist me in refactoring this code.

My feeling that a general solution this is very intractable, but I thought
I'd ask.  I'd probably have to roll my own based on the specifics of the
situation.

It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line.  How would I do a diff, token by token?  I don't know anything
about what algorithms diffs use.

thanks,
MB






More information about the Python-list mailing list