String/source code analysis tools

Moosebumps moosebumps at moosebumps.com
Thu Apr 22 03:16:51 EDT 2004


Man, I love Python!  After writing this, with about 10 minutes of googling,
I found the difflib, which can do diffs token by token.  I can do what I
want with about 10 lines of code probably.  Wow.

I think the diff is pretty much the best solution -- but if anyone has any
other pointers I would appreciate it.  I would have to diff all pairs of
files and I can get a score of how similar they are to each other.  So if I
have 10 files I would have to run it 45 times to get all pairs of diffs.
That should be OK since they are small files in general.

MB


"Moosebumps" <moosebumps at moosebumps.com> wrote in message
news:j0Khc.25133$Q%5.6444 at newssvr27.news.prodigy.com...
> I have a whole bunch of script files in a custom scripting "language" that
> were basically copied and pasted all over the place -- a huge mess,
> basically.
>
> I want to clean this up using Python -- and I'm wondering if there is any
> sort of algorithm for detecting copied and pasted code with slight
> modifications.
>
> i.e. a simple example:
>
> If I have two pieces of code like this:
>
> func1( a, b, c, 13, d, e, f )
> func2( x, y, z, z )
>
> and
>
> func1( a, b, c, 55, d, e, f )
> func2( x, y, z, x )
>
> I would like to be able to detect the redundancies.  This is obviously a
> simple example, the real code is worlds messier -- say a 3 line script,
each
> line has 800 characters, copied 10 times over with slight modifications
> among the 800 characters.  I'm not exaggerating.  So I'm wondering if
there
> is any code out there that will assist me in refactoring this code.
>
> My feeling that a general solution this is very intractable, but I thought
> I'd ask.  I'd probably have to roll my own based on the specifics of the
> situation.
>
> It is actually sort of like a diff algorithm maybe, but it wouldn't go
line
> by line.  How would I do a diff, token by token?  I don't know anything
> about what algorithms diffs use.
>
> thanks,
> MB
>
>
>





More information about the Python-list mailing list