String/source code analysis tools

François Pinard pinard at iro.umontreal.ca
Tue May 11 19:43:41 EDT 2004


[Ira Baxter]
> "Moosebumps" <moosebumps at moosebumps.com> wrote in message
> news:j0Khc.25133$Q%5.6444 at newssvr27.news.prodigy.com...

> > I have a whole bunch of script files in a custom scripting
> > "language" that were basically copied and pasted all over the place
> > -- a huge mess, basically.  I want to clean this up using Python --
> > and I'm wondering if there is any sort of algorithm for detecting
> > copied and pasted code with slight modifications.

> Not in Python, but could be used to do this.  We offer a clone
> detection tool that works on very large source code basis, and detects
> cloned clone with "slight modifications".  You'd have to provide a
> grammar for your 'scripting language'.  See
> http://www.semanticdesigns.com/Products/Clone/index.html.

Thanks for the reference, I'm saving it for later perusal or study.

Many years ago, because I had a cleaning problem which I presume similar
to yours, I wrote then used a tool for this, but all in C.  I called
it `mdiff' (for "multi-diff"), and it is likely found within some old
pretest of `Free wdiff' -- I did not really touch `wdiff' in years, even
if I ponder republishing it this summer, given I find some free time.

`mdiff' seeks for identical sequences of lines within one or more files
(I used it for many dozens of files at once).  One difficulty was to
design a way for displaying the output in a usable way, and this was an
interesting problem at least. `mdiff' did the job for me, but I do not
really remember the state of this project nor how `mdiff' would behave
if recompiled today.  But, as usual with me, if you feel like toying,
just ask for the sources, or wander for them from my home web page! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard




More information about the Python-list mailing list