[Tutor] How to identify clusters of similar files

Steven D'Aprano steve at pearwood.info
Sun Jun 3 04:00:32 CEST 2012


Albert-Jan Roskam wrote:
> Hi,
> 
> I want to use difflib to compare a lot (tens of thousands) of text files. I
> know that many files are quite similar as they are subsequent versions of
> the same document (a primitive kind of version control). What would be a
> good approach to cluster the files based on their likeness?

You have already identified the basic tool: difflib. But your question is not 
really about Python, it is more about the algorithm used for clustering data 
according to goodness of fit. That's a hard problem, and you should consider 
asking it on the main Python mailing list or newsgroup too.

Some search terms to get you started:

biopython
nltk  (the Natural Language Tool Kit)
unrooted phylogram


Good luck!


-- 
Steven


More information about the Tutor mailing list