[Tutor] How to identify clusters of similar files
Steven D'Aprano
steve at pearwood.info
Sun Jun 3 04:00:32 CEST 2012
Albert-Jan Roskam wrote:
> Hi,
>
> I want to use difflib to compare a lot (tens of thousands) of text files. I
> know that many files are quite similar as they are subsequent versions of
> the same document (a primitive kind of version control). What would be a
> good approach to cluster the files based on their likeness?
You have already identified the basic tool: difflib. But your question is not
really about Python, it is more about the algorithm used for clustering data
according to goodness of fit. That's a hard problem, and you should consider
asking it on the main Python mailing list or newsgroup too.
Some search terms to get you started:
biopython
nltk (the Natural Language Tool Kit)
unrooted phylogram
Good luck!
--
Steven
More information about the Tutor
mailing list