Comparing two book chapters (text files)

M.-A. Lemburg mal at egenix.com
Thu Feb 5 07:00:30 EST 2009


On 2009-02-05 02:20, Nick Matzke wrote:
> Hi all,
> 
> So I have an interesting challenge.  I want to compare two book
> chapters, which I have in plain text format, and find out (a) percentage
> similarity and (b) what has changed.
> 
> Some features make this problem different than what seems to be the
> standard text-matching problem solvable with e.g. difflib.  Here is what
> I mean:
> 
> * there is no guarantee that single lines from each file will be
> directly comparable -- e.g., if a few words are inserted into a
> sentence, then a chunk of the sentence will be moved to the next line,
> then a chunk of that line moved to the next, etc.
> 
> * Also, there are cases where paragraphs have been moved around,
> sections re-ordered, etc.  So it can't just be a "linear" match.
> 
> I imagine this kind of thing can't be all that hard in the grand scheme
> of things, but I couldn't find an easily applicable solution readily
> available.  I have advanced beginner python skills but am not quite
> where I could do this kind of thing from scratch without some guidance
> about the likely functions, libraries etc. to use.
> 
> PS: I am going to have to do this for multiple book chapters so various
> software packages, e.g. for windows, are not really usable.
> 
> Any help is much appreciated!!

difflib is in the Python stdlib and provides many ways to implement
difference detection:

    http://docs.python.org/library/difflib.html

Here's a script that I use for diff'ing text files on a word
basis, called tdiff.py:

    http://downloads.egenix.com/python/tdiff.py

It helps a lot with text that gets word wrapped or reformatted.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 05 2009)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-list mailing list