Comparing two book chapters (text files)

Chris Rebert crebert at ucsd.edu
Wed Feb 4 20:41:41 EST 2009


On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke <matzke at berkeley.edu> wrote:
> Hi all,
>
> So I have an interesting challenge.  I want to compare two book chapters,
> which I have in plain text format, and find out (a) percentage similarity
> and (b) what has changed.
>
> Some features make this problem different than what seems to be the standard
> text-matching problem solvable with e.g. difflib.  Here is what I mean:
>
> * there is no guarantee that single lines from each file will be directly
> comparable -- e.g., if a few words are inserted into a sentence, then a
> chunk of the sentence will be moved to the next line, then a chunk of that
> line moved to the next, etc.
>
> * Also, there are cases where paragraphs have been moved around, sections
> re-ordered, etc.  So it can't just be a "linear" match.
>
> I imagine this kind of thing can't be all that hard in the grand scheme of
> things, but I couldn't find an easily applicable solution readily available.
>  I have advanced beginner python skills but am not quite where I could do
> this kind of thing from scratch without some guidance about the likely
> functions, libraries etc. to use.
>
> PS: I am going to have to do this for multiple book chapters so various
> software packages, e.g. for windows, are not really usable.

Though not written in Python, wdiff
(http://www.gnu.org/software/wdiff/wdiff.html) might be a good
starting point.

Cheers,
Chris

-- 
Follow the path of the Iguana...
http://rebertia.com



More information about the Python-list mailing list