Looking Python script to compare two files

Thu Nov 10 04:29:54 EST 2005

[david]
> I want to compare PDF-PDF files and WORD-WORD files.

OK. Well, that's clear enough.

> It seems that the right way is :
> First, extract text from PDF file or Word file.
> Then, use Difflib to compare these text files.

When you say "it seems that the right way is..." I'll
assume that this way meets your requirements. It
wouldn't be the right way if, for example, you
wanted to treat different header levels as different,
or to consider embedded graphics as significant etc.

> Would you please give me some more information 
> about the external diff tools?

Well, I could mention the name of the ones
which I might use (WinMerge and GNU diff), 
but I'm sure there are many of then around
the place, and you're far better off doing this:

http://www.google.co.uk/search?q=diff+tools

In case you didn't realise, the "difflib" I
referred to is a Python module from the standard
library:

<screendump>
Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> `difflib`
"<module 'difflib' from 'c:\\python24\\lib\\difflib.pyc'>"
>>>
</screendump>

> There some Python scripts that can extract text 
> from PDF or WORD file?

Well, I'm sure there are, but my honest opinion is that,
unless you've got some compelling reason to do this in
Python, you're better off using, say:

+ antiword: http://www.winfield.demon.nl/
+ pdf2text from xpdf: http://www.foolabs.com/xpdf/home.html

If you really wanted to go with Python (for the learning
experience, if nothing else) then the most obvious candidates
are:

+ Word: use the pywin32 modules to automate Word and save the document
  as text:

http://pywin32.sf.net/

Something like this (assumes doc called c:\temp\test.doc exists):

<code>
import win32com.client
word = win32com.client.gencache.EnsureDispatch ("Word.Application")
doc = word.Documents.Open (FileName="c:/temp/test.doc")
doc.SaveAs (FileName="c:/temp/test2.txt",
FileFormat=win32com.client.constants.wdFormatText)
word.Quit ()
del word

text = open ("c:/temp/test2.txt").read ()
print text
</code>

+ PDF: David Boddie's pdftools looks like about the only possibility:
(ducks as a thousand people jump on him and point out the alternatives)

http://www.boddie.org.uk/david/Projects/Python/pdftools/

Something like this might do the business. I'm afraid I've
no idea how to determine where the line-breaks are. This
was the first time I'd used pdftools, and the fact that
I could do this much is a credit to its usability!

<code>
from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text

def contents_to_text (contents):
  for item in contents:
    if isinstance (item, type ([])):
      for i in contents_to_text (item):
        yield i
    elif isinstance (item, Text):
      yield item.text

doc = PDFDocument ("c:/temp/test.pdf")
n_pages = doc.count_pages ()
text = []
for n_page in range (1, n_pages+1):
  print "Page", n_page
  page = doc.read_page (n_page)
  contents = page.read_contents ().contents
  text.extend (contents_to_text (contents))

print "".join (text)
</code>

TJG

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________