[Tutor] PDF program [pdftotext]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu May 29 17:02:02 2003


On Thu, 29 May 2003, Ron Nixon wrote:

> I thought I say a Python program that would read and extract text from
> PDF files. Anyone recall anything on this?


Hi Ron,


(If anyone finds an alternative to the solution below, I'd be very very
interested in this, as my work depends on doing this sort of stuff too!
*grin*)


The 'pdfsearch' project,

    http://pdfsearch.sourceforge.net/


uses the Unix utility 'pdftotext' as its backend to pull text out of these
files.  pdftotext is part of the 'xpdf' package:

    http://www.foolabs.com/xpdf/download.html


It wouldn't be too hard to write a Python wrapper around pdftotext.
Here's a sketch of a kind of wrapper:


###
def extractPdfText(filename):
    return os.popen("pdftotext '%s' -" % filename).read()
###

This isn't complete or foolproof, but it should be a good start.



Good luck to you!