[Tutor] New newbie question. [PDFs and Python]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 9 Jul 2002 15:30:28 -0700 (PDT)


On Tue, 9 Jul 2002, SA wrote:

> On 7/9/02 4:32 PM, "Danny Yoo" <dyoo@hkn.eecs.berkeley.edu> wrote:
>
> >
> > To make pdftotext work nicely as as a Python function, we can do something
> > like this:
> >
> > ###
> > def extractPDFText(pdf_filename):
> >   """Given an pdf file name, returns a new file object of the
> >   text of that PDF.  Uses the 'pdftotext' utility."""
> >   return os.popen("pdftotext %s -" % pdf_filename)
> > ###
> >
> >
> > Here's a demonstration:
> >
> > ###
> >>>> f = extractPDFText('ortuno02.pdf')
> >>>> text = f.read()
> >>>> print text[:200]
> > EUROPHYSICS LETTERS
> > 1 March 2002
> >
> > Europhys. Lett., 57 (5), pp. 759
>
>
> Ok I can see how it would work like this. However, if I first convert
> the pdf to txt, do I not then have a pdf file and a text file?


The hyphen in the command is the key to doing this without an intermediate
'txt' file:

    return os.popen("pdftotext %s -" % pdf_filename)

The hyphen is an option that tells pdftotext not to write to disk, but to
write out to its "standard output".  popen() can capture the standard
output of an external command, and, instead of writing it out to disk, can
make it directly available to us.  It feels like a file, but it's actually
all in memory, with no intermediate disk access involved.


Hope this helps!