pdf to text

Thu Jan 25 16:46:51 EST 2007

On Thursday 25 January 2007 22:05, tubby wrote:

> I know this question comes up a lot, so here goes again. I want to read
> text from a PDF file, run re searches on the text, etc. I do not care
> about layout, fonts, borders, etc. I just want the text. I've been
> reading Adobe's PDF Reference Guide and I'm beginning to develop a
> better understanding of PDF in general, but I need a bit of help... this
> seems like it should be easier than it is.

It _seems_ that way. ;-)

One of the more promising suggestions for a way to solve this came
up in a comp.lang.python thread last year:

http://groups.google.com/group/comp.lang.python/msg/cb6c97a44ce4cbe9?dmode=source

Basically, if you have access to the pdftotext command on a system
that supports xpdf, you should be able to get something reasonable
out of a PDF file.

> I know the text is compressed... that it would have stream and endstream
> makers and BT (Begin Text) and ET (End Text) and that the uncompressed
> text is enclosed in parenthesis (this is my text). Has anyone here done
> this in a simple fashion? I've played with the pyPdf library some, but
> it seems overly complex for my needs (merge PDFs, write PDFs, etc). I
> just want a simple PDF text extractor.

The pdftotext tool may do what you want:

  http://www.foolabs.com/xpdf/download.html

Let us know how you get on with it.

David