Script to extract text from PDF files

byte8bits at gmail.com byte8bits at gmail.com
Wed Sep 26 09:06:54 EDT 2007


On Sep 25, 10:19 pm, Lawrence D'Oliveiro <l... at geek-
central.gen.new_zealand> wrote:

> > Doesn't work that well...
>
> This is inherent in the nature of PDF: it's a page-description language, not
> a document-interchange language. Each text-drawing command can put a block
> of text anywhere on the page, so you have no idea, just from parsing the
> PDF content, how to join these blocks up into lines, paragraphs, columns
> etc.

So (I'm not being a wise guy) how does pdftotext do it so well? The
text I can extract from PDFs is extracted as it appears in the doc.
Although there are various ways to insert and encode text in PDFs,
it's also well documented in the PDF specifications (http://
www.adobe.com/devnet/pdf/pdf_reference.html). Going back to
pdftotext... it works well at extracting text from PDF. I'd like a
native Python library that does the same. This can be done. And, it
can be done in Python. I've made a small start, my hope was that
others would be interested in helping, but I can do it on my own
too... it'll just take a lot longer :)

Brad






More information about the Python-list mailing list