Script to extract text from PDF files

David Boddie david at boddie.org.uk
Wed Sep 26 10:55:40 EDT 2007


On Wed Sep 26 15:06:54 CEST 2007, byte8bits wrote:

> On Sep 25, 10:19 pm, Lawrence D'Oliveiro <l... at geek-
> central.gen.new_zealand> wrote:
> 
> > This is inherent in the nature of PDF: it's a page-description language,
> > not a document-interchange language. Each text-drawing command can put a
> > block of text anywhere on the page, so you have no idea, just from
> > parsing the PDF content, how to join these blocks up into lines,
> > paragraphs, columns etc.
> 
> So (I'm not being a wise guy) how does pdftotext do it so well?

There's a little information on that online:

  http://www.glyphandcog.com/textext.html

You would need to look at the source code to see exactly what it does.

> The text I can extract from PDFs is extracted as it appears in the doc.
> Although there are various ways to insert and encode text in PDFs,
> it's also well documented in the PDF specifications (http://
> www.adobe.com/devnet/pdf/pdf_reference.html).

Just because inserting and encoding is well documented doesn't mean that the
reverse processes are easy. :-/

> Going back to pdftotext... it works well at extracting text from PDF.
> I'd like a native Python library that does the same.

Maybe you should look at the source code for pdftotext, if that's an option.

> This can be done. 
> And, it can be done in Python. I've made a small start, my hope was that
> others would be interested in helping, but I can do it on my own
> too... it'll just take a lot longer :)

Can I suggest that you approach one or more authors of the existing Python
PDF solutions and work with them on this? There are at least four PDF parsers
written in Python out there.

David



More information about the Python-list mailing list