Script to extract text from PDF files

Lawrence D'Oliveiro ldo at geek-central.gen.new_zealand
Tue Sep 25 22:19:00 EDT 2007


In message <1190747931.415834.75670 at n39g2000hsh.googlegroups.com>, 
byte8bits at gmail.com wrote:

> On Sep 25, 3:02 pm, Paul Hankin <paul.han... at gmail.com> wrote:
>
>> Googling for 'pdf to text python' and following the first link
>> giveshttp://pybrary.net/pyPdf/
> 
> Doesn't work that well...

This is inherent in the nature of PDF: it's a page-description language, not
a document-interchange language. Each text-drawing command can put a block
of text anywhere on the page, so you have no idea, just from parsing the
PDF content, how to join these blocks up into lines, paragraphs, columns
etc.



More information about the Python-list mailing list