reading text in pdf, some working sample code

Wed Nov 22 19:39:48 EST 2017

On 2017-11-21, Daniel Gross <grossd18 at gmail.com> wrote:
> I am new to python and jumped right into trying to read out (english) text
> from PDF files.

That's not a trivial task. However I just released pycpdf, which might
help you out. Check out https://github.com/jribbens/pycpdf which shows
an example of extracting text from PDFs. It may or may not cope with
the particular PDFs you have, as there's quite a lot of variety within
the format.

Example:

    pdf = pycpdf.PDF(open("file.pdf", "rb").read())
    if pdf.info and pdf.info.get('Title'):
        print('Title:', pdf.info['Title'])
    for pageno, page in enumerate(pdf.pages):
        print('Page', pageno + 1)
        print(page.text)