reading text in pdf, some working sample code

Wed Nov 22 21:07:54 EST 2017

On Wed, Nov 22, 2017 at 5:39 PM, Jon Ribbens <jon+usenet at unequivocal.eu>
wrote:

> On 2017-11-21, Daniel Gross <grossd18 at gmail.com> wrote:
> > I am new to python and jumped right into trying to read out (english)
> text
> > from PDF files.
>
> That's not a trivial task. However I just released pycpdf, which might
> help you out. Check out https://github.com/jribbens/pycpdf which shows
> an example of extracting text from PDFs. It may or may not cope with
> the particular PDFs you have, as there's quite a lot of variety within
> the format.
>
> Example:
>
>     pdf = pycpdf.PDF(open("file.pdf", "rb").read())
>     if pdf.info and pdf.info.get('Title'):
>         print('Title:', pdf.info['Title'])
>     for pageno, page in enumerate(pdf.pages):
>         print('Page', pageno + 1)
>         print(page.text)
> --
> https://mail.python.org/mailman/listinfo/python-list
>

Sorry if I'm late to this party, but I use pdf2txt for this. Works just
fine. It has options for different encodings, page range, etc. On Linux
just "apt install python-pdfminer" to install.

-- 

**** Listen to my FREE CD at http://www.mellowood.ca/music/cedars ****
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: bob at mellowood.ca
WWW:   http://www.mellowood.ca