[TriPython] Any recommendations on PDF Reader library in Python?

Philip Semanchuk philip at semanchuk.com
Sat Apr 8 21:05:09 EDT 2017


> On Apr 8, 2017, at 9:51 AM, Ginny Ghezzo <ginnyghezzo at gmail.com> wrote:
> 
>   Does anyone have a favorite library for reading .pdfs?**
>   I want to pull a schedule from a .pdf file and put it on my calendar.
>   (Side note: I know how to do two step conversion from .pdf to another
>   format and then use pandas but wanted to cut out the middle man.)**
>   Cheers,**
>   Ginny**

Hi Ginny,
PDFMiner does a nice job of reading PDFs based on my limited experience, but the data it produces is pretty raw (i.e. a jumble of characters with associated (x, y) coordinates. I used PDFMiner in the test portion of a project and I wrote some code to make its output less raw. The code I wrote is open source, so you could use it too. The project was a registration system for the Libyan national election, so there's some bits in the code specific to handling Arabic. 

The code is here:
https://github.com/SmartElect/SmartElect/blob/develop/rollgen/tests/utils_for_tests.py

The three functions you’d use are extract_pdf_page(), extract_textlines(), and maybe clean_textlines(), like so:
>>> xml = extract_pdf_page('a_filename.pdf', 1)
>>> lines = extract_textlines(xml)
>>> lines = clean_textlines(lines)

You can see examples of how I used these functions here:
https://github.com/SmartElect/SmartElect/blob/develop/rollgen/tests/test_generate_pdf.py


Hope this helps
Philip




More information about the TriZPUG mailing list