[Pandas-dev] Python library to extract tables from PDF files

Vinayak Mehta vmehta94 at gmail.com
Thu Sep 27 06:49:04 EDT 2018


Hello everyone!

I'm a software engineer based out of New Delhi, India. I've been a long
time user and have used it in countless projects and scripts! Thanks to the
core developers and contributors for working on it!

I recently released a Python library which lets users extract data tables
out of PDF files, my first open-source library! Here's the link:
https://github.com/socialcopsdev/camelot

It has a similar API to the pandas read_* functions, bearing most
similarity to read_html(). Like read_html(), it has a read_pdf() main
interface which returns a list of pandas DataFrames for each table found in
the PDF file, and contains two flavors for parsing different types of
tables!

I've created a comparison with other open-source PDF table extraction
libraries and tools in the wiki here
<https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
.

I would be really grateful if you could check it out and see if its useful
to you, and give me any feedback that may help me improve it, I promise if
would take less than 5 minutes of your time! :)

To the core devs: I was wondering if pandas would be open to accept this
library as a contribution to its read_* interface? The library uses
OpenCV's morphological transformations to detect lines in PDFs when
flavor='lattice', which I could vendorize or re-implement. It also has two
system specific dependencies which are python-tk (used by matplotlib) and
ghostscript (used to convert PDF to PNG). The first one shouldn't pose a
problem since pandas also uses matplotlib, and for the second one, I could
look for a Python library alternative to ghostscript.

Looking forward to hearing from you all!

Thanks for your time!

Vinayak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/b35da5a6/attachment.html>


More information about the Pandas-dev mailing list