[Pandas-dev] Python library to extract tables from PDF files
Vinayak Mehta
vmehta94 at gmail.com
Thu Sep 27 06:49:04 EDT 2018
Hello everyone!
I'm a software engineer based out of New Delhi, India. I've been a long
time user and have used it in countless projects and scripts! Thanks to the
core developers and contributors for working on it!
I recently released a Python library which lets users extract data tables
out of PDF files, my first open-source library! Here's the link:
https://github.com/socialcopsdev/camelot
It has a similar API to the pandas read_* functions, bearing most
similarity to read_html(). Like read_html(), it has a read_pdf() main
interface which returns a list of pandas DataFrames for each table found in
the PDF file, and contains two flavors for parsing different types of
tables!
I've created a comparison with other open-source PDF table extraction
libraries and tools in the wiki here
<https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
.
I would be really grateful if you could check it out and see if its useful
to you, and give me any feedback that may help me improve it, I promise if
would take less than 5 minutes of your time! :)
To the core devs: I was wondering if pandas would be open to accept this
library as a contribution to its read_* interface? The library uses
OpenCV's morphological transformations to detect lines in PDFs when
flavor='lattice', which I could vendorize or re-implement. It also has two
system specific dependencies which are python-tk (used by matplotlib) and
ghostscript (used to convert PDF to PNG). The first one shouldn't pose a
problem since pandas also uses matplotlib, and for the second one, I could
look for a Python library alternative to ghostscript.
Looking forward to hearing from you all!
Thanks for your time!
Vinayak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/b35da5a6/attachment.html>
More information about the Pandas-dev
mailing list