[Pandas-dev] Python library to extract tables from PDF files

Tue Apr 30 05:05:30 EDT 2019

I won't be at PyCon, but happy to review a PR!

Op ma 29 apr. 2019 om 10:28 schreef Vinayak Mehta <vmehta94 at gmail.com>:

> Sorry for the really late reply here. I agree that adding the package on
> the ecosystem page will be a good first step! I'll make a PR.
>
> I completely understand that something like this can't be included inside
> pandas because it calls a lot of external packages.
>
> I'll be at PyCon this week, would love to chat more about this if you're
> around! :)
>
> Vinayak
>
> On Fri, Nov 2, 2018 at 3:39 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Hi Vinayak,
>>
>> Thanks for mentioning this package on the list! Camelot looks like a
>> really useful package to me (I needed to extract some data out of a pdf
>> last week, so I was able to give it a try, and it did exactly what I wanted
>> :)).
>>
>> Regarding your question about adding it to pandas itself: personally I am
>> a bit hesitant to further broaden the scope of what is already in pandas,
>> although in this case it would mainly be calling the external package (but
>> which has quite some dependencies).
>> I think it is also nice to have a good ecosystem of packages that provide
>> additional IO functionality, but then we should do a better job advertising
>> them.
>>
>> In any case, I think it would already be a good first step to list the
>> package on ecosystem page in the docs:
>> http://pandas.pydata.org/pandas-docs/stable/ecosystem.html (regardless
>> of the above discussion). And we could maybe also have a section on
>> additional formats on the IO page.
>> PR very welcome for that!
>>
>> Best,
>> Joris
>>
>> Op vr 28 sep. 2018 om 20:02 schreef Vinayak Mehta <vmehta94 at gmail.com>:
>>
>>> I've created a Jupyter notebook which shows an example of how Camelot
>>> makes it easy to extract tables out of PDFs. In the example, I scrape a PDF
>>> from this disease outbreaks data source[1] using requests, extract tables
>>> from each page of the PDF and then concat those tables. Here's the gist!
>>> https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873
>>> :)
>>>
>>> [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689
>>>
>>> On Thu, Sep 27, 2018 at 4:19 PM Vinayak Mehta <vmehta94 at gmail.com>
>>> wrote:
>>>
>>>> Hello everyone!
>>>>
>>>> I'm a software engineer based out of New Delhi, India. I've been a long
>>>> time user and have used it in countless projects and scripts! Thanks to the
>>>> core developers and contributors for working on it!
>>>>
>>>> I recently released a Python library which lets users extract data
>>>> tables out of PDF files, my first open-source library! Here's the link:
>>>> https://github.com/socialcopsdev/camelot
>>>>
>>>> It has a similar API to the pandas read_* functions, bearing most
>>>> similarity to read_html(). Like read_html(), it has a read_pdf() main
>>>> interface which returns a list of pandas DataFrames for each table found in
>>>> the PDF file, and contains two flavors for parsing different types of
>>>> tables!
>>>>
>>>> I've created a comparison with other open-source PDF table extraction
>>>> libraries and tools in the wiki here
>>>> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
>>>> .
>>>>
>>>> I would be really grateful if you could check it out and see if its
>>>> useful to you, and give me any feedback that may help me improve it, I
>>>> promise if would take less than 5 minutes of your time! :)
>>>>
>>>> To the core devs: I was wondering if pandas would be open to accept
>>>> this library as a contribution to its read_* interface? The library uses
>>>> OpenCV's morphological transformations to detect lines in PDFs when
>>>> flavor='lattice', which I could vendorize or re-implement. It also has two
>>>> system specific dependencies which are python-tk (used by matplotlib) and
>>>> ghostscript (used to convert PDF to PNG). The first one shouldn't pose a
>>>> problem since pandas also uses matplotlib, and for the second one, I could
>>>> look for a Python library alternative to ghostscript.
>>>>
>>>> Looking forward to hearing from you all!
>>>>
>>>> Thanks for your time!
>>>>
>>>> Vinayak
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190430/057f688f/attachment.html>