From vmehta94 at gmail.com Mon Apr 29 04:27:49 2019 From: vmehta94 at gmail.com (Vinayak Mehta) Date: Mon, 29 Apr 2019 13:57:49 +0530 Subject: [Pandas-dev] Python library to extract tables from PDF files In-Reply-To: References: Message-ID: Sorry for the really late reply here. I agree that adding the package on the ecosystem page will be a good first step! I'll make a PR. I completely understand that something like this can't be included inside pandas because it calls a lot of external packages. I'll be at PyCon this week, would love to chat more about this if you're around! :) Vinayak On Fri, Nov 2, 2018 at 3:39 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi Vinayak, > > Thanks for mentioning this package on the list! Camelot looks like a > really useful package to me (I needed to extract some data out of a pdf > last week, so I was able to give it a try, and it did exactly what I wanted > :)). > > Regarding your question about adding it to pandas itself: personally I am > a bit hesitant to further broaden the scope of what is already in pandas, > although in this case it would mainly be calling the external package (but > which has quite some dependencies). > I think it is also nice to have a good ecosystem of packages that provide > additional IO functionality, but then we should do a better job advertising > them. > > In any case, I think it would already be a good first step to list the > package on ecosystem page in the docs: > http://pandas.pydata.org/pandas-docs/stable/ecosystem.html (regardless of > the above discussion). And we could maybe also have a section on additional > formats on the IO page. > PR very welcome for that! > > Best, > Joris > > Op vr 28 sep. 2018 om 20:02 schreef Vinayak Mehta : > >> I've created a Jupyter notebook which shows an example of how Camelot >> makes it easy to extract tables out of PDFs. In the example, I scrape a PDF >> from this disease outbreaks data source[1] using requests, extract tables >> from each page of the PDF and then concat those tables. Here's the gist! >> https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 :) >> >> [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689 >> >> On Thu, Sep 27, 2018 at 4:19 PM Vinayak Mehta wrote: >> >>> Hello everyone! >>> >>> I'm a software engineer based out of New Delhi, India. I've been a long >>> time user and have used it in countless projects and scripts! Thanks to the >>> core developers and contributors for working on it! >>> >>> I recently released a Python library which lets users extract data >>> tables out of PDF files, my first open-source library! Here's the link: >>> https://github.com/socialcopsdev/camelot >>> >>> It has a similar API to the pandas read_* functions, bearing most >>> similarity to read_html(). Like read_html(), it has a read_pdf() main >>> interface which returns a list of pandas DataFrames for each table found in >>> the PDF file, and contains two flavors for parsing different types of >>> tables! >>> >>> I've created a comparison with other open-source PDF table extraction >>> libraries and tools in the wiki here >>> >>> . >>> >>> I would be really grateful if you could check it out and see if its >>> useful to you, and give me any feedback that may help me improve it, I >>> promise if would take less than 5 minutes of your time! :) >>> >>> To the core devs: I was wondering if pandas would be open to accept this >>> library as a contribution to its read_* interface? The library uses >>> OpenCV's morphological transformations to detect lines in PDFs when >>> flavor='lattice', which I could vendorize or re-implement. It also has two >>> system specific dependencies which are python-tk (used by matplotlib) and >>> ghostscript (used to convert PDF to PNG). The first one shouldn't pose a >>> problem since pandas also uses matplotlib, and for the second one, I could >>> look for a Python library alternative to ghostscript. >>> >>> Looking forward to hearing from you all! >>> >>> Thanks for your time! >>> >>> Vinayak >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Apr 30 05:05:30 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 30 Apr 2019 11:05:30 +0200 Subject: [Pandas-dev] Python library to extract tables from PDF files In-Reply-To: References: Message-ID: I won't be at PyCon, but happy to review a PR! Op ma 29 apr. 2019 om 10:28 schreef Vinayak Mehta : > Sorry for the really late reply here. I agree that adding the package on > the ecosystem page will be a good first step! I'll make a PR. > > I completely understand that something like this can't be included inside > pandas because it calls a lot of external packages. > > I'll be at PyCon this week, would love to chat more about this if you're > around! :) > > Vinayak > > On Fri, Nov 2, 2018 at 3:39 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi Vinayak, >> >> Thanks for mentioning this package on the list! Camelot looks like a >> really useful package to me (I needed to extract some data out of a pdf >> last week, so I was able to give it a try, and it did exactly what I wanted >> :)). >> >> Regarding your question about adding it to pandas itself: personally I am >> a bit hesitant to further broaden the scope of what is already in pandas, >> although in this case it would mainly be calling the external package (but >> which has quite some dependencies). >> I think it is also nice to have a good ecosystem of packages that provide >> additional IO functionality, but then we should do a better job advertising >> them. >> >> In any case, I think it would already be a good first step to list the >> package on ecosystem page in the docs: >> http://pandas.pydata.org/pandas-docs/stable/ecosystem.html (regardless >> of the above discussion). And we could maybe also have a section on >> additional formats on the IO page. >> PR very welcome for that! >> >> Best, >> Joris >> >> Op vr 28 sep. 2018 om 20:02 schreef Vinayak Mehta : >> >>> I've created a Jupyter notebook which shows an example of how Camelot >>> makes it easy to extract tables out of PDFs. In the example, I scrape a PDF >>> from this disease outbreaks data source[1] using requests, extract tables >>> from each page of the PDF and then concat those tables. Here's the gist! >>> https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 >>> :) >>> >>> [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689 >>> >>> On Thu, Sep 27, 2018 at 4:19 PM Vinayak Mehta >>> wrote: >>> >>>> Hello everyone! >>>> >>>> I'm a software engineer based out of New Delhi, India. I've been a long >>>> time user and have used it in countless projects and scripts! Thanks to the >>>> core developers and contributors for working on it! >>>> >>>> I recently released a Python library which lets users extract data >>>> tables out of PDF files, my first open-source library! Here's the link: >>>> https://github.com/socialcopsdev/camelot >>>> >>>> It has a similar API to the pandas read_* functions, bearing most >>>> similarity to read_html(). Like read_html(), it has a read_pdf() main >>>> interface which returns a list of pandas DataFrames for each table found in >>>> the PDF file, and contains two flavors for parsing different types of >>>> tables! >>>> >>>> I've created a comparison with other open-source PDF table extraction >>>> libraries and tools in the wiki here >>>> >>>> . >>>> >>>> I would be really grateful if you could check it out and see if its >>>> useful to you, and give me any feedback that may help me improve it, I >>>> promise if would take less than 5 minutes of your time! :) >>>> >>>> To the core devs: I was wondering if pandas would be open to accept >>>> this library as a contribution to its read_* interface? The library uses >>>> OpenCV's morphological transformations to detect lines in PDFs when >>>> flavor='lattice', which I could vendorize or re-implement. It also has two >>>> system specific dependencies which are python-tk (used by matplotlib) and >>>> ghostscript (used to convert PDF to PNG). The first one shouldn't pose a >>>> problem since pandas also uses matplotlib, and for the second one, I could >>>> look for a Python library alternative to ghostscript. >>>> >>>> Looking forward to hearing from you all! >>>> >>>> Thanks for your time! >>>> >>>> Vinayak >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: