[PSF-Community] Python library to extract data tables from PDF files

Vinayak Mehta vmehta94 at gmail.com
Fri Sep 28 15:36:54 EDT 2018


Hello David!

Yes, I've created a wiki page comparing Camelot with other open source
tools and libraries. tabula-py is a wrapper over tabula-java, which is used
by Tabula. You can check out the comparison of Camelot with Tabula here
<https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools#tabula>.
As you can see in the comparison, it outperforms Tabula in almost all cases!

While Tabula either gives either good output or fails miserably, Camelot
gives you complete control over the extraction process with various
configuration parameters! You can check out this section of the README
<https://github.com/socialcopsdev/camelot#why-camelot> for more
information. Camelot also lets you plot various geometries like detected
lines, intersections, tables in the PDF to debug and improve table
extraction! You can check out this part of the documentation
<https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry>
for more information on that.

Try it out!

Vinayak

On Sat, Sep 29, 2018 at 12:34 AM David Mertz <mertz at gnosis.cx> wrote:

> Have you compared your tool with existing ones, such as
> https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302
> ?
>
> What notable difference in API and/or accuracy do you have?
>
> On Fri, Sep 28, 2018 at 2:32 PM Vinayak Mehta <vmehta94 at gmail.com> wrote:
>
>> I've created a Jupyter notebook which shows an example of how Camelot makes
>> it easy to extract tables out of PDFs.
>>
>>
>> In the example, I scrape a PDF from an Indian disease outbreaks data source[1] using requests, extract tables from
>> each page of the PDF using Camelot and then concat those tables. Here's the gist!https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 :)
>>
>> [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689
>>
>>
>> On Fri, Sep 28, 2018 at 12:01 PM Vinayak Mehta <vmehta94 at gmail.com>
>> wrote:
>>
>>> Hello everyone!
>>>
>>> I recently released a Python library which lets users extract data
>>> tables out of PDF files, my first open source library! Here's the link:
>>> https://github.com/socialcopsdev/camelot
>>>
>>> I've created a wiki page
>>> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
>>> comparing it to other open source PDF table extraction tools. I'm currently
>>> working on porting it to Python3!
>>>
>>> I would be really grateful if you could check it out and see if its
>>> useful to you and give me any feedback that may help me improve it, by
>>> replying here, opening an issue or a pull request!
>>>
>>> Looking forward to hearing from you all!
>>>
>>> Thanks for your time!
>>>
>>> Vinayak
>>>
>> _______________________________________________
>> PSF-Community mailing list
>> PSF-Community at python.org
>> https://mail.python.org/mailman/listinfo/psf-community
>>
>
>
> --
> Keeping medicines from the bloodstreams of the sick; food
> from the bellies of the hungry; books from the hands of the
> uneducated; technology from the underdeveloped; and putting
> advocates of freedom in prisons.  Intellectual property is
> to the 21st century what the slave trade was to the 16th.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/psf-community/attachments/20180929/d200766e/attachment.html>


More information about the PSF-Community mailing list