[PSF-Community] Python library to extract data tables from PDF files

Álvaro Justen [Turicas] alvarojusten at gmail.com
Fri Sep 28 16:10:32 EDT 2018


Hi, Vinayak! Good work, thanks for sharing. :)

I'm the creator of the rows library[http://turicas.info/rows] and
implemented PDF support early this year (with 3 different strategies)
-- it's not released on PyPI yet since I'm fixing some bugs before
releasing the next version, but you can try it out by installing:

    pip install
git+https://github.com/turicas/rows.git@feature/plugin-pdf#egg=rows
pdfminer.six cached-property

It's 100% written in Python and also has a command-line interface (so
you can run `rows convert http://example.com/file.pdf
newfile.(csv|xls|xlsx|html|sqlite)` or even `rows query "SELECT * FROM
table1 WHERE some_condition" http://example.com/file.pdf
--output=result.xls`).

The idea behind the extraction algorithms is to be flexible, so you
can plug your own if you want (depending on how the PDF is created,
the objects will be very different and you cannot use the same
ordering/grouping strategy).

I'm now implementing support to extract tables from images (and also
from PDFs with images), but it's probably not going to the next
version since I need a better OCR tool. What do you think in joining
efforts so we can have better libraries? I'm going to test the PDFs
you've cited with my code so we can compare better. Feel free to
contact me directly or join the chat at https://gitter.im/turicas/rows

Cheers,
 Álvaro Justen "Turicas"
    turicas.info / @turicas (twitter, github, youtube)
   +55 41 999 311 221
On Fri, Sep 28, 2018 at 11:43 AM Vinayak Mehta <vmehta94 at gmail.com> wrote:
>
> Hello everyone!
>
> I recently released a Python library which lets users extract data tables out of PDF files, my first open source library! Here's the link: https://github.com/socialcopsdev/camelot
>
> I've created a wiki page comparing it to other open source PDF table extraction tools. I'm currently working on porting it to Python3!
>
> I would be really grateful if you could check it out and see if its useful to you and give me any feedback that may help me improve it, by replying here, opening an issue or a pull request!
>
> Looking forward to hearing from you all!
>
> Thanks for your time!
>
> Vinayak
> _______________________________________________
> PSF-Community mailing list
> PSF-Community at python.org
> https://mail.python.org/mailman/listinfo/psf-community


More information about the PSF-Community mailing list