[PSF-Community] Python library to extract data tables from PDF files

Vinayak Mehta vmehta94 at gmail.com
Mon Oct 1 05:24:50 EDT 2018


Thanks Alvaro! rows looks top-notch, I'll check it out! I too have support
for extracting tables from images on my roadmap, will drop by the rows
gitter channel to discuss this further! :)

On Sat, Sep 29, 2018 at 1:40 AM Álvaro Justen [Turicas] <
alvarojusten at gmail.com> wrote:

> Hi, Vinayak! Good work, thanks for sharing. :)
>
> I'm the creator of the rows library[http://turicas.info/rows] and
> implemented PDF support early this year (with 3 different strategies)
> -- it's not released on PyPI yet since I'm fixing some bugs before
> releasing the next version, but you can try it out by installing:
>
>     pip install
> git+https://github.com/turicas/rows.git@feature/plugin-pdf#egg=rows
> pdfminer.six cached-property
>
> It's 100% written in Python and also has a command-line interface (so
> you can run `rows convert http://example.com/file.pdf
> newfile.(csv|xls|xlsx|html|sqlite)` or even `rows query "SELECT * FROM
> table1 WHERE some_condition" http://example.com/file.pdf
> --output=result.xls`).
>
> The idea behind the extraction algorithms is to be flexible, so you
> can plug your own if you want (depending on how the PDF is created,
> the objects will be very different and you cannot use the same
> ordering/grouping strategy).
>
> I'm now implementing support to extract tables from images (and also
> from PDFs with images), but it's probably not going to the next
> version since I need a better OCR tool. What do you think in joining
> efforts so we can have better libraries? I'm going to test the PDFs
> you've cited with my code so we can compare better. Feel free to
> contact me directly or join the chat at https://gitter.im/turicas/rows
>
> Cheers,
>  Álvaro Justen "Turicas"
>     turicas.info / @turicas (twitter, github, youtube)
>    +55 41 999 311 221
> On Fri, Sep 28, 2018 at 11:43 AM Vinayak Mehta <vmehta94 at gmail.com> wrote:
> >
> > Hello everyone!
> >
> > I recently released a Python library which lets users extract data
> tables out of PDF files, my first open source library! Here's the link:
> https://github.com/socialcopsdev/camelot
> >
> > I've created a wiki page comparing it to other open source PDF table
> extraction tools. I'm currently working on porting it to Python3!
> >
> > I would be really grateful if you could check it out and see if its
> useful to you and give me any feedback that may help me improve it, by
> replying here, opening an issue or a pull request!
> >
> > Looking forward to hearing from you all!
> >
> > Thanks for your time!
> >
> > Vinayak
> > _______________________________________________
> > PSF-Community mailing list
> > PSF-Community at python.org
> > https://mail.python.org/mailman/listinfo/psf-community
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/psf-community/attachments/20181001/f3048307/attachment.html>


More information about the PSF-Community mailing list