[PSF-Community] Python library to extract data tables from PDF files

Vasudev Ram vasudevram at gmail.com
Fri Sep 28 15:56:24 EDT 2018


Very interesting, and congrats, Vinayak.

As a person interested in both PDF generation [1] and PDF text
extraction [2], I'm interested to know what issues you faced w.r.t.
accuracy of text extraction and also formatting.

[1] I'm the creator of xtopdf, a Python toolkit for PDF generation
from other file formats;

http://slides.com/vasudevram/xtopdf

http://bitbucket.org/vasudevram/xtopdf

[2] I worked on a project to extract text from PDF files. It was done
using a C library (xpdf), though, not a Python one. However, the text
extraction accuracy issues (some of which are technical issues
inherent in the PDF format, according to the vendor of xpdf, Glyph and
Cog) are language-independent. There were things like characters
getting transposed, missing characters, junk characters sometimes,
etc. (I also wrote a heuristics program to detect some such issues,
but that too could only reject the bad extracts, not make them
correct.)

So the extraction was not 100% accurate, at least in my project. Also,
like I said, that vendor said the issues are inherent in PDF, partly
related to it being a canvas-based model, not a text-based one.

I'll try to check out your project some time later.

Cheers,
Vasudev
-- 
vi quickstart: https://gumroad.com/l/vi_quick
Web site:      https://vasudevram.github.io
Blog:             https://jugad2.blogspot.com
Products:      https://gumroad.com/vasudevram

> While Tabula either gives either good output or fails miserably, Camelot
> gives you complete control over the extraction process with various
> configuration parameters! You can check out this section of the README
> <https://github.com/socialcopsdev/camelot#why-camelot> for more
> information. Camelot also lets you plot various geometries like detected
> lines, intersections, tables in the PDF to debug and improve table
> extraction! You can check out this part of the documentation
> <https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry>
> for more information on that.
>

>>>> Hello everyone!
>>>>
>>>> I recently released a Python library which lets users extract data
>>>> tables out of PDF files, my first open source library! Here's the link:
>>>> https://github.com/socialcopsdev/camelot
>>>>
>>>> I've created a wiki page
>>>> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
>>>> comparing it to other open source PDF table extraction tools. I'm
>>>> currently
>>>> working on porting it to Python3!
>>>>
>>>> I would be really grateful if you could check it out and see if its
>>>> useful to you and give me any feedback that may help me improve it, by
>>>> replying here, opening an issue or a pull request!
>>>>
>>>> Looking forward to hearing from you all!
>>>>
>>>> Thanks for your time!
>>>>
>>>> Vinayak
>>>>


More information about the PSF-Community mailing list