PDF->Text converter/extractor

Mon Nov 5 16:34:53 EST 2001

On Mon, Nov 05, 2001 at 09:42:25PM +0100, Igor Stroh wrote:
> Hi there,
> 
> has someone ever tried to extract text from a PDF with python?
> So far, there are 2 alternatives, but none of them satisfies my needs
> (GPL license (or the like), speed and reliability):
> 1) Using pdftotext (Xpdf) with usual files
> 2) Using commerical PageCatcher from reportlab.com (1000 bucks per
> license lol) directly in a python script (no files opened)
> 
> though I didnt find anything yet, perhaps there is someone who already
> had the same problem and solved it by writing an own PDF parser? :) I'm
> too lazy to start reading the specs of PDF and try to write the thingy by
> myself :)

Ghostscript is the usual answer for any Postscript processing.  On my
Linux, I got 'pdftotext'.  So, I guess you can invoke it as os.system()
command.

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>.
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin