PDF->Text converter/extractor

Mon Nov 5 16:52:57 EST 2001

I had written a script some time ago to extract directly from PDF file, it's
quite easy . As I had a very large volume of text  to extract (some giga of
text), I now use PDFTOTEXT which comes with XPDF. I slighly modify for my
needs. If you are interested, I will look for the script in my archives

Bruno Lienard

"Igor Stroh" <igor.stroh at wohnheim.uni-ulm.de> a écrit dans le message news:
3be6fa21$1 at sol.wohnheim.uni-ulm.de...
> Hi there,
>
> has someone ever tried to extract text from a PDF with python?
> So far, there are 2 alternatives, but none of them satisfies my needs
> (GPL license (or the like), speed and reliability):
> 1) Using pdftotext (Xpdf) with usual files
> 2) Using commerical PageCatcher from reportlab.com (1000 bucks per
> license lol) directly in a python script (no files opened)
>
> though I didnt find anything yet, perhaps there is someone who already
> had the same problem and solved it by writing an own PDF parser? :) I'm
> too lazy to start reading the specs of PDF and try to write the thingy by
> myself :)
>
> TIA,
> Igor