Fw: PDF library for reading PDF files

Mon Jan 19 08:04:34 EST 2004

In article <oxEOb.96911$Vs3.36407 at twister.socal.rr.com>,
Robert Kern  <rkern at ucsd.edu> wrote:
>Cameron Laird wrote:
>> In article <Xns9474CBDE9B2D7cpl19ghumspamgourmet at 62.153.159.134>,
>> Harald Massa  <cpl.19.ghum at spamgourmet.com> wrote:
>> 
>>>>I am looking for a library in Python that would read PDF files and I
>>>>could extract information from the PDF with it. I have searched with
>>>>google, but only found libraries that can be used to write PDF files. 
>>>
>>>reportlab has a lib called pagecatcher; it is fully supported with python, 
>>>it is not free.
>>>
>>>Harald
>> 
>> 
>> ReportLab's libraries are great things--but they do not "extract
>> information from the PDF" in the sense I believe the original
>> questioner intended.  
>
>No, but ReportLab (the company) has a product separate from reportlab 
>(the package) called PageCatcher that does exactly what the OP asked 
>for. It is not open source, however, and costs a chunk of change.

Let's take this one step farther.  Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >.  I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee.  I
entirely agree that PageCatcher "read[s] PDF files ...
and ... extract[s] information from the PDF with it."

HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
functions.  
-- 

Cameron Laird <claird at phaseit.net>
Business:  http://www.Phaseit.net