searching pdf files for certain info

Andreas Lobinger andreas.lobinger at netsurf.de
Tue Feb 22 09:38:33 EST 2005


Aloha,

rbt wrote:
> Not really a Python question... but here goes: Is there a way to read 
> the content of a PDF file and decode it with Python? I'd like to read 
> PDF's, decode them, and then search the data for certain strings.

First of all,
http://groups.google.de/groups?selm=400CF2E3.29506EAE%40netsurf.de&output=gplain
still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.

  >>> import pdffile
  >>> import pages
  >>> import zlib
  >>> pf = pdffile.pdffile('../pdf-testset1/a.pdf')
  >>> pp = pages.pages(pf)
  >>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
  >>> op = pdftool.parse_content(c)
  >>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
  >>> for a in sop:
         print a[0]

Wishing a happy day
	LOBI



More information about the Python-list mailing list