searching pdf files for certain info

rbt rbt at athop1.ath.vt.edu
Tue Feb 22 10:00:21 EST 2005


Andreas Lobinger wrote:
> Aloha,
> 
> rbt wrote:
> 
>> Not really a Python question... but here goes: Is there a way to read 
>> the content of a PDF file and decode it with Python? I'd like to read 
>> PDF's, decode them, and then search the data for certain strings.
> 
> 
> First of all,
> http://groups.google.de/groups?selm=400CF2E3.29506EAE%40netsurf.de&output=gplain 
> 
> still applies here.
> 
> If you can deal with a very basic implementation of a pdf-lib you
> might be interested in
> http://sourceforge.net/projects/pdfplayground
> 
> In the CVS (or the current snapshot) you can find in
> ppg/Doc/text_extract.txt an example for text extraction.
> 
>  >>> import pdffile
>  >>> import pages
>  >>> import zlib
>  >>> pf = pdffile.pdffile('../pdf-testset1/a.pdf')
>  >>> pp = pages.pages(pf)
>  >>> c = zlib.decompress(pf[pp.pagelist[0]['/Contents']].stream)
>  >>> op = pdftool.parse_content(c)
>  >>> sop = [x[1] for x in op if x[0] in ["'", "Tj"]]
>  >>> for a in sop:
>         print a[0]
> 
> Wishing a happy day
>     LOBI

Thanks guys... what if I convert it to PS via printing it to a file or 
something? Would that make it easier to work with?



More information about the Python-list mailing list