pdf to text

tubby tubby at bandaheart.com
Thu Jan 25 16:05:11 EST 2007


I know this question comes up a lot, so here goes again. I want to read 
text from a PDF file, run re searches on the text, etc. I do not care 
about layout, fonts, borders, etc. I just want the text. I've been 
reading Adobe's PDF Reference Guide and I'm beginning to develop a 
better understanding of PDF in general, but I need a bit of help... this 
seems like it should be easier than it is. Here's some code:

import zlib

fp = open('test.pdf', 'rb')
bytes = []
while 1:
     byte = fp.read(1)
     #print byte
     bytes.append(byte)
     if not byte:
         break

for byte in bytes:

     op = open('pdf.txt', 'a')

     dco = zlib.decompressobj()

     try:
         s = dco.decompress(byte)
         #print >> op, s
         print s
     except Exception, e:
         print e

     op.close()

fp.close()

I know the text is compressed... that it would have stream and endstream 
makers and BT (Begin Text) and ET (End Text) and that the uncompressed 
text is enclosed in parenthesis (this is my text). Has anyone here done 
this in a simple fashion? I've played with the pyPdf library some, but 
it seems overly complex for my needs (merge PDFs, write PDFs, etc). I 
just want a simple PDF text extractor.

Thanks



More information about the Python-list mailing list