Script to extract text from PDF files

brad byte8bits at gmail.com
Tue Sep 25 13:41:56 EDT 2007


I have a very crude Python script that extracts text from some (and I 
emphasize some) PDF documents. On many PDF docs, I cannot extract text, 
but this is because I'm doing something wrong. The PDF spec is large and 
complex and there are various ways in which to store and encode text. I 
wanted to post here and ask if anyone is interested in helping make the 
script better which means it should accurately extract text from most 
any pdf file... not just some.

I know the topic of reading/extracting the text from a PDF document 
natively in Python comes up every now and then on comp.lang.python... 
I've posted about it in the past myself. After searching for other 
solutions, I've resorted to attempting this on my own in my spare time. 
Using apps external to Python (pdftotext, etc.) is not really an option 
for me. If someone knows of a free native Python app that does this now, 
let me know and I'll use that instead!

So, if other more experienced programmer are interested in helping make 
the script better, please let me know. I can host a website and the 
latest revision and do all of the grunt work.

Thanks,

Brad



More information about the Python-list mailing list