Script to extract text from PDF files

Wed Sep 26 11:13:15 EDT 2007

David Boddie wrote:
> There's a little information on that online:
>   http://www.glyphandcog.com/textext.html

Thanks, I'll read that.

> Just because inserting and encoding is well documented doesn't mean that the
> reverse processes are easy. :-/

Boy, that's an understatement... most of the PDF tools (in fact almost 
all) I come across write  PDF docs... they output things to PDF. It's 
like anyone can generate PDF files... it's dead simple, but extracting 
text out of them in an accurate, reliable manner is much more difficult.

> Maybe you should look at the source code for pdftotext, if that's an option.

I'm not sure it's opensource/free software with source available, but 
I'll look into that.

> Can I suggest that you approach one or more authors of the existing Python
> PDF solutions and work with them on this? There are at least four PDF parsers
> written in Python out there.

I appreciate that suggestion, but again, none of the current solutions 
I've seen and tried, extract text from pdf documents. I'd love to be 
proven wrong on this point. So if one of those four current PDF 
solutions you mention do that, please let me know.

Thanks,

Brad