Script to extract text from PDF files

David Boddie david at boddie.org.uk
Wed Sep 26 18:53:04 EDT 2007


On Wed Sep 26 23:50:16 CEST 2007, byte8bits wrote:

> On Sep 26, 4:49 pm, Svenn Are Bjerkem <svenn.bjer... at googlemail.com>
> wrote:
> 
> > I have downloaded this package and installed it and found that the
> > text-extraction is more or less useless. Looking into the code and
> > comparing with the PDF spec show a very early implementation of text
> > extraction. Luckily it is possible to overwrite the textextraction
> > method in the base class without having to fiddle with the original
> > code. I tried to contact the developer to offer some help on
> > implementing text extraction, but he didn't answer my emails.

That's disappointing to hear, but it's understandable. I must have one
or two outstanding requests to add features to pdftools from a year ago.
I keep meaning to look into making the necessary changes, but it's not
something I'm looking forward to.

> Well, feel free to send any ideas or help to me! It seems simple... Do
> a binary read. Find 'stream' and 'endstream' sections.
> zlib.decompress() all the streams.

Assuming that they're FlateEncoded...

> Find BT and ET markers (Begin Text 
> & End Text) and finally locate the parens within those and string the
> text together.

Which works fine if the generator put in space characters. Otherwise,
it seems to me that you need to figure out where any spaces should go.

> This works great on 3 out of 10 PDF documents, but my 
> main issue seems to be the zlib compressed streams. Some of them don't
> seem to be FlateDecodeable (although they claim to be) or the header
> is somehow incorrect. But, once I get a good stream and decompress it,
> things are OK from that point on. Seriously, if you have ideas, please
> let me know. I'll be glad to share what I've got so far.

You need to take a good parser and work on a higher level text extraction
library.

> Not many people seem to be interested. I'll stop adding to this
> thread... I don't want to beat a dead horse. Anyone interested in
> helping, can contact me via emial.

On the contrary, lots of people are interested in this sort of thing:

http://phaseit.net/claird/comp.text.pdf/PDF_converters.html
http://sourceforge.net/projects/pdfplayground
http://www.adaptive-enterprises.com.au/~d/software/pdffile/
http://pybrary.net/pyPdf/
http://www.boddie.org.uk/david/Projects/Python/pdftools/

I discussed working with the author of pdfplayground, but things never
really got going.

I'd like to be part of a team working on a PDF library for Python, but my
views on software licensing mean that I'd prefer to use a strong copyleft
license rather than the permissive licenses found attached to most of the
above libraries.

David



More information about the Python-list mailing list