Script to extract text from PDF files

byte8bits at gmail.com byte8bits at gmail.com
Wed Sep 26 17:50:16 EDT 2007


On Sep 26, 4:49 pm, Svenn Are Bjerkem <svenn.bjer... at googlemail.com>
wrote:

> I have downloaded this package and installed it and found that the
> text-extraction is more or less useless. Looking into the code and
> comparing with the PDF spec show a very early implementation of text
> extraction. Luckily it is possible to overwrite the textextraction
> method in the base class without having to fiddle with the original
> code. I tried to contact the developer to offer some help on
> implementing text extraction, but he didn't answer my emails.
> --
> Svenn

Well, feel free to send any ideas or help to me! It seems simple... Do
a binary read. Find 'stream' and 'endstream' sections.
zlib.decompress() all the streams. Find BT and ET markers (Begin Text
& End Text) and finally locate the parens within those and string the
text together. This works great on 3 out of 10 PDF documents, but my
main issue seems to be the zlib compressed streams. Some of them don't
seem to be FlateDecodeable (although they claim to be) or the header
is somehow incorrect. But, once I get a good stream and decompress it,
things are OK from that point on. Seriously, if you have ideas, please
let me know. I'll be glad to share what I've got so far.

Not many people seem to be interested. I'll stop adding to this
thread... I don't want to beat a dead horse. Anyone interested in
helping, can contact me via emial.

Thanks,

Brad




More information about the Python-list mailing list