Script to extract text from PDF files

Thu Sep 27 15:40:37 EDT 2007

On Sep 26, 11:50 pm, byte8b... at gmail.com wrote:
> On Sep 26, 4:49 pm, Svenn Are Bjerkem <svenn.bjer... at googlemail.com>
> wrote:
>
> > I have downloaded this package and installed it and found that the
> > text-extraction is more or less useless. Looking into the code and
> > comparing with the PDF spec show a very early implementation of text
> > extraction. Luckily it is possible to overwrite the textextraction
> > method in the base class without having to fiddle with the original
> > code. I tried to contact the developer to offer some help on
> > implementing text extraction, but he didn't answer my emails.
> > --
> > Svenn
>
> Well, feel free to send any ideas or help to me! It seems simple... Do
> a binary read. Find 'stream' and 'endstream' sections.
> zlib.decompress() all the streams. Find BT and ET markers (Begin Text
> & End Text) and finally locate the parens within those and string the
> text together. This works great on 3 out of 10 PDF documents, but my
> main issue seems to be the zlib compressed streams. Some of them don't
> seem to be FlateDecodeable (although they claim to be) or the header
> is somehow incorrect. But, once I get a good stream and decompress it,
> things are OK from that point on. Seriously, if you have ideas, please
> let me know. I'll be glad to share what I've got so far.

So far I have found that extracting text from the IEEE journal papers
is not as simple as described above. The IEEE journals are typesetting
things in typical journal style with two columns body text and one
column abstract and a blob of header and author information. Take
figures and formulas and footnotes and spread them around in the
journal and you are basically using all block text layout commands
there is in PDF.

I wanted to to get the pdftotext from xpdf package to see what that
tool does to the IEEE pdfs in order to see if I should dive into the
sources to see what they do to get things right. So far I have not got
this far. Purpose of my work was to extract the abstract of each paper
to put into a database for later search, but IEEE also has a search
engine on their journal DVD => postpone python work.

Got my gentoo machine back on track so that may maybe change
again......
--
Svenn