Fw: PDF library for reading PDF files

Andreas Lobinger andreas.lobinger at netsurf.de
Tue Jan 20 04:20:35 EST 2004


Aloha,

Peter Galfi schrieb:
> Thanks. I am studying the PDF spec, it just does not seem to be that easy
> having to implement all the decompressions, etc. The "information" I am
> trying to extract from the PDF file is the text, specifically in a way to
> keep the original paragraphs of the text. I have seen so far one shareware
> standalone tool that extracts the text (and a lot of other formatting
> garbage) into an RTF document keeping the paragraphs as well. I would need
> only the text.

As others wrote here, the simplest solution is to use a external
pdf-2-text programm and postprocess the data. Read comp.text.pdf

There is no simple and consistent way to extract text from a .pdf
because there are many ways to set text. The optical impression
of a paragraph may not be represented by a similar command structure 
in the .pdf.

Adobe recognized the difficulties for document reuse and introduced
tagged .pdf in 1.4. With tagged-pdf it is possible to insert
structural information in the .pdf. If you are interested in
using this, contact me.

Wishing a happy day
		LOBI



More information about the Python-list mailing list