Analyse of PDF (or EPS?)
David Boddie
davidb at mcs.st-and.ac.uk
Tue Nov 25 15:58:13 EST 2003
Johan Holst Nielsen <johan at weknowthewayout.com> wrote in message news:<3fbe00e8$0$95070$edfadb0f at dread11.news.tele.dk>...
> David Boddie wrote:
> > The full PDF specification is not exactly short, but it's fairly readable.
>
> Yep... I tried it... but there are no reason to do exactly the same - if
> other people already have done that. And time is an issue too ;)
Time is always an issue. How much of it do you have? ;-)
> > I have a Python library which is able to identify a lot of the structure in simple
> > documents, including basic text extraction, but I've become pretty disillusioned
> > with it because so much work is required to extract more complex information.
> >
> > Maybe it's time to stick a license on it and upload it somewhere.
>
> Well, let me know ;) Maybe I could get an demo or something? That would
> be nice :)
You may be disappointed, but here it is:
http://www.boddie.org.uk/david/Projects/Python/pdftools/
The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.
Basic use:
import pdftools
file = "MyFile.pdf"
doc = pdftools.PDFdocument(file)
print "Document uses PDF format version", doc.document_version()
pages = doc.count_pages()
print "Document contains %i pages." % pages
if pages > 123:
page123 = doc.read_page(123)
contents123 = page123.read_contents()
print "The objects found in this page:"
print
print contents123.contents
I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)
Have fun, and don't expect too much,
David
More information about the Python-list
mailing list