Analyse of PDF (or EPS?)

Tue Nov 25 15:58:13 EST 2003

Johan Holst Nielsen <johan at weknowthewayout.com> wrote in message news:<3fbe00e8$0$95070$edfadb0f at dread11.news.tele.dk>...
> David Boddie wrote:

> > The full PDF specification is not exactly short, but it's fairly readable.
> 
> Yep... I tried it... but there are no reason to do exactly the same - if 
> other people already have done that. And time is an issue too ;)

Time is always an issue. How much of it do you have? ;-)

> > I have a Python library which is able to identify a lot of the structure in simple
> > documents, including basic text extraction, but I've become pretty disillusioned
> > with it because so much work is required to extract more complex information.
> > 
> > Maybe it's time to stick a license on it and upload it somewhere.
> 
> Well, let me know ;) Maybe I could get an demo or something? That would 
> be nice :)

You may be disappointed, but here it is:

    http://www.boddie.org.uk/david/Projects/Python/pdftools/

The core of the library was written in a hurry over two years ago; later refinements
make it only slightly more robust. It was never really intended for anything other
than exploring the structure of PDF files.

Basic use:

    import pdftools

    file = "MyFile.pdf"
    doc = pdftools.PDFdocument(file)

    print "Document uses PDF format version", doc.document_version()

    pages = doc.count_pages()
    print "Document contains %i pages." % pages

    if pages > 123:

        page123 = doc.read_page(123)
        contents123 = page123.read_contents()

        print "The objects found in this page:"
        print
        print contents123.contents

I've not really dealt with the coordinate system very well. Ideally, it would be
trivial to extract all the device-independent positioning information but,
whenever I start to look at this, I get distracted. :-)

Have fun, and don't expect too much,

David