Working with PDFs?

Tue Aug 24 12:32:31 EDT 2010

<jyoung79 at kc.rr.com> wrote in message 
news:mailman.2465.1282591017.1673.python-list at python.org...
>> <jyoung79 at kc.rr.com> writes:
>
>>> - Pull out text from each PDF page (to search for specific words)
>>> - Combine separate pdf documents into one document
>>> - Add bookmarks (with destination settings)
>
>> PDF Shuffler is a Python app which does PDF merging and splitting very
>> well. I don't think it does anything else, though, but maybe that's
>> where your code comes in?
>
> Thank you Anssi, MRAB, Terry and Geremy for your replies.  I've been
> researching the apps you have recommended.  Just curious if anyone has
> used pyPdf?  While testing this, it seems to work pretty well for
> combining pdf files (seems to keep the annotation notes nicely also)
> and pulling out the text contents.  I'm not sure I'm going to be able
> to find anything that can add bookmarks though.  If you have used pyPdf,
> would you mind sharing your thoughts about it?
>
> Thanks.
>
> Jay

Hi Jay,

I use pyPdf and I seem to remember I had to patch it so it didn't crash when 
a PDF dictionary contained duplicate keys. (the part that holds the document 
properties I think).

Anyway, I use the package to get info from that document properties 
dictionary, page count and etc for displaying a build report to users of a 
customized LaTeX system. So I'm using LaTeX to generate the PDFs and pyPDF 
to glean data about the pdfs after the builds.

I'd like to be able to do more with it, like find out whether any fonts in 
the doc are not embedded for example.

--Tim Arnold