highlight words by regex in pdf files using python

Wed Mar 17 11:11:40 EDT 2010

On Wed, Mar 17, 2010 at 9:53 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
> Thank you for your long reply! But I'm not sure if you get my question or not.
>
> Acrobat can highlight certain words in pdfs. I could add notes to the
> highlighted words as well. However, I find that I frequently end up
> with highlighting some words that can be expressed by a regular
> expression.
>
> To improve my productivity, I don't want do this manually in Acrobat
> but rather do it in an automatic way, if there is such a tool
> available. People in reportlab mailing list said this is not possible
> with reportlab. And I don't see PyPDF can do this. If you know there
> is an API to for this purpose, please let me know. Thank you!

I do not know of any API specific to this purpose, no.  But I
mentioned three libraries (pagecatcher, pdfminer, and pdfrw) that are
capable, to a greater or lesser extent, of reading in PDFs and giving
you the data from them, which you can then do your replacement on and
then write back out.  I would imagine this would be a piece of cake
with pagecatcher.  (I noticed you just posted on the reportlab mailing
list, but you did not specifically mention pagecatcher.)  It will
probably take more work with either of the other two.  It is probable
that none of them do exactly what you want, but also that any of them
is a better starting point than coding what you want from scratch.

Regards,
Pat