PDF: finding a blank image

Mon Jul 13 19:05:05 EDT 2009

DrLeif <l.lensgraf at gmail.com> writes:

> What I would like to do is have python detect a "blank" pages in a PDF
> file and remove it.  Any suggestions?

The odds are good that even a blank page is being "rendered" within
the PDF as having some small bits of data due to scanner resolution,
imperfections on the page, etc..  So I suspect you won't be able to just
look for a well-defined pattern in the resulting PDF or anything.

Unless you're using OCR, the odds are good that the scanner is
rendering the PDF as an embedded image.  What I'd probably do is
extract the image of the page, and then use image processing on it to
try to identify blank pages.  I haven't had the need to do this
myself, and tool availability would depend on platform, but for
example, I'd probably try ImageMagick's convert operation to turn the
PDF into images (like PNGs).  I think Gimp can also do a similar
conversion, but you'd probably have to script it yourself.

Once you have an image of a page, you could then use something like
OpenCV to process the page (perhaps a morphology operation to remove
small noise areas, then a threshold or non-zero counter to judge
"blankness"), or probably just something like PIL depending on
complexity of the processing needed.

Once you identify a blank page, removing it could either be with pure
Python (there have been other posts recently about PDF libraries) or
with external tools (such as pdftk under Linux for example).

-- David