Using PIL to find separator pages

half.italian at gmail.com half.italian at gmail.com
Thu May 31 20:09:12 EDT 2007


On May 31, 10:01 am, Larry Bates <larry.ba... at websafe.com> wrote:
> I have a project that I wanted to solicit some advice
> on from this group.  I have millions of pages of scanned
> documents with each page in and individual .JPG file.
> When the documents were scanned the people that did
> the scanning put a colored (hot pink) separator page
> between the individual documents.  I was wondering if
> there was any way to utilize PIL to scan through the
> individual files, look at some small section on the
> page, and determine if it is a separator page by
> somehow comparing the color to the separator page
> color?  I realize that this would be some sort of
> percentage match where 100% would be a perfect match
> and any number lower would indicate that it was less
> likely that it was a coverpage.
>
> Thanks in advance for any thoughts or advice.
>
> Regards,
> Larry Bates

I used GraphicsMagick for a similar situation.  Once installed you can
run `gm identify' to return all sorts of usefull information about the
images.  In my case I had python call 'gm' to identify the number of
colors in each image, then inspect the output and handle the image
accordingly.  I'll bet PIL could do a similar thing, but in my case I
was examining DPX files which PIL can't handle.  Either approach will
most likely take a bit of time unless you spread the work over several
machines.

~Sean




More information about the Python-list mailing list