Using PIL to find separator pages

Sat Jun 2 14:06:53 EDT 2007

Larry Bates wrote:
> Steve Holden wrote:
>> Larry Bates wrote:
>>> Steve Holden wrote:
>>>> Larry Bates wrote:
>>>>> I have a project that I wanted to solicit some advice
>>>>> on from this group.  I have millions of pages of scanned
>>>>> documents with each page in and individual .JPG file.
>>>>> When the documents were scanned the people that did
>>>>> the scanning put a colored (hot pink) separator page
>>>>> between the individual documents.  I was wondering if
>>>>> there was any way to utilize PIL to scan through the
>>>>> individual files, look at some small section on the
>>>>> page, and determine if it is a separator page by
>>>>> somehow comparing the color to the separator page
>>>>> color?  I realize that this would be some sort of
>>>>> percentage match where 100% would be a perfect match
>>>>> and any number lower would indicate that it was less
>>>>> likely that it was a coverpage.
>>>>>
>>>>> Thanks in advance for any thoughts or advice.
>>>>>
>>>> I suspect the easiest way would be to select a few small patches of each
>>>> image and average the color values of the pixels, then normalize to hue
>>>> rather than RGB.
>>>>
>>>> Close enough to the hue you want (and you could include saturation and
>>>> intensity too, if you felt like it) across several areas of the page
>>>> would be a hit for a separator.
>>>>
>>>> regards
>>>>  Steve
>>> Steve,
>>>
>>> I'm completely lost on how to proceed.  I don't know how to average color
>>> values, normalize to hue...  Any guidance you could give would be greatly
>>> appreciated.
>>>
>>> Thanks in advance,
>>> Larry
>> I'd like to help but I don't have any sample code to hand. Maybe someone
>> who does could give you more of a clue. Let's hope so, anyway ...
>>
>> regards
>>  Steve
> 
> I think I've come up with something that will work.  I use PIL
> Image.getcolors() to get colors and take the top 10 colors of my
> background page.  I then calculate the average of the R, G, B
> components.  That becomes my reference.  Then I read a page and
> make the same calculation.  I then calculate the absolute value
> of the difference of R, G, B of the two values.  Sum those
> together gives something like the average difference between
> the two average colors (at least that is what I think it does).
> This seems to give me small numbers when the pages are the same
> and large numbers when they are different.  It isn't super fast
> but it is working.
> 
> Thanks for pushing me in the right direction.
> 
> -Larry

Well done! Thanks for letting me know that the basic approach was correct.

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------