Using PIL to find separator pages
Steve Holden
steve at holdenweb.com
Sat Jun 2 14:06:53 EDT 2007
Larry Bates wrote:
> Steve Holden wrote:
>> Larry Bates wrote:
>>> Steve Holden wrote:
>>>> Larry Bates wrote:
>>>>> I have a project that I wanted to solicit some advice
>>>>> on from this group. I have millions of pages of scanned
>>>>> documents with each page in and individual .JPG file.
>>>>> When the documents were scanned the people that did
>>>>> the scanning put a colored (hot pink) separator page
>>>>> between the individual documents. I was wondering if
>>>>> there was any way to utilize PIL to scan through the
>>>>> individual files, look at some small section on the
>>>>> page, and determine if it is a separator page by
>>>>> somehow comparing the color to the separator page
>>>>> color? I realize that this would be some sort of
>>>>> percentage match where 100% would be a perfect match
>>>>> and any number lower would indicate that it was less
>>>>> likely that it was a coverpage.
>>>>>
>>>>> Thanks in advance for any thoughts or advice.
>>>>>
>>>> I suspect the easiest way would be to select a few small patches of each
>>>> image and average the color values of the pixels, then normalize to hue
>>>> rather than RGB.
>>>>
>>>> Close enough to the hue you want (and you could include saturation and
>>>> intensity too, if you felt like it) across several areas of the page
>>>> would be a hit for a separator.
>>>>
>>>> regards
>>>> Steve
>>> Steve,
>>>
>>> I'm completely lost on how to proceed. I don't know how to average color
>>> values, normalize to hue... Any guidance you could give would be greatly
>>> appreciated.
>>>
>>> Thanks in advance,
>>> Larry
>> I'd like to help but I don't have any sample code to hand. Maybe someone
>> who does could give you more of a clue. Let's hope so, anyway ...
>>
>> regards
>> Steve
>
> I think I've come up with something that will work. I use PIL
> Image.getcolors() to get colors and take the top 10 colors of my
> background page. I then calculate the average of the R, G, B
> components. That becomes my reference. Then I read a page and
> make the same calculation. I then calculate the absolute value
> of the difference of R, G, B of the two values. Sum those
> together gives something like the average difference between
> the two average colors (at least that is what I think it does).
> This seems to give me small numbers when the pages are the same
> and large numbers when they are different. It isn't super fast
> but it is working.
>
> Thanks for pushing me in the right direction.
>
> -Larry
Well done! Thanks for letting me know that the basic approach was correct.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
More information about the Python-list
mailing list