Reading image dimensions with PIL

Will McGugan news at NOwillmcguganSPAM.com
Wed May 18 05:02:11 EDT 2005


Dave Brueck wrote:

> 
> 
> If you're tossing images that are too _small_, is there any benefit to 
> not downloading the whole image, checking it, and then throwing it away?

Its a 'webscraper' app that downloads images based on search criteria. 
The user may want only images above 640x480, although the general case 
will be something like 200x200 to avoid downloading thumbnails

> 
> Checking just the first 1K probably won't save you too much time unless 
> you're over a modem. Are you using a byte-range HTTP request to pull 
> down the images or just a normal GET (via e.g. urllib)? If you're not 
> using a byte-range request, then all of the data is already on its way 
> so maybe you could go ahead and get it all.

I'm not familiar with byte-range requests. Is this a standard feature of 
webservers? I know there will be more that one K in the pipeline if I do 
a read, but if I close the file object from urllib it will stop the 
download if there is data remaining - wont it?

> 
> But hey, if your current approach works... :) It _is_ a bit 
> unconventional, so to reduce the risk you could test it on a decent mix 
> of image types (normal JPEG, progressive JPEG, normal & progressive GIF, 
> png, etc.) - just to make sure  PIL is able to handle partial data for 
> all different types you might encounter.
> 
> Also, if PIL can't handle the partial data, can you reliably detect that 
> scenario? If so, you could detect that case and use the 
> download-it-all-and-check approach as a failsafe.

The PIL code worked with most of the images I threw at it (just jpegs), 
if there was no 'size' attribute then I just continue to download the 
entire image. It may have caused a memory leak though, with this code in 
  memory usage increased continuously..

Actualy, this may all be moot now. Originally I looked at reading the 
image dimensions from the jpeg header, but that turned out to be 
non-trivial and I gave up. Fortunately I found some Perl code that does 
it, and converted it to Python (and I dont even know Perl!). Here's the 
code if anyone is interested..

import struct


def GetJpegSize(data):

     idata = iter(data)

     width = None
     height = None

     try:

         B1 = ord(idata.next())
         B2 = ord(idata.next())

         if B1 != 0xFF or B2 != 0xD8:
             return -1, -1

         while True:

             byte = ord(idata.next())

             while byte != 0xFF:
                 byte = ord(idata.next())

             while byte == 0xFF:
                 byte = ord(idata.next())

             if byte >= 0xc0 and byte <= 0xc3:
                 idata.next()
                 idata.next()
                 idata.next()
                 height, width = struct.unpack( '>HH', 
"".join(idata.next() for b in range(4)) )
                 break
             else:
                 offset = struct.unpack('>H', idata.next() + 
idata.next())[0] - 2
                 for _ in xrange(offset):
                     idata.next()

     except StopIteration:
         pass

     return width, height


if __name__ == "__main__":

     first_k = file("test.jpg","rb").read(1024)

     print GetJpegSize(first_k)


Returns (-1, -1) for a non-jpeg, or (None, None) if the size wasn't 
contained in the data supplied (some jpegs have embedded thumbnails), or 
(width, height) if the dimensions were found.

And the original source: http://wiki.tcl.tk/757


Thanks,

Will


-- 
http://www.willmcgugan.com
"".join( [ {'*':'@','^':'.'}.get(c,None) or chr(97+(ord(c)-84)%26) for c 
in "jvyy*jvyyzpthtna^pbz" ] )



More information about the Python-list mailing list