[Spambayes-checkins] spambayes/spambayes ImageStripper.py,1.4,1.5
Skip Montanaro
montanaro at users.sourceforge.net
Sun Sep 10 00:18:31 CEST 2006
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv30280
Modified Files:
ImageStripper.py
Log Message:
Add crude support for multi-frame GIFs to PIL_decode_parts(). I made a few
assumptions:
1. NetPBM support will eventually be ripped out. Everyone should be
able to install PIL. Consequently, no attempt to update the NetPBM
code was made.
2. The image with the fewest background pixels is probably the one
containing the text. GIF image frames can be just part of the
overall image, so this assumption will be violated in the future.
For the time being it appears most spammers have a hard time setting
frame duration properly (are they trying to induce epileptic seizures
or sell stocks?), let alone carving up frames into pieces. We'll
cross that bridge when we come to it.
3. If an image's info dict doesn't have a "duration" key it's assumed to
be a single-frame image.
Index: ImageStripper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** ImageStripper.py 14 Aug 2006 02:58:11 -0000 1.4
--- ImageStripper.py 9 Sep 2006 22:18:28 -0000 1.5
***************
*** 22,26 ****
try:
! from PIL import Image
except ImportError:
Image = None
--- 22,26 ----
try:
! from PIL import Image, ImageSequence
except ImportError:
Image = None
***************
*** 189,192 ****
--- 189,219 ----
continue
else:
+ # Spammers are now using GIF image sequences. From examining a
+ # miniscule set of multi-frame GIFs it appears the frame with
+ # the fewest number of background pixels is the one with the
+ # text content.
+
+ if "duration" in image.info:
+ # Big assumption? I don't know. If the image's info dict
+ # has a duration key assume it's a multi-frame image. This
+ # should save some needless construction of pixel
+ # histograms for single-frame images.
+ bgpix = 1e17 # ridiculously large number of pixels
+ try:
+ for frame in ImageSequence.Iterator(image):
+ # Assume the pixel with the largest value is the
+ # background.
+ bg = max(frame.histogram())
+ if bg < bgpix:
+ image = frame
+ bgpix = bg
+ # I've empirically determined:
+ # * ValueError => GIF image isn't multi-frame.
+ # * IOError => Decoding error
+ except IOError:
+ tokens.add("invalid-image:%s" % part.get_content_type())
+ continue
+ except ValueError:
+ pass
image = image.convert("RGB")
More information about the Spambayes-checkins
mailing list