[spambayes-dev] Latest CVS update, Ocrad for Windows

Michele Belloli mb at symbolic.it
Mon Sep 4 10:26:42 CEST 2006


skip at pobox.com ha scritto:
> I updated the OCR capabilities a bit more today.  I added more intelligent
> assembly of split images into a single image after noticing that the
> spammers don't simply chop up multi-part GIF images horizontally.  I also
> added a couple extra options (ocrad_scale and ocrad_charset) which control
> the image scaling factor (default is 2) and character set (default is
> "ascii") Ocrad uses.  Scaling the image by a factor of 2 was a pretty
> obvious win:
>
>     false positive percentages
>         0.000  0.000  tied          
>         0.000  0.000  tied          
>         0.000  0.000  tied          
>         0.000  0.000  tied          
>         0.000  0.000  tied          
>
>     won   0 times
>     tied  5 times
>     lost  0 times
>
>     total unique fp went from 0 to 0 tied          
>     mean fp % went from 0.0 to 0.0 tied          
>
>     false negative percentages
>         4.213  4.213  tied          
>         1.404  0.843  won    -39.96%
>         3.371  2.809  won    -16.67%
>         2.528  2.247  won    -11.12%
>         4.213  3.652  won    -13.32%
>
>     won   4 times
>     tied  1 times
>     lost  0 times
>
>     total unique fn went from 56 to 49 won    -12.50%
>     mean fn % went from 3.14606741573 to 2.75280898876 won    -12.50%
>
> Scaling by a factor of three was even better in the false negative
> department but regressed a bit in the false positive category so I checked
> Options.py in with a default scaling factor of 2.  A couple things could
> stand to be further tested:
>
>     * I have no idea how good Ocrad's scaling algorithm is.  It's possible
>       that PIL or NetPBM's scaling code is better.  If so, it would make
>       sense to scale the images before feeding to Ocrad.
>
>     * The images I've see so far were all plain English, so I blindly made
>       ascii the default charset.  The other choices were iso-8859-9 and
>       iso-8859-15.  I simply assumed ascii would be the most appropriate
>       default, but didn't test it.
>
> Finally, I put together a really simpleminded Ocrad-for-Windows release
> based upon the ocrad.exe binary that Tony built.  Check the Files section of
> the SpamBayes project site:
>
>     http://sourceforge.net/project/showfiles.php?group_id=61702
>
> and grab ocrad-cygwin.
>
> There are a few caveats:
>
>     1. I don't do Windows.  (No, really, I don't, strange as that may seem.)
>        This is no fancy-schmancy point-and-shoot Windows installer.  It's
>        just a simple zip file with the Ocrad 0.15 distribution, Tony's .exe
>        file and the patch he applied to the source.
>
>     2. I don't do Windows.  The code I've written so far has been done
>        entirely on my Mac.  I've made no obvious concessions to portability.
>        That said, I hope portability issues won't be daunting for any early
>        adopters.
>
>     3. I don't do Windows.  If you have problems it won't do you any good to
>        mail me directly.  Post about problems on the SpamBayes bug tracker:
>
>            http://sourceforge.net/tracker/?group_id=61702&atid=498103
>
>     4. If you do Windows you will need PIL to take advantage of the recent
>        changes:
>
>            http://www.pythonware.com/products/pil/
>
>        (unless you want to put hair on your chest and build NetPBM on
>        Windows).  Fredrik Lundh provides prebuilt Windows versions of PIL.
>        Grab the one appropriate for the version of Python you have
>        installed.
>
>     5. If you do Windows (or any other platform for that matter), feedback
>        to the lists about successes and failures would be helpful.
>
> Cheers,
>
> Skip
>
>
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev at python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev
>
>
>   
Hi,

I'm very interested in this OCR and in the way SpamBayes analyzes image 
spam.
Now there is a new kind of image spam using animated images and I've 
received a lot of "animated spam" lately so it's possible they could be 
very common in a brief period.
Here you can find a brief description about this:
http://www.viruslist.com/en/weblog?weblogid=196822613

I would like to ask you how your OCR manages this kind of images.

Thank you a lot for your time.

Regards

-- 
Michele  Belloli
Research & Development Dept.


Symbolic - Network Security Distributor
http://www.symbolic.it

eXtensiveControl La nuova soluzione di Content Filtering per la PMI
http://www.extensivecontrol.it/



More information about the spambayes-dev mailing list