[spambayes-dev] Latest CVS update, Ocrad for Windows
Michele Belloli
mb at symbolic.it
Mon Sep 4 10:26:42 CEST 2006
skip at pobox.com ha scritto:
> I updated the OCR capabilities a bit more today. I added more intelligent
> assembly of split images into a single image after noticing that the
> spammers don't simply chop up multi-part GIF images horizontally. I also
> added a couple extra options (ocrad_scale and ocrad_charset) which control
> the image scaling factor (default is 2) and character set (default is
> "ascii") Ocrad uses. Scaling the image by a factor of 2 was a pretty
> obvious win:
>
> false positive percentages
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
>
> won 0 times
> tied 5 times
> lost 0 times
>
> total unique fp went from 0 to 0 tied
> mean fp % went from 0.0 to 0.0 tied
>
> false negative percentages
> 4.213 4.213 tied
> 1.404 0.843 won -39.96%
> 3.371 2.809 won -16.67%
> 2.528 2.247 won -11.12%
> 4.213 3.652 won -13.32%
>
> won 4 times
> tied 1 times
> lost 0 times
>
> total unique fn went from 56 to 49 won -12.50%
> mean fn % went from 3.14606741573 to 2.75280898876 won -12.50%
>
> Scaling by a factor of three was even better in the false negative
> department but regressed a bit in the false positive category so I checked
> Options.py in with a default scaling factor of 2. A couple things could
> stand to be further tested:
>
> * I have no idea how good Ocrad's scaling algorithm is. It's possible
> that PIL or NetPBM's scaling code is better. If so, it would make
> sense to scale the images before feeding to Ocrad.
>
> * The images I've see so far were all plain English, so I blindly made
> ascii the default charset. The other choices were iso-8859-9 and
> iso-8859-15. I simply assumed ascii would be the most appropriate
> default, but didn't test it.
>
> Finally, I put together a really simpleminded Ocrad-for-Windows release
> based upon the ocrad.exe binary that Tony built. Check the Files section of
> the SpamBayes project site:
>
> http://sourceforge.net/project/showfiles.php?group_id=61702
>
> and grab ocrad-cygwin.
>
> There are a few caveats:
>
> 1. I don't do Windows. (No, really, I don't, strange as that may seem.)
> This is no fancy-schmancy point-and-shoot Windows installer. It's
> just a simple zip file with the Ocrad 0.15 distribution, Tony's .exe
> file and the patch he applied to the source.
>
> 2. I don't do Windows. The code I've written so far has been done
> entirely on my Mac. I've made no obvious concessions to portability.
> That said, I hope portability issues won't be daunting for any early
> adopters.
>
> 3. I don't do Windows. If you have problems it won't do you any good to
> mail me directly. Post about problems on the SpamBayes bug tracker:
>
> http://sourceforge.net/tracker/?group_id=61702&atid=498103
>
> 4. If you do Windows you will need PIL to take advantage of the recent
> changes:
>
> http://www.pythonware.com/products/pil/
>
> (unless you want to put hair on your chest and build NetPBM on
> Windows). Fredrik Lundh provides prebuilt Windows versions of PIL.
> Grab the one appropriate for the version of Python you have
> installed.
>
> 5. If you do Windows (or any other platform for that matter), feedback
> to the lists about successes and failures would be helpful.
>
> Cheers,
>
> Skip
>
>
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev at python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev
>
>
>
Hi,
I'm very interested in this OCR and in the way SpamBayes analyzes image
spam.
Now there is a new kind of image spam using animated images and I've
received a lot of "animated spam" lately so it's possible they could be
very common in a brief period.
Here you can find a brief description about this:
http://www.viruslist.com/en/weblog?weblogid=196822613
I would like to ask you how your OCR manages this kind of images.
Thank you a lot for your time.
Regards
--
Michele Belloli
Research & Development Dept.
Symbolic - Network Security Distributor
http://www.symbolic.it
eXtensiveControl La nuova soluzione di Content Filtering per la PMI
http://www.extensivecontrol.it/
More information about the spambayes-dev
mailing list