[Spambayes] Windows compatibility - OCR [was: Unwanted

Sat Nov 4 15:11:04 CET 2006

On Sat, 2006-11-04 at 14:19 +0100, Vibe Grevsen wrote:
> Hi friends,
>   
> >> OCR code's now been tweaked and tested to work in both WinXP and
> >> Win9x.
> >> This should work in unix as well.
> >>  
> >> Here is a summary:
> >>  
> >> 1. Put ocrad 0.16 in the path
> >
> > As a note, for Windows you need a copy of ocrad with skip patch that
> > opens pnm files in binary mode otherwise ocrad will fail on a lot of
> > files.
> 
> Actually you're probably refering to my "patch"? (Ocrad/CygWin1.dll)
> http://mail.python.org/pipermail/spambayes/2006-October/019983.html
> 
> If you have MinGW experience - which I don't - I think you can compile
> an exe-only which don't need the dll. But then I don't know if it is actually
> working because of the POSIX emulation or they did change the source.
> (I did not...)
> 
> You're right Skip pointed it out in the ocrad forum, but the developer was
> reluctant to change this then so I don't actually know why 0.16 is working...
> Just know it is, which is fine for me.

I was referring to this mail
http://www.nabble.com/Ocrad-opens-files-in-text-mode-t2485744.html

probably if you use cygwin emulation layer you have no problem with
binary/text file. I have no experience with mingw but I compiled ocrad
using it and I'm using the result (without cygwin dll) with no problem,
but you have to open the pnm file in binary mode.

> 
> 
> 
> > Have you tried other ocr programs?
> 
> No, not yet.
> 
> Tony Meyer suggested Tesseract:
> http://mail.python.org/pipermail/spambayes-dev/2006-September/003750.html
> but there seemed to be build issues... I haven't tried..

I built tesseract with no problem.
I have done a very quick test with it and it's difficult to use (at
least I was not able (on windows) to get any results if the image wasn't
in the same folder of the executable).
I tested few spam images and the results were poor.

> 
> I mailed with NoSpam Today! Support (spamassasin based) before I chose SB.
> They were doing research on FuzzyOcr and ImageInfo. Maybe we could ask
> again about their results. I believe FuzzyOcr is gocr-based?

Yes, they are using gocr. But as I said in my previous mail it has its
own  problems.

> 
> 
> 
> > I tried gocr and I think that its result are somewhat better but version
> > 0.41 + pgm patch almost hangs
> 
> Ok, probably needs some tweaking then.
> Since the ocr is working with ocrad and - as you see below - I get very
> good results I will be moving on to the next area now.
You are lucky. My results are so so. Probably I get a reduction of a
60/70% of spam with images (which in itself could be considered not bad)
but way too much spam is not stopped.

I'm going to recheck my environment to see if something is wrong.

> 
> I think it is far more beneficial to do more research into the actual processing
> as you commented elsewhere than to start the whole testing/tweaking all over
> again with a new ocr engine. Of course that is just my opinion...
Yes and no. We need a decent ocr engine to start with than we may focus
on better image manipulation.
At the moment spambayes have trouble with image for the following
reason:

- PIL sometimes fail to handle the image. I'm still investigating the
issue but the images seems reasonably correct (IE, Firefox and many
viewers, on linux and windows, are able to display them). It's quite
rare and not a big issue
- ocr results are poor. The worst case are when you get a sequenze of
chars (char space char space ...) or a long word. both are ignored by
spambayes. 
There are images which contain more than words and in this case we may
get no tokens.
In few cases if the colors used inside the image are changed you get a
different result.

I have no knowledge of image processing but I tried few simple
operations (like scaling, sharpening, convert to gray, ...) but I got no
results. They were all quick tests and the result are in no way
conclusive.

> 
> 
> >> 5. Finally I sugest you change the default scale from 1 to 2 like in
> >> this line
> >>  
> >>         scale = options["Tokenizer", "ocrad_scale"] or 2
> >
> > changing this surely doesn't hurt but ocrad_scale it's already set to 2
> > in Options.py
> 
> Ok, I missed that. Don't know which one has prevalence.
> ImageStripper.py, Options.py or bayescustomize.ini.
from my understanding in Options.py you set the default values,
bayescustomize.ini contain the values chosen by the user an in
Imagestripper.py the programmer may embed it's values ignoring the user
choice (joking)

> 
> With 2 you should get this quality image tokens:
> 
> watch
> out
> here
> comes
> the
> big
> one!
> srrl
> about
> blow
> your
> minds
> add
> srrl
> your
> radar
> mon
> nov
> ob
> companu
> name:
> stellar
> resource
> new
> (otc
> bb:srrl.ob)
> sumbol:
> srrl
> prlce:
> tl._
> targe_:
> tio
> skip:r 10
> ueru
> s_rong
> buu
> our
> last
> feature,
> posted
> cains
> ouer
> __o_
> the
> span
> weekithose-
> are
> ridiculous
> cainsl
> cet
> srrl
> nowl
> will
> makinc
> stunninc
> skip:a 10
> next
> weekl
> massiue
> campaicns
> are
> about
> startl
> watch
> srrl
> trade
> monday
> nou
> obl
> don't
> left
> out!
> 
> That is about a 90% recognition or so.
Yes, sometimes the results are good and sometimes are much worst. In few
cases a scaling factor of 3 it's better. Just now I'm doing a retraining
with ocrad_scale set to 3. we will see in the next days if the result
are better or worst

> 
> 
> > probably should be removed (or set to 2 as you suggest)
> 
> Then I suggest removal as you say. Better avoid redundancy ( clutter :) )
> 
> 
> 
> Happy coding :)
> 
> Vibe
> _______________________________________________
> SpamBayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
-- 
Luigi Pugnetti

Symbolic S.p.A.
V.le Mentana, 29
I-43100 Parma
Italy

Tel: +39 0521 708811
Fax: +39 0521 776190