[Spambayes] Windows compatibility - OCR [was: Unwanted

Vibe Grevsen grevsen at gmail.com
Sat Nov 4 14:19:56 CET 2006


Hi friends,
  
>> OCR code's now been tweaked and tested to work in both WinXP and
>> Win9x.
>> This should work in unix as well.
>>  
>> Here is a summary:
>>  
>> 1. Put ocrad 0.16 in the path
>
> As a note, for Windows you need a copy of ocrad with skip patch that
> opens pnm files in binary mode otherwise ocrad will fail on a lot of
> files.

Actually you're probably refering to my "patch"? (Ocrad/CygWin1.dll)
http://mail.python.org/pipermail/spambayes/2006-October/019983.html

If you have MinGW experience - which I don't - I think you can compile
an exe-only which don't need the dll. But then I don't know if it is actually
working because of the POSIX emulation or they did change the source.
(I did not...)

You're right Skip pointed it out in the ocrad forum, but the developer was
reluctant to change this then so I don't actually know why 0.16 is working...
Just know it is, which is fine for me.



> Have you tried other ocr programs?

No, not yet.

Tony Meyer suggested Tesseract:
http://mail.python.org/pipermail/spambayes-dev/2006-September/003750.html
but there seemed to be build issues... I haven't tried..

I mailed with NoSpam Today! Support (spamassasin based) before I chose SB.
They were doing research on FuzzyOcr and ImageInfo. Maybe we could ask
again about their results. I believe FuzzyOcr is gocr-based?



> I tried gocr and I think that its result are somewhat better but version
> 0.41 + pgm patch almost hangs

Ok, probably needs some tweaking then.
Since the ocr is working with ocrad and - as you see below - I get very
good results I will be moving on to the next area now.

I think it is far more beneficial to do more research into the actual processing
as you commented elsewhere than to start the whole testing/tweaking all over
again with a new ocr engine. Of course that is just my opinion...



>> 5. Finally I sugest you change the default scale from 1 to 2 like in
>> this line
>>  
>>         scale = options["Tokenizer", "ocrad_scale"] or 2
>
> changing this surely doesn't hurt but ocrad_scale it's already set to 2
> in Options.py

Ok, I missed that. Don't know which one has prevalence.
ImageStripper.py, Options.py or bayescustomize.ini.

With 2 you should get this quality image tokens:

watch
out
here
comes
the
big
one!
srrl
about
blow
your
minds
add
srrl
your
radar
mon
nov
ob
companu
name:
stellar
resource
new
(otc
bb:srrl.ob)
sumbol:
srrl
prlce:
tl._
targe_:
tio
skip:r 10
ueru
s_rong
buu
our
last
feature,
posted
cains
ouer
__o_
the
span
weekithose-
are
ridiculous
cainsl
cet
srrl
nowl
will
makinc
stunninc
skip:a 10
next
weekl
massiue
campaicns
are
about
startl
watch
srrl
trade
monday
nou
obl
don't
left
out!

That is about a 90% recognition or so.



> probably should be removed (or set to 2 as you suggest)

Then I suggest removal as you say. Better avoid redundancy ( clutter :) )



Happy coding :)

Vibe


More information about the SpamBayes mailing list