[Spambayes] date for new release to handle image spam?

Mark Hammond mhammond at skippinet.com.au
Mon Feb 5 01:33:51 CET 2007


> If you run ocrad over some spam text images you can see what
> it generates.
> If it finds nothing, nothing comes out the back end.  If it
> sees something,
> it's almost certain to be some garbage text peculiar to it,
> unlikely to turn
> up in normal text.  For example, here's a pretty clean image:
>
>     http://www.webfast.com/~skip/bogus-5-3.png
>
> Here's what ocrad produces by default:
>
>     COULD THl_ BE THE NEXT IBM_
>     ALL _|___ _wow IWAl LllL |_ ABO_| lo EXPLODEl
>     WAIIW LllL p_ Ll_E A WAW_ _IARll__ WO_DA_ _EPIEWBER lll
>
>     IomO_n_ __m_ L |_IL IOWP_IER_ |_I (o_h__ OII LllL p_)
>     __o__ __mbol LllL
>     F_ld__ Ilo__ O Tl (_o s_/_ On F_ld__ Alon_|)
>     _ d__ |__o__ __
>     I____n_ R__lnO ___onO B__
>     \
>     ln _h_ Io____ ot _ W___. LllL W____ ______| ___nnlnO Wo___'
>
>     L ln___n__lon_| Anno_n___
>
>     On_lo__h(IW) _P_o_P__ TP_hnoloO_ b_
>     B_llP_ p_oo_ Da_a _P___|__ Ba_k_O_ and _P__o_P_
>     |__ ____ __n____lon p__Aqco_TM_/P__AID CO_TM_
>     _|__a Po__ablP wloh _OPPd _olld __a_P D_|_P TP_hnoloO_
>     _h_ W___oOoll_. _hP Wo_ld _ _|___ _g laO_oO ComOrfP_
>     _Pa___lnO W_ldla _ Q_a_ll TP_hnoloO_
>     \
>     L ln___n__lon_| _IOn_ _4 _W E__oO__n Dl___lb__lon AO___m_n_
>
>     Th_ b_Pmo__ __PO b_wa_d _a__|_al _Pn___P |_ amonO o_hP_ p__|__|_P
>     dl___lb__lon aO_PPmPn__ ____Pn_|_ _ndP_ nPOo_la_lon ª_
> _P_P_al addl_lonal
>     hlOh O_ofi_ _POlon_ and _PO_P_Pn__ a kP_ ___a_POl_
> Oa__nP__hlO _ha_ _P___P_
>     l ln_P_na_lonal ComO__P__ wl_h ___|_ Olobal ma_kP_ _Pa_h
> and O_a_an_PPd
>     O_P _alP_ and lo_k_ _hP _omOan_ ln hlOhl_ dP_|_ablP
> p__|__|_P dl___lb__lon
>     ma_kP__
>
>     READ MORE ONLINE NOWl
>
>     OPPORl__||_ DOE_ _ol __OI_ o_ IWE DOOR E_ER_ DA_|
>     _o _A_E A Wl__IE IOODD LllL lo _O_R RADAR _ow A_D
>     WAIIW II _OARl

FWIW, I am getting *much* better results with gocr than ocrad.  gocr running
over that same image results in:

--- 8< ---
_        _ _   _
COULD THIS BE THE NEXT IBM?
ALL SIGNS SHOW THAT LITL IS ABOUT TO EXPLODE!

Company Name:
Stock Symbol:
Friday Close:    O.71 (Up 6O_a On Friday Alone!)
S-dayTarget:   $3
Current Rating:  Strong Buy
\

In the Course of a Week, LITL Makes Several Stunning Moves!

L International Announces:

- OneTouch(TM) Recovery Technology hr
Bullet-Proof Data Security Backups and Restores          ,
- Its Next-Generation PuRA_GO(TM)/PuRAID-GO(TM)
UItra-Portable High-Speed Solid State Drive Technology
. - the metropolis, the worldt First l9'' Laptop compWer
Featuring Nvidiat Quad-SLI Technology   _

\
L International Signs $4SM European Distribution Agreement

- T_s hremost step hrward tactical venture is, among other exclusive
distribution agreements, currently under negotiation gr several additional
high-pro_t regions and represents a key strategic partnership that secures
L International Computers with truly global market reach and guaranteed
pre-sales, and locks the company in highly desirable exclusive distribution
marke.ts.

--- >8 ----

Indeed, I have never seen an image that ocrad does better on than gocr.
FWIW, I'm currently 1/2 way through modifying spambayes to support either
ocrad or gocr, in the hope that using gocr will actually cause a noticible
reduction in image spam - unfortunately, using gocr I see no reduction at
all (which isn't to say there is not a small reduction - it just doesn't
"seem" to me like it has reduced).

Mark



More information about the SpamBayes mailing list