From sethg at goodmanassociates.com  Fri Feb  2 20:45:46 2007
From: sethg at goodmanassociates.com (Seth Goodman)
Date: Fri, 2 Feb 2007 13:45:46 -0600
Subject: [spambayes-dev] was [Spambayes] date for new release to handle
	image spam?
In-Reply-To: <17859.22356.153465.151561@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAIENBOGAA.sethg@goodmanassociates.com>

skip at pobox.com wrote on Friday, February 02, 2007 9:23 AM -0600:

> Seth Goodman wrote:
>
> > The word salad they use to drown out significant clues generally
> > fails, but if they throw enough words at it, they sometimes dilute
> > the spam clues sufficiently.  The fact that they throw hundreds of
> > "noise" words at the filters for every spam clue they want to hide
> > and Bayesian filters still catch half or three-quarters of it
> > shows how powerful the Bayesian approach really is....
>
> Hmmm... Could we do something to measure the amount of word salad
> without penalizing large non-image emails?

That's a very interesting idea:  a meta-analysis after tokenizing.  To
restate the hypothesis you imply:  spam using word salad may have a
different percentage of tokens that are significant clues than non-spam
email.  Taking this further, there may also be differences in the total
number of distinct tokens generated, and how many of those tokens are
from words versus synthetic tokens.  So in general, try to make use any
correlation between spamminess and meta-information like total number of
tokens generated, total number of word tokens generated, number of
significant clues and number of non-significant clues.  A very cool
general extension to Bayesian classification.

I don't know how you'd put this meta-information into a form that
Spambayes could make use of.  Let's see, the database tells you how many
times a given token appears in the ham/spam training sets.  From this
you calculate a spam probability that is combined with the results of
other tokens to give an overall spam probability.  For a numeric value
token, you want to calculate a spam probability of the numeric value
with respect to the values in the ham/spam training sets.  It's a
different calculation, but it is still probably amenable to using a
chi-square distribution so you can combine it with other clues.


>
> > - zombie hosts tend to be weak on SMTP etiquette, so one clue is
> >   that they often fail to wait when asked; making the SMTP client
> >   wait for 30 seconds before sending the "connect banner" often
> >   tricks impatient zombies into spewing, and you can then hang up;
>
> Yeah, but this is a job for postgrey and other similar tools.

Yes, sendmail/exim/qmail, but we're completely in agreement on the
location.  My point to the OP was that the MTA is the best place to make
spam filtering more effective by cutting down on the amount of spam
post-acceptance filters have to process.  The example was meant to show
the kind of behavioral clues that suggest an SMTP client may not be a
legitimate mail host and the connection refused.  I was suggesting that
doing the MTA part a little better has far greater return than anything
you do later.  I suspect that the best rejection criteria for image spam
is the identity of the SMTP client (a zombie host), and that's hard to
do once a message is delivered to a user mailbox.

After giving a few examples, I realized that the decision process is
similar to the one used in a post-acceptance spam filter, so perhaps
MTA's could make use of Bayesian classification to make better
decisions.  The current state of the art (OK, bleeding edge) is to use a
reputation system that accumulates reputation (hamminess) for each of
several possible sender identity types, identity qualification methods
and qualification results.  For example, there are three common
identities available at SMTP envelope time:  connecting IP address,
connecting hostname, and SMTP MAILFROM address (domain part only).
Because of the prevalence of forgery, you attempt to qualify each
identity using a hierarchy of possible methods.  Common methods to
qualify an identity are SPF and forward/reverse DNS.  Each qualification
method can produce results of pass, fail or unknown.  The tuple of
(identity, qualification method, qualification result) forms an atom in
the database and holds a reputation score.  There are also behavioral
clues from the connecting SMTP client which are useful when there is no
reputation data.  Finally, there is a time component so the data remains
current.

Every time a connecting MTA offers a message, the receiving MTA must
make a trinary decision analogous to what Spambayes does:  accept all
messages from this sender (whitelist), deny all messages from this
sender (blacklist), or allow the sender to present messages but filter
each one for content (unsure).  The quality of the decisions is
particularly important for senders with no reputation, as that is where
most spam comes from, yet it also includes infrequent senders with real
messages.  Sender in this context means mail host or domain that bounces
go to, not the mailbox address of the author.

--
Seth Goodman


From sethg at goodmanassociates.com  Sat Feb  3 21:48:24 2007
From: sethg at goodmanassociates.com (Seth Goodman)
Date: Sat, 3 Feb 2007 14:48:24 -0600
Subject: [spambayes-dev] was [Spambayes] date for new release to
	handleimage spam?
In-Reply-To: <MHEGIFHMACFNNIMMBACAIENBOGAA.sethg@goodmanassociates.com>
Message-ID: <MHEGIFHMACFNNIMMBACAAEOOOGAA.sethg@goodmanassociates.com>

Another possible meta-token that might help detect word salad
(probably what Skip had in mind):

  percentage of unique word tokens that are not significant

Whether or not this would help classify word salad better is
anyone's guess.  I would hope that your own correspondents have
some messages in the training set, so a larger fraction of their
obscure words would be significant clues than you'd expect of
random text from other sources.

Using a percentage rather than an absolute number may avoid bias
towards large or small messages.  Then again, having both
percentage and total number versions of this meta-token may prove
useful for some users' training sets, as their legitimate mail may
tend towards large or small messages.  If one version or the other
is not useful for an end user, that meta-token will probably turn
out to not be significant and will be excluded from the overall
score.  Using meta-information is a little scary, since the
underlying tokens already contribute to the overall spam score.
I think the trick is to devise meta-tokens that describe overall
message characteristics and are relatively independent of
individual token scores.

-- 
Seth Goodman

From skip at pobox.com  Sat Feb  3 22:16:57 2007
From: skip at pobox.com (skip at pobox.com)
Date: Sat, 3 Feb 2007 15:16:57 -0600
Subject: [spambayes-dev] was [Spambayes] date for new release to
 handleimage spam?
In-Reply-To: <MHEGIFHMACFNNIMMBACAAEOOOGAA.sethg@goodmanassociates.com>
References: <MHEGIFHMACFNNIMMBACAIENBOGAA.sethg@goodmanassociates.com>
	<MHEGIFHMACFNNIMMBACAAEOOOGAA.sethg@goodmanassociates.com>
Message-ID: <17860.64457.594466.274613@montanaro.dyndns.org>


    Seth> Another possible meta-token that might help detect word salad
    Seth> (probably what Skip had in mind):

    Seth>   percentage of unique word tokens that are not significant

I see a chicken-and-egg situation developing when we try to compute these
sort of numbers.  Start with an empty database.  Train on a ham message.  No
words are significant at that point, so having no significant word tokens is
a hammy clue.  Train on a spam.  By definition all words in the database at
this point are significant, so only words not yet seen will be deemed not
significant.

Lather, rinse, repeat.

Maybe after you're done training on all available messages you can toss all
these percentage tokens and make a second pass over your messages computing
only those tokens.  Are there better ways to compute tokens such as this
which depend on the contribution of other messages in the database?

Skip


From sethg at goodmanassociates.com  Mon Feb  5 01:43:51 2007
From: sethg at goodmanassociates.com (Seth Goodman)
Date: Sun, 4 Feb 2007 18:43:51 -0600
Subject: [spambayes-dev] was date for new release ...
In-Reply-To: <17860.64457.594466.274613@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAKEBKOHAA.sethg@goodmanassociates.com>

skip at pobox.com wrote on Saturday, February 03, 2007 3:17 PM -0600:

> Seth> Another possible meta-token that might help detect word salad
> Seth> (probably what Skip had in mind):
>
> Seth>   percentage of unique word tokens that are not significant
>
> I see a chicken-and-egg situation developing when we try to compute
> these sort of numbers.  Start with an empty database.  Train on a ham
> message.  No words are significant at that point, so having no
> significant word tokens is a hammy clue.  Train on a spam.  By
> definition all words in the database at this point are significant,
> so only words not yet seen will be deemed not significant.

It definitely has chicken and egg properties.


>
> Lather, rinse, repeat.
>
> Maybe after you're done training on all available messages you can
> toss all these percentage tokens and make a second pass over your
> messages computing only those tokens.  Are there better ways to
> compute tokens such as this which depend on the contribution of
> other messages in the database?

I hope so.  This is fundamentally different from drawing an inference
from previously observed word frequencies.  Numeric value meta-tokens
are not the result of binary experiments.  They exist for every message,
whether ham or spam, and they are real numbers.  We don't know their
underlying distribution.  The problem is to estimate the probability
that a message that contains a token with a given numeric value is ham
or spam based on the values of that token observed in trained ham and
spam.

This is a very raw idea, not even half-baked.  I think this problem
becomes tractable if we assume the tokens values are Gaussian
distributed, even if we believe they aren't.  It should be possible to
estimate the likelihood that a given token value is from a spam message
based on the distribution of that token's value in both trained ham and
spam.  If it's Gaussian, we only need to know the mean and variance of
each distribution.

If this turns out to work at all, we wouldn't need that much information
in the database.  For each numeric value token you model this way, you
need at least the mean and variance for each of ham and spam.  To
untrain a value, I think you could get away with keeping only the
intermediate values used to calculate variance, and I vaguely recall two
of them.  If you want to support arbitrary real values, these are all
floats, with the possibility that the intermediate variables are double
precision.

--
Seth Goodman


From mhammond at skippinet.com.au  Mon Feb  5 04:24:38 2007
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon, 5 Feb 2007 14:24:38 +1100
Subject: [spambayes-dev] [Spambayes] date for new release to handle
	image spam?
In-Reply-To: <09c701c748bd$4fb7e8a0$230a0a0a@enfoldsystems.local>
Message-ID: <09e201c748d5$2b1db6b0$230a0a0a@enfoldsystems.local>

In the message below (which I sent to spambayes instead of -dev), I
mentioned I got much better results with gocr than ocrad.  I've uploaded my
patch at
http://sourceforge.net/tracker/index.php?func=detail&aid=1652111&group_id=61
702&atid=498105, and I've assigned it to skip for a quick scan.  There are
some bits of the outlook patch mixed in there too, but that shouldn't
distract from the rest of the patch.  I'd obviously welcome all testing of
this and am happy to check it in.

Cheers,

Mark

> -----Original Message-----
> From: spambayes-bounces at python.org
> [mailto:spambayes-bounces at python.org]On Behalf Of Mark Hammond
> Sent: Monday, 5 February 2007 11:34 AM
> To: skip at pobox.com
> Cc: spambayes at python.org
> Subject: Re: [Spambayes] date for new release to handle image spam?
>
>
> > If you run ocrad over some spam text images you can see what
> > it generates.
> > If it finds nothing, nothing comes out the back end.  If it
> > sees something,
> > it's almost certain to be some garbage text peculiar to it,
> > unlikely to turn
> > up in normal text.  For example, here's a pretty clean image:
> >
> >     http://www.webfast.com/~skip/bogus-5-3.png
> >
> > Here's what ocrad produces by default:
> >
> >     COULD THl_ BE THE NEXT IBM_
> >     ALL _|___ _wow IWAl LllL |_ ABO_| lo EXPLODEl
> >     WAIIW LllL p_ Ll_E A WAW_ _IARll__ WO_DA_ _EPIEWBER lll
> >
> >     IomO_n_ __m_ L |_IL IOWP_IER_ |_I (o_h__ OII LllL p_)
> >     __o__ __mbol LllL
> >     F_ld__ Ilo__ O Tl (_o s_/_ On F_ld__ Alon_|)
> >     _ d__ |__o__ __
> >     I____n_ R__lnO ___onO B__
> >     \
> >     ln _h_ Io____ ot _ W___. LllL W____ ______| ___nnlnO Wo___'
> >
> >     L ln___n__lon_| Anno_n___
> >
> >     On_lo__h(IW) _P_o_P__ TP_hnoloO_ b_
> >     B_llP_ p_oo_ Da_a _P___|__ Ba_k_O_ and _P__o_P_
> >     |__ ____ __n____lon p__Aqco_TM_/P__AID CO_TM_
> >     _|__a Po__ablP wloh _OPPd _olld __a_P D_|_P TP_hnoloO_
> >     _h_ W___oOoll_. _hP Wo_ld _ _|___ _g laO_oO ComOrfP_
> >     _Pa___lnO W_ldla _ Q_a_ll TP_hnoloO_
> >     \
> >     L ln___n__lon_| _IOn_ _4 _W E__oO__n Dl___lb__lon AO___m_n_
> >
> >     Th_ b_Pmo__ __PO b_wa_d _a__|_al _Pn___P |_ amonO o_hP_
> p__|__|_P
> >     dl___lb__lon aO_PPmPn__ ____Pn_|_ _ndP_ nPOo_la_lon ?_
> > _P_P_al addl_lonal
> >     hlOh O_ofi_ _POlon_ and _PO_P_Pn__ a kP_ ___a_POl_
> > Oa__nP__hlO _ha_ _P___P_
> >     l ln_P_na_lonal ComO__P__ wl_h ___|_ Olobal ma_kP_ _Pa_h
> > and O_a_an_PPd
> >     O_P _alP_ and lo_k_ _hP _omOan_ ln hlOhl_ dP_|_ablP
> > p__|__|_P dl___lb__lon
> >     ma_kP__
> >
> >     READ MORE ONLINE NOWl
> >
> >     OPPORl__||_ DOE_ _ol __OI_ o_ IWE DOOR E_ER_ DA_|
> >     _o _A_E A Wl__IE IOODD LllL lo _O_R RADAR _ow A_D
> >     WAIIW II _OARl
>
> FWIW, I am getting *much* better results with gocr than
> ocrad.  gocr running
> over that same image results in:
>
> --- 8< ---
> _        _ _   _
> COULD THIS BE THE NEXT IBM?
> ALL SIGNS SHOW THAT LITL IS ABOUT TO EXPLODE!
>
> Company Name:
> Stock Symbol:
> Friday Close:    O.71 (Up 6O_a On Friday Alone!)
> S-dayTarget:   $3
> Current Rating:  Strong Buy
> \
>
> In the Course of a Week, LITL Makes Several Stunning Moves!
>
> L International Announces:
>
> - OneTouch(TM) Recovery Technology hr
> Bullet-Proof Data Security Backups and Restores          ,
> - Its Next-Generation PuRA_GO(TM)/PuRAID-GO(TM)
> UItra-Portable High-Speed Solid State Drive Technology
> . - the metropolis, the worldt First l9'' Laptop compWer
> Featuring Nvidiat Quad-SLI Technology   _
>
> \
> L International Signs $4SM European Distribution Agreement
>
> - T_s hremost step hrward tactical venture is, among other exclusive
> distribution agreements, currently under negotiation gr
> several additional
> high-pro_t regions and represents a key strategic partnership
> that secures
> L International Computers with truly global market reach and
> guaranteed
> pre-sales, and locks the company in highly desirable
> exclusive distribution
> marke.ts.
>
> --- >8 ----
>
> Indeed, I have never seen an image that ocrad does better on
> than gocr.
> FWIW, I'm currently 1/2 way through modifying spambayes to
> support either
> ocrad or gocr, in the hope that using gocr will actually
> cause a noticible
> reduction in image spam - unfortunately, using gocr I see no
> reduction at
> all (which isn't to say there is not a small reduction - it
> just doesn't
> "seem" to me like it has reduced).
>
> Mark
>
> _______________________________________________
> SpamBayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
>


From skip at pobox.com  Mon Feb  5 05:01:15 2007
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 4 Feb 2007 22:01:15 -0600
Subject: [spambayes-dev] [Spambayes] date for new release to handle
	image spam?
In-Reply-To: <09e201c748d5$2b1db6b0$230a0a0a@enfoldsystems.local>
References: <09c701c748bd$4fb7e8a0$230a0a0a@enfoldsystems.local>
	<09e201c748d5$2b1db6b0$230a0a0a@enfoldsystems.local>
Message-ID: <17862.44043.681609.437198@montanaro.dyndns.org>


    Mark> In the message below (which I sent to spambayes instead of -dev),
    Mark> I mentioned I got much better results with gocr than ocrad.  I've
    Mark> uploaded my patch at
    Mark> http://sourceforge.net/tracker/index.php?func=detail&aid=1652111&group_id=61
    Mark> 702&atid=498105, and I've assigned it to skip for a quick scan.

    Mark> There are some bits of the outlook patch mixed in there too, but
    Mark> that shouldn't distract from the rest of the patch.  I'd obviously
    Mark> welcome all testing of this and am happy to check it in.

Mark,

I made some changes today as well (not yet checked in) in an attempt to
improve the ability of ocrad to extract text from images.  It really
requires the text be dark and the background be light in order to "see"
anything.  I believe a perfectly formatted image where the text is white on
a black background results in no output by ocrad.  I created a patch to try
and remedy that:

  http://sourceforge.net/tracker/index.php?func=detail&aid=1652120&group_id=61702&atid=498105

I need to get to bed, but I'll try to look at your gocr patch Monday or
Tuesday.

Skip

From skip at pobox.com  Tue Feb  6 03:07:01 2007
From: skip at pobox.com (skip at pobox.com)
Date: Mon, 5 Feb 2007 20:07:01 -0600
Subject: [spambayes-dev] gocr is definitely improving...
Message-ID: <17863.58053.857731.637415@montanaro.dyndns.org>

I got a mail with image spam today (I probably got quite a few but gmail
blocks most of them nowadays):

    http://www.webfast.com/~skip/thermometer.gif

I ran gocr 0.41 over it and got this output:

    > _'__o______ __ ____o______ ___
    i__8
    _____ 00,__ 0 0_,_>
    0 __8 ___E3 __>_E3 __ E3_,__ _____
    0,__,_ _ _0______ _ 0 __0


    _, ___ _E3____ E3 _ _ _ ____ ____ 'o__0____ ____ 0>,E3
    _______ __ _________, _,______ _ 0 __________ ___,,_____,
    ____ ____',____ ____ ___ ___ _ 0 ___ >__ ____ ___
    ____ _ ___E3_ ___e__ ___E3___ 0 ______

The latest version is 0.43, so I downloaded and built it (with a couple
slight tweaks needed).  When fed the same image it spit out:

    _  _  _  _ _  _          _


    X;niy_nha_ Technology Ltd
    qnb oI_ _
    p_rce I1.SB lP 1_.6_
    hb te: H_ts Il_ghs of I1._B TodJy
    .M_ rc _ Fxpected T _ rr _

    Ini thc Izst 3 _ eks they ha_e ianded o_er I1.Z
    M II_on _n contracts. TJdays n _ Jnnounced anothe?
    huge cont_iact. Read all the n _ and set ycur buy
    fur_ mm f_rst cn_ng Tuesday nD rn_ng!

Pretty huge improvement.  (I think you can see why I gave up on gocr
before.)  By comparison, with my latest massaging of the input fed to ocrad
I get:

    X?nU?nha? TechnologU L_d
    glbol! _
    p_rce __.58 LP _3.6_
    __e: H__s H_ghs or __.78 Tod_V
    _re _ Expec_ed T_rr_

    In _he las_ 3 _ehs _heV ha_e landed o_er t_.2
    n?ll?on ?n con_roc_s, TodoVs n_ onnounced ono_her
    huge con_rac_, RPad all _he n_ and se_ Uour buU
    ror mM r?rs_ _h?ng TuesdaU nDrn?ng!

Without any massaging ocrad doesn't find any text.  You have to give the
--invert flag.  Seems like it should automatically try to invert the image
if its first attempt to extract text completely fails.

At any rate, gocr looks much better than it did.  I'm going to install it
and give your patch a try for a couple days.  It looks fine based on a
simple skim of the changes.  Go ahead and check it in so more people can
play with it.

Skip


From pl at symbolic.it  Tue Feb  6 09:39:14 2007
From: pl at symbolic.it (Luigi Pugnetti)
Date: Tue, 06 Feb 2007 09:39:14 +0100
Subject: [spambayes-dev] gocr is definitely improving...
In-Reply-To: <17863.58053.857731.637415@montanaro.dyndns.org>
References: <17863.58053.857731.637415@montanaro.dyndns.org>
Message-ID: <1170751155.29941.46.camel@localhost.localdomain>

On Mon, 2007-02-05 at 20:07 -0600, skip at pobox.com wrote:

<snip>
> 
> Without any massaging ocrad doesn't find any text.  You have to give the
> --invert flag.  Seems like it should automatically try to invert the image
> if its first attempt to extract text completely fails.

you could use a simple check to find if the inverted flag is needed

if ImageStat.Stat(image).mean[0] + ImageStat.Stat(image).mean[1] +
ImageStat.Stat(image).mean[2] >= (128 *3)
  invert flag is needed

this is a very simple check that sometimes could fail (inverted is
needed but the condition is false. I've never seen the opposite)
Probably checking if two of the mean[]s are greater than 128 could
suffice especially when one of them is very big (> 190).

> 
> At any rate, gocr looks much better than it did.  I'm going to install it
> and give your patch a try for a couple days.  It looks fine based on a
> simple skim of the changes.  Go ahead and check it in so more people can
> play with it.
> 
> Skip
> 
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev at python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev
-- 
Luigi Pugnetti

Symbolic S.p.A.
V.le Mentana, 29
I-43100 Parma
Italy

Tel: +39 0521 708811
Fax: +39 0521 776190


From genojoe at neo.rr.com  Wed Feb  7 10:15:27 2007
From: genojoe at neo.rr.com (Gene Rhodes)
Date: Wed, 7 Feb 2007 04:15:27 -0500
Subject: [spambayes-dev] Desired Feature
Message-ID: <009c01c74a98$81ea0330$650fa8c0@PRESARIO>

One feature that may already exist but I cannot find is the following:

 
I have a list of 20 or so people that I regularily receive email from.
Sometimes their email is identified as Spam regardless of the "training".

 
Is there a way to create a list of email addresses that are always treated
as "good" regardless of their spam score?  This is such an obvious request
that I think that it may already exist but I am not able to find it.

 
I am using Outlook 2003, Road runner and Windows XP in my home.  

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20070207/81e02f15/attachment.htm 

From genojoe at neo.rr.com  Wed Feb  7 10:23:36 2007
From: genojoe at neo.rr.com (Gene Rhodes)
Date: Wed, 7 Feb 2007 04:23:36 -0500
Subject: [spambayes-dev] FW: Desired Feature
Message-ID: <00a101c74a99$a5284540$650fa8c0@PRESARIO>

In the email sent immediately preceeding this email, I stated: 

One feature that may already exist but I cannot find is the following:

 
I have a list of 20 or so people that I regularily receive email from.
Sometimes their email is identified as Spam regardless of the "training".

 
Is there a way to create a list of email addresses that are always treated
as "good" regardless of their spam score?  This is such an obvious request
that I think that it may already exist but I am not able to find it.

 
I am using Outlook 2003, Road runner and Windows XP in my home.  

Please add the following comment"

 
If the preceeding feature does not exist consider adding the following to
you FAQ:

 
Question:

Can I create a list of email addresses that are always accepted?

 
Answer:

No, this capability currently does not exist in Spambayes.

 
This FAQ will help others that have the same desire.  I think many people
would find this helpful.  Again, this feature may exist.  If that is the
case, I apologize to you for implying that it does not exist.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20070207/ef1a05a2/attachment.htm 

From skip at pobox.com  Wed Feb  7 14:17:40 2007
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 7 Feb 2007 07:17:40 -0600
Subject: [spambayes-dev] FW: Desired Feature
In-Reply-To: <00a101c74a99$a5284540$650fa8c0@PRESARIO>
References: <00a101c74a99$a5284540$650fa8c0@PRESARIO>
Message-ID: <17865.53620.723031.43882@montanaro.dyndns.org>


    Gene> One feature that may already exist but I cannot find is the
    Gene> following:

    Gene> I have a list of 20 or so people that I regularily receive email
    Gene> from.  Sometimes their email is identified as Spam regardless of
    Gene> the "training".
    ...
    Gene> Is there a way to create a list of email addresses that are always
    Gene> treated as "good" regardless of their spam score?  This is such an
    Gene> obvious request that I think that it may already exist but I am
    Gene> not able to find it.

This is already in the FAQ (question 6.6):

    http://spambayes.sourceforge.net/faq.html

Short answer: If you really want whitelisting, add a filter to Outlook
that's run before SpamBayes scores those messages.

Skip

From bishop at aeroprise.com  Wed Feb  7 20:55:08 2007
From: bishop at aeroprise.com (Peter Bishop)
Date: Wed, 7 Feb 2007 11:55:08 -0800
Subject: [spambayes-dev] new FAQ needed
Message-ID: <MAILGsA7MnuXoKEQpQh0000bca4@mail.aeroprise.com>

I have been monitoring the spambayes list and helping out some of the
spambayes users for a couple of months.

Some of the least frequently answered questions are related to the response
below.  I recommend that this answer be reformatted into a FAQ and added to
the FAQ list under "reinstalling SpamBayes".  The information below should
be included in the "Addin doesn't load" section of the Troubleshooting Guide
that is installed with the product. (It would be good to have some
verification from more knowledgable SpamBayes people that this info is
good.)

I suspect this is not the right issue to raise on this list, but I need help
figuring out how to help get this done, or get the ball rolling to get it
done.

Peter Bishop
-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of Klieg
Sent: Tuesday, February 06, 2007 5:31 PM
To: spambayes at python.org
Subject: Re: [Spambayes] spambayes quit working and won't reinstall


You have likely corrected this by now, but I encountered a similar
difficulty. SpamBayes quit working. The menu bar in Outlook still showed the
controls, however they did not respond. I tried re-installing the product
and it showed registered in the log, however still did not work.

I un-installed the product, deleted the menu from Outlook, then re-installed
the product.

It still did not work. WHen I tried to turn on the COMM addin in Outlook by
checking the addin under Tools > Options > Other > Advanced Options > Comm
Addins, still nothing. When going back into the Comm addins options, the
checkbox was un-checked.

Then I selected the Spambayes item in the Comm add-ins option window and
selected 'Remove'. I then selected 'Add' to add in the SpamBayes addin. In
the file dialogue box, you need to navigate to the addin under program files
> Spambayes > Bin > outlook_addin.dll.

The addin takes a little while to install at this point, then the Spambayes
menu items should show up.

You'll need to re-configure some of the Spambayes setup, then things should
be back in working order.


Norm Dingle wrote:
> 
> I am running Outlook 2003 and had Spambayes running fine.
> 
> All of a sudden it quit working. I have read the trouble shooting 
> manual and nothing seems to work.
> 
> I have uninstalled Spambayes and reinstalled several times.
> 
>  
> 
> The log file says it is registered.
> 
>  
> 
> When I go to the Advanced options/Comm Addinns dialog Spambay show in 
> the list but unchecked.
> 
> I check it and then restart Outlook. No Spambayes. The addin dialog 
> then shows it unchecked again.
> 
>  
> 
> Right now I don't have any other ideas about what to try.
> 
>  
> 
> Thanks
> 
> Norm
> 
>  
> 
> 
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.344 / Virus Database: 267.10.18/90 - Release Date: 
> 9/5/2005
>  
> 
> _______________________________________________
> Spambayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
> 

--
View this message in context:
http://www.nabble.com/spambayes-quit-working-and-won%27t-reinstall-tf282807.
html#a8838231
Sent from the Python - spambayes mailing list archive at Nabble.com.

_______________________________________________
SpamBayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html


From Sjoerd.Mullender at cwi.nl  Tue Feb 13 08:38:53 2007
From: Sjoerd.Mullender at cwi.nl (Sjoerd Mullender)
Date: Tue, 13 Feb 2007 08:38:53 +0100
Subject: [spambayes-dev] bug in spambayes/ImageStripper.py?
Message-ID: <45D16B0D.7020603@cwi.nl>

There are two occurrences of the name "program_name" in
spambayes/ImageStripper.py.  Shouldn't both be "engine_name"?

-- 
Sjoerd Mullender

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 369 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070213/a1c262ba/attachment.pgp 

From mhammond at skippinet.com.au  Wed Feb 14 01:53:41 2007
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 14 Feb 2007 11:53:41 +1100
Subject: [spambayes-dev] bug in spambayes/ImageStripper.py?
In-Reply-To: <45D16B0D.7020603@cwi.nl>
Message-ID: <021b01c74fd2$92dd1f80$060a0a0a@enfoldsystems.local>

> There are two occurrences of the name "program_name" in
> spambayes/ImageStripper.py.  Shouldn't both be "engine_name"?

They should indeed!  I've checked that in.

Cheers,

Mark


From f.rougon at free.fr  Mon Feb 19 14:23:42 2007
From: f.rougon at free.fr (Florent Rougon)
Date: Mon, 19 Feb 2007 14:23:42 +0100
Subject: [spambayes-dev] Tesseract OCR
Message-ID: <87r6sm5li9.fsf@florent.maison>

Hi,

I just discovered the existence of Tesseract OCR, whose homepage[1] says:

  A commercial quality OCR engine originally developed at HP between
  1985 and 1995. In 1995, this engine was among the top 3 evaluated by
  UNLV. It was open-sourced by HP and UNLV in 2005.

I thought some of you (Skip, Mark) might be interested if you hadn't
heard about this software yet.

According to the Debian package page[2], Tesseract OCR is command-line
driven, which sounds good for you. And according to the Debian copyright
file, the software is released under the Apache License, version 2.0.

That's it, end of advertisement. Thanks for the great spam filter that
saved my life, and keep up the good work! :)

Regards,


  [1] http://sourceforge.net/projects/tesseract-ocr

  [2] http://packages.debian.org/unstable/graphics/tesseract-ocr

-- 
Florent

From mhammond at skippinet.com.au  Tue Feb 20 00:16:44 2007
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 20 Feb 2007 10:16:44 +1100
Subject: [spambayes-dev] Tesseract OCR
In-Reply-To: <87r6sm5li9.fsf@florent.maison>
Message-ID: <02af01c7547c$06184440$010a0a0a@enfoldsystems.local>

> I just discovered the existence of Tesseract OCR, whose
> homepage[1] says:
>
>   A commercial quality OCR engine originally developed at HP between
>   1985 and 1995. In 1995, this engine was among the top 3 evaluated by
>   UNLV. It was open-sourced by HP and UNLV in 2005.
>
> I thought some of you (Skip, Mark) might be interested if you hadn't
> heard about this software yet.

You could help us out here too, by running some of your image spam against
the various engines and manually inspecting the accuracy of the text versus
what you actually see in the image.  My quick experiments show that
tesseract is very close to the results I get from gocr, and significantly
better than ocrad.

Mark


From mhammond at skippinet.com.au  Thu Feb 22 02:01:56 2007
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 22 Feb 2007 12:01:56 +1100
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
Message-ID: <00d801c7561d$0d628830$020a0a0a@enfoldsystems.local>

Hi guys,
  On a bit of a whim, I mailed the jocr-devels list to try and get informal
approval for distributing their windows binary with spambayes.  I believe
that as we are not modifying the binary, as long as we point to the original
binary (and therefore implicitly pointing at their source tarball etc, as
required by the GPL) we should be fine.

Anyone have any comments or objections to us releasing spambayes with a gocr
binary?

Mark

-----Original Message-----
From: Joerg Schulenburg [mailto:Joerg.Schulenburg at URZ.Uni-Magdeburg.DE]
Sent: Tuesday, 20 February 2007 8:16 PM
To: Mark Hammond
Subject: Re: [Jocr-devels] redistribute gocr binaries?


I have no problems with that. I also try to improve gocr according to spam
images.

Joerg.

On Sun, 18 Feb 2007, Mark Hammond wrote:

> Hi all,
>  I'm involved with the 'spambayes' project (spambayes.org), an open-source
> client-based spam solution, and we've had recent success in using gocr
with
> our recent OCR enhancements.  Spambayes is released under a 'Python' style
> open-source license which is closer to a BSD license than to the GPL.
>
> Are there any license considerations or any other objections to us
including
> a gocr binary with our Windows binaries?  If not, are there any other
> requests or guidelines you would like us to adhere to?  We are looking at
> including the unmodified binary at
> http://www-e.uni-magdeburg.de/jschulen/ocr/gocr043.exe in our application
> directory (an email address for Peter B L Meijer wasn't obvious, otherwise
> I'd CC him)
>
> Thanks,
>
> Mark
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Jocr-devels mailing list
> Jocr-devels at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/jocr-devels
>
>

--
-------------------------------------------------------------------------
-                                                             \V/       -
- EMAIL: joerg.schulenburg at urz.uni-magdeburg.de              (o o)      -
----------------------------------------------------------oOo-(_)-oOo----
- http://www-e.uni-magdeburg.de/jschulen/
- PGP 1024D/53BDFBE3, 3816 B803 D578 F5AD 12FD  FE06 5D33 0C49 53BD FBE3
-------------------------------------------------------------------------


From skip at pobox.com  Sat Feb 24 22:58:01 2007
From: skip at pobox.com (skip at pobox.com)
Date: Sat, 24 Feb 2007 15:58:01 -0600
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
In-Reply-To: <00d801c7561d$0d628830$020a0a0a@enfoldsystems.local>
References: <00d801c7561d$0d628830$020a0a0a@enfoldsystems.local>
Message-ID: <17888.46313.313814.279176@montanaro.dyndns.org>


    Mark> On a bit of a whim, I mailed the jocr-devels list to try and get
    Mark> informal approval for distributing their windows binary with
    Mark> spambayes.  I believe that as we are not modifying the binary, as
    Mark> long as we point to the original binary (and therefore implicitly
    Mark> pointing at their source tarball etc, as required by the GPL) we
    Mark> should be fine.

    Mark> Anyone have any comments or objections to us releasing spambayes
    Mark> with a gocr binary?

No objections.  I went through the same procedure with ocrad.  Might as well
be thorough.

Skip

From f.rougon at free.fr  Mon Feb 26 19:26:06 2007
From: f.rougon at free.fr (Florent Rougon)
Date: Mon, 26 Feb 2007 19:26:06 +0100
Subject: [spambayes-dev] Tesseract OCR
In-Reply-To: <02af01c7547c$06184440$010a0a0a@enfoldsystems.local> (Mark
	Hammond's message of "Tue, 20 Feb 2007 10:16:44 +1100")
References: <02af01c7547c$06184440$010a0a0a@enfoldsystems.local>
Message-ID: <87slcsn5c1.fsf@florent.maison>

Hi Mark,

"Mark Hammond" <mhammond at skippinet.com.au> wrote:

> You could help us out here too, by running some of your image spam against
> the various engines and manually inspecting the accuracy of the text versus
> what you actually see in the image.

Sure... I just severely lack time to do that in the forseeable future.
:-/

So, my message was just "in case you didn't know" about tesseract yet.

> My quick experiments show that tesseract is very close to the results
> I get from gocr, and significantly better than ocrad.

OK, so you had already tried it. Thanks!

-- 
Florent

From mhammond at skippinet.com.au  Tue Feb 27 11:14:25 2007
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue, 27 Feb 2007 21:14:25 +1100
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
In-Reply-To: <17888.46313.313814.279176@montanaro.dyndns.org>
Message-ID: <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local>

>     Mark> On a bit of a whim, I mailed the jocr-devels list
>     Mark> to try and get informal approval for distributing
>     Mark> their windows binary with spambayes.

> No objections.  I went through the same procedure with ocrad.
> Might as well be thorough.

Yeah - I stumbled across your mail to the ocrad -devel list a while ago - it
inspired my similar mail to gocr :)

My next proposition is that we enable this OCR support by default, and
enable gocr as the default OCR engine - how does that sound?  If OK, should
we do that by promoting the relevant 'X-' options to 'official' options, or
just change the default values for the options as implemented?

Mark


From skip at pobox.com  Tue Feb 27 12:57:38 2007
From: skip at pobox.com (skip at pobox.com)
Date: Tue, 27 Feb 2007 05:57:38 -0600
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
In-Reply-To: <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local>
References: <17888.46313.313814.279176@montanaro.dyndns.org>
	<002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local>
Message-ID: <17892.7346.76540.56365@montanaro.dyndns.org>


    Mark> My next proposition is that we enable this OCR support by default,
    Mark> and enable gocr as the default OCR engine - how does that sound?
    Mark> If OK, should we do that by promoting the relevant 'X-' options to
    Mark> 'official' options, or just change the default values for the
    Mark> options as implemented?

Sounds good to me.  Default to gocr, drop the "X-".

Skip

From sjoerd at acm.org  Tue Feb 27 13:25:00 2007
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Tue, 27 Feb 2007 13:25:00 +0100
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
In-Reply-To: <17892.7346.76540.56365@montanaro.dyndns.org>
References: <17888.46313.313814.279176@montanaro.dyndns.org>	<002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local>
	<17892.7346.76540.56365@montanaro.dyndns.org>
Message-ID: <45E4231C.5010801@acm.org>

On 2007-02-27 12:57, skip at pobox.com wrote:
>     Mark> My next proposition is that we enable this OCR support by default,
>     Mark> and enable gocr as the default OCR engine - how does that sound?
>     Mark> If OK, should we do that by promoting the relevant 'X-' options to
>     Mark> 'official' options, or just change the default values for the
>     Mark> options as implemented?
> 
> Sounds good to me.  Default to gocr, drop the "X-".

Do keep in mind that gocr and ocrad are not installed on all systems.
It's great if you put it into the Windows distribution, but on Linux
that is not an option, and spambayes should still work on Linux as well.

On Fedora Core, those two programs are not available unless the system
administrator gets them from some third party repository (gocr (but not
ocrad) is available in freshrpms) or builds them from source.

-- 
Sjoerd Mullender

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 370 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070227/3be29e4a/attachment.pgp 

From skip at pobox.com  Tue Feb 27 13:59:53 2007
From: skip at pobox.com (skip at pobox.com)
Date: Tue, 27 Feb 2007 06:59:53 -0600
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
In-Reply-To: <45E4231C.5010801@acm.org>
References: <17888.46313.313814.279176@montanaro.dyndns.org>
	<002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local>
	<17892.7346.76540.56365@montanaro.dyndns.org>
	<45E4231C.5010801@acm.org>
Message-ID: <17892.11081.89379.191710@montanaro.dyndns.org>


    Sjoerd> Do keep in mind that gocr and ocrad are not installed on all
    Sjoerd> systems.  It's great if you put it into the Windows
    Sjoerd> distribution, but on Linux that is not an option, and spambayes
    Sjoerd> should still work on Linux as well.

The image analysis code will still work if you enable the image cracking
options but don't have the requisite ocr engine installed.  It just emits a
message to standard error and returns.

    Sjoerd> On Fedora Core, those two programs are not available unless the
    Sjoerd> system administrator gets them from some third party repository
    Sjoerd> (gocr (but not ocrad) is available in freshrpms) or builds them
    Sjoerd> from source.

I suspect that's true for many systems.  It is certainly true on my Mac.  On
my Ubuntu system both ocrad and gocr are available if I check the "community
maintained" box in the repositories popup.

Skip

From sjoerd at acm.org  Tue Feb 27 14:15:15 2007
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Tue, 27 Feb 2007 14:15:15 +0100
Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries?
In-Reply-To: <17892.11081.89379.191710@montanaro.dyndns.org>
References: <17888.46313.313814.279176@montanaro.dyndns.org>	<002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local>	<17892.7346.76540.56365@montanaro.dyndns.org>	<45E4231C.5010801@acm.org>
	<17892.11081.89379.191710@montanaro.dyndns.org>
Message-ID: <45E42EE3.1010102@acm.org>

On 2007-02-27 13:59, skip at pobox.com wrote:
>     Sjoerd> Do keep in mind that gocr and ocrad are not installed on all
>     Sjoerd> systems.  It's great if you put it into the Windows
>     Sjoerd> distribution, but on Linux that is not an option, and spambayes
>     Sjoerd> should still work on Linux as well.
> 
> The image analysis code will still work if you enable the image cracking
> options but don't have the requisite ocr engine installed.  It just emits a
> message to standard error and returns.

I'm not sure that is a good user interface: don't change anything to the
code and you get a warning about a missing program.

Would it be possible to have the default for using OCR to be off on all
systems but Windows where gocr and/or ocrad is included?

-- 
Sjoerd Mullender

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 370 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070227/ff046abd/attachment.pgp