From sethg at goodmanassociates.com Fri Feb 2 20:45:46 2007 From: sethg at goodmanassociates.com (Seth Goodman) Date: Fri, 2 Feb 2007 13:45:46 -0600 Subject: [spambayes-dev] was [Spambayes] date for new release to handle image spam? In-Reply-To: <17859.22356.153465.151561@montanaro.dyndns.org> Message-ID: skip at pobox.com wrote on Friday, February 02, 2007 9:23 AM -0600: > Seth Goodman wrote: > > > The word salad they use to drown out significant clues generally > > fails, but if they throw enough words at it, they sometimes dilute > > the spam clues sufficiently. The fact that they throw hundreds of > > "noise" words at the filters for every spam clue they want to hide > > and Bayesian filters still catch half or three-quarters of it > > shows how powerful the Bayesian approach really is.... > > Hmmm... Could we do something to measure the amount of word salad > without penalizing large non-image emails? That's a very interesting idea: a meta-analysis after tokenizing. To restate the hypothesis you imply: spam using word salad may have a different percentage of tokens that are significant clues than non-spam email. Taking this further, there may also be differences in the total number of distinct tokens generated, and how many of those tokens are from words versus synthetic tokens. So in general, try to make use any correlation between spamminess and meta-information like total number of tokens generated, total number of word tokens generated, number of significant clues and number of non-significant clues. A very cool general extension to Bayesian classification. I don't know how you'd put this meta-information into a form that Spambayes could make use of. Let's see, the database tells you how many times a given token appears in the ham/spam training sets. From this you calculate a spam probability that is combined with the results of other tokens to give an overall spam probability. For a numeric value token, you want to calculate a spam probability of the numeric value with respect to the values in the ham/spam training sets. It's a different calculation, but it is still probably amenable to using a chi-square distribution so you can combine it with other clues. > > > - zombie hosts tend to be weak on SMTP etiquette, so one clue is > > that they often fail to wait when asked; making the SMTP client > > wait for 30 seconds before sending the "connect banner" often > > tricks impatient zombies into spewing, and you can then hang up; > > Yeah, but this is a job for postgrey and other similar tools. Yes, sendmail/exim/qmail, but we're completely in agreement on the location. My point to the OP was that the MTA is the best place to make spam filtering more effective by cutting down on the amount of spam post-acceptance filters have to process. The example was meant to show the kind of behavioral clues that suggest an SMTP client may not be a legitimate mail host and the connection refused. I was suggesting that doing the MTA part a little better has far greater return than anything you do later. I suspect that the best rejection criteria for image spam is the identity of the SMTP client (a zombie host), and that's hard to do once a message is delivered to a user mailbox. After giving a few examples, I realized that the decision process is similar to the one used in a post-acceptance spam filter, so perhaps MTA's could make use of Bayesian classification to make better decisions. The current state of the art (OK, bleeding edge) is to use a reputation system that accumulates reputation (hamminess) for each of several possible sender identity types, identity qualification methods and qualification results. For example, there are three common identities available at SMTP envelope time: connecting IP address, connecting hostname, and SMTP MAILFROM address (domain part only). Because of the prevalence of forgery, you attempt to qualify each identity using a hierarchy of possible methods. Common methods to qualify an identity are SPF and forward/reverse DNS. Each qualification method can produce results of pass, fail or unknown. The tuple of (identity, qualification method, qualification result) forms an atom in the database and holds a reputation score. There are also behavioral clues from the connecting SMTP client which are useful when there is no reputation data. Finally, there is a time component so the data remains current. Every time a connecting MTA offers a message, the receiving MTA must make a trinary decision analogous to what Spambayes does: accept all messages from this sender (whitelist), deny all messages from this sender (blacklist), or allow the sender to present messages but filter each one for content (unsure). The quality of the decisions is particularly important for senders with no reputation, as that is where most spam comes from, yet it also includes infrequent senders with real messages. Sender in this context means mail host or domain that bounces go to, not the mailbox address of the author. -- Seth Goodman From sethg at goodmanassociates.com Sat Feb 3 21:48:24 2007 From: sethg at goodmanassociates.com (Seth Goodman) Date: Sat, 3 Feb 2007 14:48:24 -0600 Subject: [spambayes-dev] was [Spambayes] date for new release to handleimage spam? In-Reply-To: Message-ID: Another possible meta-token that might help detect word salad (probably what Skip had in mind): percentage of unique word tokens that are not significant Whether or not this would help classify word salad better is anyone's guess. I would hope that your own correspondents have some messages in the training set, so a larger fraction of their obscure words would be significant clues than you'd expect of random text from other sources. Using a percentage rather than an absolute number may avoid bias towards large or small messages. Then again, having both percentage and total number versions of this meta-token may prove useful for some users' training sets, as their legitimate mail may tend towards large or small messages. If one version or the other is not useful for an end user, that meta-token will probably turn out to not be significant and will be excluded from the overall score. Using meta-information is a little scary, since the underlying tokens already contribute to the overall spam score. I think the trick is to devise meta-tokens that describe overall message characteristics and are relatively independent of individual token scores. -- Seth Goodman From skip at pobox.com Sat Feb 3 22:16:57 2007 From: skip at pobox.com (skip at pobox.com) Date: Sat, 3 Feb 2007 15:16:57 -0600 Subject: [spambayes-dev] was [Spambayes] date for new release to handleimage spam? In-Reply-To: References: Message-ID: <17860.64457.594466.274613@montanaro.dyndns.org> Seth> Another possible meta-token that might help detect word salad Seth> (probably what Skip had in mind): Seth> percentage of unique word tokens that are not significant I see a chicken-and-egg situation developing when we try to compute these sort of numbers. Start with an empty database. Train on a ham message. No words are significant at that point, so having no significant word tokens is a hammy clue. Train on a spam. By definition all words in the database at this point are significant, so only words not yet seen will be deemed not significant. Lather, rinse, repeat. Maybe after you're done training on all available messages you can toss all these percentage tokens and make a second pass over your messages computing only those tokens. Are there better ways to compute tokens such as this which depend on the contribution of other messages in the database? Skip From sethg at goodmanassociates.com Mon Feb 5 01:43:51 2007 From: sethg at goodmanassociates.com (Seth Goodman) Date: Sun, 4 Feb 2007 18:43:51 -0600 Subject: [spambayes-dev] was date for new release ... In-Reply-To: <17860.64457.594466.274613@montanaro.dyndns.org> Message-ID: skip at pobox.com wrote on Saturday, February 03, 2007 3:17 PM -0600: > Seth> Another possible meta-token that might help detect word salad > Seth> (probably what Skip had in mind): > > Seth> percentage of unique word tokens that are not significant > > I see a chicken-and-egg situation developing when we try to compute > these sort of numbers. Start with an empty database. Train on a ham > message. No words are significant at that point, so having no > significant word tokens is a hammy clue. Train on a spam. By > definition all words in the database at this point are significant, > so only words not yet seen will be deemed not significant. It definitely has chicken and egg properties. > > Lather, rinse, repeat. > > Maybe after you're done training on all available messages you can > toss all these percentage tokens and make a second pass over your > messages computing only those tokens. Are there better ways to > compute tokens such as this which depend on the contribution of > other messages in the database? I hope so. This is fundamentally different from drawing an inference from previously observed word frequencies. Numeric value meta-tokens are not the result of binary experiments. They exist for every message, whether ham or spam, and they are real numbers. We don't know their underlying distribution. The problem is to estimate the probability that a message that contains a token with a given numeric value is ham or spam based on the values of that token observed in trained ham and spam. This is a very raw idea, not even half-baked. I think this problem becomes tractable if we assume the tokens values are Gaussian distributed, even if we believe they aren't. It should be possible to estimate the likelihood that a given token value is from a spam message based on the distribution of that token's value in both trained ham and spam. If it's Gaussian, we only need to know the mean and variance of each distribution. If this turns out to work at all, we wouldn't need that much information in the database. For each numeric value token you model this way, you need at least the mean and variance for each of ham and spam. To untrain a value, I think you could get away with keeping only the intermediate values used to calculate variance, and I vaguely recall two of them. If you want to support arbitrary real values, these are all floats, with the possibility that the intermediate variables are double precision. -- Seth Goodman From mhammond at skippinet.com.au Mon Feb 5 04:24:38 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon, 5 Feb 2007 14:24:38 +1100 Subject: [spambayes-dev] [Spambayes] date for new release to handle image spam? In-Reply-To: <09c701c748bd$4fb7e8a0$230a0a0a@enfoldsystems.local> Message-ID: <09e201c748d5$2b1db6b0$230a0a0a@enfoldsystems.local> In the message below (which I sent to spambayes instead of -dev), I mentioned I got much better results with gocr than ocrad. I've uploaded my patch at http://sourceforge.net/tracker/index.php?func=detail&aid=1652111&group_id=61 702&atid=498105, and I've assigned it to skip for a quick scan. There are some bits of the outlook patch mixed in there too, but that shouldn't distract from the rest of the patch. I'd obviously welcome all testing of this and am happy to check it in. Cheers, Mark > -----Original Message----- > From: spambayes-bounces at python.org > [mailto:spambayes-bounces at python.org]On Behalf Of Mark Hammond > Sent: Monday, 5 February 2007 11:34 AM > To: skip at pobox.com > Cc: spambayes at python.org > Subject: Re: [Spambayes] date for new release to handle image spam? > > > > If you run ocrad over some spam text images you can see what > > it generates. > > If it finds nothing, nothing comes out the back end. If it > > sees something, > > it's almost certain to be some garbage text peculiar to it, > > unlikely to turn > > up in normal text. For example, here's a pretty clean image: > > > > http://www.webfast.com/~skip/bogus-5-3.png > > > > Here's what ocrad produces by default: > > > > COULD THl_ BE THE NEXT IBM_ > > ALL _|___ _wow IWAl LllL |_ ABO_| lo EXPLODEl > > WAIIW LllL p_ Ll_E A WAW_ _IARll__ WO_DA_ _EPIEWBER lll > > > > IomO_n_ __m_ L |_IL IOWP_IER_ |_I (o_h__ OII LllL p_) > > __o__ __mbol LllL > > F_ld__ Ilo__ O Tl (_o s_/_ On F_ld__ Alon_|) > > _ d__ |__o__ __ > > I____n_ R__lnO ___onO B__ > > \ > > ln _h_ Io____ ot _ W___. LllL W____ ______| ___nnlnO Wo___' > > > > L ln___n__lon_| Anno_n___ > > > > On_lo__h(IW) _P_o_P__ TP_hnoloO_ b_ > > B_llP_ p_oo_ Da_a _P___|__ Ba_k_O_ and _P__o_P_ > > |__ ____ __n____lon p__Aqco_TM_/P__AID CO_TM_ > > _|__a Po__ablP wloh _OPPd _olld __a_P D_|_P TP_hnoloO_ > > _h_ W___oOoll_. _hP Wo_ld _ _|___ _g laO_oO ComOrfP_ > > _Pa___lnO W_ldla _ Q_a_ll TP_hnoloO_ > > \ > > L ln___n__lon_| _IOn_ _4 _W E__oO__n Dl___lb__lon AO___m_n_ > > > > Th_ b_Pmo__ __PO b_wa_d _a__|_al _Pn___P |_ amonO o_hP_ > p__|__|_P > > dl___lb__lon aO_PPmPn__ ____Pn_|_ _ndP_ nPOo_la_lon ?_ > > _P_P_al addl_lonal > > hlOh O_ofi_ _POlon_ and _PO_P_Pn__ a kP_ ___a_POl_ > > Oa__nP__hlO _ha_ _P___P_ > > l ln_P_na_lonal ComO__P__ wl_h ___|_ Olobal ma_kP_ _Pa_h > > and O_a_an_PPd > > O_P _alP_ and lo_k_ _hP _omOan_ ln hlOhl_ dP_|_ablP > > p__|__|_P dl___lb__lon > > ma_kP__ > > > > READ MORE ONLINE NOWl > > > > OPPORl__||_ DOE_ _ol __OI_ o_ IWE DOOR E_ER_ DA_| > > _o _A_E A Wl__IE IOODD LllL lo _O_R RADAR _ow A_D > > WAIIW II _OARl > > FWIW, I am getting *much* better results with gocr than > ocrad. gocr running > over that same image results in: > > --- 8< --- > _ _ _ _ > COULD THIS BE THE NEXT IBM? > ALL SIGNS SHOW THAT LITL IS ABOUT TO EXPLODE! > > Company Name: > Stock Symbol: > Friday Close: O.71 (Up 6O_a On Friday Alone!) > S-dayTarget: $3 > Current Rating: Strong Buy > \ > > In the Course of a Week, LITL Makes Several Stunning Moves! > > L International Announces: > > - OneTouch(TM) Recovery Technology hr > Bullet-Proof Data Security Backups and Restores , > - Its Next-Generation PuRA_GO(TM)/PuRAID-GO(TM) > UItra-Portable High-Speed Solid State Drive Technology > . - the metropolis, the worldt First l9'' Laptop compWer > Featuring Nvidiat Quad-SLI Technology _ > > \ > L International Signs $4SM European Distribution Agreement > > - T_s hremost step hrward tactical venture is, among other exclusive > distribution agreements, currently under negotiation gr > several additional > high-pro_t regions and represents a key strategic partnership > that secures > L International Computers with truly global market reach and > guaranteed > pre-sales, and locks the company in highly desirable > exclusive distribution > marke.ts. > > --- >8 ---- > > Indeed, I have never seen an image that ocrad does better on > than gocr. > FWIW, I'm currently 1/2 way through modifying spambayes to > support either > ocrad or gocr, in the hope that using gocr will actually > cause a noticible > reduction in image spam - unfortunately, using gocr I see no > reduction at > all (which isn't to say there is not a small reduction - it > just doesn't > "seem" to me like it has reduced). > > Mark > > _______________________________________________ > SpamBayes at python.org > http://mail.python.org/mailman/listinfo/spambayes > Check the FAQ before asking: http://spambayes.sf.net/faq.html > From skip at pobox.com Mon Feb 5 05:01:15 2007 From: skip at pobox.com (skip at pobox.com) Date: Sun, 4 Feb 2007 22:01:15 -0600 Subject: [spambayes-dev] [Spambayes] date for new release to handle image spam? In-Reply-To: <09e201c748d5$2b1db6b0$230a0a0a@enfoldsystems.local> References: <09c701c748bd$4fb7e8a0$230a0a0a@enfoldsystems.local> <09e201c748d5$2b1db6b0$230a0a0a@enfoldsystems.local> Message-ID: <17862.44043.681609.437198@montanaro.dyndns.org> Mark> In the message below (which I sent to spambayes instead of -dev), Mark> I mentioned I got much better results with gocr than ocrad. I've Mark> uploaded my patch at Mark> http://sourceforge.net/tracker/index.php?func=detail&aid=1652111&group_id=61 Mark> 702&atid=498105, and I've assigned it to skip for a quick scan. Mark> There are some bits of the outlook patch mixed in there too, but Mark> that shouldn't distract from the rest of the patch. I'd obviously Mark> welcome all testing of this and am happy to check it in. Mark, I made some changes today as well (not yet checked in) in an attempt to improve the ability of ocrad to extract text from images. It really requires the text be dark and the background be light in order to "see" anything. I believe a perfectly formatted image where the text is white on a black background results in no output by ocrad. I created a patch to try and remedy that: http://sourceforge.net/tracker/index.php?func=detail&aid=1652120&group_id=61702&atid=498105 I need to get to bed, but I'll try to look at your gocr patch Monday or Tuesday. Skip From skip at pobox.com Tue Feb 6 03:07:01 2007 From: skip at pobox.com (skip at pobox.com) Date: Mon, 5 Feb 2007 20:07:01 -0600 Subject: [spambayes-dev] gocr is definitely improving... Message-ID: <17863.58053.857731.637415@montanaro.dyndns.org> I got a mail with image spam today (I probably got quite a few but gmail blocks most of them nowadays): http://www.webfast.com/~skip/thermometer.gif I ran gocr 0.41 over it and got this output: > _'__o______ __ ____o______ ___ i__8 _____ 00,__ 0 0_,_> 0 __8 ___E3 __>_E3 __ E3_,__ _____ 0,__,_ _ _0______ _ 0 __0 _, ___ _E3____ E3 _ _ _ ____ ____ 'o__0____ ____ 0>,E3 _______ __ _________, _,______ _ 0 __________ ___,,_____, ____ ____',____ ____ ___ ___ _ 0 ___ >__ ____ ___ ____ _ ___E3_ ___e__ ___E3___ 0 ______ The latest version is 0.43, so I downloaded and built it (with a couple slight tweaks needed). When fed the same image it spit out: _ _ _ _ _ _ _ X;niy_nha_ Technology Ltd qnb oI_ _ p_rce I1.SB lP 1_.6_ hb te: H_ts Il_ghs of I1._B TodJy .M_ rc _ Fxpected T _ rr _ Ini thc Izst 3 _ eks they ha_e ianded o_er I1.Z M II_on _n contracts. TJdays n _ Jnnounced anothe? huge cont_iact. Read all the n _ and set ycur buy fur_ mm f_rst cn_ng Tuesday nD rn_ng! Pretty huge improvement. (I think you can see why I gave up on gocr before.) By comparison, with my latest massaging of the input fed to ocrad I get: X?nU?nha? TechnologU L_d glbol! _ p_rce __.58 LP _3.6_ __e: H__s H_ghs or __.78 Tod_V _re _ Expec_ed T_rr_ In _he las_ 3 _ehs _heV ha_e landed o_er t_.2 n?ll?on ?n con_roc_s, TodoVs n_ onnounced ono_her huge con_rac_, RPad all _he n_ and se_ Uour buU ror mM r?rs_ _h?ng TuesdaU nDrn?ng! Without any massaging ocrad doesn't find any text. You have to give the --invert flag. Seems like it should automatically try to invert the image if its first attempt to extract text completely fails. At any rate, gocr looks much better than it did. I'm going to install it and give your patch a try for a couple days. It looks fine based on a simple skim of the changes. Go ahead and check it in so more people can play with it. Skip From pl at symbolic.it Tue Feb 6 09:39:14 2007 From: pl at symbolic.it (Luigi Pugnetti) Date: Tue, 06 Feb 2007 09:39:14 +0100 Subject: [spambayes-dev] gocr is definitely improving... In-Reply-To: <17863.58053.857731.637415@montanaro.dyndns.org> References: <17863.58053.857731.637415@montanaro.dyndns.org> Message-ID: <1170751155.29941.46.camel@localhost.localdomain> On Mon, 2007-02-05 at 20:07 -0600, skip at pobox.com wrote: > > Without any massaging ocrad doesn't find any text. You have to give the > --invert flag. Seems like it should automatically try to invert the image > if its first attempt to extract text completely fails. you could use a simple check to find if the inverted flag is needed if ImageStat.Stat(image).mean[0] + ImageStat.Stat(image).mean[1] + ImageStat.Stat(image).mean[2] >= (128 *3) invert flag is needed this is a very simple check that sometimes could fail (inverted is needed but the condition is false. I've never seen the opposite) Probably checking if two of the mean[]s are greater than 128 could suffice especially when one of them is very big (> 190). > > At any rate, gocr looks much better than it did. I'm going to install it > and give your patch a try for a couple days. It looks fine based on a > simple skim of the changes. Go ahead and check it in so more people can > play with it. > > Skip > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev at python.org > http://mail.python.org/mailman/listinfo/spambayes-dev -- Luigi Pugnetti Symbolic S.p.A. V.le Mentana, 29 I-43100 Parma Italy Tel: +39 0521 708811 Fax: +39 0521 776190 From genojoe at neo.rr.com Wed Feb 7 10:15:27 2007 From: genojoe at neo.rr.com (Gene Rhodes) Date: Wed, 7 Feb 2007 04:15:27 -0500 Subject: [spambayes-dev] Desired Feature Message-ID: <009c01c74a98$81ea0330$650fa8c0@PRESARIO> One feature that may already exist but I cannot find is the following: I have a list of 20 or so people that I regularily receive email from. Sometimes their email is identified as Spam regardless of the "training". Is there a way to create a list of email addresses that are always treated as "good" regardless of their spam score? This is such an obvious request that I think that it may already exist but I am not able to find it. I am using Outlook 2003, Road runner and Windows XP in my home. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20070207/81e02f15/attachment.htm From genojoe at neo.rr.com Wed Feb 7 10:23:36 2007 From: genojoe at neo.rr.com (Gene Rhodes) Date: Wed, 7 Feb 2007 04:23:36 -0500 Subject: [spambayes-dev] FW: Desired Feature Message-ID: <00a101c74a99$a5284540$650fa8c0@PRESARIO> In the email sent immediately preceeding this email, I stated: One feature that may already exist but I cannot find is the following: I have a list of 20 or so people that I regularily receive email from. Sometimes their email is identified as Spam regardless of the "training". Is there a way to create a list of email addresses that are always treated as "good" regardless of their spam score? This is such an obvious request that I think that it may already exist but I am not able to find it. I am using Outlook 2003, Road runner and Windows XP in my home. Please add the following comment" If the preceeding feature does not exist consider adding the following to you FAQ: Question: Can I create a list of email addresses that are always accepted? Answer: No, this capability currently does not exist in Spambayes. This FAQ will help others that have the same desire. I think many people would find this helpful. Again, this feature may exist. If that is the case, I apologize to you for implying that it does not exist. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20070207/ef1a05a2/attachment.htm From skip at pobox.com Wed Feb 7 14:17:40 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 7 Feb 2007 07:17:40 -0600 Subject: [spambayes-dev] FW: Desired Feature In-Reply-To: <00a101c74a99$a5284540$650fa8c0@PRESARIO> References: <00a101c74a99$a5284540$650fa8c0@PRESARIO> Message-ID: <17865.53620.723031.43882@montanaro.dyndns.org> Gene> One feature that may already exist but I cannot find is the Gene> following: Gene> I have a list of 20 or so people that I regularily receive email Gene> from. Sometimes their email is identified as Spam regardless of Gene> the "training". ... Gene> Is there a way to create a list of email addresses that are always Gene> treated as "good" regardless of their spam score? This is such an Gene> obvious request that I think that it may already exist but I am Gene> not able to find it. This is already in the FAQ (question 6.6): http://spambayes.sourceforge.net/faq.html Short answer: If you really want whitelisting, add a filter to Outlook that's run before SpamBayes scores those messages. Skip From bishop at aeroprise.com Wed Feb 7 20:55:08 2007 From: bishop at aeroprise.com (Peter Bishop) Date: Wed, 7 Feb 2007 11:55:08 -0800 Subject: [spambayes-dev] new FAQ needed Message-ID: I have been monitoring the spambayes list and helping out some of the spambayes users for a couple of months. Some of the least frequently answered questions are related to the response below. I recommend that this answer be reformatted into a FAQ and added to the FAQ list under "reinstalling SpamBayes". The information below should be included in the "Addin doesn't load" section of the Troubleshooting Guide that is installed with the product. (It would be good to have some verification from more knowledgable SpamBayes people that this info is good.) I suspect this is not the right issue to raise on this list, but I need help figuring out how to help get this done, or get the ball rolling to get it done. Peter Bishop -----Original Message----- From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On Behalf Of Klieg Sent: Tuesday, February 06, 2007 5:31 PM To: spambayes at python.org Subject: Re: [Spambayes] spambayes quit working and won't reinstall You have likely corrected this by now, but I encountered a similar difficulty. SpamBayes quit working. The menu bar in Outlook still showed the controls, however they did not respond. I tried re-installing the product and it showed registered in the log, however still did not work. I un-installed the product, deleted the menu from Outlook, then re-installed the product. It still did not work. WHen I tried to turn on the COMM addin in Outlook by checking the addin under Tools > Options > Other > Advanced Options > Comm Addins, still nothing. When going back into the Comm addins options, the checkbox was un-checked. Then I selected the Spambayes item in the Comm add-ins option window and selected 'Remove'. I then selected 'Add' to add in the SpamBayes addin. In the file dialogue box, you need to navigate to the addin under program files > Spambayes > Bin > outlook_addin.dll. The addin takes a little while to install at this point, then the Spambayes menu items should show up. You'll need to re-configure some of the Spambayes setup, then things should be back in working order. Norm Dingle wrote: > > I am running Outlook 2003 and had Spambayes running fine. > > All of a sudden it quit working. I have read the trouble shooting > manual and nothing seems to work. > > I have uninstalled Spambayes and reinstalled several times. > > > > The log file says it is registered. > > > > When I go to the Advanced options/Comm Addinns dialog Spambay show in > the list but unchecked. > > I check it and then restart Outlook. No Spambayes. The addin dialog > then shows it unchecked again. > > > > Right now I don't have any other ideas about what to try. > > > > Thanks > > Norm > > > > > -- > No virus found in this outgoing message. > Checked by AVG Anti-Virus. > Version: 7.0.344 / Virus Database: 267.10.18/90 - Release Date: > 9/5/2005 > > > _______________________________________________ > Spambayes at python.org > http://mail.python.org/mailman/listinfo/spambayes > Check the FAQ before asking: http://spambayes.sf.net/faq.html > -- View this message in context: http://www.nabble.com/spambayes-quit-working-and-won%27t-reinstall-tf282807. html#a8838231 Sent from the Python - spambayes mailing list archive at Nabble.com. _______________________________________________ SpamBayes at python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html From Sjoerd.Mullender at cwi.nl Tue Feb 13 08:38:53 2007 From: Sjoerd.Mullender at cwi.nl (Sjoerd Mullender) Date: Tue, 13 Feb 2007 08:38:53 +0100 Subject: [spambayes-dev] bug in spambayes/ImageStripper.py? Message-ID: <45D16B0D.7020603@cwi.nl> There are two occurrences of the name "program_name" in spambayes/ImageStripper.py. Shouldn't both be "engine_name"? -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 369 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070213/a1c262ba/attachment.pgp From mhammond at skippinet.com.au Wed Feb 14 01:53:41 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 14 Feb 2007 11:53:41 +1100 Subject: [spambayes-dev] bug in spambayes/ImageStripper.py? In-Reply-To: <45D16B0D.7020603@cwi.nl> Message-ID: <021b01c74fd2$92dd1f80$060a0a0a@enfoldsystems.local> > There are two occurrences of the name "program_name" in > spambayes/ImageStripper.py. Shouldn't both be "engine_name"? They should indeed! I've checked that in. Cheers, Mark From f.rougon at free.fr Mon Feb 19 14:23:42 2007 From: f.rougon at free.fr (Florent Rougon) Date: Mon, 19 Feb 2007 14:23:42 +0100 Subject: [spambayes-dev] Tesseract OCR Message-ID: <87r6sm5li9.fsf@florent.maison> Hi, I just discovered the existence of Tesseract OCR, whose homepage[1] says: A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. I thought some of you (Skip, Mark) might be interested if you hadn't heard about this software yet. According to the Debian package page[2], Tesseract OCR is command-line driven, which sounds good for you. And according to the Debian copyright file, the software is released under the Apache License, version 2.0. That's it, end of advertisement. Thanks for the great spam filter that saved my life, and keep up the good work! :) Regards, [1] http://sourceforge.net/projects/tesseract-ocr [2] http://packages.debian.org/unstable/graphics/tesseract-ocr -- Florent From mhammond at skippinet.com.au Tue Feb 20 00:16:44 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue, 20 Feb 2007 10:16:44 +1100 Subject: [spambayes-dev] Tesseract OCR In-Reply-To: <87r6sm5li9.fsf@florent.maison> Message-ID: <02af01c7547c$06184440$010a0a0a@enfoldsystems.local> > I just discovered the existence of Tesseract OCR, whose > homepage[1] says: > > A commercial quality OCR engine originally developed at HP between > 1985 and 1995. In 1995, this engine was among the top 3 evaluated by > UNLV. It was open-sourced by HP and UNLV in 2005. > > I thought some of you (Skip, Mark) might be interested if you hadn't > heard about this software yet. You could help us out here too, by running some of your image spam against the various engines and manually inspecting the accuracy of the text versus what you actually see in the image. My quick experiments show that tesseract is very close to the results I get from gocr, and significantly better than ocrad. Mark From mhammond at skippinet.com.au Thu Feb 22 02:01:56 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 22 Feb 2007 12:01:56 +1100 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? Message-ID: <00d801c7561d$0d628830$020a0a0a@enfoldsystems.local> Hi guys, On a bit of a whim, I mailed the jocr-devels list to try and get informal approval for distributing their windows binary with spambayes. I believe that as we are not modifying the binary, as long as we point to the original binary (and therefore implicitly pointing at their source tarball etc, as required by the GPL) we should be fine. Anyone have any comments or objections to us releasing spambayes with a gocr binary? Mark -----Original Message----- From: Joerg Schulenburg [mailto:Joerg.Schulenburg at URZ.Uni-Magdeburg.DE] Sent: Tuesday, 20 February 2007 8:16 PM To: Mark Hammond Subject: Re: [Jocr-devels] redistribute gocr binaries? I have no problems with that. I also try to improve gocr according to spam images. Joerg. On Sun, 18 Feb 2007, Mark Hammond wrote: > Hi all, > I'm involved with the 'spambayes' project (spambayes.org), an open-source > client-based spam solution, and we've had recent success in using gocr with > our recent OCR enhancements. Spambayes is released under a 'Python' style > open-source license which is closer to a BSD license than to the GPL. > > Are there any license considerations or any other objections to us including > a gocr binary with our Windows binaries? If not, are there any other > requests or guidelines you would like us to adhere to? We are looking at > including the unmodified binary at > http://www-e.uni-magdeburg.de/jschulen/ocr/gocr043.exe in our application > directory (an email address for Peter B L Meijer wasn't obvious, otherwise > I'd CC him) > > Thanks, > > Mark > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Jocr-devels mailing list > Jocr-devels at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/jocr-devels > > -- ------------------------------------------------------------------------- - \V/ - - EMAIL: joerg.schulenburg at urz.uni-magdeburg.de (o o) - ----------------------------------------------------------oOo-(_)-oOo---- - http://www-e.uni-magdeburg.de/jschulen/ - PGP 1024D/53BDFBE3, 3816 B803 D578 F5AD 12FD FE06 5D33 0C49 53BD FBE3 ------------------------------------------------------------------------- From skip at pobox.com Sat Feb 24 22:58:01 2007 From: skip at pobox.com (skip at pobox.com) Date: Sat, 24 Feb 2007 15:58:01 -0600 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? In-Reply-To: <00d801c7561d$0d628830$020a0a0a@enfoldsystems.local> References: <00d801c7561d$0d628830$020a0a0a@enfoldsystems.local> Message-ID: <17888.46313.313814.279176@montanaro.dyndns.org> Mark> On a bit of a whim, I mailed the jocr-devels list to try and get Mark> informal approval for distributing their windows binary with Mark> spambayes. I believe that as we are not modifying the binary, as Mark> long as we point to the original binary (and therefore implicitly Mark> pointing at their source tarball etc, as required by the GPL) we Mark> should be fine. Mark> Anyone have any comments or objections to us releasing spambayes Mark> with a gocr binary? No objections. I went through the same procedure with ocrad. Might as well be thorough. Skip From f.rougon at free.fr Mon Feb 26 19:26:06 2007 From: f.rougon at free.fr (Florent Rougon) Date: Mon, 26 Feb 2007 19:26:06 +0100 Subject: [spambayes-dev] Tesseract OCR In-Reply-To: <02af01c7547c$06184440$010a0a0a@enfoldsystems.local> (Mark Hammond's message of "Tue, 20 Feb 2007 10:16:44 +1100") References: <02af01c7547c$06184440$010a0a0a@enfoldsystems.local> Message-ID: <87slcsn5c1.fsf@florent.maison> Hi Mark, "Mark Hammond" wrote: > You could help us out here too, by running some of your image spam against > the various engines and manually inspecting the accuracy of the text versus > what you actually see in the image. Sure... I just severely lack time to do that in the forseeable future. :-/ So, my message was just "in case you didn't know" about tesseract yet. > My quick experiments show that tesseract is very close to the results > I get from gocr, and significantly better than ocrad. OK, so you had already tried it. Thanks! -- Florent From mhammond at skippinet.com.au Tue Feb 27 11:14:25 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue, 27 Feb 2007 21:14:25 +1100 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? In-Reply-To: <17888.46313.313814.279176@montanaro.dyndns.org> Message-ID: <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local> > Mark> On a bit of a whim, I mailed the jocr-devels list > Mark> to try and get informal approval for distributing > Mark> their windows binary with spambayes. > No objections. I went through the same procedure with ocrad. > Might as well be thorough. Yeah - I stumbled across your mail to the ocrad -devel list a while ago - it inspired my similar mail to gocr :) My next proposition is that we enable this OCR support by default, and enable gocr as the default OCR engine - how does that sound? If OK, should we do that by promoting the relevant 'X-' options to 'official' options, or just change the default values for the options as implemented? Mark From skip at pobox.com Tue Feb 27 12:57:38 2007 From: skip at pobox.com (skip at pobox.com) Date: Tue, 27 Feb 2007 05:57:38 -0600 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? In-Reply-To: <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local> References: <17888.46313.313814.279176@montanaro.dyndns.org> <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local> Message-ID: <17892.7346.76540.56365@montanaro.dyndns.org> Mark> My next proposition is that we enable this OCR support by default, Mark> and enable gocr as the default OCR engine - how does that sound? Mark> If OK, should we do that by promoting the relevant 'X-' options to Mark> 'official' options, or just change the default values for the Mark> options as implemented? Sounds good to me. Default to gocr, drop the "X-". Skip From sjoerd at acm.org Tue Feb 27 13:25:00 2007 From: sjoerd at acm.org (Sjoerd Mullender) Date: Tue, 27 Feb 2007 13:25:00 +0100 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? In-Reply-To: <17892.7346.76540.56365@montanaro.dyndns.org> References: <17888.46313.313814.279176@montanaro.dyndns.org> <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local> <17892.7346.76540.56365@montanaro.dyndns.org> Message-ID: <45E4231C.5010801@acm.org> On 2007-02-27 12:57, skip at pobox.com wrote: > Mark> My next proposition is that we enable this OCR support by default, > Mark> and enable gocr as the default OCR engine - how does that sound? > Mark> If OK, should we do that by promoting the relevant 'X-' options to > Mark> 'official' options, or just change the default values for the > Mark> options as implemented? > > Sounds good to me. Default to gocr, drop the "X-". Do keep in mind that gocr and ocrad are not installed on all systems. It's great if you put it into the Windows distribution, but on Linux that is not an option, and spambayes should still work on Linux as well. On Fedora Core, those two programs are not available unless the system administrator gets them from some third party repository (gocr (but not ocrad) is available in freshrpms) or builds them from source. -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 370 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070227/3be29e4a/attachment.pgp From skip at pobox.com Tue Feb 27 13:59:53 2007 From: skip at pobox.com (skip at pobox.com) Date: Tue, 27 Feb 2007 06:59:53 -0600 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? In-Reply-To: <45E4231C.5010801@acm.org> References: <17888.46313.313814.279176@montanaro.dyndns.org> <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local> <17892.7346.76540.56365@montanaro.dyndns.org> <45E4231C.5010801@acm.org> Message-ID: <17892.11081.89379.191710@montanaro.dyndns.org> Sjoerd> Do keep in mind that gocr and ocrad are not installed on all Sjoerd> systems. It's great if you put it into the Windows Sjoerd> distribution, but on Linux that is not an option, and spambayes Sjoerd> should still work on Linux as well. The image analysis code will still work if you enable the image cracking options but don't have the requisite ocr engine installed. It just emits a message to standard error and returns. Sjoerd> On Fedora Core, those two programs are not available unless the Sjoerd> system administrator gets them from some third party repository Sjoerd> (gocr (but not ocrad) is available in freshrpms) or builds them Sjoerd> from source. I suspect that's true for many systems. It is certainly true on my Mac. On my Ubuntu system both ocrad and gocr are available if I check the "community maintained" box in the repositories popup. Skip From sjoerd at acm.org Tue Feb 27 14:15:15 2007 From: sjoerd at acm.org (Sjoerd Mullender) Date: Tue, 27 Feb 2007 14:15:15 +0100 Subject: [spambayes-dev] FW: [Jocr-devels] redistribute gocr binaries? In-Reply-To: <17892.11081.89379.191710@montanaro.dyndns.org> References: <17888.46313.313814.279176@montanaro.dyndns.org> <002101c75a58$0fa4a9e0$070a0a0a@enfoldsystems.local> <17892.7346.76540.56365@montanaro.dyndns.org> <45E4231C.5010801@acm.org> <17892.11081.89379.191710@montanaro.dyndns.org> Message-ID: <45E42EE3.1010102@acm.org> On 2007-02-27 13:59, skip at pobox.com wrote: > Sjoerd> Do keep in mind that gocr and ocrad are not installed on all > Sjoerd> systems. It's great if you put it into the Windows > Sjoerd> distribution, but on Linux that is not an option, and spambayes > Sjoerd> should still work on Linux as well. > > The image analysis code will still work if you enable the image cracking > options but don't have the requisite ocr engine installed. It just emits a > message to standard error and returns. I'm not sure that is a good user interface: don't change anything to the code and you get a warning about a missing program. Would it be possible to have the default for using OCR to be off on all systems but Windows where gocr and/or ocrad is included? -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 370 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070227/ff046abd/attachment.pgp