From sanjaydarisi at cox.net Sat Nov 1 05:13:08 2003 From: sanjaydarisi at cox.net (sanjaydarisi@cox.net) Date: Sat Nov 1 05:13:17 2003 Subject: [spambayes-dev] Question about binary installer! Message-ID: <20031101101308.EFFQ24944.fed1mtao01.cox.net@smtp.west.cox.net> I am curious on how to make the spambayes outlook addin binary installer. Is py2exe used or McMillan installer? 'cos I tried with both and got errors. Like I tried using py2exe with setup_all.py file in the windows/py2exe dir and got a few errors regarding the options of py2exe like 'exclude-dll' etc. I have py2exe 0.4.2 and python 2.3 Then i went back to McMillan installer and tried. I used the Outlook2000\installer\spambayes_addin.py|spec|iss files. I was able to get a .exe file at the end. But, it won't work like I was able to install and the spambayes won't come up when I start the Outlook. Is it the same file that is used in building the spambayes outlook addin binary installer? I have few COM errors in the log file, complaining about lack of resource section for the image files and an assertion error saying 'Should not yet have a toolbar' Is there anything else that I have to do inaddition to using those files? Could anyone let me know how the binary installer of spambayes Outlook addin is built and could share the script used in this process? I'd really appreciate your help. Thank you, Sanjay. From theller at python.net Sat Nov 1 06:44:19 2003 From: theller at python.net (Thomas Heller) Date: Sat Nov 1 06:44:20 2003 Subject: [spambayes-dev] Re: Question about binary installer! References: <20031101101308.EFFQ24944.fed1mtao01.cox.net@smtp.west.cox.net> Message-ID: <65i4uxd8.fsf@python.net> writes: > I am curious on how to make the spambayes outlook addin binary > installer. Is py2exe used or McMillan installer? 'cos I tried with > both and got errors. Like I tried using py2exe with setup_all.py file > in the windows/py2exe dir and got a few errors regarding the options > of py2exe like 'exclude-dll' etc. I have py2exe 0.4.2 and python 2.3 If you try py2exe, you should use the 0.5.0a prerelease in the files section. You need win32all build 161, however. Thomas From skip at pobox.com Mon Nov 3 12:54:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 3 12:54:10 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? Message-ID: <16294.38457.868547.794422@montanaro.dyndns.org> We know some problems arise if grossly different numbers of ham or spam exist in the training databases. I wonder if there might be problems within datasets if different numbers of particular hams or spams have been used in the training. That's probably not worded well. Let me demonstrate with a concrete example. Suppose I've trained on exactly 1000 ham and 1000 spam, just to eliminate that source of problems. Within the 1000 hams, suppose I've trained on 800 python messages, 100 messages about cars and 100 messages about pop psychology. We know that if I get a message about a subject which I've never trained on before (say, woodworking) that there are likely to be topic-specific clues I've never seen which won't contribute to scoring the message as ham ("router", "lathe", "sawdust", ...). Questions: * How many woodworking messages will I need to train as ham to get the system to properly recognize those messages as ham? Would that large glut of python-related messages hamper the ability of the classifier to detect woodworking messages as ham? * Similarly, would the 8:1 ratio of python messages to messages about cars or pop psychology have an effect on scoring any of those messages accurately? Skip From tdickenson at geminidataloggers.com Mon Nov 3 13:15:21 2003 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Mon Nov 3 13:15:25 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <16294.38457.868547.794422@montanaro.dyndns.org> References: <16294.38457.868547.794422@montanaro.dyndns.org> Message-ID: <200311031815.21719.tdickenson@geminidataloggers.com> On Monday 03 November 2003 17:54, Skip Montanaro wrote: > We know some problems arise if grossly different numbers of ham or spam > exist in the training databases. I wonder if there might be problems > within datasets if different numbers of particular hams or spams have been > used in the training. Dont scare the new users with talk of problems..... I train using *everything* in my kmail folders. That is 1 part spam, 4 parts python mailing lists, 6 parts other lists, 1 part personal email, and 4 parts automated log message. No perceptable problems so far. -- Toby Dickenson From kennypitt at hotmail.com Mon Nov 3 13:17:16 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Nov 3 13:18:18 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <16294.38457.868547.794422@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Suppose I've trained on exactly 1000 ham and 1000 spam, > just to eliminate that source of problems. Within the 1000 hams, > suppose I've trained on 800 python messages, 100 messages about cars > and 100 messages about pop psychology. We know that if I get a > message about a subject which I've never trained on before (say, > woodworking) that there are likely to be topic-specific clues I've > never seen which won't contribute to scoring the message as ham > ("router", "lathe", "sawdust", ...). > > Questions: > > * How many woodworking messages will I need to train as ham to > get the system to properly recognize those messages as ham? > Would that large glut of python-related messages hamper the > ability of the classifier to detect woodworking messages as ham? I would think one would be sufficient, assuming of course that none of the words in your woodworking message already appear in your *spam* training. SpamBayes only considers tokens that are *in* the message being classified, not tokens that are *not in* the message. So, regardless of how many times a token has appeared in the python messages, it will not even be considered in the scoring if it does not appear in the woodworking message. On the other hand, if that token *does* appear in the woodworking message then it will be solidly scored as ham and therefore increase the probability of the message being correctly classified. > * Similarly, would the 8:1 ratio of python messages to messages > about cars or pop psychology have an effect on scoring any of > those messages accurately? I wouldn't think so. Since all of these messages are considered ham, the tokens from the python messages would at best reinforce the *correct* classification of the other messages, and at worst would contribute nothing one way or the other to the scoring. Just my thoughts, totally unproven scientifically. -- Kenny Pitt From skip at pobox.com Mon Nov 3 13:42:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 3 13:42:47 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <200311031815.21719.tdickenson@geminidataloggers.com> References: <16294.38457.868547.794422@montanaro.dyndns.org> <200311031815.21719.tdickenson@geminidataloggers.com> Message-ID: <16294.41366.356102.723324@montanaro.dyndns.org> >> I wonder if there might be problems within datasets if different >> numbers of particular hams or spams have been used in the training. Toby> Dont scare the new users with talk of problems..... Any new user who subscribes to spambayes-dev deserves to get scared every once in awhile. Skip From skip at pobox.com Mon Nov 3 14:09:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 3 14:09:20 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <20031103181814.F241FF58F3@orb.pobox.com> References: <16294.38457.868547.794422@montanaro.dyndns.org> <20031103181814.F241FF58F3@orb.pobox.com> Message-ID: <16294.42960.302363.849243@montanaro.dyndns.org> >> * How many woodworking messages will I need to train as ham to get >> the system to properly recognize those messages as ham? Would that >> large glut of python-related messages hamper the ability of the >> classifier to detect woodworking messages as ham? Kenny> I would think one would be sufficient, assuming of course that Kenny> none of the words in your woodworking message already appear in Kenny> your *spam* training. SpamBayes only considers tokens that are Kenny> *in* the message being classified, not tokens that are *not in* Kenny> the message. So, regardless of how many times a token has Kenny> appeared in the python messages, it will not even be considered Kenny> in the scoring if it does not appear in the woodworking message. Kenny> On the other hand, if that token *does* appear in the woodworking Kenny> message then it will be solidly scored as ham and therefore Kenny> increase the probability of the message being correctly Kenny> classified. Let me rephrase the question again. There's a discussion in Gary Robinson's LJ article http://www.linuxjournal.com/article.php?sid=6467 about dealing with rare words which I didn't really follow. If I've trained on 1000 other ham messages and now encounter a woodworking message, some of the words in there are likely to have not been seen before ("lathe", for example). Such words obviously can't contribute to scoring that message. Let's assume I then train that message as ham. "lathe" now has a hamcount of 1 and a spamcount of 0. It is a "rare word". How many more messages which contain "lathe" do I have to train on before it is no longer "rare". In particular, by training on 1000 other hams which don't contain that word, have I somehow created an artificial barrier to getting woodworking-specific words to have full effect as ham indicators? If there is a problem, it might be fairly easy to fall into a trap which is a bit difficult to get out of. Suppose I'm starting from scratch and I know I have several mailboxes: * python - 800 messages * cars - 100 messages * pop-psycology - 100 messages * spam - 1000 messages As a new user, it might be very easy for me to ask SB to score all messages in the first three mailboxes as ham and all in the fourth as spam, thus creating a problem (if one exists). *If* such a problem exists (and it very well may not), it might be better if I could tell the system to pick a random sample of each of my collections such that the relative number of hams and spams is about equal and so that the imbalance between mailboxes classified as ham or spam is not too great either. Skip From popiel at wolfskeep.com Mon Nov 3 15:14:45 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Nov 3 15:14:50 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: Message from Skip Montanaro of "Mon, 03 Nov 2003 13:09:04 CST." <16294.42960.302363.849243@montanaro.dyndns.org> References: <16294.38457.868547.794422@montanaro.dyndns.org> <20031103181814.F241FF58F3@orb.pobox.com> <16294.42960.302363.849243@montanaro.dyndns.org> Message-ID: <20031103201445.D5C852DF59@cashew.wolfskeep.com> In message: <16294.42960.302363.849243@montanaro.dyndns.org> Skip Montanaro writes: > >Let me rephrase the question again. There's a discussion in Gary Robinson's >LJ article > > http://www.linuxjournal.com/article.php?sid=6467 > >about dealing with rare words which I didn't really follow. It's talking about the math behind unknown_word_strength and unknown_word_prob. >If I've trained >on 1000 other ham messages and now encounter a woodworking message, some of >the words in there are likely to have not been seen before ("lathe", for >example). Such words obviously can't contribute to scoring that message. >Let's assume I then train that message as ham. "lathe" now has a hamcount >of 1 and a spamcount of 0. It is a "rare word". How many more messages >which contain "lathe" do I have to train on before it is no longer "rare". A word is not "rare" or "not rare" according to the classifier... it's not just a binary switch. All words have their probabilities adjusted towards unknown_word_prob by an amount determined by unknown_word_strength and the number of trained messages in which the word has appeared. The more often the word has been seen (and trained), the smaller the adjustment. The only way this could be a binary switch would be if the unknown word adjustments were strong enough to pull the probability for a word inside the .4-.6 range (assuming default settings) that the classifier outright ignores... but the default settings for unknown_word_* aren't that strong. I seem to recall that the hapax values (from only a single instance trained) are around .31 and .69 for ham and spam respectively. >In particular, by training on 1000 other hams which don't contain that word, >have I somehow created an artificial barrier to getting woodworking-specific >words to have full effect as ham indicators? No. Training on other mail which does not contain the word does not affect the score for a word at all (unless you have the experimental ham/spam imbalance adjustment enabled and it's actually doing something... and you specifically engineered for question to make the imbalance adjustment moot). >If there is a problem, it might be fairly easy to fall into a trap which is >a bit difficult to get out of. Lucky for us, there is no problem here. ;-) - Alex From kennypitt at hotmail.com Mon Nov 3 15:23:14 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Nov 3 15:23:35 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <16294.42960.302363.849243@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Let me rephrase the question again. There's a discussion in Gary > Robinson's LJ article > > http://www.linuxjournal.com/article.php?sid=6467 > > about dealing with rare words which I didn't really follow. If I've > trained on 1000 other ham messages and now encounter a woodworking > message, some of the words in there are likely to have not been seen > before ("lathe", for example). Such words obviously can't contribute > to scoring that message. Let's assume I then train that message as > ham. "lathe" now has a hamcount of 1 and a spamcount of 0. It is a > "rare word". How many more messages which contain "lathe" do I have > to train on before it is no longer "rare". In particular, by training > on 1000 other hams which don't contain that word, have I somehow > created an artificial barrier to getting woodworking-specific words > to have full effect as ham indicators? OK, I see where you're coming from. I answered a related (albeit much simpler ) question for someone on the Spambayes list not long ago. The "rare word" adjustment is a way of adjusting the contributed probability for words that haven't been seen very often. In your example of "lathe" with ham=1 and spam=0, the straight probability of spam [spam / (spam + ham)] would be 0.0, but one occurrence doesn't make it the most reliable indicator. SpamBayes adjusts this using the "unknown_word_strength" (s in the Robinson article) and "unknown_word_prob" (x in the article) options. You can see the adjustment calculation in the probability() function in classifier.py. The default for these options in Options.py are s=0.45 and x=0.5. Using these defaults with the case of 1 ham and no spam, the actual probability contributed to the chi2 combining is 0.155172. As the total number of occurrences of the token increases, the contributed probability gets closer and closer to the straight probability. So, for ham=5 and spam=0, contributed probablity is 0.041284; for ham=10 and spam=0, contributed probability is 0.021531; and for ham=50 and spam=0, contributed probability is 0.004460. As you can see, the probability moves back toward the straight probability fairly quickly. The important thing to note with respect to your original concerns, though, is that this "rare" word calculation is entirely independent of any other tokens in the training data. The calculation involves the original straight probability, the fixed factors of s and x, and the total number of occurrences of that token in both ham and spam. There is no fixed cutoff that says a word is no longer rare, but neither does the definition of rare depend on the relative numbers compared to any other token in the training data. -- Kenny Pitt From skip at pobox.com Mon Nov 3 15:47:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 3 15:47:24 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <20031103201445.D5C852DF59@cashew.wolfskeep.com> References: <16294.38457.868547.794422@montanaro.dyndns.org> <20031103181814.F241FF58F3@orb.pobox.com> <16294.42960.302363.849243@montanaro.dyndns.org> <20031103201445.D5C852DF59@cashew.wolfskeep.com> Message-ID: <16294.48840.459042.449590@montanaro.dyndns.org> >> Let me rephrase the question again. There's a discussion in Gary >> Robinson's LJ article >> >> http://www.linuxjournal.com/article.php?sid=6467 >> >> about dealing with rare words which I didn't really follow. alex> It's talking about the math behind unknown_word_strength and alex> unknown_word_prob. >> If I've trained on 1000 other ham messages and now encounter a >> woodworking message, some of the words in there are likely to have >> not been seen before ("lathe", for example). Such words obviously >> can't contribute to scoring that message. Let's assume I then train >> that message as ham. "lathe" now has a hamcount of 1 and a spamcount >> of 0. It is a "rare word". How many more messages which contain >> "lathe" do I have to train on before it is no longer "rare". alex> A word is not "rare" or "not rare" according to the alex> classifier... I understand that it's not a binary thing. I used that term because Gary used it in his article. I seem to be having trouble making my ideas understood today... Was my exposition that vague? >> If there is a problem, it might be fairly easy to fall into a trap >> which is a bit difficult to get out of. alex> Lucky for us, there is no problem here. ;-) That's all I was asking. Skip From skip at pobox.com Mon Nov 3 16:08:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 3 16:09:20 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: References: <16294.42960.302363.849243@montanaro.dyndns.org> Message-ID: <16294.50099.106403.122008@montanaro.dyndns.org> Kenny> The important thing to note with respect to your original Kenny> concerns, though, is that this "rare" word calculation is Kenny> entirely independent of any other tokens in the training data. Kenny> The calculation involves the original straight probability, the Kenny> fixed factors of s and x, and the total number of occurrences of Kenny> that token in both ham and spam. There is no fixed cutoff that Kenny> says a word is no longer rare, but neither does the definition of Kenny> rare depend on the relative numbers compared to any other token Kenny> in the training data. Thanks, this is what I was getting at. Skip From tim.one at comcast.net Mon Nov 3 16:49:46 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 3 16:49:55 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <20031103201445.D5C852DF59@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > No. Training on other mail which does not contain the word does not > affect the score for a word at all ... It's a bit curious that this is true only so long as the word has appeared in only one kind of training data (only in spam, or only in ham). As soon as a word appears in at least one of each, training on msgs that don't contain the word can change the word's score. Example: suppose we've trained on 100 ham and a 100 spam, and "lathe" appeared in exactly one ham. Its by-counting spamprob is then >>> h = 1./100 >>> s = 0./100 >>> s/(h+s) 0.0 >>> So long as we never see "lathe" in spam, s's numerator is 0 no matter how many additional ham and spam we train on, so s is 0, so the by-counting spamprob remains 0/(h+0) = 0. Change the example so we've seen "lathe" in one ham and one spam: >>> h = 1./100 >>> s = 1./100 >>> s/(h+s) 0.5 >>> The by-counting spamprob is then 0.5, which makes fine intuitive sense. Now suppose we train on 100 more ham, and don't see "lathe" again: >>> h = 1./200 >>> s = 1./100 >>> s/(h+s) 0.66666666666666674 >>> Now "lathe" seems spammy! It should, since we've seen it in a greater percentage of spam than ham. I'm not sure we've got the best guess to 17 significant digits, though . Make the imbalance wilder and the by-counting spamprob gets wilder too: >>> h = 1./20000 >>> s = 1./100 >>> s/(h+s) 0.99502487562189057 >>> That offends my intuition -- the word is so rare (2 of 20100 msgs) that it's hard to believe that 99.5% is a sane guess. The Bayesian adjustment knocks it down a lot based on how few times it's been seen in total: >>> (.45*.5 + 2.0*_)/(.45 + 2.0) 0.90410193928317584 >>> But that still seems like a high guess to me. The experimental ham/spam imbalance option knocked it down a lot more. Unfortunately, that also moved spamprobs a lot closer to 0.5 for words that appeared lots of times in the over-represented category, and that made it a Bad Idea overall. It's tempting to ignore words that haven't appeared in at least N messages total (for some N). Alas, Graham's original algorithm had a gimmick like that, and testing said it worked better not to have such a cutoff. And for the mistake-based training many of us have fallen into, scoring hapaxes is very important. So we can't ignore rare words -- but in the presence of strong imbalance, I think we're still missing a trick. From popiel at wolfskeep.com Mon Nov 3 17:07:57 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Nov 3 17:08:00 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: Message from "Tim Peters" of "Mon, 03 Nov 2003 16:49:46 EST." References: Message-ID: <20031103220757.323B72DF59@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[T. Alexander Popiel] >> No. Training on other mail which does not contain the word does not >> affect the score for a word at all ... > >It's a bit curious that this is true only so long as the word has appeared >in only one kind of training data (only in spam, or only in ham). As soon >as a word appears in at least one of each, training on msgs that don't >contain the word can change the word's score. Yarg. I stand corrected. Perhaps it's time to test a variation where the prob is based on hamcount and spamcount instead of hamratio and spamratio. Hrm. *tap, tap, tap* I'll be back in a few hours... - Alex From tim.one at comcast.net Mon Nov 3 18:43:32 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 3 18:43:40 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: <20031103220757.323B72DF59@cashew.wolfskeep.com> Message-ID: >> [T. Alexander Popiel] >>> No. Training on other mail which does not contain the word does not >>> affect the score for a word at all ... [Tim] >> It's a bit curious that this is true only so long as the word has >> appeared in only one kind of training data (only in spam, or only in >> ham). As soon as a word appears in at least one of each, training >> on msgs that don't contain the word can change the word's score. >> ... [Alex] > Yarg. I stand corrected. > > Perhaps it's time to test a variation where the prob is based on > hamcount and spamcount instead of hamratio and spamratio. Hrm. > *tap, tap, tap* I'll be back in a few hours... Well, they're all the same if the # of training ham == the # of training spam. Computing spambprobs based on ratios is a first attempt at surviving in the face of unbalanced training data. For example, if a token appeared in 99 of 100 spam, and 100 of 10,000 ham, a spamprob of 0.5 (100/(100+100)) doesn't make intuitive sense. In effect, computing based on ratios (s/(s+h) where s = 99/100 and h=100/10000) answers what would happen *if* we had trained on equal numbers of each, while keeping the percentages of ham and spam containing the token fixed. In the example, if 99 of 100 spam contained a given token, then our best guess is that, if we had seen 10,000 spam instead, we would have seen the token in 9,900 of those. Then 9900/(9900+100) gives the same result as the current s/(s+h). IOW, s/(s+h) gives the result that "prob is based on hamcount and spamcount" gives if we extrapolate our actual training data to what it would be if it were balanced. If it's already balanced, the computed spamprob is the same whether computed by raw count or by ratio. So if you try raw count, the only interesting tests would be on unbalanced training data. From popiel at wolfskeep.com Mon Nov 3 19:34:58 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Nov 3 19:35:01 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: Message from "Tim Peters" of "Mon, 03 Nov 2003 18:43:32 EST." References: Message-ID: <20031104003458.142372DF59@cashew.wolfskeep.com> In message: "Tim Peters" writes: >> >> Perhaps it's time to test a variation where the prob is based on >> hamcount and spamcount instead of hamratio and spamratio. Hrm. >> *tap, tap, tap* I'll be back in a few hours... > >Well, they're all the same if the # of training ham == the # of training >spam. Computing spambprobs based on ratios is a first attempt at surviving >in the face of unbalanced training data. Hrm, yes. I'm obviously not thinking all that well today. This gives leads me to thoughts where the elements of the probability are scaled nonlinearly by the ham/spam imbalance before combining them into the prob, instead of scaling the perceived number of messages (and thus effecively scaling unknown_word_strength) afterward... Time to cogitate on which continuous asymptotic functions might be effective at this. >IOW, s/(s+h) gives the result that "prob is based on hamcount and spamcount" >gives if we extrapolate our actual training data to what it would be if it >were balanced. If it's already balanced, the computed spamprob is the same >whether computed by raw count or by ratio. So if you try raw count, the >only interesting tests would be on unbalanced training data. I'm currently testing against my RL data, which is between 60% and 70% spam overall (rising to about 90% spam in recent weeks). - Alex From anthony at interlink.com.au Tue Nov 4 06:27:26 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Nov 4 06:31:15 2003 Subject: [spambayes-dev] 1.0a7 Message-ID: <200311041127.hA4BRQ73005475@localhost.localdomain> Well, it's the end of a very long weekend (I love living in a place that gives us a public holiday for a horse race :-) and the 1.0a7 release is available for now from http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.tar.gz Can people please check it out and make sure it's sane? I'm trying to follow the process for getting a Windows checkout of the code to get Windows-style line-endings on the file (to paraphrase Dilbert: "Here's a nickel kid. Go buy a real line-ending") but WinCVS is being a snarky little sod. A zipfile will end up at the same place if/when I get it to play nice. The release is tagged, so if someone who's suffered windows long enough to get a working CVS checkout wants to make the zip, this would be fine. Alternately, the windows users can deal with correct line endings . I'll push the release itself out first thing tomorrow if it looks good. I've made a bunch of changes to the README.txt to remove old application names from it - README-DEVEL.txt is, however, woefully out of date. I've put a note in it mentioning this. Anthony From anthony at interlink.com.au Tue Nov 4 06:42:15 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Nov 4 06:46:00 2003 Subject: [spambayes-dev] 1.0a7 In-Reply-To: <200311041127.hA4BRQ73005475@localhost.localdomain> Message-ID: <200311041142.hA4BgGIK005815@localhost.localdomain> >>> Anthony Baxter wrote > http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.tar.gz There's also http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.zip now. I gave up on trying to make sense of the insanity that is Windows, and just found the magic zip option to mangle line endings. A bit of 'find' magic, and ta-da, only the .txt files are mangled, the JPGs &c are still ok. I think Anthony -- Anthony Baxter It's never too late to have a happy childhood. From theller at python.net Tue Nov 4 07:07:46 2003 From: theller at python.net (Thomas Heller) Date: Tue Nov 4 07:07:53 2003 Subject: [spambayes-dev] Re: 1.0a7 References: <200311041127.hA4BRQ73005475@localhost.localdomain> <200311041142.hA4BgGIK005815@localhost.localdomain> Message-ID: Anthony Baxter writes: >>>> Anthony Baxter wrote >> http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.tar.gz > > There's also > http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.zip > now. > > I gave up on trying to make sense of the insanity that is Windows, > and just found the magic zip option to mangle line endings. A bit > of 'find' magic, and ta-da, only the .txt files are mangled, the > JPGs &c are still ok. I think I use command line cvs on Windows, and never have problems with line endings. Unless some crazy guy commits Windows line ending files with cygwin tools. Thomas From anthony at interlink.com.au Tue Nov 4 08:00:36 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Nov 4 08:04:43 2003 Subject: [spambayes-dev] Re: 1.0a7 In-Reply-To: Message-ID: <200311041300.hA4D0a5p007315@localhost.localdomain> >>> Thomas Heller wrote > I use command line cvs on Windows, and never have problems with line > endings. > Unless some crazy guy commits Windows line ending files with cygwin > tools. The idea here is to get a checked out spambayes with the txt files with windows line endings. After going down the path of WinCVS, TortoiseCVS, Putty, 14 different implementations of ssh for windows each with different functionality, I gave up and just found the magic zip flag on Unix. It would be nice to know a way to get cvs working on windows with ssh auth - it looks (to me) like the only way is to install most of cygwin, which seems to defeat the purpose Anthony -- Anthony Baxter It's never too late to have a happy childhood. From theller at python.net Tue Nov 4 08:49:48 2003 From: theller at python.net (Thomas Heller) Date: Tue Nov 4 08:49:58 2003 Subject: [spambayes-dev] Re: 1.0a7 References: <200311041300.hA4D0a5p007315@localhost.localdomain> Message-ID: <4qxk9rb7.fsf@python.net> Anthony Baxter writes: >>>> Thomas Heller wrote >> I use command line cvs on Windows, and never have problems with line >> endings. >> Unless some crazy guy commits Windows line ending files with cygwin >> tools. > > The idea here is to get a checked out spambayes with the txt files > with windows line endings. After going down the path of WinCVS, > TortoiseCVS, Putty, 14 different implementations of ssh for windows > each with different functionality, I gave up and just found the magic > zip flag on Unix. It would be nice to know a way to get cvs working on > windows with ssh auth - it looks (to me) like the only way is to install > most of cygwin, which seems to defeat the purpose I followed the instructions here a looong time ago, and it still works. This is also linked from here: An alternative would be to use anon cvs, which doesn't require ssh auth. Thomas From anthony at interlink.com.au Tue Nov 4 09:19:40 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Nov 4 09:23:29 2003 Subject: [spambayes-dev] Re: 1.0a7 In-Reply-To: <4qxk9rb7.fsf@python.net> Message-ID: <200311041419.hA4EJemW016562@localhost.localdomain> >>> Thomas Heller wrote > I followed the instructions here > a looong time ago, and it still works. Aha. I was foolishly reading the docs on the wincvs &c websites. Thanks! > An alternative would be to use anon cvs, which doesn't require ssh auth. But for SF's 24-hour delay in anon cvs, that keeps it more-or-less useless. -- Anthony Baxter It's never too late to have a happy childhood. From kennypitt at hotmail.com Tue Nov 4 09:35:26 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Nov 4 09:35:54 2003 Subject: [spambayes-dev] Re: 1.0a7 In-Reply-To: <200311041419.hA4EJemW016562@localhost.localdomain> Message-ID: Anthony Baxter wrote: > But for SF's 24-hour delay in anon cvs, that keeps it more-or-less > useless. Seems to be much less than 24-hour now, although still not perfect. As of 9:30am EST time, I pulled all your release_1_0_a7 tagged changes up through MANIFEST.in 1.7.2.2. The only update I haven't seen yet is the 1.4.2.3 update to README-DEVEL.txt. -- Kenny Pitt From kennypitt at hotmail.com Tue Nov 4 11:15:18 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Nov 4 11:15:48 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: Message-ID: Tim Peters wrote: > I'm not sure we've got the best guess > to 17 significant digits, though . Make the imbalance wilder > and the by-counting spamprob gets wilder too: > >>>> h = 1./20000 >>>> s = 1./100 >>>> s/(h+s) > 0.99502487562189057 >>>> > > That offends my intuition -- the word is so rare (2 of 20100 msgs) > that it's hard to believe that 99.5% is a sane guess. The Bayesian > adjustment knocks it down a lot based on how few times it's been seen > in total: > >>>> (.45*.5 + 2.0*_)/(.45 + 2.0) > 0.90410193928317584 >>>> Wow, that's interesting. I had always considered words that were either ham or spam, but never a little of both. In a way it makes sense because 1/20000 ham is so close to zero that the word should be considered spammy. This seems even more scary, though. Compare your last example to the case where the token has only been seen in 1 spam and no ham: >>> h = 0./20000 >>> s = 1./100 >>> s/(h+s) 1.0 >>> (.45*.5 + 1.*_)/(.45 + 1.) 0.84482758620689669 >>> The spam prob here is less than the case of 1 ham and 1 spam because of the "rare word" adjustment. So, if the token has only been seen once in spam and is later seen once in ham, it gets spammier? Yikes! If we go to h=10: >>> h = 10./20000 >>> s = 1./100 >>> s/(h+s) 0.95238095238095233 >>> (.45*.5 + 11.*_)/(.45 + 11.) 0.93460178831357876 >>> And the spam prob is still going up! So whenever we have an extreme imbalance like this, the first n occurrences of a token added to the larger corpus, where n depends on the size of the imbalance, actually causes the probability of the *opposite* classification to *increase*. -- Kenny Pitt From richie at entrian.com Tue Nov 4 14:08:43 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Nov 4 14:08:57 2003 Subject: [spambayes-dev] Re: 1.0a7 In-Reply-To: <200311041300.hA4D0a5p007315@localhost.localdomain> References: <200311041300.hA4D0a5p007315@localhost.localdomain> Message-ID: [Anthony] > The idea here is to get a checked out spambayes with the txt files > with windows line endings. After going down the path of WinCVS, > TortoiseCVS, Putty, 14 different implementations of ssh for windows > each with different functionality, I gave up and just found the magic > zip flag on Unix. It would be nice to know a way to get cvs working on > windows with ssh auth - it looks (to me) like the only way is to install > most of cygwin, which seems to defeat the purpose SourceForge have excellent instructions on how to set up WinCVS and PuTTY here: http://sourceforge.net/docman/display_doc.php?docid=766&group_id=1 (Thanks for doing the release, BTW - I'll give it a bash later on tonight.) -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Tue Nov 4 17:32:33 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Nov 4 17:32:13 2003 Subject: [spambayes-dev] More CVS branch/tags questions Message-ID: <02b601c3a323$8b250d70$0500a8c0@eden> I saw Skip raise this last week, but I think he was asking different questions. My understanding is that we are moving towards 1.0 on the release_1_0 branch. Is that correct? If so, I'm a little confused by this :) If we look at an edited log from sb_server, we see (Please see my comments/questions inline with "****", and at the end: RCS file: /cvsroot/spambayes/spambayes/scripts/sb_server.py,v Working file: sb_server.py head: 1.11 ... symbolic names: release_1_0_a7: 1.6.2.1 outlook-1-0-fork: 1.11 release_1_0: 1.6.0.2 release_1_0_a6: 1.6 **** My reading of this is that this file was branched for 1.0 at 1.6. Correct? revision 1.11 date: 2003/10/07 00:36:30; author: anadelonbrin; state: Exp; lines: +2 -3 Fix [ spambayes-Bugs-818871 ] sb_server.py calls undefined variable ---------------------------- revision 1.10 date: 2003/09/29 04:43:09; author: anadelonbrin; state: Exp; lines: +27 -0 ... ---------------------------- revision 1.9 date: 2003/09/25 00:10:31; author: mhammond; state: Exp; lines: +99 -15 Patch [ 809008 ] safe start/stop and exlusive execution on windows ... ---------------------------- revision 1.8 date: 2003/09/24 05:28:53; author: anadelonbrin; state: Exp; lines: +3 -1 This should fix [ spambayes-Bugs-809769 ] TypeError when training 1.0a6 ---------------------------- revision 1.7 date: 2003/09/19 23:38:10; author: anadelonbrin; state: Exp; lines: +5 -6 ... Add the various interface improvements discussed on spambayes-dev. In particular, an advanced 'find token' query is available, the 'find message' query is improved, and the review messages page is more customisable. ---------------------------- ... ---------------------------- revision 1.6.2.1 date: 2003/09/24 03:54:14; author: anadelonbrin; state: Exp; lines: +4 -1 Stupid global variables! Thanks to a global variable not being updated, when we recreated everything, the userinterface kept using the old classifier. Since we now behave and close that one, this caused all sorts of problems. Get rid of the damn glob al variable, and correctly update it, and all is well in the world again. In addition, don't save an empty database. I think we make assumptions about the db being non-empty in some places. This should fix [ spambayes-Bugs-809769 ] TypeError when training 1.0a6 (I can't believe it took so long for me to find this!) ============================================================================ = *** From my reading of this, the "1.0" release is missing a number of significant patches - all 1.7->1.11 checkings appear to *not* be on the 1.0 release. And very interestingly, note that 1.8 and 1.6.2.1 *both* claim to fix [809769], and on the same day. I doubt this is the intention - I can't recall anyone deciding to fix real, verified bugs *after* the 1.0 release. Can anyone shed any light? Thanks, Mark. From popiel at wolfskeep.com Tue Nov 4 17:41:47 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Nov 4 17:41:51 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? In-Reply-To: Message from "T. Alexander Popiel" of "Mon, 03 Nov 2003 14:07:57 PST." <20031103220757.323B72DF59@cashew.wolfskeep.com> References: <20031103220757.323B72DF59@cashew.wolfskeep.com> Message-ID: <20031104224147.8893E2DE36@cashew.wolfskeep.com> In message: <20031103220757.323B72DF59@cashew.wolfskeep.com> "T. Alexander Popiel" writes: > >Perhaps it's time to test a variation where the prob is based on >hamcount and spamcount instead of hamratio and spamratio. Hrm. >*tap, tap, tap* I'll be back in a few hours... FWIW, basing the prob on the raw counts instead of the ratios is an incredibly clearcut loss. Only won twice on the false positives (by relatively small margins), but lost EVERY time on the false negatives by large amounts. - Alex From richie at entrian.com Tue Nov 4 17:46:58 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Nov 4 17:47:11 2003 Subject: [spambayes-dev] Re: 1.0a7 In-Reply-To: References: <200311041300.hA4D0a5p007315@localhost.localdomain> Message-ID: <3vagqvsc0kv1m74f1in0n8567396agou5f@4ax.com> [Me, earlier] > (Thanks for doing the release, BTW - I'll give it a bash later on > tonight.) A quick smoke-test of the web interface and the POP3 server on Windows failed to smoke, so all's fine with me. -- Richie Hindle richie@entrian.com From kennypitt at hotmail.com Tue Nov 4 17:54:48 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Nov 4 17:55:16 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <02b601c3a323$8b250d70$0500a8c0@eden> Message-ID: Mark Hammond wrote: > I saw Skip raise this last week, but I think he was asking different > questions. > > My understanding is that we are moving towards 1.0 on the release_1_0 > branch. Is that correct? > > If so, I'm a little confused by this :) If we look at an edited log > from sb_server, we see (Please see my comments/questions inline with > "****", and at the end: > > RCS file: /cvsroot/spambayes/spambayes/scripts/sb_server.py,v > Working file: sb_server.py > head: 1.11 > ... > symbolic names: > release_1_0_a7: 1.6.2.1 > outlook-1-0-fork: 1.11 > release_1_0: 1.6.0.2 > release_1_0_a6: 1.6 > > **** My reading of this is that this file was branched for 1.0 at 1.6. > Correct? > > *** From my reading of this, the "1.0" release is missing a number of > significant patches - all 1.7->1.11 checkings appear to *not* be on > the 1.0 release. > > And very interestingly, note that 1.8 and 1.6.2.1 *both* claim to fix > [809769], and on the same day. > > I doubt this is the intention - I can't recall anyone deciding to fix > real, verified bugs *after* the 1.0 release. Can anyone shed any > light? After looking at the full log, here is how I see it. Fixes 1.8 and 1.6.2.1 went in on the same day because that is the correct way to do things at this stage. Fixes that apply to both the 1.0 and 1.1 releases need to be made on both the branch and the trunk (because the current state of those versions could be different at the time the fix is made). Revs 1.7 and 1.10 are enhancements to the UI that came after the feature freeze for 1.0, so were not applied to the branch. I'm not certain, but I think 1.11 was a fix to a problem caused by the mods in rev 1.10. Rev 1.9 is a bit gray. The problem definitely applies to 1.0, so would probably make a reasonable fix to add to the 1.0 branch. On the other hand, it is in a sense adding a "new" feature to prevent multiple execution. -- Kenny Pitt From skip at pobox.com Tue Nov 4 17:59:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Nov 4 17:59:58 2003 Subject: [spambayes-dev] less is more? Message-ID: <16296.12135.892469.753587@montanaro.dyndns.org> I've been meaning to try restarting my training database from scratch for quite awhile. I finally broke down and did that this afternoon. I'm quite satisfied with the performance of the system with its current 24 spams and 11 hams. As a bonus, the database is a lot smaller (currently 340k vs 20MB for the old db) and things seem to run somewhat faster since it's a lot easier to keep the entire db file in memory. Overnight I'm sure I'll get a fair number of errors and I get a load of email the current db hasn't seen, but so far, so good. Skip From richie at entrian.com Tue Nov 4 18:31:24 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Nov 4 18:31:39 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <02b601c3a323$8b250d70$0500a8c0@eden> References: <02b601c3a323$8b250d70$0500a8c0@eden> Message-ID: [Mark] > My understanding is that we are moving towards 1.0 on the release_1_0 > branch. Is that correct? I think so, yes. release_1_0 was supposed to be bugfix-only, and the head (moving towards version 1.1) was for enhancements. Here's Tony's original mail: > As discussed earlier, I've created a cvs branch - 'release_1_0' - to move > toward 1.0b1 and then 1.0. > > If I understand things rightly (going by Jeremy and Richie's comments) the > main branch is now for 1.1 work, so is un-feature frozen ;). If people > could check 1.0 bugfixes into the release_1_0 branch (and 1.1, as needed), > that would be great. Re-reading Tony's mail, I should have pointed out at the time that we shouldn't commit edits to both places, but should use "cvs up [-j moving-tag] -j release_1_0" to periodically merge the bugfix branch onto the head. Nuts. >From looking at the logs, it seems you're right, Mark - bugfixes have been hitting the head instead of release_1_0. Also, some fixes have been committed to both the head and release_1_0, which will probably make merging release_1_0 back onto the head a pain - you always get more conflicts when you do that. (I should have encouraged more discussion of branch strategy when all this came up - we make heavy use of CVS branches at work, and we know a bit about how best to manage them.) One thing that CVS is spectacularly bad at is giving you an overview of what's been happening, so it's hard to say where we should go from here. How much enhancement work has gone onto the head since release_1_0 was taken? If it's not very much then maybe we should just give it a solid testing then merge release_1_0 back onto the head as soon as 1.0a7 is out. We then either take 1.0a8 from the head (bletch) or start again with a new bugfix branch and a better-advertised branch management strategy... -- Richie two-commits-in-two-months-like-he-has-room-to-talk Hindle richie@entrian.com From richie at entrian.com Tue Nov 4 18:58:56 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Nov 4 18:59:09 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: References: <02b601c3a323$8b250d70$0500a8c0@eden> Message-ID: [Kenny] > Fixes that apply to both the 1.0 and 1.1 releases > need to be made on both the branch and the trunk (because the current > state of those versions could be different at the time the fix is made). No, definitely not! Fixes made on the bugfix branch should be batch-merged onto the head once in a while, using "cvs up -j". If you have a policy of manually fixing both places then inevitably you sometimes forget. If you do lots of manual merging and only then try to merge the branches you get hundreds of conflicts. By doing a periodic merge, you keep the code roughly in step as far as bugfixes go - close enough that cvs can do meaningful merges - and you fix conflicts when you get them. The fact that the head and the bugfix branch differ is rarely a problem when it comes to merging - if you get conflicts, you fix them. The fix is usually obvious, and only needs to applied once. Here's how the system works over time: Head: 1.1 1.2 1.3 1.4 ....... 1.9 ......... 1.20 ........ 1.30 | ^ ^ ^ bugfix branch | | merge 1 | merge2 | merge 3 (eg. release_1_0) | | | | --> 1.3.1.1 1.3.2.3 ..... 1.3.2.9 ..... 1.3.2.20 So you start with the head, then at some point you take your bugfix branch. 1.4 is a feature, 1.3.1.1 is a bugfix (the specific numbers don't matter). At some point you decide to merge the bugfix branch onto the head. Why? o People working on the head are frustrated by the bugs o It's been a long time and the branches are getting out of step o Someone wants to start a major piece of work and wants to branch off the head to do it, and they want the bugfixes in place on their branch So you get CVS to apply all the edits that have been made on the bugfix branch to the head: you take a head checkout and do "cvs up -j bugfix". You get some conflicts where bugfixes have been made to code that's changed on the head, and you fix them (this is less of a problem than you might think, once the code has stopped migrating wholesale from place to place as it can do in the early stages of a project). The next time you're going to want to do this merge, you'll want to take all the edits made on the bugfix branch between this merge and the time you do the next merge, and apply them to the head. So after this first merge, you mark the point on the bugfix branch at which you did your merge by tagging it: "cvs tag bugfix_to_head" in a bugfix checkout. That will apply the bugfix_to_head tag to 1.3.2.3 on the bugfix branch (again, you don't care about the numbers in the real world because you're operating on entire branches). Then at some later date comes merge 2: take the edits between bugfix_to_head and the current state of the bugfix branch and apply them to the head. In a head checkout, "cvs up -j bugfix_to_head -j bugfix". Fix your conflicts, and in a bugfix checkout, move your marker tag to the new position: "cvs tag -F bugfix_to_head". That moves bugfix_to_head to 1.3.2.9, ready for merge 3 at some later date. All this takes longer to explain than to just do. 8-) In the long run it guarantees that bugfixes don't get lost, and that people can consistently use each branch for its intended purpose. Bugfix releases are always made from the bugfix branch, which eventually comes to an end when a new feature release goes out - and a new bugfix branch is taken for that release. -- Richie Hindle richie@entrian.com From anthony at interlink.com.au Wed Nov 5 08:12:26 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed Nov 5 08:17:36 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: Message-ID: <200311051313.hA5DCQaE017601@localhost.localdomain> >>> Richie Hindle wrote > Re-reading Tony's mail, I should have pointed out at the time that we > shouldn't commit edits to both places, but should use "cvs up [-j > moving-tag] -j release_1_0" to periodically merge the bugfix branch onto > the head. Nuts. Note that you can also just use cvs diff to apply the fix by hand to both the trunk and the branch. This is likely to cause less pain, as you're then relying on less cvs magic. >From my recent python-dev postings about the python maintenance branch, I'd suggest the following: - checkin to the trunk. If the fix is a bugfix, and suitable for the branch, include "bugfix candidate" in the checkin message. - (preferably) check your bugfix into the branch as well. I suggest having two checkouts, one on the branch, one on the trunk. - (otherwise) someone else notices that the "bugfix" needs to be applied to the branch as well, and does so. I need to apply the various doc changes I made in the last couple of days to the trunk, I will try to do so soon. -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Wed Nov 5 09:08:08 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 5 09:08:35 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: References: <02b601c3a323$8b250d70$0500a8c0@eden> Message-ID: <16297.1096.280851.95630@montanaro.dyndns.org> Richie> From looking at the logs, it seems you're right, Mark - bugfixes Richie> have been hitting the head instead of release_1_0. Also, some Richie> fixes have been committed to both the head and release_1_0, Richie> which will probably make merging release_1_0 back onto the head Richie> a pain - you always get more conflicts when you do that. My take on this is that at this point in time we should not be working on any branches. Everything should happen on the trunk until a release is about the be cut, at which point a branch is made, then frozen except for crucial bug fixes. Once the release is complete, the branch dies (or at best any changes it contains which are not on the trunk are merged back into the trunk). After we've actually had a 1.0 release, we create a branch called something like release10_maint, to which bug fixes are backported from the trunk. At some point in time, that branch also dies (probably fairly quickly, after a 1.1 or 1.2 release). This is more-or-less how the Python development works. The advantage from my standpoint is that most developers can be content to only check in changes to the trunk and occasionally backport their changes (if they are bug fixes) to an obvious branch. The only people who have to worry much about branches are release managers. Branches leading up to a release are very short-lived. Skip From anthony at interlink.com.au Wed Nov 5 09:28:59 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed Nov 5 09:32:23 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <16297.1096.280851.95630@montanaro.dyndns.org> Message-ID: <200311051429.hA5ESxa3019296@localhost.localdomain> >>> Skip Montanaro wrote > My take on this is that at this point in time we should not be working on > any branches. Everything should happen on the trunk until a release is > about the be cut, at which point a branch is made, then frozen except for > crucial bug fixes. This is how the Python release process works. At the moment we seem to be following something more like the Mozilla process, where we cut a branch for the upcoming release once we're past the point of adding new features. Having said that, I'd say the time to branch is at the point where we're about to cut the first beta. So we've possibly done it too soon here. OTOH, I don't know what is stopping us from cutting 1.0b1 in a couple of weeks, with a possible RC a couple of weeks after that. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From kennypitt at hotmail.com Wed Nov 5 09:41:33 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Nov 5 09:41:54 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: Message-ID: Richie Hindle wrote: > [Kenny] >> Fixes that apply to both the 1.0 and 1.1 releases >> need to be made on both the branch and the trunk (because the current >> state of those versions could be different at the time the fix is >> made). > > No, definitely not! Fixes made on the bugfix branch should be > batch-merged onto the head once in a while, using "cvs up -j". That's cool, I didn't know CVS could do that. I was simply going off the previous description of bug fixing that you referenced in your previous message. > So you get CVS to apply all the edits that have been made on the > bugfix branch to the head: you take a head checkout and do "cvs up -j > bugfix". You get some conflicts where bugfixes have been made to code > that's changed on the head, and you fix them (this is less of a > problem than you might think, once the code has stopped migrating > wholesale from place to place as it can do in the early stages of a > project). Have you ever seen this become an issue if the new line of development does decide to do a significant refactoring of the code? Thanks for the excellent description of the process. That's good information for anyone working on a CVS project, not just for SpamBayes. -- Kenny Pitt From richie at entrian.com Wed Nov 5 13:37:18 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Nov 5 13:37:33 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <200311051429.hA5ESxa3019296@localhost.localdomain> References: <16297.1096.280851.95630@montanaro.dyndns.org> <200311051429.hA5ESxa3019296@localhost.localdomain> Message-ID: [Kenny] > Have you ever seen this become an issue if the new line of development > does decide to do a significant refactoring of the code? No, not really. Either the code has moved so you get a conflict and have to hand-merge, which is no harder than with any other branch management scheme, or the bugfix is no longer meaningful because the code has gone away, so you just take the version from the head. [Skip] > My take on this is that at this point in time we should not be working on > any branches. Everything should happen on the trunk until a release is > about the be cut, at which point a branch is made, then frozen except for > crucial bug fixes. That's fair enough as long we're happy to release new features with every release. As I understood it, at the time we had lots of new features in the pipeline *and* a need to release some bugfixes. Perhaps the features should have gone onto a branch and the trunk remained the place to do bugfixes - we also do that, and it's no problem. People developing significant new features get all the benefits of source control without stepping on the toes of the other developers or the release process. [Anthony] > Note that you can also just use cvs diff to apply the fix by hand to both > the trunk and the branch. This is likely to cause less pain, as you're > then relying on less cvs magic. CVS is good at this kind of thing as long as you give it enough help (managing tags and branches sensibly) - I'd rather let CVS do the heavy lifting than do it by hand. All that said, the branch strategy you use is far less important than letting everyone know what it is! I don't have a big axe to grind about this - spambayes isn't (yet) a big enough project for the choice of branch strategy to be critical. Our failure this time, if there even was a failure, was in not advertising the strategy loudly enough. [Anthony] > OTOH, I don't know what is stopping us from cutting 1.0b1 in a couple of > weeks, with a possible RC a couple of weeks after that. The DBRunRecoveryError problem is stopping us, IMHO. Stephen Harper posted to the spambayes list yesterday with a possible method for reproducing that (http://mail.python.org/pipermail/spambayes/2003-November/009021.html) which I need to find time to look into. -- Richie Hindle richie@entrian.com From skip at pobox.com Wed Nov 5 14:54:59 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 5 14:55:11 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: References: <16297.1096.280851.95630@montanaro.dyndns.org> <200311051429.hA5ESxa3019296@localhost.localdomain> Message-ID: <16297.21907.471869.85745@montanaro.dyndns.org> Richie> [Anthony] >> OTOH, I don't know what is stopping us from cutting 1.0b1 in a couple >> of weeks, with a possible RC a couple of weeks after that. Richie> The DBRunRecoveryError problem is stopping us, IMHO. Stephen Richie> Harper posted to the spambayes list yesterday with a possible Richie> method for reproducing that Richie> (http://mail.python.org/pipermail/spambayes/2003-November/009021.html) Richie> which I need to find time to look into. Greg Smith just checked in some changes to the bsddb package in the Python CVS tree related to deadlocks and threading. It might be worth seeing if those changes help. Skip From richie at entrian.com Wed Nov 5 14:56:00 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Nov 5 14:56:14 2003 Subject: [spambayes-dev] Re: [Spambayes] Lotus Notes filter error KeyError: ('Hammie', 'header_spam_string') In-Reply-To: <2384386.1068059505079.JavaMail.root@gonzo.psp.pas.earthlink.net> References: <2384386.1068059505079.JavaMail.root@gonzo.psp.pas.earthlink.net> Message-ID: <50liqvg29dnuk1s4k5hdvf1q5um1r01jop@4ax.com> [Mike] > File "C:\Program Files\Python23\Scripts\sb_notesfilter.py", line 237, in processAndTrain > str = options["Hammie", "header_spam_string"] I don't know much about the Notes stuff, but that looks like a bug. That piece of code should probably be: if is_spam: str = options["Headers", "header_spam_string"] else: str = options["Headers", "header_ham_string"] You can see from spambayes/Options.py that header_spam_string goes in the Headers section, not the Hammie section. There are a few other places in sb_notesfilter.py where similar code ("options["Hammie", "header_xxx_string"]") appears - that should be changed too. Mike, does changing "Hammie" to "Headers" in those places fix your problem? I'm forwarding this to spambayes-dev to see whether anyone there knows for sure whether I'm right about this...? -- Richie Hindle richie@entrian.com From richie at entrian.com Thu Nov 6 02:58:40 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Nov 6 02:58:52 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <16297.21907.471869.85745@montanaro.dyndns.org> References: <16297.1096.280851.95630@montanaro.dyndns.org> <200311051429.hA5ESxa3019296@localhost.localdomain> <16297.21907.471869.85745@montanaro.dyndns.org> Message-ID: [Skip] > Greg Smith just checked in some changes to the bsddb package in the Python > CVS tree related to deadlocks and threading. It might be worth seeing if > those changes help. Thanks. If I can reproduce the problem with Spambayes, I'll try it again with CVS Python. -- Richie Hindle richie@entrian.com From kennypitt at hotmail.com Thu Nov 6 10:22:18 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Nov 6 10:22:44 2003 Subject: [spambayes-dev] RE: [Spambayes] Spambayes 1.0a7 - windows proxy_tray installation In-Reply-To: <3FA9D6E0.3000409@swiftdsl.com.au> Message-ID: Phil Pierotti wrote: > [mebbe I didn't install something the way it was expecting] > But what I see is that under the distribution, there's > windows\pop3proxy_*.py ("the scripts") > windows\resources\ (with all the .ico etc resources for the > systray program) > > The scripts are installed under > \Python23\Scripts > by setuup.py, but there's no corresponding > \Python23\windows\resources\ > with all the icon/resources > > So: > > (a) did I not install something properly > (b) dd the installer not install the resources properly > (c) are the paths in the script wrong > (d) all of the above > (e) none of the above, I'm just smoking too much crack (as per usual) Looks like you hit the nail right on the head. Glad I finished reading my inbox before replying to your previous message . At some point in the not-too-distant past, a decision was made that the Windows scripts pop3proxy_service.py and pop3proxy_tray.py should be installed to the Python Scripts directory along with the other command-line scripts. It seems this was a bit premature, as pop3proxy_tray obviously isn't designed to be run that way. When run from source, the icon resources are required to be present in a directory structure that isn't appropriate for installing into the main Python directory. For now, you should be able to get around this problem by going to the windows dir in your original source tree and running pop3proxy_tray.py from there. I've CC'd the spambayes-dev list in hopes that someone can take a look at this. At the very least, we should probably stop copying it to the Python\Scripts directory until the problem is fixed. -- Kenny Pitt From richie at entrian.com Thu Nov 6 16:36:20 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Nov 6 16:36:35 2003 Subject: [spambayes-dev] Hunter-killer drones Message-ID: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> Dev people, Before I start digging through the CVS logs to see who committed the following code to BrighterAsyncChat.handle_error in Dibbler.py: if type == socket.error and v[0] == 9: # Why? Who knows... pass so that I know where to dispatch the hunter-killer drones, does anybody want to confess to it? Throwing away that exception is causing an infinite loop in sb_server.py whenever something happens to a browser socket, like someone going to a different page during training. Rumour has it that this leads to the infamous DBRunRecoveryError, but I haven't confirmed that yet. Own up and explain yourself, or the next thing you hear will be the whine of tiny nuclear-tipped turbines... -- Richie Hindle richie@entrian.com From tim.one at comcast.net Thu Nov 6 16:58:20 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Nov 6 16:58:25 2003 Subject: [spambayes-dev] Hunter-killer drones In-Reply-To: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> Message-ID: [Riche Hindle] > Dev people, > > Before I start digging through the CVS logs to see who committed the > following code to BrighterAsyncChat.handle_error in Dibbler.py: > > if type == socket.error and v[0] == 9: # Why? Who knows... > pass > > so that I know where to dispatch the hunter-killer drones, does > anybody want to confess to it? I'd love to, except I didn't do it. CVS annotate says it's been like that since version 1.1, and, indeed, the oldest version in the repository already had it: Revision 1.1 Fri Jan 17 20:21:07 2003 UTC (9 months, 2 weeks ago) by richiehindle > Throwing away that exception is causing an infinite loop in > sb_server.py whenever something happens to a browser socket, like > someone going to a different page during training. That's probably not good . > Rumour has it that this leads to the infamous DBRunRecoveryError, Cool! Doubly worth pursuing then. > but I haven't confirmed that yet. > > Own up and explain yourself, or the next thing you hear will be the > whine of tiny nuclear-tipped turbines... It's far too late to stop that now -- I just saw the drones whiz by my window! From skip at pobox.com Thu Nov 6 17:02:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Nov 6 17:02:38 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? Message-ID: <16298.50417.896066.477124@montanaro.dyndns.org> I run SpamBayes on a couple machines filtering scoring mail for several email addresses which eventually find their way to my mailbox. I'd like to stuff the hostname into the score somehow. My first attempt was [Headers] classification_header_name: X-Spambayes-Classification: titan This failed with this error: Attempted to set [Headers] classification_header_name with invalid value ... Without considering it further, I then tried: [Headers] header_spam_string: titan: spam header_ham_string: titan: ham This also failed. Next, I tried [Headers] header_spam_string: titan:spam header_ham_string: titan:ham then [Headers] header_spam_string: titan-spam header_ham_string: titan-ham which finally worked. It looks to me like OptionsClass.HEADER_VALUE is too restrictive, but I'll leave it for the author of that code to decide whether or not to loosen it up. Skip From nas-spambayes at python.ca Thu Nov 6 17:04:46 2003 From: nas-spambayes at python.ca (Neil Schemenauer) Date: Thu Nov 6 17:02:59 2003 Subject: [spambayes-dev] Hunter-killer drones In-Reply-To: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> Message-ID: <20031106220445.GA24610@mems-exchange.org> On Thu, Nov 06, 2003 at 09:36:20PM +0000, Richie Hindle wrote: > Before I start digging through the CVS logs to see who committed the > following code to BrighterAsyncChat.handle_error in Dibbler.py: > > if type == socket.error and v[0] == 9: # Why? Who knows... > pass On my system 9 == errno.EBADF (Bad file descriptor). No idea why someone would want to ignore it. Neil From richie at entrian.com Thu Nov 6 17:20:46 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Nov 6 17:21:01 2003 Subject: [spambayes-dev] Hunter-killer drones In-Reply-To: References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> Message-ID: <5vhlqvoif2857jl2lc675a7b9v0k9p0idp@4ax.com> [Tim] > CVS annotate says it's been like that since version 1.1 That's why I don't want to have to dig through CVS - it was introduced before the code was moved from pop3proxy.py (or possibly the original web configurator, whatever that was called) into Dibbler.py. So I need to hunt around in the attic... much easier to deploy the Drones. -- Richie Hindle richie@entrian.com From richie at entrian.com Thu Nov 6 17:34:07 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Nov 6 17:34:21 2003 Subject: [spambayes-dev] Hunter-killer drones In-Reply-To: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> Message-ID: [Me] > Own up and explain yourself, or the next thing you hear will be the whine > of tiny nuclear-tipped turbines... Ha! The Drones are now en route to Central Illinois. Take a picture of *this*! Ka-BOOM! Sadly I'm not sure the culprit reads this list any more, so he may never know the cause of his demise (and that of the unfortunate West Central Illinois Tractor Pullers Association - http://www.wcitpa.com - whose headquarters will unfortunately be destroyed as well). I'll send him a personal email to apologise - perhaps he'll see it before the Drones reach him. -- Richie Hindle richie@entrian.com From papaDoc at videotron.ca Thu Nov 6 17:38:39 2003 From: papaDoc at videotron.ca (papaDoc) Date: Thu Nov 6 17:38:42 2003 Subject: [spambayes-dev] Hunter-killer drones In-Reply-To: References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com> Message-ID: <3FAACD6F.3040303@videotron.ca> Hi, We want a name, we want a name ........ ;-) What a nice explosion, I was able to see it from Montreal !!! Remi From kennypitt at hotmail.com Fri Nov 7 13:13:34 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Nov 7 13:13:59 2003 Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon In-Reply-To: Message-ID: Bob Chojnacki wrote: > Hi, > > I really like SpamBayes and the Outlook plugin. It is working much > better than other spam filters, considering I get 85-95% spam. I am > currently using version 008.1. I read your FAQ about the problems > with making the Outlook envelope tray icon go away. (I am also not > sure if this is the right email address to send this comment, so > please bear with me if it isn't.) > > Is the following link helpful? (Keep in mind that I am not a Windows > programmer): > > http://www.slipstick.com/dev/code/clearenvicon.htm Thanks for the link. I created the following code to implement this in the Outlook plugin and attached it to a menu item for testing. It was, in fact, successful in removing the new mail envelope from the taskbar. Now, the *really* tricky part is figuring out when to remove the icon. ==================== def RemoveNewMailIcon(): win32gui.EnumWindows(_removeIconCallback, None) def _removeIconCallback(hwnd, extra): # Check for Outlook window class. if win32gui.GetClassName(hwnd) == "rctrl_renwnd32": # Got the correct class, but we need to make sure window title is # empty because there may be other top-level Outlook windows. if win32gui.GetWindowText(hwnd) == "": return not _killNewMailIcon(hwnd) else: return True else: return True WUM_RESETNOTIFICATION = win32con.WM_USER + 7 def _killNewMailIcon(hwnd): nid = (hwnd, 0) if not win32gui.Shell_NotifyIcon(win32gui.NIM_DELETE, nid): return False else: win32gui.SendMessage(hwnd, WUM_RESETNOTIFICATION, 0, 0) return True ==================== -- Kenny Pitt From bob at jellyvision.com Fri Nov 7 13:40:33 2003 From: bob at jellyvision.com (Bob Chojnacki) Date: Fri Nov 7 13:37:11 2003 Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon In-Reply-To: <20031107121504.7ea42d3ec3ef466293ceca3f8ae215f9.in@ansel.jellyvision.com> Message-ID: > Now, the *really* tricky part is figuring out when to remove the icon. I noticed right after I sent my email (blush) the comment in their code ' add some code to check whether the latest items are "interesting" The comment is akin to the old Steve Martin comedy routine: "How to become a millionaire. First, get a million dollars..." Sorry about that. Bob > -----Original Message----- > From: Kenny Pitt [mailto:kennypitt@hotmail.com] > Sent: Friday, November 07, 2003 12:14 PM > To: 'Bob Chojnacki'; spambayes@python.org > Cc: spambayes-dev@python.org > Subject: RE: [Spambayes] Outlook Envelope Tray Icon > > > Bob Chojnacki wrote: > > Hi, > > > > I really like SpamBayes and the Outlook plugin. It is working much > > better than other spam filters, considering I get 85-95% spam. I am > > currently using version 008.1. I read your FAQ about the problems > > with making the Outlook envelope tray icon go away. (I am also not > > sure if this is the right email address to send this comment, so > > please bear with me if it isn't.) > > > > Is the following link helpful? (Keep in mind that I am not a Windows > > programmer): > > > > http://www.slipstick.com/dev/code/clearenvicon.htm > > Thanks for the link. I created the following code to implement this in > the Outlook plugin and attached it to a menu item for testing. It was, > in fact, successful in removing the new mail envelope from the taskbar. > Now, the *really* tricky part is figuring out when to remove the icon. > > ==================== > > def RemoveNewMailIcon(): > win32gui.EnumWindows(_removeIconCallback, None) > > def _removeIconCallback(hwnd, extra): > # Check for Outlook window class. > if win32gui.GetClassName(hwnd) == "rctrl_renwnd32": > # Got the correct class, but we need to make sure window title > is > # empty because there may be other top-level Outlook windows. > if win32gui.GetWindowText(hwnd) == "": > return not _killNewMailIcon(hwnd) > else: > return True > else: > return True > > WUM_RESETNOTIFICATION = win32con.WM_USER + 7 > def _killNewMailIcon(hwnd): > nid = (hwnd, 0) > if not win32gui.Shell_NotifyIcon(win32gui.NIM_DELETE, nid): > return False > else: > win32gui.SendMessage(hwnd, WUM_RESETNOTIFICATION, 0, 0) > return True > > ==================== > > -- > Kenny Pitt From rmalayter at bai.org Fri Nov 7 13:47:54 2003 From: rmalayter at bai.org (Ryan Malayter) Date: Fri Nov 7 13:47:59 2003 Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon Message-ID: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org> > From: Bob Chojnacki > Subject: RE: [Spambayes] Outlook Envelope Tray Icon > > I noticed right after I sent my email (blush) the comment in > their code > > ' add some code to check whether the latest items are "interesting" > > The comment is akin to the old Steve Martin comedy routine: > > "How to become a millionaire. First, get a million dollars..." > > Sorry about that. Is it really that hard? Maybe I'm not thinking it through enough, but I suggest this simple approach: Check for unread messages in the SpamBayes "watched" folders. Check the spam score on each of those unread messages. If any exist where the Spam score is below the certain ham threshold, show the icon if not, everything new was spam, and you can remove the icon. This might take a second or two two but it can happen right after every SpamBayes scoring run gets triggered. So we'll see the new mail icon for at most a few seconds. Regards, Ryan From adam.walker at rbwconsulting.com Fri Nov 7 14:07:48 2003 From: adam.walker at rbwconsulting.com (Adam Walker) Date: Fri Nov 7 14:08:03 2003 Subject: [spambayes-dev] Re: [Spambayes] Outlook Envelope Tray Icon In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org> References: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org> Message-ID: <3FABED84.5050607@rbwconsulting.com> What about mail delivered to unwatched folders? What about mail delivered to watched and unwatched folders in the same batch? Why do people feel they need to drop everything and read an email when it comes in? Ryan Malayter wrote: > >Is it really that hard? > >Maybe I'm not thinking it through enough, but I suggest this simple >approach: > Check for unread messages in the SpamBayes "watched" folders. > Check the spam score on each of those unread messages. > If any exist where the Spam score is below the certain ham >threshold, show the icon > if not, everything new was spam, and you can remove the icon. > >This might take a second or two two but it can happen right after every >SpamBayes scoring run gets triggered. So we'll see the new mail icon for >at most a few seconds. > >Regards, > Ryan > > > From kennypitt at hotmail.com Fri Nov 7 14:08:12 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Nov 7 14:08:36 2003 Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org> Message-ID: Ryan Malayter wrote: >> From: Bob Chojnacki >> Subject: RE: [Spambayes] Outlook Envelope Tray Icon >> >> I noticed right after I sent my email (blush) the comment in >> their code >> >> ' add some code to check whether the latest items are "interesting" >> > Is it really that hard? Maybe it isn't. I just know I'm not the guy familiar enough with the code to determine that. > > Maybe I'm not thinking it through enough, but I suggest this simple > approach: > Check for unread messages in the SpamBayes "watched" folders. > Check the spam score on each of those unread messages. > If any exist where the Spam score is below the certain ham > threshold, show the icon > if not, everything new was spam, and you can remove the icon. > > This might take a second or two two but it can happen right after > every SpamBayes scoring run gets triggered. So we'll see the new mail > icon for at most a few seconds. Some possible issues I can think of: First, depending on your SpamBayes configuration the processing can be triggered each time a new message is added to the Inbox. I'm not sure if it would be a good idea to check all messages in all watched folders every time a new message is received, as that might prove time-consuming (especially for those of us who tend to let our Inboxes get cluttered with old mail). Second, we've seen in other cases that we can't always rely on Outlook to do things in the order that we expect. I'm not sure we can guarantee that we are processing the message *after* Outlook has already created the tray icon. We could end up "removing" an icon that doesn't yet exist, and then have Outlook add it after we've finished our processing. P.S. I've moved this discussion over to the spambayes-dev list, which is probably a more appropriate venue for these implementation details. -- Kenny Pitt From cdellario at whatif-productions.com Fri Nov 7 14:43:37 2003 From: cdellario at whatif-productions.com (Chris Dellario) Date: Fri Nov 7 14:42:47 2003 Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon Message-ID: <113EE4C6211B1D41A34E54A089F4795C0AFCAB@mailbox.whatif-productions.com> Because, unfortunately, some of us have co-workers (boss, boss's boss, any number of department heads, etc) who expect us to have read an email a few minutes after they've sent it. Those of us who have the privilege of working offset are often expected to reply sooner than others to "prove" that we're working. ------------------------------------------------------------ Chris Dellario Lead Engineer Whatif Productions LLC http://www.whatif.info (617) 977-0115 -----Original Message----- From: Adam Walker [mailto:adam.walker@rbwconsulting.com] Sent: Friday, November 07, 2003 2:08 PM To: Ryan Malayter Cc: spambayes-dev@python.org; spambayes@python.org; Bob Chojnacki Subject: Re: [Spambayes] Outlook Envelope Tray Icon What about mail delivered to unwatched folders? What about mail delivered to watched and unwatched folders in the same batch? Why do people feel they need to drop everything and read an email when it comes in? Ryan Malayter wrote: > >Is it really that hard? > >Maybe I'm not thinking it through enough, but I suggest this simple >approach: > Check for unread messages in the SpamBayes "watched" folders. > Check the spam score on each of those unread messages. > If any exist where the Spam score is below the certain ham >threshold, show the icon > if not, everything new was spam, and you can remove the icon. > >This might take a second or two two but it can happen right after every >SpamBayes scoring run gets triggered. So we'll see the new mail icon for >at most a few seconds. > >Regards, > Ryan > > > _______________________________________________ Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html From spambayes at whateley.com Sat Nov 8 02:13:16 2003 From: spambayes at whateley.com (Brendon) Date: Sun Nov 9 16:12:03 2003 Subject: [spambayes-dev] Re: [Spambayes] Outlook Envelope Tray Icon In-Reply-To: <3FABED84.5050607@rbwconsulting.com> References: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org> <3FABED84.5050607@rbwconsulting.com> Message-ID: <200311071220.06215.spambayes@whateley.com> On Friday 07 November 2003 11:07 am, Adam Walker wrote: > What about mail delivered to unwatched folders? What about mail > delivered to watched and unwatched folders in the same batch? Why do > people feel they need to drop everything and read an email when it comes > in? > Probably the same reason most people can't _not_ answer a ringing phone! That, or just lonely? Brendon. From richie at entrian.com Mon Nov 10 15:46:19 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Nov 10 15:46:43 2003 Subject: [spambayes-dev] Re: Offer to Help / Development Participation In-Reply-To: References: Message-ID: <8rsvqv0vt1804s1akodgn77060evpgr83e@4ax.com> David, Darrell, [David] > I'm very impressed with your work and would be glad to help [...] [Darrell] > [...] any development help you need don't hesitate to drop me a line. Many thanks for the offers! Maybe other developers would like to make specific suggestions (hence I've forwarded this to the spambayes-dev mailing list), but there's a whole bunch of things that you could do, starting with the non-technical: o Try to reproduce bugs that we're having trouble reproducing; see the bug list at http://sourceforge.net/tracker/?group_id=61702&atid=498103 (807217 is my personal hate figure). o Help with testing; we're only able to test within our own environments, and only the developers who are around at the time of a release are able to do even that. Some "real people" who could help test in their environments would be a big help. o Help improve the website; there's a Wiki page about that at http://www.entrian.com/sbwiki/WebSiteDevelopment o Help improve the documentation, especially for the non-Outlook applications (POP3 proxy, IMAP filter, Notes filter, sb_filter). o Help out newbies on the mailing list. o Make contributions to the Wiki, http://www.entrian.com/sbwiki - any hints and tips, scripts, recipes etc. o Taking part in discussions on the developer's mailing list at spambayes-dev@python.org. You don't need to be a developer to participate, you just need to have a decent grasp of the project and have opinions about how it should be developed. For those with programming skills, there's even more you could help with, even without in-depth knowledge of the code. The code's pretty accessible, and developers are always glad to answer questions about how it all works. Here's a small list off the top of my head: o Test patches, tidying them up, making them fit the coding standard (http://www.python.org/peps/pep-0008.html) if they don't already. See http://sourceforge.net/tracker/?group_id=61702&atid=498105 o Fix bugs - turning a bug report into a patch makes it far more likely to be fixed! o Improve our unit tests, or help develop an acceptance test framework. o Once you've got a handle on how the code works, implement feature requests. o Backport bugfixes from the head onto the bugfix branch, although our branch strategy is a little up in the air at the moment, so that's one for the future. o Help with sailing our fleet of luxury yachts from the Caribbean to the Med for the Spring season... or am I dreaming again? 8-) There are probably a dozen other things that I haven't thought of. -- Richie Hindle richie@entrian.com From skip at pobox.com Mon Nov 10 17:49:43 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 10 17:49:49 2003 Subject: [spambayes-dev] Another tweak to try - asciify_subject Message-ID: <16304.5639.788824.89239@montanaro.dyndns.org> We're all familiar with the recent attempts to foil spam filters by adding Latin-1 accents to message subjects (and sometimes to message bodies): We c?n mak? it l?nger now The attached context diff maps subjects through a "latscii" codec I wrote which does little more than strip accents. (It also maps various symbols to reasonable ASCII equivalents, like mapping '?' -> '!'.) This showed a small improvement in false negatives for me (1 out of 10 on the timcv meter, n == 10, 500 messages per bucket) and no change in false positives: false positive percentages 0.600 0.600 tied 0.000 0.000 tied 0.200 0.200 tied 0.400 0.400 tied 0.000 0.000 tied 0.800 0.800 tied 0.200 0.200 tied 0.800 0.800 tied 0.200 0.200 tied 0.400 0.400 tied won 0 times tied 10 times lost 0 times total unique fp went from 18 to 18 tied mean fp % went from 0.36 to 0.36 tied false negative percentages 2.200 2.200 tied 1.000 1.000 tied 2.200 2.000 won -9.09% 3.000 3.000 tied 1.600 1.600 tied 2.000 2.000 tied 1.000 1.000 tied 2.000 2.000 tied 1.600 1.600 tied 1.400 1.400 tied won 1 times tied 9 times lost 0 times total unique fn went from 90 to 89 won -1.11% mean fn % went from 1.8 to 1.78 won -1.11% ham mean ham sdev 4.92 4.94 +0.41% 14.95 14.98 +0.20% 5.14 5.16 +0.39% 15.47 15.48 +0.06% 4.89 4.90 +0.20% 14.51 14.53 +0.14% 5.31 5.34 +0.56% 15.80 15.85 +0.32% 4.61 4.62 +0.22% 14.80 14.83 +0.20% 5.71 5.75 +0.70% 17.21 17.28 +0.41% 4.32 4.33 +0.23% 13.45 13.50 +0.37% 4.83 4.85 +0.41% 14.83 14.87 +0.27% 4.38 4.38 +0.00% 13.97 14.02 +0.36% 5.96 5.97 +0.17% 17.38 17.40 +0.12% ham mean and sdev for all runs 5.01 5.02 +0.20% 15.29 15.33 +0.26% spam mean spam sdev 90.76 90.84 +0.09% 19.66 19.58 -0.41% 91.16 91.23 +0.08% 17.64 17.57 -0.40% 91.25 91.29 +0.04% 18.84 18.79 -0.27% 88.31 88.36 +0.06% 22.55 22.49 -0.27% 90.54 90.62 +0.09% 18.50 18.42 -0.43% 91.64 91.68 +0.04% 17.75 17.69 -0.34% 91.19 91.33 +0.15% 17.82 17.71 -0.62% 91.66 91.69 +0.03% 18.76 18.74 -0.11% 91.31 91.39 +0.09% 17.97 17.85 -0.67% 91.87 91.96 +0.10% 17.07 16.97 -0.59% spam mean and sdev for all runs 90.97 91.04 +0.08% 18.74 18.66 -0.43% ham/spam mean difference: 85.96 86.02 +0.06 If you test this out, it will have no effect if you don't have any messages in your training databases which use this trick. When I first ran it, I hadn't factored in any recent messages and saw nothing. After I ran splitndirs.py over my current small (153 spam, 102 ham) training databases, then ran rebal -n 300 followed by rebal -n 500 to stir the pot a bit, I saw the above changes. While I was at it, I wrote a simple Makefile to run the cross validation tests. This should speed things up in the common case where your training database and your base.ini file don't change (cutting processing time approximately in half). Use it like so: make BASE=std TRIAL=ascii A plain make assumes your base and trial option files are std.ini and trial.ini, respectively. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: Makefile Type: application/octet-stream Size: 737 bytes Desc: Makefile for running cross validations Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/Makefile.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: sb.diff Type: application/octet-stream Size: 6759 bytes Desc: asciify_subject tweak Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/sb.obj From tim.one at comcast.net Tue Nov 11 11:08:45 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Nov 11 11:08:49 2003 Subject: [spambayes-dev] RE: Bug in UserInterface.py In-Reply-To: Message-ID: Would someone familiar with UserInterface.py please check in the attached patch, or add it to the patch manager if you're unsure about it? Thanks! -------------- next part -------------- An embedded message was scrubbed... From: "Mats Kindahl" Subject: Bug in UserInterface.py Date: Tue, 11 Nov 2003 03:39:58 -0500 Size: 1983 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031111/8a8f0679/attachment.mht From kennypitt at hotmail.com Tue Nov 11 12:18:09 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Nov 11 12:18:49 2003 Subject: [spambayes-dev] RE: Bug in UserInterface.py In-Reply-To: Message-ID: Tim Peters wrote: > Would someone familiar with UserInterface.py please check in the > attached patch, or add it to the patch manager if you're unsure about > it? Thanks! [from attached patch] """ diff -r1.32 UserInterface.py 274c274 < sc_re = re.compile("%s:(.*)\n" % \ --- > sc_re = re.compile("%s:\s*(\d*\.\d+|\d+\.\d*).*\n" % \ """ This would probably work if and when the fix for bug #831388 is applied. However, the current code inserts the probability in the X-Spambayes-Spam-Probability: header using str(prob), which can go into the "e" exponent notation for very small probs. If this happens, the patched regex will fail to properly identify the probability. The regex can be modified as follows if you want to account for this possibility: sc_re = re.compile("%s:\s*((\d*\.\d+|\d+\.\d*)(e[-+]\d+)?).*\n" % \ In addition, I believe that there will always be at least one digit before the decimal point as the leading zero is always included, so we should be able to simplify the expression to: sc_re = re.compile("%s:\s*(\d+\.\d*(e[-+]\d+)?).*\n" % \ -- Kenny Pitt From richie at entrian.com Tue Nov 11 16:58:41 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Nov 11 16:59:01 2003 Subject: [spambayes-dev] Website bug and proposed fix Message-ID: Hi, Jens Rantil has kindly pointed out that we have some broken links on our website, in particular the "SF Project Page" link that appears throughout the site. I've never looked at the website stuff before so I could be way off base, but the problem seems to be that we're applying posixpath.normpath to a URL, with results that look like this: >>> import posixpath >>> posixpath.normpath("http://sourceforge.net/projects/spambayes") 'http:/sourceforge.net/projects/spambayes' I'd say the fix was to break apart the URL and only run the path component through normpath. Here's a patch - I don't want to commit it, partly because I don't know the code, and partly because the website build system doesn't fully work on my machine so I can't thoroughly test it. I've also removed a rather cryptic comment that seems to refer to history rather than the current state of play. Index: scripts/ht2html/LinkFixer.py =================================================================== RCS file: /cvsroot/spambayes/website/scripts/ht2html/LinkFixer.py,v retrieving revision 1.2 diff -c -r1.2 LinkFixer.py *** scripts/ht2html/LinkFixer.py 28 Oct 2003 04:37:08 -0000 1.2 --- scripts/ht2html/LinkFixer.py 11 Nov 2003 21:57:40 -0000 *************** *** 8,13 **** --- 8,15 ---- """ import sys + import urlparse + import posixpath # use posix semantics for urls from types import StringType SLASH = '/' *************** *** 37,49 **** url = 'index.html' elif url[-1] == '/': url = url + 'index.html' ! absurl = SLASH.join([self.__rootdir, self.__relthis, url]) # normalize the path, kind of the way os.path.normpath() does. ! # urlparse ought to have something like this... ! # hrm - MarkH thinks this is broken, so it has been replaced ! # with normpath - what is the problem with normpath? ! import posixpath # use posix semantics for urls ! absurl = posixpath.normpath(absurl) self.msg('absurl= %s', absurl) return absurl --- 39,51 ---- url = 'index.html' elif url[-1] == '/': url = url + 'index.html' ! # normalize the path, kind of the way os.path.normpath() does. ! # urlparse ought to have something like this built in... ! scheme, addr, path, params, query, frag = urlparse.urlparse(url) ! abspath = SLASH.join([self.__rootdir, self.__relthis, path]) ! path = posixpath.normpath(abspath) ! absurl = urlparse.urlunparse((scheme, addr, path, params, query, frag)) self.msg('absurl= %s', absurl) return absurl -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Tue Nov 11 17:22:28 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Nov 11 17:22:12 2003 Subject: [spambayes-dev] Website bug and proposed fix In-Reply-To: Message-ID: <08f801c3a8a2$4b7011f0$0500a8c0@eden> > Hi, > > Jens Rantil has kindly pointed out that we have some broken links on > our website, in particular the "SF Project Page" link that appears > throughout the site. > > I've never looked at the website stuff before so I could be way off > base, but the problem seems to be that we're applying > posixpath.normpath > to a URL, with results that look like this: > > >>> import posixpath > >>> posixpath.normpath("http://sourceforge.net/projects/spambayes") > 'http:/sourceforge.net/projects/spambayes' > > I'd say the fix was to break apart the URL and only run the path > component through normpath. Here's a patch - I don't want to > commit it, > partly because I don't know the code, and partly because the website > build system doesn't fully work on my machine so I can't > thoroughly test > it. Mea culpa. This was a hack I made when trying to get the apps/outlook/bugs.html file working. The code was breaking for me with relative links, and I had the impression that the "link fixer" was only fixing relative links. I've checked your fix in (I still have a related problem with bugs.html, but it exists before and after your patch.) Mozilla appears to have done the right thing with those links! > I've also removed a rather cryptic comment that seems to refer to > history rather than the current state of play. hehe - surely that comment helped you track the bug? At least I put my name next to my suspect code ;) Mark. From skip at pobox.com Wed Nov 12 09:48:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 09:48:31 2003 Subject: [spambayes-dev] sb_filter change Message-ID: <16306.18489.981478.992986@montanaro.dyndns.org> I modified sb_filter.py to accept one or more file names on the command line. Existing behavior should be retained. If a single message is read from stdin, the output message will have a From_ line only if the input message did. When processing files from the command line, it uses mboxutils.getmbox() to decipher their format. In such cases, the output is always a Unix-style mailbox on stdout. This change probably doesn't have a lot of practical use, but I find it helpful in one situation. If I want to score a mailbox full of messages to identify outliers (perhaps mistakes in my classification of a large body of messages), I used to do this: formail -s sb_filter.py < somembox \ | egrep -i '^(x-spambayes-classification|message-id): ' which incurred sb_filter.py startup for each message. Now I execute sb_filter.py somembox \ | egrep -i '^(x-spambayes-classification|message-id): ' which runs a lot faster. I should be able to figure out how to process my incoming mail that was as well, then spit the result into formail -s procmail to do the usual procmail processing. This usage suggests an enhancement to mboxutils.getmbox(). Currently, it doesn't recognize Tim-style training databases (e.g. Data/Ham/SetN where all files have numeric filenames. mboxutils.DirOfTxtFileMailbox could be extended to simply accept all plain files as messages and all subdirectories as nested Dir_ofTxtFileMailboxes. Would that change break anyone's usage? (What are .lorien files anyway?) Skip From patterson at Tech2020.org Wed Nov 12 12:28:30 2003 From: patterson at Tech2020.org (Kevin Patterson) Date: Wed Nov 12 12:28:36 2003 Subject: [spambayes-dev] (no subject) Message-ID: Will spambayes ever work in a Terminal server/Citrix enviroment? It works fine when only one instance of outlook is running. Do you know of anything I could do config wise on spambayes to fix this? Or on the server side. Keep up the great work. Thank you! Kevin Patterson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031112/8574420c/attachment.html From skip at pobox.com Wed Nov 12 12:40:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 12:44:03 2003 Subject: [spambayes-dev] Who can explain this? Message-ID: <16306.28829.807690.155222@montanaro.dyndns.org> Racking my brain trying to figure out just what persistent storage file I was using (because sometimes it seemed to use ~/hammie.db and sometimes ~/.hammiedb), I came across this in sb_filter.py: # This is a bit of a hack to counter the default for # persistent_storage_file changing from ~/.hammiedb to hammie.db # This will work unless a user: # * had hammie.db as their value for persistent_storage_file, and # * their config file was loaded by Options.py. if options["Storage", "persistent_storage_file"] == \ options.default("Storage", "persistent_storage_file"): options["Storage", "persistent_storage_file"] = \ "~/.hammiedb" Can we just rip this hack out and let the user's options file dictate things? Skip From skip at pobox.com Wed Nov 12 13:05:21 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 13:05:28 2003 Subject: [spambayes-dev] proposal for more uniform option setting from the command line Message-ID: <16306.30305.187928.62359@montanaro.dyndns.org> Our command lines still seem to be a mish mash of little hacks. Everything of interest can be set via the INI file, but there are only a few options which can be set via the command line, and not (I don't believe) in a consistent way across SB apps. How about instead of only allowing specific options to be overridden on the command line we use a consistent syntax for overriding *any* option from the command line? For example, to set the ["Storage", "persistent_storage_file"] we could use something like -o Storage:persistent_storage_file:~/.hammiedb or --option=Storage:persistent_storage_file:~/.hammiedb The general syntax of an option setting command line arg would then be section:field:value. The post-getopt.getopt() code might look something like: from spambayes.Options import options for opt, arg in opts: ... elif opt in ('-o', '--option'): options.set(*arg.split(':')) ... We would then deprecate any command line args used to twiddle options using any other syntax. Use of those args would trigger a message to stderr like: Deprecated form: "-d ~/hammie.db" found. Use "-o Storage:persistent_storage_file:~/hammie.db" instead. This could be extended further. Should the user give an incomplete -o flag such as "-o Storage" or "-o Storage:spam_cache", help about that section or variable could be emitted: saw_help = False for opt, arg in opts: ... elif opt in ('-o', '--option'): # this would probably be folded into an OptionsClass method val = arg.split(':') if len(val) < 3: options.help(sys.stderr, *val) saw_help = True else: options.set(*arg.split(':')) ... if saw_help: raise SystemExit where OptionsClass.OptionsClass.help() would look something like: def help(self, stream, sect=None, opt=None): if sect is None: # dump help about all options elif opt is None: # dump help about sect else: # dump help about options[sect, opt] Skip From papaDoc at videotron.ca Wed Nov 12 13:15:23 2003 From: papaDoc at videotron.ca (papaDoc) Date: Wed Nov 12 13:15:26 2003 Subject: [spambayes-dev] proposal for more uniform option setting from the command line In-Reply-To: <16306.30305.187928.62359@montanaro.dyndns.org> References: <16306.30305.187928.62359@montanaro.dyndns.org> Message-ID: <3FB278BB.2040402@videotron.ca> Hi, > Our command lines still seem to be a mish mash of little hacks. Everything of interest can be set via the INI file, but there are only a few options which can be set via the command line, and not (I don't believe) in a consistent way across SB apps. This is really true and there was several complain about that. > The general syntax of an option setting command line arg would then be section:field:value. The post-getopt.getopt() code might look something like: This is interesting +1 But We have to check if it could be possible to have and option that will not be/can not be included in the options file like my patch about the -t for the sb_mboxtrain.py Remi From kennypitt at hotmail.com Wed Nov 12 13:43:06 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Nov 12 13:47:05 2003 Subject: [spambayes-dev] proposal for more uniform option setting from thecommand line In-Reply-To: <16306.30305.187928.62359@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > How about instead of only allowing specific options to be overridden > on the command line we use a consistent syntax for overriding *any* > option from the command line? For example, to set the ["Storage", > "persistent_storage_file"] we could use something like > > -o Storage:persistent_storage_file:~/.hammiedb > > or > > --option=Storage:persistent_storage_file:~/.hammiedb This sounds useful for those doing testing with various options, and I'm all for it from that standpoint. However, I'm not sure how useful it would be for the average user. > We would then deprecate any command line args used to twiddle options > using any other syntax. Use of those args would trigger a message to > stderr like: > > Deprecated form: "-d ~/hammie.db" found. > Use "-o Storage:persistent_storage_file:~/hammie.db" instead. I don't know if it's good to go that far. The new syntax is rather cumbersome, especially if I'm typing the command manually. Also, some command line flags can set several related option values to the correct combination (e.g. set both the database filename and type with one flag), and the new syntax would require knowing the correct combination and providing all the correct values. > This could be extended further. Should the user give an incomplete > -o flag such as "-o Storage" or "-o Storage:spam_cache", help about > that section or variable could be emitted: What about options that have no effect on the application being run? Would it be possible to detect them and show help in that case also? How would we present a list of useful options to the end user without overwhelming them with rarely changed settings and gory internal details? -- Kenny Pitt From popiel at wolfskeep.com Wed Nov 12 13:53:30 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Nov 12 13:53:35 2003 Subject: [spambayes-dev] proposal for more uniform option setting from the command line In-Reply-To: Message from Skip Montanaro of "Wed, 12 Nov 2003 12:05:21 CST." <16306.30305.187928.62359@montanaro.dyndns.org> References: <16306.30305.187928.62359@montanaro.dyndns.org> Message-ID: <20031112185330.D4ADD2DDA2@cashew.wolfskeep.com> In message: <16306.30305.187928.62359@montanaro.dyndns.org> Skip Montanaro writes: > >How about instead of only allowing specific options to be overridden on the >command line we use a consistent syntax for overriding *any* option from the >command line? +1 >The general syntax of an option setting command line arg would then be >section:field:value. The post-getopt.getopt() code might look something >like: > > from spambayes.Options import options > > for opt, arg in opts: > ... > elif opt in ('-o', '--option'): > options.set(*arg.split(':')) > ... Only problem here is that this particular phrasing makes it impossible to set an option value with a colon in it. Better would be to use options.set(*arg.split(':', 2)). >We would then deprecate any command line args used to twiddle options using >any other syntax. Use of those args would trigger a message to stderr like: > > Deprecated form: "-d ~/hammie.db" found. > Use "-o Storage:persistent_storage_file:~/hammie.db" instead. +1 >This could be extended further. Should the user give an incomplete -o flag >such as "-o Storage" or "-o Storage:spam_cache", help about that section or >variable could be emitted: I would tend to put this in a separate syntax, and have an incomplete specification just emit an error message (possibly saying something like 'use --help=Storage for more information'). - Alex From skip at pobox.com Wed Nov 12 14:48:31 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 14:48:40 2003 Subject: [spambayes-dev] proposal for more uniform option setting from the command line In-Reply-To: <3FB278BB.2040402@videotron.ca> References: <16306.30305.187928.62359@montanaro.dyndns.org> <3FB278BB.2040402@videotron.ca> Message-ID: <16306.36495.340289.654317@montanaro.dyndns.org> Remi> But We have to check if it could be possible to have and option Remi> that will not be/can not be included in the options file like my Remi> patch about the -t for the sb_mboxtrain.py Sure. Command line args which are specific to an application and don't involve modifications to the options database would still be fine. I'm more after the "'-d file' means file is a dbhash and '-D file' means file is a pickle" sort of arg. These can be dispensed with if the user can set the appropriate option(s) from the command line in a more general fashion. This sort of thing might also be useful for at least casual testing. I have this asciify_subject option I'm playing with. I could compare the output of these two commands: sb_filter.py ~/Mail/unsure \ | egrep -i 'x-spambayes-classification' sb_filter.py -o Tokenizer:asciify_subject:True ~/Mail/unsure \ | egrep -i 'x-spambayes-classification' to see if it helps push some of my current unsures in the right direction. Skip From skip at pobox.com Wed Nov 12 14:54:04 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 14:55:27 2003 Subject: [spambayes-dev] proposal for more uniform option setting from thecommand line In-Reply-To: <20031112184659.240962764A7@orb.pobox.com> References: <16306.30305.187928.62359@montanaro.dyndns.org> <20031112184659.240962764A7@orb.pobox.com> Message-ID: <16306.36828.190258.68060@montanaro.dyndns.org> >> -o Storage:persistent_storage_file:~/.hammiedb >> >> or >> >> --option=Storage:persistent_storage_file:~/.hammiedb Kenny> This sounds useful for those doing testing with various options, Kenny> and I'm all for it from that standpoint. However, I'm not sure Kenny> how useful it would be for the average user. Correct. However, the average user probably shouldn't be giving any command line options (or very few) anyway, but should be twiddling bits in the options file to make them persistent. >> Deprecated form: "-d ~/hammie.db" found. >> Use "-o Storage:persistent_storage_file:~/hammie.db" instead. Kenny> I don't know if it's good to go that far. The new syntax is Kenny> rather cumbersome, especially if I'm typing the command manually. I can buy that, though if you're using a modern shell, command recall can mitigate most of that. (I don't think DOS shells or vanilla /bin/sh qualify as "modern shells". I'm talking tcsh, bash, ksh, etc.) Kenny> Also, some command line flags can set several related option Kenny> values to the correct combination (e.g. set both the database Kenny> filename and type with one flag), and the new syntax would Kenny> require knowing the correct combination and providing all the Kenny> correct values. I think that's more confusing than it ought to be. Having -d and -D simultaneously set two options seems >> This could be extended further. Should the user give an incomplete >> -o flag such as "-o Storage" or "-o Storage:spam_cache", help about >> that section or variable could be emitted: Kenny> What about options that have no effect on the application being Kenny> run? I hadn't considered that. Kenny> Would it be possible to detect them and show help in that case Kenny> also? I suppose so, but the application would then have to register all the options it's interested in. How would the application author know what all the storage options were without diving into storage.py and friends? Kenny> How would we present a list of useful options to the end user Kenny> without overwhelming them with rarely changed settings and gory Kenny> internal details? Experiment, I suppose. It appears the majority of users will use the Outlook plugin for which this doesn't apply. I suspect I'm appealing more to the propeller heads among us. Skip From skip at pobox.com Wed Nov 12 14:59:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 14:59:54 2003 Subject: [spambayes-dev] proposal for more uniform option setting from the command line In-Reply-To: <20031112185330.D4ADD2DDA2@cashew.wolfskeep.com> References: <16306.30305.187928.62359@montanaro.dyndns.org> <20031112185330.D4ADD2DDA2@cashew.wolfskeep.com> Message-ID: <16306.37168.627459.610349@montanaro.dyndns.org> >> This could be extended further. Should the user give an incomplete >> -o flag such as "-o Storage" or "-o Storage:spam_cache", help about >> that section or variable could be emitted: Alex> I would tend to put this in a separate syntax, and have an Alex> incomplete specification just emit an error message (possibly Alex> saying something like 'use --help=Storage for more information'). I thought about this. I suppose you're right about the incomplete flags. It doesn't give you a way to ask about all option file sections either. I think it's best to leave --help/-h alone (no args) and have a pair of standard options used like --help-section=Storage --help-section=Storage:spam_cache --help-all-sections with obvious semantics. Or maybe you can glob things: --help-section=* # problematic - * can be special to shells or --help-section=all # relies on special "section" "all" Skip From kennypitt at hotmail.com Wed Nov 12 15:09:54 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Nov 12 15:13:54 2003 Subject: [spambayes-dev] proposal for more uniform option setting from thecommand line In-Reply-To: <16306.36828.190258.68060@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > > Kenny> Also, some command line flags can set several related option > Kenny> values to the correct combination (e.g. set both the database > Kenny> filename and type with one flag), and the new syntax would > Kenny> require knowing the correct combination and providing all the > Kenny> correct values. > > I think that's more confusing than it ought to be. Having -d and -D > simultaneously set two options seems Bad example . Should have known from past experience that those were the ones you're gunning for. > > >> This could be extended further. Should the user give an incomplete > >> -o flag such as "-o Storage" or "-o Storage:spam_cache", help about > >> that section or variable could be emitted: > > Kenny> What about options that have no effect on the application being > Kenny> run? > > I hadn't considered that. > > Kenny> Would it be possible to detect them and show help in that case > Kenny> also? > > I suppose so, but the application would then have to register all the > options it's interested in. How would the application author know > what all the storage options were without diving into storage.py and > friends? Good point. There are quite a few layers to most operations, and digging up an exhaustive list of what is actually used for a particular case would be extremely difficult. > > Kenny> How would we present a list of useful options to the end user > Kenny> without overwhelming them with rarely changed settings and gory > Kenny> internal details? > > Experiment, I suppose. > > It appears the majority of users will use the Outlook plugin for > which this doesn't apply. I suspect I'm appealing more to the > propeller heads among us. If that is the intended audience then all of my comments above are pretty much moot. As I said initially, I'm all for it from the standpoint of testing, and the propeller heads don't need no stinkin' help, right? So, my final vote: +1 -- Kenny Pitt From mhammond at skippinet.com.au Wed Nov 12 16:39:16 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Nov 12 16:38:58 2003 Subject: [spambayes-dev] (no subject) In-Reply-To: Message-ID: <02f001c3a965$6c8432e0$0500a8c0@eden> You have not given us any indication of what problem you are seeing. Please see the "Troubleshooting Guide" that comes with SpamBayes, and create a new bug, being sure to upload the log file for your session. Regards, Mark -----Original Message----- From: spambayes-dev-bounces@python.org [mailto:spambayes-dev-bounces@python.org]On Behalf Of Kevin Patterson Sent: Thursday, 13 November 2003 4:28 AM To: spambayes-dev@python.org Subject: [spambayes-dev] (no subject) Will spambayes ever work in a Terminal server/Citrix enviroment? It works fine when only one instance of outlook is running. Do you know of anything I could do config wise on spambayes to fix this? Or on the server side. Keep up the great work. Thank you! Kevin Patterson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031113/39085ab7/attachment.html From skip at pobox.com Wed Nov 12 17:13:34 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Nov 12 17:13:44 2003 Subject: [spambayes-dev] Re: proposal for more uniform option setting from the command line Message-ID: <16306.45198.286429.808432@montanaro.dyndns.org> me> How about instead of only allowing specific options to be overridden me> on the command line we use a consistent syntax for overriding *any* me> option from the command line? I took the first step in this direction, adding a set_from_cmdline method to OptionsClass.OptionsClass and modifying sb_filter.py to accept -o/--option flags. Seems to work for me. I decided to push as much processing into OptionsClass.py as possible (including error message display) to make it as easy as possible to process these options from other apps. In sb_filter.py, I had to add 'o:' and 'option=' to the appropriate getopt.getopt() arg and add elif opt in ('-o', '--option'): Options.options.set_from_cmdline(arg, sys.stderr) to the post-getopt() call processing. If you want to handle error recovery yourself, simply omit the sys.stderr arg from the call and wrap the call in the appropriate try/except incantation. Note that the way I have it set up, it displays the error but continues processing. I don't know if that's necessarily a good way to do it, but under the assumption that this is mostly for propeller head use, it seems okay to let processing continue and thus be able to potentially catch multiple errors. Skip From tp at diffenbach.org Thu Nov 13 23:47:06 2003 From: tp at diffenbach.org (TP Diffenbach) Date: Thu Nov 13 23:43:40 2003 Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin Message-ID: I'd like to extend the Spambayes Outlook plugin a bit. In the Spambayes Outlook Plugin, in which module are the header lines (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted? In which module is the spam percentage score added to the Outlook mail item? (Why I'm doing this: the headers aren't accessible in Outlook except via View|Options, or programmatically. I want an Outlook form that automatically displays the headers, but doing it in Visual Basic Script is problematic because of Outlook's security policies. Other work-arounds (using Redemption for Outlook or writing my own hook) are possible, but I'd prefer just to leverage Spambayes. So I'd like to add the headers as a user-defined field (ugh, duplication) are design a form that merely accesses that.) Thanks, Tom -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1044 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031113/a2c53459/winmail.bin From tim.one at comcast.net Fri Nov 14 01:14:57 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Nov 14 01:14:59 2003 Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin In-Reply-To: Message-ID: [TP Diffenbach] > I'd like to extend the Spambayes Outlook plugin a bit. > > In the Spambayes Outlook Plugin, in which module are the header lines > (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted? The plugin doesn't use the CDO API -- it's too problematic across Outlook variations (e.g., it appears it's not even available in most IMO configurations, unless the user manually installs CDO from their Office CD). It uses low-level MAPI instead. Search the source for the MAPI PR_TRANSPORT_MESSAGE_HEADERS_A property and you'll soon find it. Be warned that raw MAPI can be extremely painful to work with (although it's a hell of a lot easier to work with from Python than from C!); OTOH, it's much faster than CDO too, and that's important to the plugin for high-volume users. > In which module is the spam percentage score added to the Outlook > mail item? The same module you'll find above, in the SetField method of class MAPIMsgStoreMsg (I'm resisting becoming a remote search button in your text editor ). > (Why I'm doing this: the headers aren't accessible in Outlook except > via View|Options, or programmatically. I want an Outlook form that > automatically displays the headers, but doing it in Visual Basic > Script is problematic because of Outlook's security policies. Using MAPI directly appears to sidestep most Outlook whining. At least so far. -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1040 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031114/3e30d968/winmail.bin From kennypitt at hotmail.com Fri Nov 14 09:08:29 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Nov 14 09:08:55 2003 Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin In-Reply-To: Message-ID: TP Diffenbach wrote: > I'd like to extend the Spambayes Outlook plugin a bit. > > In the Spambayes Outlook Plugin, in which module are the header lines > (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted? > > In which module is the spam percentage score added to the Outlook > mail item? > > (Why I'm doing this: the headers aren't accessible in Outlook except > via View|Options, or programmatically. I want an Outlook form that > automatically displays the headers, but doing it in Visual Basic > Script is problematic because of Outlook's security policies. Other > work-arounds (using Redemption for Outlook or writing my own hook) > are possible, but I'd prefer just to leverage Spambayes. So I'd like > to add the headers as a user-defined field (ugh, duplication) are > design a form that merely accesses that.) If you do "Show spam clues for current message" and scroll past the Significant Tokens section to the Message Stream section, it shows the raw content of the e-mail including all headers. Is that the kind of information you're trying to get access to? -- Kenny Pitt From tim.one at comcast.net Fri Nov 14 20:02:45 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Nov 14 20:02:53 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message-ID: Jeremy (Hylton) sent me some work-related email today, the output from running a statistics-gathering program over a ZODB database. We both wondered why I hadn't gotten the message, and I eventually discovered that it was actually in my Spam folder, and at "the wrong end" to boot (the view on my Spam folder is sorted by spam score). It had an internal ham score of exactly 0 and an internal spam score of exactly 1. So I trained on it as ham, and the next time he sent a similar report, things were reversed: the new one got ham=1 and spam=0. So what unforgivable sin had he committed in the first email? Heh. It had virtually no English text, but lots, and lots, and lots of different integers (about 100KB worth). There were about a half dozen strong ham clues that it had come from him, but about 140 spam clues from the variety of little integers, most hapaxes that had appeared in one training spam each. I view that mostly as a danger of mistake-based training: as I've mentioned before, mistake-based training tends toward being hapax-driven, and hapaxes are brittle. There's nothing *inherently* spammy about, say, 16384, and because that's a power of 2 and I'm a computer geek, that *would* have appeared in several training ham if I hadn't fallen into mistake-based training (yes, 16384 had indeed appeared in one training spam). So it's a cute one. I have to note that it argues in favor of a whitelist gimmick too -- although that wouldn't have done me any good since I never would have anticipated that anything Jeremy sent would get scored as spam. Even if I had anticipated it, I don't remember all the email accounts he uses, and probably wouldn't have thought to whitelist the account he used to send this one. So if any spammers are reading this, here's how to get by my mistake-based filter now: add scads of random little integers to your spam. If the rest of your spam is brief enough, it will get a spam score of 0, because now my database has even more little integer hapaxes in the *ham* direction. amusedly y'rs - tim From rob at hooft.net Sat Nov 15 04:27:57 2003 From: rob at hooft.net (Rob Hooft) Date: Sat Nov 15 04:31:03 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: <3FB5F19D.5070506@hooft.net> Tim Peters wrote: > I view that mostly as a danger of mistake-based training: as I've mentioned > before, mistake-based training tends toward being hapax-driven, and hapaxes > are brittle. There's nothing *inherently* spammy about, say, 16384, and > because that's a power of 2 and I'm a computer geek, that *would* have > appeared in several training ham if I hadn't fallen into mistake-based > training (yes, 16384 had indeed appeared in one training spam). I am now training on all mistakes and unsures, plus all ham scoring more than 0.02 and all spam scoring less than 0.99. Total trained messages is ~250 both ways, and 97+ of spam scores 0.99+ leaving only 1-2 new spams per day, less than 1 unsure per day, and ~1 new ham per day to train on. I am really pleased by the performance of this training schedule. It is not as brittle as mistake-based training, but it still ignores the obvious repeating things like CVS log messages of which I receive a few dozen per day. It keeps the database reasonably small, but not really hapax driven. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tp at diffenbach.org Sat Nov 15 06:33:07 2003 From: tp at diffenbach.org (TP Diffenbach) Date: Sat Nov 15 06:29:27 2003 Subject: [spambayes-dev] test -- can I use an arbitrary from address? Message-ID: Can I mail to the list from my real address, or only from the dares I signed up with -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1048 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031115/cf90e3de/winmail.bin From richie at entrian.com Sat Nov 15 11:06:52 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Nov 15 11:07:22 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: <6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com> [Tim] > There were about a half dozen strong ham > clues that it had come from him, but about 140 spam clues from the variety > of little integers, most hapaxes that had appeared in one training spam > each. Perhaps it's argument for not classifying using hapaxes? Wait for any given clue to appear in more than one message before it becomes valid for classification. Has anyone tried this? (And not just for SpamBayes - Bill?) It could well have helped with the similar spectacular false positive that I reported a few weeks ago - that was from a colleague as well, and consisted of a list of US state codes and state names. Many of those were spam hapaxes. -- Richie Hindle richie@entrian.com From rob at hooft.net Sat Nov 15 11:26:24 2003 From: rob at hooft.net (Rob Hooft) Date: Sat Nov 15 11:29:31 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com> References: <6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com> Message-ID: <3FB653B0.3080507@hooft.net> Richie Hindle wrote: > [Tim] > >>There were about a half dozen strong ham >>clues that it had come from him, but about 140 spam clues from the variety >>of little integers, most hapaxes that had appeared in one training spam >>each. > > > Perhaps it's argument for not classifying using hapaxes? Wait for any > given clue to appear in more than one message before it becomes valid for > classification. Has anyone tried this? (And not just for SpamBayes - > Bill?) ? h?v? n?t tr??d ?t, b?t ? ?m q??t? s?r? ?t w??ld p?rf?rm w?rs?! Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie at entrian.com Sat Nov 15 11:30:22 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Nov 15 11:30:44 2003 Subject: [spambayes-dev] Whitelists (was: A spectacular false positive) In-Reply-To: References: Message-ID: [Tim] > I have to note that it argues in favor of a whitelist > gimmick too -- although that wouldn't have done me any good since I never > would have anticipated that anything Jeremy sent would get scored as spam. > Even if I had anticipated it, I don't remember all the email accounts he > uses, and probably wouldn't have thought to whitelist the account he used to > send this one. I've been thinking about whitelists, and the more I think about them the more I'm in favour of them. We can do things with a built-in SpamBayes whitelist that you just can't do with standard email client filters - things that I think would address your objections, Tim. All these rules would be optional, and possibly behind another rule that says "An address must qualify N times before this happens": o Whenever a message is trained as ham, add the From address to the whitelist. o Whenever a message is trained as spam, remove the From address from the whitelist. o Whenever a message is received from a whitelisted addresses, and scores as solid (for some value of 'solid') ham, auto-train the message as ham. You'd use this for personal acquaintances only, and not for mailing lists or organisations (amazon.com, ebay.com, etc.) Add a couple of other features: o Give it an mbox file (or Outlook folder, etc.) and it adds all the addresses to the whitelist. o Support wildcard patterns in the whitelist, eg. *@myemployer.com and I think you have something that would be mostly automated. You wouldn't need to dig out all your acquaintances addresses and add them by hand, because the act of training would catch many of them. The ability to add all the addresses in a folder would catch most of the rest (for anyone that keeps a good deal of old email around, which I suspect is most people, especially in a working environment). The upshot: I still don't trust SpamBayes to delete my Spam without looking it. This feature would mean I *would* trust it, because I could be sure that when one of my friends or colleagues sends me a spammy message (cf. the list of US state names I received a while ago) it doesn't get classified as spam. I'm prepared to take the risk of forged From addresses because the time spent weeding out those will be far less than the time I currently take glancing down my entire list of ~150 spams per day. I'm prepared to take the risk that the first ever email a friend sends me gets deleted as spam (very unlikely). I keep all my old mail, sorted into ham and spam, so generating my whitelist will be easy (and even if you don't keep all your old mail, generating a training-based whitelist for frequent correspondents, or adding wildcard patterns for all work addresses, would be easy). Other features we'd need: o Manual editing in web interface / an Outlook dialog - just a newline-separated list of names or wildcard patterns. o Import / export of whitelists as plain text files (choice of merge or replace on import) Classification would just override whatever the classifier said, adding "X-Spambayes-Classification: ham". If you ask for evidence, you get "X-Spambayes-Evidence: Whitelist rule '' matches 'From:
'". Questions: o How to get the actual address from a To/From header - the address would need separating from the real name and any quoting. o Which headers to use? Probably just From to keep it simple; maybe Reply-To as well. o Should there be a blacklist as well, for symmetry? Probably not - a whitelist is far more useful. A blacklist would only be useful if you were getting persistent false negatives from the same address despite repeated training - if that's happening then something's broken 8-) o Where to store the whitelist - it could get big, so bayescustomize.ini might not be the place. Ongoing problems with DBRunRecovery errors put me off putting it in the clues database. -- Richie Hindle richie@entrian.com From richie at entrian.com Sat Nov 15 11:37:37 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Nov 15 11:37:58 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <3FB653B0.3080507@hooft.net> References: <6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com> <3FB653B0.3080507@hooft.net> Message-ID: [Richie] > Perhaps it's argument for not classifying using hapaxes? Wait for any > given clue to appear in more than one message before it becomes valid for > classification. Has anyone tried this? (And not just for SpamBayes - > Bill?) [Rob] > ? h?v? n?t tr??d ?t, b?t ? ?m q??t? s?r? ?t w??ld p?rf?rm w?rs?! 8-) I'm sure it would perform worse in the short term, but as the size of the training set increased, I think the performance would pretty much catch up while the chance of false positives would remain significantly smaller. (I speak with the conviction of someone with no evidence and negligible mathematical ability...) -- Richie Hindle richie@entrian.com From richie at entrian.com Sat Nov 15 11:57:27 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Nov 15 11:57:58 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: References: Message-ID: Pablo, [Moving this thread to spambayes-dev@python.org] > All right, third try, then I'll quit :-) Our apologies - we don't mean to ignore people who offer to help - far from it! > I just want to ask if there's any interest > at all in having Spambayes available in Spanish or not. We'd love to have international versions, though there are a lot of issues involved. I don't mean to put you off the idea, or to imply that we're not prepared to put effort into this, but these things need taking into account... Many (most?) of the English strings in SpamBayes are mixed in with the code. Taking the source code as it is an translating the strings into Spanish would be unmaintainable - we'd have two entirely separate versions of the code, and any edits would have to be applied to both. So the first job to do would be to pull out all those hard-coded strings into a language file. That's not a huge job, and one that any computer-literate person could probably do 95% of, even if they weren't a programmer. Still more effort than simply translating a collection of English phrases into Spanish, though. More English text appears in HTML pages. Some of these are mostly text, like the Outlook help pages, and maintaining two versions would not be too bad (though any stylistic changes might have to be applied twice). Some, however, are quite strictly defined in ways that makes them machine-readable - the web interface (as used by the POP3 proxy and IMAP filter) defines its user interface in little pieces of HTML that are joined together by the program at runtime. Translating that stuff would need more technical knowledge, and probably a significant amount of re-engineering to make it maintainable. A lot of Outlook interface (at least the dialogs) are defined in Windows resource files, which require Visual Studio to edit them (or there may be third party programs to it - any free ones that people know of?). You can edit them by hand but it's a huge pain. They also contain a lot of information that's not just language strings, meaning there's a lot of duplication between the different language versions that causes maintenance headaches. I'd be interested to hear from people out there in the world with solutions to this problem! Lastly a social issue. You could become the support department for hundred of Spanish SpamBayes users! So. Do you still want to do this? 8-) And are there SpamBayes developers (or other Python-literate SpamBayes users - Pablo, you don't say whether you're a Python programmer?) who have the time to make the necessary software changes for this to work? There may be other, cheaper, alternatives to doing the work that I haven't considered (for instance, translating all the strings in place and then maintaining the edits as a set of patches that get applied to each release - are there Open Source projects that work that way?) I may be painting an unnecessarily bleak picture here, because I'm no expert on i18n. I'd love for someone to come in and say "No, you've got it all wrong, it's easy, look!" Anyone? -- Richie Hindle richie@entrian.com From skip at pobox.com Sat Nov 15 13:56:19 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Nov 15 13:56:43 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <3FB5F19D.5070506@hooft.net> References: <3FB5F19D.5070506@hooft.net> Message-ID: <16310.30419.316274.137644@montanaro.dyndns.org> Rob> I am now training on all mistakes and unsures, plus all ham scoring Rob> more than 0.02 and all spam scoring less than 0.99. I used to use that sort of scheme as well, but it gets tedious after awhile and just grows my training database. The problem was that most ham scored 0.0 and after concluding a message was ham I let procmail toss it in the proper mailbox. This meant that the few hams which didn't score 0.0 were scattered all over the place, so I had to constantly be on the lookout for them. I suppose I could have added a copy rule to my procmailrc file to save all non-zero ham, but that would have just been another mailbox to look at. I already have unsure, lospam and hispam. That would add hiham. Also, when you get two of essentially the same spam, do you train on both? I'm trying to be careful now to minimize that sort of duplication. I have so many email addresses feeding into skip@mojam.com that I generally get multiples of everything. Finally, I also gave up on training on low-scoring spams. If it's spam and not a mistake, it's good enough for me. At the moment I have a training database of 133 spams and 111 hams. Skip From matt at mondoinfo.com Sat Nov 15 14:20:16 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sat Nov 15 14:20:21 2003 Subject: [spambayes-dev] Whitelists (was: A spectacular false positive) In-Reply-To: References: Message-ID: <1068923279.1.1879@mint-julep.mondoinfo.com> > Questions: > > o How to get the actual address from a To/From header - the address > would need separating from the real name and any quoting. That one's pretty easy: >>> import email >>> import email.Utils >>> f=open("test") >>> m=email.message_from_file(f) >>> email.Utils.parseaddr(m.get("From",""))[1] 'richie@entrian.com' Regards, Matt From listas at loquecreas.com Sat Nov 15 15:02:13 2003 From: listas at loquecreas.com (Pablo Vieira) Date: Sat Nov 15 15:02:43 2003 Subject: [spambayes-dev] RE: [Spambayes] RV: I18N and L10N In-Reply-To: Message-ID: Wow, looks challenging, but very interesting! I'm not a Python programmer but I'm a C programmer and learning Java right now. I'll join you guys at the developers list. I might have some suggestions. Thanks for answering, finally! ;-) Pablo > -----Mensaje original----- > De: Richie Hindle [mailto:richie@entrian.com] > Enviado el: s?bado, 15 de noviembre de 2003 17:57 > Para: Pablo Vieira; spambayes@python.org; spambayes-dev@python.org > Asunto: Re: [Spambayes] RV: I18N and L10N > > Pablo, > > [Moving this thread to spambayes-dev@python.org] > > > All right, third try, then I'll quit :-) > > Our apologies - we don't mean to ignore people who offer to help - far > from it! > > > I just want to ask if there's any interest > > at all in having Spambayes available in Spanish or not. > > We'd love to have international versions, though there are a lot of issues > involved. I don't mean to put you off the idea, or to imply that we're > not prepared to put effort into this, but these things need taking into > account... > > Many (most?) of the English strings in SpamBayes are mixed in with the > code. Taking the source code as it is an translating the strings into > Spanish would be unmaintainable - we'd have two entirely separate versions > of the code, and any edits would have to be applied to both. So the first > job to do would be to pull out all those hard-coded strings into a > language file. That's not a huge job, and one that any computer-literate > person could probably do 95% of, even if they weren't a programmer. Still > more effort than simply translating a collection of English phrases into > Spanish, though. > > More English text appears in HTML pages. Some of these are mostly text, > like the Outlook help pages, and maintaining two versions would not be too > bad (though any stylistic changes might have to be applied twice). Some, > however, are quite strictly defined in ways that makes them > machine-readable - the web interface (as used by the POP3 proxy and IMAP > filter) defines its user interface in little pieces of HTML that are > joined together by the program at runtime. Translating that stuff would > need more technical knowledge, and probably a significant amount of > re-engineering to make it maintainable. > > A lot of Outlook interface (at least the dialogs) are defined in Windows > resource files, which require Visual Studio to edit them (or there may be > third party programs to it - any free ones that people know of?). You can > edit them by hand but it's a huge pain. They also contain a lot of > information that's not just language strings, meaning there's a lot of > duplication between the different language versions that causes > maintenance headaches. I'd be interested to hear from people out there in > the world with solutions to this problem! > > Lastly a social issue. You could become the support department for > hundred of Spanish SpamBayes users! > > So. Do you still want to do this? 8-) And are there SpamBayes > developers (or other Python-literate SpamBayes users - Pablo, you don't > say whether you're a Python programmer?) who have the time to make the > necessary software changes for this to work? > > There may be other, cheaper, alternatives to doing the work that I haven't > considered (for instance, translating all the strings in place and then > maintaining the edits as a set of patches that get applied to each release > - are there Open Source projects that work that way?) > > I may be painting an unnecessarily bleak picture here, because I'm no > expert on i18n. I'd love for someone to come in and say "No, you've got > it all wrong, it's easy, look!" Anyone? > > -- > Richie Hindle > richie@entrian.com > > From tim.one at comcast.net Sat Nov 15 16:42:47 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Nov 15 16:42:43 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <3FB5F19D.5070506@hooft.net> Message-ID: [Rob Hooft] > I am now training on all mistakes and unsures, plus all ham scoring > more than 0.02 and all spam scoring less than 0.99. Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to match? Then you can describe the same thing as just "mistakes and unsures" (which is what I mean by "mistake-based training"). > Total trained messages is ~250 both ways, and 97+ of spam scores 0.99+ > leaving only 1-2 new spams per day, less than 1 unsure per day, and > ~1 new ham per day to train on. > > I am really pleased by the performance of this training schedule. It > is not as brittle as mistake-based training, but it still ignores the > obvious repeating things like CVS log messages of which I receive a > few dozen per day. It keeps the database reasonably small, but not > really hapax driven. Sigh -- we need solid research on training disciplines that work great in real-life use, respecting that anything requiring human input will barely get used except by geeks who never tire of watching the training process. We're getting a lot of anecdotal evidence (which ain't the same thing) about different schemes, and I'm afraid no two of the developers train in the same way anymore. It's a good thing the algorithm appears to have turned out to be robust against almost any training insanity short of what Outlook users can stumble into <0.9 wink>. Oh well. In the meantime, I think your msg would be a great addition to Richie's spambayes wiki. I know *you* know where that is, because a coworker found your http://www.entrian.com/sbwiki/RobsSetup there yesterday, and it was exactly what he needed to set up our code with his maildir-based system. From tim.one at comcast.net Sat Nov 15 17:02:58 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Nov 15 17:02:53 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message-ID: [Richie Hindle] >>> Perhaps it's argument for not classifying using hapaxes? Wait for >>> any given clue to appear in more than one message before it becomes >>> valid for classification. Has anyone tried this? (And not just for >>> SpamBayes - Bill?) [Rob Hooft] >> ? h?v? n?t tr??d ?t, b?t ? ?m q??t? s?r? ?t w??ld p?rf?rm w?rs?! [Richie] > 8-) > > I'm sure it would perform worse in the short term, but as the size of > the training set increased, I think the performance would pretty much > catch up while the chance of false positives would remain > significantly smaller. (I speak with the conviction of someone with > no evidence and negligible mathematical ability...) Graham's original scheme ignored tokens that hadn't appeared at least 5 times in training data. Some of the very earliest experiments played with that, moving the cutoff both higher and lower. The evidence was very clear (not like the noise-level results most recent experiments have shown -- this was "0 lost 1 tied 9 won" territory) that a cutoff of 0 worked best. Part of the "reason" is surely that *every* token first *enters* the database as a hapax. When new kinds of fuzzy ham and spam appear, one example often introduces enough hapaxes so that the next instance of the same kind of thing is nailed to the correct category just from scoring the hapaxes in it. I noticed this dramatically during the last major round of worm spew, where I was getting about 1,000 worm-related turds each day. Like Skip suggested recently, I only trained on one at a time, and then rescored the morning's unsures. Training on 6 total examples turned out to be enough that I never had to train on another -- and "that worked" almost purely by capturing different hapaxes unique to about 6 different variations of the worm spew I was getting. So hapaxes are (I believe) really the heart of what lets lazy, minimal mistake-based training work as well as it does. It will always be brittle, though. A scheme I would like to try can't be tried easily anymore because we removed some of the info it needs from our database: ignore hapaxes that haven't been *used* in scoring over the last (say) week. Spam especially seems to come in spurts, where I might get 100 copies in a few days of a spam containing "16384". That hapax is very valuable in nailing minor variations of that spam until that spam campaign ends; but after that point, I probably never use it to score a spam again, yet it stays in the database forever. If it stays there long enough, Jeremy is eventually going to use it too . Especially since more & more of us are inclining toward using tiny databases (compared to what we used to do), making space for a "last used" timestamp may not be nearly as scary as it used to be. From tp at diffenbach.org Sat Nov 15 17:27:10 2003 From: tp at diffenbach.org (TP Diffenbach) Date: Sat Nov 15 17:23:31 2003 Subject: FW: [spambayes-dev] Code locations in Spambayes Outlook plugin Message-ID: Tim, thanks for your help. Knowing what to grep on make this a one line code change, and running "python addin.py" painlessly installed it. Thanks too to Kenny, for your response. A bit of mucking about with the Outlook forms (and ignoring an Outlook popped-up suggestion after a bit of Googling), made it work, and now I can see the headers without having to delve into Outlook's menus, and I can do it without code in the form, which disables Outlook's auto-preview (so much of using Outlook seems to involve working around stupid design decisions in Outlook). BTW, I'm signed up to the spambayes-dev list under a different email than I use to post to the list; will this cause any problems? Thanks, Tom -----Original Message----- [TP Diffenbach] > I'd like to extend the Spambayes Outlook plugin a bit. > > In the Spambayes Outlook Plugin, in which module are the header lines > (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted? The plugin doesn't use the CDO API -- it's too problematic across Outlook variations (e.g., it appears it's not even available in most IMO configurations, unless the user manually installs CDO from their Office CD). It uses low-level MAPI instead. Search the source for the MAPI PR_TRANSPORT_MESSAGE_HEADERS_A property and you'll soon find it. Be warned that raw MAPI can be extremely painful to work with (although it's a hell of a lot easier to work with from Python than from C!); OTOH, it's much faster than CDO too, and that's important to the plugin for high-volume users. > In which module is the spam percentage score added to the Outlook > mail item? The same module you'll find above, in the SetField method of class MAPIMsgStoreMsg (I'm resisting becoming a remote search button in your text editor ). -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1213 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031115/9c5d5d63/winmail.bin From rob at hooft.net Sat Nov 15 18:09:13 2003 From: rob at hooft.net (Rob Hooft) Date: Sat Nov 15 18:12:18 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <16310.30419.316274.137644@montanaro.dyndns.org> References: <3FB5F19D.5070506@hooft.net> <16310.30419.316274.137644@montanaro.dyndns.org> Message-ID: <3FB6B219.6090706@hooft.net> Skip Montanaro wrote: > Rob> I am now training on all mistakes and unsures, plus all ham scoring > Rob> more than 0.02 and all spam scoring less than 0.99. > > I used to use that sort of scheme as well, but it gets tedious after awhile > and just grows my training database. [...] > Also, when you get two of essentially the same spam, do you train on both? > I'm trying to be careful now to minimize that sort of duplication. I have > so many email addresses feeding into skip@mojam.com that I generally get > multiples of everything. I do not get a lot of true duplicates, definitely not in the non-obvious spam. This is my .procmailrc; it indeed has the copy-rule you mention. LOGFILE=/home/h/hooft/procmail.log :0 fw:hamlock | /home/h/hooft/bin/sb_filter.py # Messages that are so obviously spam that we should not train on them :0 * ^X-SpamBayes-Classification: spam; 1.00 .ztrain.obvious-spam/ # Messages that are spam but we might want to train on them :0 * ^X-SpamBayes-Classification: spam .ztrain.spam/ # Unsure messages must be copied to the unsure folder for training :0 c * ^X-SpamBayes-Classification: unsure .ztrain.unsure/ # Ham that doesn't score 0.00 is eligible for training as well :0 c * ^X-SpamBayes-Classification: ham; 0.0[2-9] .ztrain.ham/ :0 c * ^X-SpamBayes-Classification: ham; 0.1[0-9] .ztrain.ham/ ## ## ## Split into folders ## ## :0 * ^List-Id:.*python-announce-list .python.Announce/ ## Etc. -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob at hooft.net Sat Nov 15 18:15:05 2003 From: rob at hooft.net (Rob Hooft) Date: Sat Nov 15 18:18:08 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: <3FB6B379.3070304@hooft.net> Tim Peters wrote: > [Rob Hooft] > >>I am now training on all mistakes and unsures, plus all ham scoring >>more than 0.02 and all spam scoring less than 0.99. > > > Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to match? > Then you can describe the same thing as just "mistakes and unsures" (which > is what I mean by "mistake-based training"). Because I still "never look" at anything that scores over 0.90. They are all spam. But the spammiest of those, the ones over 0.995, are not even used for training. At the ham-side you're right: it is the same. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one at comcast.net Sat Nov 15 18:37:01 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Nov 15 18:37:02 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <3FB6B379.3070304@hooft.net> Message-ID: [Rob Hooft] >>> I am now training on all mistakes and unsures, plus all ham scoring >>> more than 0.02 and all spam scoring less than 0.99. [Tim] >> Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to >> match? Then you can describe the same thing as just "mistakes and >> unsures" (which is what I mean by "mistake-based training"). [Rob] > Because I still "never look" at anything that scores over 0.90. They > are all spam. I don't understand. Suppose a message scores 0.93. 0.93 > 0.90, so by what you just said you never look at it. But 0.93 < 0.99, so by what you first said you *do* train on it. Is it possible to simulataneously both train on a thing and never look at it? I guess I don't know what "never look" means. You mean you don't use your eyeballs to physically look at the 0.93 message, but let spambayes auto-train on its own "it's spam" decision then? That would be consistent with all that you said, so I'm assuming now that's the intended meaning. From popiel at wolfskeep.com Sat Nov 15 18:42:51 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Nov 15 18:42:54 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message from "Tim Peters" of "Sat, 15 Nov 2003 17:02:58 EST." References: Message-ID: <20031115234251.228272DF6A@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >Especially since more & more of us are inclining toward using tiny databases >(compared to what we used to do), making space for a "last used" timestamp >may not be nearly as scary as it used to be. This is something that I don't understand... why do we care if the database is huge? With 100 gigabyte drives commonplace, why are we quibbling over 20 or 40 megabytes? - Alex From rob at hooft.net Sat Nov 15 18:54:01 2003 From: rob at hooft.net (Rob Hooft) Date: Sat Nov 15 18:57:01 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: <3FB6BC99.3080208@hooft.net> Tim Peters wrote: > [Rob Hooft] > >>>>I am now training on all mistakes and unsures, plus all ham scoring >>>>more than 0.02 and all spam scoring less than 0.99. > > > [Tim] > >>>Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to >>>match? Then you can describe the same thing as just "mistakes and >>>unsures" (which is what I mean by "mistake-based training"). > > > [Rob] > >>Because I still "never look" at anything that scores over 0.90. They >>are all spam. > > > I don't understand. Suppose a message scores 0.93. 0.93 > 0.90, so by what > you just said you never look at it. But 0.93 < 0.99, so by what you first > said you *do* train on it. Is it possible to simulataneously both train on > a thing and never look at it? I guess I don't know what "never look" means. > You mean you don't use your eyeballs to physically look at the 0.93 message, > but let spambayes auto-train on its own "it's spam" decision then? That > would be consistent with all that you said, so I'm assuming now that's the > intended meaning. Exactly. I am assuming that the 0.93 message has some "old-fashioned" spammy characteristics, but the spammer is looking at new techniques to disguise his messages in the future. He is just not radical enough to get into my unsure box. My automatic training on these messages now makes sure that this new trick will be useless in the future. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob at hooft.net Sat Nov 15 18:58:05 2003 From: rob at hooft.net (Rob Hooft) Date: Sat Nov 15 19:01:09 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <20031115234251.228272DF6A@cashew.wolfskeep.com> References: <20031115234251.228272DF6A@cashew.wolfskeep.com> Message-ID: <3FB6BD8D.8040902@hooft.net> T. Alexander Popiel wrote: > In message: > "Tim Peters" writes: > >>Especially since more & more of us are inclining toward using tiny databases >>(compared to what we used to do), making space for a "last used" timestamp >>may not be nearly as scary as it used to be. > > > This is something that I don't understand... why do we care if the > database is huge? With 100 gigabyte drives commonplace, why are > we quibbling over 20 or 40 megabytes? My database is on an ISP's server with 100,000 clients and limited disk quota? But then again: % ll .hammiedb -rw-rw-r-- 1 hooft hooft 1277952 Nov 14 07:35 .hammiedb Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie at entrian.com Sat Nov 15 19:31:29 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Nov 15 19:31:51 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: [Tim] > A scheme I would like to try can't be tried easily anymore because we > removed some of the info it needs from our database: ignore hapaxes that > haven't been *used* in scoring over the last (say) week. *That* I like. Best of both worlds. -- Richie Hindle richie@entrian.com From tim.one at comcast.net Sat Nov 15 22:04:42 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Nov 15 22:04:49 2003 Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin In-Reply-To: Message-ID: [TP Diffenbach] > ... > A bit of mucking about with the Outlook forms (and ignoring an Outlook > popped-up suggestion after a bit of Googling), made it work, and now > I can see the headers without having to delve into Outlook's menus, > and I can do it without code in the form, which disables Outlook's > auto-preview (so much of using Outlook seems to involve working > around stupid design decisions in Outlook). Also the lack of a full object model (e.g., the convolutions we endure to deal with the toolbar, to play with the rule system sanely, and the inability to automate setting up spam-score columns in views, are pretty dreadful; we won't mention the maddening "new mail" systray icon). It's a pain. > BTW, I'm signed up to the spambayes-dev list under a different email > than I use to post to the list; will this cause any problems? For whom? It's OK by me -- it's not a restricted list (anyone can post here, subscribed or not; and anyone can read the list, via its archive on the web). -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1052 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031115/4d6ff582/winmail.bin From tim.one at comcast.net Sat Nov 15 22:34:59 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Nov 15 22:35:04 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <20031115234251.228272DF6A@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > This is something that I don't understand... why do we care if the > database is huge? With 100 gigabyte drives commonplace, why are > we quibbling over 20 or 40 megabytes? I expect large drives are still rare among consumers, and this has become a "mass market" application. It wouldn't be *just* the database size, of course -- keeping "last access" up to date also requires caching token timestamps in memory, and most significantly updating the DB on disk after scoring (we never have to write to disk after scoring now, only after training). So there are many costs. I'd feel a lot better about it if Berkeley DB were a lot faster on Windows, and wasn't still implicated in so many maddeningly baffling database corruption reports. From tim.one at comcast.net Sat Nov 15 23:49:34 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Nov 15 23:49:47 2003 Subject: [spambayes-dev] RE: [Spambayes] RV: I18N and L10N In-Reply-To: Message-ID: [Pablo Vieira] > ... > Since you guys state very clearly that no one should email the > developers directly I'm putting this here. I just want to ask if > there's any interest at all in having Spambayes available in Spanish > or not. I expect that, like most of the rest of the developers, I think that would be great, but don't have relevant experience, or time, to give to it. If I were you , I'd announce my *intent* on the newsgroup comp.lang.python, to attract the interest of Python programmers with real-life I/L*N experience. There are people who know pretty much exactly what to do, but they don't hang out on this list, and this has much more to do with using Python's relevant features (like Unicode) correctly than with SpamBayes specifically. You might have luck asking on a Zope mailing list too (Zope is a popular web content management system coded in Python, with users all over the world, and within the last couple years has benefited by many peoples' intense help with I/L*N). A problem I know came up repeatedly in the Zope experience: a 100% commitment to Unicode can make life much easier, but old-time Python programmers have to be dragged kicking and screaming to Unicode ("it's inefficient", "it's wasteful", "it's too hard", ..., all the kinds of things old people say when they're too cranky to learn new tricks ). You'll have my support in fighting that battle, but not really much of my help -- because I'm one of the old farts who still hasn't learned anything about how to live in a Unicode world. Asians are likely to complain about Unicode too, but adapting SpamBayes to Asian languages has many deep problems that European languages shouldn't face (spambayes splits the body into tokens by whitespace, and that's it -- it deliberately didn't assume 7-bit ASCII English). I'm not sure whether the Python email package plays nicely with Unicode. That could be a real problem at the starting gate, if not. From tim.one at comcast.net Sun Nov 16 02:23:55 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Nov 16 02:23:59 2003 Subject: [spambayes-dev] Native Outlook 2003 spam filtering Message-ID: Some geeks with too much time on their hands reverse-engineered huge parts of OL2003's secret spam gimmicks, and wrote a detailed account: http://www.mapilab.com/articles/outlook_spam_filter.html As always, the first release of a thing from MS is so bizarre that competitors are lulled into laughing MS off. For example, the dictionary of words and word weights is fixed: it doesn't learn, and it's the same for all users. So, if you're a spammer, you just mail your spam to your own OL2K+3, and fiddle it until the filter likes it. Then all OL3K-997 installations will like it. If anyone knows Bill Gates, please tell him he's welcome to use our code for free . From jm at jmason.org Sun Nov 16 03:05:37 2003 From: jm at jmason.org (Justin Mason) Date: Sun Nov 16 03:05:49 2003 Subject: [spambayes-dev] Native Outlook 2003 spam filtering In-Reply-To: Message-ID: <20031116080538.7F4D416EFD@jmason.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tim Peters writes: > Some geeks with too much time on their hands reverse-engineered huge parts > of OL2003's secret spam gimmicks, and wrote a detailed account: > > http://www.mapilab.com/articles/outlook_spam_filter.html > > As always, the first release of a thing from MS is so bizarre that > competitors are lulled into laughing MS off. For example, the dictionary of > words and word weights is fixed: it doesn't learn, and it's the same for > all users. So, if you're a spammer, you just mail your spam to your own > OL2K+3, and fiddle it until the filter likes it. Then all OL3K-997 > installations will like it. Bizarre! *Great* article, though. Thanks for the pointer! (PS: I like link 8 on Appendix B ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQE/ty/RQTcbUG5Y7woRArJHAKDa9HLugCnpyEj51SN6JHp/hTScJgCfcQjy HmDq9u9Ar72idaAzlqSG2Rc= =GFVd -----END PGP SIGNATURE----- From papaDoc at videotron.ca Sun Nov 16 14:44:53 2003 From: papaDoc at videotron.ca (Remi Ricard) Date: Sun Nov 16 14:44:03 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: References: Message-ID: <1069011891.3384.8.camel@porsche.hq.simlog.com> Hi, > All right, third try, then I'll quit :-), Please don't quit, > I just want to ask if there's any interest > at all in having Spambayes available in Spanish or not. I might be interested by a French version bu I seem interested by a Spanich version. What you should ask is: "Is it easy to translate the English string to whatever." If the answer is yes then "You" can do it and provide a patch. Remi -- Remi Ricard From tim.one at comcast.net Sun Nov 16 15:44:42 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Nov 16 15:44:37 2003 Subject: [spambayes-dev] Native Outlook 2003 spam filtering In-Reply-To: <20031116080538.7F4D416EFD@jmason.org> Message-ID: [Justin Mason, on http://www.mapilab.com/articles/outlook_spam_filter.html ] > Bizarre! *Great* article, though. Thanks for the pointer! Ya, it reminds me of the summer month I spent disassembling the Radio Shack Model 100's ROM, to track down a suspected bug in its BASIC interpreter. The horrible fascination of it all keeps you going long after you find the answer you were looking for. > (PS: I like link 8 on Appendix B ;) I didn't mention that on purpose -- I'm not sure the spambayes developers could stand to see SpamBayes subjected to such no-holds-barred rigorous criticism . From sanjaydarisi at cox.net Sun Nov 16 18:43:03 2003 From: sanjaydarisi at cox.net (sanjaydarisi@cox.net) Date: Sun Nov 16 18:43:12 2003 Subject: [spambayes-dev] Quick questions! Message-ID: <20031116234302.BADI9968.fed1mtao05.cox.net@smtp.west.cox.net> I have three quick questions regarding spambayes outlook addin. Firstly, Isn't that possible to add spam field directly to the Outlook view instead of user adding it manually from user-defined fields? Secondly, If I want to add the delete as spam/recover as ham buttons to the message view that is displayed when an email message is double-clicked in Outlook, how do I do that? Any ideas? Thirdly, If I want to add some personal signature regarding spambayes after I install it for my Outlook, How do I do that? Any suggestions? Thanks in advance, Sanjay. From tameyer at ihug.co.nz Sun Nov 16 21:43:52 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 16 21:41:21 2003 Subject: [spambayes-dev] Who can explain this? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130407C93C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29CE@its-xchg4.massey.ac.nz> > Racking my brain trying to figure out just what persistent > storage file I was using (because sometimes it seemed to use > ~/hammie.db and sometimes ~/.hammiedb), I came across this in > sb_filter.py: [...] > Can we just rip this hack out and let the user's options file > dictate things? This is my bad from some time ago. +1 to getting rid of it. I'm not sure that 'hammie.db' is the best default for the option, though. For one it can lead to lots of hammie.db files being created, depending on what the cwd is at the time spambayes is run. For another, it's nice to default to having the personal files separated out from the application ones. os.path.expanduser("~/.hammiedb") works nicely enough on WinXP - does it work happily with other Windows flavours and with pre-OSX Macs? (I presume OSX is fine). Personally, I like that more as a default (although I suppose if the default is to be changed, the 'hammie' name could also be dropped). FWIW, these days, Windows users with Mark's extensions, for which spambayes can't find a database, get one created in (effectively) ~/Application Data/SpamBayes/Proxy), if memory serves me correctly. (Called statistics_database.db, I think). =Tony Meyer From tameyer at ihug.co.nz Sun Nov 16 22:07:06 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 16 22:04:37 2003 Subject: [spambayes-dev] Re: [Spambayes] Lotus Notes filter error KeyError:('Hammie', 'header_spam_string') In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1F6B@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B132@its-xchg4.massey.ac.nz> > [Mike] > > File "C:\Program Files\Python23\Scripts\sb_notesfilter.py", > line 237, in processAndTrain > > str = options["Hammie", "header_spam_string"] > > I don't know much about the Notes stuff, but that looks like > a bug. That piece of code should probably be: > > if is_spam: > str = options["Headers", "header_spam_string"] > else: > str = options["Headers", "header_ham_string"] [...] > I'm forwarding this to spambayes-dev to see whether anyone > there knows for sure whether I'm right about this...? This is a bit out of date now, but I can't see a message confirming it, so: Yes, that's right. When I went through the scripts and updated the options names I either missed some in notesfilter (as with some elsewhere), or deliberately left notesfilter alone (I can't recall which) since it needs updating in various places (TimS created it, used it, and planned to update it, but never managed to find time). I notice that this fix has been checked in anyway, so this is just in case you were wondering, or for anyone reading the archives ;) =Tony Meyer From tameyer at ihug.co.nz Sun Nov 16 22:17:20 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 16 22:14:50 2003 Subject: [spambayes-dev] RE: [Spambayes] Spambayes 1.0a7 - windowsproxy_tray installation In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1F96@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29CF@its-xchg4.massey.ac.nz> > At some point in the not-too-distant past, a decision was > made that the Windows scripts pop3proxy_service.py and > pop3proxy_tray.py should be installed to the Python Scripts > directory along with the other command-line scripts. It > seems this was a bit premature, as pop3proxy_tray obviously > isn't designed to be run that way. This is my fault - I forgot about the icons. The problem was (several people reported it after the 1.0a6 release) that everything else *is* installed into the python scripts directory, and so the readme tells people to then discard the expanded archive - but they then didn't have the contents of the (spambayes) windows directory. > I've CC'd the > spambayes-dev list in hopes that someone can take a look at > this. A few comments: * For the vast majority of people, this won't be a problem, because they'll use the binary installer for spambayes and it'll install a frozen pop3proxy_tray in the requested place. * Could the tray handle the icon files like the web ui handles it's non python files? (with resourcepackage)? Would this be desirable? * What do other python programs do that have 'support' files? > At the very least, we should probably stop copying it > to the Python\Scripts directory until the problem is fixed. I would rather that we just came up with a solution to the missing icon files problem and checked that in, rather than stop copying it. After all, it's only a copy - running the script from the expanded archive will work fine. If we do stop copying it, then the readme (and website?) needs to be updated to explain which files need to be kept from the expanded archive, and which don't. =Tony Meyer From tameyer at ihug.co.nz Sun Nov 16 22:20:54 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 16 22:18:19 2003 Subject: [spambayes-dev] Re: Offer to Help / Development Participation In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130407BE77@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B134@its-xchg4.massey.ac.nz> > Many thanks for the offers! Maybe other developers would > like to make specific suggestions (hence I've forwarded this > to the spambayes-dev mailing list), but there's a whole bunch > of things that you could do, starting with the non-technical: [list of suggestions snipped] > There are probably a dozen other things that I haven't thought of. Richie - could you put something like this up on the wiki somewhere? And maybe link to it from the "how can I help" FAQ? It's much more comprehensive than stuff that I've seen/written before, and it is a fairly common question. (The wiki's probably better than the FAQ, since this'll presumably change as things progress). =Tony Meyer From tameyer at ihug.co.nz Sun Nov 16 22:28:48 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 16 22:26:14 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1FBF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz> > It looks to me like > OptionsClass.HEADER_VALUE is too restrictive, but I'll leave > it for the author of that code to decide whether or not to > loosen it up. I wrote (many of) the regexes in OptionsClass, and in my defense I'll note that (somewhere) at the time I pointed out that they needed to be checked out by someone more expert at them than me. It's currently "[\w\.\-\*]+". Someone here must know offhand what the valid characters in an email header are, yes? Or do we just go with flexibility and use ".+"? =Tony Meyer From tameyer at ihug.co.nz Sun Nov 16 22:34:13 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 16 22:31:38 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1F67@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B136@its-xchg4.massey.ac.nz> > [Anthony] > OTOH, I don't know what is stopping us from cutting 1.0b1 > in a couple of > weeks, with a possible RC a couple of weeks after that. [Richie] > The DBRunRecoveryError problem is stopping us, IMHO. I would agree with that, but I don't think that there's anything else (although I haven't read the bug reports for the last couple of weeks...). It would be interesting to see if the changes in 1.0a7 reduce the occurrences of the problem in the messageinfo db (I think they should). On a positive* note, it appears that during the time I was away, my wife's system (current cvs spambayes, Python 2.2.2) has started corrupting the stats db every time it's run, so I might be able to chase something down from that in the next couple of days. =Tony Meyer * Not in her opinion, however . From skip at pobox.com Sun Nov 16 22:35:16 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Nov 16 23:04:41 2003 Subject: [spambayes-dev] Whitelists (was: A spectacular false positive) In-Reply-To: References: Message-ID: <16312.16884.440384.414721@montanaro.dyndns.org> Richie> o Whenever a message is trained as spam, remove the From Richie> address from the whitelist. So when a spammer forges Barry Warsaw's address (as I've seen before), Barry disappears from my whitelist? Most of my email that comes in as ham is never a candidate for training. Even if I fed my current ham training database to a whitelist generator it wouldn't whitelist a single '@python.org' address. It would get a number of Python-related email addresses though: gerrit@nl.linux.org tim_one@users.sourceforge.net amk@amk.ca anthony@interlink.com.au ... While all these addresses are certainly valuable contacts in the Python world, they are hardly representative of the email addresses which would float to the front of my cortex if I decided to build a whitelist manually. They just happen to be authors of Python-related messages on which I've trained. My current set of ham includes 11 python-list messages, two python-checkins messages, and one each from the spambayes, mailman and python-dev lists. My Python mailbox obviously contains a lot more mail, but it includes messages from random people asking Python questions which I simply forgot to delete as well as messages I've saved for their content, not necessarily who they are from. Richie> o Whenever a message is received from a whitelisted addresses, Richie> and scores as solid (for some value of 'solid') ham, Richie> auto-train the message as ham. You'd use this for personal Richie> acquaintances only, and not for mailing lists or Richie> organisations (amazon.com, ebay.com, etc.) Now we're back to growing large databases. I think over time you might wind up with a highly unbalanced set of ham and spam. Of course, as Tim pointed out, we all seem to be flying more-or-less by the seat of our pants vis a vis training, so one feature is as good as another. Still, I get so few false positives that I find it hard to believe a whitelist - even if it included my wife and my boss - would be helpful. Richie> The upshot: I still don't trust SpamBayes to delete my Spam Richie> without looking it. I have auto-deleted spam with a classifation of "spam; 1.00" for a couple months. My boss hasn't fired me yet for not responding to an email. Skip From skip at pobox.com Sun Nov 16 22:13:16 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Nov 16 23:04:44 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <20031115234251.228272DF6A@cashew.wolfskeep.com> References: <20031115234251.228272DF6A@cashew.wolfskeep.com> Message-ID: <16312.15564.157062.319322@montanaro.dyndns.org> >> Especially since more & more of us are inclining toward using tiny >> databases (compared to what we used to do), making space for a "last >> used" timestamp may not be nearly as scary as it used to be. Alex> This is something that I don't understand... why do we care if the Alex> database is huge? With 100 gigabyte drives commonplace, why are Alex> we quibbling over 20 or 40 megabytes? It's not an issue of 20-40 megabytes, it's how many messages are represented by that file. In my case, I had a training database of around 21MB and on the order of 10,000 ham and somewhat fewer spam (maybe 7,000 or so), depending on how agressively I'd been training and how recently I'd whacked off the oldest 10%-20% of my messages. I think there's a psychological hurdle to overcome to simply throw away 17,000 messages, even if it's not working optimally, because it does represent a substantial time investment. That hurdle is much lower when your training database is under 500 messages. Heck, I can rebuild one of that size in next to no time. Here's something I think would be interesting. At the moment I have about 40 unsures awaiting a decision from me (train or discard). I'm trying conciously to be conservative. What I'd like to know is which message, if added to my training database, would have the greatest effect on the scores of the other unsure messages. That would help me decide which ones yield the most benefit. OTOH, maybe I'd do just as well to train on every fourth unsure or select unsures to train on with a probability of 0.25 (1/4 picked purely out of thin air, so don't ask where I got it :-). Skip From tim.one at comcast.net Mon Nov 17 00:24:50 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 17 00:24:44 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz> Message-ID: [Skip] >> It looks to me like OptionsClass.HEADER_VALUE is too restrictive, but >> I'll leave it for the author of that code to decide whether or not to >> loosen it up. [Tony] > I wrote (many of) the regexes in OptionsClass, and in my defense I'll > note that (somewhere) at the time I pointed out that they needed to > be checked out by someone more expert at them than me. > > It's currently "[\w\.\-\*]+". Someone here must know offhand what > the valid characters in an email header are, yes? Or do we just go > with flexibility and use ".+"? RFC 822 sez: The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). But who cares? Not me. From tameyer at ihug.co.nz Mon Nov 17 02:33:57 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Nov 17 02:31:25 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1DC2@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz> [This thread seems to have died a week ago, but since I was away, and have things to say , and it doesn't seem to be resolved, I figured I'd resurrect it. While I'm doing notes: thanks Richie, Anthony and Skip for outlining the various processes in more detail - great stuff for us cvs newbies]. [Richie] > Re-reading Tony's mail, I should have pointed out at the time that we > shouldn't commit edits to both places, but should use "cvs up [-j > moving-tag] -j release_1_0" to periodically merge the bugfix > branch onto the head. Nuts. I've been very guilty of this. Basically for every edit if I thought it fit in the 'bug fix' category, then I recommitted it on the branch. The changelog outlines (for the most part) which ones I put into the branch, and which ones were only on the head. > From looking at the logs, it seems you're right, Mark - > bugfixes have been > hitting the head instead of release_1_0. Also, some fixes have been > committed to both the head and release_1_0, which will probably make > merging release_1_0 back onto the head a pain - you always get more > conflicts when you do that. Apart from the last couple of weeks, I committed the majority of changes to the branch (a mixture of stuff from me, and copying other people's fixes from the head). With the exception of the windows specific stuff, I'm pretty sure that I branch-committed all the changes that looked to me like bug fixes. I did do it by copying and pasting, mostly (and then checking), so hopefully there won't be too many conflicts. > (I should have encouraged more discussion of > branch strategy when all this came up - we make heavy use of > CVS branches at work, and we know a bit about how best to manage them.) I should asked more questions, too, sorry. I'm very much a newcomer to cvs, and was probably pushing towards the 1.0 release most for a period, so should have ensured that I was doing it right. > How much enhancement work has gone onto the head since release_1_0 was > taken? A lot with the web interface. The changelog details it - it's the stuff in the 1.1a1 section, rather than the 1.0a7 one. [Anthony] > I'd suggest the following: > > - checkin to the trunk. If the fix is a bugfix, and suitable for the > branch, include "bugfix candidate" in the checkin message. > > - (preferably) check your bugfix into the branch as well. I suggest > having two checkouts, one on the branch, one on the trunk. > > - (otherwise) someone else notices that the "bugfix" needs to be > applied to the branch as well, and does so. This is more-or-less what I was doing, I think, except that I based the "bugfix candidate" decision on discussion in the lists the checkin message, and my own fallible head, rather than an explicit message. > Having said that, I'd say the time to branch is at the point > where we're about to cut the first beta. So we've possibly done > it too soon here. This was almost the intent here, too. The original aim was to create the branch, release 1.0a6, then in a very short time release 1.0b1; before release the code that became 1.0a6 seemed pretty stable, and the main reason for the release was to have an alpha with the new script/option names before releasing a beta. Of course, after it was released, the db-closing/interface bug surfaced, and there was a resurgence of dbrunrecovery errors, plus a few others. [Richie] > Our failure this time, if there even was a failure, was in > not advertising the strategy loudly enough. When a strategy is decided, what would be the best way to advertise it, given that people may join the development team at any point? Something in readme-devel? And speaking of deciding a strategy, what is the spambayes one? . Personally, I'm in favour of someone else deciding and giving me steps to follow :) It does seem likely that if we can resolve the db corruption bug, a beta wouldn't be far off, so it would be good to decide by then :) =Tony Meyer From tdickenson at devmail.geminidataloggers.co.uk Mon Nov 17 04:11:36 2003 From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Mon Nov 17 04:11:47 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: <200311170911.36239.tdickenson@devmail.geminidataloggers.co.uk> On Saturday 15 November 2003 01:02, Tim Peters wrote: > It had > virtually no English text, but lots, and lots, and lots of different > integers (about 100KB worth). There were about a half dozen strong ham > clues that it had come from him, but about 140 spam clues from the variety > of little integers, most hapaxes that had appeared in one training spam > each. > > I view that mostly as a danger of mistake-based training: as I've > mentioned before, mistake-based training tends toward being hapax-driven, > and hapaxes are brittle. There's nothing *inherently* spammy about, say, > 16384, and because that's a power of 2 and I'm a computer geek, that > *would* have appeared in several training ham if I hadn't fallen into > mistake-based training (yes, 16384 had indeed appeared in one training > spam). I occasionally see the inverse problem. I train on every email I receive, including many hams containing lots of numbers like Jeremy sent you. Occasionally I get a spam where 2 or 3 numbers (in a price list, usually) are enough to classify it as ham. Would you have been as suprised by the same result if Jeremy had sent you a long list of effectively random words? -- Toby Dickenson From m0davis at pacbell.net Mon Nov 17 07:23:27 2003 From: m0davis at pacbell.net (Martin Stone Davis) Date: Mon Nov 17 07:30:25 2003 Subject: [spambayes-dev] Idea to re-energize corpus learning Message-ID: I recently started this thread on the POPFile forum, but it applies just as well to SpamBayes. https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099 -Martin From skip at pobox.com Mon Nov 17 08:34:15 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 17 08:34:31 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1FBF@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz> Message-ID: <16312.52823.91199.400874@montanaro.dyndns.org> >> It looks to me like OptionsClass.HEADER_VALUE is too restrictive... Tony> It's currently "[\w\.\-\*]+". Someone here must know offhand what Tony> the valid characters in an email header are, yes? Or do we just Tony> go with flexibility and use ".+"? Anything printable is okay, yes? that would be [ -~]+ I think. Do we need to worry about people including control characters or high-bit-set stuff? Skip From skip at pobox.com Mon Nov 17 08:47:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 17 08:47:50 2003 Subject: [spambayes-dev] Idea to re-energize corpus learning In-Reply-To: References: Message-ID: <16312.53618.652677.190274@montanaro.dyndns.org> Martin> I recently started this thread on the POPFile forum, but it Martin> applies just as well to SpamBayes. Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099 See my note from Sunday on spambayes-dev: http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html Just because you train on a gazillion spams and hams doesn't mean the best course once you've screwed something up isn't to start over. Like I said in the above message, I think there's a certain psychological barrier you have to overcome before you throw out a massive training database. I suspect POPfile learns about as quickly as SpamBayes, so without proof I assert that starting over there is often going to be the right course as well. For example, it's rather easy for me to scan my current training database for mistakes, either in a semi-automated fashion using sb_filter.py or manually, because it only contains about 250 messages. This was extremely difficult using my previous monster database (15k-20k messages). Skip From m0davis at pacbell.net Mon Nov 17 09:25:08 2003 From: m0davis at pacbell.net (Martin Stone Davis) Date: Mon Nov 17 09:24:53 2003 Subject: [spambayes-dev] Re: Idea to re-energize corpus learning In-Reply-To: <16312.53618.652677.190274@montanaro.dyndns.org> References: <16312.53618.652677.190274@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Martin> I recently started this thread on the POPFile forum, but it > Martin> applies just as well to SpamBayes. > > Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099 > > See my note from Sunday on spambayes-dev: > > http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html > > Just because you train on a gazillion spams and hams doesn't mean the best > course once you've screwed something up isn't to start over. Like I said in > the above message, I think there's a certain psychological barrier you have > to overcome before you throw out a massive training database. I suspect > POPfile learns about as quickly as SpamBayes, so without proof I assert that > starting over there is often going to be the right course as well. > > For example, it's rather easy for me to scan my current training database > for mistakes, either in a semi-automated fashion using sb_filter.py or > manually, because it only contains about 250 messages. This was extremely > difficult using my previous monster database (15k-20k messages). > > Skip Wouldn't it be nice if there were some middle ground between continuing to train the huge immovable database and starting over fresh? After all, it's more than just a psychological barrier. Having to train 100% of incoming messages after starting over is real work, and especially frustrating when you *know* that 80-90% would have been correctly classified anyway if only you hadn't started over. So why not soften the blow? That's what my proposal amounts to: achieving some sort of middle ground between the status quo and starting over. After performing a "Soften training SEVERELY" (where the counts are all set to their square roots), messages would still be classified in more-or-less the same way. However, further training would then be far more effective, since the counts would be lower. Doesn't that sound like a good idea? -Martin P.S. I'm also sure that POPfile learns just as quickly as SpamBayes, since they are based on the same principle. From tim.one at comcast.net Mon Nov 17 09:54:36 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 17 09:54:26 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? In-Reply-To: <16312.52823.91199.400874@montanaro.dyndns.org> Message-ID: [Skip] > It looks to me like OptionsClass.HEADER_VALUE is too > restrictive... > ... > Anything printable is okay, yes? The colon is forbidden in a header field name. > that would be [ -~]+ I think. The blank is also forbidden in a header field name. > Do we need to worry about people including control characters or > high-bit-set stuff? Not if people are willing to adhere to the standard . From skip at pobox.com Mon Nov 17 10:04:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 17 10:04:37 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? In-Reply-To: References: <16312.52823.91199.400874@montanaro.dyndns.org> Message-ID: <16312.58231.468785.103935@montanaro.dyndns.org> Tim> [Skip] >> It looks to me like OptionsClass.HEADER_VALUE is too >> restrictive... >> ... >> Anything printable is okay, yes? Tim> The colon is forbidden in a header field name. >> that would be [ -~]+ I think. Tim> The blank is also forbidden in a header field name. >> Do we need to worry about people including control characters or >> high-bit-set stuff? Tim> Not if people are willing to adhere to the standard . I believe OptionsClass.HEADER_VALUE refers to the value of a particular, not its name. Everything you wrote is correct for OptionsClass.HEADER_NAME. Right now, both have the same value: HEADER_NAME = r"[\w\.\-\*]+" HEADER_VALUE = r"[\w\.\-\*]+" I am happy to leave HEADER_NAME as is, but would like to change HEADER_VALUE to HEADER_VALUE = "[ -~]+" or should that be HEADER_VALUE = "[\t -~]+" ? Skip From tim.one at comcast.net Mon Nov 17 10:26:12 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 17 10:26:06 2003 Subject: [spambayes-dev] OptionsClass.is_valid too picky? In-Reply-To: <16312.58231.468785.103935@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I believe OptionsClass.HEADER_VALUE refers to the value of a > particular, not its name. Everything you wrote is correct for > OptionsClass.HEADER_NAME. Right now, both have the same value: > > HEADER_NAME = r"[\w\.\-\*]+" > HEADER_VALUE = r"[\w\.\-\*]+" > > I am happy to leave HEADER_NAME as is, but would like to change > HEADER_VALUE to > > HEADER_VALUE = "[ -~]+" > > or should that be > > HEADER_VALUE = "[\t -~]+" http://www.faqs.org/rfcs/rfc822.html The field-body may be composed of any ASCII characters, except CR or LF. (While CR and/or LF may be present in the actual text, they are removed by the action of unfolding the field.) This seems to contradict the definition of "text" given later, which allows bare CR and bare LF too, just the CRLF combination. "ASCII characters" isn't clearly defined, although the lexical definition for CHAR later is *described* as "any ASCII character" in English and *defined* as decimal 0 to decimal 127. One reason email clients get incompatible is that these early standards can be darned hard to make full sense of. So "suit yourself" is what many do in practice, although "be liberal in what you accept" is the Official Mantra offered as equally fuzzy advice . From kennypitt at hotmail.com Mon Nov 17 10:42:23 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Nov 17 10:43:13 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message-ID: Tim Peters wrote: > Sigh -- we need solid research on training disciplines that work > great in real-life use, respecting that anything requiring human > input will barely get used except by geeks who never tire of watching > the training process. We're getting a lot of anecdotal evidence > (which ain't the same thing) about different schemes, and I'm afraid > no two of the developers train in the same way anymore. It's a good > thing the algorithm appears to have turned out to be robust against > almost any training insanity short of what Outlook users can stumble > into <0.9 wink>. Yes, the Outlook plugin pretty much guarantees mistake-based training for anyone not familiar enough with the program (or too lazy ) to update the training through SpamBayes Manager periodically. The majority of my ham comes either from the same list of senders at work, or from the SpamBayes lists, so didn't take SpamBayes long to start classifying all of those correctly. I got up to almost 10:1 spam to ham ratio pretty quickly. To try to work around the problem, I implemented two experimental options to train on all certain ham and train on all certain spam. Since I can turn them on or off independently, I can use them to get my ratio back in balance and then turn them off. What I'd like to implement is a way to do this automatically. I'd like to say something like, "If my spam count reaches twice my ham count then train on all certain hams until the counts are within 5% of each other again." These cutoffs would of course be configurable. It will take me a little while to get around to implementing this and even longer to see if it is effective, but I'll report results (or at least perceptions) when I have them. -- Kenny Pitt From eckert at indiana.edu Mon Nov 17 11:22:12 2003 From: eckert at indiana.edu (Eckert, Robert D) Date: Mon Nov 17 11:23:21 2003 Subject: [spambayes-dev] Can't move items that are in the results list from an Outlook Find when SpamBayes is installed Message-ID: <885BB3CAB85CBD44B73B52CFBC1FC55EAAF2D5@iu-mssg-mbx08.exchange.iu.edu> Hi, I am using Outlook 2002 with an Exchange 2002 server when I work. The copy of Outlook is locally installed on my PC which is running Windows 2000 Professional. All software is up to date and patched. When I do "Find" operation on Inbox and get a results list, then select all the items and attempt to drag and drop them into another folder in my folder list, Outlook says: Can't move the items in a dialog box. When SpamBayes (or Qurb before SpamBayes) is installed, the move operation fails, yet with SpamBayes is not installed, the operation succeeds without a problem. Can you address what is happening here? Thank you, An otherwise *very* satisfied SpamBayes user. -Bob Bob Eckert - Principal Analyst eckert@indiana.edu (812) 855-7209 - (812) 855-8299 Fax Indiana University University Information Technology Services University Information Services 2711 East 10th Street - Room 101.5 Bloomington, IN 47408 From kennypitt at hotmail.com Mon Nov 17 12:47:18 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Nov 17 12:47:54 2003 Subject: [spambayes-dev] RE: [Spambayes] Spambayes 1.0a7 - windowsproxy_tray installation In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29CF@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> At some point in the not-too-distant past, a decision was >> made that the Windows scripts pop3proxy_service.py and >> pop3proxy_tray.py should be installed to the Python Scripts >> directory along with the other command-line scripts. It >> seems this was a bit premature, as pop3proxy_tray obviously >> isn't designed to be run that way. > > This is my fault - I forgot about the icons. The problem was (several > people reported it after the 1.0a6 release) that everything else *is* > installed into the python scripts directory, and so the readme tells > people to then discard the expanded archive - but they then didn't > have the contents of the (spambayes) windows directory. > > A few comments: > * For the vast majority of people, this won't be a problem, because > they'll use the binary installer for spambayes and it'll install a > frozen pop3proxy_tray in the requested place. Very true, as this has proven the norm for the Outlook plugin. We just need to finalize the installer and get it out in front of people. These Windows-specific scripts seem to be more akin to the Outlook plugin than to the more Unix-oriented command line scripts, so would the best course be to handle them the same way? For the Outlook plugin, the binary installer is the general case, and those who want/need to run from source do so from the complete source directory. > * Could the tray handle the icon files like the web ui handles it's > non python files? (with resourcepackage)? Would this be desirable? Probably, with enough extra work. The Web UI is quite happy having the raw file data available in an in-memory object. Windows will happily load an icon from either a disk file or a properly formatted resource (as the binary will use). IIRC, thoug, it doesn't provide much help if you have the same data as the file but it isn't physically in a file. >> At the very least, we should probably stop copying it >> to the Python\Scripts directory until the problem is fixed. > > I would rather that we just came up with a solution to the missing > icon files problem and checked that in, rather than stop copying it. Agreed. I was thinking only about reducing confusion in the meantime. -- Kenny Pitt From skip at pobox.com Mon Nov 17 13:28:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 17 13:31:57 2003 Subject: [spambayes-dev] Re: Idea to re-energize corpus learning In-Reply-To: References: <16312.53618.652677.190274@montanaro.dyndns.org> Message-ID: <16313.4948.609409.61316@montanaro.dyndns.org> >>>>> "Martin" == Martin Stone Davis writes: Martin> Skip Montanaro wrote: Martin> I recently started this thread on the POPFile forum, but it Martin> applies just as well to SpamBayes. >> Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099 >> >> See my note from Sunday on spambayes-dev: >> >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html Martin> Wouldn't it be nice if there were some middle ground between Martin> continuing to train the huge immovable database and starting Martin> over fresh? Sure, it would, but why propagate mistakes, even if they are smaller in magnitude? I should have continued my previous message instead of leaving people to draw their own conclusions. With a small database, if you have an error, it's easier to find, and if you can't find it, starting from scratch is not a big problem. With a large database there's this feeling that, "but... but... but... I'll be throwing away all that *good* data and all my (valuable) work!" Martin> Having to train 100% of incoming messages after starting over is Martin> real work, and especially frustrating when you *know* that Martin> 80-90% would have been correctly classified anyway if only you Martin> hadn't started over. If you only train on mistakes and unsures (as many of us appear to do now), then the effort is lessened. I don't see any practical benefit to training on every Python-related message I receive as ham. I currently have about 20 in my training database. If I was smart, I could probably figure out how to reduce that number. As far as I can tell, nearly every valid Python-related message I receive gets a ham score of 0.00 (rounded). None get scored as unsure or spam. How long should I beat that particular dead horse? Since blowing away my gazillion message training database I've started from scratch twice. Considering the volume of mail I get, getting back to a 250-message training database is little effort at all for me. SpamBayes seems to start scoring most stuff pretty well after seeing just a few hams and spams, so the cost is minimal. The problem with spam is that it varies all over the map (subject wise). My hams fall into just a few categories though, so good messages begin to be correctly classified almost immediately. Spam tends to linger in the unsure category must longer. My current approach to that problem is to try and push my spam_cutoff down further. If you want to seed a training database, you might try initially adding just the most recent message from each of your active ham mailboxes. I could add just ten messages and be almost certain they would all be useful indicators of ham. Once I've added a few spams, I'd probably see pretty good classification results. Given a 20k-message training database which contains mistakes, I will have a hard time finding and correcting those mistakes. Your approach is to reduce the magnitude of the mistakes by reducing the weight of the current training database. I effectively take the same approach, it's just that I've actually deleted the mistakes. I've thrown the baby out with the bath water (you just shrink your babies ;-), but I get plenty of babies in my incoming mail feed. If I'm careful, perhaps I'll avoid introducing the same mistakes next time. Martin> Doesn't that sound like a good idea? I suppose. Mine doesn't require any new code to be written though. I'm really not saying your idea is bad, just that mine ought to be "good enough" and requires no extra code to be written. You should be able to write a little Python script which will march through your database and reduce the counts by appropriate amounts. You will have to be aware of a couple corner conditions: * The counts for some words will round to zero. You have to decide whether to keep them as hapaxes or delete them altogether. * Roundoff error might leave you with some assertion errors like the dreaded assert hamcount <= nham assert spamcount <= nspam You'll also have to take care to avoid that case. One thing I tried in the past was to whack off the oldest 10%-20% of my training database and retrain on the result. That's another option to try to remove errors. If you as a trainer get better at your job, over time you will also reduce the number of mistakes in your training database. This approach also has the pleasant side effect of deleting old messages, keeping your training data more current as the nature of spam shifts. If you initially trained on a large body of saved mail though, you might wind up whacking out many/most/all the clues pertaining to a particular subject area and have to add some new messages in to compensate. Skip From popiel at wolfskeep.com Mon Nov 17 14:08:26 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Nov 17 14:08:33 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message from "Kenny Pitt" of "Mon, 17 Nov 2003 10:42:23 EST." References: Message-ID: <20031117190826.5EB772DF1B@cashew.wolfskeep.com> In message: "Kenny Pitt" writes: >Tim Peters wrote: >> Sigh -- we need solid research on training disciplines that work >> great in real-life use, respecting that anything requiring human >> input will barely get used except by geeks who never tire of watching >> the training process. FWIW, this sort of research is what I built the incremental harness for. It really ought to be named something like the time-sequence harness, but I didn't think of that at the time. In any case, use the harness, you can specify (in regimes.py) any particular training behaviour you want. Using that, you can run cv-esque tests to check effectiveness. Unfortunately, after building the harness, I lost all will to actually use it. :-/ >To try to work around the problem, I implemented two experimental >options to train on all certain ham and train on all certain spam. >Since I can turn them on or off independently, I can use them to get my >ratio back in balance and then turn them off. What I'd like to >implement is a way to do this automatically. I'd like to say something >like, "If my spam count reaches twice my ham count then train on all >certain hams until the counts are within 5% of each other again." These >cutoffs would of course be configurable. This is a training behaviour which is easily emulated using the harness above. I'd love to see some quantitative numbers on it vs. training on everything or training on just mistakes and unsures (both of which are preexisting regimes). >It will take me a little while to get around to implementing this and >even longer to see if it is effective, but I'll report results (or at >least perceptions) when I have them. Cool. - Alex From popiel at wolfskeep.com Mon Nov 17 14:26:28 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Nov 17 14:27:18 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message from Skip Montanaro of "Sun, 16 Nov 2003 21:13:16 CST." <16312.15564.157062.319322@montanaro.dyndns.org> References: <20031115234251.228272DF6A@cashew.wolfskeep.com> <16312.15564.157062.319322@montanaro.dyndns.org> Message-ID: <20031117192628.B11E32DF1B@cashew.wolfskeep.com> In message: <16312.15564.157062.319322@montanaro.dyndns.org> Skip Montanaro writes: > >It's not an issue of 20-40 megabytes, it's how many messages are represented >by that file. In my case, I had a training database of around 21MB and on >the order of 10,000 ham and somewhat fewer spam (maybe 7,000 or so), >depending on how agressively I'd been training and how recently I'd whacked >off the oldest 10%-20% of my messages. > >I think there's a psychological hurdle to overcome to simply throw away >17,000 messages, even if it's not working optimally, because it does >represent a substantial time investment. That hurdle is much lower when >your training database is under 500 messages. Heck, I can rebuild one of >that size in next to no time. I agree that the time investment is an issue... even more than the message count. I have a scheduled job that runs every night and retrains from scratch, but if I were doing it manually, then I too would probably hesitate to whack the database. However, with the foreknowledge that I'd want to whack the database, I set stuff up (saving all mail, and the categorization of said mail) so it would be easy to retrain. Perhaps what we really need is some easy way to allow people to retrain in bulk... with the data required for doing so collected by default instead of only by unusual forethought. >Here's something I think would be interesting. At the moment I have about >40 unsures awaiting a decision from me (train or discard). I'm trying >conciously to be conservative. What I'd like to know is which message, if >added to my training database, would have the greatest effect on the scores >of the other unsure messages. That would help me decide which ones yield >the most benefit. I tend to think that you're over-optimizing... many times over, this project has shown that stupid beats smart. >OTOH, maybe I'd do just as well to train on every fourth >unsure or select unsures to train on with a probability of 0.25 (1/4 picked >purely out of thin air, so don't ask where I got it :-). I believe (without proof) this is true. - Alex From skip at pobox.com Mon Nov 17 16:04:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 17 16:04:54 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <20031117190826.5EB772DF1B@cashew.wolfskeep.com> References: <20031117190826.5EB772DF1B@cashew.wolfskeep.com> Message-ID: <16313.14313.535777.235945@montanaro.dyndns.org> Alex> FWIW, this sort of research is what I built the incremental Alex> harness for. It really ought to be named something like the Alex> time-sequence harness, but I didn't think of that at the time. I gather that several files in testttools are related to your harness? A quick read of incremental.HOWTO.txt suggests: incremental.* regimes.py mksets.py dotest.sh Anything else? It seems that these files dominate that directory. Maybe we should create a time-sequence or incremental-test subdirectory and push them into that. Skip From skip at pobox.com Mon Nov 17 16:08:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Nov 17 16:08:58 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <20031117192628.B11E32DF1B@cashew.wolfskeep.com> References: <20031115234251.228272DF6A@cashew.wolfskeep.com> <16312.15564.157062.319322@montanaro.dyndns.org> <20031117192628.B11E32DF1B@cashew.wolfskeep.com> Message-ID: <16313.14556.545461.7393@montanaro.dyndns.org> >> Here's something I think would be interesting. At the moment I have >> about 40 unsures awaiting a decision from me (train or discard). I'm >> trying conciously to be conservative. What I'd like to know is which >> message, if added to my training database, would have the greatest >> effect on the scores of the other unsure messages. That would help >> me decide which ones yield the most benefit. Alex> I tend to think that you're over-optimizing... many times over, Alex> this project has shown that stupid beats smart. Agreed, but we're in more-or-less uncharted territory here. We all know that testing strategies haven't received nearly the attention that the basic algorithm has. My unsures are dominated by spams at the moment. I'm just experimenting with this stuff and trying to be careful about getting my ham/spam ratio too out-of-whack. Skip From tim.one at comcast.net Mon Nov 17 16:13:57 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 17 16:13:52 2003 Subject: [spambayes-dev] Re: Idea to re-energize corpus learning In-Reply-To: Message-ID: [Martin Stone Davis] > ... > So why not soften the blow? That's what my proposal amounts to: > achieving some sort of middle ground between the status quo and > starting over. After performing a "Soften training SEVERELY" (where > the counts are all set to their square roots), messages would still > be classified in more-or-less the same way. You can't know that without running serious tests, and it sounds like something tests would prove wrong. SpamBayes effectively computes spamprobs from ratios, and sqrt(x)/sqrt(y) = sqrt(x/y): the effective relative ratios would also get "square rooted", and that's likely to cause massive changes in scoring. "The usual" way (in many fields) to diminish counts that have grown "too large" is to add 1, then shift right by a bit. The purpose of adding 1 first is to prevent an original count of 1 from becoming 0. Other than that, it's basically "cut all the counts in half". Then (x/2)/(y/2) = x/y, so that relative ratios aren't affected (much; counts 2*i+1 and 2*i+2, for any i >= 0, are both reduced to i+1, so relative ratios can still change some, and especially for small i). > However, further training would then be far more effective, since the > counts would be lower. > > Doesn't that sound like a good idea? If test results say that it is, yes; otherwise no. A problem with artificially mangling token counts is that you'll probably lose the ability to meaningfully untrain a message again (the relationship betwen token counts and total number of ham and spam trained on is destroyed by reducing only one of them, but if you reduce the total counts too then you've got more messages you *could* untrain on than the (reduced) total count believes is possible; untraining anyway will then lead to worsening inaccuracy until the reduced total count "goes negative", at which point the code will probably blow up, or start to deliver pure nonsense results). > -Martin > > P.S. I'm also sure that POPfile learns just as quickly as SpamBayes, > since they are based on the same principle. Sorry, but unless you've tested this, you have no basis for such a claim. May be true, may be false, but "same principle" doesn't determine it a priori (overlooking that the ways in which SpamBayes and POPfile determine a category actually have very little in common). From popiel at wolfskeep.com Mon Nov 17 16:17:23 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Nov 17 16:17:29 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: Message from Skip Montanaro of "Mon, 17 Nov 2003 15:04:41 CST." <16313.14313.535777.235945@montanaro.dyndns.org> References: <20031117190826.5EB772DF1B@cashew.wolfskeep.com> <16313.14313.535777.235945@montanaro.dyndns.org> Message-ID: <20031117211723.CFCA32DF1B@cashew.wolfskeep.com> In message: <16313.14313.535777.235945@montanaro.dyndns.org> Skip Montanaro writes: > > Alex> FWIW, this sort of research is what I built the incremental > Alex> harness for. It really ought to be named something like the > Alex> time-sequence harness, but I didn't think of that at the time. > >I gather that several files in testttools are related to your harness? A >quick read of incremental.HOWTO.txt suggests: > > incremental.* > regimes.py > mksets.py > dotest.sh > >Anything else? It seems that these files dominate that directory. Maybe we >should create a time-sequence or incremental-test subdirectory and push them >into that. Docs: incremental.HOWTO.txt incremental.TODO.txt Prep: es2hs.py sort+group.py mksets.py Actual harness: incremental.py regimes.py Analysis: mkgraph.py Handy wrapper script: dotest.sh Total of 9 files out of 19 in the testtools dir... not quite to domination, but close. ;-) If you think they should be pushed further down a directory hole, that's OK with me... just update the sys.path mangling at the top of relevant files to be sure to grab spambayes from the local tree and not some installed version... FWIW, the prep scripts can be used with stuff other than this harness, too. They just build the semi-standard 1-message-per-file Data/{Ham,Spam}/Set* testing tree with specially named files (names indicating sequence and grouping information). - Alex From m0davis at pacbell.net Mon Nov 17 20:03:12 2003 From: m0davis at pacbell.net (Martin Stone Davis) Date: Mon Nov 17 20:03:37 2003 Subject: [spambayes-dev] Re: Idea to re-energize corpus learning In-Reply-To: References: Message-ID: Tim Peters wrote: > [Martin Stone Davis] > >>... >>So why not soften the blow? That's what my proposal amounts to: >>achieving some sort of middle ground between the status quo and >>starting over. After performing a "Soften training SEVERELY" (where >>the counts are all set to their square roots), messages would still >>be classified in more-or-less the same way. > > > You can't know that without running serious tests, and it sounds like > something tests would prove wrong. SpamBayes effectively computes spamprobs > from ratios, and sqrt(x)/sqrt(y) = sqrt(x/y): the effective relative ratios > would also get "square rooted", and that's likely to cause massive changes > in scoring. Yes, scores in my system would get pushed closer to 1. Which means it should act a little more "unsure" about all the words. I don't see anything so terrible about that, but it's something to keep in mind. > > "The usual" way (in many fields) to diminish counts that have grown "too > large" is to add 1, then shift right by a bit. The purpose of adding 1 > first is to prevent an original count of 1 from becoming 0. Other than > that, it's basically "cut all the counts in half". Then (x/2)/(y/2) = x/y, > so that relative ratios aren't affected (much; counts 2*i+1 and 2*i+2, for > any i >= 0, are both reduced to i+1, so relative ratios can still change > some, and especially for small i). This way would be fine too. As long as the counts are reduced somehow, I'd achieve the goal of making further training more effective. I will try it though, so thanks for the tip. > > >>However, further training would then be far more effective, since the >>counts would be lower. >> >>Doesn't that sound like a good idea? > > > If test results say that it is, yes; otherwise no. A problem with > artificially mangling token counts is that you'll probably lose the ability > to meaningfully untrain a message again (the relationship betwen token > counts and total number of ham and spam trained on is destroyed by reducing > only one of them, but if you reduce the total counts too then you've got > more messages you *could* untrain on than the (reduced) total count believes > is possible; untraining anyway will then lead to worsening inaccuracy until > the reduced total count "goes negative", at which point the code will > probably blow up, or start to deliver pure nonsense results). True, but the whole point of my system is that I don't want to have to go over previously trained stuff to try to make it work better. So the fact that it's tough to meaningfully untrain messages after softening is no problem for me. (Hmmm, you might still do it: train A, soften, train B, harden, untrain A. That should be kinda meaningful, if a little confusing. But again, it's not a big issue for me.) > > >>-Martin >> >>P.S. I'm also sure that POPfile learns just as quickly as SpamBayes, >>since they are based on the same principle. > > > Sorry, but unless you've tested this, you have no basis for such a claim. > May be true, may be false, but "same principle" doesn't determine it a > priori (overlooking that the ways in which SpamBayes and POPfile determine a > category actually have very little in common). True, but I was just expressing my confidence in Skip's assertion to the same effect. I'll be more careful next time. :P -Martin P.S. Someone posted a hack to POPfile which will let me test this idea. So that makes one tester... I'll try both the "square root" method and the "cut all the counts in half" method. From m0davis at pacbell.net Mon Nov 17 20:22:09 2003 From: m0davis at pacbell.net (Martin Stone Davis) Date: Mon Nov 17 20:22:30 2003 Subject: [spambayes-dev] Re: Idea to re-energize corpus learning In-Reply-To: <16313.4948.609409.61316@montanaro.dyndns.org> References: <16312.53618.652677.190274@montanaro.dyndns.org> <16313.4948.609409.61316@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: >>>>>>"Martin" == Martin Stone Davis writes: > > > Martin> Skip Montanaro wrote: > Martin> I recently started this thread on the POPFile forum, but it > Martin> applies just as well to SpamBayes. > >> > Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099 > >> > >> See my note from Sunday on spambayes-dev: > >> > >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html > > Martin> Wouldn't it be nice if there were some middle ground between > Martin> continuing to train the huge immovable database and starting > Martin> over fresh? > > Sure, it would, but why propagate mistakes, even if they are smaller in > magnitude? I should have continued my previous message instead of leaving > people to draw their own conclusions. With a small database, if you have an > error, it's easier to find, and if you can't find it, starting from scratch > is not a big problem. With a large database there's this feeling that, > "but... but... but... I'll be throwing away all that *good* data and all my > (valuable) work!" > > Martin> Having to train 100% of incoming messages after starting over is > Martin> real work, and especially frustrating when you *know* that > Martin> 80-90% would have been correctly classified anyway if only you > Martin> hadn't started over. > > If you only train on mistakes and unsures (as many of us appear to do now), > then the effort is lessened. I don't see any practical benefit to training > on every Python-related message I receive as ham. I currently have about 20 > in my training database. If I was smart, I could probably figure out how to > reduce that number. As far as I can tell, nearly every valid Python-related > message I receive gets a ham score of 0.00 (rounded). None get scored as > unsure or spam. How long should I beat that particular dead horse? > > Since blowing away my gazillion message training database I've started from > scratch twice. Considering the volume of mail I get, getting back to a > 250-message training database is little effort at all for me. SpamBayes > seems to start scoring most stuff pretty well after seeing just a few hams > and spams, so the cost is minimal. The problem with spam is that it varies > all over the map (subject wise). My hams fall into just a few categories > though, so good messages begin to be correctly classified almost > immediately. Spam tends to linger in the unsure category must longer. My > current approach to that problem is to try and push my spam_cutoff down > further. > > If you want to seed a training database, you might try initially adding just > the most recent message from each of your active ham mailboxes. I could add > just ten messages and be almost certain they would all be useful indicators > of ham. Once I've added a few spams, I'd probably see pretty good > classification results. > > Given a 20k-message training database which contains mistakes, I will have a > hard time finding and correcting those mistakes. Your approach is to reduce > the magnitude of the mistakes by reducing the weight of the current training > database. I effectively take the same approach, it's just that I've > actually deleted the mistakes. I've thrown the baby out with the bath water > (you just shrink your babies ;-), but I get plenty of babies in my incoming > mail feed. If I'm careful, perhaps I'll avoid introducing the same mistakes > next time. > > Martin> Doesn't that sound like a good idea? > > I suppose. Mine doesn't require any new code to be written though. > > I'm really not saying your idea is bad, just that mine ought to be "good > enough" and requires no extra code to be written. I get your point. But for whatever reason, I am just much less tolerant than you of having to futz with the training database. Even if it isn't *perfect*, I feel better about shrinking those babies than throwing them out, since I really *hate* having to meet new babies. Okay, we've stretch that analogy far enough! > You should be able to > write a little Python script which will march through your database and > reduce the counts by appropriate amounts. You will have to be aware of a > couple corner conditions: > > * The counts for some words will round to zero. You have to decide > whether to keep them as hapaxes or delete them altogether. > > * Roundoff error might leave you with some assertion errors like the > dreaded > > assert hamcount <= nham > assert spamcount <= nspam > > You'll also have to take care to avoid that case. Ah, but you see: I'm too lazy to learn enough Python to get that to work. But if I ever do try, thanks for the pointers. > > One thing I tried in the past was to whack off the oldest 10%-20% of my > training database and retrain on the result. Hold it right there. Whack off? hehe hehehe hheheheheheehehe. > That's another option to try > to remove errors. If you as a trainer get better at your job, over time you > will also reduce the number of mistakes in your training database. This > approach also has the pleasant side effect of deleting old messages, keeping > your training data more current as the nature of spam shifts. If you > initially trained on a large body of saved mail though, you might wind up > whacking out many/most/all the clues pertaining to a particular subject area > and have to add some new messages in to compensate. Let's call it the "kill the oldest babies" method. I actually thought about that one first before I came up with the shrinking babies. I figured that I would prefer shrinking them since I wouldn't usually know how much I liked those older babies. Aghhhhhhhhh babies! Thanks for the input, -Martin From sanjaydarisi at cox.net Mon Nov 17 21:14:25 2003 From: sanjaydarisi at cox.net (Sanjay Darisi) Date: Mon Nov 17 21:07:33 2003 Subject: [spambayes-dev] Accessing delivery time of an email message! Message-ID: <3FB98081.6040204@cox.net> If I want to access the time stamp on the email (Outlook), which property should I use? Is it PR_DELIVER_TIME that I need to use? It's a PT_SYSTIME type, So the documentation says that it is pyTime object. So, I tried using time.ctime(int(deliverytime)) and it complains ValueError: unconvertible time This is what i've done, In SB\Outlook2000\msgstore.py in class MAPIMsgStoreMsg, I added PR_DELIVER_TIME to message_init_props . And tag, deliverytime = prop_row[8] self.deliverytime = deliverytime And at the end, in the test() function for msg in folder.GetMessageGenerator(): print time.ctime(int(msg.deliverytime)) When I execute msgstore.py at the command prompt, it says ValueError: unconvertible time. Am I missing anything obvious? How can I access the sent/delivery time of an email message in outlook? Thank you in advance, Sanjay. From tim.one at comcast.net Tue Nov 18 00:33:30 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Nov 18 00:33:28 2003 Subject: [spambayes-dev] imbalance within ham or spam training sets? Message-ID: [Tim, quite a while ago] >> I'm not sure we've got the best guess >> to 17 significant digits, though . Make the imbalance wilder >> and the by-counting spamprob gets wilder too: >> >> >>> h = 1./20000 >> >>> s = 1./100 >> >>> s/(h+s) >> 0.99502487562189057 >> >>> >> >> That offends my intuition -- the word is so rare (2 of 20100 msgs) >> that it's hard to believe that 99.5% is a sane guess. The Bayesian >> adjustment knocks it down a lot based on how few times it's been >> seen in total: >> >> >>> (.45*.5 + 2.0*_)/(.45 + 2.0) 0.90410193928317584 >> >>> [Kenny Pitt] > Wow, that's interesting. I had always considered words that were > either ham or spam, but never a little of both. In a way it makes > sense because 1/20000 ham is so close to zero that the word should be > considered spammy. > > This seems even more scary, though. Compare your last example to the > case where the token has only been seen in 1 spam and no ham: > > >>> h = 0./20000 > >>> s = 1./100 > >>> s/(h+s) > 1.0 > >>> (.45*.5 + 1.*_)/(.45 + 1.) > 0.84482758620689669 > >>> > > The spam prob here is less than the case of 1 ham and 1 spam because > of the "rare word" adjustment. So, if the token has only been seen > once in spam and is later seen once in ham, it gets spammier? Yikes! > If we go to h=10: > > >>> h = 10./20000 > >>> s = 1./100 > >>> s/(h+s) > 0.95238095238095233 > >>> (.45*.5 + 11.*_)/(.45 + 11.) 0.93460178831357876 > > And the spam prob is still going up! So whenever we have an extreme > imbalance like this, the first n occurrences of a token added to the > larger corpus, where n depends on the size of the imbalance, actually > causes the probability of the *opposite* classification to *increase*. That's an excellent analysis, and I repeated it in full so that it's easier to find later . This is a systematically counterintuitive effect that's inevitable when working with highly unbalanced training data. I know Gary Robinson is thinking "about stuff like this" now, and I hope he has time to dream up a better way to cope. One gloss: > I had always considered words that were either ham or spam, but never a > little of both. Most words are like that! If you dig thru your entire database, and ignore hapaxes (words that appeared only once total across all training data), I bet you'll find that few appeared only in ham or only in spam. Ours is a "preponderance of evidence" scheme, not a "smoking gun" scheme. That's what makes it hard to fool (no fixed word, or even collection of words, is/are strong enough on their own to force a decision). From tim.one at comcast.net Tue Nov 18 10:53:21 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Nov 18 10:53:12 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <200311170911.36239.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: [Toby Dickenson] > I occasionally see the inverse problem. I train on every email I > receive, including many hams containing lots of numbers like Jeremy > sent you. Occasionally I get a spam where 2 or 3 numbers (in a price > list, usually) are enough to classify it as ham. If you train on everything, and you get substantially more ham than spam, then your training data is unbalanced in a way that would (I think) push in that direction. > Would you have been as suprised by the same result if Jeremy had sent > you a long list of effectively random words? Yes, I'd expect that to tend toward unsure, given the way I've trained. I tried generating a random email like so: >>> f = file('/updates/word.lst') >>> d = dict.fromkeys(f) >>> len(d) 173528 >>> import random >>> for w in random.sample(d, 300): ... print w, and then pasting the result into an email. word.lst is just a list of English words, one per line. That wasn't particularly revealing: it scored as a low Unsure (22), but very few of the words had ever been trained on, so were simply ignored (for example, I had never trained on burkites, zemstvo, or morphallaxes before). The few words that remained were solidly hammy (compiler, initial) or solidly spammy (male, sexy), about the same number of each. What pushed it toward the ham side of unsure were the half-dozen header clues claiming that the message was sent from me, and to me using my real name. I tried again, boosting the # of random words to 3000, to try to stumble into more I'd actually trained on. As expected, that pushed it more toward exactly Unsure: Combined Score: 47% (0.465326) Internal ham score (*H*): 0.571772 Internal spam score (*S*): 0.502424 Little integers are different for me, because while they show up in tons of geek ham, I've trained on very little of that because that kind of stuff rarely scores above 1, and almost never scores above my ham cutoff of 20. So mistake-based training almost never trains on geek ham anymore. My non-geek friends don't write much about integers . From mhammond at skippinet.com.au Tue Nov 18 21:09:19 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Nov 18 21:09:03 2003 Subject: [spambayes-dev] RE: Accessing delivery time of an email message! In-Reply-To: <3FB98081.6040204@cox.net> Message-ID: <117801c3ae42$24c5e750$0500a8c0@eden> I'm not sure exactly what property you should use - there are a number of time related properties for a message. See dump_props.py, and fiddle with the code there - this dumps all the date properties for a message. Mark. > -----Original Message----- > From: Sanjay Darisi [mailto:sanjaydarisi@cox.net] > Sent: Tuesday, 18 November 2003 1:14 PM > To: spambayes-dev@python.org; mhammond@skippinet.com.au > Subject: Accessing delivery time of an email message! > > > > If I want to access the time stamp on the email (Outlook), which > property should I use? Is it PR_DELIVER_TIME that I need to > use? It's a > PT_SYSTIME type, So the documentation says that it is pyTime object. > So, I tried using time.ctime(int(deliverytime)) and it complains > ValueError: unconvertible time > > This is what i've done, > > In SB\Outlook2000\msgstore.py > > in class MAPIMsgStoreMsg, I added PR_DELIVER_TIME to > message_init_props > . And > > tag, deliverytime = prop_row[8] > > self.deliverytime = deliverytime > > And at the end, in the test() function > > for msg in folder.GetMessageGenerator(): > print time.ctime(int(msg.deliverytime)) > > When I execute msgstore.py at the command prompt, it says ValueError: > unconvertible time. Am I missing anything obvious? How can I > access the > sent/delivery time of an email message in outlook? > > Thank you in advance, > Sanjay. > > From mhammond at skippinet.com.au Tue Nov 18 21:59:28 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Nov 18 21:59:12 2003 Subject: [spambayes-dev] Can't move items that are in the results list froman Outlook Find when SpamBayes is installed In-Reply-To: <885BB3CAB85CBD44B73B52CFBC1FC55EAAF2D5@iu-mssg-mbx08.exchange.iu.edu> Message-ID: <118801c3ae49$26afb1c0$0500a8c0@eden> Hi, Please see the "troubleshooting guide" installed with SpamBayes for information on how to report a bug, ensuring you attach the relevant log files (also in the troubleshooting guide). Regards, Mark. > -----Original Message----- > From: spambayes-dev-bounces@python.org > [mailto:spambayes-dev-bounces@python.org]On Behalf Of Eckert, Robert D > Sent: Tuesday, 18 November 2003 3:22 AM > To: spambayes-dev@python.org > Cc: Eckert, Robert D > Subject: [spambayes-dev] Can't move items that are in the results list > froman Outlook Find when SpamBayes is installed > > > Hi, > I am using Outlook 2002 with an Exchange 2002 server when I work. > The copy of Outlook is locally installed on my PC which is running > Windows 2000 Professional. All software is up to date and patched. > > When I do "Find" operation on Inbox and get a results list, then > select all the items and attempt to drag and drop them into another > folder in my folder list, Outlook says: Can't move the items in a > dialog box. > > When SpamBayes (or Qurb before SpamBayes) is installed, the move > operation > fails, yet with SpamBayes is not installed, the operation succeeds > without > a problem. > > Can you address what is happening here? > > Thank you, > > An otherwise *very* satisfied SpamBayes user. > > -Bob > > > Bob Eckert - Principal Analyst > eckert@indiana.edu (812) 855-7209 - (812) 855-8299 Fax > Indiana University > University Information Technology Services > University Information Services > 2711 East 10th Street - Room 101.5 > Bloomington, IN 47408 > > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 2608 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/a5718a35/winmail.bin From kennypitt at hotmail.com Wed Nov 19 12:30:45 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Nov 19 12:31:14 2003 Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam bayes installation on Windows 2000pc Message-ID: tony.flury@bt.com wrote: > The plug fails to initialise - I attach the Logs files from 4 > attempts - all of which mention permissions problems > > <> <> <> > <> > > Any assistance would be useful And in reply to a request for more info... tony.flury@bt.com wrote: > Outlook is 2002 (SP-2) > > This is a new install into Outlook - outlook is running fine. Outlook > was installed clean onto this PC. > > Yes - the user I run under is not the Admin user. This user appears to be experiencing a scenario that I had some concerns about as I was testing the py2exe-based installer. It would be great if someone knowledgeable in this area could look into it. Unfortunately, that probably == Mark, as if he doesn't have enough to do. Here's what I think is happening. When we build the binary installer, we have win32com pre-generate the typelib wrappers (gen_py cache) and put them into the binary. We do this using the Outlook 2000 typelib, which has a typelib version of 9.0. At runtime, win32com checks to see if that same typelib version is installed. If not, it checks to see if it can substitute a typelib with a higher minor version number. Outlook 2002 (XP) has a typelib version of 9.1, and Outlook 2003 is version 9.2. In this case, win32com does not find the version 9.0 typelib but it does find the 9.1 typelib. In order to substitute the newer typelib, win32com attempts to regenerate the wrappers because they don't exist in the binary, and it attempts to write them into the app installation directory. If the user doesn't have admin privileges then this fails. Any thoughts on how we should handle this? Should we include multiple versions of the typelib wrappers? Can we force win32com to output to the user's temp directory? Am I maybe just missing the root cause entirely? -- Kenny Pitt -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes1.log Type: application/octet-stream Size: 2912 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes1.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes3.log Type: application/octet-stream Size: 3267 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes3.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes2.log Type: application/octet-stream Size: 3267 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes2.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes1.log Type: application/octet-stream Size: 2912 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes1-0001.obj From mhammond at skippinet.com.au Wed Nov 19 17:42:10 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Nov 19 17:41:54 2003 Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam bayesinstallation on Windows 2000pc In-Reply-To: Message-ID: <140101c3aeee$5ebf7140$0500a8c0@eden> > Here's what I think is happening. When we build the binary installer, > we have win32com pre-generate the typelib wrappers (gen_py cache) and > put them into the binary. We do this using the Outlook 2000 typelib, > which has a typelib version of 9.0. > > At runtime, win32com checks to see if that same typelib version is > installed. If not, it checks to see if it can substitute a > typelib with > a higher minor version number. Outlook 2002 (XP) has a > typelib version > of 9.1, and Outlook 2003 is version 9.2. > > In this case, win32com does not find the version 9.0 typelib > but it does > find the 9.1 typelib. In order to substitute the newer typelib, > win32com attempts to regenerate the wrappers because they > don't exist in > the binary, and it attempts to write them into the app installation > directory. If the user doesn't have admin privileges then this fails. > > Any thoughts on how we should handle this? Should we include multiple > versions of the typelib wrappers? Can we force win32com to output to > the user's temp directory? Am I maybe just missing the root cause > entirely? I think you are on the money. However, the world has shifted. Newer versions will be released using py2exe, and will have the "gencache" inside the .zip file. win32com will then consider it "read-only", and in the scenario you outline above *should* use the pre-generated 9.0 typelib (as it has the same minor version). My testing shows this to work fine with a 9.1 typelib. I *hope* that the logic still holds up when a 9.2 exists too . The tlb checking/validation code is pretty horrible and due for a cleanup (but I didn't write it this time ) Mark. From kennypitt at hotmail.com Wed Nov 19 17:49:28 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Nov 19 17:49:57 2003 Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam bayesinstallation on Windows 2000pc In-Reply-To: <140101c3aeee$5ebf7140$0500a8c0@eden> Message-ID: Mark Hammond wrote: >> In this case, win32com does not find the version 9.0 typelib >> but it does >> find the 9.1 typelib. In order to substitute the newer typelib, >> win32com attempts to regenerate the wrappers because they >> don't exist in >> the binary, and it attempts to write them into the app installation >> directory. If the user doesn't have admin privileges then this >> fails. > > I think you are on the money. However, the world has shifted. Newer > versions will be released using py2exe, and will have the "gencache" > inside the .zip file. win32com will then consider it "read-only", > and in the scenario you outline above *should* use the pre-generated > 9.0 typelib (as it has the same minor version). > > My testing shows this to work fine with a 9.1 typelib. I *hope* that > the logic still holds up when a 9.2 exists too . The tlb > checking/validation code is pretty horrible and due for a cleanup > (but I didn't write it this time ) I have all 3 typelibs on my system, so don't know if I'm getting the complete picture. When I run from the py2exe binary, I've never gotten an error and it hasn't generated any new wrapper classes into the dist\bin or dist\lib directories. I'll try removing any wrappers from site-packages\win32com\gen_py and my temp dir and run again to make sure nothing gets regenerated elsewhere. FYI, I installed the trial version of InBoxer and it has the same problem since it is based on the old plug-in binary mechanism. It created a support\gen_py subdirectory in the app dir and wrote new wrappers for my 9.2 typelib. -- Kenny Pitt From mhammond at skippinet.com.au Wed Nov 19 18:23:35 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Nov 19 18:23:16 2003 Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam bayesinstallation on Windows 2000pc In-Reply-To: Message-ID: <142101c3aef4$282f0810$0500a8c0@eden> Kenny: > I have all 3 typelibs on my system, so don't know if I'm getting the > complete picture. When I run from the py2exe binary, I've > never gotten > an error and it hasn't generated any new wrapper classes into the > dist\bin or dist\lib directories. Excellent - that is the correct behaviour. It should be exactly the same regardless of what typelibs you have installed - including *none* of them. The intent now is that win32com.gen_py knows it is frozen, so *never* attempts to load typelibs, for either generation or version checking purposes. This has performance advantages even in the usual case when the typelibs are all installed. > I'll try removing any wrappers from > site-packages\win32com\gen_py and my temp dir and run again > to make sure > nothing gets regenerated elsewhere. Hopefully your installed Python and the py2exe dll should not be able to conflict, even if they wanted to! If you can trick anything into failing that smells like it might be related, let me know. > FYI, I installed the trial version of InBoxer and it has the same > problem since it is based on the old plug-in binary mechanism. It > created a support\gen_py subdirectory in the app dir and wrote new > wrappers for my 9.2 typelib. Yeah, the "old" way sucks for a number of reasons :) Mark. From tim.one at comcast.net Wed Nov 19 21:38:20 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Nov 19 21:38:22 2003 Subject: [spambayes-dev] It's only money Message-ID: Anyone want to chip in $7,500.00 for this once-in-a-month opportunity? If so, just make your check out to me. couldn't-give-or-receive-a-finer-gift-ly y'rs - tim -------------- next part -------------- An embedded message was scrubbed... From: "Cristi Brown" Subject: [PSF-Board] InfoWorld Product Spotlight Advertising Program forSpamBayes Date: Wed, 19 Nov 2003 15:17:27 -0500 Size: 4173 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/ad3cbdc7/attachment-0001.mht From anthony at interlink.com.au Wed Nov 19 22:24:30 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed Nov 19 22:24:59 2003 Subject: [spambayes-dev] It's only money In-Reply-To: Message-ID: <200311200324.hAK3OWXj027252@localhost.localdomain> >>> "Tim Peters" wrote > Anyone want to chip in $7,500.00 for this once-in-a-month opportunity? If > so, just make your check out to me. > > couldn't-give-or-receive-a-finer-gift-ly y'rs - tim Jeez. One thing the SB project's done for me is to make me realise that most of these PC review magazines are on the take. So much for the wall between editorial and advertising... From kennypitt at hotmail.com Thu Nov 20 10:14:59 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Nov 20 10:15:53 2003 Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam bayesinstallation on Windows 2000pc In-Reply-To: <142101c3aef4$282f0810$0500a8c0@eden> Message-ID: Mark Hammond wrote: > Kenny: >> I'll try removing any wrappers from >> site-packages\win32com\gen_py and my temp dir and run again >> to make sure >> nothing gets regenerated elsewhere. > > Hopefully your installed Python and the py2exe dll should not be able > to conflict, even if they wanted to! If you can trick anything into > failing that smells like it might be related, let me know. Just wanted to report the results of my test. Everything seems to be working as expected. I renamed my site-packages\win32com\gen_py dir (and recreated it with only the original __init__.py file to be safe). I also renamed the registry keys for the Outlook 2000 and Office 2000 typelib versions so that they wouldn't be found. I then registered the py2exe binary addin and loaded Outlook. SpamBayes ran fine, and no files were created in the dist\bin, dist\lib, or site-packages\win32com\gen_py directories. -- Kenny Pitt From skip at pobox.com Sat Nov 22 01:17:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Nov 22 01:17:31 2003 Subject: [spambayes-dev] more selective Received: header mining... Message-ID: <16318.65398.374469.490455@montanaro.dyndns.org> I made a change to the mine_received_headers stuff this evening, adding a new option, gateway_machines. The idea is that the only Received: header which is really useful is the one which crosses the boundary between your known "good" network and the wild free-for-all part of the net. Received: headers from hosts internal to your network are meaningless, since for the most part, all mail passes through them, while Received: headers from hosts external to your network probably just contain random garbage which clogs your database with meaningless tokens. On the other hand, information from the point at which your mail system receives a message can be useful. You can trust your network's mail server to at least get the IP address of the delivering host. When processing Received: headers, I use the gateway_machines option (a regular expression) to detect when I first encounter an SMTP server I trust. I have four useful email addresses: skip@mojam.com, skip@pobox.com, skip@python.org and montanaro@users.sourceforge.net, so I set gateway_machines to mojam\.com|pobox\.com|python\.org|sourceforge\.net The attached context diff implements the change. If you leave gateway_machines an empty string, mine_received_headers will have it's original meaning. If you set it to something, it will cause only the earliest Received: header which matches your regular expression to be processed. It's hard to tell how well this will work, since improvements are necessarily very small at this stage of the game. It certainly seems like it might be time-sensitive. Machines which were open relays a year ago may be closed off now, forcing spammers to use different routes to your mailbox. I'm thinking it might be more helpful with small training databases and small messages, as it adds more relevant clues for the classfier to munch on. My only testing to this point has been to see how it does on my current unsure mailbox. At the moment it contains about 50 messages, a mixture of ham and spam (though mostly spam) which all scored unsure when they landed there and which for one reason or another I have yet to delete or save somewhere else. Before enabling gateway_machines no messages scored as spam. After enabling it to the above regex and retraining from scratch (~170 hams and 250 spams), three more messages from my unsure mailbox scored as spam. Not surprisingly, the number of 'received:' records in my training database dropped substantially (from 2289 to 1254) after enabling this. Finally, note that the couple of context diffs here were pulled out of already modified versions of tokenizer.py and Options.py, so patch will probably apply them with offsets. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: sb.diffs Type: application/octet-stream Size: 5414 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031122/b4485c26/sb.obj From skip at pobox.com Sat Nov 22 23:40:29 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Nov 22 23:40:50 2003 Subject: [spambayes-dev] "X-" as a prefix for experimental options Message-ID: <16320.14909.793055.342411@montanaro.dyndns.org> I think the easiest way for people to play with new options is if they are in CVS instead of having to apply patches. Posting context diffs doesn't seem to be yielding a stampede of testers for several (trivial, perhaps) recent ideas. As an alternative, I propose that experimental options be simply incorporated into CVS with an "X-" prefix on the option name (e.g., ["Tokenizer", "X-gateway_machines"]) and that they always be off by default. This allows a couple things to happen: * They would be more easily available to early adopters who might not have the usual facility we've come to expect with cvs and patch(1). As the Outlook plugin-using population continues to grow, the relative number of cvs-and-patch aficianados will dwindle. * They could documented as experimental and included in a SpamBayes release. * User interfaces like sb_server.py or the Outlook plugin could recognize such options and display them in a distinctive manner which makes it clear they are experimental, and possibly even solicit feedback on them (particularly if such applications could report some relevant statistics where warranted). * Elevating such options to non-experimental status only requires removing the "X-" prefix from that option's use in distributed code. Instances of the "X-" prefixed names which remain in options files might elicit a warning, but still serve to set the now non-experimental option value. * The options parser could warn (but not fatally) about option file settings that have "X-" prefixes which don't correspond to actual options. This way, the code which implements them could be ripped out if they are deemed not useful without fear that programs which use them will begin to fail, possibly silently in the case of non-interactive use. Skip From mhammond at skippinet.com.au Sun Nov 23 00:55:40 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun Nov 23 00:55:51 2003 Subject: [spambayes-dev] "X-" as a prefix for experimental options In-Reply-To: <16320.14909.793055.342411@montanaro.dyndns.org> Message-ID: <194801c3b186$6d8bd340$0500a8c0@eden> A problem I see is that the users will have no way of measuring any changes. The binaries don't come with any of the test tools, and relying on lots of people giving subjective results doesn't seem useful. I think we need some kind of better, application based testing framework first. The scripts we use now predate all of the applications, and I can never remember how to run them. If I could just get a test tool to run directly over Outlook folders, we would be much closer (for Outlook anyway ). This needn't be too hard - just abstracting the test tools a little so they allow sub-classes to extract the actual message streams for the test runs. Ultimately, we end up with a simple way for either Outlook or sb_server to run tests over the training sets, and report succinct results. Otherwise, I doubt anything will change in terms of the number of *users* running tests (let alone developers ) Mark. > -----Original Message----- > From: spambayes-dev-bounces@python.org > [mailto:spambayes-dev-bounces@python.org]On Behalf Of Skip Montanaro > Sent: Sunday, 23 November 2003 3:40 PM > To: spambayes-dev@python.org > Subject: [spambayes-dev] "X-" as a prefix for experimental options > > > > I think the easiest way for people to play with new options > is if they are > in CVS instead of having to apply patches. Posting context > diffs doesn't > seem to be yielding a stampede of testers for several > (trivial, perhaps) > recent ideas. As an alternative, I propose that experimental > options be > simply incorporated into CVS with an "X-" prefix on the > option name (e.g., > ["Tokenizer", "X-gateway_machines"]) and that they always be > off by default. > This allows a couple things to happen: > > * They would be more easily available to early adopters > who might not > have the usual facility we've come to expect with cvs > and patch(1). > As the Outlook plugin-using population continues to > grow, the relative > number of cvs-and-patch aficianados will dwindle. > > * They could documented as experimental and included in a > SpamBayes > release. > > * User interfaces like sb_server.py or the Outlook plugin could > recognize such options and display them in a > distinctive manner which > makes it clear they are experimental, and possibly even solicit > feedback on them (particularly if such applications > could report some > relevant statistics where warranted). > > * Elevating such options to non-experimental status only requires > removing the "X-" prefix from that option's use in > distributed code. > Instances of the "X-" prefixed names which remain in > options files > might elicit a warning, but still serve to set the now > non-experimental option value. > > * The options parser could warn (but not fatally) about > option file > settings that have "X-" prefixes which don't correspond > to actual > options. This way, the code which implements them > could be ripped out > if they are deemed not useful without fear that > programs which use > them will begin to fail, possibly silently in the case of > non-interactive use. > > Skip > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev From skip at pobox.com Sun Nov 23 08:59:22 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Nov 23 08:59:33 2003 Subject: [spambayes-dev] "X-" as a prefix for experimental options In-Reply-To: <194801c3b186$6d8bd340$0500a8c0@eden> References: <16320.14909.793055.342411@montanaro.dyndns.org> <194801c3b186$6d8bd340$0500a8c0@eden> Message-ID: <16320.48442.882585.738466@montanaro.dyndns.org> Mark> If I could just get a test tool to run directly over Outlook Mark> folders, we would be much closer (for Outlook anyway ). Mark> This needn't be too hard - just abstracting the test tools a Mark> little so they allow sub-classes to extract the actual message Mark> streams for the test runs. You may be able to extend mboxutils.getmbox() to handle Outlook folders. That would allow many tools (though not timcv.py) to handle them automagically. timcv.py may need more than sequential access to the messages, however. I've never looked at it. Mark> Ultimately, we end up with a simple way for either Outlook or Mark> sb_server to run tests over the training sets, and report succinct Mark> results. Otherwise, I doubt anything will change in terms of the Mark> number of *users* running tests (let alone developers ) Yes, it would be nice to move in that direction. PEP time? Skip From bernie at pobox.com Mon Nov 24 05:47:52 2003 From: bernie at pobox.com (Bernard Payne) Date: Mon Nov 24 05:47:53 2003 Subject: [spambayes-dev] Reviewing trained messages Message-ID: <00a801c3b278$68f719f0$30132352@nec> Hi - Could you consider an entry in the FAQ section on how to access the database of messages which have been trained - i.e. in the case that you misclassify a message and want to sort it out? In my case I have a message that was correctly classified, but it seems like it was not delivered to my Outlook Express inbox - a little odd I know, but if I could access the "already trained" database I could see if this is really the case or not. I use the web interface pointing to localhost:8800 to review & train messages, in case this is relevant. Thanks...Bernie Payne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031124/212d4cfb/attachment.html From barry at python.org Mon Nov 24 09:53:37 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 09:53:40 2003 Subject: [spambayes-dev] Three patches for better Evolution integration Message-ID: <1069685616.2365.17.camel@geddy> I finally spent some time this weekend trying to integrate sb and Ximian Evolution (I use 1.4.5 as my primary mail reader). As others have pointed out before, Evolution lets you create a filter, one of the criteria of which can be "pipe message to shell script". Then Evolution can match the exit code of the script to determine what to do with the message. So I use sb_imapfilter.py to train a local database, then I run sb_xmlrpcserver.py to do the scoring against this database. I wrote a small client called sb_score.py which basically just pipes stdin to the server, calls XMLHammie.score() and compares the float return value against spam_cutoff and ham_cutoff. The script returns 0 for ham, 1 for unsure, and 2 for spam (also -1 if there's an error). Currently, I'm just moving spam to a separate folder, leaving ham and unsure in my inbox. I'll probably refine that soon to move unsures as well. I've uploaded three patches to SF. 848311 is a small patch to sb_imapfilter.py so that it honors html_ui::launch_browser when the -b option is given. I wanted it to start the web server and not start a browser, but there didn't seem to be any way to make this happen. In 848314 I had to make several changes to sb_xmlrpcserver.py to 1) make the socket reusable, 2) fix XMLHammie.score(). The latter method was trying to wrap the float return value in a Binary, but that's both broken and unnecessary . Now it just returns the float directly. 848319 is my sb_score.py script. I don't like sb_imapfilter.py's tendency to create copies of messages (with one marked deleted) in my ham and spam training folders. I vaguely remember some discussion about this and I'm not sure if it's fixable or not (I'm using uw_imap -- yeah, yeah, I know, I know). I may try, but if I fail, I'll probably just rsync over those two folders and do an mbox train on them. Evolution is applying the filter and moving the message, and that all looks good. Evolution doesn't seem any slower . I had two problems though. As I was getting things going, I'd stop my xmlrpc server and retrain, then restart the server. This seemed to give Evolution fits, which it spitting up cryptic error messages and forcing a restart. This only happens occasionally though, but definitely seems to be related to my training regimen. It also wasn't doing a very good job of classifying messages. Of course, maybe that had something to do with bugs in my bayescustomize.ini file where I swapped my ham and spam training folders ;). I've fixed that now, blown away my database, retrained, and now am awaiting the daily flood of messages, both tasty and rancid. -Barry From jens.rantil at telia.com Mon Nov 24 16:13:34 2003 From: jens.rantil at telia.com (Jens Rantil) Date: Mon Nov 24 16:14:46 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: References: Message-ID: <20031124221334.0d4cfb22.jens.rantil@telia.com> Hi Richie, I am one of those passive readers at this forum =) and finally have something to say... On Sat, 15 Nov 2003 16:57:27 +0000 Richie Hindle wrote: > We'd love to have international versions, though there are a lot of issues > involved. I don't mean to put you off the idea, or to imply that we're > not prepared to put effort into this, but these things need taking into > account... > > Many (most?) of the English strings in SpamBayes are mixed in with the > code. Taking the source code as it is an translating the strings into > Spanish would be unmaintainable - we'd have two entirely separate versions > of the code, and any edits would have to be applied to both. So the first > job to do would be to pull out all those hard-coded strings into a > language file. That's not a huge job, and one that any computer-literate > person could probably do 95% of, even if they weren't a programmer. Still > more effort than simply translating a collection of English phrases into > Spanish, though. No one here seems to have mentioned the GNU gettext project. Why not try to integrate SB with gettext instead? I believe that should be the best solution...and if no one is happy doing it I might have a look at it, however, I can't promise having time. =) See http://www.python.org/doc/current/lib/module-gettext.html and http://www.gnu.org/software/gettext/gettext.html for more info. Regards, Jens Rantil, a fan of yours =) PS. When reading thrue all the messages in the user mailing list I find that many of the questions concerning the outlook plugin aren't replied. I would suggest adding at least a line in the FAQ on how to set the logging frequency and where to find the log file so that it can be attached to the mails. Perhaps that would help a lot in solving some of the errors which seems to circumvent in the plugin? DS. From kennypitt at hotmail.com Mon Nov 24 16:33:56 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Nov 24 16:34:49 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <20031124221334.0d4cfb22.jens.rantil@telia.com> Message-ID: Jens Rantil wrote: > Hi Richie, > I am one of those passive readers at this forum =) and finally have > something to say... > > On Sat, 15 Nov 2003 16:57:27 +0000 > Richie Hindle wrote: > >> We'd love to have international versions, though there are a lot of >> issues involved. I don't mean to put you off the idea, or to imply >> that we're not prepared to put effort into this, but these things >> need taking into account... > > No one here seems to have mentioned the GNU gettext project. Why not > try to integrate SB with gettext instead? IIRC, the GNU license is not compatible with the SpamBayes/PSF license. -- Kenny Pitt From barry at python.org Mon Nov 24 16:59:30 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 16:59:37 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: References: Message-ID: <1069711169.10090.5.camel@anthem> On Mon, 2003-11-24 at 16:33, Kenny Pitt wrote: > > No one here seems to have mentioned the GNU gettext project. Why not > > try to integrate SB with gettext instead? > > IIRC, the GNU license is not compatible with the SpamBayes/PSF license. The GPL and the PSF license are compatible. -Barry From mhammond at skippinet.com.au Mon Nov 24 17:03:14 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Nov 24 17:04:15 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <1069711169.10090.5.camel@anthem> Message-ID: <1cc901c3b2d6$c21f6a10$0500a8c0@eden> > On Mon, 2003-11-24 at 16:33, Kenny Pitt wrote: > > > > No one here seems to have mentioned the GNU gettext > project. Why not > > > try to integrate SB with gettext instead? > > > > IIRC, the GNU license is not compatible with the > SpamBayes/PSF license. > > The GPL and the PSF license are compatible. Here we go again :) Wouldn't it mean that we must release SB under the GPL? That is what my understanding of Python being "compatible" means - Python can be released in a GPL'd project, but not the other way around - using GNU with Python doesn't allow us to re-licence the GNU stuff. Mark. From tim.one at comcast.net Mon Nov 24 17:19:49 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 24 17:19:57 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <20031124221334.0d4cfb22.jens.rantil@telia.com> Message-ID: [Jens Rantil] > ... > PS. When reading thrue all the messages in the user mailing list I > find that many of the questions concerning the outlook plugin aren't > replied. Yup, and I blame the users . Seriously, there's exactly one person with a deep understanding of the Outlook plugin internals, against more than a hundred thousand people who have downloaded it. Mark couldn't keep up with the stream of questions even if he were paid to work on it full-time (and, of course, he's not paid to work on it at all -- it has to come out of his spare time). So users with problems are going to have to help each other solve them. > I would suggest adding at least a line in the FAQ on how to > set the logging frequency and where to find the log file so that > it can be attached to the mails. Suggesting more work for Mark to do probably isn't going to help. If you think you can write FAQ entries that would help people, please do! > Perhaps that would help a lot in solving some of the errors which > seems to circumvent in the plugin? Alas, quite possibly not. From all I can tell, the plugin works fine for the vast majority of people who install it. The cases where it doesn't work are the ones we hear about, and if the developers had ever seen these failures themselves, they would already be fixed. Most people seem able to find the log (the troubleshooting guide already contains lots of info about exactly where to find it); a problem is that the logs don't seem to be much help in resolving the problems that remain. From barry at python.org Mon Nov 24 17:24:21 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 17:24:27 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <1cc901c3b2d6$c21f6a10$0500a8c0@eden> References: <1cc901c3b2d6$c21f6a10$0500a8c0@eden> Message-ID: <1069712660.10090.18.camel@anthem> On Mon, 2003-11-24 at 17:03, Mark Hammond wrote: > > On Mon, 2003-11-24 at 16:33, Kenny Pitt wrote: > > > > > > No one here seems to have mentioned the GNU gettext > > project. Why not > > > > try to integrate SB with gettext instead? > > > > > > IIRC, the GNU license is not compatible with the > > SpamBayes/PSF license. > > > > The GPL and the PSF license are compatible. > > Here we go again :) Wouldn't it mean that we must release SB under the GPL? > That is what my understanding of Python being "compatible" means - Python > can be released in a GPL'd project, but not the other way around - using GNU > with Python doesn't allow us to re-licence the GNU stuff. I haven't been following this thread, but in this specific example, I doubt it's necessary even to worry about it . Python has its own gettext implementation that doesn't share any code with GNU gettext. In fact, Python's class-based API works better for Python code than classic gettext API. ducking-ly y'rs, -Barry From jens.rantil at telia.com Mon Nov 24 17:29:21 2003 From: jens.rantil at telia.com (Jens Rantil) Date: Mon Nov 24 17:30:55 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <20031124221334.0d4cfb22.jens.rantil@telia.com> References: <20031124221334.0d4cfb22.jens.rantil@telia.com> Message-ID: <20031124232922.790e7369.jens.rantil@telia.com> Once again, On Mon, 24 Nov 2003 22:13:34 +0100 Jens Rantil wrote: > No one here seems to have mentioned the GNU gettext project. Also...if there was such an implementation I would gratefully add a swedish translation. =) /Jens From richie at entrian.com Mon Nov 24 17:31:10 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Nov 24 17:31:42 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <1cc901c3b2d6$c21f6a10$0500a8c0@eden> References: <1069711169.10090.5.camel@anthem> <1cc901c3b2d6$c21f6a10$0500a8c0@eden> Message-ID: [Jens] > No one here seems to have mentioned the GNU gettext project. [Kenny] > IIRC, the GNU license is not compatible with the SpamBayes/PSF license. [Barry] > The GPL and the PSF license are compatible. [Mark] > Wouldn't it mean that we must release SB under the GPL? I think that's a red herring. I'm sure someone will correct me if I'm wrong, but I believe Python's gettext support does not include any GNU code. It's a re-implementation of a subset of the GNU system in Python (plus extra bits) and is PSF-licensed along with the rest of Python. Certainly it ships with Python (unlike readline, for example). Jens: I don't know anything about gettext apart from its existence 8-) and the basic idea. It would be fine for strings in source code (and was the kind of job I meant when I said "any computer-literate person could probably do 95% of [it]"). It certainly couldn't be used for the Windows dialogs, and probably not for the HTML (or could it? - you seem to know more about it than I do). > ...a line in the FAQ on how to set the logging frequency... I'll let one of the Outlook guys field that one - the Outlook plugin is something else whose existence is pretty much the sum total of my knowledge. 8-) -- Richie Hindle richie@entrian.com From tameyer at ihug.co.nz Mon Nov 24 17:51:32 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Nov 24 17:51:47 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315108@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E3@its-xchg4.massey.ac.nz> > PS. When reading thrue all the messages in the user mailing > list I find that many of the questions concerning the outlook > plugin aren't replied. I don't reply to Outlook plug-in messages when I'm busy with other things because: * I know that the FAQ, help pages, or recent archives have the answer to the question, * The user hasn't included enough information, even though the help pages say exactly what needs to be included (including finding the log and including it), * There's an open bug report about it already (so the info should be added to that), or * I don't know the answer. The people here volunteer their own time to help out with this project, so it's not too much to expect that the users put some effort into asking for help. The first three of those reasons are the user's fault. If info is too hard to find, then a message about *that* would be welcome, and probably dealt with reasonably promptly. > I would suggest adding at least a line > in the FAQ on how to set the logging frequency and where to > find the log file so that it can be attached to the mails. The Outlook help pages do this - does it need to be in the FAQ as well? (If the users don't read the help pages, will they read the FAQ?). =Tony Meyer From barry at python.org Mon Nov 24 18:02:29 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 18:02:41 2003 Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N In-Reply-To: References: <1069711169.10090.5.camel@anthem> <1cc901c3b2d6$c21f6a10$0500a8c0@eden> Message-ID: <1069714948.7132.0.camel@anthem> On Mon, 2003-11-24 at 17:31, Richie Hindle wrote: > I think that's a red herring. I'm sure someone will correct me if I'm > wrong, but I believe Python's gettext support does not include any GNU > code. It's a re-implementation of a subset of the GNU system in Python > (plus extra bits) and is PSF-licensed along with the rest of Python. Correct. -Barry From barry at python.org Mon Nov 24 21:22:05 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 21:22:16 2003 Subject: [spambayes-dev] Re: Three patches for better Evolution integration In-Reply-To: <1069685616.2365.17.camel@geddy> References: <1069685616.2365.17.camel@geddy> Message-ID: <1069726924.7132.14.camel@anthem> On Mon, 2003-11-24 at 09:53, Barry Warsaw wrote: > I've uploaded three patches to SF. I think these patches are pretty stable now. Evolution is very nicely filtering spam and unsures. Anybody mind if I check these into the cvs head? (Anthony thought it would be okay.) -Barry From tameyer at ihug.co.nz Mon Nov 24 21:43:33 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Nov 24 21:43:44 2003 Subject: [spambayes-dev] Re: Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13043151BA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E4@its-xchg4.massey.ac.nz> > I think these patches are pretty stable now. Evolution is > very nicely filtering spam and unsures. Anybody mind if I > check these into the cvs head? (Anthony thought it would be okay.) Quickly adding comments, both here and to the trackers... #848319: sb_score.py - would sb_xmlscore.py be a better name? (I don't care enough to debate this if you don't agree ;) so +1 to checking this in). #848314: +1 to checking this in. #848311: At the moment if you run "sb_imapfilter.py", you get the web server with no browser launched. I would prefer a different fix to add following the config option, so that it matches sb_server more closely. Patch attached to tracker. =Tony Meyer From tim.one at comcast.net Mon Nov 24 21:49:54 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Nov 24 21:50:05 2003 Subject: [spambayes-dev] Re: Three patches for better Evolution integration In-Reply-To: <1069726924.7132.14.camel@anthem> Message-ID: [Barry Warsaw] > I've uploaded three patches to SF. > > I think these patches are pretty stable now. Evolution is very nicely > filtering spam and unsures. Anybody mind if I check these into the > cvs head? (Anthony thought it would be okay.) Well, you're a SpamBayes project admin, so if you can't check something in, you're sorely in need of learning to abuse your power! Sounds fine to me. For those who don't know, Ximian's Evolution is a free email client for Linux and Solaris, obviously aiming to mimic Outlook look-&-feel (but, one hopes, not Outlook's variety of baffling bugs). If Barry keeps honking on this day and night for the next month, Ximian might be half as pleasant to use with SpamBayes as Mark's Outlook addin . From tameyer at ihug.co.nz Mon Nov 24 21:56:38 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Nov 24 21:56:46 2003 Subject: [spambayes-dev] Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304314FFA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz> > I don't like sb_imapfilter.py's tendency to create copies of > messages (with one marked deleted) in my ham and spam > training folders. I vaguely remember some discussion about > this and I'm not sure if it's fixable or not (I'm using > uw_imap -- yeah, yeah, I know, I know). I may try, but if I > fail, I'll probably just rsync over those two folders and do > an mbox train on them. The reason it does that is to mark the messages with an id so that it can identify them in future (by adding an "X-SpamBayes-ID" header). IMAP doesn't let you modify a message (or even move one ), so the filter makes a copy instead. IMAP has ids of it's own (one for the message and one for the folder), but they're not guaranteed to be permanent, and there were early problems because with some servers they aren't (the best that the spec offers is to let you know when the ids will be all wrong). The reason for marking the messages is so that they aren't continually trained (hence also the reliance on the 'message info' db). If you can come up with a way around this, that would be fantastic, and make imapfilter a lot simpler. If you can't be bothered trying, and have access to the mail via something other than IMAP, then yes, that would be much easier. =Tony Meyer From barry at python.org Mon Nov 24 23:05:28 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 23:05:38 2003 Subject: [spambayes-dev] Re: Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E4@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E4@its-xchg4.massey.ac.nz> Message-ID: <1069733127.31869.15.camel@anthem> On Tue, 2003-11-25 at 10:43, Tony Meyer wrote: > #848319: sb_score.py - would sb_xmlscore.py be a better name? (I don't care > enough to debate this if you don't agree ;) so +1 to checking this in). I went with sb_evoscore.py for reasons given in the tracker. :) > #848314: +1 to checking this in. Done, thanks. > #848311: At the moment if you run "sb_imapfilter.py", you get the web server > with no browser launched. I would prefer a different fix to add following > the config option, so that it matches sb_server more closely. Patch > attached to tracker. Here's the problem. When I start sb_imapfilter.py with no options, I get an error saying "You need to specify both a server and a username". Since I don't when I used -b, I don't know why I should need to do that now. The basic problem seems to be that launchUI is used both to determine whether the server gets started and whether the browser gets started. The alternative patch doesn't fix this, so it doesn't help me too much. But I'm not sure what the intent is so I won't make any changes to this script for now. Thanks! -Barry From barry at python.org Mon Nov 24 23:08:26 2003 From: barry at python.org (Barry Warsaw) Date: Mon Nov 24 23:08:32 2003 Subject: [spambayes-dev] Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz> Message-ID: <1069733305.31869.19.camel@anthem> On Tue, 2003-11-25 at 10:56, Tony Meyer wrote: > The reason it does that is to mark the messages with an id so that it can > identify them in future (by adding an "X-SpamBayes-ID" header). Why won't Message-ID work for this? I know that's not guaranteed to be unique, or even present on the messages in the imap server, but they'll /probably/ exist, and they'll /probably/ be unique, so it might be good enough. Alternatively, you could fingerprint some part of the message that won't change and then store the fingerprint in the message database. -Barry From tameyer at ihug.co.nz Tue Nov 25 00:29:12 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Nov 25 00:29:20 2003 Subject: [spambayes-dev] Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13043151E0@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E6@its-xchg4.massey.ac.nz> [Tony] > The reason it does that is to mark the messages with an id so that it > can identify them in future (by adding an "X-SpamBayes-ID" header). [Barry] > Why won't Message-ID work for this? I know that's not > guaranteed to be unique, or even present on the messages in > the imap server, but they'll /probably/ exist, and they'll > /probably/ be unique, so it might be good enough. I considered this early on, and it wouldn't be all that difficult to change to it (the 'message info' db doesn't care what key it's given). It seemed simpler to just go with something definite, rather than 'if there is a message id, use it, if not, then do something else'. (As long as duplicates are very rare, I don't care about that, since it would just mean that a message was wrongly ignored). Would you be willing to guess what percentage of messages have a message-id? I thought that maybe the number of tokens in my db that start with "message-id" would be an indication, but that's only 900/3779, which seems far too low. (A portion of those messages never left the Exchange server, so don't have anything much in the way of headers, but not that many). > Alternatively, you could fingerprint some part of the message > that won't change and then store the fingerprint in the > message database. I considered this, too, but wasn't sure whether it would work or not. The message can't really change (apart from the IMAP id and flags), but I wasn't sure how much of the message I would need to use to make sure that it was unique, what the best way of fingerprinting would be, or anything much, really . I think (depending on the results of "probably", above), this is the best way to go, if anyone does want to implement it. Basically what it comes down to is that (TimS and) I wrote imapfilter because we were sick of answering requests for it (and as an exercise). If someone were to say "I'll take over imapfilter completely", I would be very happy :). As it is, I doubt that I'll get around to doing much in the way of improvements (although I'm happy to work on bugs). Hopefully someone with the time & interest will come along at some point in the not-too-distant future. =Tony Meyer From tameyer at ihug.co.nz Tue Nov 25 01:00:36 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Nov 25 01:00:46 2003 Subject: [spambayes-dev] Re: Three patches for better Evolutionintegration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315211@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E8@its-xchg4.massey.ac.nz> [Barry] > Here's the problem. When I start sb_imapfilter.py with no > options, I get an error saying "You need to specify both a > server and a username". This is a bug hanging over from old code. I'll put a new patch on the tracker. > The basic problem seems to be that launchUI is used both to > determine whether the server gets started and whether the > browser gets started. It looks like this, although that's not actually the case (whatever the value of launchUI is, if neither doClassify or doTrain is true, the server is started). Again, hangover from older code. I'll include fixing this in the new patch. > The alternative patch doesn't fix this, so it doesn't help me > too much. > But I'm not sure what the intent is so I won't make any > changes to this script for now. When you have a chance, if you could take a look at the revised patch and see if it meets your needs, that would be great :) =Tony Meyer From sjoerd at acm.org Tue Nov 25 05:00:05 2003 From: sjoerd at acm.org (Sjoerd Mullender) Date: Tue Nov 25 05:00:29 2003 Subject: [spambayes-dev] Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz> Message-ID: <3FC32825.9000600@acm.org> Tony Meyer wrote: > The reason for marking the messages is so that they aren't continually > trained (hence also the reliance on the 'message info' db). If you can come > up with a way around this, that would be fantastic, and make imapfilter a > lot simpler. If you can't be bothered trying, and have access to the mail > via something other than IMAP, then yes, that would be much easier. I was thinking, could you use the IMAP command to add a flag to a message, such as STORE +FLAGS Classified and STORE +FLAGS Trained. You can select messages with SEARCH KEYWORD Classified. You wouldn't have to change the message, so you don't need to make copies (unless of course you have to move the message to a different folder). -- Sjoerd Mullender From tameyer at ihug.co.nz Tue Nov 25 18:15:26 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Nov 25 18:15:06 2003 Subject: [spambayes-dev] Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13043152C9@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B191@its-xchg4.massey.ac.nz> [Sjoerd Mullender] > I was thinking, could you use the IMAP command to add a flag to a > message, such as STORE +FLAGS Classified and STORE +FLAGS > Trained. You can select messages with SEARCH KEYWORD Classified. You > wouldn't have to change the message, so you don't need to make > copies (unless of course you have to move the message to a > different folder). The problem is that not all IMAP servers allow you to store arbitary flags (at least this is how the RFC reads to me; correct me if I'm wrong). So this would mean we only support a subset of IMAP servers. Again, if someone decides that this is the way to go, I don't really care, but I likewise, I don't care enough to do it myself. =Tony Meyer From richie at entrian.com Tue Nov 25 19:05:17 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Nov 25 19:06:16 2003 Subject: [spambayes-dev] RE: Bug in UserInterface.py In-Reply-To: References: Message-ID: [Mats] > I have a minor bug that triggers when you have enabled > header_score_logarithm, have a spam propability of more than > 0.995 (or less than 0.005) and tries to view clues from the > review page of the POP3PROXY. [Kenny] > This would probably work if and when the fix for bug #831388 is applied. I've applied that fix, and a simplified version of Mats' patch - thanks to all concerned! UserInterface.py 1.34. -- Richie Hindle richie@entrian.com From tim.one at comcast.net Tue Nov 25 21:36:20 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Nov 25 21:36:25 2003 Subject: [spambayes-dev] more selective Received: header mining... In-Reply-To: <16318.65398.374469.490455@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I made a change to the mine_received_headers stuff this evening, > adding a new option, gateway_machines. The idea is that the only > Received: header which is really useful is the one which crosses the > boundary between your known "good" network and the wild free-for-all > part of the net. Received: headers from hosts internal to your > network are meaningless, since for the most part, all mail passes > through them, while Received: headers from hosts external to your > network probably just contain random garbage which clogs your > database with meaningless tokens. I don't know that that's so. On the spam side, some spammers forge a sequence of Received headers to make it appear as if the path to your machine was legitimate, and the specific paths they forge can be clues. On the ham side, different senders' emails often take different paths that leave behind distinctive clues on their end of the pipe. If a token in the database is indeed worthless, that can be detected by (1) the token is never used for scoring anymore; and/or, (2) the token has a spamprob in the range we ignore. If your real concern is purging useless tokens, then analysis based on #1 and #2 should identify huge masses of useless tokens, including all due to Received headers. #1 is hard to do now, of course (since we don't save any token access-time info in the database). BTW, the Outlook addin currently leaves mine_received_headers at its default False, so I don't have any tokens due to Received lines in my databases. From tim.one at comcast.net Tue Nov 25 22:03:25 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Nov 25 22:03:37 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: <16312.15564.157062.319322@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > Here's something I think would be interesting. At the moment I have > about 40 unsures awaiting a decision from me (train or discard). I'm > trying conciously to be conservative. What I'd like to know is which > message, if added to my training database, would have the greatest > effect on the scores of the other unsure messages. That would help > me decide which ones yield the most benefit. If you can define what "greatest effect on the scores of the other unsure messages" means, exactly, then it should be easy to automate that decision (for each unsure: train on it, score all the other unsures, compute "the effect" on their scores (whatever that means to you), untrain it; then pick the one with the greatest whatever-it-is you measured). Google on "active learning" classification to get a warm fuzzy feeling that this may be a fine thing to do . I train on "the worst" Unsure first (lowest-scoring spam or highest-scoring ham), then rescore Unsures, and repeat until they're all gone. A number of Unsures usually get resolved on their own this way, especially near-duplicates of a new spam. I don't spend any time any more trying to guess whether a message "really is" ham or spam -- if it's not obvious after 5 seconds, I toss it without training on it at all. From sjoerd at acm.org Wed Nov 26 05:39:56 2003 From: sjoerd at acm.org (Sjoerd Mullender) Date: Wed Nov 26 05:40:02 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayes message.py, 1.40, 1.41 In-Reply-To: References: Message-ID: <3FC482FC.7030707@acm.org> Richie Hindle wrote: > Update of /cvsroot/spambayes/spambayes/spambayes > In directory sc8-pr-cvs1:/tmp/cvs-serv19794 > > Modified Files: > message.py > Log Message: > Patch 831388: Make message.py respect the header_score_digits option. > > Bugfix candidate (probably). > > > Index: message.py > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/spambayes/message.py,v > retrieving revision 1.40 > retrieving revision 1.41 > diff -C2 -d -r1.40 -r1.41 > *** message.py 8 Oct 2003 04:04:35 -0000 1.40 > --- message.py 25 Nov 2003 23:11:18 -0000 1.41 > *************** > *** 342,346 **** > > if options['Headers','include_score']: > ! disp = str(prob) > if options["Headers", "header_score_logarithm"]: > if prob<=0.005 and prob>0.0: > --- 342,346 ---- > > if options['Headers','include_score']: > ! disp = ("%."+str(options["Headers", "header_score_digits"])+"f") % prob > if options["Headers", "header_score_logarithm"]: > if prob<=0.005 and prob>0.0: > This can be done with disp = "%.*f" % (options["Headers", "header_score_digits"], prob) which looks more readable to me. -- Sjoerd Mullender From sjoerd at acm.org Wed Nov 26 05:42:35 2003 From: sjoerd at acm.org (Sjoerd Mullender) Date: Wed Nov 26 05:42:45 2003 Subject: [spambayes-dev] Three patches for better Evolution integration In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B191@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130212B191@its-xchg4.massey.ac.nz> Message-ID: <3FC4839B.90206@acm.org> Tony Meyer wrote: > [Sjoerd Mullender] > >>I was thinking, could you use the IMAP command to add a flag to a >>message, such as STORE +FLAGS Classified and STORE +FLAGS >>Trained. You can select messages with SEARCH KEYWORD Classified. You >>wouldn't have to change the message, so you don't need to make >>copies (unless of course you have to move the message to a >>different folder). > > > The problem is that not all IMAP servers allow you to store arbitary flags > (at least this is how the RFC reads to me; correct me if I'm wrong). So > this would mean we only support a subset of IMAP servers. Again, if someone > decides that this is the way to go, I don't really care, but I likewise, I > don't care enough to do it myself. I can't say I care very much either at the moment. But just for the record, it seems that the Cyrus and UW IMAP servers both implement this. -- Sjoerd Mullender From barry at python.org Wed Nov 26 09:10:10 2003 From: barry at python.org (Barry Warsaw) Date: Wed Nov 26 09:10:34 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: Message-ID: <1069855810.3419.58.camel@anthem> On Tue, 2003-11-25 at 22:03, Tim Peters wrote: > If you can define what "greatest effect on the scores of the other unsure > messages" means, exactly, then it should be easy to automate that decision > (for each unsure: train on it, score all the other unsures, compute "the > effect" on their scores (whatever that means to you), untrain it; then pick > the one with the greatest whatever-it-is you measured). Sounds like a genetic algorithm. The trick is deciding what it is you want to maximize. -Barry From richie at entrian.com Wed Nov 26 17:09:48 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Nov 26 17:10:37 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayes message.py, 1.40, 1.41 In-Reply-To: <3FC482FC.7030707@acm.org> References: <3FC482FC.7030707@acm.org> Message-ID: [Me, applying patch 831388] > disp = ("%."+str(options["Headers", "header_score_digits"])+"f") % prob [Sjoerd] > This can be done with > disp = "%.*f" % (options["Headers", "header_score_digits"], prob) > which looks more readable to me. You're right, that is better. Checked in as message.py 1.43 - thanks! -- Richie Hindle richie@entrian.com From kennypitt at hotmail.com Wed Nov 26 17:19:03 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Nov 26 17:19:44 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayesmessage.py, 1.40, 1.41 In-Reply-To: Message-ID: Richie Hindle wrote: > [Me, applying patch 831388] >> disp = ("%."+str(options["Headers", "header_score_digits"])+"f") % prob > > [Sjoerd] >> This can be done with >> disp = "%.*f" % (options["Headers", "header_score_digits"], prob) >> which looks more readable to me. > > You're right, that is better. Checked in as message.py 1.43 - thanks! Identical format string is used in hammie.py. Might be worth changing it there as well. -- Kenny Pitt From richie at entrian.com Wed Nov 26 18:01:32 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Nov 26 18:02:04 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayesmessage.py, 1.40, 1.41 In-Reply-To: References: Message-ID: [Kenny] > Identical format string is used in hammie.py. Might be worth changing > it there as well. Good spot. Done. -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Wed Nov 26 18:08:13 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Nov 26 18:08:22 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz> Message-ID: <027801c3b472$2b4c06a0$0200a8c0@eden> > [This thread seems to have died a week ago, but since I was > away, and have > things to say , and it doesn't seem to be resolved, I > figured I'd > resurrect it. While I'm doing notes: thanks Richie, Anthony > and Skip for > outlining the various processes in more detail - great stuff > for us cvs > newbies]. That sounds great - but I am afraid I am still not sure what the resolution is. My specific issue is that the branch does not include patches needed run effectively on Windows (notably, the patches made to the server, service, and tray to ensure only one instance is running, and to ensure the tray handles the service correctly. So, as far as I can tell, the trunk has what we want to release as a "stand alone" sb_server, but the trunk includes what I want to release for a Windows binary. So I find myself unable to move on the Windows binary, but not understanding what has happened well enough to fix things. Should we abandon the branch, merging everything back to the trunk? I don't see the branch is offering us any value, firstly as it is now very old, and secondly as the trunk doesn't seem to have had any changes that truly should be post 1.0 - unless we don't count a Windows binary in 1.0. I will start moving on the binary again as soon as someone can help me resolve this. In the meantime it looks like resolving 355 bugs as duplicates Mark. From richie at entrian.com Thu Nov 27 03:41:34 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Nov 27 03:42:04 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <027801c3b472$2b4c06a0$0200a8c0@eden> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz> <027801c3b472$2b4c06a0$0200a8c0@eden> Message-ID: <93ebsvkek2u6gjfrll6e6d7ks98ni49ta4@4ax.com> [Mark] > Should we abandon the branch, merging everything back to the trunk? +1 We're still in alpha, and we never really decided what our branch management strategy should be. Let's start again with the trunk. -- Richie Hindle richie@entrian.com From tameyer at ihug.co.nz Fri Nov 28 01:00:35 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Nov 28 01:00:40 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315896@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F1@its-xchg4.massey.ac.nz> [Mark] > Should we abandon the branch, merging everything > back to the trunk? I don't see the branch is offering > us any value, firstly as it is now very old, and secondly > as the trunk doesn't seem to have had any changes that > truly should be post 1.0 - unless we don't count a > Windows binary in 1.0. The sb_server interface has had quite a few changes on the trunk that haven't been put into the branch. I certainly hope that they didn't introduce new bugs (since I wrote most of it), but they may have done. I think it would certainly mean that we should do another release before we consider that things are stable. OTOH, the changes were all sugar, and they could be ripped out and added back in another day. Anyone have any idea how many people are using sb_server from cvs? Enough that the changes (they're fairly old now, even though they've never made a release) seem fairly stable? -0 from me, anyway. > I will start moving on the binary again as > soon as someone can help me resolve this. I can't help resolve it, really, but now I'm back and have caught up with things, I'm happy to help out however I can with the 'full' binary. Are you still thinking of doing one final Outlook only one? Give me a list of things to do . (Or maybe you already did; I'll have to look through the stuff I have on my to-do list). > In the meantime > it looks like resolving 355 bugs as duplicates I tried to get some of those out of the way; there sure are a lot of Outlook ones there at the moment. If the new release solves the 0x000000 install problem, that'd get rid of a lot... =Tony Meyer From mhammond at skippinet.com.au Fri Nov 28 05:37:06 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Nov 28 05:37:16 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F1@its-xchg4.massey.ac.nz> Message-ID: <018e01c3b59b$925be480$0200a8c0@eden> > [Mark] > > Should we abandon the branch, merging everything > > back to the trunk? I don't see the branch is offering > > us any value, firstly as it is now very old, and secondly > > as the trunk doesn't seem to have had any changes that > > truly should be post 1.0 - unless we don't count a > > Windows binary in 1.0. > > The sb_server interface has had quite a few changes on the trunk that > haven't been put into the branch. I certainly hope that they didn't > introduce new bugs (since I wrote most of it), but they may > have done. I > think it would certainly mean that we should do another > release before we > consider that things are stable. OTOH, the changes were all > sugar, and they > could be ripped out and added back in another day. > > Anyone have any idea how many people are using sb_server from > cvs? Enough > that the changes (they're fairly old now, even though they've > never made a > release) seem fairly stable? > > -0 from me, anyway. I should clarify that by "abandon the branch", I don't actually mean abandon . I mean we merge the branch back onto the trunk (cvs up -j ...), handling any conflicts and resolving the pain it may cause. I haven't tried to see what conflicts actually arise, but am willing to. I expect the only real conflicts will be where a bug *has* been fixed in both places. Or is that what you thought I meant, and still -0? > still thinking of doing one final Outlook only one? Give me a list of > things to do . (Or maybe you already did; I'll have to > look through > the stuff I have on my to-do list). OK - this is my strawman plan: 1) Merge as above. 2) Let things settle for a week or so, so poor CVS users all get to suffer alone. 3) Put together a binary from my current py2exe setup script, which includes CVS and a number of sb_ programs. 4) Announce this binary as a "binary-beta", calling it 0.75 or something. 5) Any major bugs will presumably be part of the "binary framework", so maybe 0.76 etc, depending on the damage. 6) Move towards release 0.8 - this will be simultaneous windows-binary and source. 7) Move towards release 0.9 - aim for 4 weeks after 0.8, addressing only bugs. 8) Just *before* the 0.9 release, cut a new 1.0 branch. Release 0.9. 9) Move towards 1.0, again aiming for 4 weeks, possibly with 2x release candidates. As far as I can tell, almost everyone is in bug-fix-only mode already (except that damn bass-playing Warsaw ). I'm pretty much in that mode for Outlook too. So up until (8), which is when we cut a new 1.0 branch, all new real features go one some development branch (a branch per feature - whatever). We all reserve our right to use a fairly liberal definition of "bug" for low-risk, high-benefit tweaks, but I think our app is mature enough that we can happily tell people who want truly new features to grab a CVS branch. After the branch is cut, we get seriously anal about "bugfix only", with the expectation the branch lasts only 4 weeks (which flies when everyone is busy!) This is likely to impact almost no-one once we are re-merged. Moving fast towards 1.0 seems in everyones benefit, and this would get us there around the end of Feb. If we can get there sooner due to the only bugs being old ones we don't know how to fix, all the better. Happy-new-year ly, Mark. From barry at python.org Fri Nov 28 08:37:07 2003 From: barry at python.org (Barry Warsaw) Date: Fri Nov 28 08:37:12 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <018e01c3b59b$925be480$0200a8c0@eden> References: <018e01c3b59b$925be480$0200a8c0@eden> Message-ID: <1070026627.20553.10.camel@anthem> On Fri, 2003-11-28 at 05:37, Mark Hammond wrote: > As far as I can tell, almost everyone is in bug-fix-only mode already > (except that damn bass-playing Warsaw ). Yeah, if we could only get him to read this mailing list once in a while, then he might even agree with the plan. -Not That Guy From support at netcom3.com Fri Nov 28 17:54:59 2003 From: support at netcom3.com (Auto Auction Center) Date: Fri Nov 28 17:57:26 2003 Subject: Proposal[spambayes-dev] Business opportunity Message-ID: <000601c3b602$a6dee080$0a01a8c0@Netcom3> About Us: Out of over 100,000 merchants, Netcom3 (Auto Center) is ranked #1 as the most successful company through Clickbank Network. Netcom3 (The Auto Center) has millions of used car buyers every month in its automotive network with one of the largest providers of e-business for the automotive industry. By 2004, millions of customers will be using our no obligation service to bid on vehicles and make offer requests to our network of over 67,200 sellers and dealers. We have been in the automotive business for over 7 years with a huge profit gain each and every year. We are now planning to expand our company by offering free advertising on our online auto site which will also appear on our network partner's sites Including: Yahoo.com, cars.com, Autotrader.com, autoweb.com and many more auto giants. Advertising with us is FREE and you can instantly generate an ongoing stream of targeted leads. All leads can be emailed to you or customers can go directly to your website to view what you have to offer. Profit Potential: By signing up with Netcom3.com, you will have access to over 63,000,000 potential auto buyers every month for FREE. Just imagine how much money you can save by signing up with Netcom3 appose to posting your on ads on our network partners sites yourself. See example below. Example: To post an ad on yahoo.com auto site it will cost you $34.95 per car. But if you sign up with Netcom3.com it will cost you nothing and your ad will appear on Netcom3 site and its network partner sites including yahoo.com for free. Example2: Place 100 ads on yahoo.com will cost you $3495.00. Place 100 ads with Netcom3 and its partner sites will cost you $0. You can post as many cars as you like for free. We are so confident in our service that we are not obligating you or any of our other affiliates to contract with us. You can back out at anytime. Give it a try and I guarantee you will be glad you did. If you are interested in receiving more qualified leads please visit our web site for more details at http://www.netcom3.com/sell.htm Thank You Netcom3 / Auto Center Marketing Department -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031128/80645b1e/attachment.html From skip at pobox.com Thu Nov 27 07:01:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Nov 28 18:16:18 2003 Subject: [spambayes-dev] A spectacular false positive In-Reply-To: References: <16312.15564.157062.319322@montanaro.dyndns.org> Message-ID: <16325.59286.217664.627909@montanaro.dyndns.org> >> What I'd like to know is which message, if added to my training >> database, would have the greatest effect on the scores of the other >> unsure messages. That would help me decide which ones yield the most >> benefit. Tim> If you can define what "greatest effect on the scores of the other Tim> unsure messages" means, exactly, then it should be easy to automate Tim> that decision (for each unsure: train on it, score all the other Tim> unsures, compute "the effect" on their scores (whatever that means Tim> to you), untrain it; then pick the one with the greatest Tim> whatever-it-is you measured). I mean "pushes the remaining unsures the furthest away from their current scores". I guess I want to maximize: sum([abs(old-new) for (old,new) in zip(oldprobs, newprobs)]) Tim> Google on Tim> "active learning" classification Tim> to get a warm fuzzy feeling that this may be a fine thing to do Tim> . Thanks. When I get a chance, I may. On the other hand, I may just take your word for it. Tim> I train on "the worst" Unsure first (lowest-scoring spam or Tim> highest-scoring ham), then rescore Unsures, and repeat until Tim> they're all gone. A number of Unsures usually get resolved on Tim> their own this way, especially near-duplicates of a new spam I've been doing this sort of thing, though perhaps not consistently enough. Tim> I don't spend any time any more trying to guess whether a message Tim> "really is" ham or spam -- if it's not obvious after 5 seconds, I Tim> toss it without training on it at all. Ditto. Skip From gward at python.net Sun Nov 30 12:16:14 2003 From: gward at python.net (Greg Ward) Date: Sun Nov 30 12:16:18 2003 Subject: [spambayes-dev] Clever avoidance technique Message-ID: <20031130171614.GA10222@cthulhu.gerg.ca> Here's a nifty variation on the invisible-text-in-HTML tactic: make the invisible text vaguely relevant to the recipient of the spam. I just got one this morning that's immediately, obviously spam from these headers: From: "Inconvenience O. Imprecision" To: Gward Subject: Gward, meet singles in your area U7n2QHvxKLmBOhTROl57D5Q7crCNQzbL Date: Sat, 29 Nov 2003 16:42:22 -0500 but if I look in the HTML body, I see this:

The Defense Technical Information Center (DTIC= =ae) is the central facility for the collection and dissemination of scie= ntific and technical information for the Department of Defense (DoD)=2e M= uch of this information is made available by DTIC in the form of technica= l reports about completed research, and research summaries of ongoing res= earch=2e u62Mb6TFJNptB0duTKrhqDiJDdBNRazm

which isn't terribly relevant to me... but a little farther on (after the actual spam payload, encoded of course), we see this:

The Handle System allows handles to be both cr= eated and resolved in a distributed fashion (see the diagram on this page= for an overview of the Handle System architecture)=2e Both creation and = resolution can be accomplished using dedicated clients, common clients su= ch as web browsers using special extensions or plug-ins, or unextended cl= ients going through various proxies=2e In all cases, communication with t= he Handle System is carried out using the Handle System protocol which ha= s a formal specification and some specific implementations, all freely av= ailable from CNRI=2e The protocol has a corresponding client library avai= lable in C and Java=2e The C client library has been used by CNRI in the = creation of a handle-aware extension to the Netscape and Microsoft web br= owsers=2e The Java client library has been used to create an http-to-hand= [...] Interesting! This would probably count as ham for any computer geek. However, the above blurb describes software produced by my former employer, and you can probably get to it with 3 or 4 clicks from my home page. And, knowing CNRI, the first blurb is probably vaguely related -- most of their money comes from the US military-industrial-entertainment complex, after all. This feels very much like it's targeted at Bayesian filters -- eg. I suspect SpamAssassin pre-2.6 would have had a better chance at calling this one spam than Spambayes (which scored it 0.198, just barely ham for my thresholds). Full message attached in case you're curious. Greg -- Greg Ward http://www.gerg.ca/ Earn cash in your spare time -- blackmail your friends! -------------- next part -------------- An embedded message was scrubbed... From: "Inconvenience O. Imprecision" Subject: Gward, meet singles in your area U7n2QHvxKLmBOhTROl57D5Q7crCNQzbL Date: Sat, 29 Nov 2003 16:42:22 -0500 Size: 4908 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031130/d32a569a/meet-singles.mht From tim.one at comcast.net Sun Nov 30 17:09:43 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Nov 30 17:09:49 2003 Subject: [spambayes-dev] Clever avoidance technique In-Reply-To: <20031130171614.GA10222@cthulhu.gerg.ca> Message-ID: [Greg Ward] > Here's a nifty variation on the invisible-text-in-HTML tactic: make > the invisible text vaguely relevant to the recipient of the spam. Jeremy got exactly the same white-on-white text as you got below, about two weeks ago, although the container spam had different content (albeit the same thrust). I think the CNRI connection is coincidence -- lots of spam contains color-on-close-color decoy text, but you never notice it except on those rare occasions it ends up being hammy to you. > I just got one this morning that's immediately, obviously spam from > these headers: Why, are you married now, or you just don't get ham on Saturdays anymore ? > From: "Inconvenience O. Imprecision" > To: Gward > Subject: Gward, meet singles in your area > U7n2QHvxKLmBOhTROl57D5Q7crCNQzbL > Date: Sat, 29 Nov 2003 16:42:22 -0500 > > but if I look in the HTML body, I see this: > >

The Defense Technical Information Center > (DTIC= =ae) is the central facility for the collection and > dissemination of scie= ntific and technical information for the > Department of Defense (DoD)=2e M= uch of this information is made > available by DTIC in the form of technica= l reports about > completed research, and research summaries of ongoing res= earch=2e > u62Mb6TFJNptB0duTKrhqDiJDdBNRazm

> > which isn't terribly relevant to me... Jeremy thought it was, as DTIC worked closely with CNRI, and even hosted a symposium on CNRI's handle system (the topic of the next blurb below). > but a little farther on (after the actual spam payload, encoded of > course), we see this: > >

The Handle System allows handles to be > both cr= eated and resolved in a distributed fashion (see the > diagram on this page= for an overview of the Handle System > architecture)=2e Both creation and = resolution can be accomplished > using dedicated clients, common clients su= ch as web browsers > using special extensions or plug-ins, or unextended cl= ients going > through various proxies=2e In all cases, communication with t= he > Handle System is carried out using the Handle System protocol which > ha= s a formal specification and some specific implementations, all > freely av= ailable from CNRI=2e The protocol has a corresponding > client library avai= lable in C and Java=2e The C client library > has been used by CNRI in the = creation of a handle-aware extension > to the Netscape and Microsoft web br= owsers=2e The Java client > library has been used to create an http-to-hand= [...] > > Interesting! This would probably count as ham for any computer geek. > However, the above blurb describes software produced by my former > employer, and you can probably get to it with 3 or 4 clicks from my > home page. Ya, and Johnny Carrero's 1998 Folsom Fitness Extravaganza is only two clicks from your home page, the McConnell Brain Imaging Centre only one. If they were targeting you specifically, they hit stuff relatively *hard* to find. And, knowing CNRI, the first blurb is probably vaguely > related -- most of their money comes from the US > military-industrial-entertainment complex, after all. > > This feels very much like it's targeted at Bayesian filters -- eg. I > suspect SpamAssassin pre-2.6 would have had a better chance at calling > this one spam than Spambayes (which scored it 0.198, just barely ham > for my thresholds). Jeremy and Guido both got spam a while back with a sure way to beat SpamBayes: the spam was added to replies to mailing list postings of theirs, with their original subject lines and the quoted text of their original messages. That trick is all but guaranteed to find lots of tokens hammy to you, and seems a lot cheaper & simpler than crawling over web pages looking for "related interests". But after a couple of those, we never saw that trick again. It's more expensive than spraying the same set of spam content at every address you can find, and I expect the response rate from targeting tech mailing-list posters was so low as to make it a net monetary loss. It would be nice to "do something" about the color-on-close-color trick, but I don't yet see it *working* often enough to be worth the expense and bother. From tameyer at ihug.co.nz Sun Nov 30 19:17:25 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Nov 30 19:17:31 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315B7A@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F3@its-xchg4.massey.ac.nz> [Mark] > I haven't tried to see > what conflicts actually arise, but am willing to. I expect > the only real conflicts will be where a bug *has* been fixed > in both places. I'm pretty sure that you'll find that the branch contains *nothing* that the trunk doesn't already, apart from the WHAT_IS_NEW.TXT file. In fact, I'll bet you 50% of your SpamBayes Outlook plug-in profits to date that that's the case . > Or is that what you thought I meant, and still -0? Well, yes. My only worry is that the trunk contains various code that the branch doesn't which is definitely not bug-fix and does change sb_server.py, UserInterface.py and ProxyUI.py a fair bit. Of course, if I did it right, I didn't introduce any bugs adding that code anyway, but ... OTOH, it would be nice to have some of those features in a release sooner than May 04 :) (Especially the one that lets people submit a decent bug report). I'll upgrade to +0 :) I think everyone else is in favour, anyway. > OK - this is my strawman plan: [...] > 3) Put together a binary from my current py2exe setup script, > which includes CVS and a number of sb_ programs. Does one need a special version of py2exe for this? If so, is it one that there's a binary available for? (i.e. can I do this without VC++?) > 4) Announce this binary as a "binary-beta", calling it 0.75 > or something. > 5) Any major bugs will presumably be part of the "binary > framework", so maybe 0.76 etc, depending on the damage. And on the 'encourage people to try it out' side, there are a few bugs that have been fixed since 1.0a7, so they may wish to get those benefits. [...rest of steps...] All this looks +1 to me. > As far as I can tell, almost everyone is in bug-fix-only mode > already (except that damn bass-playing Warsaw ). I kinda am, but certainly wasn't for a period after we went into 'feature freeze' (hence the new features in the trunk, and not in the branch). I would be willing to hold off adding any new features for the (NZ/Au) summer, although I would like to integrate the Japanese/Asian languages patches (which have been patiently waiting, and continually updated, for a while now). I don't know if that's bug or feature tampering . It would be interesting to try and put together a testing framework that works with the apps (as in the other thread), too, but that could easily be on a branch. > So up until (8), > which is when we cut a new 1.0 branch, all new real features > go one some development branch (a branch per feature - > whatever). I would be happy doing this, though. Instead of a branch per feature, what about a branch per app? (So (as needed), a sb_server_experimental branch, an sb_imapfilter_experimental branch, and so on (with better names)). > We all reserve our right to use a fairly liberal > definition of "bug" for low-risk, high-benefit tweaks, but I > think our app is mature enough that we can happily tell > people who want truly new features to grab a CVS branch. Agreed. > After the branch is cut, we get seriously anal about "bugfix > only", with the expectation the branch lasts only 4 weeks > (which flies when everyone is busy!) Agreed. > Moving fast towards 1.0 seems in everyone's benefit Agreed :) If everyone else agrees with this, we really ought to put a copy of the above list somewhere, too, so that Barry can be pointed to it later . README-DEVEL.TXT, maybe? =Tony Meyer