From seandarcy at hotmail.com Thu Apr 1 17:47:17 2004 From: seandarcy at hotmail.com (sean darcy) Date: Thu Apr 1 17:47:21 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: I'm training using the web interface. When I click train I got the following: 500 Server error Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in found_terminator getattr(plugin, name)(**params) File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in onReview fromCache=True) File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 200, in takeMessage types.StringsTypes): NameError: global name 'types' is not defined So, I used CVS today. Now I get: 500 Server error Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in found_terminator getattr(plugin, name)(**params) File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in onReview fromCache=True) File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 213, in takeMessage fromcorpus.removeMessage(msg) File "/usr/lib/python2.3/site-packages/spambayes/FileCorpus.py", line 151, in removeMessage Corpus.Corpus.removeMessage(self, message, observer_flags) File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 147, in removeMessage obs.onRemoveMessage(message, observer_flags) File "/usr/lib/python2.3/site-packages/spambayes/storage.py", line 606, in onRemoveMessage if flags.find(NO_TRAINING_FLAG) < 0: AttributeError: 'NoneType' object has no attribute 'find' sean _________________________________________________________________ Tax headache? MSN Money provides relief with tax tips, tools, IRS forms and more! http://moneycentral.msn.com/tax/workshop/welcome.asp From skip at pobox.com Thu Apr 1 18:41:28 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Apr 1 18:41:42 2004 Subject: [spambayes-dev] Dibbler.py error in training In-Reply-To: References: Message-ID: <16492.43176.91868.294903@montanaro.dyndns.org> sean> So, I used CVS today. Now I get: ... sean> File "/usr/lib/python2.3/site-packages/spambayes/storage.py", line 606, in sean> onRemoveMessage sean> if flags.find(NO_TRAINING_FLAG) < 0: sean> AttributeError: 'NoneType' object has no attribute 'find' This looks like a bug in onRemoveMessage(). I don't know what the meaning of a flags value of None is supposed to be so I can't fix it, but it's clear that the flags.find() call has to be conditional on flags not being None. Tony added that code in the past week or so. I trust he will know the correct fix. Skip From skip at pobox.com Fri Apr 2 11:33:38 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Apr 2 11:33:50 2004 Subject: [spambayes-dev] sb_bnfilter.py/sb_bnserver.py In-Reply-To: <16492.45455.798964.179426@montanaro.dyndns.org> References: <1080424390.4065.24.camel@porsche.hq.simlog.com> <16487.22766.448116.900778@montanaro.dyndns.org> <4068267B.5030206@videotron.ca> <16490.18079.594164.489427@montanaro.dyndns.org> <16492.45455.798964.179426@montanaro.dyndns.org> Message-ID: <16493.38370.576646.51972@montanaro.dyndns.org> Skip> Toby's timeout changes coupled with a change to the PATH setting Skip> in my procmailrc file seem to have fixed the problems I was Skip> having. I've been running with Toby's sb_bnfilter.py this setup since yesterday. It's processed around 1500 messages with no hiccups (no procmail.log messages so far). This is looking good. After a bit more exercise I think we should consider it as a replacement for sb_filter.py. Skip From kennypitt at hotmail.com Fri Apr 2 13:35:35 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Apr 2 13:36:51 2004 Subject: [spambayes-dev] Dibbler.py error in training In-Reply-To: <16492.43176.91868.294903@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > sean> So, I used CVS today. Now I get: > ... > sean> File "/usr/lib/python2.3/site-packages/spambayes/storage.py", line 606, in > sean> onRemoveMessage > sean> if flags.find(NO_TRAINING_FLAG) < 0: > > sean> AttributeError: 'NoneType' object has no attribute 'find' > > This looks like a bug in onRemoveMessage(). I don't know what the > meaning of a flags value of None is supposed to be so I can't fix it, > but it's clear that the flags.find() call has to be conditional on > flags not being None. Tony added that code in the past week or so. > I trust he will know the correct fix. I just checked in a fix for this. flags=None was supposed to represent that no flags had been passed. I changed the code to use integer bit values that could be OR'd together if we ever add more flags in the future, and it now defaults to flags=0 for no flags. This error only seems to occur if you retrain a message that was trained into the wrong corpus (either correcting a training mistake, or retraining a false positive or false negative with a train-on-everything strategy). onRemoveMessage is not called on the unknown corpus. -- Kenny Pitt From seandarcy at hotmail.com Fri Apr 2 18:47:29 2004 From: seandarcy at hotmail.com (sean darcy) Date: Fri Apr 2 18:47:33 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: >I just checked in a fix for this. flags=None was supposed to represent >that no flags had been passed. I changed the code to use integer bit >values that could be OR'd together if we ever add more flags in the >future, and it now defaults to flags=0 for no flags. > >This error only seems to occur if you retrain a message that was trained >into the wrong corpus (either correcting a training mistake, or >retraining a false positive or false negative with a train-on-everything >strategy). onRemoveMessage is not called on the unknown corpus. Updated from cvs. New error message: Training... 500 Server error Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in found_terminator getattr(plugin, name)(**params) File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in onReview fromCache=True) File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 209, in takeMessage if opt in notate_opt and \ AttributeError: 'NoneType' object has no attribute 'startswith' Did I get the fix from cvs? Maybe Sourceforge just didn't update it. cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/spambayes co spambayes U spambayes/contrib/sb_bnfilter.py U spambayes/spambayes/Corpus.py U spambayes/spambayes/FileCorpus.py U spambayes/spambayes/storage.py sean _________________________________________________________________ Limited-time offer: Fast, reliable MSN 9 Dial-up Internet access FREE for 2 months! http://join.msn.com/?page=dept/dialup&pgmarket=en-us&ST=1/go/onm00200361ave/direct/01/ From kennypitt at hotmail.com Mon Apr 5 09:44:40 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Apr 5 09:45:56 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: sean darcy wrote: >> I just checked in a fix for this. {snip] > > Updated from cvs. New error message: > > > > Training... > 500 Server error > > Traceback (most recent call last): > > File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line > 461, in found_terminator getattr(plugin, name)(**params) > > File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line > 391, in onReview fromCache=True) > > File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line > 209, in takeMessage if opt in notate_opt and \ > > AttributeError: 'NoneType' object has no attribute 'startswith' > > > Did I get the fix from cvs? Maybe Sourceforge just didn't update it. > > cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/spambayes > co spambayes > U spambayes/contrib/sb_bnfilter.py > U spambayes/spambayes/Corpus.py > U spambayes/spambayes/FileCorpus.py > U spambayes/spambayes/storage.py Well, anonymous CVS does have a little bit of a delay but it looks like you got all of the affected files, and it looks like this error is in a different location. I'll try to make time to take a look if someone else doesn't beat me to it. -- Kenny Pitt From kennypitt at hotmail.com Mon Apr 5 10:02:35 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Apr 5 10:03:53 2004 Subject: [spambayes-dev] Dibbler.py error in training In-Reply-To: Message-ID: Kenny Pitt wrote: >> >> Training... >> 500 Server error >> >> Traceback (most recent call last): >> >> File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line >> 461, in found_terminator getattr(plugin, name)(**params) >> >> File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line >> 391, in onReview fromCache=True) >> >> File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line >> 209, in takeMessage if opt in notate_opt and \ >> >> AttributeError: 'NoneType' object has no attribute 'startswith' >> ... > > Well, anonymous CVS does have a little bit of a delay but it looks > like you got all of the affected files, and it looks like this error > is in a different location. I'll try to make time to take a look if > someone else doesn't beat me to it. I went ahead and took a look at this. It was a different problem accidentally introduced a little while ago while fixing a previous bug. I checked in another fix in Corpus.py for it. Look for revision 1.18 to come through anon cvs. We really appreciate these problem reports. Everyone uses the software differently so it is often impossible for the developers to catch problems with every combination of option settings. -- Kenny Pitt From ta-meyer at ihug.co.nz Mon Apr 5 18:48:56 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Mon Apr 5 18:49:13 2004 Subject: [spambayes-dev] Incremental filtering and the spam folder Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BC9@its-xchg4.massey.ac.nz> From ta-meyer at ihug.co.nz Mon Apr 5 18:52:15 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Mon Apr 5 18:52:33 2004 Subject: [spambayes-dev] Incremental filtering and the spam folder Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BCA@its-xchg4.massey.ac.nz> [Sorry about the blank one of these. Did the shortcut for 'send' instead of 'save']. The log file has a rather confusing message when incremental training is *disabled*: """ SpamBayes: Watching (for incremental training) in 'Personal Folders/Spam' """ This gets logged because the spam folder is (from what I can tell) still watched for incremental training, but when an item gets added to it, if the train_manual_spam option is False, nothing happens. Would it not be better to only add the hook if the train_manual_spam option is True? Or is there some other reason that the spam folder has to be hooked? =Tony Meyer From seandarcy at hotmail.com Mon Apr 5 19:25:11 2004 From: seandarcy at hotmail.com (sean darcy) Date: Mon Apr 5 19:25:18 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: >I went ahead and took a look at this. It was a different problem >accidentally introduced a little while ago while fixing a previous bug. >I checked in another fix in Corpus.py for it. Look for revision 1.18 to >come through anon cvs. Did that. Sadly: Training... 500 Server error Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line 461, in found_terminator getattr(plugin, name)(**params) File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line 391, in onReview fromCache=True) File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line 209, in takeMessage if (notate_opt is not None) and (opt in notate_opt) and \ AttributeError: 'NoneType' object has no attribute 'startswith' >We really appreciate these problem reports. Everyone uses the software >differently so it is often impossible for the developers to catch >problems with every combination of option settings. > >-- >Kenny Pitt Thanks for the kind words - but I really appreciate you guys for doing all the work on this. sean _________________________________________________________________ Limited-time offer: Fast, reliable MSN 9 Dial-up Internet access FREE for 2 months! http://join.msn.com/?page=dept/dialup&pgmarket=en-us&ST=1/go/onm00200361ave/direct/01/ From davejameson at comcast.net Mon Apr 5 20:17:31 2004 From: davejameson at comcast.net (Dave Jameson) Date: Mon Apr 5 20:19:58 2004 Subject: [spambayes-dev] spamBayes ideas Message-ID: Hello, First let me say thank you for all your hard work on this project ? it is fantastic! I have recommended it to many people who have found it to be everything I claimed ;-) I am a product planner for a very large software project so hopefully my ideas aren?t to lame. 1. I have noticed lately that may spammers are moving to add fake HTML tags in the middle of the words to screw the parsers up, much in the same way that people obfuscate their email addresses on web pages to beat the spambots. (E.G. from a spam received today - www.lifeisimportant.biz

). I was thinking if you could database valid HTML tags (perhaps learned and pre-populated?) so that new unknown tags would count as spam probability. This would primarily mean inverting the way < > tags are handled compared to other words, that is assuming spam, learning ham. In the above example
would be ham and the others spam. You could even set a property file to allow x number of false tags to score the whole email as spam. In the above example spam there were 11 fake tags. 2. The last one is a bit fancy but here goes. On possible spam measure the recovered vs. bad and look at the scores. With an algorithm you should be able to auto adjust the thresholds ? just a thought. HTH, Dave -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040405/2bed38cc/attachment.html From kennypitt at hotmail.com Tue Apr 6 09:13:39 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Apr 6 09:14:59 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: sean darcy wrote: >> I went ahead and took a look at this. It was a different problem >> accidentally introduced a little while ago while fixing a previous >> bug. I checked in another fix in Corpus.py for it. Look for >> revision 1.18 to come through anon cvs. > > Did that. Sadly: > > Training... > 500 Server error > > Traceback (most recent call last): > > File "/usr/lib/python2.3/site-packages/spambayes/Dibbler.py", line > 461, in found_terminator getattr(plugin, name)(**params) > > File "/usr/lib/python2.3/site-packages/spambayes/ProxyUI.py", line > 391, in onReview fromCache=True) > > File "/usr/lib/python2.3/site-packages/spambayes/Corpus.py", line > 209, in takeMessage if (notate_opt is not None) and (opt in > notate_opt) and \ > > AttributeError: 'NoneType' object has no attribute 'startswith' Oops, looks like I misread the original error message. The fix I put in is probably a useful safeguard, but not the one that was causing the problem. In looking more closely, though, something seems a little odd here. The offending object that is coming back None appears to be the msg[header] reference. If I'm not mistaken, that means that either the Subject: or To: header is missing entirely from the message, which is very unusual. Could you, by chance, attach a copy of the message that is causing the error? A copy of it should appear as a file in one of the cache directories below the directory containing your training database, or you could just view the message source from Review Messages and copy-and-paste it. -- Kenny Pitt From kennypitt at hotmail.com Tue Apr 6 09:37:21 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Apr 6 09:38:38 2004 Subject: [spambayes-dev] Incremental filtering and the spam folder In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677BCA@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > The log file has a rather confusing message when incremental training > is *disabled*: > > """ > SpamBayes: Watching (for incremental training) in 'Personal > Folders/Spam' """ > > This gets logged because the spam folder is (from what I can tell) > still watched for incremental training, but when an item gets added > to it, if the train_manual_spam option is False, nothing happens. > > Would it not be better to only add the hook if the train_manual_spam > option is True? Or is there some other reason that the spam folder > has to be hooked? Haven't looked at the code to see if anything else is going on in the hook function, but at the very least it seems like we should check the train_manual_spam option and not generate the log message if incremental training is disabled. -- Kenny Pitt From seandarcy at hotmail.com Tue Apr 6 14:22:33 2004 From: seandarcy at hotmail.com (sean darcy) Date: Tue Apr 6 14:22:41 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: ----Original Message Follows---- From: "Kenny Pitt" <kennypitt@hotmail.com> To: "'sean darcy'" <seandarcy@hotmail.com>,<skip@pobox.com> CC: <spambayes-dev@python.org> Subject: RE: [spambayes-dev] Dibbler.py error in training Date: Tue, 6 Apr 2004 09:13:39 -0400 ................................... >Oops, looks like I misread the original error message. The fix I put in >is probably a useful safeguard, but not the one that was causing the >problem. > >In looking more closely, though, something seems a little odd here. The >offending object that is coming back None appears to be the msg[header] >reference. If I'm not mistaken, that means that either the Subject: or >To: header is missing entirely from the message, which is very unusual. It's not that unusual for the Subject header to be missing. Looking over past emails, I've found some "ham" posts that had no subject. In any event, some of the posts to be trained do have no Subject - all spam. Here's an example from "tokens" on the untrained message page: Tokens for: (none) (15) Word Probability Times in ham Times in spam content-type:text/plain 0.288326 1576 556 from:addr:qziwpklwit - 0 0 from:addr:musician.org 0.844828 0 1 from:no real name:2**0 0.186886 825 165 to:none 0.878691 2 14 cc:none 0.351951 979 463 sender:none 0.410456 978 593 reply-to:none 0.271479 746 242 x-mailer:none 0.417812 832 520 message-id:@mta13.srv.hcvlny.cv.net 0.844828 0 1 header:Date:1 0.500287 1742 1519 header:Received:3 0.77726 215 654 header:Message-id:1 0.907877 144 1238 header:From:1 0.500718 1739 1519 header:Return-path:1 0.940104 95 1302 Here's the mesage source: Return-path: Received: from mta13.srv.hcvlny.cv.net (mta13.srv.hcvlny.cv.net [167.206.5.82]) by mstr9.srv.hcvlny.cv.net (iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14 2003)) with ESMTP id <0HVC00G0PB4QME@mstr9.srv.hcvlny.cv.net>; Mon, 29 Mar 2004 08:36:26 -0500 (EST) Received: from f94006.upc-f.chello.nl (f94006.upc-f.chello.nl [80.56.94.6]) by mta13.srv.hcvlny.cv.net (iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14 2003)) with SMTP id <0HVC00ISEAU5TL@mta13.srv.hcvlny.cv.net>; Mon, 29 Mar 2004 08:34:03 -0500 (EST) Received: from 123.224.24.65 by 80.56.94.6 with qdtrhun [1 Date: Mon, 29 Mar 2004 08:34:03 -0500 (EST) Date-warning: Date header was inserted by mta13.srv.hcvlny.cv.net From: qziwpklwit@musician.org Message-id: <0HVC00IM1B0CTL@mta13.srv.hcvlny.cv.net> Content-transfer-encoding: 7BIT X-Spambayes-Classification: unsure X-Spambayes-Spam-Probability: 0.84 X-Spambayes-Level: ******** X-Spambayes-MailId: 1080858684-6 >Could you, by chance, attach a copy of the message that is causing the >error? The untrained message page has about 60 messages. How do I know which one is the problem? >A copy of it should appear as a file in one of the cache >directories below the directory containing your training database, or >you could just view the message source from Review Messages and >copy-and-paste it. You've lost me. Here's my spambayes data directory: ls bayescustomize.ini _pop3proxy.log pop3proxy-spam-cache bayescustomize.ini~ pop3proxy.log-1 pop3proxy-unknown-cache bayescustomize.ini.bak pop3proxy.log-evolution spambayes.messageinfo.db hammie.db pop3proxy.log-evolution~ start.info pop3proxy-ham-cache pop3proxy.log-mozilla When I grep for the odd "From" name I get nothing: grep -R qziwpklwit * I'm looking for spam in all the wrong places. >-- >Kenny Pitt sean _________________________________________________________________ Tax headache? MSN Money provides relief with tax tips, tools, IRS forms and more! http://moneycentral.msn.com/tax/workshop/welcome.asp From kennypitt at hotmail.com Tue Apr 6 15:55:46 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Apr 6 15:57:06 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: sean darcy wrote: >> In looking more closely, though, something seems a little odd here. >> The offending object that is coming back None appears to be the >> msg[header] reference. If I'm not mistaken, that means that either >> the Subject: or To: header is missing entirely from the message, >> which is very unusual. > > It's not that unusual for the Subject header to be missing. Looking > over past emails, I've found some "ham" posts that had no subject. In > any event, some of the posts to be trained do have no Subject - all > spam. Well, it's certainly not unusual for the Subject: header to be empty but I didn't realize that it was legal to leave out the header entirely. Guess I'll have to go back and re-read the spec! Anyway, I checked in a new fix (Corpus.py 1.19) to guard against missing headers, so give that a try when it comes through and let us know the results. >> Could you, by chance, attach a copy of the message that is causing >> the error? > > The untrained message page has about 60 messages. How do I know which > one is the problem? Click the "Defer" heading to make sure that is the default for all messages, then select a classification for only one message at a time to see which one dies. You can then go back to Review Messages and click the subject of that message to display the message source. >> A copy of it should appear as a file in one of the cache >> directories below the directory containing your training database, or >> you could just view the message source from Review Messages and >> copy-and-paste it. > > You've lost me. Here's my spambayes data directory: > > ls > bayescustomize.ini _pop3proxy.log pop3proxy-spam-cache > bayescustomize.ini~ pop3proxy.log-1 > pop3proxy-unknown-cache > bayescustomize.ini.bak pop3proxy.log-evolution > spambayes.messageinfo.db > hammie.db pop3proxy.log-evolution~ start.info > pop3proxy-ham-cache pop3proxy.log-mozilla The pop3proxy-unknown-cache subdirectory contains copies of e-mails that haven't been trained yet, up to the expiration age which I believe defaults to 7 days. No worries, though. The message source you included in the message was what I was interested in. -- Kenny Pitt From tameyer at ihug.co.nz Wed Apr 7 02:59:10 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 7 02:59:20 2004 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes Corpus.py, 1.18, 1.19 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305CE452B@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BE7@its-xchg4.massey.ac.nz> > The real culprit seems to be msg[header], so check that for None > instead. It seems odd for a message to be missing a Subject: or > To: header, but this is spam after all and malformed messages are > not unusual. I believe this is the correct fix. Sorry I didn't manage to look at this previously (or just think of this case in the first place and allow for it in the code I wrote), but I've been flat out this week. Thanks heaps for doing the work to fix it Kenny :) =Tony Meyer From seandarcy at hotmail.com Thu Apr 8 00:00:01 2004 From: seandarcy at hotmail.com (sean darcy) Date: Thu Apr 8 00:00:06 2004 Subject: [spambayes-dev] Dibbler.py error in training Message-ID: >----Original Message Follows---- >From: "Kenny Pitt" <kennypitt@hotmail.com> >To: "'sean darcy'" ><seandarcy@hotmail.com>,<skip@pobox.com> >CC: <spambayes-dev@python.org> >Subject: RE: [spambayes-dev] Dibbler.py error in training >Date: Tue, 6 Apr 2004 15:55:46 -0400 > >Anyway, I checked in a new fix (Corpus.py 1.19) to guard against missing >headers, so give that a try when it comes through and let us know the >results. Tada! It worked. Thanks for all the help. sean _________________________________________________________________ Persistent heartburn? Check out Digestive Health & Wellness for information and advice. http://gerd.msn.com/default.asp From ta-meyer at ihug.co.nz Thu Apr 8 02:47:31 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Thu Apr 8 02:47:42 2004 Subject: [spambayes-dev] RE: 1.0b1 Release candidates In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305A4976A@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677BF2@its-xchg4.massey.ac.nz> Sorry about the slow response. > I've just installed it on my Win2K machine with Outlook 2000 > and it fails upon Outlook startup. Message is pretty > unhelpful ("failed to load, close Outlook & restart", or > something similar). Needless to say, restart did not help. > I've unchecked it in the COM add-ins, re-checked but to no > avail. So I've uninstalled it and reinstalled previous > version (09). I think that maybe this is to do with the version of Outlook that is used to build the installer. There is a comment somewhere that says that Outlook 2000 should be used, and I only have Outlook 2002, so used that. Nothing much else has changed with Outlook, so I'm fairly confident that that's the problem. It does mean that Mark needs to build the installer, though, or someone else with OL2K. Apart from the ActivePython problem (and assuming the above is correct), I think that we're ready to put the release out. I'm happy to do the build & everything, if Mark is available to do build the binaries (there's been a lot of pywin32 activity lately, so he might have the time). Mark (if you're reading this) - what are your thoughts about putting a release out? Are you able to use the code that Thomas posted to patch the ActivePython pythoncom.dll, or do we just require a newer ActivePython install to use new spambayes releases? (Once the new ActivePython is out, of course!). > What do you need in order to analyse the problem? About 2 extra hours in each day . =Tony Meyer From theller at python.net Thu Apr 8 05:46:39 2004 From: theller at python.net (Thomas Heller) Date: Thu Apr 8 05:46:48 2004 Subject: [spambayes-dev] Re: 1.0b1 Release candidates References: <1ED4ECF91CDED24C8D012BCF2B034F1305A4976A@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677BF2@its-xchg4.massey.ac.nz> Message-ID: <3c7eiyyo.fsf@python.net> "Tony Meyer" writes: > Sorry about the slow response. > >> I've just installed it on my Win2K machine with Outlook 2000 >> and it fails upon Outlook startup. Message is pretty >> unhelpful ("failed to load, close Outlook & restart", or >> something similar). Needless to say, restart did not help. >> I've unchecked it in the COM add-ins, re-checked but to no >> avail. So I've uninstalled it and reinstalled previous >> version (09). > > I think that maybe this is to do with the version of Outlook that is used to > build the installer. There is a comment somewhere that says that Outlook > 2000 should be used, and I only have Outlook 2002, so used that. Nothing > much else has changed with Outlook, so I'm fairly confident that that's the > problem. It does mean that Mark needs to build the installer, though, or > someone else with OL2K. > > Apart from the ActivePython problem (and assuming the above is correct), I > think that we're ready to put the release out. I'm happy to do the build & > everything, if Mark is available to do build the binaries (there's been a > lot of pywin32 activity lately, so he might have the time). > > Mark (if you're reading this) - what are your thoughts about putting a > release out? Are you able to use the code that Thomas posted to patch the > ActivePython pythoncom.dll, ... and should that code go into py2exe (maybe with an additional change to py2exe so that it can specify the LCID for the resources) ... > or do we just require a newer ActivePython > install to use new spambayes releases? (Once the new ActivePython is out, > of course!). IIUC, nobody needs the new ActiveState Python to release spambayes, but the existing AS Python dll conflicts with the py2exe'd binaries. (Besides: The existing AS Visual Python plugin for MS Visual Studio also relies on the pywin32 registry entries, so it might take some time to sort this out). > >> What do you need in order to analyse the problem? > > About 2 extra hours in each day . Now that's a great idea - where can I get them ;-) ? > > =Tony Meyer Thomas From seandarcy at hotmail.com Fri Apr 9 19:35:58 2004 From: seandarcy at hotmail.com (sean darcy) Date: Fri Apr 9 19:36:04 2004 Subject: [spambayes-dev] train on missing headers? Message-ID: [ I posted this before, but it didn't show up. So if it does...] Now that there a fix for missing headers, I realize how much of my spam is in fact missing headers, esp. Subject headers. But when I look at clues, missing headers isn't one of them. Most of this spam is classed as either unsure or ham. Maybe continued training will sort this out. I seems to me that missing a header has lots of predictive value. Can this be incorporated in the spambayes tokens/clues? sean _________________________________________________________________ Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 From mhammond at skippinet.com.au Fri Apr 9 21:02:14 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Apr 9 21:02:37 2004 Subject: [spambayes-dev] RE: 1.0b1 Release candidates In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677BF2@its-xchg4.massey.ac.nz> Message-ID: <000601c41e97$771c60e0$0200a8c0@eden> > Sorry about the slow response. You should talk! :) > I think that maybe this is to do with the version of Outlook > that is used to > build the installer. There is a comment somewhere that says > that Outlook > 2000 should be used, and I only have Outlook 2002, so used > that. Nothing > much else has changed with Outlook, so I'm fairly confident > that that's the > problem. Unfortunately, the typelibs for office 2000 are hard-coded in a couple of spots. I fear that upgrading these to later typelibs will prevent SpamBayes working at all for the older users. No time to, or easy way to look at this particular problem. > It does mean that Mark needs to build the > installer, though, or > someone else with OL2K. Which I am still struggling to do! The binary builds and registers fine, but silently fails to load when I start outlook. But this time I am not trying to do it 1 day before taking off, and will nail it :) Most wolves have moved away from my door, so I have a little time now. > Apart from the ActivePython problem (and assuming the above > is correct), I > think that we're ready to put the release out. I'm happy to > do the build & > everything, if Mark is available to do build the binaries > (there's been a > lot of pywin32 activity lately, so he might have the time). See above :) I think for now I will stick with the original plan - manually edit the resource string in the Python DLL we ship. > Mark (if you're reading this) - what are your thoughts about putting a > release out? Are you able to use the code that Thomas posted > to patch the > ActivePython pythoncom.dll, I'm sorry, but I seem to have missed that, and can't find it. I've a message or 2 from Thomas to catch up on next, but don't recall it being one of them. Mark. From tim.one at comcast.net Fri Apr 9 22:43:09 2004 From: tim.one at comcast.net (Tim Peters) Date: Fri Apr 9 22:43:15 2004 Subject: [spambayes-dev] train on missing headers? In-Reply-To: Message-ID: [sean darcy] > Now that there a fix for missing headers, I realize how much of my > spam is in fact missing headers, esp. Subject headers. But when I > look at clues, missing headers isn't one of them. Most of this spam > is classed as either unsure or ham. Maybe continued training will > sort this out. > > I seems to me that missing a header has lots of predictive value. Can > this be incorporated in the spambayes tokens/clues? You can set the option [Tokenizer] record_header_absence: True to experiment with this. I know it's helpful for me (or was, more than a year ago, when I last tested it ). From mhammond at skippinet.com.au Fri Apr 9 23:06:38 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Apr 9 23:06:57 2004 Subject: [spambayes-dev] RE: Incremental filtering and the spam folder In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677BCA@its-xchg4.massey.ac.nz> Message-ID: <002001c41ea8$d8708bd0$0200a8c0@eden> > Would it not be better to only add the hook if the > train_manual_spam option > is True? Or is there some other reason that the spam folder has to be > hooked? I think you are correct - we could avoid the hook alltogether. However, I'm still reluctant to change this, as it does risk breakage. We can fix it post 1.0. Mark. From matt at mondoinfo.com Sat Apr 10 17:05:51 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sat Apr 10 17:08:47 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer Message-ID: <1081622811.78.614@mint-julep.mondoinfo.com> I've lately been getting a bunch of spam that's almost entirely nonsense except for a link or two. Perhaps not surprisingly, SpamBayes hasn't been catching it all that well. I could probably improve SpamBayes's performance by turning on more header checks but on account of some peculiarities of my email, I'm reluctant to do that. (I read various postmaster, webmaster, and ARIN contact addresses that get almost nothing but spam but it's important that I see what little legitimate mail goes to them.) I don't remember who mentioned it here first, but it seemed to me that adding a DNS lookup for URLs to the tokenizer would be a good idea. There's hardly any limit to the number of domains a spammer can register, but the number of networks that are willing to host a spammer's website seems to be reasonably small. So I hacked the tokenizer to generate tokens for the address that a URL in a message resolves to. It generates four tokens for each address, stripping values from the dotted-quad from right to left. That is, 10.1.2.3 would generate: url-ip:10/8 url-ip:10.1/16 url-ip:10.1.2/24 url-ip:10.1.2.3/32 (I realize that that's not how networks are allocated these days, but byte boundaries seemed as good an arbitrary place to make the cuts as any other.) A day's worth of unscientific testing suggested that it works pretty well; the new tokens quickly started to show up in the classifier's evidence. So I set up buckets for a 5-way cross-validation set and ran timcv.py. The only classification difference between the two runs is that unsures dropped from 27 to 25. Here's the output from cmp.py for those who can interpret it better than I can: nodnss.txt -> dnss.txt -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.1 to 0.1 tied false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fn went from 1 to 1 tied mean fn % went from 0.1 to 0.1 tied ham mean ham sdev 0.27 0.22 -18.52% 3.16 2.51 -20.57% 0.36 0.33 -8.33% 3.83 3.61 -5.74% 0.68 0.66 -2.94% 7.28 7.21 -0.96% 0.14 0.10 -28.57% 1.03 0.89 -13.59% 0.31 0.30 -3.23% 2.54 2.54 +0.00% ham mean and sdev for all runs 0.35 0.32 -8.57% 4.13 3.97 -3.87% spam mean spam sdev 99.90 99.82 -0.08% 1.02 1.28 +25.49% 99.74 99.83 +0.09% 2.99 1.98 -33.78% 98.91 98.91 +0.00% 5.15 5.11 -0.78% 98.39 98.44 +0.05% 9.37 9.35 -0.21% 98.86 98.79 -0.07% 6.36 6.84 +7.55% spam mean and sdev for all runs 99.16 99.16 +0.00% 5.77 5.79 +0.35% ham/spam mean difference: 98.81 98.84 +0.03 I suspect that the results would have been better if I had chosen more recent spam. I think that I inadvertently chose the oldest spam from my spam archive. In case anyone would like to play with it, I'll append my trivial patch. It requires pydns from: http://sourceforge.net/projects/pydns/ I think that some lines may need to un-wrapped by hand. The code is governed by the option x-pick_apart_urls so you'll need to have that turned on for it to work. If want to do comparison testing, you'll want that option turned on for both runs. You should note that while an individual DNS lookup is pretty cheap, doing thousands of them slows the test down a lot and may hammer your resolving nameserver pretty hard. I hacked it up in a way that suits me for testing only. Among the things that ought to be changed if anyone wants it added to the distributed code: It should have its own option The timeout should be configurable The imports should be moved to a sane place Regards, Matt *** tokenizer.py.orig 2004-04-10 12:13:20.000000000 -0500 --- tokenizer.py 2004-04-10 15:34:21.000000000 -0500 *************** *** 1052,1057 **** --- 1052,1078 ---- url = urllib.unquote(url) scheme, netloc, path, params, query, frag = urlparse.urlparse(url) + + import DNS + import DNS.Base + DNS.DiscoverNameServers() + r=DNS.DnsRequest(timeout=1) + try: + replies=r.req(netloc).answers + except DNS.Base.DNSError: + pass + else: + for reply in replies: # Should we limit to one A record? + if reply["typename"]=="A": + dottedQuad=reply["data"] + pushclue("url-ip:%s/32" % dottedQuad) + dottedQuadList=dottedQuad.split(".") + pushclue("url-ip:%s/8" % dottedQuadList[0]) + pushclue("url-ip:%s.%s/16" % (dottedQuadList[0],dottedQuadList[1])) + pushclue("url-ip:%s.%s.%s/24" % (dottedQuadList[0], + dottedQuadList[1],dottedQuadList[2])) + + # one common technique in bogus "please (re-)authorize yourself" # scams is to make it appear as if you're visiting a valid # payment-oriented site like PayPal, CitiBank or eBay, when you From skip at pobox.com Sat Apr 10 22:49:25 2004 From: skip at pobox.com (Skip Montanaro) Date: Sat Apr 10 22:49:29 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1081622811.78.614@mint-julep.mondoinfo.com> References: <1081622811.78.614@mint-julep.mondoinfo.com> Message-ID: <16504.45621.192141.340899@montanaro.dyndns.org> Matt> I don't remember who mentioned it here first, but it seemed to me Matt> that adding a DNS lookup for URLs to the tokenizer would be a good Matt> idea. There's hardly any limit to the number of domains a spammer Matt> can register, but the number of networks that are willing to host Matt> a spammer's website seems to be reasonably small. So I hacked the Matt> tokenizer to generate tokens for the address that a URL in a Matt> message resolves to. Matt, Doesn't mine_received_headers work for you? I've got lots of tokens in my database like: received:65.248 received:65.248.59 received:65.248.59.178 received:65.248.59.196 received:65.248.59.35 which records all the possible fragments of the ip addresses through which the mail moves. Skip From matt at mondoinfo.com Sat Apr 10 23:19:29 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sat Apr 10 23:20:30 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <16504.45621.192141.340899@montanaro.dyndns.org> References: <1081622811.78.614@mint-julep.mondoinfo.com> <16504.45621.192141.340899@montanaro.dyndns.org> Message-ID: <1081652165.08.529@mint-julep.mondoinfo.com> Dear Skip, > Doesn't mine_received_headers work for you? I've got lots of > tokens in my database like: > received:65.248 > received:65.248.59 > received:65.248.59.178 > received:65.248.59.196 > received:65.248.59.35 > which records all the possible fragments of the ip addresses > through which the mail moves. I expect that it would help most of the time, but it's not what I wanted to do. Some of the addresses that I read go through different SMTP servers. In particular, there are two servers that receive mail for a webmaster address, a postmaster address, and an ARIN contact address that I read. Those addresses get almost nothing but spam, but I need to get what little legitimate mail does get sent to them. Using mine_received_headers, I'd have a very strong spam clue that was really for the wrong reason. Whether that one clue would push the legitimate mail that I get at those addresses into the wrong bucket is hard for me to tell since I don't get enough legitimate mail sent to them to be able to perform much of an experiment. In addition, my unscientific poking at recent spam suggests to me that spam is sent to my servers from a lot of different places. But the sites spamvertized tend to be on a much smaller number of networks. It seems that it's easier for a spammer to find a compromised PC to relay though than it is for them to find someone willing to host a their site. For example, looking though my logs for this evening, I find four spams that advertise seemingly unrelated products but which have URLs that resolve to addresses within the same /24 in China. Regards, Matt From skip at pobox.com Sat Apr 10 23:48:20 2004 From: skip at pobox.com (Skip Montanaro) Date: Sat Apr 10 23:48:23 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1081652165.08.529@mint-julep.mondoinfo.com> References: <1081622811.78.614@mint-julep.mondoinfo.com> <16504.45621.192141.340899@montanaro.dyndns.org> <1081652165.08.529@mint-julep.mondoinfo.com> Message-ID: <16504.49156.490243.142533@montanaro.dyndns.org> Matt> In particular, there are two servers that receive mail for a Matt> webmaster address, a postmaster address, and an ARIN contact Matt> address that I read. Those addresses get almost nothing but spam, Matt> but I need to get what little legitimate mail does get sent to Matt> them. Using mine_received_headers, I'd have a very strong spam Matt> clue that was really for the wrong reason. Whether that one clue Matt> would push the legitimate mail that I get at those addresses into Matt> the wrong bucket is hard for me to tell since I don't get enough Matt> legitimate mail sent to them to be able to perform much of an Matt> experiment. Unless those messages are extremely short, I doubt it would matter much. It's going to be just one clue among many. I have no trouble getting the occasional good mail from the pychecker mailing list, which gets almost nothing but spam these days. Matt> It seems that it's easier for a spammer to find a compromised PC Matt> to relay though than it is for them to find someone willing to Matt> host a their site. In which case I doubt either of these network ip classification schemes will have much effect. Skip From sethg at GoodmanAssociates.com Sun Apr 11 00:05:47 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sun Apr 11 00:05:49 2004 Subject: [spambayes-dev] various Outlook version and RFC2822 compliance Message-ID: I have heard before that certain types of checks, such as the name of file attachments, are not possible in the Outlook plug-in since the data format used was pre-RFC2822 and Outlook, in fact, destroys some of the MIME structure necessary to see these things. I have also heard that later versions of Outlook are more RFC2822 compliant. I have a few questions here that have probably been discussed among the developers already. 1) How RFC2822 compliant is the stored message format in the various versions of Outlook subsequent to Outlook2000? Without even looking at the code I can tell that Outlook2000 is not compliant due to the total absence of a References: header, which causes many people real problems who view mailing lists by conversation thread. 2) If the later versions of Outlook are more (or perhaps even fully?) RFC2822 compliant, would it be possible to detect the Outlook version and enable generating the additional tokens that are available with the web proxy? I realize this is not a simple matter. I was just wondering how far we are from a more unified code base. -- Seth Goodman From tim.one at comcast.net Sun Apr 11 00:51:50 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Apr 11 00:51:57 2004 Subject: [spambayes-dev] various Outlook version and RFC2822 compliance In-Reply-To: Message-ID: [Seth Goodman] > I have heard before that certain types of checks, such as the name of file > attachments, are not possible in the Outlook plug-in It's not that this is impossible, it's that nobody has written Outlook-specific code necessary to do it. > since the data format used was pre-RFC2822 and Outlook, in fact, destroys > some of the MIME structure necessary to see these things. Outlook destroys all MIME structure. Our parser understands only MIME structure. > I have also heard that later versions of Outlook are more RFC2822 > compliant. Outlook keeps getting better at both accepting and creating standard email, but it doesn't store email in this format. Our Outlook addin sees email in the way Outlook stores it. > I have a few questions here that have probably been discussed among the > developers already. > > 1) How RFC2822 compliant is the stored message format in the various > versions of Outlook subsequent to Outlook2000? Outlook's storage format has nothing to do with any Internet standard (regardless of Outlook version). It's possible to get the original headers as a blob of text from the Outlook message store (and we do), but that's all of the original MIME structure Outlook preserves. ... > 2) If the later versions of Outlook are more (or perhaps even fully?) > RFC2822 compliant, would it be possible to detect the Outlook version and > enable generating the additional tokens that are available with the web > proxy? If the antecedent were true, yes . > I realize this is not a simple matter. I was just wondering how far we > are from a more unified code base. Tokenizing anything beyond what the Outlook addin can tokenize now will require new Outlook-specific code. From sethg at GoodmanAssociates.com Sun Apr 11 01:45:44 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sun Apr 11 01:45:46 2004 Subject: FW: [spambayes-dev] various Outlook version and RFC2822 compliance Message-ID: > From: Tim Peters > Sent: Saturday, April 10, 2004 11:52 PM > > > [Seth Goodman] <...> > > I have also heard that later versions of Outlook are more RFC2822 > > compliant. > > Outlook keeps getting better at both accepting and creating > standard email, > but it doesn't store email in this format. Our Outlook addin > sees email in > the way Outlook stores it. Those filthy buggers. You'd think with the rest of the world using RFC2822, or at least trying to, these guys would relent and store the messages in that format so that just in case they ever wanted to use a message for anything later, it would all be there. But noooooo! > Tokenizing anything beyond what the Outlook addin can tokenize now will > require new Outlook-specific code. Sounds like a lot of work and an undocumented, moving target. A recipe for a mess. Too bad I'm so habituated to this mail client (and you guys have done such a fine job of integrating SpamBayes into it). Maybe OpenOffice will eventually make an Outlook look-alike, maybe even with an RFC2822 storage option. With open source code, you could actually see how the internals worked instead of reverse engineering it. But for better or worse, Outlook will probably remain the "standard" for many years to come. -- Seth Goodman From sethg at GoodmanAssociates.com Sun Apr 11 01:46:51 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sun Apr 11 01:46:52 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer Message-ID: > From: Skip Montanaro > Sent: Saturday, April 10, 2004 10:48 PM > <...> > Matt> It seems that it's easier for a spammer to find a compromised PC > Matt> to relay though than it is for them to find someone willing to > Matt> host a their site. > > In which case I doubt either of these network ip classification > schemes will > have much effect. I don't know, Matt may have a point here. I've been getting a lot of salad spams that mostly end up in the Unsure folder and tend to score somewhat neutral. Many of them do not even use real words to dilute the sales pitch, they use random combinations of letters separated by white space so there are relatively few significant tokens. It's not the smartest strategy, but I've seen quite a bit of it. In such cases, could a strong spam clue, such as the netblock of a spamvertised web site, possibly push it from Unsure into Spam? I don't have a feel for Chi-squared combining so this is a question, not an assertion. I agree with Matt that because of the huge number of compromised windows boxes with cables modems on providers (like Comcast) that do not restrict outgoing port 25 connections to their smarthost, the chance of getting two spams from the same compromised box are almost nil. Even if you fragment the header IP addresses in the same way that Matt suggests (maybe you already do?), the sheer size of IP address space allocated to dynamic IP pools at major providers is orders of magnitude larger than the IP space of hosting services willing to host sites for enlargement products. It seems that the hosting service IP's are more likely generate strong spam clues than the source IP's of the compromised windows boxes. Whether this would ultimately make enough of a difference, I don't know. -- Seth Goodman From skip at pobox.com Sun Apr 11 08:25:15 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Apr 11 08:25:26 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: References: Message-ID: <16505.14635.741313.773689@montanaro.dyndns.org> Matt> It seems that it's easier for a spammer to find a compromised PC Matt> to relay though than it is for them to find someone willing to Matt> host a their site. Skip> In which case I doubt either of these network ip classification Skip> schemes will have much effect. Seth> I don't know, Matt may have a point here. I've been getting a lot Seth> of salad spams .... In such cases, could a strong spam clue, such Seth> as the netblock of a spamvertised web site, possibly push it from Seth> Unsure into Spam? Sure, if there are few tokens, one extra token may have a large enough effect. That wasn't the case I was referring to. Matt was worried about losing the occasional good message in a sea of spam on a few important mailing lists. If those good messages are fairly typical (or if he's trained on a few of them), there are probably plenty of hammy tokens in each one, in which case throwing in a netblock isn't going to add much. Seth> Even if you fragment the header IP addresses in the same way that Seth> Matt suggests (maybe you already do?), the sheer size of IP Seth> address space allocated to dynamic IP pools at major providers is Seth> orders of magnitude larger than the IP space of hosting services Seth> willing to host sites for enlargement products. Yes, I believe mine_received_headers does fragment in the same way as Matt's scheme (minus the /(8,16,24,32) suffix which I think is superfluous), which was why I mentioned it in the first place. I think with mine_received_headers enabled we're already collecting the same information (actually more in most instances, since all Received: headers are parsed). Here are some examples gotten using spamcounts (post-sorted by the spam prob) from my current database. * mail.python.org (slightly hammy): % spamcounts -r 'received:12.155' db: /Users/skip/.hammiedb token,nspam,nham,spam prob received:12.155,269,387,0.40438528783 received:12.155.117,269,387,0.40438528783 received:12.155.117.29,269,387,0.40438528783 * pobox.com, main relay for most of my mail (again, mostly mildly mildly hammy, though with some outliers): % spamcounts -r 'received:(208\.58|207\.8)' db: /Users/skip/.hammiedb token,nspam,nham,spam prob received:208.58.216,0,1,0.155172413793 received:208.58.216.73,0,1,0.155172413793 received:207.8.226.3,66,92,0.412197950796 received:207.8.214.3,67,93,0.413216308473 received:207.8.214,73,98,0.42129893514 received:208.58.1.193,87,116,0.422927556996 received:207.8,208,269,0.430284644233 received:207.8.226,135,171,0.435415990012 received:208.58,193,239,0.440949675391 received:208.58.1,193,238,0.441982563175 received:208.58.1.194,99,118,0.450429768447 received:207.8.226.2,69,79,0.460422504704 received:208.58.1.197,5,5,0.494310099573 received:207.8.214.2,6,5,0.53799693756 received:208.58.1.198,4,1,0.771713070997 * mail.mojam.com, where my mail eventually winds up (mildly spammy because I get lots of non-skip@mojam.com stuff there which is primarily spam): % spamcounts -r 'received:199.249' db: /Users/skip/.hammiedb token,nspam,nham,spam prob received:199.249.165.21,0,1,0.155172413793 received:199.249.165.25,0,1,0.155172413793 received:199.249,90,55,0.614718002838 received:199.249.165,90,55,0.614718002838 received:199.249.165.175,90,54,0.619037063122 Now I cheat and just sort all received: features by spam prob. The highest is received:69.6,7,0,0.969798657718 received:biz,7,0,0.969798657718 (perhaps not surprising). Looking up some of the individual addresses in the 69.6 block yields a bunch of "host not found" responses. Also, not all that surprising. Looking at the other end of the spectrum, I see received:66.163,0,6,0.0348837209302 The ip's I have in that block refer to Yahoo's mail servers. This suggests to me they do a pretty good job keeping their relays closed to abuse. Seth> It seems that the hosting service IP's are more likely generate Seth> strong spam clues than the source IP's of the compromised windows Seth> boxes. Whether this would ultimately make enough of a difference, Seth> I don't know. Of course, whether or not this helps on any given message depends to a large degree on how many other features the tokenizer extracts from the message. Switching gears a bit, I suspect we could probably toss out the received:N.N.N.N and received:N.N.N features and not lose much in the way of accuracy since all but a few of them are hapaxes. feature pattern total hapaxes --------------- ----- ------- received:N 177 77 (44%) received:N.N 1606 1228 (76%) received:N.N.N 2140 1927 (90%) received:N.N.N.N 2548 2362 (93%) Perhaps the same holds true for hostname-based features (received:biz, received:creosote.python.org, etc), though it's less clear cut. Perhaps none of them are worth keeping: feature pattern total hapaxes --------------- ----- ------- received:a 320 257 (80%) received:a.a 1046 867 (83%) received:a.a.a 1222 1062 (87%) received:a.a.a.a 682 609 (89%) The above data are from my database which currently contains 102863 tokens. If I removed all the three- and four-component received: features I'd reduce the database size by about six percent. I'll restate my question. What does Matt's proposal do that mine_received_headers doesn't do already? Skip From matt at mondoinfo.com Sun Apr 11 12:28:49 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sun Apr 11 12:30:00 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <16504.49156.490243.142533@montanaro.dyndns.org> References: <1081622811.78.614@mint-julep.mondoinfo.com> <16504.45621.192141.340899@montanaro.dyndns.org> <1081652165.08.529@mint-julep.mondoinfo.com> <16504.49156.490243.142533@montanaro.dyndns.org> Message-ID: <1081656741.01.529@mint-julep.mondoinfo.com> Dear Skip, > Unless those messages are extremely short, I doubt it would matter > much. It's going to be just one clue among many. I have no trouble > getting the occasional good mail from the pychecker mailing list, > which gets almost nothing but spam these days. Thanks for the clue. I'll give it a try. >> It seems that it's easier for a spammer to find a compromised PC >> to relay though than it is for them to find someone willing to >> host their site. > In which case I doubt either of these network ip classification > schemes will have much effect. Sorry for not being clear. What I should have mentioned earlier is that it doesn't seem to me that an unusual amount of spam comes from the networks that host spammers' websites. I don't think that mine_received_headers and the scheme I'm testing will generate much of the same data. In the last 24 hours, I've had 29 spams for which SpamBayes's classifier used as evidence URL's IPs in 202/8, 218/8, 219/8, and 221/8. On the ham side, the IP for mail.python.org has figured in evidence for 15 hams. Spammers seem to be limited in their choice of networks for hosting, but they can't know what networks the URLs that you or I get in ham messages will resolve to. In that respect, those IPs fit well with what SpamBayes does: spammers have a constrained spam "vocabulary" and can't know a random individual's limited ham "vocabulary". Regards, Matt From Arnold.Lou829 at rogers.com Sun Apr 11 12:41:45 2004 From: Arnold.Lou829 at rogers.com (Lou Arnold) Date: Sun Apr 11 12:38:07 2004 Subject: [spambayes-dev] Compatibility with Norton Antivirus Message-ID: I have installed Norton Anti-Virus (NAV) to secure my incoming email. As I understand things, it has a POP3 Proxy that sits between the ISP mail server and my MS Outlook 2000. Q: Will the SpamBayes' POP3 proxy replace or interfere with NAV? From pje at telecommunity.com Sun Apr 11 12:55:05 2004 From: pje at telecommunity.com (Phillip J. Eby) Date: Sun Apr 11 12:55:05 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: Message-ID: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> At 08:25 AM 4/11/04 -0400, spambayes-dev-request@python.org wrote: >I'll restate my question. What does Matt's proposal do that >mine_received_headers doesn't do already? It looks at URLs embedded in the message *body*. As a simple contrast, if I link here to: http://enlarge-my-spam.com?id=123456 That will produce a very *different* set of IP tokens than the Received: headers of this message. And, if the same spam is sent from a thousand compromised PC's, they will all still have the same URL IP cues, despite lacking any Received: headers in common. Yes, they'll also have tokens representing parts of the domain name, but spammers can cheaply change their domain names to avoid being recognized. Their website IP addresses are not only harder to change, but take advantage of the fact that so-called "bulletproof hosting" providers are a "bad neighborhood" for links. So, if you train on these tokens, then you could potentially nail entirely unrelated spammers who simply host with the same ISP. Of course, the spammers' next move would likely be to use redirects from non-"bulletproof" hosts, but everything we can do to make it more difficult and more costly for them is a good thing. From sethg at GoodmanAssociates.com Sun Apr 11 14:33:53 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sun Apr 11 14:33:54 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> Message-ID: > From: Phillip J. Eby > Sent: Sunday, April 11, 2004 11:55 AM > > > At 08:25 AM 4/11/04 -0400, spambayes-dev-request@python.org wrote: > >[Skip] > >I'll restate my question. What does Matt's proposal do that > >mine_received_headers doesn't do already? > > It looks at URLs embedded in the message *body*. ... That's _exactly_ what I was getting at. Mine_received_headers only looks at headers, which don't contain the IP's of spamvertised sites. Much of, if not most, spam today comes direct-to-MX from compromised windows boxes operating on broadband, dynamic IP connections from providers that don't limit customers' use of outgoing port 25 connections. The theory, if it is worth anything, is that the total size of the IP address space for "bad-boy" hosting service web-servers is puny compared with the dynamic IP pools of major providers who do not block outgoing port 25 connections. Having the token database learn the former is feasible, while having it learn the latter is pretty hopeless. For exactly the same reason, I would guess that the message source IP is probably better at identifying ham than spam. For this property alone, it is extremely valuable. My friends' tendency to use an occasional spammy word is partially offset by the strong ham clues from their outgoing MTA IP and their personal email address. In terms of detecting spam, the token database does a great job at detecting repetitive spam sources, but is somewhat ill-suited for the dynamic IP phenomenon. Rather than have the token database learn to be a mediocre dynamic IP blacklist, it would probably be better to use a proxy to query a real dynamic IP blacklist and add a header for SpamBayes to mine. However, that's outside the scope of SpamBayes. -- Seth Goodman From skip at pobox.com Sun Apr 11 18:58:30 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Apr 11 18:58:35 2004 Subject: [spambayes-dev] Compatibility with Norton Antivirus In-Reply-To: References: Message-ID: <16505.52630.74591.355838@montanaro.dyndns.org> Lou> I have installed Norton Anti-Virus (NAV) to secure my incoming Lou> email. As I understand things, it has a POP3 Proxy that sits Lou> between the ISP mail server and my MS Outlook 2000. Lou> Q: Will the SpamBayes' POP3 proxy replace or interfere with NAV? If you configure things correctly, it should work just fine. Your setup might look like this: real Spambayes NAV Your POP3 <---> POP3 proxy <---> POP3 proxy <---> email server server server client Let's pick some hypothetical names. Your machine is "localhost". Your real POP3 server is mail.myisp.com. Configure Spambayes to get mail from mail.myisp.com on port 110 and listen to port 110 on localhost. Configure NAV's proxy server to get mail from Spambayes on port 110 of localhost and listen to port 1110 on localhost. Configure your email client to get mail from port 1110 on localhost. The reason I suggest placing Spambayes ahead of NAV is that we've seen situations where NAV popped up a dialog box seeking input from the user which the user apparently didn't see (maybe it didn't get raised to the top of the stack of windows). If NAV is upstream from Spambayes everything grinds to a halt and it looks like Spambayes has hung. By placing Spambayes between the real POP3 server and NAV at least its web interface should still be responsive. Maybe it's a useless distinction. It should work either way. You might have to configure things so NAV comes first if it can't connect to localhost on a port other than 110. Skip From skip at pobox.com Sun Apr 11 19:09:49 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Apr 11 19:09:54 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> References: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> Message-ID: <16505.53309.360402.514258@montanaro.dyndns.org> >> I'll restate my question. What does Matt's proposal do that >> mine_received_headers doesn't do already? Phillip> It looks at URLs embedded in the message *body*. As a simple Phillip> contrast, if I link here to: Phillip> http://enlarge-my-spam.com?id=123456 Phillip> That will produce a very *different* set of IP tokens than the Phillip> Received: headers of this message. Ah, okay. I missed that in Matt's post. If the tokenizer's x-pick_apart_urls option is True, it picks apart URLs embedded in the body of the message. It's not as ip-centered as Matt's code. Skip From matt at mondoinfo.com Sun Apr 11 19:25:11 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sun Apr 11 19:26:01 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <16505.53309.360402.514258@montanaro.dyndns.org> References: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> <16505.53309.360402.514258@montanaro.dyndns.org> Message-ID: <1081725547.82.472@mint-julep.mondoinfo.com> > Ah, okay. I missed that in Matt's post. If the tokenizer's > x-pick_apart_urls option is True, it picks apart URLs embedded in > the body of the message. It's not as ip-centered as Matt's code. Yes, the two needn't be related. That just turned out to be the best place in the code to do the lookup. I wanted the patch to be as simple as possible in the hopes that someone else would like to test it. Here's a little more data: Since yesterday morning, SpamBayes has scored 352 messages for me. Of those, a url-ip token has figured in the evidence for 262 of them. Only 90 were scored without one. Regards, Matt From skip at pobox.com Sun Apr 11 19:56:14 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Apr 11 19:56:25 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1081725547.82.472@mint-julep.mondoinfo.com> References: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> <16505.53309.360402.514258@montanaro.dyndns.org> <1081725547.82.472@mint-julep.mondoinfo.com> Message-ID: <16505.56094.688745.208522@montanaro.dyndns.org> >> Ah, okay. I missed that in Matt's post. If the tokenizer's >> x-pick_apart_urls option is True, it picks apart URLs embedded in the >> body of the message. It's not as ip-centered as Matt's code. Matt> Yes, the two needn't be related. That just turned out to be the Matt> best place in the code to do the lookup. I wanted the patch to be Matt> as simple as possible in the hopes that someone else would like to Matt> test it. Can your mods be easily factored into the x-pick_apart_urls option? Skip From matt at mondoinfo.com Sun Apr 11 20:10:01 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sun Apr 11 20:10:09 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <16505.56094.688745.208522@montanaro.dyndns.org> References: <5.1.1.6.0.20040411124603.026a56a0@mail.telecommunity.com> <16505.53309.360402.514258@montanaro.dyndns.org> <1081725547.82.472@mint-julep.mondoinfo.com> <16505.56094.688745.208522@montanaro.dyndns.org> Message-ID: <1081728065.93.472@mint-julep.mondoinfo.com> Dear Skip, > Can your mods be easily factored into the x-pick_apart_urls option? If I understand your question correctly, the answer is that the DNS lookup code is governed by that option now. If my patch were ever added to the distributed code, I suspect that it would make sense to leave it under x-pick_apart_urls and add another option that affected only the DNS lookup code. It would be necessary to note in the documentation that turning that option on only had an effect if x-pick_apart_urls was also turned on but I don't imagine that that would be a serious problem. Regards, Matt From tameyer at ihug.co.nz Sun Apr 11 21:25:13 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Apr 11 21:25:34 2004 Subject: [spambayes-dev] RE: Incremental filtering and the spam folder In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E10E63@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C04@its-xchg4.massey.ac.nz> > > Would it not be better to only add the hook if the > > train_manual_spam option is True? Or is there some > > other reason that the spam folder has to be hooked? > > I think you are correct - we could avoid the hook > alltogether. However, I'm still reluctant to change this, as > it does risk breakage. We can fix it post 1.0. Sounds good. So I remember, I've opened a tracker: [ 933473 ] Unnecessary spam folder hook I'll run with the patch until it gets checked in as well, for some additional assurance that it'll work. =Tony Meyer From tameyer at ihug.co.nz Sun Apr 11 21:36:02 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Apr 11 21:36:29 2004 Subject: [spambayes-dev] RE: 1.0b1 Release candidates In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E10E34@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B7E@its-xchg4.massey.ac.nz> > Unfortunately, the typelibs for office 2000 are hard-coded in > a couple of spots. I fear that upgrading these to later > typelibs will prevent SpamBayes working at all for the older > users. No time to, or easy way to look at this particular problem. In my local copies, I've changed the hard-coded codes to match the ones I generate. I suspect that you're right, since it didn't appear that my binary built worked for Amir. I have access to OL2K and OL2k2, so I could play around with this at some point, but there doesn't seem to be much of a need for it right now. (i.e for as long as you're willing to be the plug-in builder). > I'm sorry, but I seem to have missed that, and can't find it. > I've a message or 2 from Thomas to catch up on next, but > don't recall it being one of them. This one: Though note also his comment here: =Tony Meyer From mhammond at skippinet.com.au Sun Apr 11 22:06:38 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun Apr 11 22:06:56 2004 Subject: [spambayes-dev] Re: 1.0b1 Release candidates In-Reply-To: <3c7eiyyo.fsf@python.net> Message-ID: <07e701c42032$cad3fa80$0200a8c0@eden> Tony: > release out? Are you able to use the code that Thomas > posted to patch the ActivePython pythoncom.dll Tony - note that the code posted by Thomas is to patch python23.dll, as packaged by py2exe and shipped by us. It does not touch the ActivePython DLLs at all. What we are doing is changing the registry location that *we* read Python options from. This way, we won't be reading the standard "2.3", which is what ActivePython uses. > ... and should that code go into py2exe (maybe with an > additional change > to py2exe so that it can specify the LCID for the resources) ... Yes, I believe it should, and I agree losing that version information is no big deal. IIUC, you were also implying that Python 2.4 should be patched to use language independent resource here? I'll have a bash at making a py2exe patch, while I'm making the other patches up. Mark. From tameyer at ihug.co.nz Sun Apr 11 22:14:30 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Apr 11 22:15:24 2004 Subject: [spambayes-dev] Re: 1.0b1 Release candidates In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E112EF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C0A@its-xchg4.massey.ac.nz> > Tony - note that the code posted by Thomas is to patch > python23.dll, as packaged by py2exe and shipped by us. It > does not touch the ActivePython DLLs at all. > > What we are doing is changing the registry location that *we* > read Python options from. This way, we won't be reading the > standard "2.3", which is what ActivePython uses. Ah, ok - I understand this now. I'm glad I left this alone . =Tony Meyer From kennypitt at hotmail.com Mon Apr 12 08:57:29 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Apr 12 08:58:46 2004 Subject: [spambayes-dev] Compatibility with Norton Antivirus In-Reply-To: Message-ID: Lou Arnold wrote: > I have installed Norton Anti-Virus (NAV) to secure my incoming email. > As I understand things, it has a POP3 Proxy that sits between the ISP > mail server and my MS Outlook 2000. > > Q: Will the SpamBayes' POP3 proxy replace or interfere with NAV? I run NAV 2003 at home, and at least as of that version NAV no longer operates as a POP3 proxy. It hooks directly into the network protocol stack as a "filter" that sees the traffic on the POP3 port before whatever application is accessing the port. I simply configure SpamBayes to talk to my POP3 server and my mail client to talk to SpamBayes, and NAV filters the mail traffic as SpamBayes reads it from the POP3 server. -- Kenny Pitt From kennypitt at hotmail.com Mon Apr 12 09:06:26 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Apr 12 09:07:48 2004 Subject: [spambayes-dev] RE: 1.0b1 Release candidates In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B7E@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> Unfortunately, the typelibs for office 2000 are hard-coded in >> a couple of spots. I fear that upgrading these to later >> typelibs will prevent SpamBayes working at all for the older >> users. No time to, or easy way to look at this particular problem. > > In my local copies, I've changed the hard-coded codes to match the > ones I generate. I suspect that you're right, since it didn't appear > that my binary built worked for Amir. I have access to OL2K and > OL2k2, so I could play around with this at some point, but there > doesn't seem to be much of a need for it right now. (i.e for as long > as you're willing to be the plug-in builder). On my system, I just grabbed copies of MSO9.DLL and MSOUTL9.OLB from an old installation of Outlook 2K and registered them with regsvr32 on my Outlook 2003 system. That installed the two type libraries that SpamBayes is looking for. Now everything builds without errors and creates the right genpy interface GUIDs when I run py2exe. Haven't actually installed and run the binary on an OL2K system to verify operation, though. -- Kenny Pitt From rmalayter at bai.org Mon Apr 12 11:45:51 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Mon Apr 12 11:45:57 2004 Subject: [spambayes-dev] various Outlook version and RFC2822 compliance Message-ID: <792DE28E91F6EA42B4663AE761C41C2A021C8D95@cliff.bai.org> [Seth Goodman] > 1) How RFC2822 compliant is the stored message format in the various > versions of Outlook subsequent to Outlook2000? Without even > looking at the > code I can tell that Outlook2000 is not compliant due to the > total absence > of a References: header, which causes many people real > problems who view > mailing lists by conversation thread. I've done a bit of research into this, as I am trying to find a way to reliably reconstruct the MIME structure in the Outlook plug-in, beyond simply synthesizing a token for attachments. When any version of Outlook, even 2003, stores mail in a .PST file, the messages are converted to Microsoft's "MAPI" format, which destroys the MIME structure. The MAPI format is mostly proprietary and only partially documented, and seems to get tweaks from version to version. This situation is not likely to change, since MS needs to preserve some form of backwards compatibility. Many people run different versions of Outlook on different machines, and they would get a boatload of support calls from people trying to open newer PST files on older Outlook versions if they changed the format drastically. A version of this MAPI format is what is exposed via the Outlook APIs to the SpamBayes Outlook plug-in, and is the source of the issue with attachments. Now, when you use Outlook 2003 and mail is *stored* on a Microsoft Exchange *2003* server (not a PST file), the mail is not converted from to MAPI format automatically. It remains in RFC format the in the Exchange server database and even when it is sent to the Outlook 2003 client. This is nice, because it drastically reduces the "format conversion" CPU load on the Exchange server. However, there still appears to be no way to access this RFC-compliant message stream programmatically from within the Outlook 2003 client. The Outlook client performs the RFC-to-MAPI format conversion on the fly. You can get the RFC format message stream through various means on the server-side, but this is not much help to the SpamBayes plug-in. One thing I have been able to do is create a windows file share of the Exchange Installable File-system (EXIFS), which basically gives you access to a set of read-only files representing each message in RFC format. Assuming you were to set up this file share on your Exchange server with appropriate permissions, you could then have add code to the SpamBayes plug-in to look at the RFC-formatted message from this file share. This method is certainly a hack, and may not work in the future, since MS appears to be moving away from the ExIFS. And since most users of the SB code base do not use Exchange servers, but rather connect to standard POP3 or IMAP servers, it is probably not worth pursuing a patch to the general SB code base to make this work. > > 2) If the later versions of Outlook are more (or perhaps even fully?) > RFC2822 compliant, would it be possible to detect the Outlook > version and > enable generating the additional tokens that are available > with the web > proxy? > Another option I was looking at would be to use a subset of the SpamBayes POP3/IMAP filter in the Outlook client to retrieve messages in RFC format. This way, if you left your mail on the server, you could still use the Outlook plug-in user interface, but it would actually go and retrieve the mail from the server via MAPI or POP3 rather than using Outlook's API to get a message stream. If it couldn't find the message via IMAP or POP3, that means the message is no longer on the mail server and it would use the version provided by Outlook's API. This basically would mean there would need to be a level of integration between the Outlook plug-in and the MAPI/POP3 proxies, and *all* Outlook plug-in installations of SpamBayes would also be MAPI or POP3 proxy installations. It seems this is going to be difficult to get working, though, with the possibility of little gain if tokenizing file attachments doesn't prove generally useful. So I'm going to go back to trying to synthesize a MIME header for attachments when I have the time. If you have any more thoughts, please let me know. Thanks, Ryan From pje at telecommunity.com Mon Apr 12 18:47:47 2004 From: pje at telecommunity.com (Phillip J. Eby) Date: Mon Apr 12 18:48:08 2004 Subject: FW: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: Message-ID: <5.1.1.6.0.20040412184300.0200b790@telecommunity.com> At 09:36 PM 4/11/04 -0400, "Phillip J. Eby" wrote: >Of course, the spammers' next move would likely be to use redirects from >non-"bulletproof" hosts, but everything we can do to make it more difficult >and more costly for them is a good thing. Oh, by the way, Slashdot ran an article today on a similar scheme: http://slashdot.org/article.pl?sid=04/04/12/1956252 From mhammond at skippinet.com.au Mon Apr 12 23:20:57 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Apr 12 23:21:17 2004 Subject: [spambayes-dev] various Outlook version and RFC2822 compliance In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A021C8D95@cliff.bai.org> Message-ID: <119901c42106$56b465c0$0200a8c0@eden> > I've done a bit of research into this, as I am trying to find a way to > reliably reconstruct the MIME structure in the Outlook plug-in, beyond > simply synthesizing a token for attachments. As a matter of interest, why do you want to do this? Won't you have to reconstruct all the attachments in this stream, just to have them pulled apart (but promptly ignored) by the tokenizer? For binary attachments, including virus payload, this would seem significant. Given the various problems and version dependencies we have extracting this stream, it would seem much simpler to use documented stable Outlook interfaces to synthesize the few tokens we are talking about. While I agree it is an interesting problem, I don't see why it would be the best way for the Outlook addin to approach it. Is there something I am missing? Thanks, Mark. -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 2032 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040413/3052dc03/winmail.bin From mhammond at skippinet.com.au Mon Apr 12 23:22:23 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Apr 12 23:22:41 2004 Subject: [spambayes-dev] Compatibility with Norton Antivirus In-Reply-To: Message-ID: <119d01c42106$8a58c3d0$0200a8c0@eden> > operates as a POP3 proxy. It hooks directly into the network protocol > stack as a "filter" that sees the traffic on the POP3 port before > whatever application is accessing the port. Hmm - now that sounds like fun :) Mark. From rmalayter at bai.org Tue Apr 13 00:00:16 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Tue Apr 13 00:00:21 2004 Subject: [spambayes-dev] various Outlook version and RFC2822 compliance Message-ID: <792DE28E91F6EA42B4663AE761C41C2A021C8DED@cliff.bai.org> [Ryan Malayter] >As a matter of interest, why do you want to do this? Won't >you have to reconstruct all the attachments in this stream, >just to have them pulled apart (but promptly ignored) by the >tokenizer? Because there may be more tokenizing options that require the full RFC2822-plus-MIME structure of the message. I also figured it would be neat and tidy to put the Outlook plug-in on equal footing with the proxy versions of Outlook. It would certainly make the two versions respond the same when testing the same corpora. Synthesizing tokens would scratch this particular itch a few people are having with attachment names, but not necessarily any future itches. I figured if I could solve the RFC-format problem easily (which I can't), it would solve my current issue and also be better for the future of the code base. -ryan- From tameyer at ihug.co.nz Tue Apr 13 01:59:34 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Apr 13 01:59:52 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E11030@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> Have you tried using the x-slurp_urls option as a solution for this problem? (I'm not saying it's a better solution, just curious if you have, and if so, what the results were). > In case anyone would like to play with it, I'll append my trivial > patch. It requires pydns from: > > http://sourceforge.net/projects/pydns/ This concerns me a bit. I'd want to see really dramatic results before something in the core distribution required non-standard libraries to be installed. How complex is the code that the patch is using? Running timcv.py was *really* slow, too - I don't know whether this was because a lot of messages timed out, or that the DNS lookup was slow, or what, but it worries me a bit. Doing the DNS enquiry interactively was very quick, and at this time of night our DNS server isn't used much at all, so quite responsive. Here are my results using timcv.py -n5 with two corpora. First cmp.py results, then a table.py with just running with defaults as well. The first one (my wife's mail for the last few months) is a win (-1 fn, -4 unsure). The second one (my work mail for the last few months) is a loss (two unsure move into fn in one run, the rest unchanged). Note that in both of these the standard x-pick_apart_urls option does nothing (good or bad) for me. -> tested 101 hams & 358 spams against 398 hams & 1427 spams -> tested 100 hams & 359 spams against 399 hams & 1426 spams -> tested 100 hams & 358 spams against 399 hams & 1427 spams -> tested 99 hams & 353 spams against 400 hams & 1432 spams -> tested 99 hams & 357 spams against 400 hams & 1428 spams -> tested 101 hams & 358 spams against 398 hams & 1427 spams -> tested 100 hams & 359 spams against 399 hams & 1426 spams -> tested 100 hams & 358 spams against 399 hams & 1427 spams -> tested 99 hams & 353 spams against 400 hams & 1432 spams -> tested 99 hams & 357 spams against 400 hams & 1428 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 0.279 0.279 tied 0.279 0.279 tied 0.559 0.559 tied 2.266 2.266 tied 2.521 2.241 won -11.11% won 1 times tied 4 times lost 0 times total unique fn went from 21 to 20 won -4.76% mean fn % went from 1.18076754281 to 1.12474513385 won -4.74% ham mean ham sdev 0.00 0.01 +(was 0) 0.04 0.04 +0.00% 0.49 0.49 +0.00% 4.91 4.91 +0.00% 0.02 0.01 -50.00% 0.12 0.11 -8.33% 0.03 0.02 -33.33% 0.21 0.21 +0.00% 0.01 0.01 +0.00% 0.08 0.08 +0.00% ham mean and sdev for all runs 0.11 0.11 +0.00% 2.21 2.21 +0.00% spam mean spam sdev 96.02 96.11 +0.09% 13.44 13.60 +1.19% 97.15 97.31 +0.16% 11.27 11.10 -1.51% 97.12 97.30 +0.19% 11.86 11.89 +0.25% 94.93 94.92 -0.01% 17.08 17.53 +2.63% 94.99 95.08 +0.09% 17.16 17.26 +0.58% spam mean and sdev for all runs 96.05 96.15 +0.10% 14.40 14.55 +1.04% ham/spam mean difference: 95.94 96.04 +0.10 filename: libbys libby_picks libby_pickms ham:spam: 499:1785 499:1785 499:1785 fp total: 0 0 0 fp %: 0.00 0.00 0.00 fn total: 21 21 20 fn %: 1.18 1.18 1.12 unsure t: 118 119 114 unsure %: 5.17 5.21 4.99 real cost: $44.60 $44.80 $42.80 best cost: $11.80 $11.80 $12.00 h mean: 0.11 0.11 0.11 h sdev: 2.21 2.21 2.21 s mean: 96.04 96.05 96.15 s sdev: 14.40 14.40 14.55 mean diff: 95.93 95.94 96.04 k: 5.78 5.78 5.73 -> tested 280 hams & 131 spams against 1111 hams & 512 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 277 hams & 128 spams against 1114 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 280 hams & 131 spams against 1111 hams & 512 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 277 hams & 128 spams against 1114 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 6.870 6.870 tied 3.125 3.125 tied 7.813 9.375 lost +19.99% 3.906 3.906 tied 5.469 5.469 tied won 0 times tied 4 times lost 1 times total unique fn went from 35 to 37 lost +5.71% mean fn % went from 5.43654580153 to 5.74904580153 lost +5.75% ham mean ham sdev 0.18 0.18 +0.00% 1.77 1.77 +0.00% 0.01 0.01 +0.00% 0.17 0.17 +0.00% 0.01 0.01 +0.00% 0.12 0.12 +0.00% 0.03 0.01 -66.67% 0.39 0.13 -66.67% 0.28 0.29 +3.57% 3.37 3.38 +0.30% ham mean and sdev for all runs 0.10 0.10 +0.00% 1.72 1.71 -0.58% spam mean spam sdev 88.89 88.89 +0.00% 25.38 25.48 +0.39% 90.07 90.39 +0.36% 23.20 22.75 -1.94% 87.23 87.13 -0.11% 28.96 29.35 +1.35% 90.79 90.92 +0.14% 23.89 23.80 -0.38% 90.31 90.67 +0.40% 25.99 25.52 -1.81% spam mean and sdev for all runs 89.46 89.60 +0.16% 25.59 25.52 -0.27% ham/spam mean difference: 89.36 89.50 +0.14 filename: exchanges exchange_picks exchange_pickms ham:spam: 1391:643 1391:643 1391:643 fp total: 0 0 0 fp %: 0.00 0.00 0.00 fn total: 35 35 37 fn %: 5.44 5.44 5.75 unsure t: 83 82 80 unsure %: 4.08 4.03 3.93 real cost: $51.60 $51.40 $53.00 best cost: $33.80 $33.80 $33.00 h mean: 0.10 0.10 0.10 h sdev: 1.72 1.72 1.71 s mean: 89.34 89.46 89.60 s sdev: 25.65 25.59 25.52 mean diff: 89.24 89.36 89.50 k: 3.26 3.27 3.29 =Tony Meyer From mhammond at skippinet.com.au Tue Apr 13 02:27:13 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Apr 13 02:27:40 2004 Subject: [spambayes-dev] ANNOUNCE: SpamBayes release 1.0b1 Message-ID: <00a701c42120$5ca69e70$0200a8c0@eden> The SpamBayes team is pleased to announce the latest release of SpamBayes - 1.0b1. Like the last version, this is both a release of the source code and of an installation program for all Microsoft Windows users. The Windows installation program will install either the Outlook add-in (for Microsoft Outlook users), or the SpamBayes server program (for all other mail client users, including Microsoft Outlook Express). All Windows users (including existing users of the Outlook add-in) are encouraged to use the installation program. If you wish to use the source-code version, you will also need to install Python - see README.txt in the source tree for more information. This release fixes a number of bugs in the last release, including a bug that could cause your PC to operate as an open mail relay in some cases. We recommend that all existing users upgrade. For a detailed description of everything (well, everything we remember) that has changed since the last release, you can view our WHAT_IS_NEW.txt file, either online, or in the source distribution. Get it via the 'Download' page at http://www.spambayes.org/download.html Enjoy the new release and your spam-free mailbox :-) Thanks to everyone involved in this release, particularly, and as usual, Tony Meyer for putting most of the actual release together! Mark. (on behalf of the SpamBayes team) --- What is SpamBayes? --- The SpamBayes project is working on developing a Bayesian (of sorts) anti-spam filter (in Python), initially based on the work of Paul Graham. The major difference between this and other, similar projects is the emphasis on testing newer approaches to scoring messages. The project includes a number of different applications, all using the same core code, ranging from a plug-in for Microsoft Outlook, to a POP3 proxy, to various command-line tools. From anthony at interlink.com.au Tue Apr 13 04:48:27 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Apr 13 04:50:01 2004 Subject: [spambayes-dev] Re: [Spambayes] ANNOUNCE: SpamBayes release 1.0b1 In-Reply-To: <00a701c42120$5ca69e70$0200a8c0@eden> References: <00a701c42120$5ca69e70$0200a8c0@eden> Message-ID: <407BA95B.201@interlink.com.au> Mark Hammond wrote: > The SpamBayes team is pleased to announce the latest release of SpamBayes - > 1.0b1. Woohoo! Well done to everyone involved in this release process. When we get to 1.0, it's probably worth mentioning in the announcement that although it's a "1.0" release, it's actually the 9th? 10th? release of the software. and-now-onto-the-almost-mythical-"1.0"-release... Anthony -- Anthony Baxter It's never too late to have a happy childhood. From sjoerd at acm.org Tue Apr 13 06:35:28 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Tue Apr 13 06:35:32 2004 Subject: [spambayes-dev] python setup.py build is failing Message-ID: <407BC270.5040900@acm.org> With a completely up-to-date checkout of spambayes, the command "python setup.py build" fails with the message error: file 'scripts/sb_bnfilter.py' does not exist It seems to me somebody checked in a change to setup.py to also compile and install sb_bnfilter.py but forgot to check in the file itself... -- Sjoerd Mullender From skip at pobox.com Tue Apr 13 09:41:54 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Apr 13 09:42:01 2004 Subject: [spambayes-dev] python setup.py build is failing In-Reply-To: <407BC270.5040900@acm.org> References: <407BC270.5040900@acm.org> Message-ID: <16507.60962.924060.82138@montanaro.dyndns.org> Sjoerd> With a completely up-to-date checkout of spambayes, the command Sjoerd> "python setup.py build" fails with the message Sjoerd> error: file 'scripts/sb_bnfilter.py' does not exist I noticed the change to the installation procedure float by on the checkins list but never saw anything which indicated that sb_bnfilter.py and sb_bnserver.py were moved from contrib to scripts. Sjoerd> It seems to me somebody checked in a change to setup.py to also Sjoerd> compile and install sb_bnfilter.py but forgot to check in the Sjoerd> file itself... It was there, just not where setup.py was looking. I cvs remove'd them from contrib and cvs add'ed them to scripts, then modified setup.py to also install sb_bnserver.py. Skip From kennypitt at hotmail.com Tue Apr 13 10:28:39 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Apr 13 10:29:59 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > Have you tried using the x-slurp_urls option as a solution for this > problem? (I'm not saying it's a better solution, just curious if you > have, and if so, what the results were). > >> In case anyone would like to play with it, I'll append my trivial >> patch. It requires pydns from: >> >> http://sourceforge.net/projects/pydns/ > > This concerns me a bit. I'd want to see really dramatic results > before something in the core distribution required non-standard > libraries to be installed. Any reason why socket.gethostbyname(hostname) wouldn't work? I wrote a patch a while back using that function to do DNS queries against a DNSBL blacklist server and create additional tokens based on the results. There are two problems with doing DNS queries during tokenization. The first is performance because you're having to wait for the result of network operations instead of just manipulating local data. My DNSBL queries worked well, but didn't improve the overall accuracy enough to justify the performance hit. The second is training. DNS lookups are by nature dynamic, so the results generated are not necessarily the same every time you do it. Training (in particular, correcting the training of a message that was previously trained incorrectly) relies on the tokens that get generated for a particular message being identical every time the message is tokenized. If some of the tokens rely on additional data from a DNS query, those tokens may be different when the user gets around to retraining the message. -- Kenny Pitt From matt at mondoinfo.com Tue Apr 13 13:52:41 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Tue Apr 13 13:53:01 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> Message-ID: <1081876416.34.1193@mint-julep.mondoinfo.com> >>> http://sourceforge.net/projects/pydns/ [Tony Meyer] >> This concerns me a bit. I'd want to see really dramatic results >> before something in the core distribution required non-standard >> libraries to be installed. I don't necessarily disagree. Still, even if it went into the core distribution, it would surely be sensible to have it turned off by default and distutils makes installing PyDNS pretty simple. I've thought for a while that it would be good to get some DNS module into Python's standard library but I've never thought that I had a strong enough argument to bring it up publicly. Using it in SpamBayes might be a start. [Kenny Pitt] > Any reason why socket.gethostbyname(hostname) wouldn't work? I > wrote a patch a while back using that function to do DNS queries > against a DNSBL blacklist server and create additional tokens based > on the results. As far as I can tell, socket.gethostbyname() doesn't respect the timeout set by socket.setdefaulttimeout(). That's apt to make the performance hit rather worse. > There are two problems with doing DNS queries during tokenization. > The first is performance because you're having to wait for the > result of network operations instead of just manipulating local > data. My DNSBL queries worked well, but didn't improve the overall > accuracy enough to justify the performance hit. Personally, as long as I set the timeout pretty low, I barely notice the difference. When my mail client fetches a couple of emails, they're scored quickly enough that I don't notice an additional delay. If it fetches 100 or so, that's going to take a while in either case. No doubt, other people would have different experiences. > The second is training. DNS lookups are by nature dynamic, so the > results generated are not necessarily the same every time you do > it. Training (in particular, correcting the training of a message > that was previously trained incorrectly) relies on the tokens that > get generated for a particular message being identical every time > the message is tokenized. If some of the tokens rely on additional > data from a DNS query, those tokens may be different when the user > gets around to retraining the message. That's certainly a disadvantage. I think that legitimate servers don't move around all that much, so it may turn out to be a relatively small one but it would be nice to know for sure. Regards, Matt From tameyer at ihug.co.nz Tue Apr 13 18:58:34 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Apr 13 18:58:57 2004 Subject: [spambayes-dev] python setup.py build is failing In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E1170A@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B87@its-xchg4.massey.ac.nz> > I noticed the change to the installation procedure float by > on the checkins list but never saw anything which indicated > that sb_bnfilter.py and sb_bnserver.py were moved from > contrib to scripts. [...] > It was there, just not where setup.py was looking. I cvs > remove'd them from contrib and cvs add'ed them to scripts, > then modified setup.py to also install sb_bnserver.py. Sorry, this is my fault. I noticed that there was a new script, but didn't notice that it wasn't in the scripts directory (or that there were actually two). Annoyingly, I forgot that I had made this change when I built the 1.0b1 dists, and so figured that the testing I did with 1.0b1rc1 would still be ok (lesson to be learned there). I'll put 1.0b1.1's on sf now. =Tony Meyer From tameyer at ihug.co.nz Tue Apr 13 19:16:37 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Apr 13 19:16:49 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B88@its-xchg4.massey.ac.nz> > Here are my results using timcv.py -n5 with two corpora. > First cmp.py results, then a table.py with just running with > defaults as well. And here are two more (they were running too slow to get out yesterday, but completed overnight). The first one is my non-work mail for the last few months; the second one is the five sets that make up the SpamAssassin Public Archive (the bzip files starting with 2003...). Once again, the standard x-pick_apart_urls option does nothing (good or bad) for me. The SAPC one is just a loss, and the other is a more substantial loss (although each win with one run). -> tested 4692 hams & 386 spams against 18762 hams & 1537 spams -> tested 4695 hams & 381 spams against 18759 hams & 1542 spams -> tested 4693 hams & 383 spams against 18761 hams & 1540 spams -> tested 4690 hams & 384 spams against 18764 hams & 1539 spams -> tested 4684 hams & 389 spams against 18770 hams & 1534 spams -> tested 4692 hams & 386 spams against 18762 hams & 1537 spams -> tested 4695 hams & 381 spams against 18759 hams & 1542 spams -> tested 4693 hams & 383 spams against 18761 hams & 1540 spams -> tested 4690 hams & 384 spams against 18764 hams & 1539 spams -> tested 4684 hams & 389 spams against 18770 hams & 1534 spams false positive percentages 0.000 0.000 tied 0.021 0.021 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.00425985090522 to 0.00425985090522 tied false negative percentages 1.036 1.036 tied 1.050 1.575 lost +50.00% 0.783 0.522 won -33.33% 1.823 2.083 lost +14.26% 1.285 1.799 lost +40.00% won 1 times tied 1 times lost 3 times total unique fn went from 23 to 27 lost +17.39% mean fn % went from 1.19553834481 to 1.40321699713 lost +17.37% ham mean ham sdev 0.09 0.10 +11.11% 1.73 1.72 -0.58% 0.11 0.11 +0.00% 2.24 2.09 -6.70% 0.12 0.12 +0.00% 2.05 2.05 +0.00% 0.09 0.08 -11.11% 2.01 1.78 -11.44% 0.04 0.05 +25.00% 0.88 1.19 +35.23% ham mean and sdev for all runs 0.09 0.09 +0.00% 1.85 1.80 -2.70% spam mean spam sdev 95.65 95.35 -0.31% 15.15 16.13 +6.47% 95.77 95.20 -0.60% 15.18 16.83 +10.87% 97.06 96.05 -1.04% 11.42 13.61 +19.18% 95.32 94.61 -0.74% 16.75 18.41 +9.91% 95.57 95.40 -0.18% 15.57 16.05 +3.08% spam mean and sdev for all runs 95.87 95.32 -0.57% 14.94 16.29 +9.04% ham/spam mean difference: 95.78 95.23 -0.55 -> tested 830 hams & 380 spams against 3320 hams & 1517 spams -> tested 830 hams & 380 spams against 3320 hams & 1517 spams -> tested 830 hams & 379 spams against 3320 hams & 1518 spams -> tested 830 hams & 379 spams against 3320 hams & 1518 spams -> tested 830 hams & 379 spams against 3320 hams & 1518 spams -> tested 830 hams & 380 spams against 3320 hams & 1517 spams -> tested 830 hams & 380 spams against 3320 hams & 1517 spams -> tested 830 hams & 379 spams against 3320 hams & 1518 spams -> tested 830 hams & 379 spams against 3320 hams & 1518 spams -> tested 830 hams & 379 spams against 3320 hams & 1518 spams false positive percentages 0.241 0.241 tied 0.482 0.482 tied 0.000 0.000 tied 0.120 0.120 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 7 to 7 tied mean fp % went from 0.168674698795 to 0.168674698795 tied false negative percentages 0.789 1.053 lost +33.46% 0.526 0.526 tied 0.528 0.264 won -50.00% 0.264 0.264 tied 1.055 1.319 lost +25.02% won 1 times tied 2 times lost 2 times total unique fn went from 12 to 13 lost +8.33% mean fn % went from 0.632551034579 to 0.685182613526 lost +8.32% ham mean ham sdev 0.67 0.61 -8.96% 6.87 6.56 -4.51% 0.95 0.85 -10.53% 8.69 8.08 -7.02% 0.87 0.81 -6.90% 7.10 6.79 -4.37% 0.60 0.57 -5.00% 6.64 6.49 -2.26% 0.48 0.42 -12.50% 4.87 4.62 -5.13% ham mean and sdev for all runs 0.71 0.65 -8.45% 6.94 6.60 -4.90% spam mean spam sdev 97.13 96.89 -0.25% 12.08 13.00 +7.62% 98.59 98.50 -0.09% 8.09 8.49 +4.94% 98.57 98.44 -0.13% 8.03 8.15 +1.49% 98.59 98.54 -0.05% 7.51 7.68 +2.26% 97.91 97.72 -0.19% 11.50 12.22 +6.26% spam mean and sdev for all runs 98.16 98.02 -0.14% 9.66 10.18 +5.38% ham/spam mean difference: 97.45 97.37 -0.08 filename: ihugs ihug_picks ihug_pickms ham:spam: 23454:1923 23454:1923 23454:1923 fp total: 1 1 1 fp %: 0.00 0.00 0.00 fn total: 23 23 27 fn %: 1.20 1.20 1.40 unsure t: 169 171 176 unsure %: 0.67 0.67 0.69 real cost: $66.80 $67.20 $72.20 best cost: $57.00 $56.60 $62.40 h mean: 0.09 0.09 0.09 h sdev: 1.89 1.85 1.80 s mean: 95.86 95.87 95.32 s sdev: 14.99 14.94 16.29 mean diff: 95.77 95.78 95.23 k: 5.67 5.70 5.26 filename: sapcs sapc_picks sapc_pickms ham:spam: 4150:1897 4150:1897 4150:1897 fp total: 7 7 7 fp %: 0.17 0.17 0.17 fn total: 12 12 13 fn %: 0.63 0.63 0.69 unsure t: 99 99 100 unsure %: 1.64 1.64 1.65 real cost: $101.80 $101.80 $103.00 best cost: $70.60 $70.20 $70.80 h mean: 0.71 0.71 0.65 h sdev: 6.92 6.94 6.60 s mean: 98.14 98.16 98.02 s sdev: 9.72 9.66 10.18 mean diff: 97.43 97.45 97.37 k: 5.86 5.87 5.80 From matt at mondoinfo.com Tue Apr 13 22:12:38 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Tue Apr 13 22:13:06 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B88@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2B88@its-xchg4.massey.ac.nz> Message-ID: <1081905546.39.1651@mint-julep.mondoinfo.com> Dear Tony, > And here are two more (they were running too slow to get out > yesterday, but completed overnight). > Once again, the standard x-pick_apart_urls option does nothing > (good or bad) for me. The SAPC one is just a loss, and the other > is a more substantial loss (although each win with one run). Hm. Well that's probably enough evidence. A tiny win for me and a small loss for you. What's odd is that doesn't seem to match what I'm seeing in my inbox. I was seeing nonsense spams there and now I'm not. Perhaps the range of spams that DNS lookup is useful for is just too narrow. Regards, Matt From tameyer at ihug.co.nz Wed Apr 14 03:03:49 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 14 03:04:07 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E118EA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz> > Hm. Well that's probably enough evidence. A tiny win for me > and a small loss for you. I don't know if it's enough, but it's likely that it's all you'll be able to solicit here <0.1 wink>. > What's odd is that doesn't seem to match what I'm seeing in > my inbox. I was seeing nonsense spams there and now I'm not. If you go through your spam folder and look at the clues for messages that look like the ones that used to be there, do you see these tokens? It could be that the spammers sending these types of messages took a holiday this week <0.5 wink>. In any case, if you're happy running from source, then there's nothing stopping you keeping the patch going for your own system - it seems unlikely that it'll conflict with any tokenizer changes in the near future. > Perhaps the range of spams that DNS lookup is useful for is > just too narrow. I suspect that it's that the spams that this helps to nail are already nailed with other techniques. I was reading some past messages today and that reminded me to suggest that you try (if you haven't already) the x-use_bigrams option. At least some people have found that it's better at nailing short spams (although maybe not quite as good at some of the more 'talky' spams). Testing and developer experience (I'm not sure if any users have turned the option on) does indicate that it's a win overall. =Tony Meyer From tdickenson at geminidataloggers.com Wed Apr 14 05:55:06 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Wed Apr 14 05:55:10 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677C1C@its-xchg4.massey.ac.nz> Message-ID: <200404141055.06066.tdickenson@geminidataloggers.com> On Tuesday 13 April 2004 06:59, Tony Meyer wrote: > Have you tried using the x-slurp_urls option as a solution for this > problem? (I'm not saying it's a better solution, just curious if you have, > and if so, what the results were). Like x-slurp_urls, enabling this option could allow host names to be used as a bug by spammers to determine whether an email address is live. That doesnt seem likely, but its not impossible. (and it would need custom dns hosting too.... so if we ever see this happening we would be able to expand this patch to use dns NS records as spam clues!) > Running > timcv.py was *really* slow, too - I don't know whether this was because a > lot of messages timed out, or that the DNS lookup was slow, or what, A barely related question..... which of our filtering methods allow for parallel filtering? sb_filter out of procmail does, provided your mta runs procmail concurrently :-) but sb_bnfilter will serialise them again :-( -- Toby Dickenson From tameyer at ihug.co.nz Wed Apr 14 19:17:51 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 14 19:20:20 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E119E7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B97@its-xchg4.massey.ac.nz> > Like x-slurp_urls, enabling this option could allow host > names to be used as a bug by spammers to determine whether > an email address is live. That doesnt seem likely, but its > not impossible. This was discussed (a lot) back when the x-slurp_urls option was first offered. It's probably the main reason why even if it does live past being an experimental option, it'll never default to True. It's also the reason for the x-only_slurp_base option - I can't see any way (other than registering a domain per message) that it could then be used as a 'address is live' indicator. OTOH, enabling x-only_slurp_base does a lot of hurt to the results in my testing. If the x-slurp_urls option is ever shown to be really effective, then it seems likely that a middle path between the two, where it's very difficult to put any tracking information in, could be created easily enough. > A barely related question..... which of our filtering methods > allow for parallel filtering? sb_filter out of procmail does, > provided your mta runs procmail concurrently :-) but sb_bnfilter > will serialise them again :-( I'm pretty sure that we don't support more than one process accessing the database at one time at all. As for one process filtering multiple messages at a time, I believe sb_server can do this (i.e. if two connections are made, to different local proxy ports, at the same time). sb_imapfilter and the Outlook plug-in don't. I do have a version of the testing setup than runs on a cluster, but I presume that's not the sort of parallel you were meaning? =Tony Meyer From spambayes at kungfoocoder.org Wed Apr 14 19:40:33 2004 From: spambayes at kungfoocoder.org (Paul Wagland) Date: Wed Apr 14 19:40:45 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B97@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B97@its-xchg4.massey.ac.nz> Message-ID: <1081986032.11188.63.camel@morsel.kungfoocoder.org> On Thu, 2004-04-15 at 01:17, Tony Meyer wrote: > > Like x-slurp_urls, enabling this option could allow host > > names to be used as a bug by spammers to determine whether > > an email address is live. That doesnt seem likely, but its > > not impossible. > > This was discussed (a lot) back when the x-slurp_urls option was first > offered. It's probably the main reason why even if it does live past being > an experimental option, it'll never default to True. It's also the reason > for the x-only_slurp_base option - I can't see any way (other than > registering a domain per message) that it could then be used as a 'address > is live' indicator. Just as a side issue... they only need a subdomain for message, not a full domain. I.e. aaa.spamisevil.com is just as unique as aaaspamisevil.com So, it would be fairly easy to setup to harvest "good" addresses. And, as a bonus, if you don't care about the image being shown, just about the e-mail address, you can return a false random response for the DNS lookup. Indeed, one early web site that I saw actually did cookie-less session tracking using URL rewriting, but instead of playing with the URL, they played with the hostname in a manner similar to aaacookieid.www.host.com Food for thought, Cheers, Paul From tameyer at ihug.co.nz Wed Apr 14 19:46:23 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 14 19:46:36 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E92428@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677C46@its-xchg4.massey.ac.nz> > Just as a side issue... they only need a subdomain for > message, not a full domain. I.e. aaa.spamisevil.com is just > as unique as aaaspamisevil.com I was really talking about the x-slurp_urls option, rather than the DNS lookup. With that option's x-only_slurp_base the URL that is retrieved is the simplest form of the url, i.e. "aaaspamisevil.com" or "massey.ac.nz". Doing a simple HTTP request for a webpage like that does (AFAICT) include any information at all about who is doing the request. This means that you *do* need a domain per message. It also means that if I have a spammy page at "spam.massey.ac.nz", but "massey.ac.nz" is ham, the clues generated will make things worse, not better. Of course, if the root domain is legitimately hammy and they have spammy subdomains/pages, there's a reasonable chance that you can get the spammy people kicked off. =Tony Meyer From matt at mondoinfo.com Wed Apr 14 21:40:40 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Wed Apr 14 21:41:34 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1305E118EA@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz> Message-ID: <1081990843.72.559@mint-julep.mondoinfo.com> [me] > Hm. Well that's probably enough evidence. A tiny win for me > and a small loss for you. [Tony Meyer] > I don't know if it's enough, but it's likely that it's all you'll > be able to solicit here <0.1 wink>. <0.9 chuckle> > If you go through your spam folder and look at the clues for > messages that look like the ones that used to be there, do you see > these tokens? I do. For example, I have a nonsense spam ("ostrich rimy cowlick derange...") that has the subject "Our little secret". And its clues include: 0.908 url-ip:221.5.250.122/32 0.908 url-ip:221.5.250/24 0.908 url-ip:221.5/16 0.965 url-ip:221/8 > It could be that the spammers sending these types of messages took > a holiday this week <0.5 wink>. It may also be that sending nonsense spams is a new tactic among spammers (born of the success of SpamBayes of course) and testing against spam even a month old won't show much advantage. I was certainly motivated to try the url-ip thing because of the unsures I had seen in the previous week or so. > In any case, if you're happy running from source, then there's > nothing stopping you keeping the patch going for your own system - > it seems unlikely that it'll conflict with any tokenizer changes in > the near future. Indeed, I plan to. It doesn't seem to do me any harm. I'm mostly miffed that the value of my Fabulously Clever Idea isn't borne out by actual testing. I expect that Tim Peters in particular has enormous sympathy . > I suspect that it's that the spams that this helps to nail are > already nailed with other techniques. That seems like the most likely explanation. > I was reading some past messages today and that reminded me to > suggest that you try (if you haven't already) the x-use_bigrams > option. At least some people have found that it's better at > nailing short spams (although maybe not quite as good at some of > the more 'talky' spams). Testing and developer experience (I'm not > sure if any users have turned the option on) does indicate that > it's a win overall. Since I now have a nifty set of ten buckets, I'm glad to try out other folks' Fabulously Clever Ideas. Here's the result: normals.txt -> bigramss.txt -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.1 to 0.1 tied false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fn went from 1 to 1 tied mean fn % went from 0.1 to 0.1 tied ham mean ham sdev 0.27 0.28 +3.70% 3.13 2.97 -5.11% 0.36 0.58 +61.11% 3.86 4.91 +27.20% 0.68 0.92 +35.29% 7.28 8.16 +12.09% 0.14 0.24 +71.43% 1.03 1.83 +77.67% 0.31 0.30 -3.23% 2.53 2.78 +9.88% ham mean and sdev for all runs 0.35 0.46 +31.43% 4.13 4.71 +14.04% spam mean spam sdev 99.89 99.77 -0.12% 1.02 1.61 +57.84% 99.74 99.89 +0.15% 2.99 1.29 -56.86% 98.92 99.24 +0.32% 5.15 4.27 -17.09% 98.37 98.38 +0.01% 9.43 8.39 -11.03% 98.86 98.82 -0.04% 6.36 6.71 +5.50% spam mean and sdev for all runs 99.16 99.22 +0.06% 5.79 5.28 -8.81% ham/spam mean difference: 98.81 98.76 -0.05 Alas, it seems that there's not much advantage there either. The only classification difference seems to be that the number of unsures went up by two. Regards, Matt From mcclurgm at bellsouth.net Thu Apr 15 00:21:59 2004 From: mcclurgm at bellsouth.net (Mark McClurg) Date: Thu Apr 15 00:22:04 2004 Subject: [spambayes-dev] _pop3proxyspam.mbox on Desktop? Message-ID: <407E0DE7.7060200@bellsouth.net> I have just installed spambayes for use on XP with Outlook Express. I've run a few emails through for training, and all appears operational - I'm excited to have this program to decrease the SPAM i've been dealing with. I've one question though. A file by the name of _pop3proxyspam.mbox has now shown up on my Desktop. I don't see any reference to this file being placed in any directory, and I see no directory in the .ini file that I would modify to have this file written elsewhere. Can someone explain what I can/should do - it's confusing with it visible on the Desktop. Thanks! Mark From tameyer at ihug.co.nz Thu Apr 15 02:21:45 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Apr 15 02:21:52 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz> > Since I now have a nifty set of ten buckets, I'm glad to try > out other folks' Fabulously Clever Ideas. Always appreciated! If you contribute nothing else to SpamBayes (and I'm sure you will :) simply testing out other people's ideas and letting everyone know the results helps a lot - especially since not many people manage to get time to do this these days. If you want to do more (it gets addictive, trust me ;) there are all the current x- options... > Here's the result: [...] > Alas, it seems that there's not much advantage there either. > The only classification difference seems to be that the > number of unsures went up by two. I should have looked at your original cmp.py posting more closely (and have now). I think that you've hit the "Peters barrier", i.e. your results with the defaults are so good that it's hard to measure whether any changes are doing you any good or not. Your defaults run only has one fp and one fn - to improve on this, the new Fabulously Clever Idea would need to directly target those two messages (without losing the rest). Unless the improvement is all in the unsures - since cmp.py output doesn't mention them, I can't tell how many there are in the defaults; maybe this is where the room to improve is. (If you still have the rates.py output around, could you post a table.py for the defaults, dns and bigrams outputs?) If you run "fpfn.py ratespyoutputs.txt" (with the appropriate rates.py output file) it'll spit out a list of the fp's and fn's (all two of them ;) for that test. It'd be worth taking a look at these two messages and seeing what they are. It might be that they are basically impossible to get right - for example, a message from someone you've never had mail from before quoting a spam with a single line addition - that's very difficult to classify as ham without getting a lot of fn's, too. =Tony Meyer From skip at pobox.com Thu Apr 15 10:51:35 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Apr 15 10:52:05 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1081990843.72.559@mint-julep.mondoinfo.com> References: <1ED4ECF91CDED24C8D012BCF2B034F1305E118EA@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2B93@its-xchg4.massey.ac.nz> <1081990843.72.559@mint-julep.mondoinfo.com> Message-ID: <16510.41335.349259.911009@montanaro.dyndns.org> Matt> It may also be that sending nonsense spams is a new tactic among Matt> spammers (born of the success of SpamBayes of course) and testing Matt> against spam even a month old won't show much advantage. I was Matt> certainly motivated to try the url-ip thing because of the unsures Matt> I had seen in the previous week or so. My guess is that for the most part spammers need to move their websites only somewhat less often than they need to move mail hosts. If they are connected to the web via a more-or-less respectalble ISP they probably get shut out pretty quickly. Accordingly, month-old IP addresses may indeed not be worth much. Motivated mostly by my desire to keep my database size small, I routinely (every few weeks) sort my ham and spam databases by date and whack of the oldest 5% to 20% of the messages they contain. This may have the side effect of improving the IP address sensitivity. Skip From skip at pobox.com Thu Apr 15 12:01:56 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Apr 15 12:02:09 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz> Message-ID: <16510.45556.523672.814107@montanaro.dyndns.org> >> Since I now have a nifty set of ten buckets, I'm glad to try out >> other folks' Fabulously Clever Ideas. Tony> Always appreciated! If you contribute nothing else to SpamBayes Tony> (and I'm sure you will :) simply testing out other people's ideas Tony> and letting everyone know the results helps a lot - especially Tony> since not many people manage to get time to do this these days. One thing I think we need to be careful of is using test data sets whose messages are too old. It's apparent the spammers are a moving target, so what worked one or six months ago (or perhaps even a week ago) may not work as well today. Skip From matt at mondoinfo.com Thu Apr 15 14:48:12 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Thu Apr 15 15:00:06 2004 Subject: [spambayes-dev] Results for DNS lookup in tokenizer In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz> Message-ID: <1082053205.04.641@mint-julep.mondoinfo.com> [Tony Meyer] > Your defaults run only has one fp and one fn - to improve on this, > the new Fabulously Clever Idea would need to directly target those > two messages (without losing the rest). Unless the improvement is > all in the unsures - since cmp.py output doesn't mention them, I > can't tell how many there are in the defaults; maybe this is where > the room to improve is. There is some room to improve the unsures. With the defaults, I get 27 unsures out of 1000 messages. > (If you still have the rates.py output around, could you post a > table.py for the defaults, dns and bigrams outputs?) Here you go: filename: normal bigrams dns ham:spam: 1000:1000 1000:1000 1000:1000 fp total: 1 1 1 fp %: 0.10 0.10 0.10 fn total: 1 1 1 fn %: 0.10 0.10 0.10 unsure t: 27 29 26 unsure %: 1.35 1.45 1.30 real cost: $16.40 $16.80 $16.20 best cost: $10.20 $11.60 $9.60 h mean: 0.35 0.46 0.32 h sdev: 4.13 4.71 3.97 s mean: 99.16 99.22 99.16 s sdev: 5.79 5.28 5.79 mean diff: 98.81 98.76 98.84 k: 9.96 9.89 10.13 > If you run "fpfn.py ratespyoutputs.txt" (with the appropriate > rates.py output file) it'll spit out a list of the fp's and fn's > (all two of them ;) for that test. It'd be worth taking a look at > these two messages and seeing what they are. It might be that they > are basically impossible to get right - for example, a message from > someone you've never had mail from before quoting a spam with a > single line addition - that's very difficult to classify as ham > without getting a lot of fn's, too. The false positive is one I ran into in real life. It's a confirmation of an order for a pair of headphones. There are lots of spammy words in it and I don't think I have much other ham from that company or on that subject. The false negative is harder to explain. The subject is "Help your employees avoid heat-related illnesses". It's not the most traditional sort of spam since it doesn't ask me to buy anything now. Scoring it against my normal database, it gets 0.789. Judging from the evidence reported, it seems that's because I live in Minneapolis and talk about the weather a lot <22 winks celsius>. Regards, Matt From matt at mondoinfo.com Thu Apr 15 21:37:26 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Thu Apr 15 21:48:35 2004 Subject: [spambayes-dev] New results for DNS lookup in tokenizer In-Reply-To: <1082053205.04.641@mint-julep.mondoinfo.com> References: <1ED4ECF91CDED24C8D012BCF2B034F1305E92470@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2B9F@its-xchg4.massey.ac.nz> <1082053205.04.641@mint-julep.mondoinfo.com> Message-ID: <1082078348.2.1077@mint-julep.mondoinfo.com> It turns out that I was right when I speculated that using DNS lookups would work better on more-recent spam. I re-did my spam sets from the thousand most recent spams in my spam archive and got rather better results: new-pick-aparts.txt -> new-dnss.txt -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.1 to 0.1 tied false negative percentages 0.500 0.000 won -100.00% 4.500 3.500 won -22.22% 1.000 0.500 won -50.00% 0.000 0.000 tied 3.000 2.500 won -16.67% won 4 times tied 1 times lost 0 times total unique fn went from 18 to 13 won -27.78% mean fn % went from 1.8 to 1.3 won -27.78% ham mean ham sdev 0.34 0.33 -2.94% 3.28 3.22 -1.83% 0.14 0.14 +0.00% 1.39 1.36 -2.16% 0.50 0.50 +0.00% 6.84 6.84 +0.00% 0.48 0.31 -35.42% 3.75 2.10 -44.00% 0.35 0.38 +8.57% 3.78 4.15 +9.79% ham mean and sdev for all runs 0.36 0.33 -8.33% 4.19 4.02 -4.06% spam mean spam sdev 98.00 98.49 +0.50% 10.43 8.53 -18.22% 94.60 95.38 +0.82% 19.89 18.38 -7.59% 97.52 97.96 +0.45% 11.40 10.63 -6.75% 98.77 98.87 +0.10% 6.47 6.81 +5.26% 94.78 95.38 +0.63% 18.47 17.51 -5.20% spam mean and sdev for all runs 96.73 97.22 +0.51% 14.37 13.33 -7.24% ham/spam mean difference: 96.37 96.89 +0.52 In addition, unsures decreased some: filename: new-pick-apart new-dns ham:spam: 1000:1000 1000:1000 fp total: 1 1 fp %: 0.10 0.10 fn total: 18 13 fn %: 1.80 1.30 unsure t: 46 40 unsure %: 2.30 2.00 real cost: $37.20 $31.00 best cost: $21.60 $19.80 h mean: 0.36 0.33 h sdev: 4.19 4.02 s mean: 96.73 97.22 s sdev: 14.37 13.33 mean diff: 96.37 96.89 k: 5.19 5.58 That's not an enormous win but it suggests that I probably am seeing the improvement in my inbox that I think I'm seeing. And the false-negatives that are eliminated are nonsense spams or spams with lots of bland, unrelated text in them. It's very arguable that a technique that only works well on recent spam shouldn't be included in SpamBayes until it has proven its value over some time. Regards, Matt From pekka.takala at pp.inet.fi Fri Apr 16 09:00:46 2004 From: pekka.takala at pp.inet.fi (Pekka Takala) Date: Fri Apr 16 08:59:49 2004 Subject: [spambayes-dev] Getting Mozilla (or Netscape) to work with pop3proxy (LINUX) Message-ID: <407FD8FE.7030303@pp.inet.fi> Here is a way to get mozilla to work with pop3proxy and multiple users on same machine. The pop3proxy software is started and stopped when needed so the amount of users on same machine can be theoretically countless. The configuration needs a little patience, but it is easy to find and also can be used with other mail clients, not just mozilla or netscape. This is tested on mozilla and netscape. 1. Install the pop3proxy software normally as root to /usr/bin and configure it normally. Test that your setup works ABSOLUTELY NORMALLY, when starting pop3proxy by hand and then stopping it after read your mails. 2. Locate the startupt script of Mozilla. In Debian Linux systems it is /usr/bin/mozilla. 3. With your favorite editor, search line containing MOZ_PROGRAM. Comment it out as reference, then make a new copy of the line. On mine system it looks like this after modification. The original line is mozilla-bin, new line has popmozilla.sh. Netscape startup script has same kind of system, except the mozilla-bin is netscape-bin. DO NOT TOUCH THE REST OF THE FILE. Remember that you need the path and name of the original binary when creating the popmozilla.sh ----- MOZ_DIST_BIN="/usr/lib/mozilla" #MOZ_PROGRAM="/usr/lib/mozilla/mozilla-bin" MOZ_PROGRAM="/usr/lib/mozilla/popmozilla.sh" MOZ_CLIENT_PROGRAM="/usr/lib/mozilla/mozilla-xremote-client" ----- 4. Save the file after modifications and go to the path where mozilla-bin is. 5. Create a new file popmozilla.sh and put this inside: #!/bin/sh #Check if the pop3proxy is already running, do not start #if it is if (ps ax |grep -q sb_server.py); then echo "pop3proxy already running -> not starting" else echo "Starting pop3proxy" /usr/bin/pop3proxy & fi #The mozilla binary, we start it here with arguments given by #/usr/bin/mozilla. /usr/lib/mozilla/mozilla-bin $1 #after mozilla quits, we stop the pop3proxy. #try first sigterm: killall -15 sb_server.py #if sigterm did not do it, then sigkill will do. killall -9 sb_server.py 6. Save the file, then apply chmod 755 to it. 7. To test: start mozilla, then try to go localhost:8880 with mozilla. If spambayes page comes, the pop3proxy starts ok! 8. Shut down mozilla, then try "ps ax |grep sb_server.py". If the binary does not show, the script works. If shows shomething is wrong and you need to re-check your scripts. This script allows multiple users to use pop3proxy, without fear of reading each other's private e-mails. The users only need to start mozilla, then configure it and that's all. The "already running" -test is there, because mozilla can be started multiple times (i.e when reading news you click a link and so on). The script may not be full-featured, but at least it allows multiple users to read their e-mails without much problems. -- Pekka "Pihti" Takala Nothing can be so bad that you cannot find something good in it! 65XXX assembler programmer/developer, linux user From juntunen at well.com Fri Apr 16 21:54:22 2004 From: juntunen at well.com (Thomas Juntunen) Date: Fri Apr 16 21:56:40 2004 Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15 In-Reply-To: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/16/04, Skip Montanaro imposed order on a stream of electrons to say: >One thing I think we need to be careful of is using test data sets whose >messages are too old. It's apparent the spammers are a moving target, so >what worked one or six months ago (or perhaps even a week ago) may not work >as well today. You might be surprised. I've saved all the spam I've received since late 2000. In recent months, I set up SpamAssassin 2.6.1 with the default rules and ran everything (some 12K messages) through it. A guy named Terry Sullivan (who knows a _lot_ more about statistics than I do) analyzed them and presented some conclusions about spam volatility at the MIT conference this past January. He composed a summary article about it here: http://www.qaqd.com/research/spam-e1.htm The upshot was spam changes a lot more slowly than common thought suggests. Thomas Juntunen -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0 iQA/AwUBQICAPdFoei/9T3YdEQIf1QCgknpLGMgUAaQSChg+GNw3mL0feCoAoJEi 0CprW+cw1AISUFLI8qC0Jm3n =lGJK -----END PGP SIGNATURE----- From skip at pobox.com Fri Apr 16 23:52:53 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Apr 16 23:53:00 2004 Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15 In-Reply-To: References: Message-ID: <16512.43541.10392.272029@montanaro.dyndns.org> >> One thing I think we need to be careful of is using test data sets >> whose messages are too old. It's apparent the spammers are a moving >> target, so what worked one or six months ago (or perhaps even a week >> ago) may not work as well today. Thomas> .... A guy named Terry Sullivan (who knows a _lot_ more about Thomas> statistics than I do) analyzed [Thomas's data] and presented Thomas> some conclusions about spam volatility at the MIT conference Thomas> this past January. He composed a summary article about it here: Thomas> http://www.qaqd.com/research/spam-e1.htm Thomas> The upshot was spam changes a lot more slowly than common Thomas> thought suggests. I'm not going to try and argue with statistics, however, if I understand the summary article, it appears that two features in the principle component analysis account for 86% of the properties of your data set and that all the other features were indistinguishable from noise. I don't know how 86% relates to how much spam those two features would reliably detect, especially in the presence of ham, but my guess is that it's much less than the 99+% we need to have an effective spam filtering solution. Looking at how Spambayes has classified my mail since mid-December, I see 168k spams (~ 60%), 87k hams (~ 31%) and 27k unsures (~ 10%). If Spambayes was only identifying 86% of the spams (does the PCA number imply that?), that would be another 23k spams I'd have had to look at. In addition, PCA doesn't seem like it begins to address the issue of false positives and false negatives. Who cares if it identifies 86% of the spams if it also erroneously classifies 1% (to pick a number out of thin air) of the hams as spams? It's clear that spammers try different things. They have to move from one mail host to another. They have to cover their tracks by routing mail through open relays. They have to "infect" vulnerable machines to create open relays for themselves. They have to add hash busters. They have to disguise key words (like "v1@grA"). They have to gut their sales pitch and just refer you to a URL. They have to add word salad (both nonsense words and real, but randomly chosen words). They do this and lots of other stuff to try and squeak as much spam past filters as they can. I believe they will continue to try other tricks. One can hope that they are running out of tricks to try, but I'm pessimistic. Skip From juntunen at well.com Sat Apr 17 11:27:29 2004 From: juntunen at well.com (Thomas Juntunen) Date: Sat Apr 17 11:27:57 2004 Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15 In-Reply-To: <16512.43541.10392.272029@montanaro.dyndns.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/16/04, Skip Montanaro imposed order on a stream of electrons to say: >I don't know how 86% relates to how much spam those two features would >reliably detect, especially in the presence of ham, but my guess is that it's >much less than the 99+% we need to have an effective spam filtering solution. Absolutely. I was trying to make the point that we've found spammers change their tactics much more slowly than is commonly assumed. FWIW, the single most common characteristic of my corpus, HTML/mutlipart with no other parts, would stop around 37% of spam by itself. If anything, this research says a two-stage system, simple SA or some such to stop the real grunge, then SB or some such to apply more selective filtering on a smaller inflow, should be workable. >It's clear that spammers try different things. Yep. I don't have a number handy, but consider that a message can only be munged in so many ways before it is undeliverable. The total useful permutations might be too large for a human to handle easily, but I'm betting not for a computer. [snip description of spammer tricks] >I believe they will continue to try other tricks. One can hope that they are >running out of tricks to try, but I'm pessimistic. It's interesting you mention this. I can't say a whole lot right now, but Dr. Sullivan has devised an interesting technique that statistically looks at all the sorts of things you've mentioned. We looked at that stuff in order to try and pin down which spamware any particular spammer might be using, since all those tricks can be considered characteristics of spamware. We came to realize, they are also characteristics of the spammers themselves. Working in conjunction with some folks from Spamhaus, Sullivan is refining a technique to "fingerprint" particualr spammers by their choices of URLs/domains, presentation, and so forth. This only works for spammers whose volume is high enough to overcome the "noise" inherent in email, but after letting his tool work through a corpus and group spam messages by sender, then manually checking these with WHOIS, dig and so forth, the tool is right a little over 50% of the time with no training whatsoever. I think he is planning to present something about this at some conference (CEAS?) this summer. Anyway, all I wanted to try and make clear was there was statistical evidence that spam techniques change a lot more slowly than people usually assume. Not that this was some form of better filtering. In fact, I've been waiting for SpamBayes to get to at least a beta release so I can install it on my Apple laptop. Thanks for the feedback! Thomas Juntunen -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0 iQA/AwUBQIE+jdFoei/9T3YdEQIQRQCgyzMCfviABf/wBKpNZId/Cw3z2xMAnikC MTYEuD/Ri5tzgdbbNj0HPhO/ =xU9A -----END PGP SIGNATURE----- From sethg at GoodmanAssociates.com Sat Apr 17 14:34:44 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sat Apr 17 14:34:46 2004 Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15 In-Reply-To: Message-ID: > From: Thomas Juntunen > Sent: Friday, April 16, 2004 8:54 PM <...> > http://www.qaqd.com/research/spam-e1.htm I would like to read this article, but the link redirects to a login page that doesn't accept 'guest', 'anonymous' or an email address as a login. Could you provide another link or send me a copy of the article? >From Skip's post, he mentioned principle component analysis as the technique the author used. If this is the same as the method by that name we use in electrical engineering, this means decomposing a signal into a series of Eigenvectors (orthogonal components), each with a length (the Eigenvalue) that indicates the strength (electrical power) of that particular component. You then throw away the components that are similar in size to those that are known to be noise (completely random, no information content), leaving what are called the principle components. Under good conditions, the principle components comprise _most_ of the information portion of the signal, though it doesn't always come out that way. This is but one of many methods for breaking a signal down into orthogonal components and removing noise. The method has its pro's and con's, which have a lot to do with the nature of the signal and how much you know about it ahead of time. I can think of several issues applying PC analysis to a text message instead of a signal stream. Since a text message can be parsed in different ways to create a signal to do the Eigendecomposition on, results will depend on whether you treat it as a bit stream, a character stream (with what character length?) or a token stream (tokenized how?). It would also be possible to treat the SpamAssassin results as tokens and use only those to create a token stream. I need to read the article, but applying Eigendecomposition to a text message raises a lot of questions for me. -- Seth Goodman From juntunen at well.com Sun Apr 18 12:06:19 2004 From: juntunen at well.com (Thomas Juntunen) Date: Sun Apr 18 12:07:07 2004 Subject: [spambayes-dev] URL Correction In-Reply-To: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey folks, Anyone who wants to see that article on spam volatility by Terry Sullivan, be advised he has sent me a corrected URL. The other one pointed at a draft, this one is for the final version. Sorry about that! http://www.qaqd.com/research/mit04sum.html Thomas Juntunen -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0 iQA/AwUBQIKZadFoei/9T3YdEQImCQCg31/OPjjfGv+0ayK92WqLdIFtuY0An24o qvvkBFq5Vt9J7Vn+FKxPJlAE =BUxZ -----END PGP SIGNATURE----- From juntunen at well.com Sun Apr 18 15:23:33 2004 From: juntunen at well.com (Thomas Juntunen) Date: Sun Apr 18 15:23:31 2004 Subject: [spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 17 In-Reply-To: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/18/04, Seth Goodman imposed order on a stream of electrons to say: >I can think of several issues applying PC analysis to a text message instead >of a signal stream. Since a text message can be parsed in different ways to >create a signal to do the Eigendecomposition on, results will depend on >whether you treat it as a bit stream, a character stream (with what >character length?) or a token stream (tokenized how?). It would also be >possible to treat the SpamAssassin results as tokens and use only those to >create a token stream. So far as I understand it, Dr. Sullivan didn't analyze the message text or headers themselves, he looked at which SpamAssassin rules were triggered over time. So the triggered rules are the vectors in this case. Thomas Juntunen -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0 iQA/AwUBQILHo9Foei/9T3YdEQKgNwCg6cT33IzOO5zXawXu8Bsdh14HJ2QAn3dW xAl1gEdAFiWxQP8z9dVgVdZ/ =q7r9 -----END PGP SIGNATURE----- From tdickenson at geminidataloggers.com Mon Apr 19 10:09:57 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Mon Apr 19 10:10:04 2004 Subject: [spambayes-dev] Re: [Spambayes] Re: Cannot connect to socket with sb_bnserver.py In-Reply-To: <200404191455.04662.tdickenson@geminidataloggers.com> References: <1080424390.4065.24.camel@porsche.hq.simlog.com> <4083D158.8020809@videotron.ca> <200404191455.04662.tdickenson@geminidataloggers.com> Message-ID: <200404191509.57948.tdickenson@geminidataloggers.com> replies set to spambayes-dev@python.org On Monday 19 April 2004 14:55, Toby Dickenson wrote: > that strace log shows that unlink("/home/ricard/.sbbnsock-modeleT") fails > with ENOENT, so the socket does not exist. > > But connect to that socket is failing with ECONNREFUSED. Thats strange..... aha! your linux kernel must be 2.2.x ! right? patch attached. -- Toby Dickenson -------------- next part -------------- A non-text attachment was scrubbed... Name: time_to_get_a_proper_unix.diff Type: text/x-diff Size: 1232 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040419/9730dde9/time_to_get_a_proper_unix.bin From papaDoc at videotron.ca Mon Apr 19 10:21:15 2004 From: papaDoc at videotron.ca (papaDoc) Date: Mon Apr 19 10:21:31 2004 Subject: [spambayes-dev] Re: [Spambayes] Re: Cannot connect to socket with sb_bnserver.py In-Reply-To: <200404191509.57948.tdickenson@geminidataloggers.com> References: <1080424390.4065.24.camel@porsche.hq.simlog.com> <4083D158.8020809@videotron.ca> <200404191455.04662.tdickenson@geminidataloggers.com> <200404191509.57948.tdickenson@geminidataloggers.com> Message-ID: <4083E05B.4060008@videotron.ca> Hi Toby, >>that strace log shows that unlink("/home/ricard/.sbbnsock-modeleT") fails >>with ENOENT, so the socket does not exist. >> >>But connect to that socket is failing with ECONNREFUSED. Thats strange..... >> >> > >aha! your linux kernel must be 2.2.x ! right? > > Yes 2.2.20 I will apply the patch and let you know what happen. Remi -- /"\ \ / X ASCII Ribbon Campaign / \ Against HTML Email From gbrown at alumni.caltech.edu Mon Apr 19 12:42:36 2004 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Mon Apr 19 12:42:46 2004 Subject: [spambayes-dev] =?iso-8859-1?q?FW=3A_Overnight_shipping_on_x=E3n?= =?iso-8859-1?q?ax=2Cval=EDum_and_more?= Message-ID: <000a01c4262d$533c48a0$2301000a@Glenn> Dear spambayes-dev: Even when forwarded to myself, the Spambayes' Outlook plugin will not score this message, delete it as spam, or show spam clues. I've tried versions 0.80 and 1.0b1, and both have the same problem. In 6 months, I've only seen this one other time, which was yesterday, so I'm suspecting a new attack triggering an internal Spambayes failure. I have no clue how they might be doing this, but I bet you will find the problem intriquing. I apologize for forwarding HTML email to this list, but I doubt plain text will trigger the bug... I'm going to be really embarrased if the forwarded message does not exibit the same behaviour, but tests on my system assure me that it will, and if it doesn't, your spam filter should prevent you from seeing this message. ;) I have not been able to rule out some problem specific to my system (like maybe some database scalability limit, after ~24000 spam messages) because I don't know any Spambayes users I can forward this to, other then you. Enjoy, --Glenn -----Original Message----- From: aboveboard achilles [mailto:wazfapwjnojtk@utvinternet.ie] Sent: Monday, April 19, 2004 12:24 AM To: Alan Subject: Cc:Overnight shipping on x?nax,val?um and more abreact abundant acidulous abram acute abo acetylene abrasion abduct acs acrylate abyssinia aborigine aback acrobatic aaa accelerate account accusation abstain ace absence accompanist abstruse acm acculturate acidulous actinide abscissae abbas acquire abram accusatory acanthus aching abrupt absinthe abstention abysmal acceptor aboveground abduct across accra achromatic abnormal abstention ac accredit access abbreviate aboriginal abeyance acreage accrual acolyte acadia abbott accredit abbas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040419/b93fec76/attachment.html From papaDoc at videotron.ca Mon Apr 19 13:17:09 2004 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon Apr 19 13:17:23 2004 Subject: [spambayes-dev] A new patch submitted on sourceforge (pychsum) Message-ID: <40840995.3020307@videotron.ca> Hi, I submitted a new file (as a patch) on sourceforge. This is an utilities to generated a check sum of spam email and compare the resulting sum to previous spam if there is a match you can (flush this new spam). This file was created by Skip and sent to me by mail. (I hope Skip won't kill me since I'm submitting this and I did not ask for his permission). Remi P.S. I submitted this file since I add trouble to find the good and latest version on my PCs From papaDoc at videotron.ca Mon Apr 19 14:27:02 2004 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon Apr 19 14:27:07 2004 Subject: [spambayes-dev] Re: [Spambayes] Re: Cannot connect to socket with sb_bnserver.py Message-ID: <408419F6.20700@videotron.ca> Hi Toby, > aha! your linux kernel must be 2.2.x ! right? > > patch attached. > < cut the path (see previous message to get the patch)> It is working with the patch !!!!!!!!!!!!!!!!!!!!!!!! I need to set my path so that sb_bnfilter can find the sb_bnserver.py and it is working. This is the difference between sb_filter ans sb_bnfilter /gmc/logiciels/spambayes/scripts$ time for in in {1,2,3,4,5}; do echo "Running $i"; cat ~/Tmp/mail.eml | /gmc/logiciels/spambayes/scripts/sb_filter.py -d ~/Tmp/spambayes.statistic.db; done real 1m38.150s user 1m33.270s sys 0m3.990s /gmc/logiciels/spambayes/scripts$ time for in in {1,2,3,4,5}; do echo "Running $i"; cat ~/Tmp/mail.eml | /gmc/logiciels/spambayes/scripts/sb_bnfilter.py -d ~/Tmp/spambayes.statistic.db; done real 0m46.275s user 0m9.160s sys 0m1.300s So thank to all of you. From kennypitt at hotmail.com Mon Apr 19 14:27:38 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Apr 19 14:29:31 2004 Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?= =?iso-8859-1?Q?x=2Cval=EDum_and_more?= In-Reply-To: <000a01c4262d$533c48a0$2301000a@Glenn> Message-ID: This message scored fine on my SpamBayes, and produced a spam score of 0.7% thanks to the wonders of SpamBayes mailing list ham clues. I'm running from latest CVS source, which is basically equivalent to 1.0b1, on Outlook 2003. Here's the top portion of the "Show spam clues". Could you attach log files from when you encountered this error? The Troubleshooting Guide under SpamBayes / Help / Troubleshooting Guide will tell you how to find them. Combined Score: 1% (0.00738917) Internal ham score (*H*): 0.985378 Internal spam score (*S*): 0.000156295 # ham trained on: 103 # spam trained on: 169 86 Significant Tokens token spamprob #ham #spam 'plain' 0.0505618 4 0 'problem.' 0.0505618 4 0 'seeing' 0.0505618 4 0 'versions' 0.0505618 4 0 'filter' 0.0652174 3 0 'forwarded' 0.0652174 3 0 'message-----' 0.0652174 3 0 'alan' 0.0918367 2 0 'plugin' 0.0918367 2 0 'subject:dev' 0.0918367 2 0 'to:addr:spambayes-dev' 0.0918367 2 0 'header:Importance:1' 0.0941901 37 6 'spambayes' 0.134245 9 2 'subject:spambayes' 0.135953 5 1 'html' 0.148058 8 2 'assure' 0.155172 1 0 'attack' 0.155172 1 0 'limit,' 0.155172 1 0 'myself,' 0.155172 1 0 'received:edu' 0.155172 1 0 'sender:addr:spambayes-dev-bounces' 0.155172 1 0 'spam,' 0.155172 1 0 'to,' 0.155172 1 0 'tried' 0.162588 4 1 'skip:- 10' 0.165055 7 2 'outlook' 0.166134 10 3 'monday,' 0.202339 3 1 'tests' 0.202339 3 1 'spam' 0.23915 14 7 'does' 0.24132 10 5 'text' 0.246248 6 3 'across' 0.252149 4 2 'sent:' 0.252149 4 2 'specific' 0.252149 4 2 'really' 0.260641 9 5 "i've" 0.267806 7 4 'internal' 0.268313 2 1 'message.' 0.268313 2 1 "i'm" 0.272039 15 9 'x-mailer:microsoft outlook, build 10.0.4510' 0.280132 5 3 'users' 0.295068 9 6 'problem' 0.29801 6 4 'from:' 0.310408 7 5 'cc:2**0' 0.324958 4 3 'message' 0.336556 17 14 'database' 0.343235 6 5 'doing' 0.343235 6 5 'subject:' 0.343235 6 5 'maybe' 0.348391 7 6 'new' 0.364136 31 29 'subject:: ' 0.364685 15 14 'header:Errors-To:1' 0.371938 31 30 'sender:no real name:2**0' 0.379611 29 29 'should' 0.380359 16 16 'to:' 0.380741 13 13 'able' 0.381108 11 11 'list,' 0.387141 3 3 'time,' 0.387141 3 3 'score' 0.390945 2 2 'after' 0.395493 15 16 'sender:addr:python.org' 0.396461 27 29 'to:addr:python.org' 0.396461 27 29 '2004' 0.399493 12 13 'you.' 0.616696 6 16 'proto:http' 0.622724 52 141 'april' 0.631635 1 3 'subject:and' 0.638645 2 6 'dear' 0.643221 5 15 'even' 0.643748 6 18 'account' 0.671001 5 17 'shipping' 0.67222 2 7 'doubt' 0.691855 1 4 'rule' 0.691855 1 4 'this,' 0.691855 1 4 'abbreviate' 0.844828 0 1 'abundant' 0.844828 0 1 'acetylene' 0.844828 0 1 'failure.' 0.844828 0 1 'forwarding' 0.844828 0 1 'subject:shipping' 0.844828 0 1 'subject:\xed' 0.844828 0 1 'trigger' 0.844828 0 1 'accelerate' 0.908163 0 2 'apologize' 0.908163 0 2 'subjectcharset:iso-8859-1' 0.958716 0 5 'url:biz' 0.985437 0 15 _____ From: spambayes-dev-bounces@python.org [mailto:spambayes-dev-bounces@python.org] On Behalf Of Glenn Brown Sent: Monday, April 19, 2004 12:43 PM To: spambayes-dev@python.org Subject: [spambayes-dev] FW: Overnight shipping on x?nax,val?um and more Even when forwarded to myself, the Spambayes' Outlook plugin will not score this message, delete it as spam, or show spam clues. I've tried versions 0.80 and 1.0b1, and both have the same problem. In 6 months, I've only seen this one other time, which was yesterday, so I'm suspecting a new attack triggering an internal Spambayes failure. I have no clue how they might be doing this, but I bet you will find the problem intriquing. I apologize for forwarding HTML email to this list, but I doubt plain text will trigger the bug... I'm going to be really embarrased if the forwarded message does not exibit the same behaviour, but tests on my system assure me that it will, and if it doesn't, your spam filter should prevent you from seeing this message. ;) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040419/a60f83ec/attachment-0001.html From pekka.takala at pp.inet.fi Tue Apr 20 04:38:33 2004 From: pekka.takala at pp.inet.fi (Pekka Takala) Date: Tue Apr 20 04:37:49 2004 Subject: [spambayes-dev] re: Getting Mozilla (or Netscape) to work with pop3proxy (LINUX) Message-ID: <4084E189.7070405@pp.inet.fi> And nothing is good without bug support: Sometimes ps ax |grep -q sb_server.py finds out itself although the sb_server is not actually running. By changing the line to read ps ax |grep -v grep | grep -q sb_server.py fixes the problem. -- Pekka "Pihti" Takala Nothing can be so bad that you cannot find something good in it! 65XXX assembler programmer/developer, linux user From mhammond at skippinet.com.au Tue Apr 20 19:18:03 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Apr 20 19:18:27 2004 Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?= =?iso-8859-1?Q?x=2Cval=EDum_and_more?= In-Reply-To: Message-ID: <13be01c4272d$bb978ee0$0200a8c0@eden> >> Even when forwarded to myself, the Spambayes' Outlook plugin will not >> score this message, delete it as spam, or show spam clues. I've >> tried versions 0.80 and 1.0b1, When you say not delete or show spam clues, what exactly happens? Do you get a message that no filterable items are selected? Kenny: > This message scored fine on my SpamBayes, and produced a > spam score of 0.7% thanks to the wonders of SpamBayes mailing > list ham clues. I'm running from latest CVS source, which The problem could be in the mail as received by Glenn, but once forward on, it again works (as Outlook inserts what is missing). ie, the function IsFilterCandidate() in msgstore.py is telling us not to filter the message. Glenn - if you can work out how, can you see if you can get "dump_props.exe" to find the message. It should be a matter of opening a command-prompt, running: dump_props.exe Overnight shipping > dump.txt and hopefully dump.txt will be created with information on all messages in your inbox with "Overnight shipping" in the subject. If you can get the information on the message out, please open a bug at sourceforge, assigning it to me, and attaching the output. Thanks, Mark. From kennypitt at hotmail.com Wed Apr 21 10:19:54 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Apr 21 10:21:36 2004 Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?= =?iso-8859-1?Q?x=2Cval=EDum_and_more?= In-Reply-To: <13be01c4272d$bb978ee0$0200a8c0@eden> Message-ID: Mark Hammond wrote: >>> Even when forwarded to myself, the Spambayes' Outlook plugin will >>> not score this message, delete it as spam, or show spam clues. I've >>> tried versions 0.80 and 1.0b1, > > When you say not delete or show spam clues, what exactly happens? Do > you get a message that no filterable items are selected? > > Kenny: >> This message scored fine on my SpamBayes, and produced a >> spam score of 0.7% thanks to the wonders of SpamBayes mailing >> list ham clues. I'm running from latest CVS source, which > > The problem could be in the mail as received by Glenn, but once > forward on, it again works (as Outlook inserts what is missing). ie, > the function IsFilterCandidate() in msgstore.py is telling us not to > filter the message. > > Glenn - if you can work out how, can you see if you can get > "dump_props.exe" to find the message. It should be a matter of > opening a command-prompt, running: > > dump_props.exe Overnight shipping > dump.txt > > and hopefully dump.txt will be created with information on all > messages in your inbox with "Overnight shipping" in the subject. If > you can get the information on the message out, please open a bug at > sourceforge, assigning it to me, and attaching the output. Mark, attached is a logfile I got from Glenn that was not CC'd to the list. Maybe this will give you some more ideas as to what's going wrong. My best guess was that one of the tokens in the database is not pickled correctly, possibly corrupted or maybe a leftover from one of the foreign character problems we've had, and that this particular message just happens to be one of the few that include that token. -- Kenny Pitt -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes1.log Type: application/octet-stream Size: 20576 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040421/9bebd489/spambayes1.obj From jbanderas_23.86gd at mailcity.com Tue Apr 20 16:21:19 2004 From: jbanderas_23.86gd at mailcity.com (Julian Banderas) Date: Wed Apr 21 11:44:24 2004 Subject: [spambayes-dev] Mala direta e-mails listas de email http://www.gueb.de/divulgamail Message-ID: <200404202021.i3KKLEsr072349@mxzilla1.xs4all.nl> As melhores listas segmentadas de e-mails para mala direta. Todos os tipos: http://www.gueb.de/divulgamail Cadastros de e-mails segmentados por estados, profiss?es, empresas e pessoas f?sicas. Tudo que voc? pracisa para fazer a divulga??o e publicidade do seu neg?cio, programas para spam e e-mail marketing. Listagens atualizadas e garantidas. Visite agora: http://www.gueb.de/divulgamail From tameyer at ihug.co.nz Wed Apr 21 20:09:09 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 21 20:10:29 2004 Subject: =?iso-8859-1?Q?RE:_=5Bspambayes-dev=5D_FW:_Overnight_shipping_on_x=E3na?= =?iso-8859-1?Q?x=2Cval=EDum_and_more?= In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305FF637D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA0@its-xchg4.massey.ac.nz> > Mark, attached is a logfile I got from Glenn that was not > CC'd to the list. Maybe this will give you some more ideas > as to what's going wrong. [...] I had something pretty similar to this myself just this morning (traceback at end of message). I believe that I caused it by force quitting Outlook when it was busy working at something. Not long afterwards I also got the "Bayes database has X, message database has Y" error message. You'll notice that Glenn also does: """ Bayes database initialized with 14808 spam and 3729 good messages *** - message database has 18536 messages - bayes has 18537 - something is screwey """ Perhaps something similar is at fault here? In any case, retraining does look like the best option, and it solved the "buttons don't do anything" problem for me, too. (I have the two old databases if anyone really cares, but it doesn't seem worth looking at). There is a lot of training there, but on the plus side retraining will mean things work faster, the imbalance can be addressed, and lots of us (IIRC) are happy with sub-1000-message databases. Plus the problem will be fixed :) (I'm just hoping this doesn't mean that I'm now channelling all the problems that arise here... ) =Tony Meyer """ Traceback (most recent call last): File "C:\Python23\lib\site-packages\win32com\server\policy.py", line 275, in _Invoke_ return self._invoke_(dispid, lcid, wFlags, args) File "C:\Python23\lib\site-packages\win32com\server\policy.py", line 280, in _invoke_ return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None) File "C:\Python23\lib\site-packages\win32com\server\policy.py", line 542, in _invokeex_ return func(*args) File "D:\spambayes\Outlook2000\addin.py", line 700, in OnClick TrainAsHam(msgstore_message, self.manager, save_db = False) File "D:\spambayes\Outlook2000\addin.py", line 142, in TrainAsHam if train.train_message(msgstore_message, False, manager.classifier_data): File "D:\spambayes\Outlook2000\train.py", line 52, in train_message cdata.bayes.learn(tokenize(stream), is_spam) File "D:\spambayes\spambayes\classifier.py", line 273, in learn self._add_msg(wordstream, is_spam) File "D:\spambayes\spambayes\classifier.py", line 375, in _add_msg record = self._wordinfoget(word) File "D:\spambayes\spambayes\storage.py", line 261, in _wordinfoget r = self.db.get(word) File "C:\Python23\Lib\shelve.py", line 111, in get return self[key] File "C:\Python23\Lib\shelve.py", line 119, in __getitem__ value = Unpickler(f).load() cPickle.UnpicklingError: invalid load key, ' '. """ From tameyer at ihug.co.nz Wed Apr 21 20:54:42 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 21 20:55:43 2004 Subject: [spambayes-dev] RE: [Spambayes] Amazing sloth In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304830103@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA7@its-xchg4.massey.ac.nz> (I think maybe I wasn't wrong about channelling other people's problems... ) > Here's a weird one, w/ Outlook 2000 and the addin from > not-so-recent-anymore CVS. I decided to start over from > scratch today, so have a new(Berkeley) DB. This is with Outlook 2002 (SP2) and the addin also from not-so-recent-anymore CVS. I also started over (see the spambayes-dev message) from scratch today with a new (Berkeley) DB. Specifically, I just trashed the old database files (while Outlook was closed) and started training as things arrived in the unsure folder. > It's taking the addin from 4 to 10 seconds to score each(!) > message. That's whether it's new incoming email, or via the > "Filter messages ..." menu item, or via a single "Show spam > clues". It's mind-numbingly slow. And, no surprise since you've read this far, I found this too. Bizarre. > While a message is being scored, Outlook is unresponsive to > keyboard or mouse input, but the process is using very little > CPU (typically a fraction of a percent, with very brief > spikes). So it's waiting on *something*, but don't know what. > > Nothing odd in the PythonWin Trace Collector display. Ran > scanpst on all the relevant .pst files -- no problems. The > sloth persists after restarting Outlook, and after a reboot. > No other Outlook operations have slowed, just SpamBayes. All of this applies to my experience as well, although I didn't try scanpst (I don't know if I have it, and since it didn't do Tim any good, it probably wouldn't have helped me anyway). > Two hours later: Heh ;). I didn't spend two hours on it, though. I remembered Tim's message and so after about 5 minutes just started with new db's again. > The sloth went away then, just as mysteriously and > dramatically as it appeared. Outlook remained open the entire time: > > extremely slow > retrain on 5 new ham and 5 new spam from scratch > zippy again I started afresh (from training 1 ham and 9 spam) also, but in the same way as before - close Outlook, move aside slow db's and start Outlook again. Also zippy once again. > So no clues, just bizarre symptoms. If it happens to you, > don't be an idiot like I just was: save the .db file before > retraining the problem away (it's the only relevant thing I > can think of that changed). Normally I would have done just that, but I recalled this message (me who struggles to remember what I had for dinner last night! ) and so have it zipped away for analysis. So, I offer it up to anyone interested in looking into it, or offer myself up to spend time looking into it if someone can suggest ways of doing that. I don't really know where to start. =Tony Meyer From tim.one at comcast.net Wed Apr 21 22:01:13 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Apr 21 22:01:19 2004 Subject: [spambayes-dev] RE: [Spambayes] Amazing sloth In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA7@its-xchg4.massey.ac.nz> Message-ID: [Tim, from a while ago] >> Here's a weird one, w/ Outlook 2000 and the addin from >> not-so-recent-anymore CVS. I decided to start over from >> scratch today, so have a new (Berkeley) DB. [Tony Meyer] > This is with Outlook 2002 (SP2) and the addin also from > not-so-recent-anymore CVS. I also started over (see the spambayes-dev > message) from scratch today with a new (Berkeley) DB. Specifically, > I just trashed the old database files (while Outlook was closed) and > started training as things arrived in the unsure folder. >> It's taking the addin from 4 to 10 seconds to score each(!) >> message. That's whether it's new incoming email, or via the >> "Filter messages ..." menu item, or via a single "Show spam >> clues". It's mind-numbingly slow. > And, no surprise since you've read this far, I found this too. > Bizarre. I should mention that it happened two more times for me after starting over from scratch, with very few msgs trained on each time (certainly less than 50 total). At that point I got a new box with a gigabyte of RAM, and switched to using a giant pickled dict instead. Much faster scoring, no problems, but much slower Outlook startup time and incremental training times. >> While a message is being scored, Outlook is unresponsive to >> keyboard or mouse input, but the process is using very little >> CPU (typically a fraction of a percent, with very brief >> spikes). So it's waiting on *something*, but don't know what. >> >> Nothing odd in the PythonWin Trace Collector display. Ran >> scanpst on all the relevant .pst files -- no problems. The >> sloth persists after restarting Outlook, and after a reboot. >> No other Outlook operations have slowed, just SpamBayes. > All of this applies to my experience as well, although I didn't try > scanpst (I don't know if I have it, and since it didn't do Tim any > good, it probably wouldn't have helped me anyway). Whenever you see reference to the "Inbox Repair Tool", it means scanpst.exe. I'm amazed that MS continues to make this thing so hard to find: .pst files routinely get corrupted in minor and major ways by Outlook (whether or not SpamBayes is installed), and scanpst.exe finds at least one problem in my .pst files every day(!). You have scanpst.exe, but you may have to search your disk to find it. > Heh ;). I didn't spend two hours on it, though. I remembered Tim's > message and so after about 5 minutes just started with new db's again. >> The sloth went away then, just as mysteriously and >> dramatically as it appeared. Outlook remained open the entire time: >> >> extremely slow >> retrain on 5 new ham and 5 new spam from scratch >> zippy again > I started afresh (from training 1 ham and 9 spam) also, but in the > same way as before - close Outlook, move aside slow db's and start > Outlook again. Also zippy once again. >> So no clues, just bizarre symptoms. If it happens to you, >> don't be an idiot like I just was: save the .db file before >> retraining the problem away (it's the only relevant thing I >> can think of that changed). > Normally I would have done just that, but I recalled this message (me > who struggles to remember what I had for dinner last night! ) > and so have it zipped away for analysis. > > So, I offer it up to anyone interested in looking into it, or offer > myself up to spend time looking into it if someone can suggest ways > of doing that. I don't really know where to start. Since I moved to a giant pickled dict, I don't care anymore <0.5 wink>. An interesting experiment would be to open it directly from a non-SpamBayes Python program, and just time lookups and inserts. There was a disturbing Python bug report against bsddb that I closed as hopeless: http://www.python.org/sf/881522 This was about a huge slowdown in shelve after several thousands of keys had been added. There were strong hints that the huge slowdown was specific to the combination of: "a modern" bsddb (after the ancient 1.85) Windows the hash flavor of bsddb There were also hints that the BTree flavor of bsddb was faster than the hash flavor, independent of the mystery-slowdown in the hash flavor. Since we experienced Amazing Sloth under different versions of Outlook, and very different OSes, my top guess has to be that the fault is in the dbhash flavor of bsddb. From tameyer at ihug.co.nz Thu Apr 22 01:16:11 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Apr 22 01:16:34 2004 Subject: [spambayes-dev] RE: [Spambayes] Amazing sloth In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305FF6565@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CB3@its-xchg4.massey.ac.nz> > I should mention that it happened two more times for me after > starting over from scratch, with very few msgs trained on > each time (certainly less than 50 total). Yay, something to look forward to . I've managed to get up to 11 ham and 11 spam without any problems, though. [...] > You have scanpst.exe, but you may > have to search your disk to find it. Indeed I did. It was in C:\Program Files\Common Files\System\Mapi\1033. Of course. > Since I moved to a giant pickled dict, I don't care anymore > <0.5 wink>. I suppose I do (assuming it may happen to me again), since I don't really want to switch to a pickled dict, because I open and close Outlook reasonably often, and have other uses for my (much smaller) memory. > An interesting experiment would be to open it > directly from a non-SpamBayes Python program, and just time > lookups and inserts. Lookups don't appear to be affected at all, but inserts definitely are. I've tried really simple (just multiple insertions) tests comparing a new database, a database around the same size (which is about 5500 keys), the slow database, and another Berkeley db with the same data (exporting the slow one to text and then using that to create a new db) in case it was just some quirk of entry order or the file itself. There doesn't seem to be any difference between the dbs with the same data, but they are 3 to 4 times slower than either the new db or the similarly sized one. This is with Python 2.3.3 and bsddb or Python 2.2.3 and bsddb3. Playing around with creating dbs of the same size doesn't seem to be getting me any closer to creating another database with this odd effect. I realise that you don't have time to look into this, but any chance you have further suggestions about how I might investigate it? > There was a disturbing Python bug report against bsddb that I > closed as hopeless: > > http://www.python.org/sf/881522 I read this, and it does seem like it could be related, but I'm not sure how to test that :) =Tony Meyer From cjh at tirania.nuclecu.unam.mx Fri Apr 23 03:10:51 2004 From: cjh at tirania.nuclecu.unam.mx (cjh@tirania.nuclecu.unam.mx) Date: Fri Apr 23 03:14:12 2004 Subject: [spambayes-dev] Spambayes-dev, TONS of piks hes and veedeos waiting 4 u In-Reply-To: <2C51L068J83D8B19@python.org> References: <2C51L068J83D8B19@python.org> Message-ID: <4HJJ913HL6GGBB4B@tirania.nuclecu.unam.mx> An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040423/a56ae7f4/attachment.html From skip at pobox.com Fri Apr 23 10:59:55 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Apr 23 11:00:02 2004 Subject: [spambayes-dev] Interesting way to purge old msgs w/ t-t-e Message-ID: <16521.12139.820834.72911@montanaro.dyndns.org> I have been running train-to-exhaustion for awhile now and like it. The only persistent problem I've had to deal with is how to purge old data, that is, what old messages to delete so my database doesn't grow without bound. The solution popped into my brain the other day: use the new reversed() builtin. If indicated on the tte.py command line with the --reverse flag, it sets up the mailbox iterators to march in reverse. This gives more weight to more recent messages. Coupled with the --cullext flag it allows me to easily purge old messages which aren't used in the actual training. Startup for each testing round is delayed slightly, but that seems to be the only negative side effect. Skip From jkx at pythonfr.org Sat Apr 24 17:00:56 2004 From: jkx at pythonfr.org (Jkx@Pythonfr) Date: Sat Apr 24 16:59:55 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin Message-ID: <200404242300.56525.jkx@pythonfr.org> First hy to everyone :) This is my first post on this mailing list. I take a little time this afternoon to write this piece of code and i want to know what other think about it. Extract from the code """ SpamBayes server compliant w/ spamassassin spamassassin can run as a daemon for a large scale network (spamd). To use the spamd spassassin provide a short piece of C code written to be really efficient to make the glue between MTA and spamd. SpamBayes (which is my preferred spam filtering) don't provide this kind of stuff. It came w/ some python xmlrpc client server, but forking a python for each incomming mail eat too much cpu on linux box. So this piece of code is a fake spamassassin server that use a spambayes for filtering. This version has been tested w/ version 2.63 of spamc Take care it dosn't support: - SSL - BSMTP benchmark results w/ 600 mails w/ a ~ 650Kb trained DB: - procmail + sb_filter.py: 70 mails/min - procmail + spamc + this: 206 mails/min (TCP or unixdomain achieve the same perf) important notes: - i don't test other server cause i need to something that work on system-wide - this doesn't support simultanus acces so i achieve the same thoughout put w/ maildrop instead of procmail The reason why i don't write this, is that i don't know what to do: thread / fork / async ??? - it support filtering for virtual hosted mailbox even if this is not the defaut behaviour """ Any comments / blame / flame .. is welcome :) ByeBye -------------- next part -------------- A non-text attachment was scrubbed... Name: sb_global_server.py Type: application/x-python Size: 6309 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040424/8b6058e9/sb_global_server.bin From skip at pobox.com Sat Apr 24 17:38:21 2004 From: skip at pobox.com (Skip Montanaro) Date: Sat Apr 24 17:38:46 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404242300.56525.jkx@pythonfr.org> References: <200404242300.56525.jkx@pythonfr.org> Message-ID: <16522.56909.879061.103902@montanaro.dyndns.org> jkx> spamassassin can run as a daemon for a large scale network jkx> (spamd). To use the spamd spassassin provide a short piece of C jkx> code written to be really efficient to make the glue between MTA jkx> and spamd. jkx> SpamBayes (which is my preferred spam filtering) don't provide this jkx> kind of stuff. It came w/ some python xmlrpc client server, but jkx> forking a python for each incomming mail eat too much cpu on linux jkx> box. jkx> So this piece of code is a fake spamassassin server that use a jkx> spambayes for filtering. Take a look at sb_bnfilter.py. It's like spamc/spamd only better. The daemon (sb_bnserver.py) is forked automatically and quietly exits after a few seconds of idle time. sb_bnfilter.py has only recently been added to the Spambayes CVS repository. In the latest distribution I think it still turns up in the contrib directory, but I moved it to the scripts directory a week or two ago, so it should get installed by default when the next distribution is released. Skip From jkx at pythonfr.org Sat Apr 24 18:31:56 2004 From: jkx at pythonfr.org (Jkx@Pythonfr) Date: Sat Apr 24 18:30:54 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <16522.56909.879061.103902@montanaro.dyndns.org> References: <200404242300.56525.jkx@pythonfr.org> <16522.56909.879061.103902@montanaro.dyndns.org> Message-ID: <200404250031.56607.jkx@pythonfr.org> On Saturday 24 April 2004 23:38, Skip Montanaro wrote: > jkx> So this piece of code is a fake spamassassin server that use a > jkx> spambayes for filtering. > > Take a look at sb_bnfilter.py. It's like spamc/spamd only better. The > daemon (sb_bnserver.py) is forked automatically and quietly exits after a > few seconds of idle time. Yes i understand your meaning, but this tend to do something really different. sb_bnserver.py forks itself and wait for the user connection, by this way it cache the db parsing and all the hammie classes needed for working w/. but : 1) you still need to create a python process for every incomming mail sb_bnfilter. And python, even if it not a weight bloat, python eat something like 4.5Mb of memory instead of the poor 500Ko of spamc 2) sb_bnserver need to be launch by the user (thought sb_bnfilter), and it is written in this way, so it isn't system-wide filering. spamc as some usefull stuff like round-robin filtering .. For example, if i need to dipatch a lot of mail in mailbox (mailing list for example), for every user it will fork n servers .. and so on ? I think sb_bn* is pretty nice for a system w/ only few mail accounts and should performs very for bursting email dispatch for a single user like after a fetchmail... but this isn't my goal. I admit that my code is a bit rought, as it only do the filtering , and don't provide anyway of caching db, not simultanous acces but this is a first try .. / apologize for my poor english / From tameyer at ihug.co.nz Sat Apr 24 22:15:20 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Apr 24 22:16:21 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305FF6CB4@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CC7@its-xchg4.massey.ac.nz> > sb_bnfilter.py has only recently been added to the Spambayes > CVS repository. In the latest distribution I think it still > turns up in the contrib directory, but I moved it to the > scripts directory a week or two ago, so it should get > installed by default when the next distribution is released. FYI, the latest 1.0b1.1 source dists (both .zip and .tar.gz) include your fix, moving them to the scripts directory (because otherwise the setup.py script didn't manage to complete (my bad for the missed testing). =Tony Meyer From skip at pobox.com Sat Apr 24 22:31:37 2004 From: skip at pobox.com (Skip Montanaro) Date: Sat Apr 24 22:31:44 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404250031.56607.jkx@pythonfr.org> References: <200404242300.56525.jkx@pythonfr.org> <16522.56909.879061.103902@montanaro.dyndns.org> <200404250031.56607.jkx@pythonfr.org> Message-ID: <16523.8969.140499.186982@montanaro.dyndns.org> jkx> 1) you still need to create a python process for every incomming jkx> mail sb_bnfilter. And python, even if it not a weight bloat, jkx> python eat something like 4.5Mb of memory instead of the poor jkx> 500Ko of spamc The sb_bnfilter/sb_bnserver combination runs several times faster on my machine. It would probably be faster if you recoded sb_bnfilter.py in C. Feel free. jkx> 2) sb_bnserver need to be launch by the user (thought sb_bnfilter), jkx> and it is written in this way, so it isn't system-wide filering. jkx> spamc as some usefull stuff like round-robin filtering .. For jkx> example, if i need to dipatch a lot of mail in mailbox (mailing jkx> list for example), for every user it will fork n servers .. and jkx> so on ? I don't recall that you said you wanted a single system-wide filter. Spambayes isn't designed that way at any rate. It will require some significant effort. jkx> I think sb_bn* is pretty nice for a system w/ only few mail jkx> accounts and should performs very for bursting email dispatch for a jkx> single user like after a fetchmail... but this isn't my goal. Some folks have experimented with using Spambayes for system-wide filtering. I don't know that anybody's produced any conclusive results. That said, one approach might be to rework sb_bnserver.py to open several unix domain sockets (one per user) and listen on all of them. When a connection is made on a socket spin off a new thread to handle it and use that user's database to score the message. If the user doesn't have a database of their own, default to a general database. Once you have that working, you can rewrite sb_bnfilter.py in C to reduce memory consumption and maybe improve performance a bit. sb_bnserver.py could probably be sped up just by running it with psyco. Skip From jkx at pythonfr.org Sat Apr 24 23:22:11 2004 From: jkx at pythonfr.org (Jkx@Pythonfr) Date: Sat Apr 24 23:21:08 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <16523.8969.140499.186982@montanaro.dyndns.org> References: <200404242300.56525.jkx@pythonfr.org> <200404250031.56607.jkx@pythonfr.org> <16523.8969.140499.186982@montanaro.dyndns.org> Message-ID: <200404250522.11544.jkx@pythonfr.org> On Sunday 25 April 2004 04:31, Skip Montanaro wrote: > jkx> 1) you still need to create a python process for every incomming > jkx> mail sb_bnfilter. And python, even if it not a weight bloat, > jkx> python eat something like 4.5Mb of memory instead of the poor > jkx> 500Ko of spamc > > The sb_bnfilter/sb_bnserver combination runs several times faster on my > machine. It would probably be faster if you recoded sb_bnfilter.py in C. > Feel free. Faster than ? - sb_filter ? - spamc + code_attached in previous email ? But why should i rewrote sb_bnfilter in C, since sb_bnserver doesn't feet w/ my needs . > jkx> 2) sb_bnserver need to be launch by the user (thought > sb_bnfilter), jkx> and it is written in this way, so it isn't > system-wide filering. jkx> spamc as some usefull stuff like round-robin > filtering .. For jkx> example, if i need to dipatch a lot of mail in > mailbox (mailing jkx> list for example), for every user it will fork n > servers .. and jkx> so on ? > > I don't recall that you said you wanted a single system-wide filter. > Spambayes isn't designed that way at any rate. It will require some > significant effort. Where significant effort ? I really miss something. Have you read the code i provided ? It just serve as 1 single server (hammie filter) for a large number of users. But all have their own database. - one and only one server (not one per user !) - every user have its own db > jkx> I think sb_bn* is pretty nice for a system w/ only few mail > jkx> accounts and should performs very for bursting email dispatch for > a jkx> single user like after a fetchmail... but this isn't my goal. > > Some folks have experimented with using Spambayes for system-wide > filtering. I don't know that anybody's produced any conclusive results. What you think of system-wide filtering is : using the same hammie filter database for all the users. Once more .. this is not what my code is done for. my code try to face this problems: - spawning a python at each incomming mail (spamc) - having one deamon (or more) per user . > That said, one approach might be to rework sb_bnserver.py to open several > unix domain sockets (one per user) and listen on all of them. When a > connection is made on a socket spin off a new thread to handle it and use > that user's database to score the message. If the user doesn't have a > database of their own, default to a general database. Do you really want to open one UnixDomain socket per user ????? I usually work w/ about 50 users right now .. ( and i wrote this code to do on ~ 1000 accounts .. ). Another thing, i don't care about 'general database' .. this isn't the goal i want a system managable for a large number of user.. > Once you have that working, you can rewrite sb_bnfilter.py in C to reduce > memory consumption and maybe improve performance a bit. sb_bnserver.py > could probably be sped up just by running it with psyco. pscyco have nothing about that. the trouble is 'exec a python' at each email this is a bad idea. That why i use code ripped from spamassassin, because 1) it is really efficient code 2) quite clear code (despite too much goto) 3) it is system-wide: - use syslogd - handler error (you don't loose mails w/) - have round-robin capabities .. - and so on .. I wrote this for sys-admin who wants to have spambayes for a large scale of users. and that can manage easly the way mails are filtered .. - only one spambayes server - all incoming mails are sent (thought spamc) to this server - and every user use it's own hammie database in there home. so even it the server falls for a strange raison mails aren't lost .. (spamc do that perfectly ) Bye Bye From skip at pobox.com Sun Apr 25 00:32:32 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Apr 25 00:32:37 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404250522.11544.jkx@pythonfr.org> References: <200404242300.56525.jkx@pythonfr.org> <200404250031.56607.jkx@pythonfr.org> <16523.8969.140499.186982@montanaro.dyndns.org> <200404250522.11544.jkx@pythonfr.org> Message-ID: <16523.16224.232593.68870@montanaro.dyndns.org> jkx> Where significant effort ? jkx> I really miss something. Have you read the code i provided ? It jkx> just serve as 1 single server (hammie filter) for a large number of jkx> users. But all have their own database. jkx> - one and only one server (not one per user !) jkx> - every user have its own db No, I admit I didn't read your code. I read your mail message and must have not fully understood what you were after. My apologies. jkx> Do you really want to open one UnixDomain socket per user ????? Sure, why not? Unix domain sockets are pretty cheap. jkx> I usually work w/ about 50 users right now .. ( and i wrote this jkx> code to do on ~ 1000 accounts .. ). jkx> Another thing, i don't care about 'general database' .. this isn't jkx> the goal i want a system managable for a large number of user.. I don't think a shared database would work except for a very close group of users (very similar ideas of what constitutes ham and spam). How do your users train their databases? I presume you are doing all this on your mail server. Are your users local or remote? >> Once you have that working, you can rewrite sb_bnfilter.py in C to >> reduce memory consumption and maybe improve performance a bit. >> sb_bnserver.py could probably be sped up just by running it with >> psyco. jkx> pscyco have nothing about that. the trouble is 'exec a python' at jkx> each email I don't see 'exec a python' as a huge problem. Presumably on a busy server the python interpreter and all the compiled bytecode will just be sitting in memory buffers awaiting activation. Lots of systems do the equivalent of 'exec a python' or more on a per message basis. Have you tried it? Was it too slow? jkx> so even it the server falls for a strange raison mails aren't lost jkx> .. (spamc do that perfectly ) I'd rather trust my mail's delivery to procmail. If sb_bn*.py craps out, procmail is there to recover the message for me. So far that combination has been very robust. It processes between 2,000 and 3,000 messages daily (about 70% spam) for me on my laptop without a hiccup. I generally don't even notice that it's running. I just ran a quick test of sb_bnfilter.py on my laptop. In a directory containing 501 spams (between 24 and 3080 lines each, average 142 lines) I executed: for f in `find . -type f` ; do time sb_bnfilter.py < $f > /dev/null done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times.txt The minimum real time was 0.180 seconds. The maximum was 1.057 seconds. The mean time was 0.260 seconds. I then tried it with a byte-compiled version of sb_bnfilter.py: for f in `find . -type f` ; do time python ~/local/bin/sb_bnfilter.pyc < $f > /dev/null done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times2.txt The times improved slightly: min 0.172, max 0.957, mean 0.241. I then tried a third test, adding -A 1000 to the sb_bnfilter.py command line in the second test to keep a single sb_bnserver.py running for the entire test. Results: min 0.169, max 0.841, mean 0.236. I'd try the psyco test but my laptop is a Mac. Presumably performance would also improve on a more serious mail server. What's your target processing time per message? Skip From jkx at pythonfr.org Sun Apr 25 08:50:21 2004 From: jkx at pythonfr.org (Jkx@Pythonfr) Date: Sun Apr 25 08:49:16 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <16523.16224.232593.68870@montanaro.dyndns.org> References: <200404242300.56525.jkx@pythonfr.org> <200404250522.11544.jkx@pythonfr.org> <16523.16224.232593.68870@montanaro.dyndns.org> Message-ID: <200404251450.21193.jkx@pythonfr.org> On Sunday 25 April 2004 06:32, Skip Montanaro wrote: > jkx> Where significant effort ? > > > No, I admit I didn't read your code. I read your mail message and must > have not fully understood what you were after. My apologies. > > jkx> Do you really want to open one UnixDomain socket per user ????? > > Sure, why not? Unix domain sockets are pretty cheap. Simply because this is not realist ... this will eat a bunch of socket for nothing .. Have you ever heard that OS has max open file descriptor limit ? ? > How do your > users train their databases? I presume you are doing all this on your mail > server. Are your users local or remote? The train will be done thought cron in Maildir folder. The users are remote and use folders via imap > jkx> pscyco have nothing about that. the trouble is 'exec a python' at > jkx> each email > > I don't see 'exec a python' as a huge problem. Presumably on a busy server > the python interpreter and all the compiled bytecode will just be sitting > in memory buffers awaiting activation. Lots of systems do the equivalent > of 'exec a python' or more on a per message basis. Have you tried it? Was > it too slow? I think you should look closer at how mail delivery works ! Have you ever think that you can deliver a bunch of mails at the same time ? So you don't have only a one 'exec python' but you will have one per user for simultanous incomming mail.. For example filtering done thought maildrop can get (by default) 100 simultanus filter.. so do you really think that 100 * exec python is the same as 100 * spamc ??? (cause spamc eat ~500 Kb and python ~ 4.5 Mb ) > jkx> so even it the server falls for a strange raison mails aren't lost > jkx> .. (spamc do that perfectly ) > > > I just ran a quick test of sb_bnfilter.py on my laptop. In a directory > containing 501 spams (between 24 and 3080 lines each, average 142 lines) I > executed: [snip] .. This test doesn't represent any valuable information, since it use 1) only one user 2) only one access .. so only 1 spwan per mail etc etc .. Please test the same thing w/ ~10 users .. and measure the nb of mail path thought the system (MTA + procmail + filter) > Presumably performance would also improve on a more serious mail server. > What's your target processing time per message? The less .. simply .. I just added a cache system to my code (maintaning a hash of already open hammie db) .. and i achieve to something like 300 mails / min. and test without any filtering give me something like 600 mails / min... i think doing better would be hard but can be done ( using fork / thread or async on the server socket delivery) Bye Bye .. From tdickenson at geminidataloggers.com Sun Apr 25 13:21:32 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Sun Apr 25 13:21:38 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404251450.21193.jkx@pythonfr.org> References: <200404242300.56525.jkx@pythonfr.org> <16523.16224.232593.68870@montanaro.dyndns.org> <200404251450.21193.jkx@pythonfr.org> Message-ID: <200404251821.33037.tdickenson@geminidataloggers.com> On Sunday 25 April 2004 13:50, Jkx@Pythonfr wrote: > > jkx> Do you really want to open one UnixDomain socket per user ????? > > > > Sure, why not? Unix domain sockets are pretty cheap. > > Simply because this is not realist ... this will eat a bunch of socket for > nothing .. Have you ever heard that OS has max open file descriptor > limit ? ? There is an engineering compromise here split around having one big process and one big socket covering all users, compared to having per-user processes and sockets. You are right that it doesnt make sense to have one big global process listening on 100's of sockets. Disadvantages of having one big process include: 1. security. This big process has to be priveliged enough to read a .hammiedb from every users home directory. In practice I guess you run it as root. The spambayes development team doesnt have the culture to justify that kind of trust IMO. (also IMO, nor should it) 2. functionality. spambayes assumes a per-user operational model. For example, I think sb_global_server currently doesnt handle per-user ~/.spambayesrc. > I think you should look closer at how mail delivery works ! > Have you ever think that you can deliver a bunch of mails at the same time > ? So you don't have only a one 'exec python' but you will have one per user > for simultanous incomming mail.. For example filtering done thought > maildrop can get (by default) 100 simultanus filter.. so do you really > think that 100 * exec python is the same as 100 * spamc ??? (cause spamc > eat ~500 Kb and python ~ 4.5 Mb ) Yes, using python for sb_bnfilter is a short-term measure. Its a prototype. C version is in progress > my code try to face this problems: > - spawning a python at each incomming mail (spamc) > - having one deamon (or more) per user . I agree the first of those is a problem, and needs to be fixed in sb_bn*. (reusing the lightweight s.a. code here is a good trick btw.) I'm unconvinced so far that the overhead of having one deamon per user is a bigger problem than having spambayes run in a shared deamon with higher priveliges than a normal user. -- Toby Dickenson From jkx at pythonfr.org Sun Apr 25 15:23:48 2004 From: jkx at pythonfr.org (Jkx@Pythonfr) Date: Sun Apr 25 15:22:43 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404251821.33037.tdickenson@geminidataloggers.com> References: <200404242300.56525.jkx@pythonfr.org> <200404251450.21193.jkx@pythonfr.org> <200404251821.33037.tdickenson@geminidataloggers.com> Message-ID: <200404252123.48303.jkx@pythonfr.org> On Sunday 25 April 2004 19:21, Toby Dickenson wrote: > On Sunday 25 April 2004 13:50, Jkx@Pythonfr wrote: > There is an engineering compromise here split around having one big process > and one big socket covering all users, compared to having per-user > processes and sockets. You are right that it doesnt make sense to have one > big global process listening on 100's of sockets. > > Disadvantages of having one big process include: > 1. security. This big process has to be priveliged enough to read a > .hammiedb from every users home directory. In practice I guess you run it > as root. The spambayes development team doesnt have the culture to justify > that kind of trust IMO. (also IMO, nor should it) In fact, if you look deeper in the code, you will see that i use this on virtual mail domain. with hierarchie looking as: /var/lib/vmail/domain/user for example /var/lib/vmail/example.com/contact. So i don't really get trouble w/ the right since this all virtual domains are only owned by one user 'vmail'. But anyway in a normal setup you can: - run the deamon as root (as done w/ spamassassin .. i don't think this is really risky, because by default the the socket is binded to localhost .. etc etc ..) - And you can easily imagine to run this as another user, and tweak the self.dbname according to your needs.. (for example put all the db in a unique folder .. which only one account can access. this is a common way to do stuff for large system) > 2. functionality. spambayes assumes a per-user operational model. For > example, I think sb_global_server currently doesnt handle per-user > ~/.spambayesrc. Yeah that's true .. but again i think this is only a matter on implementation. As the filter is done at each request .. i can imagine parsing this configuration file too. It's my first hack w/ spambayes. I just discovered the code yersterday .. so i think i really miss some points. And that why i'm asking for support here. > > I think you should look closer at how mail delivery works ! > > Have you ever think that you can deliver a bunch of mails at the same > > time ? So you don't have only a one 'exec python' but you will have one > > per user for simultanous incomming mail.. For example filtering done > > thought maildrop can get (by default) 100 simultanus filter.. so do you > > really think that 100 * exec python is the same as 100 * spamc ??? (cause > > spamc eat ~500 Kb and python ~ 4.5 Mb ) > > Yes, using python for sb_bnfilter is a short-term measure. Its a prototype. > C version is in progress Please look at spamc code, because it try to cover a large amount of issue (by alowing the change of username for example .. which is really usefull in my approach, or round-robin filtering too .. ) > > my code try to face this problems: > > - spawning a python at each incomming mail (spamc) > > - having one deamon (or more) per user . > > I agree the first of those is a problem, and needs to be fixed in sb_bn*. > (reusing the lightweight s.a. code here is a good trick btw.) Fine :) > I'm unconvinced so far that the overhead of having one deamon per user is a > bigger problem than having spambayes run in a shared deamon with higher > priveliges than a normal user. I'm quite agree w/ you but keep in mind my example, my approach isn't so much different. I run only one process for all user. and i think i can do that w/ sb_bn* but the sb_bn* doesn't support the 'user' setting. I really think that allowing users (in a large system) to spwan process is a bad idea. not for a workstation of course but of a filtering server with 1000 mail accounts. My approach is exactly the same as used by things like virus scanner .. Do you think hosting company will allow people to install there own virus scrambler on there account ? They don't .. why : Just because this will be to painfull to administer and spawn a bunch of process for nothing. My approach is the same. I really like spambayes, but i can't use on large system not because it isn't stable enought or doesn't feet w/ the goal but simply because it doesn't provide a nice way to administer a large number of accounts. Many thanks Toby .. this is kool to heard something different than look at sb_bn*. Bye Bye .. From ta-meyer at ihug.co.nz Sun Apr 25 19:50:31 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Sun Apr 25 19:50:38 2004 Subject: [spambayes-dev] Testing Tools Changes Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CD1@its-xchg4.massey.ac.nz> I recently created new testing corpora for myself and ran various tests. As part of this, I made various changes to the testing scripts to make things easier. I'd like to know if anyone thinks any of these are worth checking in: export.py (in the Outlook2000 directory): I added a command-line option to skip printing the total number of messages that would be exported. I didn't really care what this number was, and generating it took a long time. PRO: This number doesn't seem all that useful. CON: This complicates a fairly simple script with another option. export.py: I added a command-line option to only export messages that were received via a certain account. I wanted an automatic method of separating out messages from a couple of accounts, and this seemed the easiest way. It compares the "Delivered" or "Envelope" header to the given regex and only exports if it matches. In addition, if the account is "Exchange", then it only exports if it appears to be an Exchange message (missing those headers; has the "X-Exchange-Junk" stuff. PRO: This is a handy way to only get certain messages out of Outlook. CON: This complicates the script a fair bit, and I haven't done any checking to see how robust the Delivered/Envelope headers are (all I know is that all my non-Exchange messages have one or the other of these). msgstore.py (in the Outlook2000 directory): When creating the 'faked up' Exchange headers, I added a "X-Exchange-Delivery-Time" header, which the data from that Outlook property. Without this, a lot of the exported messages couldn't be sorted by the incremental testing stuff, so ended up at the end, which isn't really accurate. sort+group.py: If it can't find any received headers, check for a sort+"X-Exchange-Delivery-Time" header, and use that instead. PRO: This is a very simple change, and doesn't have any effect on classification, and improves the accuracy of incremental testing. CON: This gets added every time that we add fake headers for an Exchange message, and there is presumably a (very small, I think) cost involved with that - this includes day-to-day use of the plug-in, when this has no effect at all. mksets.py: added -H and -S command-line options to specify an alternative pair of directories to create the sets in, rather than being fixed to "Data/Ham" and "Data/Spam". PRO: This is more like the other scripts. CON: ? incremental.py: at the moment, it uses *all* mail in Data/ - I changed it to use the TestDriver hamdir/spamdir options only (so that you can have multiple corpora in the Data/ directory, but test only some of it). PRO: Makes the incremental testing more like the timcv stuff which more people are familiar with. Also easier to use, IMO. CON: Changes the way the script works, so could break existing testing setups. fpfn.py: added a command-line flag to also print out unsures (IIRC this script predates unsures) as well as fp and fn. PRO: Especially when one reaches the Peters barrier and has very few fp or fn, looking at the unsures is interesting. CON: Complicates a very simple script (there are no command-line options at the moment) and don't fit the name (but having a 'fpfnunsure.py' script that does this seems pointless). I also changed fpfn.py to print out each message and offer to move it to the corresponding ham/spam set (I used it to check for misclassified messages), but it doesn't seem like this is a good addition to the script. I also wrote a few scripts to process the incremental.py output, using both mkgraph.py and Excel (via COM), so that I ended up with reasonably useful spreadsheets. If anyone is interested in these, let me know and I'll put them somewhere (I don't think there's any point checking them in, though). =Tony Meyer =Tony Meyer From mhammond at skippinet.com.au Sun Apr 25 22:31:50 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun Apr 25 22:32:12 2004 Subject: [spambayes-dev] Testing Tools Changes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CD1@its-xchg4.massey.ac.nz> Message-ID: <00e501c42b36$a1baaa90$0200a8c0@eden> > export.py (in the Outlook2000 directory): I added a > command-line option to > skip printing the total number of messages that would be > exported. I didn't > really care what this number was, and generating it took a long time. > PRO: This number doesn't seem all that useful. > CON: This complicates a fairly simple script with another option. > > export.py: I added a command-line option to only export > messages that were > received via a certain account. I wanted an automatic method > of separating > out messages from a couple of accounts, and this seemed the > easiest way. It > compares the "Delivered" or "Envelope" header to the given > regex and only > exports if it matches. In addition, if the account is > "Exchange", then it > only exports if it appears to be an Exchange message (missing > those headers; > has the "X-Exchange-Junk" stuff. > PRO: This is a handy way to only get certain messages out of Outlook. > CON: This complicates the script a fair bit, and I haven't > done any checking > to see how robust the Delivered/Envelope headers are (all I > know is that all > my non-Exchange messages have one or the other of these). The last change sounds a little nasty, but in general these are tools for us to use to try and perform decent testing for Outlook users. AFAIK, this has never happened :) Thus, anything that may move us in that direction is encouraged! (You have not referenced a msgstore.py change above though, have you?) > msgstore.py (in the Outlook2000 directory): When creating the > 'faked up' > Exchange headers, I added a "X-Exchange-Delivery-Time" > header, which the Can you explain that one a little more? Would it be possible/better to generate the correct Date header? (I assume you are saying these messages don't have this header) > mksets.py: added -H and -S command-line options to specify an > alternative > pair of directories to create the sets in, rather than being fixed to > "Data/Ham" and "Data/Spam". > PRO: This is more like the other scripts. > CON: ? Sounds OK to me. No real opinion on the others. Mark. From ta-meyer at ihug.co.nz Sun Apr 25 22:52:02 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Sun Apr 25 22:52:10 2004 Subject: [spambayes-dev] Testing Tools Changes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E437C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BC3@its-xchg4.massey.ac.nz> > > export.py: I added a command-line option to only export > > messages that were > > received via a certain account. [...] > The last change sounds a little nasty, but in general these > are tools for us to use to try and perform decent testing for > Outlook users. AFAIK, this has never happened :) Thus, > anything that may move us in that direction is > encouraged! :) > (You have not referenced a msgstore.py change above though, > have you?) No. For my mail, either it had a "Delivered-To:" header, an "Envelope-To:" header (with the value being a unique identifier (like 'ta-meyer@pop.ihug.co.nz') for the account), or it was a message that never left the local Exchange server. I don't know how true this holds across other mail servers, though (this is with three different mail servers). So I could simply check for those headers, or for the "X-Exchange-Message" header. I agree that this sounds a little nasty (and you should see the code ). I lean towards not checking it in (anyone interested can hopefully find this thread in the archives anyway), since I'm not all that sure how common (given the hypothetical decent testing) needing this (separation by mail account) would be. > > msgstore.py (in the Outlook2000 directory): When creating the > > 'faked up' > > Exchange headers, I added a "X-Exchange-Delivery-Time" > > header, which the > > Can you explain that one a little more? Would it be > possible/better to generate the correct Date header? > (I assume you are saying these messages don't have this header) Yes, I am saying that (or should have ;). For example, I get headers like this (these are all the headers for this particular message - it had no subject): """ X-Exchange-Message: true To: Meyer, Tony X-Exchange-Delivery-Time: Fri, 23 Apr 2004 15:56:50 +1200 """ (This obviously includes the one I added). It probably would be better to generate the Date header instead (or maybe a Received header?) - I was too lazy to look up the spec for what one of those would use, so added my own. A proper Received or Date header would allow any tokenizing options that work with those headers to use the data, which would be a more beneficial (assuming those options help!) change. I could work up a patch that does this instead, perhaps. In terms of code, in _GetFakeHeaders, I also retrieved the PR_MESSAGE_DELIVERY_TIME property, added the appropriate 'delivery_time = self._GetPotentiallyLarge...' bit and then did: """ from time import timezone from email.Utils import formatdate headers.append("X-Exchange-Delivery-Time: "+\ formatdate(int(delivery_time)-timezone, True)) """ I formatted the date so that it matched the one that a Received header has, because this made the change to sort+group.py simpler than leaving it as Outlook delivered it. =Tony Meyer From tdickenson at geminidataloggers.com Mon Apr 26 03:27:32 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Mon Apr 26 03:27:38 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404252123.48303.jkx@pythonfr.org> References: <200404242300.56525.jkx@pythonfr.org> <200404251821.33037.tdickenson@geminidataloggers.com> <200404252123.48303.jkx@pythonfr.org> Message-ID: <200404260827.32385.tdickenson@geminidataloggers.com> On Sunday 25 April 2004 20:23, Jkx@Pythonfr wrote: > On Sunday 25 April 2004 19:21, Toby Dickenson wrote: > > On Sunday 25 April 2004 13:50, Jkx@Pythonfr wrote: > - And you can easily imagine to run this as another user, and tweak > the self.dbname according to your needs.. (for example put all > the db in a unique folder .. which only one account can access. > this is a common way to do stuff for large system) That makes sense. > > Yes, using python for sb_bnfilter is a short-term measure. Its a > > prototype. C version is in progress > > Please look at spamc code Thanks for the tip. (Im sure it wont be usable as-is; the auto-forking in sb_bnfilter is useful on small systems where you dont want to run any daemon most of the time.) > I really think that allowing users (in a large system) to spwan process > is a bad idea. not for a workstation of course but of a filtering server > with 1000 mail accounts. My approach is exactly the same as used by things > like virus scanner .. Do you think hosting company will allow people to > install there own virus scrambler on there account ? They don't .. why : > Just because this will be to painfull to administer and spawn a bunch of > process for nothing. My approach is the same. So then you have 1000 different spambayes databases? My .hammiedb is 20M, so your big mail server needs 20G of storage for spam databases. This is sure to affect delivery performance since there is no way to cache all of that. > and i think i > can do that w/ sb_bn* but the sb_bn* doesn't support the 'user' setting. You can specify database filename and socket names on the sb_bnfilter command line. It doesnt support 'users' directly, but it provides all you need to layer a 'users' system on top. -- Toby Dickenson From jkx at pythonfr.org Mon Apr 26 03:51:18 2004 From: jkx at pythonfr.org (Jkx@Pythonfr) Date: Mon Apr 26 03:53:58 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404260827.32385.tdickenson@geminidataloggers.com> References: <200404242300.56525.jkx@pythonfr.org> <200404251821.33037.tdickenson@geminidataloggers.com> <200404252123.48303.jkx@pythonfr.org> <200404260827.32385.tdickenson@geminidataloggers.com> Message-ID: <20040426075118.GB2996@tp1.enstb.org> On Mon, Apr 26, 2004 at 08:27:32AM +0100, Toby Dickenson wrote: > > I really think that allowing users (in a large system) to spwan process > > is a bad idea. not for a workstation of course but of a filtering server > > with 1000 mail accounts. My approach is exactly the same as used by things > > like virus scanner .. Do you think hosting company will allow people to > > install there own virus scrambler on there account ? They don't .. why : > > Just because this will be to painfull to administer and spawn a bunch of > > process for nothing. My approach is the same. > > So then you have 1000 different spambayes databases? My .hammiedb is 20M, so > your big mail server needs 20G of storage for spam databases. This is sure to > affect delivery performance since there is no way to cache all of that. Yes, there is no way to cache, but i think the system will be trained on a small amount of spam. So i hope the database won't be too big. My current SB db is around 1.2 Mb so. But you really point me to something i missed. Is there any way to produce a 'resize' / 'consolidation' of the db ? > > and i think i > > can do that w/ sb_bn* but the sb_bn* doesn't support the 'user' setting. > > You can specify database filename and socket names on the sb_bnfilter command > line. It doesnt support 'users' directly, but it provides all you need to > layer a 'users' system on top. That's true .. but plug this in a postfix delevery would be so simple .. -- J?r?me Kerdreux / Labo MI ENST Brest From skip at pobox.com Mon Apr 26 08:13:58 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Apr 26 08:14:17 2004 Subject: [spambayes-dev] SpamBayes server compliant w/ spamassassin In-Reply-To: <200404260827.32385.tdickenson@geminidataloggers.com> References: <200404242300.56525.jkx@pythonfr.org> <200404251821.33037.tdickenson@geminidataloggers.com> <200404252123.48303.jkx@pythonfr.org> <200404260827.32385.tdickenson@geminidataloggers.com> Message-ID: <16524.64774.732715.536641@montanaro.dyndns.org> Toby> So then you have 1000 different spambayes databases? My .hammiedb Toby> is 20M, so your big mail server needs 20G of storage for spam Toby> databases. This is sure to affect delivery performance since there Toby> is no way to cache all of that. The number varies. My database is about 2.5MB and does just fine. (See my recent mail about using train-to-exhaustion and training backwards in the file.) That gets you down around 2.5GB, which is largely cacheable. Skip From papaDoc at videotron.ca Mon Apr 26 21:31:39 2004 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon Apr 26 21:31:40 2004 Subject: [spambayes-dev] Openning a db Message-ID: <1083029498.3822.15.camel@porsche> Hi, I'm having a problem: When I create a db from scratch using the following command sb_mboxtrain -d ./hammie.db -g ham.mbox -s spam.mbox The db is created with the dbm_type="best" in the dbmstorage.py. This will call the function dbmstorage.py: open_db3hash but when I try to train again with the same command line (I know this does nothing to the database but continue reading) sb_mboxtrain -d ./hammie.db -g ham.mbox -s spam.mbox Then I get the error message: ------------------- File "/gmc/logiciels/spambayes/spambayes/spambayes/hammie.py", line 266, in open return Hammie(storage.open_storage(filename, useDB, mode)) File "/gmc/logiciels/spambayes/spambayes/spambayes/storage.py", line 680, in open_storage return klass(data_source_name, mode) File "/gmc/logiciels/spambayes/spambayes/spambayes/storage.py", line 164, in __init__ self.load() File "/gmc/logiciels/spambayes/spambayes/spambayes/storage.py", line 189, in load self.dbm = dbmstorage.open(self.db_name, self.mode) File "/gmc/logiciels/spambayes/spambayes/spambayes/dbmstorage.py", line 75, in open return f(db_name, mode) File "/gmc/logiciels/spambayes/spambayes/spambayes/dbmstorage.py", line 22, in open_dbhash return bsddb.hashopen(*args) File "/usr/local/lib/python2.3/bsddb/__init__.py", line 193, in hashopen d.open(file, db.DB_HASH, flags, mode) bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- ./hammid.db: unsupported hash version: 8') ------------------------ since the db is openned with the dbhash by calling the function dbmstorage.py: open_dbhash To solve my problem I'm imposing the dbm_type to be what I want but I don't think this can be a fix ;-). So this is all yours to solve.... I'm running on RedHat 9 with python2.3 compiled from source. the whichdb is really the one from the python2.3 (this was found with echo test | strace -tt -f -o trace.txt python sb_mboxtrain.py -d hammid.db -g ham.mbox -s spam.mbox) Remi -- Remi Ricard From papaDoc at videotron.ca Tue Apr 27 08:41:03 2004 From: papaDoc at videotron.ca (papaDoc) Date: Tue Apr 27 08:41:08 2004 Subject: [spambayes-dev] Re: [Spambayes] Spam bayes in French ? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CEC@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677CEC@its-xchg4.massey.ac.nz> Message-ID: <408E54DF.7060409@videotron.ca> Hi, >Someone asked about this a wee while ago (on spambayes-dev, maybe?), but I >don't know if anything was done or not. I'm happy to make time to make any >code changes necessary to make the task easier, but I can't really offer >much in the way of translation itself (I suppose I could do a Maori >translation at a push, but I imagine the potential users of such a >translation number approximately 0). Based on list traffic, a French >version should be easily do-able. > >I'm guessing that someone here must have some (even non-Python) experience >at doing this, yes? Any volunteers to coordinate the effort? > > I volunteer to do the French translation but I'm almost a non-Python so the code should be set up if possible. Remi -- /"\ \ / X ASCII Ribbon Campaign / \ Against HTML Email From tim.one at comcast.net Tue Apr 27 10:54:33 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Apr 27 10:54:47 2004 Subject: [spambayes-dev] Local boy makes good Message-ID: Congratulations! Inboxer (Sean True's SpamBayes derivative) is the "Microsoft & WUGNET Shareware Pick of the Week": http://www.wugnet.com/shareware/spow.asp?ID=551 This is a big deal, since it gets announced in approximately 38 billion copies of MS's "Windows Platform News" newsletter (that's how I found out about it). From seant at iname.com Tue Apr 27 11:41:39 2004 From: seant at iname.com (Sean True) Date: Tue Apr 27 11:43:33 2004 Subject: [spambayes-dev] RE: [Spambayes] Local boy makes good In-Reply-To: Message-ID: > > Congratulations! Inboxer (Sean True's SpamBayes derivative) is the > "Microsoft & WUGNET Shareware Pick of the Week": > > http://www.wugnet.com/shareware/spow.asp?ID=551 > > This is a big deal, since it gets announced in approximately > 38 billion > copies of MS's "Windows Platform News" newsletter (that's how > I found out > about it). > Impossible (obviously) without the spambayes community. Thanks to _all_ of you. -- Sean From valk.beekman at xs4all.nl Tue Apr 27 18:14:53 2004 From: valk.beekman at xs4all.nl (Valk Beekman) Date: Tue Apr 27 18:15:03 2004 Subject: [spambayes-dev] wish from new user Message-ID: <008201c42ca5$119f6020$6501a8c0@nl> As a new user I would like to be able to set the word Spambayes uses to mark spam myself (or it should default to something like "*spam*" . The way it works now I would have all correspondence with "spam" in the subjectline discarded by OE. Sometimes people I know send me questions about spam. Regards & s6 with your project, Valk Beekman -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040428/c96202ae/attachment.html From tameyer at ihug.co.nz Tue Apr 27 18:19:50 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Apr 27 18:20:03 2004 Subject: [spambayes-dev] wish from new user In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E4A1D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CF6@its-xchg4.massey.ac.nz> > As a new user I would like to be able to set the > word Spambayes uses to mark spam myself (or it > should default to something like "*spam*" . The > way it works now I would have all correspondence > with "spam" in the subjectline discarded by OE. > Sometimes people I know send me questions about spam. Two points: 1. The tag is actually "spam,", so you are safe as long as people don't put a comma right after the word 'spam'. 2. You can change this, you just have to manually edit your configuration file. The FAQ has lots of details about changing the options - you're after the Headers section, and the header_spam_string option. =Tony Meyer --- Please always include the list (spambayes@python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. This way, you get everyone's help, and avoid a lack of replies when I'm busy. From sourceforge at metrak.com Tue Apr 27 19:27:39 2004 From: sourceforge at metrak.com (paul sorenson) Date: Tue Apr 27 19:27:46 2004 Subject: [spambayes-dev] no messages to review In-Reply-To: References: Message-ID: <408EEC6B.20007@metrak.com> I am running spampayes proxy with mozilla thunderbird on Win XP. I just installed 1.0b1 in the last couple of days. When I attempt to review messages I see the message: "There are no untrained messages to display". This is despite receiving dozens of email each day. This has been happening for some time (before this install). Then every now and then it seems to recognize a whole lot of messages. Clicking "previous day" followed by "next day" doesn't seem to get me back to where I started. For sanity's sake I just checked the thunderbird was point to the proxy, not my mail server. cheers From mhammond at skippinet.com.au Tue Apr 27 19:40:43 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Apr 27 19:42:12 2004 Subject: [spambayes-dev] Release 1.0? Message-ID: <126401c42cb1$0ed80c50$0200a8c0@eden> The last release seems to have gone OK, with only a couple of packaging issues. What say Tony and I just turn the crank, we call it 1.0, have a beer and little party, and move on? Mark. From tim.one at comcast.net Tue Apr 27 21:18:16 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Apr 27 21:18:25 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <126401c42cb1$0ed80c50$0200a8c0@eden> Message-ID: [Mark Hammond] > The last release seems to have gone OK, with only a couple of packaging > issues. What say Tony and I just turn the crank, we call it 1.0, have a > beer and little party, and move on? I wish we had a better database story -- but apparently not enough to give up enough sleep to get us one. Other than that, the only killer flaw I notice ten times a day (in the Outlook addin) is that in the "Filter messages ..." dialog, "Start Filtering" should be the DEFPUSHBUTTON instead of "Close". I've got "Automatically move pointer to the default button in a dialog box" enabled on my laptop (I hate using touchpads!), and so my mouse pointer always flies to the wrong button when I open that dialog. "Start Training" would be a more useful default button on the Training tab too. In short, if I'm reduced to whining about petty crap like that, we're overdue for a 1.0 release . Fantastic work, everyone! From skip at pobox.com Tue Apr 27 21:22:23 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Apr 27 21:22:34 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <126401c42cb1$0ed80c50$0200a8c0@eden> References: <126401c42cb1$0ed80c50$0200a8c0@eden> Message-ID: <16527.1871.288554.714725@montanaro.dyndns.org> Mark> What say Tony and I just turn the crank, we call it 1.0, have a Mark> beer and little party, and move on? I'm hoisting a stein already. ;-) Skip From tameyer at ihug.co.nz Tue Apr 27 22:31:26 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Apr 27 22:31:41 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E4A51@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677CF9@its-xchg4.massey.ac.nz> > The last release seems to have gone OK, with only a couple of > packaging issues. What say Tony and I just turn the crank, > we call it 1.0, have a beer and little party, and move on? If you give me to the weekend, then I'm fine with this. I'd like to incorporate a few imapfilter bug fixes that have been worked out over the last week or so, and have another run through the open bug list to see if there's anything there that can/should be resolved. I'm flat out at work until then, though. (And a beer and party suit the weekend more anyway ;) For the 1.1a1 release, I'd really like to: * Finish up the 'auto configure' stuff for sb_server. Basically create a wizard like the Outlook one that can setup SpamBayes, your mail client, and do some initial training. (With a limited list of clients (OE, Eudora, Mozilla, Opera at the moment) - for the rest, you're on your own). * Have an imapfilter binary in the binary dist. It's getting used by a few more people now, so it would seem a nice option. Maybe that'll convince someone else to take over maintaining it, too . * Finish off the pop3dnd stuff that I started some time back. This is mostly working, and I still like the concept (training by drag-and-drop in arbitrary mail clients), and it'd be nice to offer as an experimental option. * Wack off a few deprecated/experimental options. 1.0 should in be use for a while, so they get their chance :) Plus looking at the database issues, as always, and training techniques (particularly figuring out a way to offer tte)... =Tony meyer From mhammond at skippinet.com.au Tue Apr 27 23:54:36 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Apr 27 23:54:55 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CF9@its-xchg4.massey.ac.nz> Message-ID: <13fb01c42cd4$867749b0$0200a8c0@eden> > If you give me to the weekend, then I'm fine with this. I'd like to > incorporate a few imapfilter bug fixes that have been worked > out over the > last week or so, and have another run through the open bug > list to see if > there's anything there that can/should be resolved. I'm flat > out at work > until then, though. (And a beer and party suit the weekend > more anyway ;) That sounds good to me! Especially the party and beer bit :) If we restrict this to low-risk new bugs, we can still go for 1.0 as the next release. Mark. From mhammond at skippinet.com.au Wed Apr 28 00:28:26 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Apr 28 00:28:50 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <20040428011829.E8EC3A6395@dampier.southern.net.au> Message-ID: <144501c42cd9$4203b0c0$0200a8c0@eden> > Other than that, the only killer flaw I notice ten times a day (in the > Outlook addin) is that in the "Filter messages ..." dialog, "Start > Filtering" should be the DEFPUSHBUTTON instead of "Close". I've got > "Automatically move pointer to the default button in a dialog > box" enabled > on my laptop (I hate using touchpads!), and so my mouse > pointer always flies > to the wrong button when I open that dialog. Fixed! > "Start > Training" would be a > more useful default button on the Training tab too. That one is harder, as the existing default button (Close) is not on the property-page, but the parent. Setting "Start" to DEFPUSHBUTTON gets it drawn like it is the default, but "Close" still does too and seems to win :) > Fantastic work, everyone! Absolutely! Not one of us here could have done anything with the others. Congratulations, and thank you! Mark. From combover at mn.rr.com Wed Apr 28 03:03:57 2004 From: combover at mn.rr.com (combover) Date: Wed Apr 28 03:03:40 2004 Subject: [spambayes-dev] Possible new header parsing option... In-Reply-To: <144501c42cd9$4203b0c0$0200a8c0@eden> References: <144501c42cd9$4203b0c0$0200a8c0@eden> Message-ID: <408F575D.5050103@mn.rr.com> Was looking over SPF (http://spf.pobox.com) last weekend, and it looks very promising - already a handful of major domains have implemented it. Of course, the headers that will be associated with SPF's checks: http://spf.pobox.com/newheader.html will not be widely used until the major MTAs provide that option, but it seems to me that they could prove to be valuable tokens at the very least, and there might be a possibility of creating a SpamBayes plugin script to do the checking at the client level. Then again, my understanding of how MTAs work and where exactly SPF checks would need to occur is not the best. Again, this isn't going to be the most useful until the majority of domains have published records, but would be beneficial once that point is reached. My one concern with the specification itself, though, is: what's to stop spammers from forging these headers themselves? Is there a mechanism in the existing MTA plugins to discard any SPF headers already in place in a received mail? I know this is probably not the best place for those concerns, so maybe i'll subscribe to their dev list... From rmalayter at bai.org Wed Apr 28 08:28:06 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Wed Apr 28 08:28:13 2004 Subject: [spambayes-dev] Possible new header parsing option... Message-ID: <792DE28E91F6EA42B4663AE761C41C2A021C96AC@cliff.bai.org> [combover] >My one concern with the specification itself, though, is: >what's to stop spammers from forging these headers >themselves? Nothing, as you've guessed correctly. >Is there a mechanism in the existing MTA plugins to discard >any SPF headers already in place in a received mail? I know >this is probably not the best place for those concerns, so >maybe i'll subscribe to their dev list... That would be the correct approach. If a recieveing MTA checks for SPF compliance, should throw out all other SPF-related headers before adding its own. Assuming the MTAs do this correctly, and SPF use becomes widespread (my domain is one of only 7500 or so registered), these headers will be very useful clues to spambayes. However, with Microsoft supporting Caller-ID for Email, and Yahoo! supporting Domain Keys, SPF may not be the ultimate winner as a sending-host verification standard. I'm placing my bets on a unified standard ermerging sometime in the next few years. Spam costs Yahoo! And MS so much money they cannot afford to bicker about this issue too long. From kennypitt at hotmail.com Wed Apr 28 09:21:30 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Apr 28 09:22:59 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <144501c42cd9$4203b0c0$0200a8c0@eden> Message-ID: Mark Hammond wrote: >> Other than that, the only killer flaw I notice ten times a day (in >> the Outlook addin) is that in the "Filter messages ..." dialog, >> "Start Filtering" should be the DEFPUSHBUTTON instead of "Close". >> I've got "Automatically move pointer to the default button in a >> dialog box" enabled on my laptop (I hate using touchpads!), and so >> my mouse pointer always flies to the wrong button when I open that >> dialog. > > Fixed! Did this get checked in? I didn't see any notice on spambayes-checkins and cvs update didn't apply any changes. -- Kenny Pitt From sourceforge at metrak.com Wed Apr 28 18:54:28 2004 From: sourceforge at metrak.com (paul sorenson) Date: Wed Apr 28 18:54:33 2004 Subject: [spambayes-dev] Re: no messages to review In-Reply-To: References: Message-ID: <40903624.4050304@metrak.com> Today the proxy decided there were messages to review. I have buttons "previous day" "refresh" "next day" and ended up with 4 screenfuls of messages for training. If spambayes reports no messages to train but I have been receiving messages, is there a simple way to check what criterion it is using? > ------------------------------ > > Message: 3 > Date: Wed, 28 Apr 2004 09:27:39 +1000 > From: paul sorenson > Subject: [spambayes-dev] no messages to review > To: spambayes-dev@python.org > Message-ID: <408EEC6B.20007@metrak.com> > Content-Type: text/plain; charset=us-ascii; format=flowed > > I am running spampayes proxy with mozilla thunderbird on Win XP. I just > installed 1.0b1 in the last couple of days. > > When I attempt to review messages I see the message: "There are no > untrained messages to display". This is despite receiving dozens of > email each day. > > This has been happening for some time (before this install). Then every > now and then it seems to recognize a whole lot of messages. Clicking > "previous day" followed by "next day" doesn't seem to get me back to > where I started. From ta-meyer at ihug.co.nz Wed Apr 28 21:53:15 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Apr 28 21:53:26 2004 Subject: [spambayes-dev] Testing Tools Changes Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BD1@its-xchg4.massey.ac.nz> [me] > I also changed fpfn.py to print out each message and > offer to move it to the corresponding ham/spam set (I used it > to check for misclassified messages), > but it doesn't seem like this is a good addition to the script. Browsing, I notice that this has been offered before (which would have saved me the bother): [ 618932 ] fpfn.py: add interactivity on unix I don't know if this makes it any more/less worthwhile including, though. =Tony Meyer From clare at optushome.com.au Thu Apr 29 07:54:05 2004 From: clare at optushome.com.au (Clare Wagemans) Date: Thu Apr 29 07:52:41 2004 Subject: [spambayes-dev] Spam Bayes falls over regularly Message-ID: Dear Sir Every now and then, not only after updating, I get the message that Spam Bayes is not working. The box "Definite Spam" just disappears and I have to create a new one, enable Spam Bayes and then retrain. I think I would have had it happen about 5 times in 6 months. regards Clare Wagemans From tim.one at comcast.net Thu Apr 29 15:21:35 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Apr 29 15:21:35 2004 Subject: [spambayes-dev] RE: [Python-Dev] SSH problems getting into SourceForge's CVS? In-Reply-To: <200404291420.i3TEKGn05101@guido.python.org> Message-ID: If you're getting messages like this today when trying to cvs up: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The RSA1 host key for cvs.spambayes.sourceforge.net has changed, and the key for the according IP address 66.35.250.209 is unknown. This could either mean that DNS SPOOFING is happening or the IP address for the host and its host key have changed at the same time. @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! [yadda yadda yadda] it's apparently because SourceForge changed the way they like their CVS servers to get addressed. Here's a link to a python-dev article with a Python script to crawl over your checkout tree and change the hidden CVS cruft so that SF stops whining at you: http://mail.python.org/pipermail/python-dev/2004-April/044593.html For example, in the root of my spambayes checkout, I ran cvs_chroot.py :ext:tim_one@cvs.sourceforge.net:/cvsroot/spambayes and then SF stopped complaining. The hostname part of my URLs used to be "cvs.spambayes.sourceforge.net", and SF doesn't want the "spambayes." part there anymore. Since this string gets buried in CVS admin files in each of your subtrees too, you really don't want to hunt them down and fiddle them all by hand. From G.Hartmann at kamax.de Fri Apr 30 04:30:22 2004 From: G.Hartmann at kamax.de (Hartmann, Gunther) Date: Fri Apr 30 04:28:27 2004 Subject: [spambayes-dev] Access rights for spambayes outlook plugin Message-ID: <54930B904AEB7D468F603F92D518796901C66CE0@kxw2kho9.kamax.de> Dear All, I have scanned all the FAQ for my problem but couldn't find any hint, so I try this way. I would like to have my Inbox scanned by a collegue's spambayes while I am out of office. We are running Outlook 2000 against an exchange server and my collegue openes my inbox during startup of his outlook. However he can't select my inbox in the spambayes managers folder selection box. It is displayed and one can select the checkbox but the message at the bottom doesn't reflect this additional selection and stays with '1 folder selected'. Clicking on OK doesn't select this additional inbox either. If I configure the spambayes .ini-File by adding a second inbox-identifier it refuses to start spambayes and the log file reads 'access refused'. So my question is: what access rights do I need pass to my collegue on what? I tried the highest possible one (which is 8) on both my mailbox AND the inbox folder - but it didn't work. Any hints? Mit freundlichen Gr??en / Best Regards / Saludos Gunther Hartmann Dr. Gunther Hartmann Director R&D KAMAX Tel: +49 6633 79 162 Fax +49 6633 79 6162 mailto:g.hartmann@kamax.de http://www.kamax.com From agrabren at yahoo.com Fri Apr 30 14:05:18 2004 From: agrabren at yahoo.com (Kevin Bruckert) Date: Fri Apr 30 14:05:55 2004 Subject: [spambayes-dev] Microsoft Exchange Server integration / Web Interface Integration Message-ID: <20040430180518.93111.qmail@web41506.mail.yahoo.com> I've searched back a few months, found someone (Sean) discussing this back in June of 2003, but no follow-up since, nor found any useful links (in my opinion). So let me explain my interests, and also offer my assistance (I'm a seasoned programmer, although new to Python... But I learn fast). On an Microsoft Exchange 2003 server, we receive plenty of spam. I've tried various solutions, but they all fall short in the UI arena. What I want to do is the following: Each user has a seperate database, although an initial global database never hurt anyone. >From there, users can either install a client module into their Outlook, giving them an easy-to-use feedback mechanism. Or, for many of the users, integration into the Exchange Server web interface. The integration into the web interface should be smooth and easy-to-use as well, instead of having to run between multiple places to report spam. By running the filters on the server, mail is filtered on entry to the system, and allows quick access to import email while on-the-go. I'm willing to put in as much effort as I can to do such work, but might want a little help at various stages to understand the existing architecture and prevent re-working areas which are already written. Thanks, Kevin Bruckert __________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover From rmalayter at bai.org Fri Apr 30 14:22:31 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Fri Apr 30 14:22:35 2004 Subject: [spambayes-dev] Microsoft Exchange Server integration / WebInterface Integration Message-ID: <792DE28E91F6EA42B4663AE761C41C2A02411854@cliff.bai.org> [Kevin Bruckert] > On an Microsoft Exchange 2003 server, we receive > plenty of spam. I've tried various solutions, but they > all fall short in the UI arena. What I want to do is > the following: Each user has a seperate database, > although an initial global database never hurt anyone. > >From there, users can either install a client module > into their Outlook, giving them an easy-to-use > feedback mechanism. Or, for many of the users, > integration into the Exchange Server web interface. > The integration into the web interface should be > smooth and easy-to-use as well, instead of having to > run between multiple places to report spam. > > By running the filters on the server, mail is filtered > on entry to the system, and allows quick access to > import email while on-the-go. > > I'm willing to put in as much effort as I can to do > such work, but might want a little help at various > stages to understand the existing architecture and > prevent re-working areas which are already written. The best server-side Exchange Server filter we evaluated in terms of UI was Sunbelt Software's iHateSpam Server Edition. We ended up buying it because it was the simplest to use and deploy, and works reasonably well, even though it's not a Bayesian filter. The only UI is a set of folders created in each users inbox that contain filtered spam as well as a whitelist and blacklist. I suggest you check it out, they have a free demo. We get capture rates of 88% with the current version and our threshold set to 170. None of the other commercial or open-source Bayesian filters - which filter more accurately - came close to iHateSpam SE in terms of deployment ease and ease of use. Those two factors overrode all of our other criteria. That said, if you want to use Spambayes in a sever-side scenario, you'll have to customize the code a lot to make it work. You might be better off trying to use something like DSPAM on a linux box as a gateway in front of your Exchange server. It has per-user filtering. Another option that worked well for us for a while was ASSP, avalable at http://assp.sourceforge.net. However, it does not do per-user filtering. It has one DB for all users on a server. We ended up abandoning it because we couldn't get our test group to train it well, despite lots of instruction. The performance was very good when it was only the IT department using it, though ;-). Regards, Ryan