From skip at pobox.com Sun May 2 15:56:31 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun May 2 15:56:16 2004 Subject: [spambayes-dev] If msg.as_string() fails... Message-ID: <16533.21103.56041.528817@montanaro.dyndns.org> I encountered a couple spams today with sb_filter.py where msg.as_string() failed and the exception wasn't caught: Traceback (most recent call last): File "/usr/local/bin/sb_filter.py", line 257, in ? main() File "/usr/local/bin/sb_filter.py", line 249, in main action(msg) File "/usr/local/bin/sb_filter.py", line 181, in filter return self.h.filter(msg) File "/usr/local/lib/python2.2/site-packages/spambayes/hammie.py", line 148, in filter return msg.as_string(unixfrom=(msg.get_unixfrom() is not None)) File "/usr/local/lib/python2.2/email/Message.py", line 113, in as_string g.flatten(self, unixfrom=unixfrom) File "/usr/local/lib/python2.2/email/Generator.py", line 103, in flatten self._write(msg) File "/usr/local/lib/python2.2/email/Generator.py", line 131, in _write self._dispatch(msg) File "/usr/local/lib/python2.2/email/Generator.py", line 157, in _dispatch meth(msg) File "/usr/local/lib/python2.2/email/Generator.py", line 200, in _handle_text raise TypeError, 'string payload expected: %s' % type(payload) TypeError: string payload expected: (Python 2.2.3, email package version 2.5.3). I'd like to do something similar to what the POP3 proxy and IMAP filters do to graft in an X-Spambayes-Exception header, but at the point where this occurs all I have is an email.Message object, no raw message text as POP3 and IMAP programs have. Is there some way to unfailingly get the raw message text from an email.Message object? I didn't see an obvious way to do this without doing precisely what email.Generator does. Any clues? Thx, Skip From tameyer at ihug.co.nz Sun May 2 17:45:58 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun May 2 17:46:09 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130626958F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D23@its-xchg4.massey.ac.nz> > I encountered a couple spams today with sb_filter.py where > msg.as_string() failed and the exception wasn't caught: [...] > TypeError: string payload expected: Not an answer to your question, but related: reports of this problem are becoming more common, both with sb_filter and sb_imapfilter (sb_server handles it because it has a raw "except:", but imapfilter has "except email.Errors.MessageParseError" which doesn't catch the TypeError - I'll fix this). I think it would be worth working in a fix for this somehow, so that these messages are correctly (as much as possible) filtered. Does anyone else want to work on this? (+1 to getting sb_filter to do the exception header thing, though). =Tony Meyer From skip at pobox.com Sun May 2 18:22:52 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun May 2 18:24:48 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D23@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130626958F@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677D23@its-xchg4.massey.ac.nz> Message-ID: <16533.29884.911670.520614@montanaro.dyndns.org> Tony> I think it would be worth working in a fix for this somehow, so Tony> that these messages are correctly (as much as possible) filtered. Tony> Does anyone else want to work on this? Tony> (+1 to getting sb_filter to do the exception header thing, Tony> though). I looked through the code that generates messages in sb_filter.py. In theory, down in mboxutils.get_message() we could attach the raw message text to the generated Message object if the input was a string: msg._raw = obj That doesn't help if the object being parsed is already a Message object. It also seems like a very bad hack just to make the raw text available in case message flattening fails at tail end. Skip From tameyer at ihug.co.nz Sun May 2 23:34:26 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun May 2 23:36:01 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13060E4AE8@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BDE@its-xchg4.massey.ac.nz> > That sounds good to me! Especially the party and beer bit :) > If we restrict this to low-risk new bugs, we can still go > for 1.0 as the next release. The only things I've checked in are low risk - there are various other things that would be nice to work on, but they're too risky to do right before 1.0, so can wait for 1.1a1. I've updated the changelog and what_is_new files. So what's the plan from here? I've done all the changes that I'll find time for - does anyone else have others? Maybe Skip would like to get the sb_filter exception stuff done? Should we go ahead and create 1.0rc1's? Since this is a reasonably important release, I'd like to try the installers on a few machines, which I won't be able to do until tomorrow, but I can easily put the source rc's together. =Tony Meyer From tim.one at comcast.net Sun May 2 23:57:22 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun May 2 23:57:20 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <144501c42cd9$4203b0c0$0200a8c0@eden> Message-ID: [Tim Peters] >> Other than that, the only killer flaw I notice ten times a day (in >> the Outlook addin) is that in the "Filter messages ..." dialog, >> "Start Filtering" should be the DEFPUSHBUTTON instead of "Close". >> I've got "Automatically move pointer to the default button in a >> dialog box" enabled on my laptop (I hate using touchpads!), and so >> my mouse pointer always flies to the wrong button when I open that >> dialog. [Mark Hammond] > Fixed! And a joy it is! Gotta love Windows: after bringing up the filtering dialog, "Start Filtering" is the default button now, so you can start filtering either by clicking the mouse or by hitting the ENTER key. If you start it by clicking the mouse, when it's done you can hit ESC to quit the dialog. But if you start it by hitting ENTER, when it's done hitting ESC "doesn't work" -- the dialog stays open. On my WinXP OL2003 it generates a "you can't do that" sound, and on my Win98SE OL2000 it sits in silence. So, for my fellow touchpad-hating laptop users, left-click the mouse and punch ESC after. That worked on both boxes, BTW. at-least-it's-kinda-consistent-ly y'rs - tim From tameyer at ihug.co.nz Mon May 3 01:30:32 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon May 3 01:30:43 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306269645@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D36@its-xchg4.massey.ac.nz> > I looked through the code that generates messages in > sb_filter.py. In theory, down in mboxutils.get_message() we > could attach the raw message text to the generated Message > object if the input was a string: > > msg._raw = obj Something like this would also work (in theory) for sb_server/sb_imapfilter put in message.py. > That doesn't help if the object being parsed is already a > Message object. It also seems like a very bad hack just to > make the raw text available in case message flattening fails > at tail end. Agreed. I'm not we can come up with a really nice general solution (other than Anthony's new, relaxed, parser, which I presume will end up in an email package one of these days). For this particular problem (reports of which are increasing) what about something like this? >>> try: ... print msg.as_string() ... except TypeError: ... parts = [] ... for part in msg.get_payload(): ... parts.append(part.as_string()) ... print "\n".join(parts) ... Obviously the "print"s would be return's or whatever. =Tony Meyer From skip at pobox.com Mon May 3 09:13:25 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon May 3 09:13:19 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D36@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1306269645@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677D36@its-xchg4.massey.ac.nz> Message-ID: <16534.17781.705841.701960@montanaro.dyndns.org> Tony> For this particular problem (reports of which are increasing) what Tony> about something like this? >>> try: ... print msg.as_string() ... except TypeError: ... parts = [] ... for part in msg.get_payload(): ... parts.append(part.as_string()) ... print "\n".join(parts) ... Isn't the code in the except clause more-or-less what msg.as_string() itself does? I'll give it some thought. Do we have a message we know makes this process barf? The few I encountered the other day didn't fail on my development machine which runs Python from CVS and version 2.5.5 of the email package. The machine on which it failed runs Python 2.2.3 and email 2.5.3. Maybe shipping 2.5.5 with Spambayes and installing it is a reasonable alternative. Didn't we used to ship some version of the email package? I don't see it in my CVS sandbox now. Skip From barry at python.org Mon May 3 09:18:59 2004 From: barry at python.org (Barry Warsaw) Date: Mon May 3 09:19:06 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D36@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D36@its-xchg4.massey.ac.nz> Message-ID: <1083590338.1035.55.camel@anthem.wooz.org> On Mon, 2004-05-03 at 01:30, Tony Meyer wrote: > Agreed. I'm not we can come up with a really nice general solution (other > than Anthony's new, relaxed, parser, which I presume will end up in an email > package one of these days). Ya, one of these days. I think I currently have the most recent copy of it sitting in my inbox waiting for a few hours to clean it up and "just" fix some edge cases. -Barry From barry at python.org Mon May 3 09:21:57 2004 From: barry at python.org (Barry Warsaw) Date: Mon May 3 09:22:02 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <16533.21103.56041.528817@montanaro.dyndns.org> References: <16533.21103.56041.528817@montanaro.dyndns.org> Message-ID: <1083590516.1035.59.camel@anthem.wooz.org> On Sun, 2004-05-02 at 15:56, Skip Montanaro wrote: > I'd like to do something similar to what the POP3 proxy and IMAP filters do > to graft in an X-Spambayes-Exception header, but at the point where this > occurs all I have is an email.Message object, no raw message text as POP3 > and IMAP programs have. Is there some way to unfailingly get the raw > message text from an email.Message object? I didn't see an obvious way to > do this without doing precisely what email.Generator does. Any clues? No, but I've been thinking in email 3.0 to provide some kind of 'raw' flag that would capture the original source of the message on an attribute of the (outer) message object. Note also that the intent is for the 3.0 parser to add a flag to the message object if it encounters breakage. That could be tokenized on and to the extent that the breakage was unfixable, would be a flag to the generator that as_string() and friends would fail. -Barry From tdickenson at geminidataloggers.com Mon May 3 11:00:08 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Mon May 3 11:00:10 2004 Subject: [spambayes-dev] C implementation of sb_bnfilter Message-ID: <200405031600.08263.tdickenson@geminidataloggers.com> Dont get too excited... its not here yet. But I am looking at it. I would like some tips about where in CVS to put any C source. In /scripts/ next to sb_bnfilter.py seems natural. Except for the slightly irritating fact that it isnt a script. Thanks in advance, -- Toby Dickenson From gbrown at alumni.caltech.edu Mon May 3 13:33:38 2004 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Mon May 3 14:17:38 2004 Subject: [spambayes-dev] FW: Overnight shipping on xănax,valíum and more In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677CA0@its-xchg4.massey.ac.nz> Message-ID: <008d01c43134$c5393e60$1208a8c0@Glenn> > "Bayes database has X, message database has Y" error message." You're right. Retraining fixed this and the unscored messages problem went away. > In any case, retraining does look like the best option, Retraining worked well following Kenny Pitt's advice, which was to retrain on just 5 spam and 5 ham and then train only on the hammiest spam and spammiest ham in my archive, using "Filter messages" to rescore the collection after each iteration. It was horrifically slow, but required only about 30 ham and 40 spam messages to be sure of all ham and catch 90% of spam. If I were to do it again, I would first lower the spam threshold from 90% to 70% or lower. Methinks all those random words make it hard to be so sure of spam. Thanks to Kenny for his suggestions. Sorry for the slow reply, but I was in Hawaii. :) > the imbalance can be addressed In the hopes it will inspire me to code a fix for the imbalance problem, I refuse to manually balance the database. IMHO, SpamBayes will not be mature until it doesn't require manual balancing, and the problem will not be addressed if all potential developers manually balance. Thanks for the help, --Glenn From skip at pobox.com Mon May 3 17:31:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon May 3 17:30:59 2004 Subject: [spambayes-dev] C implementation of sb_bnfilter In-Reply-To: <200405031600.08263.tdickenson@geminidataloggers.com> References: <200405031600.08263.tdickenson@geminidataloggers.com> Message-ID: <16534.47647.86371.923800@montanaro.dyndns.org> Toby> I would like some tips about where in CVS to put any C source. Since that's a new beast, I'd suggest either contrib or an altogether new subdirectory. How about "src"? That's a common name for the directory contaiing C code in mixed language applications. Skip From tameyer at ihug.co.nz Mon May 3 21:24:08 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon May 3 21:24:23 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13062697F2@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D3F@its-xchg4.massey.ac.nz> > Isn't the code in the except clause more-or-less what > msg.as_string() itself does? Basically. It could probably also just set the type to multipart and that would work as well, but I'm not really sure of the right way to do that, so used this. > I'll give it some thought. Do we have a message we know > makes this process barf? There are a few around. I'll attach one to this message - it fails for me with current CVS SpamBayes, but works ok with code like I posted. > The few I encountered the other > day didn't fail on my development machine which runs Python > from CVS and version 2.5.5 of the email package. > The machine on which it failed runs Python 2.2.3 and email > 2.5.3. Maybe shipping 2.5.5 with Spambayes and installing it is a > reasonable alternative. That's interesting. If the attached one works also, then this is probably a better solution (the binary users won't notice any difference, and it's not that big a deal for source users to download that, too). > Didn't we used to ship some version of the email package? I > don't see it in my CVS sandbox now. It was there once, but is gone (as much as anything is gone from CVS) now. It was either before my time or right at the time that I started using SpamBayes (can't recall), so I don't know why it was there or why it isn't anymore. =Tony Meyer From t-meyer at ihug.co.nz Mon May 3 21:29:34 2004 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Mon May 3 21:29:47 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> > I'm copying the spambayes list > since people started reporting this problem on this list too. I've moved this to cc spambayes-dev instead, because we're already discussing this there, and it'll just get lost in the bug reports on the main list. > I suspect that the crash occur because these messages have > multipart boundaries but have a text content type header. That seems to be correct. Two additional notes: Skip Montanaro thinks that he had a message like this fail with Python 2.2.3 and email 2.5.3, but work fine with Python from CVS and version 2.5.5 of the email package, so that might be worth looking into. He's going to check whether this is the case or not. For SpamBayes (and so presumably other apps that use the email package like this) we're either going to (again) include a more up-to-date/patched version of the email package, or handle the exception in our code. Adding something like this: >>> try: ... print msg.as_string() ... except TypeError: ... parts = [] ... for part in msg.get_payload(): ... parts.append(part.as_string()) ... print "\n".join(parts) ... works for me (obviously msg is an email.Message or similar, and you change print to whatever you want it to be). Adding this to the two spambayes modules that need it may be simpler for us than including a patched email package. =Tony Meyer From tameyer at ihug.co.nz Mon May 3 21:29:58 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon May 3 21:30:10 2004 Subject: [spambayes-dev] If msg.as_string() fails... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306269A56@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D41@its-xchg4.massey.ac.nz> > There are a few around. I'll attach one to this message Or maybe this one ;) =Tony Meyer -------------- next part -------------- Return-Path: Message-ID: From: "Dustin Cantu" Reply-To: "Dustin Cantu" To: whomever Subject: Homeowners Get the best r.ate in half the time Date: Wed, 28 Apr 2004 22:41:02 -0600 MIME-Version: 1.0 Content-Type: text/html; boundary="--05801166718280276418" X-Priority: 3 X-IP: 244.223.184.112 Status: O X-Status: X-Keywords: ----05801166718280276418 Content-Type: text/html; Content-Transfer-Encoding: 7Bit

If you are paying more than 6% on your mort.gage, we can slash your payment!

GUARANTEED LOWEST RA.TES ON THE PLANET

APPROVAL REGARDLESS OF C.REDIT HISTORY!

Start saving today

Show Me The Lowest Ra.tes

mop pompey slow sod voss colloquy mantlepiece patagonia imagen oracle execute gould shameface care pentagon milan rummage harlan centrist arccosine borax rand copy herbert camp atlantis awesome deficit american minuend bob rescue salle shade chuckwalla songbook bike differentiate bootes condominium nellie village pizzicato venus voyage horace cute indiscreet clone cherry amalgam debtor coxcomb versailles bizet counterfeit dependent elsie doherty

re.move

----05801166718280276418-- From mhammond at skippinet.com.au Mon May 3 21:51:04 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon May 3 21:51:23 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BDE@its-xchg4.massey.ac.nz> Message-ID: <0d0f01c4317a$4321db80$0200a8c0@eden> > Should we go ahead and create 1.0rc1's? Since this is a reasonably > important release, I'd like to try the installers on a few > machines, which I > won't be able to do until tomorrow, but I can easily put the > source rc's > together. Sounds good to me. I suggest we push this release out, then cut a CVS tag and branch. Hopefully the RC will become the actual release - but if not, we just patch on that branch. This then opens up the trunk for everyone to break everything again :) Sound OK? Mark. From skip at pobox.com Mon May 3 22:52:33 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon May 3 23:24:20 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> Message-ID: <16535.1393.304440.918139@montanaro.dyndns.org> Tony> Skip Montanaro thinks that he had a message like this fail with Tony> Python 2.2.3 and email 2.5.3, but work fine with Python from CVS Tony> and version 2.5.5 of the email package, so that might be worth Tony> looking into. He's going to check whether this is the case or Tony> not. Better yet, I have a message (attached) which works w/ Python CVS (email 2.5.5), fails w/ Python 2.3.3 (email 2.5.4), and prints as expected with your loop-over-get_payload trick. I'm offline at the moment but will try to get a change checked in later this evening or tomorrow morning. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: bogus Type: application/octet-stream Size: 2943 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040503/2bdfc1de/bogus.obj From tameyer at ihug.co.nz Tue May 4 00:09:26 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue May 4 00:10:56 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306269A6E@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D46@its-xchg4.massey.ac.nz> > Sounds good to me. I suggest we push this release out, then > cut a CVS tag and branch. Hopefully the RC will become the > actual release - but if not, we just patch on that branch. > > This then opens up the trunk for everyone to break everything again :) > > Sound OK? Sounds good to me (sounds more-or-less like what everyone liked last time we talked about branches and things, too, IIRC). I think you're about at the moment, so I'll put together the source ones right now. BTW nice work on all the bug closing :) =Tony Meyer From tameyer at ihug.co.nz Tue May 4 00:14:52 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue May 4 00:17:39 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D46@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D47@its-xchg4.massey.ac.nz> [me] > Sounds good to me (sounds more-or-less like what everyone > liked last time we talked about branches and things, too, > IIRC). I think you're about at the moment, so I'll put > together the source ones right now. Thinking slightly slower than send-clicking: the one other thing that might be nice to include would be the fix for the 'TypeError' mail parsing problem. However, the ones with the most to gain are sb_filter users, who are probably most willing to use CVS or apply a patch to a script (sb_imapfilter and sb_server will both keep running and just fail to classify that message, and Outlook is immune). I suspect that, in any case, the fix will either be too big to do pre-1.0 (like including a whole different email package) or so small that it can safely go in between the 1.0rc and 1.0. =Tony Meyer From anthony at interlink.com.au Tue May 4 01:01:59 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 4 01:02:32 2004 Subject: [spambayes-dev] Re: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> Message-ID: <409723C7.6050606@interlink.com.au> Please try out the new FeedParser in the current Python-CVS. It should be considerably more robust than the old parser. In addition, it can be fed the message text "in chunks" and it will do the correct thing. http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Lib/email/FeedParser.py -- Anthony Baxter It's never too late to have a happy childhood. From mhammond at skippinet.com.au Tue May 4 01:08:05 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue May 4 01:08:24 2004 Subject: [spambayes-dev] 1.0rc1 on sourceforge Message-ID: <000001c43195$c946e050$0200a8c0@eden> Tony and I are in the process of cutting the new release. The release has been created, but is still inactive. Tony wants to test out the source archives at home, but it should work :) The binary: http://prdownloads.sourceforge.net/spambayes/spambayes-1.0rc1.exe?download The sources: http://prdownloads.sourceforge.net/spambayes/spambayes-1.0rc1.tar.gz?downloa d http://prdownloads.sourceforge.net/spambayes/spambayes-1.0rc1.zip?download Assuming all goes well, we hope to announce the release tomorrow, and then cut a 1.0 branch. Cheers! Mark. From alex at gabuzomeu.net Tue May 4 04:02:41 2004 From: alex at gabuzomeu.net (Alexandre Ratti) Date: Tue May 4 04:00:22 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <16535.1393.304440.918139@montanaro.dyndns.org> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> Message-ID: <40974E21.5090905@gabuzomeu.net> Skip Montanaro wrote: > Better yet, I have a message (attached) which works w/ Python CVS (email > 2.5.5), fails w/ Python 2.3.3 (email 2.5.4), and prints as expected with > your loop-over-get_payload trick. I'm offline at the moment but will try to > get a change checked in later this evening or tomorrow morning. In case you need more test data, I have saved 3 messages that crashed Spambayes and the email package (2.5.4): http://alexandre.ratti.free.fr/python/email/ From anthony at interlink.com.au Tue May 4 06:44:41 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 4 06:47:24 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <40977419.1050709@interlink.com.au> Alexandre Ratti wrote: > > Skip Montanaro wrote: > >> Better yet, I have a message (attached) which works w/ Python CVS (email >> 2.5.5), fails w/ Python 2.3.3 (email 2.5.4), and prints as expected with >> your loop-over-get_payload trick. I'm offline at the moment but will >> try to >> get a change checked in later this evening or tomorrow morning. > > > In case you need more test data, I have saved 3 messages that crashed > Spambayes and the email package (2.5.4): > > http://alexandre.ratti.free.fr/python/email/ These are all correctly parsed by the current-CVS version of the email package. Well, "correct" in this case means that they're considered a single text/html part. The boundary tag is (correctly) ignored. I'll be making a release of my email-torture-test package this evening with these tests and more. Anthony From anthony at interlink.com.au Tue May 4 08:01:48 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue May 4 08:02:57 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <4097862C.40004@interlink.com.au> Ok, I've made a tarball up of my MIME torture tests. They're available from http://www.interlink.com.au/anthony/tech/mime/ See the ABOUT.txt file there for more details. If you have examples of horror that aren't already covered (in particular, anything that breaks the current-CVS python parser!) please send them my way. If you'd prefer I sanitise them to remove your email addresses, let me know. -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Tue May 4 09:14:54 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue May 4 09:14:54 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <16535.1393.304440.918139@montanaro.dyndns.org> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> Message-ID: <16535.38734.770897.586576@montanaro.dyndns.org> Tony> Skip Montanaro thinks that he had a message like this fail with Tony> Python 2.2.3 and email 2.5.3, but work fine with Python from CVS Tony> and version 2.5.5 of the email package... Skip> ... [I] will try to get a change checked in later this evening or Skip> tomorrow morning. I added an as_string() function to mboxutils.py. I updated sb_filter.py, sb_bnserver.py and hammie.py to use it. I will leave it for other authors to decide if they need to use it as well. In particular, Tony already has a similar solution in place for sb_server.py and sb_imapfilter.py. I didn't see any reason to modify that code so soon before a release. I just stuck with the applications I knew would benefit. Other candidates include sb_mboxtrain.py and sb_pop3dnd.py. After the release I think it would be worthwhile to consider subclassing email.Message.Message, morphing mboxutils.as_string() into a method method, then using it everywhere. I doubt it will be difficult to do, but would have touched more bits of code than I felt comfortable with just before the release. Longer term it looks like Barry has other robustification ideas to implement directly in the email package. Skip From skip at pobox.com Tue May 4 09:16:09 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue May 4 09:16:05 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D47@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D46@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677D47@its-xchg4.massey.ac.nz> Message-ID: <16535.38809.868607.432369@montanaro.dyndns.org> Tony> I suspect that, in any case, the fix will either be too big to do Tony> pre-1.0 (like including a whole different email package) or so Tony> small that it can safely go in between the 1.0rc and 1.0. The fix is in place for sb_filter and sb_bnserver. Skip From skip at pobox.com Tue May 4 09:41:03 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue May 4 09:41:56 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <16535.40303.320055.546812@montanaro.dyndns.org> Alexandre> In case you need more test data, I have saved 3 messages that Alexandre> crashed Spambayes and the email package (2.5.4): Alexandre> http://alexandre.ratti.free.fr/python/email/ Thanks. These seem to tickle a different bug. They pass through msg.as_string() without resorting to the mboxutils.as_string() machinery, but lose the message body. Skip From tdickenson at geminidataloggers.com Tue May 4 10:01:48 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Tue May 4 10:01:52 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D47@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D47@its-xchg4.massey.ac.nz> Message-ID: <200405041501.48799.tdickenson@geminidataloggers.com> On Tuesday 04 May 2004 05:14, Tony Meyer wrote: > However, the ones with the most to gain are sb_filter users, who > are probably most willing to use CVS or apply a patch to a script > (sb_imapfilter and sb_server will both keep running and just fail to > classify that message, and Outlook is immune). sb_filter should be fail-safe with this too. Whatever calls sb_filter is supposed to ignore its output if it exits with an exception and a non-zero exit code. -- Toby Dickenson From skip at pobox.com Tue May 4 10:28:22 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue May 4 10:28:20 2004 Subject: [spambayes-dev] Release 1.0? In-Reply-To: <200405041501.48799.tdickenson@geminidataloggers.com> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D47@its-xchg4.massey.ac.nz> <200405041501.48799.tdickenson@geminidataloggers.com> Message-ID: <16535.43142.824101.88929@montanaro.dyndns.org> Toby> On Tuesday 04 May 2004 05:14, Tony Meyer wrote: >> However, the ones with the most to gain are sb_filter users, who are >> probably most willing to use CVS or apply a patch to a script >> (sb_imapfilter and sb_server will both keep running and just fail to >> classify that message, and Outlook is immune). Toby> sb_filter should be fail-safe with this too. Whatever calls Toby> sb_filter is supposed to ignore its output if it exits with an Toby> exception and a non-zero exit code. Note that sb_filter.py can process a large mailbox or maildir as well as a single message. In that situation I think you might want it to recover as well as it can and continue. In the normal case I agree that exiting with a non-zero status is okay. After a little more messing around with the mboxutils.as_string() function I'm not certain it's all that robust itself. I don't think it's doing quite the reight right thing with boundaries and wonder if it should recurse instead of just calling part.as_string() for each part. Skip From toby at tarind.com Wed May 5 18:08:16 2004 From: toby at tarind.com (Toby Dickenson) Date: Wed May 5 18:08:21 2004 Subject: [spambayes-dev] sb_bnfilter performance Message-ID: <200405052308.16839@trumpet.tarind.com> Ive been squeezing some more performance out of sb_bnfilter..... I have a C implementation of sb_bnfilter that reduces filtering time of a typical email from 60ms (with the python sb_bnfilter.py) down to 21ms. [those times are for second and subsequent runs with sb_bnserver still running]. For comparison, the original sb_filter.py runs in 257ms. Using the C implementation of sb_bnfilter to filter an *empty* email reduces the run time to 3ms, so it looks like any further gains will come from changes in sb_bnserver...... Using Psyco in sb_bnserver improves the run time for the second and subsequent runs from 21ms to 15 ms, but the cost is extra overhead for the first run. On this machine the break-even point is after 12 runs, so I think its worth leaving on by default and add a switch to disable it. These changes are in the bnfilter_in_c_branch CVS branch for now; they dont belong in the 1.0 release. From tameyer at ihug.co.nz Wed May 5 18:21:45 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed May 5 18:22:04 2004 Subject: [spambayes-dev] sb_bnfilter performance In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306269F8F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> > Using the C implementation of sb_bnfilter to filter an > *empty* email reduces the run time to 3ms, so it looks > like any further gains will come from changes in sb_bnserver...... Out of curiosity, have you profiled sb_bnserver at all? I wonder if the actual tokenization/classification of the message is the majority of this 21ms time, which would be hard to improve in speed (without starting to recode the core SpamBayes code itself in C). If you have profiled, it'd be interesting to see where the time is being spent (i.e. please post it!). You might be able to find gains by turning off some tokenizing options (perhaps there are time-expensive ones that don't give much in the way of accuracy?). Using Python 2.4 (from CVS) might also speed things up a bit, since I gather that there are numerous speed improvements with the built-ins like dict and list. > These changes are in the bnfilter_in_c_branch CVS branch for > now; they dont belong in the 1.0 release. Note that soon (today, probably) there'll be a 1.0 branch and so you'll be able to put these on the trunk. =Tony Meyer From jkx at pythonfr.org Wed May 5 18:28:00 2004 From: jkx at pythonfr.org (Jkx@Home) Date: Wed May 5 18:26:03 2004 Subject: [spambayes-dev] sb_bnfilter performance In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> Message-ID: <1083796080.30222.13.camel@p2b.soif.fr> On jeu, 2004-05-06 at 00:08, Toby Dickenson wrote: > Ive been squeezing some more performance out of sb_bnfilter..... > > I have a C implementation of sb_bnfilter that reduces filtering time of a > typical email from 60ms (with the python sb_bnfilter.py) down to 21ms. [those > times are for second and subsequent runs with sb_bnserver still running]. For > comparison, the original sb_filter.py runs in 257ms. > > These changes are in the bnfilter_in_c_branch CVS branch for now; they dont > belong in the 1.0 release. This sound great :) I hope this have the same behaviours that spamc have : - disable sb_bnserver fork - specify username - and support for unix domain and tcp .. I don't have the time to test this right now (i will be off) until 2/3 days. But i think i gonna check/hack this soon. I put sb_global_server in a production mode (read for little groups of users) and it's works pretty fine, ( mainly due to the hammie cache in) Bye Bye. From t-meyer at ihug.co.nz Wed May 5 18:36:28 2004 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Wed May 5 18:36:46 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306269BBF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BF1@its-xchg4.massey.ac.nz> > After the release I think it would be worthwhile to consider > subclassing email.Message.Message, morphing > mboxutils.as_string() into a method method, then using it > everywhere. I doubt it will be difficult to do, but would > have touched more bits of code than I felt comfortable with > just before the release. This is basically what spambayes.message.Message is, although it has a bit more, too (and doesn't as yet have the TypeError fix, but it will get it at some point). sb_server, sb_imapfilter and sb_pop3dnd use this class. spambayes.message has two classes at the moment - Message and SBHeaderMessage. sb_filter and sb_mboxtrain could probably use one or the other of these (which would also finish off making the header adding code consistent). If the id/classification stuff in Message wasn't wanted, then we could change Message to be even simpler (just the modified as_string(), for example), and add a new PersistentMessage class with the other code, which SBHeaderMessage would then subclass (rather than Message). I'm pretty sure that no SpamBayes code currently refers to spambayes.message.Message, just to SHHeaderMessage. A couple of things: 1. I'm a little concerned that our message class has the same name as the email.Message one. I think renaming ours to "SBMessage" or something like that would be cleaner. 2. Importing spambayes.message will mean you end up with a spambayes.messageinfo.db file somewhere. This stuff is very messy and is crying out to be fixed (I have spambayes.messageinfo.db files everywhere). It's a bit risky pre 1.0, but I should get time to take a look at it sometime soon. =Tony Meyer From tdickenson at geminidataloggers.com Thu May 6 03:00:47 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Thu May 6 03:00:52 2004 Subject: [spambayes-dev] sb_bnfilter performance In-Reply-To: <1083796080.30222.13.camel@p2b.soif.fr> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> <1083796080.30222.13.camel@p2b.soif.fr> Message-ID: <200405060800.47620.tdickenson@geminidataloggers.com> On Wednesday 05 May 2004 23:28, Jkx@Home wrote: > On jeu, 2004-05-06 at 00:08, Toby Dickenson wrote: > > Ive been squeezing some more performance out of sb_bnfilter..... > > I have a C implementation of sb_bnfilter that reduces filtering time > This sound great :) > I hope this have the same behaviours that spamc have : It doesnt. Those features make sense in your spamc/sb_global_server configuration, but sb_bnfilter has different requirements. Its just a faster sb_filter, ideal for use in procmail or mua filters. sb_bnfilter is not a server-based solution (my apologies for using a component name containing the word "server" that might hint otherwise ;-) > - disable sb_bnserver fork > - specify username > - and support for unix domain and tcp .. Those features would be nice somewhere, but not in sb_filter. A related question: why not use ReadyExec? Its implementation relies on copying the filter's stdin file descriptor into the server process. This is a little hairy, since the server can continue to read from that file descriptor even after the filter process has been killed. > I put sb_global_server in a production mode (read for little groups of > users) and it's works pretty fine, ( mainly due to the hammie cache in) Nice. -- Toby Dickenson From tdickenson at geminidataloggers.com Thu May 6 03:09:12 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Thu May 6 03:09:15 2004 Subject: [spambayes-dev] sb_bnfilter performance In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> Message-ID: <200405060809.12861.tdickenson@geminidataloggers.com> On Wednesday 05 May 2004 23:21, Tony Meyer wrote: > > Using the C implementation of sb_bnfilter to filter an > > *empty* email reduces the run time to 3ms, so it looks > > like any further gains will come from changes in sb_bnserver...... > > Out of curiosity, have you profiled sb_bnserver at all? Not yet, but that will be next. -- Toby Dickenson From Murali.Rajan at unisys.com Thu May 6 09:30:44 2004 From: Murali.Rajan at unisys.com (Rajan, Murali) Date: Thu May 6 09:30:51 2004 Subject: [spambayes-dev] JUNK-EMAIL Message-ID: <3F3674A6119CE54D95959F99E3DDD75E03842C2D@USTR-EXCH4.na.uis.unisys.com> Suppress new message icon from appearing when mail is directed to Junk-Email folder. Especially after SPAMEBAYS has learned what is junk email. At least make it an option users can choose. Murali Rajan Phone (610)648-4599. Net2 385-4599. Fax (610)648-4699. THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. From kennypitt at hotmail.com Thu May 6 10:48:55 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu May 6 10:50:36 2004 Subject: [spambayes-dev] JUNK-EMAIL In-Reply-To: <3F3674A6119CE54D95959F99E3DDD75E03842C2D@USTR-EXCH4.na.uis.unisys.com> Message-ID: Rajan, Murali wrote: > Suppress new message icon from appearing when mail is directed to > Junk-Email folder. Especially after SPAMEBAYS has learned what is > junk email. At least make it an option users can choose. See FAQ 3.8: http://spambayes.sourceforge.net/faq.html#how-can-i-get-rid-of-the-envelope- tray-icon-for-spam -- Kenny Pitt From Jkx at pythonfr.org Thu May 6 19:35:17 2004 From: Jkx at pythonfr.org (Jerome Kerdreux) Date: Thu May 6 19:33:16 2004 Subject: [spambayes-dev] sb_bnfilter performance In-Reply-To: <200405060800.47620.tdickenson@geminidataloggers.com> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> <1083796080.30222.13.camel@p2b.soif.fr> <200405060800.47620.tdickenson@geminidataloggers.com> Message-ID: <20040506233516.GB2085@larsen-b.com> On Thu, May 06, 2004 at 08:00:47AM +0100, Toby Dickenson wrote: > On Wednesday 05 May 2004 23:28, Jkx@Home wrote: > > On jeu, 2004-05-06 at 00:08, Toby Dickenson wrote: > > > Ive been squeezing some more performance out of sb_bnfilter..... > > > > I have a C implementation of sb_bnfilter that reduces filtering time > > > This sound great :) > > I hope this have the same behaviours that spamc have : > > It doesnt. Those features make sense in your spamc/sb_global_server > configuration, but sb_bnfilter has different requirements. Its just a faster > sb_filter, ideal for use in procmail or mua filters. > > sb_bnfilter is not a server-based solution (my apologies for using a component > name containing the word "server" that might hint otherwise ;-) > > > - disable sb_bnserver fork > > - specify username > > - and support for unix domain and tcp .. > > Those features would be nice somewhere, but not in sb_filter. We don't talk about sb_filter but sb_bnfilter and, i think you do anything SB related in (except sb_bnserver fork) ? i don't really see where is the difference .. your code just need a option to disable the sb_bnserver fork to use the global server no ? Toby could you send me your code by email please, because i don't have cvs install here .. > A related question: why not use ReadyExec? Its implementation relies on > copying the filter's stdin file descriptor into the server process. This is a > little hairy, since the server can continue to read from that file descriptor > even after the filter process has been killed. What's this ? URL ? Bye Bye . From otrcomm at isp-systems.com Fri May 7 02:07:33 2004 From: otrcomm at isp-systems.com (OTR Comm) Date: Fri May 7 02:08:10 2004 Subject: [spambayes-dev] Web Interface to SpamBayes (procmail) Message-ID: <409B27A5.8CD4FAC@isp-systems.com> Hello, I just started using SpamBayes with procmail. I have developed a web based interface into the system written in perl and would like some people to test it out. Is this the appropriate list, or should I send the request to the users list? You can see the interface at http://www.wildapachemail.net/cgi-bin/spambayes/sb_login.cgi logon as 'newuser', password 'newuser' and then send email to newuser@wildapache.net Unfortunately, none of the spammers know this email address, but if you have some test spam messages, you can send them. I don't have any of the Help system developed yet, but I will soon. If anyone would like a copy of the tarball for the system, let me know and I will setup an ftp site for it! Again, if this is not the correct list, then I apologize. Thanks, Murrah Boswell From pbelanger at forsk.com Fri May 7 03:33:47 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 03:33:41 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/e100b202/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 03:42:20 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 03:42:11 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/d99663e3/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 03:44:44 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 03:44:36 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/9e1bc013/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 03:54:58 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 03:54:49 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/5fbef477/Outlook-0001.bin From tameyer at ihug.co.nz Fri May 7 04:00:46 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri May 7 04:01:23 2004 Subject: [spambayes-dev] Web Interface to SpamBayes (procmail) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13063B564C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BF7@its-xchg4.massey.ac.nz> > I just started using SpamBayes with procmail. I have > developed a web based interface into the system written in > perl and would like some people to test it out. Is there any reason that you're building a web interface from scratch, rather than simply using/subclassing the one that sb_server/sb_imapfilter use? (In fact, can't you practically do this already with sb_server and sb_upload?) =Tony Meyer From pbelanger at forsk.com Fri May 7 04:17:30 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 04:17:43 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/d6fa478f/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 05:16:05 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 05:16:02 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/e208aa3a/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 05:39:15 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 05:39:12 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/e321a2a3/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 05:45:44 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 05:45:45 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/3c953411/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 06:15:49 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 06:15:41 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/44668e8c/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 06:22:57 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 06:22:50 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/3b476685/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 06:26:43 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 06:26:34 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/fe30701c/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 08:48:06 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 08:47:58 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/ac69bd5c/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 09:08:39 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 09:08:30 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/4071731d/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 09:36:44 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 09:36:41 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/63e84900/Outlook-0001.bin From sjoerd at acm.org Fri May 7 09:52:36 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Fri May 7 09:52:41 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems In-Reply-To: References: Message-ID: <409B94A4.5050904@acm.org> What's going on? I have already received 15 copies of this message. I checked a couple, and they have different Date and Message-Id headers, so it looks like a problem at the source... Philippe Belanger wrote: > Why do I get this error message when I select a suspect mail and click > on delete as spam ? > > > After that I get a message indicating that the Outlook session contains > macros which maybe are not sure ? So I deactivate it but how do I see > what these macros are (I know this may be not a SpamBayes question) ? > > When a message has been classified as sure Spam and I agree with this > classification (in JunkMail folder) , is it the right way to select it > and click on Delete or is there an integrated way to say : "Well it's > effectively spam, I don't want to keep it know, but hope Spam Bayes used > it to improve its learning process ? > > Thank you by advance, > > Philippe BELANGER > FORSK > Tel: +33 (0)5.62.74.72.17 > Fax:+33 (0)5.62.74.72.11 > pbelanger@forsk.com > > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev -- Sjoerd Mullender From pbelanger at forsk.com Fri May 7 10:06:09 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 10:06:00 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/4eb17ebd/Outlook-0001.bin From pbelanger at forsk.com Fri May 7 11:11:43 2004 From: pbelanger at forsk.com (Philippe Belanger) Date: Fri May 7 11:11:42 2004 Subject: [spambayes-dev] Questions : After 3 days of (good) usage, SpamBays seems to encou nter problems Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook.bmp Type: image/bmp Size: 37854 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/05d4de04/Outlook-0001.bin From otrcomm at isp-systems.com Fri May 7 12:22:14 2004 From: otrcomm at isp-systems.com (OTR Comm) Date: Fri May 7 12:33:50 2004 Subject: [spambayes-dev] Web Interface to SpamBayes (procmail) References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BF7@its-xchg4.massey.ac.nz> Message-ID: <409BB7B6.448C09C2@isp-systems.com> > Is there any reason that you're building a web interface from scratch, > rather than simply using/subclassing the one that sb_server/sb_imapfilter > use? Probably my ignorance then! I haven't closely looked at sb_server to see how it works on a linux box with sendmail and procmail. I also don't want my users to have to load any software on their machines, but I want to give each user the ability to manage their own spam filter. I may be completely off base with my work, but it seems that if it were that simple, there would not be any need for the Outlook plugin. Is this not so? > > (In fact, can't you practically do this already with sb_server and > sb_upload?) I do not know! Like I said, I just started working with SpamBayes, but I will look into these modules. Thanks, Murrah Boswell From COrchard at bcbc.bc.ca Fri May 7 18:00:51 2004 From: COrchard at bcbc.bc.ca (Orchard, Chris ) Date: Fri May 7 18:00:51 2004 Subject: [spambayes-dev] (no subject) Message-ID: <46C2D3D3D981D41187E300508B6670341250A187@bcbcmail.bcbc.bc.ca> Wow, Great program, any chance there might be a release for Pocket Outlook for the Pocket PC Platform? Chris "Boots" Orchard Orchard Internet Services a division of Orchard Enterprises 5031 Rocky Point Rd. Victoria, BC V9C 4G4 Canada Tel: (250) 881-4139 Email: ois@oe.ca -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040507/78c730be/attachment.html From tameyer at ihug.co.nz Sat May 8 23:11:46 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat May 8 23:12:06 2004 Subject: [spambayes-dev] Web Interface to SpamBayes (procmail) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13063B5771@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BF9@its-xchg4.massey.ac.nz> > Probably my ignorance then! I haven't closely looked at sb_server to > see how it works on a linux box with sendmail and procmail. I also > don't want my users to have to load any software on their > machines, but I want to give each user the ability to manage their > own spam filter. Making the existing tools multi-user would mean a bit of work, but I would think less than doing the whole lot from scratch. You could probably put something at the front that asked for a username/password and then loaded the appropriate configuration file. Note that even if sb_server/sb_upload aren't any use, all the interface code should still be usable, if you want it. The UserInterface.py module has most of it - see ProxyUI.py or ImapUI.py for example usage. > I may be completely off base with my work, but it seems that > if it were that simple, there would not be any need for the > Outlook plugin. Is this not so? You'll never get the same sort of experience with a web-based interface as you will with an integrated plug-in. How do you do drag-and-drop training, for example? (Well, the sb_pop3dnd script gives that, but still...). What about buttons so that you can select a message, click a button, and have it trained? Classify any selected message? Classify some incoming messages, but not others? In addition, in my experience, people would rather stay within the application - going out to a web browser doesn't make sense to them. =Tony Meyer From sethg at GoodmanAssociates.com Tue May 11 22:13:01 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Tue May 11 22:13:01 2004 Subject: [spambayes-dev] button ideas (oh boy) Message-ID: Using the Outlook plug-in for regimes other than mistake based training for a while now, I think there are two places where a second button would make things a lot simpler. These other training strategies all involve training on some messages that are already correctly classified. This is currently not very easy with the plug-in and I suspect that it discourages some people from experimenting with different training regimes. I have been told that everyone is born knowing how to program in Python. I haven't tested that, probably because I couldn't handle failing at something I was born knowing how to do. Here is specifically what I am proposing, and the names of the buttons are completely up for grabs. 1) In the Inbox (and other watched ham folders), there could be a second button for "Train As Good" for messages that have scores too far from zero. Pressing the button would train the message as ham and keep it in the current folder. 2) In the Spam folder, there could be a second button for "Train As Spam" for messages that have scores too far from 100%. Pressing the button would train on the message as spam and keep it in the Spam folder. It's more complicated than we now have, but it's simpler than moving the message to the Unsure folder, then having to press either "Delete As Spam" (not too bad) or "Recover From Spam" (totally confusing, since the message was not classified as spam). This way, we would always have two buttons visible in every folder, but the name of the second button would change with the context. I know that this makes the code more complicated, but it makes sense from the user standpoint. I suppose at this point someone will point out that if I think this is such a cool idea, why don't I just code it up myself. Good point, this is open source. But let me remind you, I'm a long-time hardware guy (and a good one!) ... think about that a half dozen times or so before suggesting that I program anything in any language besides English or VHDL. Be careful, you may get what you wish for. State machines, transmission lines and analog circuits, no problem. Object classes and encapsulation, I dunno. Python? I'm afraid of snakes. -- Seth Goodman From ExchangeCompServ at massey.ac.nz Thu May 13 03:32:42 2004 From: ExchangeCompServ at massey.ac.nz (ExchangeCompServ@massey.ac.nz) Date: Thu May 13 03:33:04 2004 Subject: [spambayes-dev] Symantec AVF detected an unrepairable virus in a message you sent Message-ID: <01ec01c438bc$7a0e2150$0e817b82@massey.ac.nz> Subject of the message: Hey, ya! =)) Recipient of the message: Meyer, Tony From tameyer at ihug.co.nz Thu May 13 03:41:48 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu May 13 03:41:54 2004 Subject: [spambayes-dev] Symantec AVF detected an unrepairable virus in amessage you sent In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13063B63C4@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677DE2@its-xchg4.massey.ac.nz> Ak. Sorry about that - if the fools here have started contributing to the worldwide flood of messages like this, I'll try to stop them doing that... > -----Original Message----- > From: spambayes-dev-bounces@python.org > [mailto:spambayes-dev-bounces@python.org] On Behalf Of > ExchangeCompServ@massey.ac.nz > Sent: Thursday, 13 May 2004 7:33 p.m. > To: spambayes-dev@python.org > Subject: [spambayes-dev] Symantec AVF detected an > unrepairable virus in amessage you sent > > > Subject of the message: Hey, ya! =)) > Recipient of the message: Meyer, Tony > > > > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spamba> yes-dev > From mhammond at skippinet.com.au Thu May 13 05:57:04 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu May 13 05:57:27 2004 Subject: [spambayes-dev] button ideas (oh boy) In-Reply-To: Message-ID: <01e601c438d0$a5d3a300$0200a8c0@eden> My main questions is why you aren't trusting SpamBayes? If spambayes is correctly classifying these messages, what makes you believe you should train on them? I'd like a little evidence that this would help future classification. While it may be intuitive to you, Uncle Timmy has made way too many comments about ignoring that intuition to ignore :) Mark. > -----Original Message----- > From: spambayes-dev-bounces+mhammond=keypoint.com.au@python.org > [mailto:spambayes-dev-bounces+mhammond=keypoint.com.au@python.org]On > Behalf Of Seth Goodman > Sent: Wednesday, 12 May 2004 12:13 PM > To: SpamBayes-dev Forum > Subject: [spambayes-dev] button ideas (oh boy) > > > Using the Outlook plug-in for regimes other than mistake > based training for > a while now, I think there are two places where a second > button would make > things a lot simpler. These other training strategies all > involve training > on some messages that are already correctly classified. This > is currently > not very easy with the plug-in and I suspect that it > discourages some people > from experimenting with different training regimes. I have > been told that > everyone is born knowing how to program in Python. I haven't > tested that, > probably because I couldn't handle failing at something I was > born knowing > how to do. > > Here is specifically what I am proposing, and the names of > the buttons are > completely up for grabs. > > 1) In the Inbox (and other watched ham folders), there could > be a second > button for "Train As Good" for messages that have scores too > far from zero. > Pressing the button would train the message as ham and keep it in the > current folder. > > 2) In the Spam folder, there could be a second button for > "Train As Spam" > for messages that have scores too far from 100%. Pressing > the button would > train on the message as spam and keep it in the Spam folder. > > It's more complicated than we now have, but it's simpler than > moving the > message to the Unsure folder, then having to press either > "Delete As Spam" > (not too bad) or "Recover From Spam" (totally confusing, > since the message > was not classified as spam). This way, we would always have > two buttons > visible in every folder, but the name of the second button > would change with > the context. I know that this makes the code more > complicated, but it makes > sense from the user standpoint. > > I suppose at this point someone will point out that if I > think this is such > a cool idea, why don't I just code it up myself. Good point, > this is open > source. But let me remind you, I'm a long-time hardware guy > (and a good > one!) ... think about that a half dozen times or so before > suggesting that I > program anything in any language besides English or VHDL. Be > careful, you > may get what you wish for. State machines, transmission > lines and analog > circuits, no problem. Object classes and encapsulation, I > dunno. Python? > I'm afraid of snakes. > > -- > > Seth Goodman > > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev From sethg at GoodmanAssociates.com Thu May 13 06:32:26 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu May 13 06:32:28 2004 Subject: [spambayes-dev] button ideas (oh boy) In-Reply-To: <01e601c438d0$a5d3a300$0200a8c0@eden> Message-ID: > From: Mark Hammond > Sent: Thursday, May 13, 2004 4:57 AM > > > My main questions is why you aren't trusting SpamBayes? > If spambayes > is correctly classifying these messages, what makes you believe you should > train on them? With a train on almost everything (TOAE) regime, you train on ham that scores higher than a certain threshold and spam that scores less than another threshold. I personally use 0.1% and 95% for the two thresholds. The name of the regime is somewhat of a misnomer because you actually train on very few messages once you have a good training set. In addition to training on all the unsures, you go through the ham and spam folders every couple of days and train on any messages that meet the criteria. Normally, there are only a few messages that are not unsures that meet the criteria. I like this method because it tends to keep the number of unsures down. I don't like to constantly check the unsure folder for ham, as it can sit there for quite a few hours before I check it. This method seems to do a good job of keeping it out of there. > > I'd like a little evidence that this would help future classification. > While it may be intuitive to you, Uncle Timmy has made way too > many comments > about ignoring that intuition to ignore :) Of course. I think there was a fair amount of discussion on this particular regime some months ago. My recollection was that there was some CV data presented to show that it was better than train on errors, which is what the Outlook plug-in is currently set up to do. -- Seth Goodman From popiel at wolfskeep.com Thu May 13 12:13:49 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu May 13 12:13:53 2004 Subject: [spambayes-dev] button ideas (oh boy) In-Reply-To: Message from "Seth Goodman" of "Thu, 13 May 2004 05:32:26 CDT." References: Message-ID: <20040513161349.5AF892DF8A@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >> From: Mark Hammond >> Sent: Thursday, May 13, 2004 4:57 AM >> >> I'd like a little evidence that this would help future classification. >> While it may be intuitive to you, Uncle Timmy has made way too >> many comments >> about ignoring that intuition to ignore :) > >Of course. I think there was a fair amount of discussion on this particular >regime some months ago. My recollection was that there was some CV data >presented to show that it was better than train on errors, which is what the >Outlook plug-in is currently set up to do. For evidence and pretty graphs, look at: http://www.entrian.com/sbwiki/TrainOnErrorsAndUnsures vs. http://www.entrian.com/sbwiki/TrainOnAlmostEverything - Alex From glenn at myri.com Thu May 13 13:13:35 2004 From: glenn at myri.com (Glenn Brown) Date: Thu May 13 13:13:39 2004 Subject: [spambayes-dev] My experience with SpamBayes. Message-ID: <002401c4390d$a0e7a430$1208a8c0@Glenn> Folks, I used to use SpamBayes' Outlook plugin naively, and saw poor results (frequent false negatives and 25% of messages when into "Junk Suspects" with an imbalanced DB of >10000 messages). Then I did the following, and I'm seeing near-perfect classification with 55 ham and 83 spam in my DB. I retrained, trying to train only on the hammiest spam and spammiest ham. Specifically, I did the following: Turn off "train that a message is * when it is moved *" Train on 5 ham and 5 spam. Put a large recent ham+spam collection in JunkSuspects. while (messages in Junk Suspects) { Use "Filter Messages" to only score (but not filter) Junk Suspects; Move all spam scoring > 90% to Junk Email. Move all ham scoring < 10% to Inbox. Use "delete as spam" on the lowest-scoring spam. Use "recover from spam" on the highest-scoring ham. } Set the ham threshold to 10% and spam threshold to 60%, so I only use "delete as spam" on hammy spam, reducing the tendency to create an imbalanced DB in daily use. The process was fairly slow, but the results are excellent. IMHO, it's a poor piece of software that requires the user to manually balance a database and/or develop the expertise to manually train as I did. These problems will not go away as long as developers continue to compare results using only balanced ham+spam sets, ignoring the plight of the naive user. I send a big "Thank you." to Kenny Pitt for suggesting the general approach, and to all SpamBayes developers for providing this useful free software. I look forward to future developments. Cheers, --Glenn From sethg at GoodmanAssociates.com Thu May 13 13:15:05 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu May 13 13:15:20 2004 Subject: [spambayes-dev] button ideas (oh boy) In-Reply-To: <20040513161349.5AF892DF8A@cashew.wolfskeep.com> Message-ID: > From: T. Alexander Popiel > Sent: Thursday, May 13, 2004 11:14 AM > > <...> > For evidence and pretty graphs, look at: > http://www.entrian.com/sbwiki/TrainOnErrorsAndUnsures > vs. > http://www.entrian.com/sbwiki/TrainOnAlmostEverything Thanks, Alex! I'm relieved that I didn't dream the whole thing up. -- Seth Goodman From barry at python.org Thu May 13 19:21:47 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:22:03 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> Message-ID: <1084490505.28228.843.camel@anthem.wooz.org> On Mon, 2004-05-03 at 21:29, Tony Meyer wrote: > > I'm copying the spambayes list > > since people started reporting this problem on this list too. > > I've moved this to cc spambayes-dev instead, because we're already > discussing this there, and it'll just get lost in the bug reports on the > main list. > > > I suspect that the crash occur because these messages have > > multipart boundaries but have a text content type header. > > That seems to be correct. > > Two additional notes: > > Skip Montanaro thinks that he had a message like this fail with Python 2.2.3 > and email 2.5.3, but work fine with Python from CVS and version 2.5.5 of the > email package, so that might be worth looking into. He's going to check > whether this is the case or not. It didn't until about 5 minutes ago, but it does (work fine) in email 2.5.5 now. Fortunately, we snuck it in under the Python 2.3.4 wire. -Barry From barry at python.org Thu May 13 19:25:49 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:26:02 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <40974E21.5090905@gabuzomeu.net> References: <1ED4ECF91CDED24C8D012BCF2B034F13062699B7@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2BE4@its-xchg4.massey.ac.nz> <16535.1393.304440.918139@montanaro.dyndns.org> <40974E21.5090905@gabuzomeu.net> Message-ID: <1084490749.28228.845.camel@anthem.wooz.org> On Tue, 2004-05-04 at 04:02, Alexandre Ratti wrote: > In case you need more test data, I have saved 3 messages that crashed > Spambayes and the email package (2.5.4): > > http://alexandre.ratti.free.fr/python/email/ None of these crash email 2.5.5 now. -Barry From barry at python.org Thu May 13 19:28:57 2004 From: barry at python.org (Barry Warsaw) Date: Thu May 13 19:29:10 2004 Subject: [spambayes-dev] RE: [Email-SIG] Maybe a bug, maybe not In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BF1@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BF1@its-xchg4.massey.ac.nz> Message-ID: <1084490936.28228.848.camel@anthem.wooz.org> On Wed, 2004-05-05 at 18:36, Tony Meyer wrote: > 1. I'm a little concerned that our message class has the same name as the > email.Message one. I think renaming ours to "SBMessage" or something like > that would be cleaner. Mailman does something similar. It subclasses email.Message.Message as Mailman.Message.Message and it hasn't been much of a problem. -Barry From tameyer at ihug.co.nz Thu May 13 19:46:30 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu May 13 19:46:46 2004 Subject: [spambayes-dev] My experience with SpamBayes. In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306556DA7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677DEF@its-xchg4.massey.ac.nz> > I used to use SpamBayes' Outlook plugin naively, and saw poor results > (frequent false negatives and 25% of messages when into "Junk > Suspects" with an imbalanced DB of >10000 messages). Then I did the > following, and I'm seeing near-perfect classification with 55 ham > and 83 spam in my DB. [description of non-edge training snipped] > IMHO, it's a poor piece of software that requires the user to manually > balance a database and/or develop the expertise to manually > train as I did. It's not 100% whether nonedge (like you did) or mistake-based training works better (testing has been limited - most of it by Alex and then me), and the idea of mistake-based training has been around longer, so is what the plug-in is optimised for. What you originally did is highly unlikely to be mistake-based training, if you had over 10,000 messages trained (and the database would probably have been roughly balanced, too). If you had, then you would probably have seen results about as good as you are getting now. FWIW, to do mistake-based training with the plug-in: * Don't train anything. Everything ends up unsure. * Train all mistakes - this means everything unsure, all false positives, and all false negatives. Simple, right? And to do the training, you only have to use either the "Delete As"/"Recover From" buttons, or the drag-and-drop method (with the incremental training options on). It only takes a few messages for classification to be good, so hardly any time is involved, and, like nonedge, you end up with a small database. Until it's more clear what training regime actually is best, leaving the plug-in setup for mistake-based training seems wise, given that people are familiar with it and it's very simple. That said, there are plans to make changes to make it easier to try out other training regimes - automatically refiltering the unsure folder, for example. However, the focus at the moment is getting 1.0 out the door - and this means not making any major changes, and focusing on stability. Once work on 1.1 starts, these sorts of things will get added in. (Maybe some sort of train-to-exhaustion scheme, too, which enforces balance). Don't forget that you're using bleeding-edge software, here, along with reasonably bleeding-edge ideas. > These problems will not go away as long as developers > continue to compare results using only balanced ham+spam sets, ignoring > the plight of the na?ve user. I hope you are not referring to the SpamBayes developers here. If you are, then you should consider looking at the tests that have been done. Start with the link to Alex's tests above, and you'll immediately find tests looking at the effects that an imbalance has. Look through the spambayes-dev/spambayes archives for cross-validation testing, and consider how many actually use balanced corpora (hint: not all). If there's something in the documentation (etc) that you think encourages people to train-on-everything (which is probably what you were initially doing), then point it out and we'll address it. Once upon a time, train-on-everything seemed like the best thing to do, so there could easily be legacy text that encourages it. This is volunteer-based open-source - to get better, people need to contribute! =Tony Meyer From tim.one at comcast.net Thu May 13 20:05:50 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu May 13 20:05:55 2004 Subject: [spambayes-dev] My experience with SpamBayes. In-Reply-To: <002401c4390d$a0e7a430$1208a8c0@Glenn> Message-ID: [glenn@myri.com] > ... > IMHO, it's a poor piece of software that requires the user to manually > balance a database and/or develop the expertise to manually train as > I did. These problems will not go away as long as developers continue > to compare results using only balanced ham+spam sets, ignoring the > plight of the naive user. Naive users are also welcome to leave our code undownloaded . The problem here really stems from that truly good training strategies have been much more an area of research than of "mere" development. The project was initially aimed at filtering high-volume tech mailing lists on server-class machines, and all of the hundreds of hours of research I did at the start (I started this project, BTW) were aimed at that. It was a bonus and a small surprise that it *can* work as well as it does for individual low-volume email use too -- but effective training strategies for that weren't known at the time the code was released. It's still seems muddy what works well *over time* for individuals (you haven't gotten there yet -- something that works great at first may degrade badly over time). I still blow away my training entirely a few times per year, and start over from scratch. > I send a big "Thank you." to Kenny Pitt for suggesting the general > approach, and to all SpamBayes developers for providing this useful > free software. I look forward to future developments. Thank you! From Byrganov at everest.radiology.uiowa.edu Thu May 13 20:48:23 2004 From: Byrganov at everest.radiology.uiowa.edu (Byrganov@everest.radiology.uiowa.edu) Date: Fri May 14 23:11:30 2004 Subject: [spambayes-dev] Spambayes-dev, How do they f@k.k with snakes? In-Reply-To: References: Message-ID: <5L8643A8EGFHI1K7@everest.radiology.uiowa.edu> Looks like you've come to a real Z00 here! Yeap! We have goats, we have horses, sheep, snakes, even dogs! e have lots of @n1m@ls here and we also have lots of g1r|s who just love to have some s. e -x with these creatures? How do they do it? http://zoo-action.com/av/val/?nAkOg How do they sa-ck those c0c.k-s? How do they f@kk with snakes? Snakes don't have c0c.k-s!!! Guys! Our g1r|s can do it with every creature they want! They are ready for it! They are tired from men! They do realize that wild @n1m@ls are f@kking like no man would ever f@kk them. Cause they are animals and they f@kk just like everybody did thousands and millions years ago! http://zoo-action.com/av/val/?ltzbg Stunning 1ma-.ges, v1de0s, art series, lots of @n1m@ls, y0.u-n.g horny g1r|s spre@d1ng their legs and s@kking c0c-k.s! This is a first ever -X-.-X-.-X- zoo where every g1r| can f@kk the creature she wants! LOOK AT THIS NOW! jdaKigGG jtFdkAOPH From otrcomm at isp-systems.com Sat May 15 00:16:18 2004 From: otrcomm at isp-systems.com (OTR Comm) Date: Sat May 15 00:16:30 2004 Subject: [spambayes-dev] SPAM In This List Message-ID: <40A59992.76F08A24@isp-systems.com> Hello, I guess this list isn't protected with SpamBayes huh? I received some SPAM through this list from someone pushing zoo-action.com. The message came through my SpamBayes filter as 'unsure.' The question I have is, if I reclassify it as spam, will it also make other emails addressed to spambayes-dev@python.org then get classified as spam also? Thanks, Murrah Boswell From skip at pobox.com Sat May 15 00:15:43 2004 From: skip at pobox.com (Skip Montanaro) Date: Sat May 15 00:22:57 2004 Subject: [spambayes-dev] Error with the new FeedParser Message-ID: <16549.39279.700843.525979@montanaro.dyndns.org> I cvs up'd my Python tree today and reinstalled (I generally run from CVS HEAD, silly me I suppose). I quickly got a failure from some of my collected spam when running tte.py. Isolating the culprit message (attached) I ran it through sb_filter.py using sb_filter.py spam10 and got this traceback: Traceback (most recent call last): File "/Users/skip/local/bin/sb_filter.py", line 257, in ? main() File "/Users/skip/local/bin/sb_filter.py", line 246, in main for msg in mbox: File "/Users/skip/local/lib/python2.4/mailbox.py", line 35, in next return self.factory(_Subfile(self.fp, start, stop)) File "/Users/skip/local/lib/python2.4/site-packages/spambayes/mboxutils.py", line 129, in get_message msg = email.message_from_string(obj) File "/Users/skip/local/lib/python2.4/email/__init__.py", line 45, in message_from_string return Parser(_class, strict=strict).parsestr(s) File "/Users/skip/local/lib/python2.4/email/Parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/Users/skip/local/lib/python2.4/email/Parser.py", line 56, in parse feedparser.feed(data) File "/Users/skip/local/lib/python2.4/email/FeedParser.py", line 145, in feed self._call_parse() File "/Users/skip/local/lib/python2.4/email/FeedParser.py", line 149, in _call_parse self._parse() File "/Users/skip/local/lib/python2.4/email/FeedParser.py", line 317, in _parsegen mo = boundaryre.match(line) TypeError: expected string or buffer I'll file a bug report on SF against the email package, but I thought I'd also post here in case Barry or Anthony are awake at this hour... Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/octet-stream Size: 215750 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040514/bbe95114/attachment-0001.obj From tim.one at comcast.net Sat May 15 00:33:22 2004 From: tim.one at comcast.net (Tim Peters) Date: Sat May 15 00:33:29 2004 Subject: [spambayes-dev] SPAM In This List In-Reply-To: <40A59992.76F08A24@isp-systems.com> Message-ID: [OTR Comm] > I guess this list isn't protected with SpamBayes huh? I received some > SPAM through this list from someone pushing zoo-action.com. People often need to discuss spam on this list, including examples, so no, no filtering of any kind is done here. This list isn't moderated either. > The message came through my SpamBayes filter as 'unsure.' The question > I have is, if I reclassify it as spam, will it also make other emails > addressed to spambayes-dev@python.org then get classified as spam also? That's what I do, and it doesn't hurt. SpamBayes gives all tokens equal weight, and the "to" address is just one token. What I have seen: if the first time I see a specific spam campaign is because someone sends a sample here, that will probably rate Unsure, I'll train it as ham, and then because of that ham training when I start getting instances of the campaign *as* spam they'll also score as ham. One or two trainings on those (as spam, of course) generally takes care of that (although I have some non-default options set, and that may be relevant). From otrcomm at isp-systems.com Sat May 15 01:06:33 2004 From: otrcomm at isp-systems.com (OTR Comm) Date: Sat May 15 01:06:44 2004 Subject: [spambayes-dev] SPAM In This List References: <200405150433.i4F4XQoN025233@wildapache.net> Message-ID: <40A5A559.3B8C64E0@isp-systems.com> Thanks! > spam they'll also score as ham. One or two trainings on those (as spam, of > course) generally takes care of that (although I have some non-default > options set, and that may be relevant). If you don't mind, what are these 'non-default options?' From skip at pobox.com Sat May 15 12:05:04 2004 From: skip at pobox.com (Skip Montanaro) Date: Sat May 15 12:03:04 2004 Subject: [spambayes-dev] Error with the new FeedParser In-Reply-To: <16549.39279.700843.525979@montanaro.dyndns.org> References: <16549.39279.700843.525979@montanaro.dyndns.org> Message-ID: <16550.16304.868910.552710@montanaro.dyndns.org> Skip> I cvs up'd my Python tree today and reinstalled (I generally run Skip> from CVS HEAD, silly me I suppose). I quickly got a failure from Skip> some of my collected spam when running tte.py. ... Skip> mo = boundaryre.match(line) Skip> TypeError: expected string or buffer Should anyone else encounter this, I submitted a bug report on SF last night: http://python.org/sf/954320 and attached a patch this morning which seems to fix the problem. Skip From barry at python.org Sat May 15 12:19:43 2004 From: barry at python.org (Barry Warsaw) Date: Sat May 15 12:19:53 2004 Subject: [spambayes-dev] Error with the new FeedParser In-Reply-To: <16550.16304.868910.552710@montanaro.dyndns.org> References: <16549.39279.700843.525979@montanaro.dyndns.org> <16550.16304.868910.552710@montanaro.dyndns.org> Message-ID: <1084637982.1350.229.camel@anthem.wooz.org> On Sat, 2004-05-15 at 12:05, Skip Montanaro wrote: > Skip> I cvs up'd my Python tree today and reinstalled (I generally run > Skip> from CVS HEAD, silly me I suppose). I quickly got a failure from > Skip> some of my collected spam when running tte.py. > > ... > > Skip> mo = boundaryre.match(line) > Skip> TypeError: expected string or buffer > > Should anyone else encounter this, I submitted a bug report on SF last > night: > > http://python.org/sf/954320 > > and attached a patch this morning which seems to fix the problem. That's exactly the right patch. I have the same one sitting in my cvs checkout, but I've been trying to boil down the spam10 example to add it to the test suite. Thanks! -Barry From barry at python.org Sat May 15 12:28:53 2004 From: barry at python.org (Barry Warsaw) Date: Sat May 15 12:29:00 2004 Subject: [spambayes-dev] Error with the new FeedParser In-Reply-To: <1084637982.1350.229.camel@anthem.wooz.org> References: <16549.39279.700843.525979@montanaro.dyndns.org> <16550.16304.868910.552710@montanaro.dyndns.org> <1084637982.1350.229.camel@anthem.wooz.org> Message-ID: <1084638532.1350.231.camel@anthem.wooz.org> On Sat, 2004-05-15 at 12:19, Barry Warsaw wrote: > That's exactly the right patch. I have the same one sitting in my cvs > checkout, but I've been trying to boil down the spam10 example to add it > to the test suite. Fixed in cvs. -Barry From tameyer at ihug.co.nz Sun May 16 00:00:15 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun May 16 00:00:25 2004 Subject: [spambayes-dev] button ideas (oh boy) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13063B60EA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C021E@its-xchg4.massey.ac.nz> > 1) In the Inbox (and other watched ham folders), there could > be a second button for "Train As Good" for messages that have > scores too far from zero. Pressing the button would train the > message as ham and keep it in the current folder. I don't have much of a problem (post 1.0) with a "Train/Keep As Good" button, but I don't really like the idea of having it only appear for certain messages - seems far too likely to confuse people. There's an open feature request for this: [ 796129 ] 'Keep as Good' Button > 2) In the Spam folder, there could be a second button for > "Train As Spam" for messages that have scores too far from 100%. > Pressing the button would train on the message as spam and keep > it in the Spam folder. If the button in 1) was added, it would make sense to add this, simply to keep the balance. This sounds like a reasonable idea, and I'm sure Mark would be ok with us adding these either post-1.0 or in a branch, and seeing what people thought of them. I'll change the tracker (above) to be assigned to me, and try and whip up a patch for this at some point (i.e. not in the next couple of weeks). =Tony Meyer --- Please always include the list (spambayes@python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. This way, you get everyone's help, and avoid a lack of replies when I'm busy. From tim.one at comcast.net Sun May 16 02:10:15 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun May 16 02:10:16 2004 Subject: [spambayes-dev] SPAM In This List In-Reply-To: <40A5A559.3B8C64E0@isp-systems.com> Message-ID: [Tim] >> ... (although I have some non-default options set, and that may be >> relevant). [OTR Comm] > If you don't mind, what are these 'non-default options?' """ [Tokenizer] replace_nonascii_chars: True record_header_absence: True mine_received_headers: True [Classifier] use_bigrams: True """ use_bigrams in particular has major effects, creating a much larger database packed with many more hapaxes (tokens that appear only once). The classifier learns faster when it's enabled (less training is needed to get to a comparable level of effectiveness). OTOH, the database is much larger than without it, and over time it's unclear whether it retains an effectiveness advantage. In large-scale train-on-everything tests quite some time ago, leaving it off did just as well, and created a much smaller database, so use_bigrams didn't have anything to recommend it for high-volume applications on server-class machines. The jury is still out on whether the tradeoffs differ for personal classifiers. From kennypitt at hotmail.com Mon May 17 15:03:08 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon May 17 15:04:38 2004 Subject: [spambayes-dev] button ideas (oh boy) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C021E@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> 1) In the Inbox (and other watched ham folders), there could be a >> second button for "Train As Good" for messages that have scores too >> far from zero. Pressing the button would train the message as ham and >> keep it in the current folder. > > I don't have much of a problem (post 1.0) with a "Train/Keep As Good" > button, but I don't really like the idea of having it only appear for > certain messages - seems far too likely to confuse people. There's > an open feature request for this: > > [ 796129 ] 'Keep as Good' Button > 498106> > >> 2) In the Spam folder, there could be a second button for "Train As >> Spam" for messages that have scores too far from 100%. >> Pressing the button would train on the message as spam and keep it in >> the Spam folder. > > If the button in 1) was added, it would make sense to add this, > simply to keep the balance. > > This sounds like a reasonable idea, and I'm sure Mark would be ok > with us adding these either post-1.0 or in a branch, and seeing what > people thought of them. I'll change the tracker (above) to be assigned > to me, and try and whip up a patch for this at some point (i.e. not in > the next couple of weeks). We already have buttons to perform both training functions ("this is spam" and "this is good"), but the current labels could be misleading if we repurpose them. Even now the labels can be confusing to some (e.g. "Delete as Spam" doesn't really delete anything). I suggest renaming the buttons to more generic "Spam" and "Not Spam" (which also reduces the amount of space used by the toolbar). Keeping in mind that we can only train a given message once, we'll probably still need to dynamically determine which buttons are available for each message. Disabling inappropriate buttons might be better than removing them so that the toolbar isn't bouncing around as the user moves to different messages in the folder. I suggest the following rules for enabling the buttons. These would be based on the currently selected message, and would be the same regardless of which folder the message is stored in. The rules are for a single message, and there may be issues if the user has selected multiple messages where some have been trained and some haven't. 1. If the message has never been trained, enable both the "Spam" and "Not Spam" buttons. 2. If the message was previously trained as good, enable the "Spam" button and disable the "Not Spam" button. 3. If the message was previously trained as spam, disable the "Spam" button and enable the "Not Spam" button. If I get a chance, I'll have a look at what it would take to check the state each time the user changes the message selection. I've considered trying something like this several times, but just haven't gotten around to it. I'm pretty slammed with the work that pays the bills right now, so Tony may still beat me to it. -- Kenny Pitt From Eicker at eWerx.com Tue May 18 01:34:39 2004 From: Eicker at eWerx.com (=?iso-8859-1?Q?Eicker_|_eWerx!..communications=B0?=) Date: Tue May 18 01:35:02 2004 Subject: [spambayes-dev] ReleBayes - Food For Tought? Message-ID: <009801c43c99$d5ca8a20$1a0cfea9@eWerxD600> Hi! First of all: I'd like to say *thank you* for your great piece of software that helps to save time every day! Second: I've thought about the potential of Bayes-filters since I've installed SpamBayes and can see how good it works every day. What's on my wish list now is "ReleBayes": I mean a filter that not only kills spam (that's SpamBayes) but a filter that sorts ham-mails by relevancy. It would be even easier for the user to use and even more powerful: - Relevancy should be learned when a user *replies* to an email. I believe you respond to 15-25% of your eMails only like I do. - If there's a new mail that ReleBayes scores as relevant it could simply be ranked top in the afflicted folder instead of being ranked by date. I hope this is interesting food for thoughts. Smile!.. Gerrit eWerx!..communications? | Werbung Marketing PR? | beyond the noise? http://eWerx.com/news/ E: mailto:0700@eWerx.com T: 0700/eWerxcom [ 0700/39379266 ] F: 0700/eWerxcom [ 0700/39379266 ] P: Heerstr. 134 | D-58553 Halver From ta-meyer at ihug.co.nz Tue May 18 03:51:59 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Tue May 18 03:52:11 2004 Subject: [spambayes-dev] PSF Donations Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C022C@its-xchg4.massey.ac.nz> Various comments users have made recently have made me wonder what sort of donations SpamBayes is generating for the PSF. Is there any chance that we could occasionally get information like "there were x donations in the last x months, which totalled $x"? Is there a reason that we shouldn't get this information? I'm curious more than anything, although the information could be useful for things (like convincing the PSF to pay for the domain name, or convincing another project that this is a useful way to help the PSF out). I suppose I should ask some PSF person this, but some of you reading this are PSF people, right? =Tony Meyer From tdickenson at geminidataloggers.com Tue May 18 03:57:08 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Tue May 18 03:57:13 2004 Subject: [spambayes-dev] sb_bnfilter performance In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677D6B@its-xchg4.massey.ac.nz> Message-ID: <200405180857.08260.tdickenson@geminidataloggers.com> On Wednesday 05 May 2004 23:21, Tony Meyer wrote: > Out of curiosity, have you profiled sb_bnserver at all? Ive spent a little time on this, testing using sb_bnfilter to filter my whole inbox and spam folder with psycho turned off. The profiler showed one hot lambda function in the tokeniser, eliminated in the patch below. Repeating the test without the profiler showed only a few percent increase in speed. Unless there are objections I will commit this change anyway; to my eyes it is also a small readability improvement. After that, most of the time is going in bsddb. We currently call shelve.get once for each token; which calls both bsddb.__getattr__ *and* .has_attr. Hacking shelve.py to replace the has_attr call with a KeyError exception handler gave roughly a 10% gain. Nice, but not enough to tempt me to polish any changes. Profiler output attached. -- Index: spambayes/classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/classifier.py,v retrieving revision 1.23 diff -c -2 -r1.23 classifier.py *** spambayes/classifier.py 6 Feb 2004 21:43:00 -0000 1.23 --- spambayes/classifier.py 18 May 2004 05:43:10 -0000 *************** *** 221,226 **** if evidence: ! clues = [(w, p) for p, w, r in clues] ! clues.sort(lambda a, b: cmp(a[1], b[1])) clues.insert(0, ('*S*', S)) clues.insert(0, ('*H*', H)) --- 221,226 ---- if evidence: ! clues.sort() ! clues = [(w,p) for (p,w,r) in clues] clues.insert(0, ('*S*', S)) clues.insert(0, ('*H*', H)) Toby Dickenson -------------- next part -------------- 5222854 function calls (5185915 primitive calls) in 116.128 CPU seconds Ordered by: internal time, call count List reduced from 245 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 59953 12.636 0.000 12.720 0.000 /usr/lib/python2.3/bsddb/__init__.py:140(has_key) 59835 8.378 0.000 9.464 0.000 /usr/lib/python2.3/bsddb/__init__.py:114(__getitem__) 931 6.478 0.007 6.533 0.007 ../scripts/sb_bnserver.py:135(get_request) 3797 6.272 0.002 21.605 0.006 /usr/lib/python2.3/email/Generator.py:162(_write_headers) 231478 5.789 0.000 31.549 0.000 /home/toby/projects/spambayes/spambayes/storage.py:253(_wordinfoget) 1128 5.247 0.005 8.652 0.008 /usr/lib/python2.3/email/Generator.py:362(_make_boundary) 400632 3.746 0.000 8.526 0.000 /home/toby/projects/spambayes/spambayes/tokenizer.py:1532(tokenize_body) 13644/13454 3.586 0.000 3.750 0.000 /usr/lib/python2.3/email/Header.py:419(_split_ascii) 930 3.528 0.004 51.244 0.055 /home/toby/projects/spambayes/spambayes/classifier.py:430(_getclues) 930 3.449 0.004 3.449 0.004 /usr/lib/python2.3/socket.py:161(close) 930 3.181 0.003 7.975 0.009 /home/toby/projects/spambayes/spambayes/hammie.py:40(formatclues) 181183 3.089 0.000 4.111 0.000 /home/toby/projects/spambayes/spambayes/OptionsClass.py:597(get) 6631/2823 2.699 0.000 3.923 0.001 /usr/lib/python2.3/sre_parse.py:367(_parse) 231478 2.516 0.000 34.831 0.000 /home/toby/projects/spambayes/spambayes/classifier.py:504(_worddistanceget) 59835 2.508 0.000 11.972 0.000 /usr/lib/python2.3/shelve.py:114(__getitem__) 2611 2.433 0.001 2.449 0.001 /usr/lib/python2.3/email/Generator.py:191(_handle_text) 1880/930 2.019 0.001 6.066 0.007 /usr/lib/python2.3/email/Parser.py:143(_parsebody) 59996 1.830 0.000 1.830 0.000 /usr/lib/python2.3/email/Generator.py:41(_is8bitstring) 1239 1.761 0.001 1.761 0.001 /home/toby/projects/spambayes/spambayes/tokenizer.py:1175(find_html_virus_clues) 59976 1.748 0.000 2.154 0.000 /usr/lib/python2.3/email/Header.py:344(_encode_chunks) From ta-meyer at ihug.co.nz Tue May 18 04:27:24 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Tue May 18 04:27:55 2004 Subject: [spambayes-dev] ANNOUNCE: SpamBayes release 1.0rc1 Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C022E@its-xchg4.massey.ac.nz> The SpamBayes team is pleased to announce the latest release of SpamBayes - 1.0rc1. Like the last two versions, this is both a release of the source code and of an installation program for all Microsoft Windows users. The Windows installation program will install either the Outlook add-in (for Microsoft Outlook users), or the SpamBayes server program (for all other mail client users, including Microsoft Outlook Express). All Windows users (including existing users of the Outlook add-in) are encouraged to use the installation program. If you wish to use the source-code version, you will also need to install Python - see README.txt in the source tree for more information. This release fixes a number of reasonably minor bugs in the last release; however, we still highly recommend that existing users upgrade. For a detailed description of everything (well, everything we remember) that has changed since the last release, you can view our WHAT_IS_NEW.txt file, either online, or in the source distribution. Get it via the 'Download' page at http://www.spambayes.org/download.html Enjoy the new release and your spam-free mailbox :-) Thanks to everyone involved in this release, particularly, and as usual, Mark Hammond for putting most of this release together! Tony. (on behalf of the SpamBayes team) --- What is SpamBayes? --- The SpamBayes project is working on developing a Bayesian (of sorts) anti-spam filter (in Python), initially based on the work of Paul Graham. The major difference between this and other, similar projects is the emphasis on testing newer approaches to scoring messages. The project includes a number of different applications, all using the same core code, ranging from a plug-in for Microsoft Outlook, to a POP3 proxy, to various command-line tools. From janne.sinkkonen at hut.fi Tue May 18 06:35:14 2004 From: janne.sinkkonen at hut.fi (Janne Sinkkonen) Date: Tue May 18 06:35:24 2004 Subject: [spambayes-dev] ReleBayes - Food For Tought? In-Reply-To: <009801c43c99$d5ca8a20$1a0cfea9@eWerxD600> References: <009801c43c99$d5ca8a20$1a0cfea9@eWerxD600> Message-ID: <200405181335.14784.janne.sinkkonen@hut.fi> On Tuesday 18 May 2004 08:34, Eicker | eWerx!..communications? wrote: > What's on my wish list now is "ReleBayes": I mean a filter that not > only kills spam (that's SpamBayes) but a filter that sorts ham-mails > by relevancy. It would be even easier for the user to use and even > more powerful: > > - Relevancy should be learned when a user *replies* to an email. I > believe you respond to 15-25% of your eMails only like I do. I have done this, running two incarnations of spambayes (in Linux). I split mail to two folders, one being for probably non-replied mail and one for replied. This kind of works, at least well enough that I want to stick with it. Implementation is simply enough - just run two spambayes with different initialization files, and tune options to give different header names etc. I agree that there is potential in automatic text analysis and classification. Spambayes would be a good framework for trying various kind of new probabilistic text analysis techniques (Latent Dirichlet Allocation, multinomial PCA, etc.) with some kind of discriminative approach. -- Janne From perl at rhesa.com Tue May 18 21:19:53 2004 From: perl at rhesa.com (Rhesa Rozendaal) Date: Tue May 18 21:20:02 2004 Subject: [spambayes-dev] Re: [Spambayes] ANNOUNCE: SpamBayes release 1.0rc1 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C022E@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13064C022E@its-xchg4.massey.ac.nz> Message-ID: <40AAB639.4090402@rhesa.com> Tony Meyer wrote: > The SpamBayes team is pleased to announce the latest release of SpamBayes - > 1.0rc1. First off, thank you very much for the new release. I just finished installing it, and it feels very good! The training web interface is a great deal more responsive, initial training of about 1000 messages was very quick, and it contains many other small improvements. I like the added statistics, and the many added advanced and experimental options. I do have two small comments though: - I was already running pop3proxy as a service, and the installer didn't want to install the new version. The reason at first was an unrecognised option in my old bayescustomize.ini in the [html_ui] section. After commenting the offending line, it just crashed. Of course I was running an older version (0.7a I think), and since this is only 1.0, no big deal. And manually removing the old service and installing the new version fixed it. [ I'm sorry to say I was so stupid to forget noting the offending option before throwing everything away, so I cannot include it here. I only remember there were two options in that section, and the offending one ended in _to, value True ] - Two of the new advanced options immediately caught my eyes: Default training for spam/ham. I always had to manually check 'discard' for those, so these looked like a real time saver. Given that I receive an average of 300 messages a day, of which ~50% is spam, and I only want to train on selected unsures, you can imagine it takes quite a bit of scrolling to find the spam section. I was hoping I could just quickly set the unsures to the correct values, and jump down to the page to press Train, but unfortunately these options are not honored. All hams still default to Ham, and all spams to Spam. [ A very small improvement for me would be a Train button at the top of the page. I trust Spambayes enough in its ham and spam judgments, so I do not need to look at those sections. In the past year, I have never had a false positive, and definitely less than 10 false negatives ] Overall, it looks like you greatly improved an already awesome product! So thank you very, very much! Rhesa Rozendaal > Tony. > (on behalf of the SpamBayes team) > From tim.one at comcast.net Wed May 19 00:29:54 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed May 19 00:29:56 2004 Subject: [spambayes-dev] PSF Donations In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C022C@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > Various comments users have made recently have made me wonder what sort > of donations SpamBayes is generating for the PSF. > > Is there any chance that we could occasionally get information like > "there were x donations in the last x months, which totalled $x"? Yes, but it's slim . Neal Norwitz is the PSF Treasurer, and I believe is the only person with ready access to this info. All his work for the PSF is "spare time" volunteer stuff too, so I hesitate to ask him for more (he does an amazing amount of work for the PSF already). He's working on making regular financial summary reports to the PSF membership. We don't want to post every detail to a web site for the same reasons no business (non-profit or not) does -- besides just attracting cranks and nuisance lawsuits, there are privacy issues too. > Is there a reason that we shouldn't get this information? The same reason SpamBayes doesn't have a whitelist gimmick: lack of the combination of interest and spare time on the part of the people able to supply one . > I'm curious more than anything, although the information could be useful > for things (like convincing the PSF to pay for the domain name, or > convincing another project that this is a useful way to help the PSF out). > > I suppose I should ask some PSF person this, but some of you reading this > are PSF people, right? Yes, at least 7 PSF members have posted here, including two directors and an officer. I'm comfortable summarizing one of Neal's recent summaries to the Board like so: in the 9 months since the PSF started accepting donations "for real": 354 contributions totaling $19,172.37. Of those, 131 contributions totaling $2,885.50 came from the SpamBayes PayPal button or via a check with a SpamBayes annotation. So SpamBayes accounts for about a third of the contributions and about a sixth of the contributed dollars so far. It's by far the largest source of contributions that don't come directly from the PSF website. I think it's a fine showing, especially considering that we usually encourage people not to contribute . From Richard.Peik at UNISYS.com Wed May 19 18:51:09 2004 From: Richard.Peik at UNISYS.com (Peik, Richard A) Date: Wed May 19 18:51:13 2004 Subject: [spambayes-dev] SpamBayes at message-set boundary Message-ID: Over several months of use, SpamBayes never filters out the spam message that's most recent of current harvest of messages, forcing explicit DeleteAsSpam to remove. Regards. Dick (Richard A.) Peik Windows BIS, MRI and ICE Development Unisys Roseville (MN) (651) 635-3464 / N2: 524-3464 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040519/f3838f7f/attachment.html From ta-meyer at ihug.co.nz Tue May 25 03:16:04 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Tue May 25 03:17:27 2004 Subject: [spambayes-dev] PSF Donations In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306557C05@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C023F@its-xchg4.massey.ac.nz> [Tony Meyer] > Is there any chance that we could occasionally get information like > "there were x donations in the last x months, which totalled $x"? [Tim Peters] > Yes, but it's slim . [rest snipped] The broad summary that you gave was plenty to satisfy my curiosity, so I won't be bothering Neal. =Tony Meyer (Why did I bother posting this? One, to say thanks to Tim - thanks! - since I hadn't done that, and, two, because I realised that I left things looking like I might actually bother doing more about this and didn't want anyone to follow it up on my behalf.) From rob at hooft.net Wed May 26 16:25:43 2004 From: rob at hooft.net (Rob Hooft) Date: Wed May 26 16:30:59 2004 Subject: [spambayes-dev] Testing Tools Changes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BD1@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2BD1@its-xchg4.massey.ac.nz> Message-ID: <40B4FD47.9040305@hooft.net> Tony Meyer wrote: > [me] > >>I also changed fpfn.py to print out each message and >>offer to move it to the corresponding ham/spam set (I used it >>to check for misclassified messages), >>but it doesn't seem like this is a good addition to the script. > > > Browsing, I notice that this has been offered before (which would have saved > me the bother): > > [ 618932 ] fpfn.py: add interactivity on unix > 702&atid=498105> > > I don't know if this makes it any more/less worthwhile including, though. :-) 18 months ago! Rob From tim.one at comcast.net Thu May 27 01:48:39 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu May 27 01:48:52 2004 Subject: [spambayes-dev] RE: [Spambayes-checkins] website faq.txt,1.71,1.72 In-Reply-To: Message-ID: [Tony Meyer] > Modified Files: > faq.txt > Log Message: > Why don't you filter/moderate the list and make it so only subscribers can > post? Thanks, Tony! For many things -- this one just reminded me . From T.A.Meyer at massey.ac.nz Mon May 31 02:37:57 2004 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon May 31 02:38:31 2004 Subject: [spambayes-dev] RE: Brazilian Portuguese Translation Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1306874D68@its-xchg4.massey.ac.nz> [From personal email] > Anyway, I'm interested in working in a translation for the > plugin version of SB. It would be fantastic to have one by the > 1.0 final release. If you guide me through, I'd like to make one > for the OE version, too. > > I'm quite experienced with interface translations, mainly with > PostNuke CMS, but also others. Thanks for the offer - I agree that it would be great to have translations, and Brazilian Portuguese is as good a start as any! I doubt that this could make it into the 1.0 final release, though, since any changes between the 1.0 release candidates and the final have to be pretty small, and really ought to only be bug fixes. However, it'd be great for 1.1 (and I'm sure a 1.1a1 will arrive reasonably quickly after 1.0). I've cc'd this to the spambayes developers list, where this sort of thing gets discussed. While I'm happy to put in time to help with the internationalisation process, I don't have any experience with this at all, so it'd be better if one of the other SpamBayes developers could indicate how this should be done (I presume that there's a normal way to do it with Python scripts). Anyway, hopefully this should start the conversation flowing... =Tony Meyer From anthony at interlink.com.au Mon May 31 07:02:12 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon May 31 07:02:29 2004 Subject: [spambayes-dev] RE: Brazilian Portuguese Translation In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306874D68@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1306874D68@its-xchg4.massey.ac.nz> Message-ID: <40BB10B4.3040301@interlink.com.au> Speaking as a release manager, my approach is that the only fixes between a release candidate and a final release should be related to packaging, or critical bugs that have shown up in the release candidate. Other bugs can wait until the next release. New features should absolutely not be in the release. What's the status of the 1.0 release? Is there a bug blocking it? If it's likely to take much longer, it should probably be branched off now so that things like this can progress on the trunk. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From mhammond at skippinet.com.au Mon May 31 21:10:14 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon May 31 21:10:26 2004 Subject: Release branch (was RE: [spambayes-dev] RE: Brazilian Portuguese Translation) In-Reply-To: <40BB10B4.3040301@interlink.com.au> Message-ID: <03a101c44775$3229c340$0200a8c0@eden> Anthony: > What's the status of the 1.0 release? Is there a bug blocking > it? If it's likely to take much longer, it should probably be > branched off now so that things like this can progress on the > trunk. Sorry - I kinda got distracted :) Since 1.0rc, there have been a number of checkins, but from what I can see, pretty much are all still bugfixes, or not related to the build (eg, new checksum/sb_bnserver C module) Therefore, I propose I create a "release_1_0-fork" tag and "release_1_0-branch" branch from the trunk as it stands now. We then consider the trunk open-slather for the 2.0 release. Tony and I can then negotiate how much of his autoconf stuff goes in 1.0, or if we just do 1.0 asap. I vote for the latter :) In either case, we do a 1.0rc2, and this time *seriously* hope we release it as 1.0 - only major regressions would be considered for yet another RC. Tony and I repeat as necessary until we stop screwing up . Sound OK? Mark. From tameyer at ihug.co.nz Mon May 31 21:46:38 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon May 31 21:46:46 2004 Subject: Release branch (was RE: [spambayes-dev] RE: Brazilian PortugueseTranslation) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306966EE1@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C0264@its-xchg4.massey.ac.nz> > Sorry - I kinda got distracted :) Me too. > Since 1.0rc, there have been a number of checkins, but from > what I can see, pretty much are all still bugfixes, or not > related to the build (eg, new checksum/sb_bnserver C module) Changes I'd consider for 1.0[rc2] that weren't in 1.0rc1: * The "select child folder" bug fix with multiple Exchange mailboxes. This comes close to critical, IMO. * Installing the default_bayes_customize.ini file. This is packaging bug, so should be included, IMO. And maybe Skip's fixes for using db_expimp.py with Python 2.2. The rest can definitely wait. > Therefore, I propose I create a "release_1_0-fork" tag and > "release_1_0-branch" branch from the trunk as it stands now. Alternatively, you could cut it from what it was like at the 1.0rc1 tag. I'm happy to copy over the various fixes (the ones above, plus Skip's fix for mboxutils, plus the ones I've commented with 'bugfix'). Your call. > We then consider the trunk open-slather for the 2.0 release. What's the plan with life after 1.0, anyway? Is 1.1 just for bug fixes and anything new goes in 2.0? Or does 1.1 have bug fixes and minor changes, and anything major goes in 2.0? > In either case, we do a 1.0rc2, and this time > *seriously* hope we release it as 1.0 - only major > regressions would be considered for yet another RC. What about we pick a deadline date, too? If I have a deadline, I'm less likely to get distracted. =Tony Meyer From t-meyer at ihug.co.nz Mon May 31 21:47:59 2004 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Mon May 31 21:48:09 2004 Subject: [spambayes-dev] RE: [Python-Dev] anonymous cvs checkout In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677EDA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677EDD@its-xchg4.massey.ac.nz> [Opps. Forgot to include the list in my reply] [Edward Loper] > I'm having trouble checking out an anonymous copy of the Python CVS > repository. FWIW, neither a fresh checkout nor an update works here (WinXP), either, although (at least some) other sourceforge projects do. ViewCVS doesn't work, either (shows an empty page), so it's certainly not just you. The SF site status does mention that there's maintenance scheduled (for tomorrow) - maybe that's relevant? Does developer access still work? =Tony meyer From tameyer at ihug.co.nz Mon May 31 21:53:12 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon May 31 21:53:21 2004 Subject: [spambayes-dev] RE: [Python-Dev] anonymous cvs checkout In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306966F01@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677EE0@its-xchg4.massey.ac.nz> This is not my day. Ignore that, wrong list. The best you can say about it is that spambayes-dev and python-dev are kinda close...