From kennypitt at hotmail.com Mon Dec 1 09:40:59 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Dec 1 09:41:35 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F3@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > Well, yes. My only worry is that the trunk contains various code > that the branch doesn't which is definitely not bug-fix and does > change sb_server.py, UserInterface.py and ProxyUI.py a fair bit. Of > course, if I did it right, I didn't introduce any bugs adding that > code anyway, but ... I've been using the updated version for quite awhile now, both from source and with the new py2exe Windows binary. We wrang out a couple of minor bugs early on, but I haven't had any problems with it in several weeks. > OTOH, it would be nice to have some of those features in a release > sooner than May 04 :) (Especially the one that lets people submit a > decent bug report). Haven't tried the bug report feature, so can't give you a read on the stability there. >> 3) Put together a binary from my current py2exe setup script, >> which includes CVS and a number of sb_ programs. > > Does one need a special version of py2exe for this? If so, is it one > that there's a binary available for? (i.e. can I do this without > VC++?) It's the version in the sandbox subdirectory in py2exe CVS. There currently isn't a binary install that I know of, but I'm sure someone could throw one together if needed. I've been meaning to do it myself so I can put it on my computer at home, but just haven't gotten around to it. On an at best partially related aside, if/when we redo the release branching could we possibly do something with the version numbering in Version.py? It seems a bit confusing to have a completely different version number for every app, especially when they appear to be totally unrelated to the "1.0a7" type release numbering. -- Kenny Pitt From skip at pobox.com Mon Dec 1 10:01:13 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 1 10:01:24 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <027801c3b472$2b4c06a0$0200a8c0@eden> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz> <027801c3b472$2b4c06a0$0200a8c0@eden> Message-ID: <16331.22457.680172.948562@montanaro.dyndns.org> Mark> Should we abandon the branch, merging everything back to the Mark> trunk? Yes. Let's reserve branches for the brief time leading up to a release and for maintenance of old releases (if we decide that's worth doing, which I doubt we will). Skip From Hugo.Duncan at alcan.com Mon Dec 1 11:44:13 2003 From: Hugo.Duncan at alcan.com (Hugo.Duncan@alcan.com) Date: Mon Dec 1 11:41:00 2003 Subject: [spambayes-dev] sb_notesfilter.py changes Message-ID: Hi, I downloaded spambayes a few days ago, saw that you had some notes support, and tried to integrate the script to run on receipt of mail in my notes client. These were the changes that I had to make (attached diff file): add -P password option to specify the notes password. Not terribly secure I guess, but you don't have to use it if you don't want to. make it so that the pathname of the mail database is used both on server and on local machine. make the replication occur only if running on the server fails. allow redirection of stdout and stderr to file (-R filename) allow logging to a notes database. add file "Spam" to processed mail, to record spam probability. I then added an "agent" (lotus speak for script) that processes new mail, and some menu options for manually marking as spam, for unmarking falsely classified spam and for training as ham. Regards, Hugo (See attached file: sb_notesfilter.diff) Notice: This message and any attachments are the property of Alcan and are intended solely for the named recipients or entity to whom this message is addressed. If you have received this message in error please inform the sender via e-mail and destroy the message. If you are not the intended recipient you are not allowed to use, copy or disclose the contents or attachments in whole or in part. -------------- next part -------------- A non-text attachment was scrubbed... Name: sb_notesfilter.diff Type: application/octet-stream Size: 6216 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031201/55cee7d2/sb_notesfilter-0001.obj From tim at fourstonesExpressions.com Mon Dec 1 11:52:23 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Dec 1 11:52:55 2003 Subject: [spambayes-dev] sb_notesfilter.py changes In-Reply-To: References: Message-ID: > Hi, > > I downloaded spambayes a few days ago, saw that you had some notes > support, > and tried to integrate the script to run on receipt of mail in my notes > client. Very primitive support, as you have seen. Glad to see that these were the only changes you required . Would you let me know how it works for you as time goes on? One thing that I noticed was that it became less and less able to accurately classify, and I think it's related to a couple of things. One is that notes does not give you headers in the rfc822 sense. Thus the "no headers found" token becomes the deciding factor if your spam/ham ratio gets way out of whack, which mine was... like 3000:250 or something like that. It started deciding that almost everything was spam... let me know what your experience is. If your s:h ratio remains more reasonable, i would expect it to behave much more reasonably. I'd love to have a notes integration, similar to the outlook integration, that doesn't rely on an external program using the com interface... but that is beyond my abilities with notes. > > These were the changes that I had to make (attached diff file): > > add -P password option to specify the notes password. Not terribly > secure > I guess, but > you don't have to use it if you don't want to. > > make it so that the pathname of the mail database is used both on server > and on > local machine. > > make the replication occur only if running on the server fails. > > allow redirection of stdout and stderr to file (-R filename) > > allow logging to a notes database. > > add file "Spam" to processed mail, to record spam probability. I'll check 'em out! Thanks. > > > > I then added an "agent" (lotus speak for script) that processes new mail, > and some menu options for manually marking as spam, for unmarking > falsely classified spam and for training as ham. I tried to do this, but just couldn't figure it out... if you can give me the source for the agent, I can include it in the documentation... again, thanks for the input. > > Regards, > Hugo > > (See attached file: sb_notesfilter.diff) > > Notice: > This message and any attachments are the property of Alcan and are > intended > solely for the named recipients or entity to whom this message is > addressed. If you have received this message in error please inform the > sender via e-mail and destroy the message. If you are not the intended > recipient you are not allowed to use, copy or disclose the contents or > attachments in whole or in part. -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From Hugo.Duncan at alcan.com Mon Dec 1 12:23:43 2003 From: Hugo.Duncan at alcan.com (Hugo.Duncan@alcan.com) Date: Mon Dec 1 12:20:28 2003 Subject: [spambayes-dev] Re: sb_notesfilter.py changes Message-ID: Tim, Thanks for getting any sort of notes support! > it became less and less able to accurately classify I'll keep an eye on this. Thanks for the warning. >One is that notes does not give you headers in the rfc822 sense. Although you can access them in the document fields. I just wrote an agent to extract these so that I could send stuff to SpamCop. > I'd love to have a notes integration, similar to the outlook integration, > that doesn't rely on an external program using the com > interface... but that is beyond my abilities with notes. Presumably the outlook integration uses some sort of dll? does it still require a python interpreter ? > if you can give me the source for the agent This is the "After new mail arrives" agent, in LotusScript. Not very pretty, but it works for me. Sub Initialize Err=0 res%=Shell("c:/usr/Python23/pythonw.exe c:/usr/Python23/scripts/sb_notesfilter.py -t -c -r your_server_name -l your_db_name -f Spambayes -d notesbayes -i index_name -P your_password_here -R c:/tmp/bayes.log -L SpamBayesLog",1) If (Err<>0) Then Messagebox Error$ End If End Sub The others are "SimpleAction"'s to move the mail to the appropriate folders. Notice: This message and any attachments are the property of Alcan and are intended solely for the named recipients or entity to whom this message is addressed. If you have received this message in error please inform the sender via e-mail and destroy the message. If you are not the intended recipient you are not allowed to use, copy or disclose the contents or attachments in whole or in part. From tim at fourstonesExpressions.com Mon Dec 1 12:52:23 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Dec 1 12:52:43 2003 Subject: [spambayes-dev] Re: sb_notesfilter.py changes In-Reply-To: References: Message-ID: > Tim, > > Thanks for getting any sort of notes support! My pleasure. Was born out of necessity . >> One is that notes does not give you headers in the rfc822 sense. > > Although you can access them in the document fields. I just wrote > an agent to extract these so that I could send stuff to SpamCop. > Hmmmm.... don't know if that's available to the com interface. >> I'd love to have a notes integration, similar to the outlook >> integration, >> that doesn't rely on an external program using the com >> interface... but that is beyond my abilities with notes. > > Presumably the outlook integration uses some sort of dll? does > it still require a python interpreter ? It is a straight python program, that uses Mark Hammonds windows interfacing dll to access the COM interface that notes sports. That interface is not particularly rich... What version of notes are you using? V5, I presume... -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From richie at entrian.com Mon Dec 1 15:22:39 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 1 15:22:49 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F3@its-xchg4.massey.ac.nz> Message-ID: [Kenny] > On an at best partially related aside, if/when we redo the release > branching could we possibly do something with the version numbering in > Version.py? It seems a bit confusing to have a completely different > version number for every app, especially when they appear to be totally > unrelated to the "1.0a7" type release numbering. +1 I wasn't around when that was introduced, but I have to say it's never made much sense to me. -- Richie Hindle richie@entrian.com From tameyer at ihug.co.nz Mon Dec 1 18:53:20 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 1 18:53:28 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477C04@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F5@its-xchg4.massey.ac.nz> [Kenny] > I've been using the updated version for quite awhile now, > both from source and with the new py2exe Windows binary. We > wrang out a couple of minor bugs early on, but I haven't had > any problems with it in several weeks. Good to know. > Haven't tried the bug report feature, so can't give you a > read on the stability there. It's fairly straightforward, and worked for me, so hopefully goes ok. At the very least it should manage to give people and understanding of what information we need. > It's the version in the sandbox subdirectory in py2exe CVS. > There currently isn't a binary install that I know of, but > I'm sure someone could throw one together if needed. I've > been meaning to do it myself so I can put it on my computer > at home, but just haven't gotten around to it. If you could make me a copy, that would be fantastic. At the moment I'm stuck either testing whatever Mark throws my way or running from source only. I suppose I could just install VC++ (we have some sort of site license here, I gather), but I really can't be bothered . [Kenny] > On an at best partially related aside, if/when we redo the > release branching could we possibly do something with the > version numbering in Version.py? It seems a bit confusing to > have a completely different version number for every app, > especially when they appear to be totally unrelated to the > "1.0a7" type release numbering. [Richie] > I wasn't around when that was introduced, but > I have to say it's never made much sense to me. The thing is that the various apps do change at different rates - mboxtrain/filter, for example, tend to change much more slowly (I imagine because, being simpler in concept, they're more stable to begin with), and the core of the system hasn't changed for a long time (since 1.0a1?). As more apps start getting released separately, I think this will become more important. I agree that it could be improved, though. One thing I think would help (I've made this change locally (ages ago), but haven't checked it in) is removing the 'interface version' from pop3proxy/imapfilter and having a separate 'interface version' (my fault it was there). This, for example, lets you see that the interface in this release has had significant changes, but pop3proxy itself has not changed at all. Another thing is that it's somewhat out of date in terms of names, which is easily fixed. Mark could probably explain the reasoning better than me, though, since he came up with it . =Tony Meyer From kennypitt at hotmail.com Tue Dec 2 09:39:35 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Dec 2 09:40:12 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F5@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > [Kenny] >> I've been using the updated version for quite awhile now, >> both from source and with the new py2exe Windows binary. We >> wrang out a couple of minor bugs early on, but I haven't had >> any problems with it in several weeks. > > Good to know. Oops, may have spoken too soon. Just noticed I'm getting the following error whenever I save configuration in the binary version. This doesn't happen when I run from source. Note that the only thing I changed in config was the spam cutoff value, so it shouldn't have anything to do with the close/reopen of the training database. """ 500 Server error Traceback (most recent call last): File "spambayes\Dibbler.pyc", line 457, in found_terminator File "spambayes\UserInterface.pyc", line 801, in onChangeopts File "spambayes\ProxyUI.pyc", line 691, in reReadOptions ImportError: No module named Options """ -- Kenny Pitt From dave at boost-consulting.com Tue Dec 2 11:37:14 2003 From: dave at boost-consulting.com (David Abrahams) Date: Tue Dec 2 12:30:19 2003 Subject: [spambayes-dev] Serious problem Message-ID: Something curious started happening to all the email I receive after I upgraded from the "old" (pre-reorganization) Spambayes to the new one (e.g. that includes "sb_filter.py"). When mail is run through sb_filter.py, any line which begins with a period ("."), and all lines thereafter, are stripped from the email. For example, the following paragraph begins with "./configure". If you don't see it, you're seeing the bug. ./configure is the command most people use to run a configure script. -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Tue Dec 2 12:35:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 2 12:35:48 2003 Subject: [spambayes-dev] Serious problem In-Reply-To: References: Message-ID: <16332.52584.801341.395335@montanaro.dyndns.org> Dave> When mail is run through sb_filter.py, any line which begins with Dave> a period ("."), and all lines thereafter, are stripped from the Dave> email. For example, the following paragraph begins with Dave> "./configure". If you don't see it, you're seeing the bug. Dave> ./configure is the command most people use to run a configure script. I think the problem is elsewhere in your mail tool chain. I saw the dot .at .the .beginning .of .the .line .just .fine. Also saw several other lines after it, including your electronic signature. Skip From pclarke at dynapower.com Tue Dec 2 12:49:40 2003 From: pclarke at dynapower.com (Clarke, Peter) Date: Tue Dec 2 12:49:45 2003 Subject: [spambayes-dev] new Spambayes "feature"?? Message-ID: <769511B15AAE874591E117534314507704F00D@dppexchange.dynapower.com> THE BEST spam-filter I know about (and I know "just a little!" - tried several, (even PAID for some!)). I believe I have only had one e-mail falsely labeled as spam, and NONE the other way, in over two months of heavy attack by the spam-generating community!! Suggestion: It would be neat if the "you've got mail" icon in the "system tray" (at the bottom of the screen) would be deleted whenever Spambayes sends any spam e-mail off to the assigned "junk" folder. If the software already has this feature, then I'm obviously ignorant on it, and how to activate it - if so, please enlighten. Thanks a million! Peter W Clarke Chief Engineer - ULTRACAST* Products DYNAPOWER CORPORATION Specialists in AC and DC Power Conversion Systems ULTRACAST* Cast Coil Transformers 85 Meadowland Drive (05403) PO Box 9210 S. Burlington, VT 05407-9210 Phone: (802) 652-1354 Fax: (802) 652-1371 E-mail: pclarke@dynapower.com Web: www.dynapower.com IMPORTANT: The information contained in this communication is confidential and/or proprietary business or technical data. It is intended for receipt only by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, copying, or distribution of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by telephone 802-860-7200 or electronically by return message, and delete or destroy all copies of this communication. From dave at boost-consulting.com Tue Dec 2 13:06:52 2003 From: dave at boost-consulting.com (David Abrahams) Date: Tue Dec 2 13:07:17 2003 Subject: [spambayes-dev] Serious problem In-Reply-To: <16332.52584.801341.395335@montanaro.dyndns.org> (Skip Montanaro's message of "Tue, 2 Dec 2003 11:35:36 -0600") References: <16332.52584.801341.395335@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Dave> When mail is run through sb_filter.py, any line which begins with > Dave> a period ("."), and all lines thereafter, are stripped from the > Dave> email. For example, the following paragraph begins with > Dave> "./configure". If you don't see it, you're seeing the bug. > > Dave> ./configure is the command most people use to run a configure script. > > I think the problem is elsewhere in your mail tool chain. I saw the dot I thought I'd ruled that out, but on further investigation, you're quite right. Sorry for the noise. Here's an excerpt from what I sent my sysadmin: --- Not long ago I started seeing a problem with my incoming emails. Any line beginning with a period ("."), and all following lines, are stripped from the message. If I turn off the procmail processing altogether, the effect goes away. I'm running all mail through procmail to filter spam and re-forwarding them to myself using the following .procmailrc: LOGFILE=$HOME/.procmaillog PYTHONPATH=/usr/home/dave/src/spambayes:/usr/home/dave/src/email-2.5 # Pass everything through the Spambayes filter :0 fw |/usr/local/bin/python /usr/home/dave/src/spambayes/scripts/sb_filter.py -d $HOME/h # Forward the mail back to myself :0 ! dave I can verify that the problem has nothing to do with Spambayes by replacing that line with: | (echo "X-Spambayes-Classification: ham; 0.00" ; cat) which basically force-classifies the message as ham to prevent an infinite mail-rule loop. procmail is still seeing the right email contents up until the point the mail is forwarded back to me, which I can verify by adding: :0 c |cat >> bayeslog2 and inspecting bayeslog2. -- Dave Abrahams Boost Consulting www.boost-consulting.com From kennypitt at hotmail.com Tue Dec 2 13:10:12 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Dec 2 13:10:44 2003 Subject: [spambayes-dev] new Spambayes "feature"?? In-Reply-To: <769511B15AAE874591E117534314507704F00D@dppexchange.dynapower.com> Message-ID: Clarke, Peter wrote: > Suggestion: > It would be neat if the "you've got mail" icon in the "system tray" > (at the bottom of the screen) would be deleted whenever Spambayes > sends any spam e-mail off to the assigned "junk" folder. See FAQ 3.8: http://spambayes.sourceforge.net/faq.html#how-can-i-get-rid-of-the-envel ope-tray-icon-for-spam -- Kenny Pitt From kennypitt at hotmail.com Tue Dec 2 14:30:44 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Dec 2 14:31:27 2003 Subject: [spambayes-dev] Bug in timcv.py Message-ID: It looks like there is a bug in timcv. I tried to run a test of training on only a small number of messages, and I got the following output. """ C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain 10 --SpamTrain 10 --HamTest 150 --SpamTest 400 timcv_10-10.txt Traceback (most recent call last): File "timcv.py", line 170, in ? main() File "timcv.py", line 167, in main drive(nsets) File "timcv.py", line 115, in drive d.test(hamstream, spamstream) File "C:\src\python\spambayes_exp\spambayes\TestDriver.py", line 266, in test t.predict(spam, True, new_spam) File "C:\src\python\spambayes_exp\spambayes\Tester.py", line 92, in predict prob = guess(example) File "C:\src\python\spambayes\spambayes\classifier.py", line 158, in chi2_spamprob clues = self._getclues(wordstream) File "C:\src\python\spambayes\spambayes\classifier.py", line 395, in _getclues prob = self.probability(record) File "C:\src\python\spambayes\spambayes\classifier.py", line 242, in probability assert hamcount <= nham AssertionError """ I took a quick look at timcv.py, and I think I know what is happening. The ham and spam streams for initial training are created with "train=1", but the untrain() for the set being tested is done using streams that are created with "train=0". If the HamTrain/SpamTrain counts are different from the HamTest/SpamTest counts then the untrain() does not use the same set of messages. I can, of course, work around this by setting build_each_classifier_from_scratch, but just wanted to let everyone know about the mismatch. I noticed another curiosity in the traceback: I ran the test from inside directory "C:\src\python\spambayes_exp", which contains my modified version of SpamBayes. When the traceback gets to classifier.py, however, you can see that classifier.py was loaded from "C:\src\python\spambayes" instead, which is where I have my original CVS version of SpamBayes. I don't have any PYTHONPATH environment variable set, and I don't know what else might cause it to jump paths like that. Can one of you more experienced python'ers explain this? -- Kenny Pitt From hugoduncan at users.sf.net Tue Dec 2 14:50:40 2003 From: hugoduncan at users.sf.net (Hugo Duncan) Date: Tue Dec 2 15:40:26 2003 Subject: [spambayes-dev] Re: Re: sb_notesfilter.py changes References: Message-ID: >>> One is that notes does not give you headers in the rfc822 sense. >> >> Although you can access them in the document fields. > > Hmmmm.... don't know if that's available to the com interface. Attached an updated diff that adds From, Sender, Received and ReplyTo fields in a new getMessage function. > What version of notes are you using? V5, I presume... 6.02 Hugo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031202/c23a5004/sb_notesfilter.html From papaDoc at videotron.ca Tue Dec 2 18:56:31 2003 From: papaDoc at videotron.ca (Remi Ricard) Date: Tue Dec 2 18:52:56 2003 Subject: [spambayes-dev] Feature of allow_remote_connections ? Message-ID: <1070409391.14340.6.camel@porsche.hq.simlog.com> Hi, I was testing something and found a strange behavior It is looking like the options allow_remote_connections needs two items separated by a comma. This won't work [html_ui] xxx.xxx.xxx.xxx N.B xxx.xxx.xxx.xxx is a real IP address the error is Attempted to set [html_ui] allow_remote_connections with invalid value xxx.xxx.xxx.xxx () Traceback (most recent call last): File "/gmc/logiciels/spambayes/scripts/sb_server.py", line 106, in ? from spambayes.UserInterface import UserInterfaceServer File "/gmc/logiciels/spambayes/spambayes/UserInterface.py", line 46, in ? """ If I use [html_ui] localhost,xxx.xxx.xxx.xxx then the is no error This is the function is UserInterface.py def onIncomingConnection(self, clientSocket): """Checks the security settings.""" remoteIP = clientSocket.getpeername()[0] trustedIPs = options["html_ui", "allow_remote_connections"] if trustedIPs == "*" or remoteIP == clientSocket.getsockname()[0]: return True trustedIPs = trustedIPs.replace('.', '\.').replace('*', '([01]?\d\d?|2[04]\d|25[0-5])') for trusted in trustedIPs.split(','): if re.search("^" + trusted + "$", remoteIP): return True return False If I read the python code correctly you need to have a "," in the trestedIPs string ! -- Remi Ricard From mhammond at skippinet.com.au Tue Dec 2 19:49:22 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 2 19:49:36 2003 Subject: [spambayes-dev] Branch merge Message-ID: <038f01c3b937$4b738e70$2c00a8c0@eden> I had a bash at merging the branch back onto the trunk. There were a fairly large number of conflicts, but after examining them all, it appears Tony has been doing an excellent job at keeping all hist patches on both the branch and the trunk - so most of the conflicts were resolved in favour of the trunk. I have attached 2 patches. sb_docs.patch is patches to the various doc files - README, README-DEVEL, etc etc etc. Most of these are typos fixed on the branch, and the most recent release notes. A quick look by people who edit these files would be great! sb_code.patch contains the changes to code files required to merge the trunk and the branch. ImapUI.py has a number of reasonable looking changes which check if currently logged on, and that the server name is valid. spambayes/__init__.py bumps the version number. This makes a grand total of 2 .py files that are affected by the merge. Given the trivial nature of the patch required to do the merge, the question appears to be "what is *missing*"! I also attempted to upgrade the test suite in the hope of catching any errors. I have already checked these in. Any comments, or +1s on me checking this in? After we get past this, I will update README-DEVEL with our current 1.0 plan, and re-visit Version.py ;) Thanks, Mark. -------------- next part -------------- A non-text attachment was scrubbed... Name: sb_docs.patch Type: application/octet-stream Size: 12498 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031203/c8862d0b/sb_docs.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: sb_code.patch Type: application/octet-stream Size: 4249 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031203/c8862d0b/sb_code.obj From mhammond at skippinet.com.au Tue Dec 2 19:54:42 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 2 19:54:57 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F5@its-xchg4.massey.ac.nz> Message-ID: <039a01c3b938$0a5ab610$2c00a8c0@eden> > If you could make me a copy, that would be fantastic. At the > moment I'm > stuck either testing whatever Mark throws my way or running > from source > only. I suppose I could just install VC++ (we have some sort of site > license here, I gather), but I really can't be bothered . I threw together a py2exe binary for Tony yesterday. Let me know if anyone else wants it. [As an aside, but useful for people who want to build *everything* from sources, it is now possible to build win32all itself via a distutils script. This should mean that given the source to SpamBayes, win32all and py2exe, a single automated build procedure to churn out all distribution types should be possible.] Mark. From tim.one at comcast.net Tue Dec 2 22:59:52 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 2 22:59:56 2003 Subject: [spambayes-dev] Bug in timcv.py In-Reply-To: Message-ID: [Kenny Pitt] > It looks like there is a bug in timcv. I tried to run a test of > training on only a small number of messages, and I got the following > output. > > """ > C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain > 10 --SpamTrain 10 --HamTest 150 --SpamTest 400 timcv_10-10.txt Wow -- I didn't even know those options ({Ham,Spam}{Train,Test}) existed. They warp the meaning of "cross validation" beyond my recognition, so I wish they had been added to a new "cv-like" test driver instead. Oh well. > Traceback (most recent call last): > ... > File "C:\src\python\spambayes_exp\spambayes\TestDriver.py", line 266, in test > t.predict(spam, True, new_spam) > ... > File "C:\src\python\spambayes\spambayes\classifier.py", line 242, in probability > assert hamcount <= nham > AssertionError > """ Ouch. > I took a quick look at timcv.py, and I think I know what is happening. > The ham and spam streams for initial training are created with > "train=1", Right. > but the untrain() for the set being tested is done using streams that > are created with "train=0". Right. > If the HamTrain/SpamTrain counts are different from the > HamTest/SpamTest counts then the untrain() does not use the same > set of messages. This isn't cross-validation testing, so the optimizations in timcv.py *for* true cv testing stopped making sense when these other options were added. > I can, of course, work around this by setting > build_each_classifier_from_scratch, but just wanted to let everyone > know about the mismatch. I'd rather see these options moved into a different test driver, leaving timcv.py unsurprising again. Since timcv.py is the primary driver for serious testing, it should be kept as simple and bulletproof as possible. I regret that the build_each_classifier_from_scratch option was added to it for the same reason (as the comments for that option say, there was a need for that option at one time, when evaluating some since-rejected combining schemes where *incremental* training and untraining were impossible; those schemes went away, but the option stayed behind to muddy the waters). > I noticed another curiosity in the traceback: I ran the test from > inside directory "C:\src\python\spambayes_exp", which contains my > modified version of SpamBayes. When the traceback gets to > classifier.py, however, you can see that classifier.py was loaded from > "C:\src\python\spambayes" instead, which is where I have my original > CVS version of SpamBayes. I don't have any PYTHONPATH environment > variable set, and I don't know what else might cause it to jump paths > like that. Can one of you more experienced python'ers explain this? Run Python with -v to get a report of how every import got satisfied. Then stare until your eyes bleed <0.9 wink>. I notice that a lot of the scripts these days muck around with sys.path directly, thus changing Python's search path dynamically, at runtime. That's *usually* a Bad Idea. If I were you, I'd take a critical look at the fix_sys_path() function in sb_test_support.py. I don't know how this got so convoluted, but gobs of dynamic code trying to "fix" what should be statically known (or at worst fiddled once in a config file) is a pretty sure recipe for confusion. From mhammond at skippinet.com.au Tue Dec 2 23:26:35 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 2 23:26:50 2003 Subject: [spambayes-dev] Bug in timcv.py In-Reply-To: Message-ID: <042a01c3b955$a3cc9bc0$2c00a8c0@eden> > > I noticed another curiosity in the traceback: I ran the test from > > inside directory "C:\src\python\spambayes_exp", which contains my > > modified version of SpamBayes. When the traceback gets to > > classifier.py, however, you can see that classifier.py was > loaded from > > "C:\src\python\spambayes" instead, which is where I have my original > > CVS version of SpamBayes. I don't have any PYTHONPATH environment > > variable set, and I don't know what else might cause it to > jump paths > > like that. Can one of you more experienced python'ers explain this? > > Run Python with -v to get a report of how every import got > satisfied. Then stare until your eyes bleed <0.9 wink>. I guess that both the spambayes directory itself, *and* the spambayes parent, are on sys.path (and probably the different versions of each). Thus 'import Options' may be resolved as either 'spambayes.Options' or simply 'Options'. But as Tim said, you can confirm it yourself if for some strange reason you really care > I notice that a lot of the scripts > these days muck around with sys.path directly, thus changing > Python's search > path dynamically, at runtime. That's *usually* a Bad Idea. Yeah, I'd like to fix these, as I am responsible for some. IMO, the "package-ness" of SpamBayes isn't that well defined - mainly as the concept was created after the core code. The Outlook2000 directory isn't a package, but arguably should be. Another reason is that for Outlook, I have never insisted that a user do a "setup.py install" before using the addin. I attempt to use the code directly from the source-tree, including the core spambayes package. If we do move towards forcing source-code users to use distutils to install the package, we may be able to drop even more. Is this a good thing? Tim - I assume you tend to use SpamBayes directly from the CVS tree - is that correct? If so, you manage sys.path manually? > If I were you, > I'd take a critical look at the fix_sys_path() function in > sb_test_support.py. I don't know how this got so convoluted, > but gobs of > dynamic code trying to "fix" what should be statically known > (or at worst > fiddled once in a config file) is a pretty sure recipe for confusion. Well, sb_test_support just got created *today*, so poor Kenny would not have seen it when he sent the mail . Also, this file is used *only* by the 'unit test style tests' rather than the 'validation style tests' that timcv exists in. The hacks in sb_test_support were a small step towards reducing the sys.path hacking, but only for that single directory. Mark. From skip at pobox.com Wed Dec 3 10:28:43 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 3 10:28:52 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] website faq.txt,1.51,1.52 In-Reply-To: References: Message-ID: <16334.299.607894.605629@montanaro.dyndns.org> Tony> Missing mail: it's odd that this is very suddenly a FAQ, and Tony> nowhere near a release. Proof positive that SpamBayes is moving further down the food chain. Skip From tameyer at ihug.co.nz Wed Dec 3 21:15:09 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 21:15:16 2003 Subject: [spambayes-dev] Branch merge In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130447814C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FB@its-xchg4.massey.ac.nz> > I had a bash at merging the branch back onto the trunk. [...] > I have attached 2 patches. sb_docs.patch is patches to the > various doc files - README, README-DEVEL, etc etc etc. It makes no difference if the WHAT_IS_NEW file doesn't get updated, or what the contents end up looking like - this gets rewritten with each release so will need to be redone for 1.0a8 or whatever it is that the scheme has us releasing next. (Even a binary-only 1.0a75 would need to have it modified from the 1.0a7 version *). The rest looks fine to me. * Actually, we'll have to figure out what happens here anyway, since it's probably not even included with the binary, although it does form the basis of the sourceforge 'release notes'. > sb_code.patch contains the changes to code files required to > merge the trunk and the branch. ImapUI.py has a number of > reasonable looking changes which check if currently logged > on, and that the server name is valid. Yeah, these should be on the trunk, too. Not sure how I missed that, although I was a bit preoccupied at the time . Actually, the code needs to be improved a bit, since I think at the moment it'll give a half completed page if not logged in; I'll fix that once it's on the trunk since that seems easiest (and it's hardly urgent). (This does mean that I lose my bet, though. I'd also forgotten about the version number bump and the README-DEVEL tidying). > I also attempted to upgrade the test suite in the hope of > catching any errors. I have already checked these in. When you're updating README-DEVEL, you could put something in saying that we expect that all new code will come with an appropriate unittest. You never know, we might fool newcomers into thinking that that's the actual situation, and get some movement along those lines . > Any comments, or +1s on me checking this in? +1 here. =Tony Meyer From tameyer at ihug.co.nz Wed Dec 3 21:37:57 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 21:38:04 2003 Subject: [spambayes-dev] Feature of allow_remote_connections ? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130447813D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FC@its-xchg4.massey.ac.nz> > It is looking like the options allow_remote_connections needs > two items separated by a comma. > > This won't work > [html_ui] > xxx.xxx.xxx.xxx N.B xxx.xxx.xxx.xxx is a real IP address > the error is > Attempted to set [html_ui] allow_remote_connections with invalid value > xxx.xxx.xxx.xxx () I tried this: """ [html_ui] allow_remote_connections:123.123.123.123 """ And it worked fine here. "allow_remote_connections:123.123.123.123, 123.123.123.132" didn't, but that's because of the space after the comma, I think ("allow_remote_connections:123.123.123.123,123.123.123.132" works). That said, the 'correct' way for that option to be set up would really be to expect a tuple and have the regex only allow *one* of the possibilities to match. The options code would then take care of single/multiple values. It's a pretty simple fix, I think, but I'm a bit wary of checking it in since I don't use this, and didn't write it. Any volunteers? (I'll produce the code). > This is the function is UserInterface.py [...] > for trusted in trustedIPs.split(','): [...] > If I read the python code correctly you need to have a "," in > the trestedIPs string ! Doing split(',') should return a single item (the whole string) if there aren't any commas at all, so this should work. That said, if the options code was taking care of the multiple values as it should be, then the split wouldn't be necessary at all (and they could be separated by spaces, or whatever, like the other options). This doesn't explain why you had troubles, though. Maybe the regex failed for some other reason? It's certainly very complicated looking (I presume it checks the IP is valid). For our purposes "((?:\d{1,3}\.){3}\d{1,3})" would probably do. If you're comfortable mucking about with regexs you could take the IP_LIST one out of OptionsClass.py and see why it failed the IP you gave it (I like using Kodos for this sort of thing). =Tony Meyer From tameyer at ihug.co.nz Wed Dec 3 22:09:33 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 22:10:18 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477EFB@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FD@its-xchg4.massey.ac.nz> > Here is a py2exe. It *should* work. I had a quick go a building it: I got a 'can't find the gen_py file' error, which I fixed by changing the versions from 0,9,0 to 0,9,1 and 0,2,0 to 0,2,2 which is what I have. Does this hurt? Is there a better way to do this? After that it all appeared to work (from 2 minutes testing). I'll swap my wife's system over to this to test it out there (win98) and do some more next week. =Tony Meyer From tameyer at ihug.co.nz Wed Dec 3 22:25:02 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 22:25:08 2003 Subject: [spambayes-dev] More CVS branch/tags questions In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477F8F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FE@its-xchg4.massey.ac.nz> > Unzip the attached file into your site-packages directory and > see if it does the trick. I don't think py2exe setup makes > any other changes, but I'm not certain. This was built for > Python 2.3.2 using VC++ 6.0. Thanks! (Mark gave me a copy, too, so I'm well supplied). Building the installer seems ok, but I get these errors: Traceback (most recent call last): File "pop3proxy_tray.py", line 407, in _ProxyThread File "sb_server.pyc", line 869, in start File "sb_server.pyc", line 847, in main File "spambayes\ProxyUI.pyc", line 156, in __init__ File "spambayes\UserInterface.pyc", line 254, in __init__ File "spambayes\UserInterface.pyc", line 122, in __init__ File "spambayes\UserInterface.pyc", line 240, in readUIResources File "spambayes\resources\__init__.pyc", line 30, in ? File "resourcepackage\package.pyc", line 100, in scan WindowsError: [Errno 3] The system cannot find the path specified: 'C:\\Program Files\\SpamBayes\\lib\\spambayes.zip\\spambayes\\resources/*.*' Loading database... SMTP Listener on port 25 is proxying smtp.massey.ac.nz:25 Traceback (most recent call last): File "pop3proxy_tray.py", line 389, in OnCommand File "pop3proxy_tray.py", line 431, in Start File "sb_server.pyc", line 863, in prepare File "sb_server.pyc", line 688, in buildServerStrings TypeError: iteration over non-sequence Have you seen those? Neither occurs when I use the source. (The zip is there, and has the resources directory in it). If you haven't, then I'll try and rummage around and figure out what's causing them next week. =Tony Meyer From ta-meyer at ihug.co.nz Wed Dec 3 22:42:07 2003 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 22:42:14 2003 Subject: [spambayes-dev] Testing the binary installer (Was More CVS branch/tags questions) Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz> [me, just now] > Building the installer seems ok, but I get these errors: > > Traceback (most recent call last): > File "pop3proxy_tray.py", line 407, in _ProxyThread > File "sb_server.pyc", line 869, in start > File "sb_server.pyc", line 847, in main > File "spambayes\ProxyUI.pyc", line 156, in __init__ > File "spambayes\UserInterface.pyc", line 254, in __init__ > File "spambayes\UserInterface.pyc", line 122, in __init__ > File "spambayes\UserInterface.pyc", line 240, in readUIResources > File "spambayes\resources\__init__.pyc", line 30, in ? > File "resourcepackage\package.pyc", line 100, in scan > WindowsError: [Errno 3] The system cannot find the path specified: > 'C:\\Program Files\\SpamBayes\\lib\\spambayes.zip\\spambayes\\resources/*.*' I figured this one out (and probably why Kenny and Mark don't see it). This happens if you have the resourcepackage __init__.py in the spambayes/resources directory (which is necessary to get the files in there to automatically update). Resource package isn't able to find the files inside the zip (and even if it could, would have to alter the zip to change the files). The easy solution is to just have the cvs __init__.py; the correct solution is probably to add something in that stops the check if we're frozen. Does that sound right to you, Richie? =Tony Meyer From tameyer at ihug.co.nz Wed Dec 3 22:46:18 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 22:46:24 2003 Subject: [spambayes-dev] Testing the binary installer (Was More CVSbranch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F616@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1ED@its-xchg4.massey.ac.nz> [Me, making another mistake] > The easy solution is to just have the cvs __init__.py; Opps. I meant a blank __init__.py, or to not have resourcepackage installed. =Tony Meyer From tameyer at ihug.co.nz Wed Dec 3 23:11:21 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 3 23:11:26 2003 Subject: [spambayes-dev] Using reload() with modules from zips (Was More CVS branch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477FA0@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FF@its-xchg4.massey.ac.nz> > Oops, may have spoken too soon. Just noticed I'm getting the > following error whenever I save configuration in the binary > version. [...] > ImportError: No module named Options I remember this from the last binary (from Mark) I tried, and I get it too. IMO, this is a Python bug. Try this: """ >set PYTHONPATH=path/to/spambayes.zip >python >>> from spambayes import Options >>> Options >>> reload(Options) Traceback (most recent call last): File "", line 1, in ? ImportError: No module named Options """ Looks to me like reload() doesn't work with module from zip's, which I presume it should. Could one of the Python experts correct me if I'm wrong here? Otherwise I presume I should open a (python) sf bug about this. =Tony Meyer From richie at entrian.com Thu Dec 4 04:02:26 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Dec 4 04:02:37 2003 Subject: [spambayes-dev] Testing the binary installer (Was More CVS branch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz> Message-ID: [Tony] > the correct solution > is probably to add something in that stops the check if we're frozen. Does > that sound right to you, Richie? Sounds good. I'm sure Mike Fletcher would appreciate a patch. 8-) -- Richie Hindle richie@entrian.com From richie at entrian.com Thu Dec 4 04:02:29 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Dec 4 04:02:39 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/windows pop3proxy_tray.py, 1.15, 1.16 In-Reply-To: References: Message-ID: [Tony] > Since 15/10/03, SetDefaultItem has been available for menus in win32all, so use that > as we should. (It appears to set the font of the item correctly, but not have any effect > in terms of action, so still capture the double-click ourselves. Someone correct me if > I've done this wrongly). I've just had a look at a tray app I wrote years ago, and it does the same thing. -- Richie Hindle richie@entrian.com From richie at entrian.com Thu Dec 4 04:02:30 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Dec 4 04:02:41 2003 Subject: [spambayes-dev] Branch merge In-Reply-To: <038f01c3b937$4b738e70$2c00a8c0@eden> References: <038f01c3b937$4b738e70$2c00a8c0@eden> Message-ID: <1nttsvk6smcgunmjup66glucnjioqbfo81@4ax.com> [Mark] > Any comments, or +1s on me checking this in? I have no time to read it, but I trust you. +0.99 (so if it all goes wrong I can blame the 0.01 8-) -- Richie Hindle richie@entrian.com From kennypitt at hotmail.com Thu Dec 4 11:11:33 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 4 11:12:06 2003 Subject: [spambayes-dev] Testing the binary installer (Was More CVSbranch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > [me, just now] >> Building the installer seems ok, but I get these errors: >> >> Traceback (most recent call last): >> File "pop3proxy_tray.py", line 407, in _ProxyThread >> File "sb_server.pyc", line 869, in start >> File "sb_server.pyc", line 847, in main >> File "spambayes\ProxyUI.pyc", line 156, in __init__ >> File "spambayes\UserInterface.pyc", line 254, in __init__ >> File "spambayes\UserInterface.pyc", line 122, in __init__ >> File "spambayes\UserInterface.pyc", line 240, in readUIResources >> File "spambayes\resources\__init__.pyc", line 30, in ? >> File "resourcepackage\package.pyc", line 100, in scan >> WindowsError: [Errno 3] The system cannot find the path specified: >> 'C:\\Program > Files\\SpamBayes\\lib\\spambayes.zip\\spambayes\\resources/*.*' > > I figured this one out (and probably why Kenny and Mark don't see > it). This happens if you have the resourcepackage __init__.py in the > spambayes/resources directory (which is necessary to get the files in > there to automatically update). Resource package isn't able to find > the files inside the zip (and even if it could, would have to alter > the zip to change the files). I just checked and the last binary I built was done with the resourcepackage __init__.py still in place, yet I don't get the error. I think you are probably still correct on the problem, though. There is a good chance that I ran from source at least once since I last modified any of the resources, so the .py files would have gotten updated then. Would that cause resourcepackage to not try to regenerate? Try running from source first and then rebuilding the binary to see if you still get the error. > The easy solution is to just have the cvs __init__.py; the correct > solution is probably to add something in that stops the check if > we're frozen. Does that sound right to you, Richie? This definitely should be done for a release version. -- Kenny Pitt From kennypitt at hotmail.com Thu Dec 4 11:27:18 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 4 11:27:49 2003 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/windows spambayes.iss, 1.2, 1.3 In-Reply-To: Message-ID: Tony Meyer wrote: > Update of /cvsroot/spambayes/spambayes/windows > In directory sc8-pr-cvs1:/tmp/cvs-serv3138/windows > > Modified Files: > spambayes.iss > Log Message: > These dlls end up in the lib directory here, and I'm pretty sure that that's where > they're meant to be these days. An old .iss, maybe? > > Index: spambayes.iss > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/windows/spambayes.iss,v > retrieving revision 1.2 > retrieving revision 1.3 > diff -C2 -d -r1.2 -r1.3 > *** spambayes.iss 23 Oct 2003 22:54:09 -0000 1.2 > --- spambayes.iss 4 Dec 2003 03:12:41 -0000 1.3 > *************** > *** 19,24 **** > Source: "py2exe\dist\lib\*.*"; DestDir: "{app}\lib"; Flags: ignoreversion > Source: "py2exe\dist\bin\python23.dll"; DestDir: "{app}\bin"; Flags: ignoreversion > ! Source: "py2exe\dist\bin\pythoncom23.dll"; DestDir: "{app}\bin"; Flags: ignoreversion > ! Source: "py2exe\dist\bin\PyWinTypes23.dll"; DestDir: "{app}\bin"; Flags: ignoreversion > > Source: "py2exe\dist\bin\outlook_addin.dll"; DestDir: "{app}\bin"; Check: > InstallingOutlook; Flags: ignoreversion regserver > --- 19,24 ---- > Source: "py2exe\dist\lib\*.*"; DestDir: "{app}\lib"; Flags: ignoreversion > Source: "py2exe\dist\bin\python23.dll"; DestDir: "{app}\bin"; Flags: ignoreversion > ! Source: "py2exe\dist\lib\pythoncom23.dll"; DestDir: "{app}\bin"; Flags: ignoreversion > ! Source: "py2exe\dist\lib\PyWinTypes23.dll"; DestDir: "{app}\bin"; Flags: ignoreversion > > Source: "py2exe\dist\bin\outlook_addin.dll"; DestDir: "{app}\bin"; Check: > InstallingOutlook; Flags: ignoreversion regserver Actually, those two DLLs only go in the lib directory now, not in the bin directory. The "py2exe\dist\lib\*.*" line already takes care of that. -- Kenny Pitt From kennypitt at hotmail.com Thu Dec 4 11:46:17 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 4 11:46:48 2003 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/windowsspambayes.iss, 1.2, 1.3 In-Reply-To: Message-ID: [Me, before realizing I had modified this file locally] > Tony Meyer wrote: >> Update of /cvsroot/spambayes/spambayes/windows >> In directory sc8-pr-cvs1:/tmp/cvs-serv3138/windows >> >> Modified Files: >> spambayes.iss >> Log Message: >> These dlls end up in the lib directory here, and I'm pretty sure >> that that's where they're meant to be these days. An old .iss, >> maybe? > > Actually, those two DLLs only go in the lib directory now, not in the > bin directory. The "py2exe\dist\lib\*.*" line already takes care of > that. And now for the completion of that partial thought: So, the correct fix is to delete those two DLLs from the .iss entirely. Some background: A little while back, I noticed that these DLLs weren't found in some cases when they were in the bin directory (e.g. if you tried to register the plugin DLL when you weren't sitting in the bin directory). Mark and I determined that if the DLLs are in the python path, they will be found correctly in all cases, so that's why setup_all was changed to copy them to the dist\lib directory instead of the dist\bin directory. -- Kenny Pitt From dbulgrien at vcsd.com Thu Dec 4 12:06:03 2003 From: dbulgrien at vcsd.com (Dennis W. Bulgrien) Date: Thu Dec 4 12:20:26 2003 Subject: [spambayes-dev] Re: Outlook Envelope Tray Icon References: Message-ID: One place that I have noticed would be nice is when the "Delete as Spam" button is pressed. With SpamBayes Manager, Training tab, Incremental Training frame, Clicking Delete as Spam should "mark the message as read", the icon is not cleared even though the message is marked as read. This is unexpected because the Filtering tab, Certain Spam frame, Mark spam as read check-box keeps the icon from appearing when spam comes in and is ushered to the spam folder (Advanced tab set to Enabled background filtering, default delays). Maybe the later works because SpamBayes marks it read even BEFORE Outlook displays the icon. "Kenny Pitt" wrote... ... Thanks for the link. I created the following code to implement this in the Outlook plugin and attached it to a menu item for testing. It was, in fact, successful in removing the new mail envelope from the taskbar. Now, the *really* tricky part is figuring out when to remove the icon. ... From kennypitt at hotmail.com Thu Dec 4 13:47:59 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 4 13:48:34 2003 Subject: [spambayes-dev] pop3proxy_tray error Message-ID: I decided it was probably time to do a little more thorough testing of the proxy tray than just my average daily usage. I tried to stop SpamBayes from the right-click menu, and then start it again. Here's the output I got when it tried to restart SpamBayes. """ Loading database... Traceback (most recent call last): File "pop3proxy_tray.py", line 389, in OnCommand function() File "pop3proxy_tray.py", line 431, in Start sb_server.prepare(state=sb_server.state) File "C:\src\python\spambayes_exp\scripts\sb_server.py", line 861, in prepare state.buildServerStrings() File "C:\src\python\spambayes_exp\scripts\sb_server.py", line 685, in buildServerStrings serverStrings = ["%s:%s" % (s, p) for s, p in self.servers] TypeError: iteration over non-sequence """ At this point, the tray icon indicates that SpamBayes is running (I had to hover my mouse over the icon to get it to update, maybe a different problem), but none of the ports have been opened so any attempt to connect to the mail server, review messages in the ui, etc. fails. This happens whether I am running from source or the py2exe binary. -- Kenny Pitt From kennypitt at hotmail.com Thu Dec 4 14:16:44 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 4 14:17:16 2003 Subject: [spambayes-dev] pop3proxy_tray icons Message-ID: I don't know if anyone else has noticed this or not, but on my Windows 2000 system the green and red circles in the current pop3proxy_tray icons are very difficult to make out. I created the attached icons as a possible alternative. They are basic 16-color icons and show up quite nicely on both Windows 2000 and Windows XP. The attached patch is also required because the LoadImage calls pass 0,0 for the icon size. That loads the icon using the default 32x32 size, scaling a 16x16 icon up to 32x32 if necessary. Since icons in the tray are only 16x16, they then get scaled back down when displayed and still end up looking bad. I also attached an alternate sbicon that I created in the spirit of the icons in the Web UI. It uses the envelope icon from the Wingdings font with the same blue outline color used in the UI icons. I modified my py2exe\setup_all.py to use this as the icon for all the generated exe's. -- Kenny Pitt -------------- next part -------------- A non-text attachment was scrubbed... Name: sb-stopped.ico Type: image/x-icon Size: 318 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/sb-stopped.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: sb-started.ico Type: image/x-icon Size: 318 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/sb-started.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: pop3proxy_tray.diff Type: application/octet-stream Size: 1992 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/pop3proxy_tray.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: sbicon.ico Type: image/x-icon Size: 4710 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/sbicon.bin From sanjaydarisi at cox.net Thu Dec 4 14:29:29 2003 From: sanjaydarisi at cox.net (Sanjay Darisi) Date: Thu Dec 4 14:29:34 2003 Subject: [spambayes-dev] Closing Manager window... Message-ID: <3FCF8B19.5030108@cox.net> I realized that everytime the spambayes Manager dialog is closed it saves to the config file. I found that the close button just closes the dialog. So, is there any event associated with the closing of this dialog that invokes the SaveConfig function in Manager.py file. Oops...I didn't tell you, I am running spambayes outlook addin 0.81 Thanks in advance, Sanjay. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/bb33296e/attachment.html From mhammond at skippinet.com.au Thu Dec 4 17:35:38 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Dec 4 17:35:55 2003 Subject: [spambayes-dev] Using reload() with modules from zips (Was More CVSbranch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FF@its-xchg4.massey.ac.nz> Message-ID: <07dd01c3bab6$f2ff7cf0$2c00a8c0@eden> > Looks to me like reload() doesn't work with module from zip's, which I > presume it should. > > Could one of the Python experts correct me if I'm wrong here? > Otherwise I > presume I should open a (python) sf bug about this. I agree this is a bug. While .zip files are generally logically "readonly", there is no reason that a .zip file could not be updated dynamically while an app is running. I'm not so sure it will see quick attention though, so we should consider handling this in our code. I'm also not sure exactly *why* we are doing a reload - saving user options should not require us to reload the Options module, and I'm fairly sure no code exists that updates the .zip with a new Options file even if it did :) Mark. From kennypitt at hotmail.com Thu Dec 4 17:55:40 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 4 17:56:12 2003 Subject: [spambayes-dev] Using reload() with modules from zips (Was MoreCVSbranch/tags questions) In-Reply-To: <07dd01c3bab6$f2ff7cf0$2c00a8c0@eden> Message-ID: Mark Hammond wrote: > I'm also not sure exactly *why* we are doing a reload - saving user > options should not require us to reload the Options module, and I'm > fairly sure no code exists that updates the .zip with a new Options > file even if it did :) I agree. The Python manual says of the reload() function: "This is useful if you have edited the module source file using an external editor and want to try out the new version without leaving the Python interpreter." In the binary, we have no source to reload. Even when running from source, it doesn't seem useful to recompile Options.py. It isn't done for any of the other modules. I assume the reload has the side-effect of rerunning any initialization that occurs when the module is first loaded, but what good would that do during a save? We already have all the options that the user specified or we wouldn't be able to save them, so why throw them all out and then immediately reload them again? As far as I can tell, neither Options.py or OptionsClass.py does anything except read in the values, so it doesn't seem like there should be any side-effects. Maybe we should just comment out the reload and see what happens (he said while opening his text editor). -- Kenny Pitt From mhammond at skippinet.com.au Thu Dec 4 23:47:35 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Dec 4 23:47:46 2003 Subject: [spambayes-dev] release_1_0 branch is dead Message-ID: <083701c3baea$e7695760$2c00a8c0@eden> I've checked in my merge of the branch. As per my previous mail, the changes were pretty trivial, so I expect no problems. But consider it official - the branch is dead. Mark. From kennypitt at hotmail.com Fri Dec 5 10:21:02 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Dec 5 10:21:35 2003 Subject: [spambayes-dev] Using reload() with modules from zips (WasMoreCVSbranch/tags questions) In-Reply-To: Message-ID: [Me, yesterday evening] > Mark Hammond wrote: >> I'm also not sure exactly *why* we are doing a reload - saving user >> options should not require us to reload the Options module, and I'm >> fairly sure no code exists that updates the .zip with a new Options >> file even if it did :) > > [snip meandering commentary by me] > > Maybe we should just comment out the reload and see what happens (he > said while opening his text editor). I commented out the 4 lines that do the importing and reloading of the Options module, and then rebuilt the binary. Brief initial testing showed no problems with these lines taken out. I changed the listening port for one of my POP servers. I was able to save the configuration change successfully with no traceback, and by monitoring with TCPView I saw the listening port change immediately. I then also changed the spam cutoff and again saved successfully. Finally, I exited sb_tray and reloaded it, and both changes were still present in the config file. -- Kenny Pitt From skip at pobox.com Fri Dec 5 16:43:54 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 5 16:44:01 2003 Subject: [spambayes-dev] More on training - eyeballs and edits appreciated Message-ID: <16336.64538.768060.984524@montanaro.dyndns.org> I added a bunch of text to the SpamBayes Wiki about training today (several doses of caffeine later). I apologize for the long delay. Thanks to Seth and Ryan for stepping up to the plate in my virtual absence. I also added the training aphorisms I posted a couple weeks ago. Have a look: http://www.entrian.com/sbwiki/TrainingIdeas Feed free to comment on anything or edit the page using the link at the bottom... Skip From sl6dt at cc.usu.edu Sun Dec 7 15:46:48 2003 From: sl6dt at cc.usu.edu (sl6dt) Date: Sun Dec 7 15:49:30 2003 Subject: [spambayes-dev] Web filtering Message-ID: <3FD367C7@webster.usu.edu> Hello everyone, I am the new guy to the list. I was wondering today if it were possible to make a bayesian filter for web pages to block undesired content? This way we could provide a free plugin to everybody who doesn't want porn sites to show up in their web browser. What does everyone think? John Mulholland From skip at pobox.com Sun Dec 7 18:13:01 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Dec 7 18:13:03 2003 Subject: [spambayes-dev] Web filtering In-Reply-To: <3FD367C7@webster.usu.edu> References: <3FD367C7@webster.usu.edu> Message-ID: <16339.46077.318890.127461@montanaro.dyndns.org> John> I was wondering today if it were possible to make a bayesian John> filter for web pages to block undesired content? Check out mod_spambayes.py in the contrib directory. It's a SpamBayes plugin for Amit Patel's proxy web server. Skip From sl6dt at cc.usu.edu Sun Dec 7 18:32:30 2003 From: sl6dt at cc.usu.edu (sl6dt) Date: Sun Dec 7 18:34:46 2003 Subject: [spambayes-dev] Web filtering Message-ID: <3FD3BEEC@webster.usu.edu> Thank you for the information. I am not familiar with Amit Patel's proxy web server. Who is working on this plugin? Can I help? John Mulholland >===== Original Message From skip@pobox.com ===== > John> I was wondering today if it were possible to make a bayesian > John> filter for web pages to block undesired content? > >Check out mod_spambayes.py in the contrib directory. It's a SpamBayes >plugin for Amit Patel's proxy web server. > >Skip From skip at pobox.com Sun Dec 7 19:48:21 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Dec 7 19:48:21 2003 Subject: [spambayes-dev] Web filtering In-Reply-To: <3FD3BEEC@webster.usu.edu> References: <3FD3BEEC@webster.usu.edu> Message-ID: <16339.51797.668396.312753@montanaro.dyndns.org> John> Thank you for the information. I am not familiar with Amit John> Patel's proxy web server. Who is working on this plugin? Can I John> help? The URL for the proxy server is at the top of the mod_spambayes.py file. I wrote the plugin, though as you can see, it's pretty minimal. Nobody's working on it at the moment. You're more than welcome to enhance it. I only wrote it as an exercise. Skip From tameyer at ihug.co.nz Mon Dec 8 03:13:56 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 8 03:14:07 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/windowspop3proxy_tray.py, 1.15, 1.16 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F6D2@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F1@its-xchg4.massey.ac.nz> [Tony] > Since 15/10/03, SetDefaultItem has been available for menus in > win32all, so use that as we should. (It appears to set the font of > the item correctly, but not have any effect in terms of action, so > still capture the double-click ourselves. Someone correct me if I've > done this wrongly). [Richie] > I've just had a look at a tray app I wrote years ago, and it > does the same thing. Good enough for me. (I presume this was not in Python, or you had your own extension to get access to the SetDefaultItem function?) =Tony Meyer From tameyer at ihug.co.nz Mon Dec 8 03:17:29 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 8 03:17:35 2003 Subject: [spambayes-dev] Using reload() with modules from zips (Was More CVSbranch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F6FA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F2@its-xchg4.massey.ac.nz> [Tony] > Looks to me like reload() doesn't work with module from > zip's, which I presume it should. [Mark] > I agree this is a bug. While .zip files are generally > logically "readonly", there is no reason that a .zip file > could not be updated dynamically while an app is running. At the least a more accurate ("can't reload from zip") message would be good, I would think. I'll submit this as a bug for Python. > I'm not so sure it will see quick attention though, so we > should consider handling this in our code. I'm not sure that it deserves quick attention either, since I don't imagine this is a high use feature. > I'm also not sure exactly *why* we are doing a reload - > saving user options should not require us to reload the > Options module, and I'm fairly sure no code exists that > updates the .zip with a new Options file even if it did :) I *think* (before my time, IIRC) the reason is to generate a new Options.options object, that has all the new values. I had figured that the correct behaviour for us would be to remove the reload and explicitly recreate/update the options object. =Tony Meyer From tameyer at ihug.co.nz Mon Dec 8 03:24:29 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 8 03:24:35 2003 Subject: [spambayes-dev] RE: [Spambayes] Hotmail Confusion In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458FE32@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F3@its-xchg4.massey.ac.nz> [Mark, more than once] > I am starting to believe > that the 'background filtering' option should be the default. +1. =Tony Meyer From richie at entrian.com Mon Dec 8 03:32:14 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 8 03:32:21 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/windowspop3proxy_tray.py, 1.15, 1.16 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F1@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130458F6D2@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F130212B1F1@its-xchg4.massey.ac.nz> Message-ID: <8nd8tv46t6cdsi1vnil5mmv45bacq2m636@4ax.com> [Tony] > Since 15/10/03, SetDefaultItem has been available for menus in > win32all, so use that as we should. (It appears to set the font of > the item correctly, but not have any effect in terms of action, so > still capture the double-click ourselves. Someone correct me if I've > done this wrongly). > > [Richie] > I've just had a look at a tray app I wrote years ago, and it > does the same thing. [Tony] > Good enough for me. (I presume this was not in Python, or you had your own > extension to get access to the SetDefaultItem function?) It was in C. -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Mon Dec 8 07:03:16 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Dec 8 07:03:31 2003 Subject: [spambayes-dev] Using reload() with modules from zips (Was MoreCVSbranch/tags questions) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F2@its-xchg4.massey.ac.nz> Message-ID: <002d01c3bd83$43afb560$2c00a8c0@eden> [Tony, quoting me] > > I'm also not sure exactly *why* we are doing a reload - > > saving user options should not require us to reload the > > Options module, and I'm fairly sure no code exists that > > updates the .zip with a new Options file even if it did :) > > I *think* (before my time, IIRC) the reason is to generate a new > Options.options object, that has all the new values. I had > figured that the > correct behaviour for us would be to remove the reload and explicitly > recreate/update the options object. Yes - the 'Options' module's mainline code actually reads the config file - so I can see why a reload is needed if you want to re-read an options file that may have been externally modified (now that I think about it ) The solution seems pretty simple: the top-level Options code gets moved into a function, an 'if __name__' block is added which calls it, and all occurrences of reload(Options) are also replaced similarly. I'll do it, unless someone beats me to it (fingers crossed - I've lost the context, such as any existing bugs etc), or I forget <0.1-wink> Mark. From mhammond at skippinet.com.au Mon Dec 8 07:07:42 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Dec 8 07:07:52 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins]spambayes/windowspop3proxy_tray.py, 1.15, 1.16 In-Reply-To: <8nd8tv46t6cdsi1vnil5mmv45bacq2m636@4ax.com> Message-ID: <002e01c3bd83$e29de570$2c00a8c0@eden> > [Tony] > > Good enough for me. (I presume this was not in Python, or > > you had your own extension to get access to the SetDefaultItem function?) > > It was in C. Isn't it great to have moved on from those bad old days? I know it is for me :) Good-enough-to-keep-persisting-with-win32all , ly Mark. From richie at entrian.com Mon Dec 8 15:15:30 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 8 15:15:37 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins]spambayes/windowspop3proxy_tray.py, 1.15, 1.16 In-Reply-To: <002e01c3bd83$e29de570$2c00a8c0@eden> References: <8nd8tv46t6cdsi1vnil5mmv45bacq2m636@4ax.com> <002e01c3bd83$e29de570$2c00a8c0@eden> Message-ID: [Richie] > It was in C. [Mark] > Isn't it great to have moved on from those bad old days? I know it is for > me :) Certainly is. The day I next write a C program from scratch will be the day I need to write a device driver. Or maybe not even then: http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&oe=UTF-8&th=a5f5f93827f5d230&seekm=mailman.226.1070896324.16879.python-list%40python.org&frame=off -- Richie Hindle richie@entrian.com From skip at pobox.com Tue Dec 9 11:03:18 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 9 11:03:45 2003 Subject: [spambayes-dev] RE: [Spambayes] Strip Subject of Non-alpha In-Reply-To: References: <16340.55172.171511.255475@montanaro.dyndns.org> Message-ID: <16341.62022.48555.624970@montanaro.dyndns.org> >> I never got overwhelming encouragement for my ideas about how to add >> experimental extensions to the CVS repository. Tim> Probably because it came attached to such a weak change . Okay, ignore the bit about a specific "enhancement". We all know most of them don't work anyway. Still, suppose someone comes up with an idea (we get them all the time in the spambayes mailing list): "I know, how about using the new header transmogrification feature of RFC-4822?", but doesn't have the programming cojones to implement it. Someone else comes along, realizes it wouldn't be such a big deal to implement, does so and posts, "Okay, try the version in CVS. SpamBayes now has a "Headers:X-transmogrify" option. Let us know whether it helps or not." People can then experiment with RFC-4822 transmogrification. If it proves not to be a worthy addition, the code can be ripped out. The key is tweaking the options parser to not care if there is no "Tokenizer:X-transmogrify" option (because the code was ripped out later) or to map "Tokenizer:X-transmogrify" to "Tokenizer:transmogrify" if it gains acceptance and moves out of the trial stage. (In fact, perhaps it should work the other was as well, so we can rip stuff out that's not useful without breaking peoples' options files. See below.) I just checked in a change to spambayes/OptionsClass.py which implements an experimental/deprecated option feature. It works like this: * Option is "foo", user sets "foo". status quo. * Option is "X-foo", user sets "X-foo". status quo. * Option is "foo", user sets "X-foo". "foo" is set silently. * Option is "X-foo", user sets "foo". "X-foo" is set and a warning emitted. The third case covers experimental options. The fourth case covers deprecated options. (The description for deprecated options in Options.py should start with "(DEPRECATED) ".) Tim> Really, a few people tested it and it didn't seem to matter either Tim> way. Granted. One thing I wonder about is how "current" peoples' training databases are. New techniques like c?mm?nt ?cc?nt??t??n or em.bed-ed punc#tua_tion aren't likely to turn up much in older training databases. I canned my old training database recently and have been working on rebuilding it from scratch. I think it's important that our training databases evolve as spam does. Another change I have locally is the remove_punctuation tokenizer gimmick I alluded to above. It also doesn't seem to change fp/fn results at the level of pushing messages clearly out of one category into another, however it seems to pretty consistently spread the ham/spam means apart a bit and reduce their standard deviations. I'm more interested in a framework for making such experimental changes easier for non-programmers to try out. Tim> Experimental extensions are fine by me, and you proposed a decent Tim> scheme for putting them in. The downside is that every piece of Tim> code complicates the whole, and I really don't know why you'd Tim> *want* to check in a gimmick that made no real difference to anyone Tim> who tried it (if I remember all the reports correctly -- maybe Tim> not). The point isn't sticking code in, it's being able to easily yank it back out. (I think my checking should make that easier.) You mentioned generate_time_buckets and extract_dow. I'll turn the screws in a moment to deprecate them. If this idea doesn't fly with people, or these options are deemed crucial for enough people we can just un-deprecate them. (BTW, has anyone on a Unix-ish system tried out testtools/Makefile when running timcv? If so, does it help or am I the only person who finds it useful?) Skip From skip at pobox.com Tue Dec 9 11:17:50 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 9 11:17:57 2003 Subject: [spambayes-dev] mboxutils.DirOfTxtFileMailbox - broaden its scope? Message-ID: <16341.62894.540999.159231@montanaro.dyndns.org> In mboxutils.getmbox it creates a DirOfTxtFileMailbox() object in certain situations. Looking at the code, it ignores any hierarchy within the given directory, and only considers files ending in ".txt" or ".lorien". Would anyone object if I broaded this class's mandate to recursively traverse subdirectories and consider all other files it encounters as message files? This would (for example), allow you to call spambayes.mboxutils("Data/Ham") in your test directory and walk through all the ham in your training database. I've been using this for a month or so with no ill effects, though I have to admit I have no idea what a ".lorien" file is, so I have no directories like that to break. (Also, in the world outside SpamBayes, I often add ".txt" to files which don't contain email. ) Skip From mhammond at skippinet.com.au Wed Dec 10 00:50:07 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Dec 10 00:50:24 2003 Subject: [spambayes-dev] Outlook CVS 'UserProperty' changes Message-ID: <000201c3bee1$7a119ec0$2c00a8c0@eden> I've just changed the way we manage UserProperties in the Outlook addin. The way we check if an Outlook folder has a "UserProperty" has changed, and the way we create this "UserProperty" has also changed. It does *not* change the way the "Spam" field is saved in the message (that uses MAPI properties), but the way Outlook shows these values (a subtle but real distinction) Most people should see absolutely no change - all your folders will already have this 'Spam' field, and this should be detected correctly. Until now, the 'Unsure' folder never has this property automatically created by SpamBayes, so unless you created this field manually (via the 'Field Chooser') the field should now be created for you - but I assume almost everyone here has already done that. So if a few of you would like to kill a few minutes , I would appreciate a little test - especially by Outlook XP users. * For at least one of your Watched, Spam and Unsure folders: * If the 'Spam' field is being shown for this folder, right-click the column header, and select 'Remove this column' * Right-click any column header, and select 'Field Chooser' * Select 'User defined fields' - The 'Spam' field should appear. Select it. * Click the 'Delete' button, confirm the deletion, and close the field chooser. Re-start outlook. The log (at any level) should show: ... Folder 'Personal Folders/Inbox' has no field named 'Spam' - creating SpamBayes: Watching for new messages in folder Personal Folders/Inbox ... Note the first entry - each folder that you deleted the field from should show a similar message. Then restart Outlook again, this time checking you do *not* see that message. Finally, go back to your folders and bring up the 'Field Chooser'. When you select 'User Defined Fields', you should find the 'Spam' field magically re-created. Drag it back to your view, and you are back where you started. If you are using anon-cvs, please wait until manager rev 1.91 and msgstore rev 1.78 appear . Thanks! Mark From tameyer at ihug.co.nz Wed Dec 10 02:10:11 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 10 02:10:18 2003 Subject: [spambayes-dev] Outlook CVS 'UserProperty' changes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046B4310@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677741@its-xchg4.massey.ac.nz> > So if a few of you would like to kill a few minutes , I would > appreciate a little test - especially by Outlook XP users. Outlook XP SP2 here. > * For at least one of your Watched, Spam and Unsure folders: [...] > Re-start outlook. The log (at any level) should show: Sorry, I got: """ Error adding field to 'Unsure' folder ('0000000038A1BB1005E5101AA1BB08002B2A56C20000454D534D44422E444C4C0000000000 0000001B55FA20AA6611CD9BC800AA002FC45A0C0000004954532D5843484734002F6F3D4D61 7373657920556E69766572736974792F6F753D4D41535345592F636E3D526563697069656E74 732F636E3D542E412E4D6579657200', '000000002CFF45187C119D4295E615A8AD7B7676010098B01D2717B9D411B38F0008C784093 1000010DF96B20000') NameError: global name 'PR_USERFIELDS' is not defined """ I get this every time I start Outlook now, and don't have the field available in the field chooser. =Tony Meyer From mhammond at skippinet.com.au Wed Dec 10 02:19:32 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Dec 10 02:19:44 2003 Subject: [spambayes-dev] Outlook CVS 'UserProperty' changes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677741@its-xchg4.massey.ac.nz> Message-ID: <000001c3beed$f5c80980$2c00a8c0@eden> > > So if a few of you would like to kill a few minutes , I would > > appreciate a little test - especially by Outlook XP users. > > Outlook XP SP2 here. > > > * For at least one of your Watched, Spam and Unsure folders: > [...] > > Re-start outlook. The log (at any level) should show: > > Sorry, I got: Thanks! Fixed - please try again. Mark From papaDoc at videotron.ca Wed Dec 10 10:01:10 2003 From: papaDoc at videotron.ca (papaDoc) Date: Wed Dec 10 10:01:23 2003 Subject: [spambayes-dev] Command line options Message-ID: <3FD73536.3080101@videotron.ca> Hi, I submitted on sourceforge a patch to have more consistent command line options across all the scripts. (-d for dmb and -D for pickle) Remi From papaDoc at videotron.ca Wed Dec 10 18:55:25 2003 From: papaDoc at videotron.ca (papaDoc) Date: Wed Dec 10 18:55:28 2003 Subject: [spambayes-dev] Patch to make sb_mboxtrain to work on windows Message-ID: <3FD7B26D.8000304@videotron.ca> Hi, I submitted a patch on sourceforge to make the script sb_mboxtrain to work on windows. Remi From skip at pobox.com Wed Dec 10 21:22:59 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 10 21:23:14 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677744@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13046B4478@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677744@its-xchg4.massey.ac.nz> Message-ID: <16343.54531.802087.451246@montanaro.dyndns.org> >> Big mistake. Stuff started getting wacky real fast.... Guess what? >> One of the messages in the digest was an obvious spam. Tony> This is perhaps a drawback of the minimalist database size Tony> training strategy. I'm guessing that if you had a larger Tony> database, the effect wouldn't have been as pronounced? Maybe. At the moment, I have 9768 tokens in my database and 7731 of them are hapaxes. As you suggest, it would appear mistakes can throw things off more dramatically, but it is also easier to detect. I'd be interested to see what others' hapax fractions are: >>> import shelve >>> db = shelve.open(".hammiedb") >>> n = 0 >>> len([k for k in db if db[k] in [(0,1),(1,0)]]) 7731 >>> len(db) 9769 >>> len([k for k in db if db[k] in [(0,1),(1,0)]])/float(len(db)-1) 0.79146191646191644 (The -1 is to eliminate the 'saved state' token. I'm just being pedantic. ;-) Another interesting thing (I think) might be to investigate the importance of synthetic tokens (e.g.: 'url:eweek' or 'received:168.10.156') vs. natural tokens (e.g., 'highlight' or 'dot') for smaller vs larger databases. I think one of the reasons training a single unsure has a dramatic effect on a bunch of other unsure spams is because of all the synthetic tokens they have in common due to similar delivery mechanisms (gotta use that account before it gets shut down...). If a spammer spews a bunch of messages from ISP A, then gets booted, his next spew will be from somewhere else. I suspect many of the ISP-related synthetic tokens generated will only ever be hapaxes, and thus be much more important with a small database than with a large one. It's just a theory. Hey, maybe that's another master's thesis idea for Brett Cannon... ;-) Skip From kennypitt at hotmail.com Thu Dec 11 09:58:45 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 11 09:59:25 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: <16343.54531.802087.451246@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > I'd be interested to see what others' hapax fractions are: > > >>> import shelve > >>> db = shelve.open(".hammiedb") > >>> n = 0 > >>> len([k for k in db if db[k] in [(0,1),(1,0)]]) > 7731 > >>> len(db) > 9769 > >>> len([k for k in db if db[k] in [(0,1),(1,0)]])/float(len(db)-1) > 0.79146191646191644 My current Outlook training database has 40 good and 59 spam. Here are my results: >>> len([k for k in db if db[k] in [(0,1),(1,0)]]) 8158 >>> len(db) 11274 >>> len([k for k in db if db[k] in [(0,1),(1,0)]])/float(len(db)-1) 0.72367604009580411 -- Kenny Pitt From tim.one at comcast.net Thu Dec 11 11:17:47 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 11 11:17:48 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: <16343.54531.802087.451246@montanaro.dyndns.org> Message-ID: [Tony] > This is perhaps a drawback of the minimalist database size > training strategy. I think it's a consequence of mistake-based training (and minimal database size is a (another) consequence of *that*). > I'm guessing that if you had a larger database, the effect wouldn't > have been as pronounced? A mistake in training has smaller effect under TOE (train-on-everything). The other side of that is that a correctly-trained example also has smaller effect under TOE. [Skip] > Maybe. At the moment, I have 9768 tokens in my database and 7731 of > them are hapaxes. As you suggest, it would appear mistakes can throw > things off more dramatically, We're rediscovering the bases for these old mantras: Mistake-based training leads to hapax-driven scoring. Hapax-driven scoring is brittle. "brittle" is an antonymn of "robust" . But in my personal email life, I've been very happy with mistake-based training despite its drawbacks. > but it is also easier to detect. Heh -- isn't that *because* it throws things off so dramatically ? > I'd be interested to see what others' hapax fractions are: I don't think that's the right thing to measure. There's really nothing in a database that's interesting on its own, the only thing that matters to performance is what gets used during *scoring* (everything else just sits there, passively, the same as if it didn't exist (except for its effect on database size)). A message score mostly derived from hapaxes is brittle because a single contrary training example can change the classifier's view of a hapax from "hammy" or "spammy" to "neither", and two contrary training examples can swing it to the other classification. In the early days, the database kept track of the last time a token was used in scoring, and the test framework kept track of often each token got used in scoring. There isn't an out-of-the-box way to get at that info anymore, so it's much harder to investigate how mistake-based training leads to hapax-driven scoring now. It's not *all* bad, or mistake-based training wouldn't be so effective for so many of us. Maybe the clearest example is that the hapaxes found in a new spam campaign are precisely what let us get away with training one sample and thereafter catch others from that campaign; in effect, hapaxes act like a pretty large set of lexical fingerprints in that case. > ... > Another interesting thing (I think) might be to investigate the > importance of synthetic tokens (e.g.: 'url:eweek' or > 'received:168.10.156') vs. natural tokens (e.g., 'highlight' or > 'dot') for smaller vs larger databases. I think one of the reasons > training a single unsure has a dramatic effect on a bunch of other > unsure spams is because of all the synthetic tokens they have in > common due to similar delivery mechanisms (gotta use that account > before it gets shut down...). If a spammer spews a bunch of messages > from ISP A, then gets booted, his next spew will be from somewhere > else. I suspect many of the ISP-related synthetic tokens generated > will only ever be hapaxes, and thus be much more important with a > small database than with a large one. It was established before that hapaxes are vital in mistake-based training. If you want to test that quickly but informally, modify a copy of your database to throw away all the hapaxes, then live with that reduced database for a while. It will probably have a hard time even with the messages it was originally trained with. From skip at pobox.com Thu Dec 11 11:38:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 11 11:38:47 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: References: <16343.54531.802087.451246@montanaro.dyndns.org> Message-ID: <16344.40324.210107.698842@montanaro.dyndns.org> >> I'd be interested to see what others' hapax fractions are: Tim> I don't think that's the right thing to measure. There's really Tim> nothing in a database that's interesting on its own, the only thing Tim> that matters to performance is what gets used during *scoring* Tim> (everything else just sits there, passively, the same as if it Tim> didn't exist (except for its effect on database size)). Yes, you're correct, of course. So what we might want to look at is the relative occurrence of 0.84 and 0.16 scores in message clues? Tim> It's not *all* bad, or mistake-based training wouldn't be so Tim> effective for so many of us. Maybe the clearest example is that Tim> the hapaxes found in a new spam campaign are precisely what let us Tim> get away with training one sample and thereafter catch others from Tim> that campaign; in effect, hapaxes act like a pretty large set of Tim> lexical fingerprints in that case. This is where I think the synthetic vs. natural tokens thing would be interesting. I get lots of Viagra spam, most of which is caught, but in my current database, 'viagra' is a hapax. In fact, it appears I only added it very recently. Here's the evidence header from a message with the subject: Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped Overnight which was scored around 12:25 AM today: X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16; 'subject:Free': 0.16; 'store': 0.23; 'next': 0.25; 'list,': 0.30; 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62; 'header:Reply-To:1': 0.64; 'enter': 0.67; 'content-type:multipart/alternative': 0.68; 'content-type:text/html': 0.74; 'doctors': 0.84; 'prescription': 0.84; 'received:103]': 0.84; 'received:165.175': 0.84; 'received:175': 0.84; 'received:199.249.165.175': 0.84; 'received:249.165.175': 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98 Most of the spammy clues are synthetic tokens related to delivery (and are mostly hapaxes), not content. My 'train an unsure or false negative, check for spams' method suggests this is the case, since training on a single message often pushes several other spams about completely different topics into the spam category. This suggests a couple other downsides to minimalist training. One, spammers have to move, so hapaxes related to delivery are likely to only be useful for a short period while the spammer is abusing a single account. Two, if a delivery token pushes a bunch of other messages into the spam category which are then never used as inputs to training, the opportunity to reinforce that token's quality is lost, even though it might actually appear fairly frequently in spam. Skip From richie at entrian.com Thu Dec 11 12:38:33 2003 From: richie at entrian.com (Richie Hindle) Date: Thu Dec 11 12:38:37 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage Message-ID: In response to Skip's question about hapax ratios, I ran his script and received an error. I boiled the problem down to this: >>> print [db[k] for k in db] Traceback (most recent call last): File "hapaxes.py", line 3, in ? print [db[k] for k in db] File "C:\Python23\lib\shelve.py", line 118, in __getitem__ f = StringIO(self.dict[key]) File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__ return self.db[key] KeyError: 'pics' Excuse me? Er, so how many of these things are there? >>> len([k for k in db if db.get(k, None) is None]) 306 And what do they look like? >>> from pprint import pprint as p >>> p([k for i, k in enumerate(db) if db.get(k, None) is None and i % 50 == 0]) ['magnetism', 'url:mlqnuvs', 'from:addr:wi872u', 'autograph.', 'url:ff-programs', 'motels,'] So they have nothing obvious in common. Looking through the full list it's obvious that they don't all come from one message. Some are obviously ham clues and some are obviously spam. I'm probably winging my way towards a DBRunRecovery error, unless someone can explain what's going on? -- Richie Hindle richie@entrian.com From skip at pobox.com Thu Dec 11 17:00:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 11 17:00:14 2003 Subject: [spambayes-dev] Saving last set and get times in database Message-ID: <16344.59627.337541.794331@montanaro.dyndns.org> Here's an initial patch which maintains last set and get times for tokens: https://sourceforge.net/tracker/index.php?func=detail&aid=858564&group_id=61702&atid=498105 Very experimental... Caveat emptor... Skip From kennypitt at hotmail.com Fri Dec 12 09:36:49 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Dec 12 09:37:23 2003 Subject: [spambayes-dev] FW: [Spambayes] feature request Message-ID: Mark Hammond wrote: > [Rayfes] >> Is there >> any way to have SpamBayes play a sound every time is marks >> a message a spam and maybe a different sound for possible spam? >> That way if I hear a new message come in I could >> just wait a few seconds to hear whether it was marked spam. >> If I happen to get multiple messages at time I may miss realizing >> that some Ham messages came in with some spam but that's ok with me. > > That is a pretty good idea. If someone can nail down the exact > feature request, I think we could add it. I just submitted patch #858925 that implements a first stab at this. The file notify_sound_patch.txt in the attached ZIP describes the approach I took to answer Mark's original issues. It borrows heavily from Mark's background filtering timer code to implement a "message batch accumulation" delay timer. If you think there is sufficient interest in having this feature, then please try this out and comment on what you do or don't like about the approach. It's been working well for the way I use Outlook, but then that's what I designed it for so others might prefer it to work differently. If we decide to add it to the product, I'll put together an update to SpamBayes Manager to configure it. -- Kenny Pitt From mhammond at skippinet.com.au Sat Dec 13 22:19:15 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat Dec 13 22:19:37 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? Message-ID: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden> As far as I understand it, experimental_ham_spam_imbalance_adjustment has been found to be ineffective, and that all default options have now set this to False. However, there is an issue regarding existing users of the addin. For Outlook in particular, if an old copy of "default_bayes_customize.ini" exists, we do not copy our new version over it. As experimental_ham_spam_imbalance_adjustment was set to True in early versions, this option will remain in effect, even when these users upgrade to newer versions. I don't think a similar issue exists with the other apps. Short term, the solution seems to be to nuke experimental_ham_spam_imbalance_adjustment from classifier.py - unless of course, there is some good reason to leave it for continued experiments (in which case I would just force it False in the Outlook init code) Longer term, I think the way we copy this file to the users data directory was a mistake, and I am likely to fix it (there is a bug on it from a confused user). Any thoughts? Mark. From tim.one at comcast.net Sat Dec 13 23:18:42 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 13 23:18:43 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden> Message-ID: [Mark Hammond] > As far as I understand it, experimental_ham_spam_imbalance_adjustment > has been found to be ineffective, It seemed OK so long as the data didn't get *too* unbalanced -- when there was extreme imbalance, it was not only ineffective, it did major harm. > and that all default options have now set this to False. I hope so. That was the plan. > However, there is an issue regarding existing users of the addin. For > Outlook in particular, if an old copy of "default_bayes_customize.ini" > exists, we do not copy our new version over it. As > experimental_ham_spam_imbalance_adjustment was set to True in early > versions, this option will remain in effect, even when these users > upgrade to newer versions. > > I don't think a similar issue exists with the other apps. > > Short term, the solution seems to be to nuke > experimental_ham_spam_imbalance_adjustment from classifier.py - > unless of course, there is some good reason to leave it for continued > experiments (in which case I would just force it False in the Outlook > init code) Na, it's a proven loser. I just deleted the code from classifier.py, and reworded some of the docs. Options.py still knows about it, though, to avoid breaking any .ini file that still references it. I'm not sure how to get rid of it completely. > Longer term, I think the way we copy this file to the users data > directory was a mistake, and I am likely to fix it (there is a bug on > it from a confused user). > > Any thoughts? Unsure what you have in mind -- but doubt it's insane . From tim.one at comcast.net Sun Dec 14 00:05:02 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 14 00:05:04 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Richie Hindle] > In response to Skip's question about hapax ratios, I ran his script > and received an error. I boiled the problem down to this: > > >>> print [db[k] for k in db] > Traceback (most recent call last): > File "hapaxes.py", line 3, in ? > print [db[k] for k in db] > File "C:\Python23\lib\shelve.py", line 118, in __getitem__ > f = StringIO(self.dict[key]) > File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__ > return self.db[key] > KeyError: 'pics' > > Excuse me? Er, so how many of these things are there? > > >>> len([k for k in db if db.get(k, None) is None]) 306 Ouch. What do you get if you open the database directly, instead of indirecting thru a shelf? I'm just trying to make sure it's really the database that's hosed. For example, here's a complete program picking on my database: PATH = "/WINDOWS/Application Data/SpamBayes/default_bayes_database.db" import bsddb d = bsddb.hashopen(PATH, 'r') print len(d) print len([k for k in d if d.get(k, None) is None]) That printed 40787, then 0, when I ran it just now. > And what do they look like? Doesn't matter -- it should never happen! > >>> from pprint import pprint as p > >>> p([k for i, k in enumerate(db) if db.get(k, None) is None and i > % 50 == 0]) > ['magnetism', > 'url:mlqnuvs', > 'from:addr:wi872u', > 'autograph.', > 'url:ff-programs', > 'motels,'] > > So they have nothing obvious in common. Looking through the full list > it's obvious that they don't all come from one message. Some are > obviously ham clues and some are obviously spam. > > I'm probably winging my way towards a DBRunRecovery error, unless > someone can explain what's going on? I've fixed miserable *similar* bugs in ZODB's BTrees (enumerating finds keys that direct lookup doesn't believe exist), so I'm not shocked if some other database screws up in this way too. Gotta say, I'm half ready to declare that ZODB is the only database anyone should ever use (the bugs in that are long fixed ). From mhammond at skippinet.com.au Sun Dec 14 06:06:31 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun Dec 14 06:06:45 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: Message-ID: <0fe101c3c232$55023ff0$2c00a8c0@eden> > Na, it's a proven loser. I just deleted the code from > classifier.py, and reworded some of the docs. Thanks! That was exactly what I hoped would happen :) > Options.py still knows about it, though, to > avoid breaking any .ini file that still references it. I'm > not sure how to get rid of it completely. Yep - me too, and just perfect! Thanks, Mark. From skip at pobox.com Sun Dec 14 15:31:29 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Dec 14 15:31:35 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden> References: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden> Message-ID: <16348.51361.368410.400031@montanaro.dyndns.org> Mark> However, there is an issue regarding existing users of the addin. Mark> For Outlook in particular, if an old copy of Mark> "default_bayes_customize.ini" exists, we do not copy our new Mark> version over it. As experimental_ham_spam_imbalance_adjustment Mark> was set to True in early versions, this option will remain in Mark> effect, even when these users upgrade to newer versions. If you rip out the code in classifier.py, you should be able to simply change its name in Options.py to x-experimental_ham_spam_imbalance_adjustment That's how you deprecate an option based upon the code I added to OptionsClass.py the other day. If the user sets "foo" but it doesn't exist and "x-foo" does, a message is printed to stderr, but nothing bombs. Take a look at the docstring for the OptionsClass module. Mark> Longer term, I think the way we copy this file to the users data Mark> directory was a mistake, and I am likely to fix it (there is a bug Mark> on it from a confused user). What's the mistake? I think it's correct to not obliterate the user's local copy of the config file. Plenty of programs either don't copy config files during install if a copy is already present, or install it to a different name so the user can compare local and as-distributed versions of the file. Skip From skip at pobox.com Sun Dec 14 15:34:32 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Dec 14 15:34:35 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <16348.51544.686632.177594@montanaro.dyndns.org> Tim> import bsddb Tim> d = bsddb.hashopen(PATH, 'r') Tim> print len(d) Tim> print len([k for k in d if d.get(k, None) is None]) Tim> That printed 40787, then 0, when I ran it just now. I also get N and 0 for my working database (opened with anydbm). Skip From tim.one at comcast.net Sun Dec 14 19:22:42 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 14 19:22:46 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <16348.51361.368410.400031@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > If you rip out the code in classifier.py, you should be able to simply > change its name in Options.py to > > x-experimental_ham_spam_imbalance_adjustment > > That's how you deprecate an option based upon the code I added to > OptionsClass.py the other day. If the user sets "foo" but it doesn't > exist and "x-foo" does, a message is printed to stderr, but nothing > bombs. Take a look at the docstring for the OptionsClass module. Cool! I just did that. There's a minor problem: the OptionsClass module says the magic prefix is X- (uppercase), but only x- (lowercase) works as intended. With X-experimental_ham_spam_imbalance_adjustment the warning is warning: Invalid option experimental_ham_spam_imbalance_adjustment in section Classifier in file C:\WINDOWS\Application Data\SpamBayes\default_bayes_customize.ini and with x-experimental_ham_spam_imbalance_adjustment it's the mostly hoped-for warning: option experimental_ham_spam_imbalance_adjustment in section Classifier is deprecated I'm not sure what your intent was, but the code should match the docs one way or the other. The second form of message should probably include the filename too. Works slick, though! From tim.one at comcast.net Sun Dec 14 20:06:03 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 14 20:06:14 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: <16344.40324.210107.698842@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > This is where I think the synthetic vs. natural tokens thing would be > interesting. I'm not sure what's being distinguished here. > I get lots of Viagra spam, most of which is caught, but in my current > database, 'viagra' is a hapax. In fact, it appears I only added it > very recently. Here's the evidence header from a message with the > subject: > > Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped > Overnight > > which was scored around 12:25 AM today: > > X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16; > 'subject:Free': 0.16; "Free" in a Subject line and "drug" in the body are hammy for you? Staring at clues from mistake-based training can be, umm, counter-intuitive . > 'store': 0.23; 'next': 0.25; 'list,': 0.30; > 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62; > 'header:Reply-To:1': 0.64; 'enter': 0.67; > 'content-type:multipart/alternative': 0.68; > 'content-type:text/html': 0.74; 'doctors': 0.84; > 'prescription': 0.84; 'received:103]': 0.84; > 'received:165.175': 0.84; 'received:175': 0.84; > 'received:199.249.165.175': 0.84; 'received:249.165.175': > 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98 > > Most of the spammy clues are synthetic tokens related to delivery > (and are mostly hapaxes), not content. I'm not sure what's synthetic about these. Most of your spam clues come from the email *headers*, but that's fair game. Note that mining received headers is disabled by default, so you're getting a pile of clues most people aren't getting. Maybe they should. > My 'train an unsure or false negative, check for spams' method suggests > this is the case, since training on a single message often pushes several > other spams about completely different topics into the spam category. I'm unclear on what's noteworthy about that. The biz domain is used by lots of spam, lots of spam has a yahoo.com return address, lots of spam is multipart/alternative HTML, and so on. Looks like you're generating 4 correlated clues from a single Received header, and that you got one spam before from the same box. Strangely, though, it looks like you're sucking out *suffixes* of IP addrs instead of prefixes (you've got 199.249.165.175 249.165.175 165.175 and 175 but not the almost-surely more useful 199.249.165 199.249 and 199 ). > This suggests a couple other downsides to minimalist training. One, > spammers have to move, so hapaxes related to delivery are likely to > only be useful for a short period while the spammer is abusing a > single account. IP *prefixes* should be useful despite that, due to the way IP space is handed out. If you're a spammer with a cooperative host, you're likely to get other IP addresses from the netblocks assigned to that host, and they'll share a common prefix. > Two, if a delivery token pushes a bunch of other messages into the > spam category which are then never used as inputs to training, the > opportunity to reinforce that token's quality is lost, even though it > might actually appear fairly frequently in spam. I expect 'subject:Free' was a fine example of that. From skip at pobox.com Sun Dec 14 21:13:43 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Dec 14 21:13:50 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: References: <16344.40324.210107.698842@montanaro.dyndns.org> Message-ID: <16349.6359.927187.517763@montanaro.dyndns.org> >> X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16; >> 'subject:Free': 0.16; Tim> "Free" in a Subject line and "drug" in the body are hammy for you? Tim> Staring at clues from mistake-based training can be, umm, Tim> counter-intuitive . Yeah, one of the online communities I participate in is a list of parents of "troubled kids", hence the hammy "drug" reference. "subject:Free" comes from the music community: Subject: SFS Special Announcement (Free Guest List to Fluid this Friday) >> 'store': 0.23; 'next': 0.25; 'list,': 0.30; >> 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62; >> 'header:Reply-To:1': 0.64; 'enter': 0.67; >> 'content-type:multipart/alternative': 0.68; >> 'content-type:text/html': 0.74; 'doctors': 0.84; >> 'prescription': 0.84; 'received:103]': 0.84; >> 'received:165.175': 0.84; 'received:175': 0.84; >> 'received:199.249.165.175': 0.84; 'received:249.165.175': >> 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98 >> >> Most of the spammy clues are synthetic tokens related to delivery >> (and are mostly hapaxes), not content. Tim> I'm not sure what's synthetic about these. I guess my operational definitions of "synthetic" and "natural" tokens are in order: "natural tokens" are those which derive simply by splitting the message body on whitespace boundaries. "synthetic tokens" are those which are not "natural tokens". Tim> Most of your spam clues come from the email *headers*, but that's Tim> fair game. Note that mining received headers is disabled by Tim> default, so you're getting a pile of clues most people aren't Tim> getting. Maybe they should. Sure, email headers are fair game, but if the tokenizer didn't do anything special with them, that "subject:Free" token would at most just be "free" or "Free". >> My 'train an unsure or false negative, check for spams' method >> suggests this is the case, since training on a single message often >> pushes several other spams about completely different topics into the >> spam category. Tim> I'm unclear on what's noteworthy about that. The biz domain is Tim> used by lots of spam, lots of spam has a yahoo.com return address, Tim> lots of spam is multipart/alternative HTML, and so on. Looks like Tim> you're generating 4 correlated clues from a single Received header, Tim> and that you got one spam before from the same box. Strangely, Tim> though, it looks like you're sucking out *suffixes* of IP addrs Tim> instead of prefixes (you've got Tim> 199.249.165.175 Tim> 249.165.175 Tim> 165.175 Tim> and Tim> 175 Tim> but not the almost-surely more useful Tim> 199.249.165 Tim> 199.249 Tim> and Tim> 199 Tim> ). I don't know. I agree those look backwards (that's my mail server, BTW). OTOH, given the fairly random assignment of IP networks, I doubt it makes much sense for the above IP address to be stripped of more than the last two octets ("received:199.249.165.175", "received:199.249.165" and "received:199.249"). "recevied:199", where 199 is the first octet, not the last, almost certainly means nothing. If it's spammy or hammy, it's just by sheer coincidence. >> This suggests a couple other downsides to minimalist training. One, >> spammers have to move, so hapaxes related to delivery are likely to >> only be useful for a short period while the spammer is abusing a >> single account. Tim> IP *prefixes* should be useful despite that, due to the way IP Tim> space is handed out. If you're a spammer with a cooperative host, Tim> you're likely to get other IP addresses from the netblocks assigned Tim> to that host, and they'll share a common prefix. Again, no more general than the first two octets (a class B network). Class A networks are very rare (for obvious reasons): http://euclid.math.brandeis.edu/turtschi/whois/neta1.html >> Two, if a delivery token pushes a bunch of other messages into the >> spam category which are then never used as inputs to training, the >> opportunity to reinforce that token's quality is lost, even though it >> might actually appear fairly frequently in spam. Tim> I expect 'subject:Free' was a fine example of that. 'subject:Free' is now slightly spammy, having turned up in three spams and only one ham at this point. Skip From tim.one at comcast.net Sun Dec 14 22:09:04 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 14 22:09:14 2003 Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests... In-Reply-To: <16349.6359.927187.517763@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I guess my operational definitions of "synthetic" and "natural" > tokens are in order: > > "natural tokens" are those which derive simply by splitting the > message body on whitespace boundaries. > > "synthetic tokens" are those which are not "natural tokens". OK. Now I've forgotten why you drew the distinction to begin with <0.9 wink>. [about busting apart IP addrs] > I don't know. I agree those look backwards (that's my mail server, > BTW). OTOH, given the fairly random assignment of IP networks, I > doubt it makes much sense for the above IP address to be stripped of > more than the last two octets ("received:199.249.165.175", > "received:199.249.165" and "received:199.249"). "recevied:199", > where 199 is the first octet, not the last, almost certainly means > nothing. If it's spammy or hammy, it's just by sheer coincidence. In that case, the database will learn it; since it can't generate more than 126 legitimate "Class A" tokens total, it's a trivial database burden. OTOH, for someone in the DOD, it may be valuable to know that email came from a DOD Class A network. On the third hand, spammers often forge Received headers, and I doubt most do research to forge sensible IPs. IOW, the system learns what does and doesn't work, in both directions, provided only that it's shown potentially interesting stuff. > ... > Again, no more general than the first two octets (a class B network). > Class A networks are very rare (for obvious reasons): > > http://euclid.math.brandeis.edu/turtschi/whois/neta1.html They're rarer than that now -- that's over 4 years old, and lots of those have been busted up. Since current practice is to assign a range of initial bits instead of initial bytes, maybe we should generate all *bit* prefixes instead. That would sure test whether correlation is our friend . From richie at entrian.com Mon Dec 15 04:00:19 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 15 04:00:52 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: hfahtvkejos62a0d8jfmc2mkvghpvgvkih@4ax.com Message-ID: [Richie] > >>> print [db[k] for k in db] > KeyError: 'pics' [Tim] > Ouch. What do you get if you open the database directly, instead of > indirecting thru a shelf? I'm just trying to make sure it's really the > database that's hosed. I think we're using different versions of bsddb - your code fails for me: >>> d = bsddb.hashopen("/src/tests/spambayes/hammie.db") >>> len(d) 52331 >>> len([k for k in d if d.get(k, None) is None]) Traceback (most recent call last): File "", line 1, in -toplevel- len([k for k in d if d.get(k, None) is None]) File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__ return self.db[key] TypeError: Integer keys only allowed for Recno and Queue DB's I think this is because GET_ITER is creating a list-style iterator rather than a dict-style one. bsddb objects don't look much like dictionaries: >>> len([k for k in d.keys() if d.get(k, None) is None]) Traceback (most recent call last): File "", line 1, in -toplevel- len([k for k in d.keys() if d.get(k, None) is None]) AttributeError: _DBWithCursor instance has no attribute 'get' I have Python 2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit (Intel)] on win32. Assuming that's a red herring, here's an equivalent that works for me: >>> def get(d, k, default): try: return d[k] except KeyError: return default >>> len([k for k in d.keys() if get(d, k, None) is None]) 305 So yes, the underlying database is screwed. But one token less screwed than last time - lovely. (I now get 305 when going through shelve as well.) I've done some training in between, which must have jiggled things around. [Tim] > Gotta say, I'm half ready to declare > that ZODB is the only database anyone should ever use (the bugs in that are > long fixed ). I'm certainly underwhelmed by bsddb in single-file mode. One day I want to make spambayes use full transaction mode - that really ought to work. (Does anyone know of any simple Python code I can steal that uses bsddb in full-on multi-everything DBEnv mode? The pybsddb docs just link to the SleepyCat C API docs, which aren't very approachable.) -- Richie Hindle richie@entrian.com From tim at fourstonesExpressions.com Mon Dec 15 08:24:34 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Dec 15 08:24:41 2003 Subject: [spambayes-dev] Fwd: [Spambayes] Won't work anymore In-Reply-To: <000201c3c2cf$352f58a0$e09b2e04@home> References: <000201c3c2cf$352f58a0$e09b2e04@home> Message-ID: I don't know if anyone saw this on the spambayes list, but it seems severe, and I don't know how to respond.... ------- Forwarded message ------- From: Jones Clan To: spambayes@python.org Subject: [Spambayes] Won't work anymore Date: Sun, 14 Dec 2003 21:49:29 -0800 > I loved your product with Outlook 2000. But now that I have installed > XP, it won't work. I don't get any errors but I click the button on the > toolbar and nothing happens. I have uninstalled and reinstalled > thinking it had to be installed again after the Outlook upgrade. Still > nothing. Please help because I miss using your product. > > McLean Jones > NO Sugar - NO Carb Energy Drink > www.getsomexs.com > user: mclean > pass: guest > 888.870.5070 > > -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment675.dat Type: application/octet-stream Size: 180 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031215/2b1b86ea/attachment675-0001.obj From tim at fourstonesExpressions.com Mon Dec 15 08:26:27 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Dec 15 08:26:33 2003 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile In-Reply-To: <000001c3c2b8$ce2de0b0$1e02a8c0@JDi8000> References: <000001c3c2b8$ce2de0b0$1e02a8c0@JDi8000> Message-ID: Oops... forwarded the wrong message. This is the one I was thinking of. This seems severe, and I've not seen this problem pop up in the list before. I don't know how to respond. ------- Forwarded message ------- From: My Tech To: spambayes@python.org Subject: [Spambayes] SpamBayes Corrupted My Profile Date: Sun, 14 Dec 2003 22:09:07 -0500 > After installing SpamBayes, Outlook could only be opened in Safe Mode. > (That is to say that when clicking on the Outlook desktop icon, a > dialogue > box popped up before the application would open, informing me that > Outlook > had encountered an error and needed to shut down. The checkbox to > "Restart > Outlook" was already checked and I clicked on the "Don't Send [Error > Report > to Microsoft]" button. Then, a new dialogue box popped up saying that > Outlook failed to start correctly and asked me if I wanted to start in > Safe > Mode, "Yes" or "No." If I select "No", then the first dialogue box > re-appears telling me about Outlook encountering an error and wanting to > restart. If I select "Yes", only then will Outlook open.) > > I've come to find out that installing SpamBayes has corrupted by Windows > Administrator profile and that is why Outlook will not open. PLEASE HELP > ASAP!!! I do not want to have to reinstall my OS (and all of my > software) > because of this. > > FYI: My Windows OS: 2000 Professional, 5.00.2195, Service Pack 4 > SpamBayes installer used: SpamBayes-Outlook-Setup-0081.exe > Outlook version: 2002, part of Office XP Small Business Edition > > If there is a way to fix this, please tell me. Also, please send > detailed > instructions for installing SpamBayes, as it appears that I did not do it > correctly (even though I followed the instructions per the SpamBayes > website.) > > Thank you. -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment695.dat Type: application/octet-stream Size: 180 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031215/64523b79/attachment695.obj From skip at pobox.com Mon Dec 15 09:48:48 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 15 09:48:51 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: References: <16348.51361.368410.400031@montanaro.dyndns.org> Message-ID: <16349.51664.580182.339027@montanaro.dyndns.org> Tim> Cool! I just did that. There's a minor problem: the OptionsClass Tim> module says the magic prefix is Tim> X- Tim> (uppercase), but only Tim> x- Tim> (lowercase) works as intended. With Tim> X-experimental_ham_spam_imbalance_adjustment Tim> the warning is Tim> warning: Invalid option experimental_ham_spam_imbalance_adjustment in Tim> section Classifier in file Tim> C:\WINDOWS\Application Data\SpamBayes\default_bayes_customize.ini Tim> and with Tim> x-experimental_ham_spam_imbalance_adjustment Tim> it's the mostly hoped-for Tim> warning: option experimental_ham_spam_imbalance_adjustment in Tim> section Classifier is deprecated Tim> I'm not sure what your intent was, but the code should match the Tim> docs one way or the other. The second form of message should Tim> probably include the filename too. My intent was to mimic rfc-822-style experimental headers, but it appears I don't really understand what ConfigParser does vis a vis case-sensitivity. (I thought it was case-insensitive, and the code suggests it is, but that seems to not quite be the case.) In OptionsClass.merge_file() I originally wanted to use X- (note its presence in a comment I forgot to change), but wound up switching to x-. A little more investigation suggests that ConfigParser does indeed ignore case in the values it reads from the options file, but that the code in OptionsClass.py doesn't treat the options it stores in self._options that way. I don't really care what hoops we as programmers have to jump through, but I'd like users to be able to use either x- or X-. I agree, the docstrings should match required usage. In any case, it appears Tim (or someone else) has fixed things). Skip From tim.one at comcast.net Mon Dec 15 11:00:13 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 15 11:00:47 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Richie Hindle] > I think we're using different versions of bsddb - your code fails for > me: > > >>> d = bsddb.hashopen("/src/tests/spambayes/hammie.db") > >>> len(d) > 52331 > >>> len([k for k in d if d.get(k, None) is None]) > Traceback (most recent call last): > File "", line 1, in -toplevel- > len([k for k in d if d.get(k, None) is None]) > File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__ > return self.db[key] > TypeError: Integer keys only allowed for Recno and Queue DB's > > I think this is because GET_ITER is creating a list-style iterator > rather than a dict-style one. bsddb objects don't look much like > dictionaries: > > >>> len([k for k in d.keys() if d.get(k, None) is None]) > Traceback (most recent call last): > File "", line 1, in -toplevel- > len([k for k in d.keys() if d.get(k, None) is None]) > AttributeError: _DBWithCursor instance has no attribute 'get' Not here: >>> PATH = "/WINDOWS/Application Data/SpamBayes/default_bayes_database.db" >>> import bsddb >>> d = bsddb.hashopen(PATH, 'r') >>> len([k for k in d.keys() if d.get(k, None) is None]) 0 >>> > I have Python 2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit > (Intel)] on win32. Assuming that's a red herring, I wouldn't assume that -- it may be the whole ball of wax. I'm using exactly the same, *except* I'm using 2.3.3c1 (also on Windows), and a number of bsddb3 fixes have been checked in since Python 2.3. It would help if you tried 2.3.3c1. If your symptoms above persist, then we've got a Major Mystery to sort out (e.g., maybe you-- or I --aren't getting the version of bsddb the Windows installer intended us to get). > here's an equivalent that works for me: > > >>> def get(d, k, default): > try: > return d[k] > except KeyError: > return default > > >>> len([k for k in d.keys() if get(d, k, None) is None]) 305 > > So yes, the underlying database is screwed. But one token less > screwed than last time - lovely. (I now get 305 when going through > shelve as well.) I've done some training in between, which must have > jiggled things around. ... > I'm certainly underwhelmed by bsddb in single-file mode. One day I > want to make spambayes use full transaction mode - that really ought > to work. (Does anyone know of any simple Python code I can steal that > uses bsddb in full-on multi-everything DBEnv mode? The pybsddb docs > just link to the SleepyCat C API docs, which aren't very > approachable.) Best I can suggest is studying Python's bsddb3 substantial test suite. ZODB has modules to build ZODB's transaction model on top of a Berkeley database, but I don't think I'd call that simple. I'm not a bsddb guy, though, so those are just random things I've seen. From tim.one at comcast.net Mon Dec 15 11:12:52 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 15 11:12:57 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <16349.51664.580182.339027@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > My intent was to mimic rfc-822-style experimental headers, but it > appears I don't really understand what ConfigParser does vis a vis > case-sensitivity. (I thought it was case-insensitive, and the code > suggests it is, but that seems to not quite be the case.) In > OptionsClass.merge_file() I originally wanted to use X- (note its > presence in a comment I forgot to change), but wound up switching to > x-. A little more investigation suggests that ConfigParser does > indeed ignore case in the values it reads from the options file, but > that the code in OptionsClass.py doesn't treat the options it stores > in self._options that way. I'd call that a bug in OptionsClass.py, then. ConfigParser is modeled on RFC 822 header fields, and supplies case-insensitive option names *because* 822 mandates case-insensitive semantics. > I don't really care what hoops we as programmers have to jump > through, but I'd like users to be able to use either x- or X-. I > agree, the docstrings should match required usage. In any case, it > appears Tim (or someone else) has fixed things). Tony checked in a bunch of changes, but I suspect it's still case-sensitive. From kennypitt at hotmail.com Mon Dec 15 11:27:46 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Dec 15 11:28:35 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: Tim Peters wrote: > [Richie Hindle] >> I think we're using different versions of bsddb - your code fails >> for me: >> >> >>> d = bsddb.hashopen("/src/tests/spambayes/hammie.db") >> >>> len(d) >> 52331 >> >>> len([k for k in d if d.get(k, None) is None]) >> Traceback (most recent call last): >> File "", line 1, in -toplevel- >> len([k for k in d if d.get(k, None) is None]) >> File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__ >> return self.db[key] >> TypeError: Integer keys only allowed for Recno and Queue DB's > > Not here: > > >>> PATH = "/WINDOWS/Application Data/SpamBayes/default_bayes_database.db" > >>> import bsddb > >>> d = bsddb.hashopen(PATH, 'r') > >>> len([k for k in d.keys() if d.get(k, None) is None]) > 0 > >>> > >> I have Python 2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit >> (Intel)] on win32. Assuming that's a red herring, > > I wouldn't assume that -- it may be the whole ball of wax. I'm using > exactly the same, *except* I'm using 2.3.3c1 (also on Windows), and a > number of bsddb3 fixes have been checked in since Python 2.3. It > would help if you tried 2.3.3c1. If your symptoms above persist, > then we've got a Major Mystery to sort out (e.g., maybe you-- or I > --aren't getting the version of bsddb the Windows installer intended > us to get). I get the same results as Tim using the 2.3.2 final version: Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32 In my 2.3.2 lib, the "return self.db[key]" line in __getitem__ is on line 116 of __init__.py, not line 86 as in Richie's traceback. I could expect some changes between Python 2.3 and 2.3.2, but 30 lines seems a bit much between minor bugfix releases. Is that possibly an indicator of a bsddb version mismatch? -- Kenny Pitt From tim.one at comcast.net Mon Dec 15 11:40:36 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 15 11:40:40 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Kenny Pitt] > I get the same results as Tim using the 2.3.2 final version: Python > 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on > win32 > > In my 2.3.2 lib, the "return self.db[key]" line in __getitem__ is on > line 116 of __init__.py, not line 86 as in Richie's traceback. I > could expect some changes between Python 2.3 and 2.3.2, but 30 lines > seems a bit much between minor bugfix releases. Is that possibly an > indicator of a bsddb version mismatch? It's more an indicator of bugs in 2.3's bsddb support. __init__.py was at rev 1.5 in the 2.3 release, and is at rev 1.12(!) today: http://cvs.sf.net/viewcvs.py/python/python/dist/src/Lib/bsddb/__init__.py I see that support for the iterator and mapping protocols wasn't added until rev 1.6, which is why they don't work for Richie in 2.3 final. From kennypitt at hotmail.com Mon Dec 15 12:19:19 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Dec 15 12:19:57 2003 Subject: [spambayes-dev] sb_server UI error Message-ID: Looks like a usage got missed when deprecating the extract_dow option: """ 500 Server error Traceback (most recent call last): File "spambayes\Dibbler.pyc", line 457, in found_terminator File "spambayes\UserInterface.pyc", line 629, in onAdvancedconfig File "spambayes\UserInterface.pyc", line 692, in _buildConfigPage File "spambayes\OptionsClass.pyc", line 563, in valid_input KeyError: ('Tokenizer', 'extract_dow') """ This appears to come from the adv_map in ProxyUI.py. The generate_time_buckets option will likely generate the same error. -- Kenny Pitt From barry at python.org Mon Dec 15 12:53:26 2003 From: barry at python.org (Barry Warsaw) Date: Mon Dec 15 12:53:32 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071510805.970.122.camel@anthem> On Mon, 2003-12-15 at 04:00, Richie Hindle wrote: > (Does anyone know of any simple Python code I can steal that uses bsddb in > full-on multi-everything DBEnv mode? Sorry, but that's too oxymoronic of a request to fulfill. But you can look at ZODB's BerkeleyDB based storage code which is a good working example of a full-on transactional BerkeleyDB application. If you can ignore the peculiarities of ZODB's storage API and all the tables used to support it, the code should be helpful. -Barry From skip at pobox.com Mon Dec 15 12:57:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 15 12:59:29 2003 Subject: [spambayes-dev] sb_server UI error In-Reply-To: References: Message-ID: <16349.63007.770473.205555@montanaro.dyndns.org> Kenny> Looks like a usage got missed when deprecating the extract_dow Kenny> option: ... Yeah, I wasn't aware these things were referenced anywhere but in the Options.py and tokenizer files. Try removing lines from ImapUI.py and ProxyUI.py which contain extract_dow or generate_time_buckets, then start again. If that works, let me know and I'll check in the change. (Deprecated options should probably not be offered in even the advanced options page, right?) Skip From kennypitt at hotmail.com Mon Dec 15 13:13:03 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Dec 15 13:13:39 2003 Subject: [spambayes-dev] sb_server UI error In-Reply-To: <16349.63007.770473.205555@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Kenny> Looks like a usage got missed when deprecating the extract_dow > Kenny> option: > ... > > Yeah, I wasn't aware these things were referenced anywhere but in the > Options.py and tokenizer files. Try removing lines from ImapUI.py and > ProxyUI.py which contain extract_dow or generate_time_buckets, then > start again. If that works, let me know and I'll check in the change. I don't have a setup to test ImapUI.py, but that works for ProxyUI.py. > (Deprecated options should probably not be offered in even the > advanced options page, right?) Probably not, but FWIW adding the 'x-' in front of the option name in ProxyUI.py also works. I suppose you could make a case for leaving the options on the config page for a release or two so users can see that they have been deprecated. Don't know if anyone is more likely to see it there than in the logs, though. -- Kenny Pitt From tameyer at ihug.co.nz Mon Dec 15 19:22:00 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 15 19:22:08 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0815@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz> [Skip] > I don't really care what hoops we as programmers have to jump through, > but I'd like users to be able to use either x- or X-. I agree, the > docstrings should match required usage. In any case, it appears Tim > (or someone else) has fixed things). What happens at the moment is that ConfigParser lowercases all the option names (but not section names; I don't know whether that's deliberate or not) when it reads them from a file. So users can happily use "X-" or "x-" and by the time OptionsClass deals with them it'll be "x-". I changed the comments so that they all use "x-", but haven't added anything about this. Us programmers *must* use "x-" when referring to the options at the moment. [Tim] > I'd call that a bug in OptionsClass.py, then. ConfigParser > is modeled on RFC 822 header fields, and supplies > case-insensitive option names *because* 822 mandates > case-insensitive semantics. So should get/set in our OptionsClass also be case insensitive in situations other than reading in the config files? At the moment options["Sect", "Opt"] != options["Sect", "opt"], but it would be a simple enough change (and we certainly don't have any options with the same name but differing case). > Tony checked in a bunch of changes, but I suspect it's still > case-sensitive. I made the mistake of checking things in before I really had figured out what was happening in ConfigParser, so half the check-ins are repairing the other half. It's case-sensitive *apart* from reading in the file. =Tony Meyer From tameyer at ihug.co.nz Mon Dec 15 19:29:30 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 15 19:29:38 2003 Subject: [spambayes-dev] sb_server UI error In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C085C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677765@its-xchg4.massey.ac.nz> [Kenny] > I don't have a setup to test ImapUI.py, but that works for ProxyUI.py. You can test ImapUI.py without actually having an IMAP connection. Just run "sb_imapfilter.py -b" and go to the config page. It'll work, anyway. I've checked these in. [Skip] > (Deprecated options should probably not be offered in even the > advanced options page, right?) +1 here. > I suppose you could make a case for leaving the options on the > config page for a release or two so users can see that they have > been deprecated. > Don't know if anyone is more likely to see it there than in > the logs, though. I think leaving it there is just asking for someone to set it, and that any "x-" (experimental *or* deprecated) option shouldn't be exposed via the regular config pages. (OTOH, I have the basis of a web interface for timcv.py which exposes *only* those options). I think we need some other way of presenting the warnings. Even just in the status panel of the home web interface page would be better than only in the logs. If someone wanted to put in the effort, the tray app could also put up a little window that pointed out that important messages were on that page. =Tony Meyer From skip at pobox.com Mon Dec 15 20:02:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 15 20:02:47 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13047C0815@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz> Message-ID: <16350.22965.664272.622282@montanaro.dyndns.org> >> Tony checked in a bunch of changes, but I suspect it's still >> case-sensitive. Tony> I made the mistake of checking things in before I really had Tony> figured out what was happening in ConfigParser... Join the club. :-) Skip From tim.one at comcast.net Mon Dec 15 21:23:07 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 15 21:23:15 2003 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes Options.py, 1.90, 1.91 UserInterface.py, 1.35, 1.36 classifier.py, 1.11, 1.12 In-Reply-To: Message-ID: > *** UserInterface.py 11 Dec 2003 18:44:23 -0000 1.35 > --- UserInterface.py 16 Dec 2003 02:03:31 -0000 1.36 > *************** > *** 306,309 **** > --- 306,313 ---- > for tok in tokens: > clues.append((tok, None)) > + # Need to regenerate the tokens (is there a way to > + # 'rewind' or copy a generator? Would that be > + # more effecient? > + tokens = tokenizer.tokenize(message) > probability = self.classifier.spamprob(tokens) > cluesTable = self._fillCluesTable(clues) Change the first line of the function to: tokens = list(tokenizer.tokenize(message)) There's no need to tokenize again, then. The construction of clues can be the one-liner: clues = [(tok, None) for tok in tokens] From tameyer at ihug.co.nz Mon Dec 15 21:48:35 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 15 21:48:41 2003 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayesOptions.py, 1.90, 1.91 UserInterface.py, 1.35, 1.36 classifier.py, 1.11, 1.12 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0976@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0B@its-xchg4.massey.ac.nz> [...] > + # Need to regenerate the tokens (is there a way to > + # 'rewind' or copy a generator? Would that be > + # more efficient? [...] > Change the first line of the function to: > > tokens = list(tokenizer.tokenize(message)) > > There's no need to tokenize again, then. The construction of > clues can be the one-liner: > > clues = [(tok, None) for tok in tokens] Thanks. The _getclues call later was expecting a generator rather than a list, but I've fixed that too, and the code is nicer now, I think. I've checked this in. =Tony Meyer From tameyer at ihug.co.nz Mon Dec 15 22:05:18 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 15 22:07:53 2003 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C07AF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677769@its-xchg4.massey.ac.nz> > Oops... forwarded the wrong message. This is the one I was > thinking of. > This seems severe, and I've not seen this problem pop up in the list > before. I don't know how to respond. I thought the same thing, although I'm not entirely convinced that it was the SpamBayes installer that did this. Also, whenever I've had a corrupted profile, I've had to dump the entire profile, which this guy obviously hasn't. Presumably rolling back the registry would fix it. If it's actually a problem with Outlook, not the Windows profile (which seems more likely), then Outlook's detect and repair should fix it. Does everything work apart from Outlook? If so, it seems highly unlikely that it's the Windows profile that is corrupt. If not, what is it that fails? As for the instructions to install the SpamBayes Outlook plug-in: 1. Download the installer. 2. Double-click the installer. 3. Go through the installer prompts. 4. You're done. =Tony Meyer From tim.one at comcast.net Mon Dec 15 23:37:49 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 15 23:37:53 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > ... > I made the mistake of checking things in before I really had figured > out what was happening in ConfigParser, so half the check-ins are > repairing the other half. It's case-sensitive *apart* from reading > in the file. I don't follow the distinctions being made here, but that's OK because nobody should have to : option names were intended to be case-insensitive, regardless of context. Whether they're read from .py files, or from .ini files, or passed as arguments -- all should act the same way. As things were when I last checked something in, I left this comment in Options.py, because it explained the truth of it at that time: # XXX The "x-" prefix can't be "X-" instead, else it's considered # XXX an invalid option instead of a deprecated one. That behavior # XXX doesn't match the OptionsClass comments. ("x-experimental_ham_spam_imbalance_adjustment", ... It "shouldn't" make any difference there either (but it did make a difference) whether that's spelled x-experimental_ham_spam_imbalance_adjustment or X-experimental_ham_spam_imbalance_adjustment or x-ExPeRiMeTtAl_HaM_sPaM_iMbaLaNcE_aDjUsYmEnT etc. From tameyer at ihug.co.nz Mon Dec 15 23:40:34 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Dec 15 23:40:43 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C09B4@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467776C@its-xchg4.massey.ac.nz> > I don't follow the distinctions being made here, but that's OK because > nobody should have to : option names were intended to be > case-insensitive, regardless of context. Whether they're > read from .py files, or from .ini files, or passed as arguments -- all > should act the same way. This is not the case at the moment, but I'll check in some changes in a minute to make it so. =Tony Meyer From tim at fourstonesExpressions.com Mon Dec 15 23:42:06 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Dec 15 23:42:14 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: References: Message-ID: On Mon, 15 Dec 2003 23:37:49 -0500, Tim Peters wrote: > It "shouldn't" make any difference there either (but it did make a > difference) whether that's spelled > > x-ExPeRiMeTtAl_HaM_sPaM_iMbaLaNcE_aDjUsYmEnT I suspect that the reason this one didn't work as advertised is because "adjusyment" isn't an actual option -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From tim at fourstonesExpressions.com Mon Dec 15 23:54:14 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Dec 15 23:54:23 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayes OptionsClass.py, 1.19, 1.20 In-Reply-To: References: Message-ID: Watch the world reel now On Mon, 15 Dec 2003 20:48:31 -0800, Tony Meyer wrote: > Update of /cvsroot/spambayes/spambayes/spambayes > In directory sc8-pr-cvs1:/tmp/cvs-serv9453/spambayes > > Modified Files: > OptionsClass.py > Log Message: > Option names are always case insensitive, no matter what. > > Index: OptionsClass.py > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v > retrieving revision 1.19 > retrieving revision 1.20 > diff -C2 -d -r1.19 -r1.20 > *** OptionsClass.py 15 Dec 2003 09:20:33 -0000 1.19 > --- OptionsClass.py 16 Dec 2003 04:48:28 -0000 1.20 > *************** > *** 552,586 **** > def display_name(self, sect, opt): > '''A name for the option suitable for display to a user.''' > ! return self._options[sect, opt].display_name() > def default(self, sect, opt): > '''The default value for the option.''' > ! return self._options[sect, opt].default() > def doc(self, sect, opt): > '''Documentation for the option.''' > ! return self._options[sect, opt].doc() > def valid_input(self, sect, opt): > '''Valid values for the option.''' > ! return self._options[sect, opt].valid_input() > def no_restore(self, sect, opt): > '''Do not restore this option when restoring to defaults.''' > ! return self._options[sect, opt].no_restore() > def is_valid(self, sect, opt, value): > '''Check if this is a valid value for this option.''' > ! return self._options[sect, opt].is_valid(value) > def multiple_values_allowed(self, sect, opt): > '''Multiple values are allowed for this option.''' > ! return self._options[sect, opt].multiple_values_allowed() > > def is_boolean(self, sect, opt): > '''The option is a boolean value. (Support for Python 2.2).''' > ! return self._options[sect, opt].is_boolean() > > def convert(self, sect, opt, value): > '''Convert value from a string to the appropriate type.''' > ! return self._options[sect, opt].convert(value) > > def unconvert(self, sect, opt): > '''Convert value from the appropriate type to a string.''' > ! return self._options[sect, opt].unconvert() > > def get_option(self, sect, opt): > --- 552,586 ---- > def display_name(self, sect, opt): > '''A name for the option suitable for display to a user.''' > ! return self._options[sect, opt.lower()].display_name() > def default(self, sect, opt): > '''The default value for the option.''' > ! return self._options[sect, opt.lower()].default() > def doc(self, sect, opt): > '''Documentation for the option.''' > ! return self._options[sect, opt.lower()].doc() > def valid_input(self, sect, opt): > '''Valid values for the option.''' > ! return self._options[sect, opt.lower()].valid_input() > def no_restore(self, sect, opt): > '''Do not restore this option when restoring to defaults.''' > ! return self._options[sect, opt.lower()].no_restore() > def is_valid(self, sect, opt, value): > '''Check if this is a valid value for this option.''' > ! return self._options[sect, opt.lower()].is_valid(value) > def multiple_values_allowed(self, sect, opt): > '''Multiple values are allowed for this option.''' > ! return self._options[sect, > opt.lower()].multiple_values_allowed() > > def is_boolean(self, sect, opt): > '''The option is a boolean value. (Support for Python 2.2).''' > ! return self._options[sect, opt.lower()].is_boolean() > > def convert(self, sect, opt, value): > '''Convert value from a string to the appropriate type.''' > ! return self._options[sect, opt.lower()].convert(value) > > def unconvert(self, sect, opt): > '''Convert value from the appropriate type to a string.''' > ! return self._options[sect, opt.lower()].unconvert() > > def get_option(self, sect, opt): > *************** > *** 588,598 **** > if self.conversion_table.has_key((sect, opt)): > sect, opt = self.conversion_table[sect, opt] > ! return self._options[sect, opt] > > def get(self, sect, opt): > '''Get an option value.''' > ! if self.conversion_table.has_key((sect, opt)): > ! sect, opt = self.conversion_table[sect, opt] > ! return self.get_option(sect, opt).get() > > def __getitem__(self, key): > --- 588,598 ---- > if self.conversion_table.has_key((sect, opt)): > sect, opt = self.conversion_table[sect, opt] > ! return self._options[sect, opt.lower()] > > def get(self, sect, opt): > '''Get an option value.''' > ! if self.conversion_table.has_key((sect, opt.lower())): > ! sect, opt = self.conversion_table[sect, opt.lower()] > ! return self.get_option(sect, opt.lower()).get() > > def __getitem__(self, key): > *************** > *** 601,612 **** > def set(self, sect, opt, val=None): > '''Set an option.''' > ! if self.conversion_table.has_key((sect, opt)): > ! sect, opt = self.conversion_table[sect, opt] > if self.is_valid(sect, opt, val): > ! self._options[sect, opt].set(val) > else: > print >> sys.stderr, ("Attempted to set [%s] %s with > invalid" > " value %s (%s)" % > ! (sect, opt, val, type(val))) > > def set_from_cmdline(self, arg, stream=None): > --- 601,612 ---- > def set(self, sect, opt, val=None): > '''Set an option.''' > ! if self.conversion_table.has_key((sect, opt.lower())): > ! sect, opt = self.conversion_table[sect, opt.lower()] > if self.is_valid(sect, opt, val): > ! self._options[sect, opt.lower()].set(val) > else: > print >> sys.stderr, ("Attempted to set [%s] %s with > invalid" > " value %s (%s)" % > ! (sect, opt.lower(), val, type(val))) > > def set_from_cmdline(self, arg, stream=None): > *************** > *** 617,620 **** > --- 617,621 ---- > """ > sect, opt, val = arg.split(':', 2) > + opt = opt.lower() > try: > val = self.convert(sect, opt, val) > *************** > *** 716,720 **** > if section is not None and option is not None: > output.write(self._options[section, > ! option].as_nice_string(section)) > return output.getvalue() > > --- 717,721 ---- > if section is not None and option is not None: > output.write(self._options[section, > ! > option.lower()].as_nice_string(section)) > return output.getvalue() > > *************** > *** 724,728 **** > if section is not None and sect != section: > continue > ! output.write(self._options[sect, opt].as_nice_string(sect)) > return output.getvalue() > > --- 725,729 ---- > if section is not None and sect != section: > continue > ! output.write(self._options[sect, > opt.lower()].as_nice_string(sect)) > return output.getvalue() > > > > > _______________________________________________ > Spambayes-checkins mailing list > Spambayes-checkins@python.org > http://mail.python.org/mailman/listinfo/spambayes-checkins > -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From tim.one at comcast.net Mon Dec 15 23:55:51 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 15 23:55:55 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: Message-ID: >> It "shouldn't" make any difference there either (but it did make a >> difference) whether that's spelled >> >> x-ExPeRiMeTtAl_HaM_sPaM_iMbaLaNcE_aDjUsYmEnT [Tim Stone] > I suspect that the reason this one didn't work as advertised is > because "adjusyment" isn't an actual option Good eye! As I meant to say the first time, option names were meant to be case-insensitive regardless of context, *and* to be insensitive to any substitutions of the letters in "Timmy". So, for example, adjustment and adjusiieny are also the same option. Generalization to Unicode is left as an exercise. From tim at fourstonesExpressions.com Tue Dec 16 00:05:19 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Tue Dec 16 00:05:26 2003 Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment? In-Reply-To: References: Message-ID: On Mon, 15 Dec 2003 23:55:51 -0500, Tim Peters wrote: > Good eye! As I meant to say the first time, option names were meant to > be > case-insensitive regardless of context, *and* to be insensitive to any > substitutions of the letters in "Timmy". So, for example, Well, at least you didn't use the hated (and more redundant) "Timmie" form... but I suppose you suffered as much of that type of abuse as I did . > > adjustment > > and > > adjusiieny I wonder if there are encryption possibilities here... -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From skip at pobox.com Tue Dec 16 01:33:21 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 16 01:33:17 2003 Subject: [spambayes-dev] one bigram nit Message-ID: <16350.42801.651892.388851@montanaro.dyndns.org> I see one compatibility problem with the bigram stuff. We currently have a key in the database called 'saved state' which stores a tuple: (db version, spamcount, hamcount). If that is ever generated as a bigram the database will get hosed. If backwards compatibility is an issue you might want to choose a different bigram connector than ' '. If backwards compatibility isn't a big deal, I'd bump the PICKLE_VERSION value and choose another value for the state key, probably a non-string object. Skip From kennypitt at hotmail.com Tue Dec 16 10:17:16 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Dec 16 10:17:56 2003 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes OptionsClass.py, 1.19, 1.20 In-Reply-To: Message-ID: I was looking at this same code a little bit yesterday, and one thing struck me as odd. The get(), get_option(), and set() functions use self.conversion_table to translate the requested option, but many of the other functions such as display_name() don't. Is there a reason for that? If not, I was wondering if it would be easier to just fix get_option() for case-insensitivity and then have all the other getter functions call it instead of accessing self._options directly. Also note that the self.conversion_table translation in get() is already redundant, as it then calls get_option() which will do the exact same translation. Tony Meyer wrote: > Update of /cvsroot/spambayes/spambayes/spambayes > In directory sc8-pr-cvs1:/tmp/cvs-serv9453/spambayes > > Modified Files: > OptionsClass.py > Log Message: > Option names are always case insensitive, no matter what. > > Index: OptionsClass.py > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v > retrieving revision 1.19 > retrieving revision 1.20 > diff -C2 -d -r1.19 -r1.20 > *** OptionsClass.py 15 Dec 2003 09:20:33 -0000 1.19 > --- OptionsClass.py 16 Dec 2003 04:48:28 -0000 1.20 > *************** > *** 552,586 **** > def display_name(self, sect, opt): > '''A name for the option suitable for display to a user.''' > ! return self._options[sect, opt].display_name() >[snip] > def get_option(self, sect, opt): > --- 552,586 ---- > def display_name(self, sect, opt): > '''A name for the option suitable for display to a user.''' > ! return self._options[sect, opt.lower()].display_name() >[snip] > def get_option(self, sect, opt): > *************** > *** 588,598 **** > if self.conversion_table.has_key((sect, opt)): > sect, opt = self.conversion_table[sect, opt] > ! return self._options[sect, opt] > > def get(self, sect, opt): > '''Get an option value.''' > ! if self.conversion_table.has_key((sect, opt)): > ! sect, opt = self.conversion_table[sect, opt] > ! return self.get_option(sect, opt).get() > > def __getitem__(self, key): > --- 588,598 ---- > if self.conversion_table.has_key((sect, opt)): > sect, opt = self.conversion_table[sect, opt] > ! return self._options[sect, opt.lower()] > > def get(self, sect, opt): > '''Get an option value.''' > ! if self.conversion_table.has_key((sect, opt.lower())): > ! sect, opt = self.conversion_table[sect, opt.lower()] > ! return self.get_option(sect, opt.lower()).get() > > def __getitem__(self, key): > *************** > *** 601,612 **** > def set(self, sect, opt, val=None): > '''Set an option.''' > ! if self.conversion_table.has_key((sect, opt)): > ! sect, opt = self.conversion_table[sect, opt] > if self.is_valid(sect, opt, val): > ! self._options[sect, opt].set(val) > else: > print >> sys.stderr, ("Attempted to set [%s] %s with > invalid" " value %s (%s)" % > ! (sect, opt, val, type(val))) > > def set_from_cmdline(self, arg, stream=None): > --- 601,612 ---- > def set(self, sect, opt, val=None): > '''Set an option.''' > ! if self.conversion_table.has_key((sect, opt.lower())): > ! sect, opt = self.conversion_table[sect, opt.lower()] > if self.is_valid(sect, opt, val): > ! self._options[sect, opt.lower()].set(val) > else: > print >> sys.stderr, ("Attempted to set [%s] %s with > invalid" " value %s (%s)" % > ! (sect, opt.lower(), val, > type(val))) > > def set_from_cmdline(self, arg, stream=None): -- Kenny Pitt From tim.one at comcast.net Tue Dec 16 12:10:32 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 16 12:10:43 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk email folder. In-Reply-To: <000601c3c3f3$ff64ef20$1014a8c0@station16> Message-ID: [from the spambayes list] > We use Spambayes in my company with great success, and have come > across only one bug, which I have not found listed. Since this has > happened to all three of us using Spambayes, I was surprised to not > find it in the troubleshooting guide. > > After the user accidentally deleted the Junk email folder or the Junk > Suspect folder, I created new ones, but Spambayes would not filter to > them. > ... I wonder whether the Outlook addin should stop trying to remember Outlook's internal folder IDs, remember the user-visible string paths instead, and enumerate the folders to (re)discover the internal Outlook IDs "whenever anything may have changed". It's hard to explain that creating a folder with the same name in the same place doesn't create a folder with the same name in the same place . From tim at fourstonesExpressions.com Tue Dec 16 12:19:44 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Tue Dec 16 12:19:50 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk email folder. In-Reply-To: References: Message-ID: On Tue, 16 Dec 2003 12:10:32 -0500, Tim Peters wrote: > I wonder whether the Outlook addin should stop trying to remember > Outlook's > internal folder IDs, remember the user-visible string paths instead, and > enumerate the folders to (re)discover the internal Outlook IDs "whenever > anything may have changed". It's hard to explain that creating a folder > with the same name in the same place doesn't create a folder with the > same > name in the same place . This problem seems to crop up a LOT. I don't know if it's possible to do what you say, but I think this is gonna continue to be an achilles heel for the plugin unless we do *something* about it... -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From popiel at wolfskeep.com Tue Dec 16 12:41:48 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 16 12:41:52 2003 Subject: [spambayes-dev] one bigram nit In-Reply-To: Message from Skip Montanaro of "Tue, 16 Dec 2003 00:33:21 CST." <16350.42801.651892.388851@montanaro.dyndns.org> References: <16350.42801.651892.388851@montanaro.dyndns.org> Message-ID: <20031216174148.B737B2DF7F@cashew.wolfskeep.com> In message: <16350.42801.651892.388851@montanaro.dyndns.org> Skip Montanaro writes: > >I see one compatibility problem with the bigram stuff. We currently have a >key in the database called 'saved state' which stores a tuple: (db version, >spamcount, hamcount). If that is ever generated as a bigram the database >will get hosed. If backwards compatibility is an issue you might want to >choose a different bigram connector than ' '. If backwards compatibility >isn't a big deal, I'd bump the PICKLE_VERSION value and choose another value >for the state key, probably a non-string object. I'd actually take a different approach: we should prefix all "natural" tokens (defined elsewhere as those tokens generated by the whitespace split over the message body) with "body:", so that text in the body cannot conflict with our synthetic tokens of any flavor. As it stands, I think that the words url:python and url:org would get confused with parts of http://python.org, just because we don't have any protections for naturals aliasing synthetics... Backwards compatibility is overrated; retraining is easy. - Alex From tim at fourstonesExpressions.com Tue Dec 16 12:42:56 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Tue Dec 16 12:43:04 2003 Subject: [spambayes-dev] Fwd: Re: [Spambayes] RE: Spambayes Digest, Vol 64, Issue 68 In-Reply-To: References: Message-ID: Gosh, I still get goofed up between sending to spambayes and spambayes-dev... I'm livin in the past... ------- Forwarded message ------- From: Tim Stone To: akiva@atwood.co.il, spambayes@python.org Subject: Re: [Spambayes] RE: Spambayes Digest, Vol 64, Issue 68 Date: Tue, 16 Dec 2003 11:29:26 -0600 > On Tue, 16 Dec 2003 19:20:55 +0200, Akiva Atwood > wrote: > >>> Is anyone else having problems with these types of spams recently? Has >>> some prolific spammer changed tactics? Most of the one's I've seen seem >>> to originate from Australia or Asia. >> >> I've been getting a lot of them. I thought there was a problem with MY >> filter, and reinstalled it. > > This might be well dealt with by changing the unknown word probability > to indicate a stronger spamminess. By default, it's .5, iirc. Perhaps > we should do some experiments with pushing it to .6 or .7. My corpus > has virtually none of these spams, so I can't say what would happen, and > I imagine that our test corpus has relatively few of them as well. > Comments anyone? > -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From tim at fourstonesExpressions.com Tue Dec 16 14:58:17 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Tue Dec 16 14:58:29 2003 Subject: [spambayes-dev] Who wants to pretend to be a spammer? Message-ID: How about this for a "testing" regimen... one of us can send a known list of spambayes users a series of "spams," with the idea being to see how many of them can get through existing databases, and how long it takes their databases to learn to correctly classify them? Would that be an interesting exercise? -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From kennypitt at hotmail.com Tue Dec 16 15:13:11 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Dec 16 15:13:49 2003 Subject: [spambayes-dev] Who wants to pretend to be a spammer? In-Reply-To: Message-ID: Tim Stone wrote: > How about this for a "testing" regimen... one of us can send a known > list of spambayes users a series of "spams," with the idea being to > see how many of them can get through existing databases, and how long > it takes their databases to learn to correctly classify them? Would > that be an interesting exercise? Interesting idea, but wouldn't it be tricky to make your psuedo-spams representative of real-world spam patterns? For example, it seems like whatever e-mail address and/or SMTP server you use to send the messages would quickly become a significant spam clue. -- Kenny Pitt From tim at fourstonesExpressions.com Tue Dec 16 15:18:47 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Tue Dec 16 15:19:45 2003 Subject: [spambayes-dev] Who wants to pretend to be a spammer? In-Reply-To: References: Message-ID: On Tue, 16 Dec 2003 15:13:11 -0500, Kenny Pitt wrote: > Tim Stone wrote: >> How about this for a "testing" regimen... one of us can send a known >> list of spambayes users a series of "spams," with the idea being to >> see how many of them can get through existing databases, and how long >> it takes their databases to learn to correctly classify them? Would >> that be an interesting exercise? > > Interesting idea, but wouldn't it be tricky to make your psuedo-spams > representative of real-world spam patterns? For example, it seems like > whatever e-mail address and/or SMTP server you use to send the messages > would quickly become a significant spam clue. Yeah, those could be some challenges. I'm not convinced of the usefullness of the idea, but it *could* give us a leg up on spam as it evolves. I dunno, maybe it can't evolve fast enough to fool us for long, but... -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From richie at entrian.com Tue Dec 16 15:28:31 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 16 15:28:39 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: [Tim] > It would help if you tried 2.3.3c1. Your code works under 2.3.3c1, but still lists 304 (yes, 304 - just under a year until I'm clear 8-) broken tokens. [Tim] > Best I can suggest is studying Python's bsddb3 substantial test suite. [Barry] > you can look at ZODB's BerkeleyDB based storage code which is a good > working example of a full-on transactional BerkeleyDB application. Thanks guys - if and when I get the chance, I'll have a look. Unless there's a Python-savvy lurker out there who'd like to take on a smallish, fairly well-spec'd and potentially very important SpamBayes development task? IMHO this is the only bug that's preventing sb_server from entering beta status. -- Richie Hindle richie@entrian.com From barry at python.org Tue Dec 16 15:41:44 2003 From: barry at python.org (Barry Warsaw) Date: Tue Dec 16 15:41:46 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071607304.7979.39.camel@geddy> On Tue, 2003-12-16 at 15:28, Richie Hindle wrote: > Unless there's a Python-savvy lurker out there who'd like to take on a > smallish, fairly well-spec'd and potentially very important SpamBayes > development task? IMHO this is the only bug that's preventing sb_server > from entering beta status. I really wish I had the time. But I'll help play "consultant" on BerkeleyDB stuff. -Barry From popiel at wolfskeep.com Tue Dec 16 15:46:08 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 16 15:46:12 2003 Subject: [spambayes-dev] Who wants to pretend to be a spammer? In-Reply-To: Message from Tim Stone of "Tue, 16 Dec 2003 14:18:47 CST." References: Message-ID: <20031216204608.C52152DF7F@cashew.wolfskeep.com> In message: Tim Stone writes: >On Tue, 16 Dec 2003 15:13:11 -0500, Kenny Pitt >wrote: > >> Interesting idea, but wouldn't it be tricky to make your psuedo-spams >> representative of real-world spam patterns? For example, it seems like >> whatever e-mail address and/or SMTP server you use to send the messages >> would quickly become a significant spam clue. > >Yeah, those could be some challenges. I'm not convinced of the >usefullness of the idea, but it *could* give us a leg up on spam as it >evolves. I dunno, maybe it can't evolve fast enough to fool us for long, >but... Those would be the same challenges that the initial testing had with the multi-source corpora (where significant spam all came from one source and significant ham all came for a different place)... which is why headers were almost completely ignored for the first six months or so of development. A good first approximation of returning to that would be to turn off all the from/to/received/msgid header parsing. Responding to the idea (someone emulating a spammer): wouldn't it be easier to just distribute a corpus of spam, and have people grab it and test it against their databases? - Alex From richie at entrian.com Tue Dec 16 17:01:47 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 16 17:02:00 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071607304.7979.39.camel@geddy> References: <1071607304.7979.39.camel@geddy> Message-ID: [Barry] > I really wish I had the time. But I'll help play "consultant" on > BerkeleyDB stuff. If I find the time for this project, I might just make you regret saying that. 8-) -- Richie Hindle richie@entrian.com From tim.one at comcast.net Tue Dec 16 21:40:53 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 16 21:41:00 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Tim, on the spambayes list, about x-use_bigrams in CVS] > I see that it's a cruder approximation to the suggested scoring > algorithm (which I implemented at one time). For example ... I checked in the intended implementation. Here's the checkin comment: Implemented the intended "tiling" version of x-use_bigrams. Tried to restore most of the speed lost when this option *isn't* in use. Will add comments later. Anyone using x-use_bigrams needs to retrain: synthesized bigrams now begin with a "bi:" prefix. Skip, that last point addresses your (good!) concern about ambiguity wrt the special 'saved state' key. Here's what I've found so far. My main personal database is currently trained on 474 ham and 489 spam, using mostly mistake-and-unsure-based training, with a spam cutoff of 95 and a ham cutoff of 4 (yup, those are extreme -- I've been experimenting). Database size (a bsddb3 hash database): without x-use_bigrams 2,544KB with x-use_bigrams 10,288KB That's a major size boost, and (of course) is expected (bigrams create fat hapaxes at a prodigious rate). There's no reason to suppose that the selection of training ham and spam based on mistake-and-unsure training from a unigram-only classifier makes much sense for a mixed uni+bi-gram classifier; to the contrary, the latter almost certainly has different strengths and weaknesses. An example of that is the highest scoring ham in my inbox. Because I had previously put copies of some of those into my ham training data, back when my ham cutoff was 20, without x-use_bigrams no message in my inbox today scores above 20. These are the worst: 6 6 6 7 7 7 7 8 8 8 9 9 9 12 13 13 14 16 After retraining on the same training sets with x-use_bigrams, then rescoring my inbox, the highest-scoring ham in my inbox are worse: 7 8 8 9 10 12 13 13 13 13 16 22 25 31 34 38 45 49 I'm confident that this is an artifact of using training sets based on picking on the weakest performance of a different scoring strategy, and that had I been using train-on-everything all along, that result would have been very different. There's an interesting example in the other direction too: the last time I started over from scratch, I left one Unsure in my Unsure folder, and have kept it there ever since. It's a long and chatty spam, about a topic I even have some interest in (no, my wang already has carpet burns ), and I wanted to see how mistake-based training changed its score over time. It drifted slowly upward all along, from the low 40s to the low 80s. Under x-use_bigrams, though, the score zoomed to 95.34. The difference is high-scoring bigrams that appeared in a few other spam: 'bi:any questions,' 0.908163 0 2 'bi:website at:' 0.908163 0 2 'bi:visit our' 0.931987 1 17 'bi:create your' 0.934783 0 3 'bi:than years' 0.934783 0 3 "than years" is a peculiar one, eh?! Then original text was ... more than 30 years ago ... and we skipped "30" because it's shorter than 3 characters. So, conclusions for now: + x-use_bigrams is going to bloat your database bigtime. + If you use train-on-everything, and want to try it, no problem. + If you're doing mistake-based training and want to try it, probably best to start over from scratch. + I believe that mistake-based training under this method is likely to be substantially more brittle than mistake-based training under the (still default) unigram-only scheme, because it's even more hapax-driven (synthesizing bigrams creates many more hapaxes). + OTOH, bigrams are better at recognizing the language of advertising. For example, "bi:website at:" is more clearly a "call to action" than either "website" or "at:". From hooft at o2w.nl Wed Dec 17 00:35:43 2003 From: hooft at o2w.nl (hooft@o2w.nl) Date: Wed Dec 17 00:35:48 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <43800.80.126.9.240.1071639343.squirrel@secure.o2w.nl> > [Kenny Pitt] >> I get the same results as Tim using the 2.3.2 final version: Python >> 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on >> win32 >> >> In my 2.3.2 lib, the "return self.db[key]" line in __getitem__ is on >> line 116 of __init__.py, not line 86 as in Richie's traceback. I >> could expect some changes between Python 2.3 and 2.3.2, but 30 lines >> seems a bit much between minor bugfix releases. Is that possibly an >> indicator of a bsddb version mismatch? > > It's more an indicator of bugs in 2.3's bsddb support. __init__.py was > at rev 1.5 in the 2.3 release, and is at rev 1.12(!) today: > > http://cvs.sf.net/viewcvs.py/python/python/dist/src/Lib/bsddb/__init__.py > > I see that support for the iterator and mapping protocols wasn't added > until rev 1.6, which is why they don't work for Richie in 2.3 final. Imagine the people like me that are using Python2.2 on the systems of their ISPs: Python 2.2.2 (#1, Oct 26 2002, 20:34:17) [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-108.7.2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import shelve >>> d=shelve.open('.hammiedb') >>> [k for k in d] Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.2/shelve.py", line 70, in __getitem__ f = StringIO(self.dict[key]) TypeError: key type must be string >>> Regards, Rob Hooft From jtech at hyperionmail.com Wed Dec 17 00:57:42 2003 From: jtech at hyperionmail.com (My Tech) Date: Wed Dec 17 00:58:13 2003 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677769@its-xchg4.massey.ac.nz> Message-ID: <005301c3c462$b01cc360$1e02a8c0@JDi8000> Hi Guys - Just to provide some more detail (but unfortunately I don't think it will help solve the mystery of what happened)... First, I ran Outlook's detect and repair. No change. Second, I uninstalled and reinstalled my Office XP application (which includes Outlook). No change. Third, I created a new profile in Outlook to see if that would make a difference. Nope. Still no change. Fourth, I created a new Windows profile/user with administrative rights. Doing this, Outlook opened without a problem. Didn't re-install SpamBayes for fear of making a bad problem worse. I can't roll back the registry because I'm running Windows 2000, not XP (unless you know something about 2000 functionality that I don't.) Also, I was logged in as Administrator when I installed SpamBayes (and subsequently encountered the Outlook problem), so I couldn't dump this profile. Everything else seems to work fine, except for Outlook, so my first guess was that it was an Outlook problem. However, considering that I tried Outllok detect & repair and then uninstalled/reinstalled Office XP with no resulting change, I'm left to conclude that it's a Windows profile problem. I'm really at a loss to know what to do, outside of reinstalling the OS and all of my software (a long and tedious process that I'm not looking forward to). Any further advice would be greatly appreciated. Thanks. -----Original Message----- From: Tony Meyer [mailto:tameyer@ihug.co.nz] Sent: Monday, December 15, 2003 10:05 PM To: 'Tim Stone'; spambayes-dev@python.org Cc: jtech@hyperionmail.com Subject: RE: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile > Oops... forwarded the wrong message. This is the one I was > thinking of. > This seems severe, and I've not seen this problem pop up in the list > before. I don't know how to respond. I thought the same thing, although I'm not entirely convinced that it was the SpamBayes installer that did this. Also, whenever I've had a corrupted profile, I've had to dump the entire profile, which this guy obviously hasn't. Presumably rolling back the registry would fix it. If it's actually a problem with Outlook, not the Windows profile (which seems more likely), then Outlook's detect and repair should fix it. Does everything work apart from Outlook? If so, it seems highly unlikely that it's the Windows profile that is corrupt. If not, what is it that fails? As for the instructions to install the SpamBayes Outlook plug-in: 1. Download the installer. 2. Double-click the installer. 3. Go through the installer prompts. 4. You're done. =Tony Meyer From tim.one at comcast.net Wed Dec 17 01:15:50 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 01:15:52 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <43800.80.126.9.240.1071639343.squirrel@secure.o2w.nl> Message-ID: [hooft@o2w.nl] > Imagine the people like me that are using Python2.2 on the systems of > their ISPs: > Python 2.2.2 (#1, Oct 26 2002, 20:34:17) > [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-108.7.2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import shelve > >>> d=shelve.open('.hammiedb') > >>> [k for k in d] > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/local/lib/python2.2/shelve.py", line 70, in __getitem__ > f = StringIO(self.dict[key]) > TypeError: key type must be string > >>> OK, I did. Now what ? From mhammond at skippinet.com.au Wed Dec 17 01:45:41 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Dec 17 01:46:00 2003 Subject: [spambayes-dev] pop3proxy_tray icons In-Reply-To: Message-ID: <002801c3c469$66671e30$2c00a8c0@eden> Better late than never :) > I don't know if anyone else has noticed this or not, but on my Windows > 2000 system the green and red circles in the current pop3proxy_tray > icons are very difficult to make out. I created the attached > icons as a > possible alternative. They are basic 16-color icons and show up quite > nicely on both Windows 2000 and Windows XP. I agree. > The attached patch is also required because the LoadImage > calls pass 0,0 > for the icon size. That loads the icon using the default 32x32 size, > scaling a 16x16 icon up to 32x32 if necessary. Since icons > in the tray > are only 16x16, they then get scaled back down when displayed > and still > end up looking bad. Excellent! > I also attached an alternate sbicon that I created in the > spirit of the > icons in the Web UI. It uses the envelope icon from the > Wingdings font > with the same blue outline color used in the UI icons. I modified my > py2exe\setup_all.py to use this as the icon for all the > generated exe's. I haven't done that :) I've checked it all in. Thanks, Mark. From anthony at interlink.com.au Wed Dec 17 01:47:13 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed Dec 17 01:47:28 2003 Subject: [spambayes-dev] Re: Auto-response for your message to the "Spambayes" mailing list In-Reply-To: Message-ID: <200312170647.hBH6lDCV008087@localhost.localdomain> A whole bunch of the header lines in the spambayes autoresponse are being included on the line with the header. Can someone fix, or else do whatever's necessary for me to be able to fix it? >>> spambayes-bounces@python.org wrote > READ THIS! (If you want help.) > > This is an automated response to an email message you sent to the > spambayes@python.org mailing list. Please read this message carefully > to see if it answers your question(s). > > > Before you do anything else: ---------------------------- > > Before asking a question on the list, please take a moment and check > the frequently asked questions page: > > http://spambayes.sourceforge.net/faq.html > > > What is Spambayes? ------------------ > [snip] From tameyer at ihug.co.nz Wed Dec 17 04:21:45 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 17 04:21:53 2003 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/testtools urlslurper.py, 1.6, NONE In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0D79@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677770@its-xchg4.massey.ac.nz> Opps. The comment window had scrolled down and I didn't notice. Only the last line should be there in the comments for this. > -----Original Message----- > From: spambayes-checkins-bounces@python.org > [mailto:spambayes-checkins-bounces@python.org] On Behalf Of Tony Meyer > Sent: Wednesday, 17 December 2003 10:17 p.m. > To: spambayes-checkins@python.org > Subject: [Spambayes-checkins] spambayes/testtools > urlslurper.py,1.6,NONE > > > Update of /cvsroot/spambayes/spambayes/testtools > In directory sc8-pr-cvs1:/tmp/cvs-serv671/testtools > > Removed Files: > urlslurper.py > Log Message: > Add the basis of a new experimental (and highly debatable) > option to 'slurp' URLs. > > This is based on the urlslurper.py script in the testtools > directory, which in turn > was based on Richard Jowsey's URLSlurper.java. > > Basically, when the option is enabled, instead of just > tokenizing the URLs in a message, > we also retrieve the content at that address (if it's not > text, we ignore it). > > When classifying, if the message has a 'raw' score in the > unsure range, and if the > number of tokens is less than max_discriminators, and adding > these 'slurped' tokens > would push the message into the ham/spam range, then they are used. > > This isn't necessary anymore; use the experimental > URLRetriever options > instead. > > --- urlslurper.py DELETED --- > > > > _______________________________________________ > Spambayes-checkins mailing list > Spambayes-checkins@python.org > http://mail.python.org/mailman/listinfo/spambayes-checkins > From mhammond at skippinet.com.au Wed Dec 17 06:44:31 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Dec 17 06:44:48 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk emailfolder. In-Reply-To: Message-ID: <009e01c3c493$2489c0b0$2c00a8c0@eden> > I wonder whether the Outlook addin should stop trying to > remember Outlook's > internal folder IDs, remember the user-visible string paths > instead, and > enumerate the folders to (re)discover the internal Outlook > IDs "whenever > anything may have changed". I'm not sure what you had in mind for "anything may have changed", but in general, I agree. I always had the idea that we would also store the FQN, and fall back to that when necessary, making the folder ID more a "cached" value. It just never happened. It does get complex though - what happens when the user renames the folder? Before you know it, we have even more cruft that noone really understand why is there Another alternative would be to change things so that most errors re-displayed the config wizard. Of course, 0.81 has a bug in the config wizard that relates directly to deleted folders , but otherwise, it seems a reasonable approach. If the config wizard also detected "we are probably trained OK", and allowed you to continue without retraining (really just a checkbox and 20 LOC), that whole process should take under a minute. Either way, I'm going for a new combined binary before this even gets a look in Mark. From mhammond at skippinet.com.au Wed Dec 17 06:47:39 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Dec 17 06:47:54 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayesOptionsClass.py, 1.19, 1.20 In-Reply-To: Message-ID: <009f01c3c493$936963a0$2c00a8c0@eden> > Watch the world reel now > > On Mon, 15 Dec 2003 20:48:31 -0800, Tony Meyer > wrote: > > > Update of /cvsroot/spambayes/spambayes/spambayes > > In directory sc8-pr-cvs1:/tmp/cvs-serv9453/spambayes > > > > Modified Files: > > OptionsClass.py > > Log Message: > > Option names are always case insensitive, no matter what. Yay! I *nearly* did that quite some time ago, but was worried I would be (silently) accused of loosening reasonable code to handle my sloppy style. It also means a number of .lower() calls can be removed from Outlook! Entropy-catches-us-all ly, Mark. From kennypitt at hotmail.com Wed Dec 17 10:00:15 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Dec 17 10:00:55 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junkemailfolder. In-Reply-To: <009e01c3c493$2489c0b0$2c00a8c0@eden> Message-ID: Mark Hammond wrote: >> I wonder whether the Outlook addin should stop trying to remember >> Outlook's internal folder IDs, remember the user-visible string >> paths instead, and enumerate the folders to (re)discover the >> internal Outlook IDs "whenever anything may have changed". > > I'm not sure what you had in mind for "anything may have changed", > but in general, I agree. I always had the idea that we would also > store the FQN, and fall back to that when necessary, making the > folder ID more a "cached" value. It just never happened. It does > get complex though - what happens when the user renames the folder? > Before you know it, we have even more cruft that noone really > understand why is there One of the most common problems seems to be when the spam folder is actually still sitting under Deleted Items. The ID is unchanged so SpamBayes keeps moving the spam there and people think the messages are just disappearing. As a partial interim solution, could we check for this special case, i.e. if we successfully access the spam folder by ID but it's parent folder is Deleted Items then move it back to the top level (or store the original FQN and move it back to there)? Another possibility might be to attach an ItemAdd event handler to the Deleted Items folder and check for an item with the same ID as the spam folder. Does ItemAdd get called for added folders, or only for added items? -- Kenny Pitt From skip at pobox.com Wed Dec 17 11:10:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 17 11:11:06 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: Message-ID: <16352.32783.360628.370482@montanaro.dyndns.org> Tim> Database size (a bsddb3 hash database): Tim> without x-use_bigrams 2,544KB Tim> with x-use_bigrams 10,288KB Tim> That's a major size boost, and (of course) is expected (bigrams Tim> create fat hapaxes at a prodigious rate). I've been experimenting with the bigram stuff and like it so far. I also have some mods to the DBDictClassifier stuff which add timestamps (last set, last used) to the database. There's some interaction between the two which keeps me from using the two together. It may be worthwhile considering a last used timestamp to control the number of unused (or rarely used) tokens. The first thing I did was retrain and then score my then current unsure mailbox. Out of about 40 messages it scored over half of them as spam with bigrams enabled. I then took my entire training database (around 140 spams and 100 hams) and tossed them into my unsure mailbox. Using that now much bigger mailbox (about 280 messages), I then started with a fresh round of unsure+mistake based training. I got to roughly the same performance as without bigrams using a much smaller set of training messages. I'm currently at 97 spams and 64 hams. I'm still getting a fair number of unsures, but the false positive rate doesn't seem horrible (I've seen a few, but haven't been counting). Tim> + I believe that mistake-based training under this method is likely Tim> to be substantially more brittle than mistake-based training Tim> under the (still default) unigram-only scheme, because it's even Tim> more hapax-driven (synthesizing bigrams creates many more Tim> hapaxes). As I was training, I noticed some wild fluctuations in scores with bigrams enabled, especially with small databases. Skip From tim at fourstonesExpressions.com Wed Dec 17 11:18:49 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Wed Dec 17 11:18:55 2003 Subject: [spambayes-dev] Re: [Spambayes] How low can you go? In-Reply-To: <16352.32783.360628.370482@montanaro.dyndns.org> References: <16352.32783.360628.370482@montanaro.dyndns.org> Message-ID: On Wed, 17 Dec 2003 10:10:55 -0600, Skip Montanaro wrote: > I've been experimenting with the bigram stuff and like it so far. I also > have some mods to the DBDictClassifier stuff which add timestamps (last > set, > last used) to the database. There's some interaction between the two > which > keeps me from using the two together. It may be worthwhile considering a > last used timestamp to control the number of unused (or rarely used) > tokens. iirc, there was quite a bit of discussion about aging mechanisms quite a few months ago. It seemed like most everyone agreed that it was a good idea, but nobody wanted to implement it for database size considerations. It still seems like a good idea... -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From skip at pobox.com Wed Dec 17 11:29:12 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 17 11:29:11 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk emailfolder. In-Reply-To: <009e01c3c493$2489c0b0$2c00a8c0@eden> References: <009e01c3c493$2489c0b0$2c00a8c0@eden> Message-ID: <16352.33880.908212.67671@montanaro.dyndns.org> Mark> .... It does get complex though - what happens when the user Mark> renames the folder? Before you know it, we have even more cruft Mark> that noone really understand why is there And thus a proper Windows application. Skip From skip at pobox.com Wed Dec 17 11:45:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 17 11:45:42 2003 Subject: [spambayes-dev] Re: [Spambayes] How low can you go? In-Reply-To: References: <16352.32783.360628.370482@montanaro.dyndns.org> Message-ID: <16352.34858.121487.578149@montanaro.dyndns.org> Tim> iirc, there was quite a bit of discussion about aging mechanisms Tim> quite a few months ago. It seemed like most everyone agreed that Tim> it was a good idea, but nobody wanted to implement it for database Tim> size considerations. It still seems like a good idea... Size definitely does matter. With both bigrams and my set/used timestamps (datetime objects), the size of the database ballooned. I think the set timestamp could be dispensed with and the last used timestamp converted to something smaller, like a YYYYMMDD string. Skip From tim.one at comcast.net Wed Dec 17 12:39:54 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 12:40:01 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <16352.34858.121487.578149@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Size definitely does matter. With both bigrams and my set/used > timestamps (datetime objects), the size of the database ballooned. I > think the set timestamp could be dispensed with and the last used > timestamp converted to something smaller, like a YYYYMMDD string. A small integer should be enough for last-used, like the number of days between the day the database was first created and the day a feature was most recently used in scoring. That's easily computed, easy to use *in* computations, and consumes no more than 3 bytes in a binary pickle (proto 1 or proto 2) until about 180 years after the database was created . Especially with the bigram scheme-- which creates a relatively enormous number of hapaxes --I expect the best use for a per-feature "last used" timestamp is to expire hapaxes that haven't been used in scoring for N days. That should yield major size savings, actually increase resistance to "spectacular failures" (which so far most often seem to be associated with hitting a large number of old hapaxes from "the other" category), and *probably* not hurt anything else. Expiring "near hapaxes" too gets dicier, and more so the more liberal the conception of "near". From nobody at spamcop.net Wed Dec 17 13:21:08 2003 From: nobody at spamcop.net (Seth Goodman) Date: Wed Dec 17 13:21:13 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <16352.34858.121487.578149@montanaro.dyndns.org> Message-ID: [Tim Stone] > Tim> iirc, there was quite a bit of discussion about aging mechanisms > Tim> quite a few months ago. It seemed like most everyone agreed that > Tim> it was a good idea, but nobody wanted to implement it > for database > Tim> size considerations. It still seems like a good idea... > > [Skip Montanaro] > Size definitely does matter. With both bigrams and my set/used > timestamps (datetime objects), the size of the database > ballooned. I think > the set timestamp could be dispensed with and the last used timestamp > converted to something smaller, like a YYYYMMDD string. I know this is a developer conversation, so I hope you don't mind if I offer my two cents. And I definitely agree that size matters, at least for databases. I have seen a lot of references, not just in this thread, to ageing out individual tokens. For a probability calculation in which one of the variables is the number of messages of a given class that a token appears in, it seems dangerous to remove only some tokens from a message and not adjust the message count. Here's my problem with it: all tokens from a trained message *could* conceivably age out individually, but the trained message count for the appropriate category would not change. This would result in a wrong probabilities for *all* other tokens, since the database is the same state as before the message was trained but the trained message count is now wrong. It is even harder to conceive what the trained message count should be if you only remove some of the tokens from a message. Using a token ageing scheme, the trained message counts would monotonically rise until you started over, despite removing plenty of tokens over time. I do understand that most of the aged out tokens would be oddball hapaxes, but not all of them will be. Though I often hear "intuition is a poor guide", I would propose ageing out whole messages rather than tokens. This at least maintains the integrity of your basic probability calculation. It also has the advantage of enforcing balanced (or unbalanced in a particular way) training set size. This would require adding all the tokens from a trained message to the message database and the message entry would be timestamped rather than the individual tokens. When a message got too old, all it's tokens would have their counts decremented and the trained message count for that message class would also be decremented. I would propose going one step further to give the train on everything approach some additional "memory" for atypical messages (of either type) that don't occur regularly enough to always be in a fixed-size database. This might give it some of the advantages of the train on exceptions schemes, perhaps with less of the "brittle" behavior others have noted and I have seen as well. One possible mechanism to do this is as follows: 1) If the database message count is at maximum, untrain the oldest message. 2) Score the new message to be trained. 3) Move the new training message timestamp into the future by an amount related to it's "distance" from a perfect score for that message type. More atypical messages that classify poorly would be timestamped further into the future and would thus stick around longer than ones that classify perfectly. The ones that classify perfectly would have their tokens replaced sooner, which should be no great loss. With train on everything, there should be lots of messages that classify very well to take their place. There could be a scaling constant that sets the maximum amount of extra time that an unusual message remains in the database. This determines how long the database "memory" is, along with the maximum message count and the number of messages that you train per day (depends on your training scheme). The goal of this is to allow train on everything, keep moderate database sizes and still have a long enough memory for atypical messages that are infrequent. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From wsy at merl.com Wed Dec 17 13:44:48 2003 From: wsy at merl.com (Bill Yerazunis) Date: Wed Dec 17 13:45:03 2003 Subject: [spambayes-dev] Re: [Spambayes] How low can you go? In-Reply-To: References: Message-ID: <200312171844.hBHIimx07769@localhost.localdomain> From: "Seth Goodman" [... re aging out tokens ...] Here's a particularly cute solution I implemented in CRM114. The problem is that if you choose to store a token's last-seen date, you will likely consume almost as much space in the storage of the date as you will in the token count or the token hash. But most tokens are hapaxes anyway. They have very low value, and you probably will _never_ see them again. So, when you need to clean up the database a little, go through and decrement the "seen" count on a few (very few!) tokens Choose the tokens to decrement randomly. REALLY randomly. Don't pick one chain that's too long and decrement every element in it. Decrement only every sixteenth one, or only the ones that have values that, when added to the system clock, have a hash with the low order byte == 0x00, or something like that. Sure, you're losing information- but that's a necessary consequence of forgetting tokens. The net result is very fast and has an acceptable level of damage to accuracy. Tests show that, at least for CRM114 which is HEAVILY hapax-oriented, that the damage does not increase the error rate until you get into obscenely small databases (i.e. less than 100K slots). Anyway, this is how is implemented in CRM114, and it seems to work acceptably well. -Bill Yerazunis From jm at jmason.org Wed Dec 17 13:59:02 2003 From: jm at jmason.org (Justin Mason) Date: Wed Dec 17 13:59:24 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: <20031217185904.1E1F217076@jmason.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tim Peters writes: > [Skip Montanaro] > > Size definitely does matter. With both bigrams and my set/used > > timestamps (datetime objects), the size of the database ballooned. I > > think the set timestamp could be dispensed with and the last used > > timestamp converted to something smaller, like a YYYYMMDD string. > > A small integer should be enough for last-used, like the number of days > between the day the database was first created and the day a feature was > most recently used in scoring. That's easily computed, easy to use *in* > computations, and consumes no more than 3 bytes in a binary pickle (proto 1 > or proto 2) until about 180 years after the database was created . FWIW -- in SpamAssassin, we used to use an approximate scheme that fit the remaining UNIX epoch into 2 bytes something like you're suggesting (by dividing time_t by several hours and starting the current epoch from 1 Jan 2000, or something like that). However we found that we ran into expiry problems for large dbs and busy sites, because that just didn't give us enough precision -- having a granularity of hours wasn't good enough. so SpamAssassin db version 2 now just uses a plain old long containing a time_t value, and damn the db bloat. A bit bigger, but expiry now works reliably ;) However a good way we found to cut down hapax db bloat was to use a polymorphic format for the tokens in the db; if a token has spamcount < 8 and hamcount < 8, it's marshalled so that the spamcount and hamcount are both shoved into 1 byte as a bitmask, with the high bits set. Here's the perl code in question: sub tok_pack { my ($self, $ts, $th, $atime) = @_; $ts ||= 0; $th ||= 0; $atime ||= 0; if ($ts < 8 && $th < 8) { return pack ("CV", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime); } else { return pack ("CVVV", TWO_LONGS_FORMAT, $ts, $th, $atime); } } I do like Bill Y's "sunspots expiry" scheme though ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQE/4Kd2QTcbUG5Y7woRAh/DAKC6MGlXpd1bEeR2/BzTmhtH71075ACgg21j pJ85tiGe697R3s90bP/LRS4= =slib -----END PGP SIGNATURE----- From nobody at spamcop.net Wed Dec 17 14:00:45 2003 From: nobody at spamcop.net (Seth Goodman) Date: Wed Dec 17 14:00:51 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <200312171844.hBHIimx07769@localhost.localdomain> Message-ID: [Bill Yerazunis] > Here's a particularly cute solution I implemented in CRM114. ---------snip---------------- > Choose the tokens to decrement randomly. REALLY randomly. Don't Does CRM114 use the number of trained ham and trained spam *messages* as variables in its probability calculation? If not, then you wouldn't expect that deleting infrequently used tokens would do much damage. AFAIK, SpamBayes uses the trained message counts in the probability calculation and those becomes inaccurate if you delete individual tokens. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From spambayes at whateley.com Wed Dec 17 14:08:52 2003 From: spambayes at whateley.com (Brendon Whateley) Date: Wed Dec 17 14:09:00 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: Message-ID: <200312171108.56120.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Seth, Couldn't we maintain (and use) a synthetic #messages value that is generated using the average number of tokens/message. This way, as tokens are removed from the database, the synthetic number could be adjusted. It seems (and I don't have time to think about it now, have to go pay the dog license!) that such a scheme would work quite well along with the "remove old tokens" scheme that ages unused tokens? It probably doesn't matter if the number is accurate, provided the DB doesn't contain far too few tokens. Brendon. On Wednesday 17 December 2003 11:00 am, Seth Goodman wrote: > [Bill Yerazunis] > > > Here's a particularly cute solution I implemented in CRM114. > > ---------snip---------------- > > > Choose the tokens to decrement randomly. REALLY randomly. Don't > > Does CRM114 use the number of trained ham and trained spam *messages* as > variables in its probability calculation? If not, then you wouldn't expect > that deleting infrequently used tokens would do much damage. AFAIK, > SpamBayes uses the trained message counts in the probability calculation > and those becomes inaccurate if you delete individual tokens. > > -- > Seth Goodman > > Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com > > Spambots: disregard the above > > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBP+CpxJuupqACStRwEQJZEACg23t52C7CDk5ghZsRU3KsmetsPUMAoIXQ nYVJM0QJ0tQOKT5RjZZugjRn =ZxqV -----END PGP SIGNATURE----- From richie at entrian.com Wed Dec 17 16:32:36 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 17 16:32:45 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071607304.7979.39.camel@geddy> References: <1071607304.7979.39.camel@geddy> Message-ID: [Barry] > I'll help play "consultant" on BerkeleyDB stuff. [Tim] > I'm half ready to declare that ZODB is the only database anyone > should ever use This is probably a hopelessly naive question, but can I have the best of both worlds? If I use ZODB with a BerkeleyDB back end, will that be process- and thread-safe (without using ZEO)? -- Richie Hindle richie@entrian.com From tim.one at comcast.net Wed Dec 17 17:16:44 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 17:16:47 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Richie Hindle] > This is probably a hopelessly naive question, but can I have the best > of both worlds? If I use ZODB with a BerkeleyDB back end, will that > be process- and thread-safe (without using ZEO)? My understanding is that, regardless of back end, ZODB is thread-safe among the threads in a single process, but that you cannot open a connection to a ZODB database from more than one process simultaneously without using ZEO. Don't consider ZEO to be such a big deal, though: code using ZEO looks exactly the same as code not using ZEO, except for the lines that initially open the database. Where a direct use of ZODB may open a FileStorage, for example, the same code wishing to use ZEO would open a ClientStorage instead, and that's it. Once you are using ZEO, you get distributed access for free (you can connect to the ZEO server via an arbitrary pair, so can access a ZODB database living anywhere your network can reach). Note that Jeremy already wrote code to run spambayes via ZEO, in the project's pspam/ directory. I don't know how much bitrot that's suffered. Note too that in addition to getting the best of both worlds, you may also get the worst of both worlds. For example, if BDB really does suffer corruption problems, then it would be something of a miracle if ZODB-on-BDB were somehow immune. Also note that the full ZODB back ends (like FileStorage and Berkeley) support unlimited undo, so the physical database keeps every revision ever made to every object. So they need 'pack' steps from time to time to announce that you promise never to care about revisions before a time you specify to pack, so that the physical database can reclaim their space. Finally, note that any form of concurrent modification can end up creating inconsistent data. ZODB solves this by raising ConflictError whenever inconsistency is possible, and the app has to be prepared to catch that (the usual response then is to try the transaction again, and on the second attempt it will *start* with the data successfully committed by the other transaction(s) involved in the conflict). That could be a real problem if many threads or processes keep modifying the same info simultaneously (like the counts attached to, say, "the"). From barry at python.org Wed Dec 17 17:32:03 2003 From: barry at python.org (Barry Warsaw) Date: Wed Dec 17 17:32:12 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071700323.27808.50.camel@anthem> On Wed, 2003-12-17 at 17:16, Tim Peters wrote: > Note too that in addition to getting the best of both worlds, you may also > get the worst of both worlds. For example, if BDB really does suffer > corruption problems, then it would be something of a miracle if ZODB-on-BDB > were somehow immune. Except that the BerkeleyDB based storages use the full-blown bsddb transactional interface, so from that side of things, they should be thread and multiproc safe. Assuming anyone really understands how BerkeleyDB (and the Python wrapper around it) works , I'd feel pretty confident storing data into it. > Also note that the full ZODB back ends (like FileStorage and Berkeley) > support unlimited undo, so the physical database keeps every revision ever > made to every object. So they need 'pack' steps from time to time to > announce that you promise never to care about revisions before a time you > specify to pack, so that the physical database can reclaim their space. Note that there is a "full" BDB storage and a "minimal" storage. The latter doesn't retain multiple revisions. The former can be configured to "autopack" occasionally to cut down on space it consumes. -Barry From skip at pobox.com Wed Dec 17 17:40:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 17 17:40:21 2003 Subject: [spambayes-dev] sb_filter experimental args Message-ID: <16352.56151.499376.839685@montanaro.dyndns.org> Are the sb_filter.py arguments marked [EXPERIMENTAL] (try sb_filter.py --help) really still experimental? They've been there a long while and as far as I know there's no move afoot to get rid of them (I use them from my .procmailrc file). If not, I will update the docstring. Skip From skip at pobox.com Wed Dec 17 17:59:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 17 17:59:43 2003 Subject: [spambayes-dev] empty urls in bigram? Message-ID: <16352.57313.372507.14545@montanaro.dyndns.org> I just noticed this bigram in my clues: 'bi:url: url:'. If 'url:' would only be presented once as a clue, does it make sense to form a bigram with two instances of it? What does an empty "url:" token mean? Skip From skip at pobox.com Wed Dec 17 18:07:21 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 17 18:07:19 2003 Subject: [spambayes-dev] empty urls in bigram? Message-ID: <16352.57769.222771.631763@montanaro.dyndns.org> I just noticed this bigram in my clues: 'bi:url: url:'. If 'url:' would only be presented once as a clue, does it make sense to form a bigram with two instances of it? More examples: >>> [k for k in db if re.match(r"bi:([^ ]+) \1$", k) is not None] ['bi:very, very,', 'bi:charset:utf-8 charset:utf-8', 'bi:megamek megamek', 'bi:time time', 'bi:[input] [input]', 'bi:billboard billboard', 'bi:the the', 'bi:state state', 'bi:prince prince', 'bi:subject:$ subject:$', 'bi:phpmyadmin phpmyadmin', 'bi:amsn amsn', 'bi:fund fund', 'bi:against, against,', 'bi:camera camera', 'bi:received:mailnull@localhost) received:mailnull@localhost)', 'bi:pago pago', 'bi:chicago chicago', 'bi:charset:iso-8859-1 charset:iso-8859-1', 'bi:pdfcreator pdfcreator', 'bi:gour gour', 'bi:subject:. subject:.', 'bi:received:30950 received:30950', 'bi:subject:- subject:-', "bi:subject:' subject:'", 'bi:fma fma', 'bi:subject:.. subject:..', 'bi:miktex miktex', 'bi:this this', 'bi:help help', 'bi:url:2 url:2', 'bi:fluid fluid', 'bi:sell, sell,', 'bi:$50.00 $50.00', 'bi:forum forum', 'bi:scummvm scummvm', 'bi:url:com url:com', 'bi:received:2612 received:2612', 'bi:download download', 'bi:hanukah hanukah', 'bi:becomes becomes', 'bi:men men', 'bi:url:ami url:ami', 'bi:subject:2003 subject:2003', 'bi:*** ***', 'bi:encore encore', 'bi:virus:src="cid: virus:src="cid:', 'bi:subject:You subject:You', 'bi:filezilla filezilla', 'bi:received:3948 received:3948', 'bi:charset:windows-874 charset:windows-874', 'bi:content-type:text/plain content-type:text/plain', 'bi:subject:, subject:,', 'bi:url:contactus url:contactus', 'bi:charset:windows-1252 charset:windows-1252', 'bi:have have', 'bi:url:catalog url:catalog', 'bi:or: or:', 'bi:aid aid', 'bi:url:sendmail url:sendmail', 'bi:url:%s url:%s', 'bi:url:tracking url:tracking', 'bi:described described', 'bi:you you', 'bi:music music', 'bi:springs springs', 'bi:any any', 'bi:charset:us-ascii charset:us-ascii', 'bi:url:email-reports url:email-reports', 'bi:url:cgi url:cgi', 'bi:url:newsletter_2003_oct url:newsletter_2003_oct', 'bi:indianapolis indianapolis', 'bi:dev-c++ dev-c++', 'bi:subject:* subject:*', 'bi:url:forums url:forums', 'bi:relix relix', 'bi:mau mau', 'bi:subject:: subject::', 'bi:$$$ $$$', 'bi:url:signup url:signup', 'bi:#include #include', 'bi:%s, %s,', 'bi:speech speech', 'bi:content-type:image/gif content-type:image/gif', 'bi:url:news url:news', 'bi:record, record,', 'bi:url:3 url:3', 'bi:subject:/ subject:/', 'bi:gaim gaim', 'bi:bang bang', 'bi:>> >>', 'bi:charset:windows-1256 charset:windows-1256', 'bi:liberopops liberopops', 'bi:url: url:', 'bi:subject:spambayes subject:spambayes', 'bi:url:complaint url:complaint', 'bi:received:jln@localhost) received:jln@localhost)', 'bi:free free', 'bi:coast coast', 'bi:received:16781 received:16781', 'bi:following following', 'bi:url:xdr2 url:xdr2', 'bi:card card', 'bi:a1> a1>', 'bi:unsubscribe unsubscribe', 'bi:toshiba toshiba', 'bi:jingle jingle', 'bi:charset:iso-2022-jp charset:iso-2022-jp', 'bi:subject:% subject:%', 'bi:your your'] I suppose some of them might make sense, but most are probably artifacts. Maybe bigrams should only be generated of the current and previous tokens differ. Skip From nobody at spamcop.net Wed Dec 17 18:23:48 2003 From: nobody at spamcop.net (Seth Goodman) Date: Wed Dec 17 18:23:48 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: An interesting aside to the message ageing proposal I made is that it would help fight what is being discussed in the "Spam of the Future" threads. It would do this by keeping the token databases current with the message stream so that it would adapt as quickly as possible to the extraneous words used and then retire them after a time. Another implementation suggestion for using an approach like this with a train-on-everything scheme is to only train *after* the user has verified all the classifications. If we allow it to classify on-the-fly and it makes a mistake, a whole bunch of mistakes will likely follow. It's probably better to allow the classifier to do the best it can do in it's present form, then after moving any mis-classified messages into their appropriate folders, do an incremental training on all emails in a given list of folders. This will only train messages which are previously untrained, at least in the Outlook plug-in version. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From richie at entrian.com Wed Dec 17 18:25:45 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 17 18:25:54 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071700323.27808.50.camel@anthem> References: <1071700323.27808.50.camel@anthem> Message-ID: [Barry, responding to Tim] > the BerkeleyDB based storages use the full-blown bsddb > transactional interface, so from that side of things, they should be > thread and multiproc safe. and: > Note that there is a "full" BDB storage and a "minimal" storage. The > latter doesn't retain multiple revisions. Fantastic. So in theory at least... o All the SpamBayes programs could use BDB-backed ZODB instead of directly using bsddb. o They would automatically work nicely together with a single writer (eg. sb_server is training while sb_filter is classifying), and with a bit more work catching ConflictErrors, we could even have multiple writers. o The database wouldn't get significantly bigger than with direct use of bsddb. o Since BDB uses bsddb in transaction mode rather than single-file mode, we can say goodbye to those nasty little DBRunRecovery errors. Yay! Tim, did this: > I'm half ready to declare that ZODB is the only database anyone should > ever use apply to BDB-backed ZODB, or only to ZODB's native storage? Unless there's something I'm missing (licensing problems, deployment problems, portability problems...?) it could be that we should replace our current DBDictClassifier (which suffers from DBRunRecovery errors and isn't multiprocess-safe) with a ZODBClassifier using a BDB back end. From a position of complete ignorance, I'd hazard a guess that the implementation would end up a lot simpler than rewriting DBDictClassifier to use bsddb in full-on transactional mode - the hassles of doing that have already been sorted out in ZODB. Am I in cloud cuckoo land? -- Richie Hindle richie@entrian.com From tim at fourstonesExpressions.com Wed Dec 17 18:37:24 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Wed Dec 17 18:37:30 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: <1071700323.27808.50.camel@anthem> Message-ID: On Wed, 17 Dec 2003 23:25:45 +0000, Richie Hindle wrote: > Unless there's something I'm missing (licensing problems, deployment > problems, portability problems...?) Not insignificant issues... it could be that we should replace > our > current DBDictClassifier (which suffers from DBRunRecovery errors and > isn't multiprocess-safe) with a ZODBClassifier using a BDB back end. It certainly can't hurt to give it a try... any sample code out there? > Am I in cloud cuckoo land? Well... we're all cloud dwellers, you know -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From tim.one at comcast.net Wed Dec 17 18:43:45 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 18:43:47 2003 Subject: [spambayes-dev] empty urls in bigram? In-Reply-To: <16352.57313.372507.14545@montanaro.dyndns.org> Message-ID: [Skip] > I just noticed this bigram in my clues: 'bi:url: url:'. If 'url:' > would only be presented once as a clue, does it make sense to form a > bigram with two instances of it? Sure -- why not? The same thing might happen to "really really" in The only product that makes your toes really really big! Since repetition is a form of advertising hyperbole (FREE FREE FREE!), I like the chance to catch it this way. You could try removing the possibility and running large-scale tests both ways, but I think there are more basic questions about the unibi approach open now. Note that we *won't* score more than one instance of "really really" per message -- bigram clues are subjected to the same duplicate-squashing as unigram clues. > What does an empty "url:" token mean? It doesn't *mean* anything . Staring at the code, looks like it's produced if and only if a URL contains two adjacent characters from this set: ;?:@&=+,$. So 'bi:url: url:' would come from three adjacent characters in that set. Sounds spammy to me. From tameyer at ihug.co.nz Wed Dec 17 18:55:14 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 17 18:55:36 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins]spambayes/spambayesOptionsClass.py, 1.19, 1.20 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0DCF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677772@its-xchg4.massey.ac.nz> [Tony in log message] > Option names are always case insensitive, no matter what. [Mark] > Yay! I *nearly* did that quite some time ago, but was > worried I would be > (silently) accused of loosening reasonable code to handle my > sloppy style. It also means a number of .lower() calls can be > removed from Outlook! Oh good, we have a guinea pig! =Tony Meyer From tim.one at comcast.net Wed Dec 17 19:13:58 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 19:14:04 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Seth Goodman] > Does CRM114 use the number of trained ham and trained spam *messages* > as variables in its probability calculation? If not, then you > wouldn't expect that deleting infrequently used tokens would do much > damage. AFAIK, SpamBayes uses the trained message counts in the > probability calculation Yes. > and those becomes inaccurate if you delete individual tokens. No, it doesn't matter if that's *all* you do. Say I've trained on 243 ham, and 257 spam, total, and throw out the hapax 'bi:choose the'. That has no effect on that the features I didn't throw out still came from training on 243 ham and 257 spam, total. The problem comes when untraining a message M. That reduces the count of total messages trained on, but if I threw away a hapax H from M previously, and H reappeared again later, it would be a mistake to reduce the category count on H during untraining M. There's another bullet we haven't bitten yet, saving a map of message id to an explicit list of all tokens produced by that message (Skip wants the inverse of that mapping for diagnostic purposes too). Given that, training and untraining of individual messages could proceed smoothly despite intervening changes in tokenization details; expiring entire messages would be straightforward; and when expiring an individual feature, it would be enough to remove that feature from each msg->[feature] list it's in (then untraining on a msg later wouldn't *try* to decrement the per-feature count of any feature that had previously been expired individually and appeared in the msg at the time). That's all easy enough to do, but the database grows ever bigger. It would probably need reworking to start using "feature ids" (little integers) too, so that relatively big strings didn't have to get duplicated all over the database. From tim.one at comcast.net Wed Dec 17 20:16:30 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 20:16:31 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Richie] > Fantastic. So in theory at least... > > o All the SpamBayes programs could use BDB-backed ZODB instead of > directly using bsddb. Yes, but then they also have to use persistent objects in ZODB's sense of the word. That's not scary to me, because SpamBayes was originally designed with ZODB's BTrees in mind as the mapping data structure. There is no *direct* access to bsddb via ZODB, you interact with ZODB's view of the world then, and BDB is just a (mostly) invisible, and wholly inaccessible, implementation detail. > o They would automatically work nicely together with a single writer > (eg. sb_server is training while sb_filter is classifying), Surprise! Nope. The reader will suffer a ReadConflictError if it tries to access anything that's been modified by the writer since the reader began its current transaction. This protects the reader from seeing inconsistent data. The reader is always in *some* transaction, so you can't worm around this. ZODB 3.3 will support "multiversion concurrency control", which will deliver the state of the data (to the reader) current *at* the time the transaction began, and there are no ReadConflictErrors then. But that hasn't been released yet. > and with a bit more work catching ConflictErrors, we could even have > multiple writers. ConflictErrors can only be guaranteed not to happen now if there are no writers. > o The database wouldn't get significantly bigger than with direct > use of bsddb. That one's hard to guess in advance. The BDB back end creates a number of distinct database tables to support ZODB's ideas of object identity, object revisions, and how objects all tie together. That's all metadata, on top of the application data we work with directly. But BTrees are a pretty space-efficient structure, and there are builtin flavors of BTree that are especially compact for mappings having integers as keys or values. > o Since BDB uses bsddb in transaction mode rather than single-file > mode, we can say goodbye to those nasty little DBRunRecovery > errors. Yay! That would be great -- although I still haven't seen one of these, despite running 3 different Outlooks on 3 different bsddb3's for a loooong time now! >> I'm half ready to declare that ZODB is the only database anyone >> should ever use > apply to BDB-backed ZODB, or only to ZODB's native storage? ZODB's BTrees rock. The backend storage format is just a detail. ZODB doesn't have a native format, BTW -- you get the kind of storage you explicitly ask for (there is no default), and I bet there are at least 10 flavors of storage by now. FileStorage is by far the most frequently used. We should all be aware that BDB-backed ZODB is a pretty new thing, and isn't yet used in production anywhere that I'm aware of. FileStorage has been through the wringer at sites with enormous loads for years, so is easier to trust -- and its pragmatics are much better understood too. Tuning BDB appears to be a major undertaking even on a tuning-friendly platform like Linux. > Unless there's something I'm missing (licensing problems, deployment > problems, portability problems...?) ZODB is OSI-certified Open Source, like Python. You can even piss on it and sell the result as art, if you want to . > it could be that we should replace our current DBDictClassifier (which > suffers from DBRunRecovery errors and isn't multiprocess-safe) with a > ZODBClassifier using a BDB back end. From a position of complete > ignorance, I'd hazard a guess that the implementation would end up a > lot simpler than rewriting DBDictClassifier to use bsddb in full-on > transactional mode - the hassles of doing that have already been > sorted out in ZODB. Having never written anything myself using bsddb3's "real" interface, I can't say how hard that would be. I *expect* it would actually be easy for someone with a non-trivial understanding of BDB. The only use we have for BDB now is to use it as if it were a giant dict -- it probably doesn't get any simpler than that. > Am I in cloud cuckoo land? Na, talk is cheap and always sane . From barry at python.org Wed Dec 17 20:21:53 2003 From: barry at python.org (Barry Warsaw) Date: Wed Dec 17 20:22:01 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: <1071700323.27808.50.camel@anthem> Message-ID: <1071710513.27808.62.camel@anthem> On Wed, 2003-12-17 at 18:25, Richie Hindle wrote: > o The database wouldn't get significantly bigger than with direct use of > bsddb. I didn't say that. :) ZODB's storage api and object model requires many ancillary tables in order to keep house properly. The overall disk usage of a BDB-backed ZODB will be greater than if you could just model the data structures you needed directly onto BerkeleyDB BTrees (most likely). With ZODB, it's probably likely that object pickles overwhelm the the housekeeping tables so it may not matter much, but for spambayes, I'm not sure that would be the case (I haven't looked closely at exactly what data spambayes wants to store). > o Since BDB uses bsddb in transaction mode rather than single-file mode, > we can say goodbye to those nasty little DBRunRecovery errors. Yay! That's the hope, anyway. :) -Barry From barry at python.org Wed Dec 17 20:31:01 2003 From: barry at python.org (Barry Warsaw) Date: Wed Dec 17 20:31:11 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071711060.27808.67.camel@anthem> On Wed, 2003-12-17 at 20:16, Tim Peters wrote: > Having never written anything myself using bsddb3's "real" interface, I > can't say how hard that would be. I *expect* it would actually be easy for > someone with a non-trivial understanding of BDB. The only use we have for > BDB now is to use it as if it were a giant dict -- it probably doesn't get > any simpler than that. If you map all square-bracket setitems to .put()'s and square-bracket getitems to .get()'s, it's fairly straightforward. That is, provided you can define the transaction boundaries so you can call txn begin, abort, and commit at the Right Times. You will want to pass the BDB txn object into the .gets and .puts to make it all work smoothly. Add a little extra goo to create the environment if it doesn't exist (or join it if it does), and viola! or contrabasso! -Barry From tim.one at comcast.net Wed Dec 17 20:36:48 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 20:36:49 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071710513.27808.62.camel@anthem> Message-ID: [Barry] > ... > (I haven't looked closely at exactly what data spambayes wants to store). The token statistics database now is a single (but large) mapping from short 8-bit strings to 2-tuples of little integers. The strings are usually less than 16 characters, and never a lot longer than that (the tokenizer truncates very long strings, synthesizing short "skip" tokens as proxies). It would be nice to have other mappings too, like forward and inverse msgid <-> bag_of_tokens maps. A little-integer timestamp may get added to the 2-tuples. From nobody at spamcop.net Wed Dec 17 20:41:30 2003 From: nobody at spamcop.net (Seth Goodman) Date: Wed Dec 17 20:41:30 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Tim Peters] > No, it doesn't matter if that's *all* you do. Say I've trained > on 243 ham, > and 257 spam, total, and throw out the hapax 'bi:choose the'. That has no > effect on that the features I didn't throw out still came from training on > 243 ham and 257 spam, total. OK, but there are still a couple of potential problems. 1) Let's say the discarded bi-gram occurs in a spam at a later date. Though it was only a hapax, it now contributes nothing. 2) Let's say we want to train on a spam with the discarded bi-gram. It was originally a hapax, so it should now have an occurrence count of two. After training, it again shows up as a hapax. This is a more significant problem. 3) Do we eventually reduce the occurrence count of a non-hapax token? If we do, we could eventually have none of the tokens from a trained message present but its message count will still be there. Unless we implement your token cross-reference as explained below, the message counts will eventually not be correct if we expire enough tokens. If we don't expire a lot of tokens over the long run, why bother? > > The problem comes when untraining a message M. That reduces the count of > total messages trained on, but if I threw away a hapax H from M > previously, > and H reappeared again later, it would be a mistake to reduce the category > count on H during untraining M. Yup, and you have the solution below. > > There's another bullet we haven't bitten yet, saving a map of > message id to > an explicit list of all tokens produced by that message (Skip wants the > inverse of that mapping for diagnostic purposes too). Given > that, training > and untraining of individual messages could proceed smoothly despite > intervening changes in tokenization details; expiring entire > messages would > be straightforward; and when expiring an individual feature, it would be > enough to remove that feature from each msg->[feature] list it's in (then > untraining on a msg later wouldn't *try* to decrement the > per-feature count > of any feature that had previously been expired individually and > appeared in > the msg at the time). This definitely works. But why bother tracking, cross-referencing and expiring individual tokens when we can just expire whole messages, which is a lot simpler? It accomplishes the goal of keeping the token databases cleaned of excessive hapaxes and gradually expires non-hapax tokens, as well. There is also less need for reverse indexing of tokens to messages, since all messages and their tokens will eventually expire. However, if people need that feature, they need it. > > That's all easy enough to do, but the database grows ever bigger. > It would > probably need reworking to start using "feature ids" (little > integers) too, > so that relatively big strings didn't have to get duplicated all over the > database. No argument there. How about a 32-bit hash for any token whether unigram, bi-gram, etc.? The token database could then consist of an ordered list of 32-bit hashes paired with an occurrence count (16-bits would probably do it). That's only six bytes/token, and you could use your indexing method of choice, if any, to speed up the lookups. Similarly, if we implemented a message database with this method, each token in a message would only take up four bytes. The hash calculation costs something, but the smaller database size and quicker lookup time could make up for it. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From tameyer at ihug.co.nz Wed Dec 17 20:53:21 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 17 20:53:39 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0F6D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677773@its-xchg4.massey.ac.nz> [Tim] > The token statistics database now is a single (but large) > mapping from short 8-bit strings to 2-tuples of little > integers. I think part of the Japanese/Asian languages patch which I keep meaning to look more closely into has these turn into unicode strings (how many bits is that? I know nothing much about unicode; English is good enough for me ). (Just in case someone was about to implement a new spambayes db system with only 8-bit tokens). =Tony Meyer From barry at python.org Wed Dec 17 21:17:49 2003 From: barry at python.org (Barry Warsaw) Date: Wed Dec 17 21:18:04 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071713869.27808.75.camel@anthem> On Wed, 2003-12-17 at 20:36, Tim Peters wrote: > The token statistics database now is a single (but large) mapping from short > 8-bit strings to 2-tuples of little integers. The strings are usually less > than 16 characters, and never a lot longer than that (the tokenizer > truncates very long strings, synthesizing short "skip" tokens as proxies). The raw bsddb interface wants keys and values to be strings and for btree access methods, the length doesn't really matter. You could pickle the 2-tuples or just do something easily splittable like '%s|%s' % two_tuple. Sounds like one BTree table would do the trick there. > It would be nice to have other mappings too, like forward and inverse msgid > <-> bag_of_tokens maps. A little-integer timestamp may get added to the > 2-tuples. Each of those would be a separate table, of course. bag_of_token maps sounds like you'd want to pickle the data value. -Barry From tim.one at comcast.net Wed Dec 17 21:47:29 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 17 21:47:33 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071713869.27808.75.camel@anthem> Message-ID: [Tim] >> The token statistics database now is a single (but large) mapping >> from short 8-bit strings to 2-tuples of little integers. The >> strings are usually less than 16 characters, and never a lot longer >> than that (the tokenizer truncates very long strings, synthesizing >> short "skip" tokens as proxies). [Barry Warsaw] > The raw bsddb interface wants keys and values to be strings and for > btree access methods, the length doesn't really matter. You could > pickle the 2-tuples or just do something easily splittable like > '%s|%s' % two_tuple. We already pickle this stuff, but it goes through the shelve module so pretends to be transparent. I want to get shelve out of it anyway, because shelve adds little value at high cost (there are too many layers of indirection through Python-level methods now -- slooooow). There are very few textual sites where pickle<->unpickle dances are needed (that's already been cleanly factored out). > Sounds like one BTree table would do the trick there. Yup. We're using BDB hash now. I don't know that this was a conscious decision. I'd ask whether BDB hash or BDB BTree would be faster, but I don't want to put you on the spot . >> It would be nice to have other mappings too, like forward and >> inverse msgid <-> bag_of_tokens maps. A little-integer timestamp >> may get added to the 2-tuples. > Each of those would be a separate table, of course. bag_of_token maps > sounds like you'd want to pickle the data value. They would be very much like the indices we build for full search in ZCTextIndex. This is easy to do with ZODB's IO and OO flavors of BTree, because BTree values can also be BTrees (etc), and all the pieces are automagically cut down to reasonably small storage chunks then. I'd ask whether BDB supports something similar, but ... . the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the- code-ly y'rs - tim From barry at python.org Wed Dec 17 22:17:12 2003 From: barry at python.org (Barry Warsaw) Date: Wed Dec 17 22:17:23 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071717431.27808.122.camel@anthem> On Wed, 2003-12-17 at 21:47, Tim Peters wrote: > I'd ask whether BDB hash or BDB BTree would be faster, but I > don't want to put you on the spot . Oh, you can do better than that. It's easy: the answer is yes! > the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the- > code-ly y'rs - tim But see, I'm actually doing you a favor by resolutely ducking that responsibility. How else are you going to be able to fix things when I take a leave of absence to follow the Britster on her 2-year long come back world tour? You really will eventually thank me for forcing you to write it. anyone-going-on-such-a-tour-has-no-shame-anyway-ly y'rs, -Barry From anthony at interlink.com.au Wed Dec 17 23:48:55 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed Dec 17 23:49:36 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: <200312180448.hBI4muIJ010785@localhost.localdomain> >>> "Tim Peters" wrote > the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the- > code-ly y'rs - tim Isn't that how I suckered MarkH into working on the Outlook plugin? Anthony -- Anthony Baxter It's never too late to have a happy childhood. From tim.one at comcast.net Thu Dec 18 00:08:39 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 00:08:50 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Seth Goodman] > OK, but there are still a couple of potential problems. Oh, sure -- but testing is the only judge of what works here. > 1) Let's say the discarded bi-gram occurs in a spam at a later date. > Though it was only a hapax, it now contributes nothing. I doubt it matters. Most text classification systems (this field is more than 40 years old, BTW) ignore hapaxes entirely, and also ignore tokens that don't appear in at least *several* distinct training examples (see Paul Graham's essay, where he carried on that tradition). We don't ignore anything, because testing said it worked better not to ignore anything in this particular task. It wasn't a killer-strong improvement to pay attention to everything, but was a statistically significant win. Good enough. Since then, use in real life, unlike our randomized cross-validation testing, doesn't see messages "at random" at all: it sees them ordered in time. That appears to make a difference, and actually helps us overall. After some 16 months of watching this algorithm in various tests and in practice, I've identified only two clear, repeated effects of hapaxes: 1. Good: When a spam campaign begins, the hapaxes in its first example very often help to nail the upcoming variations in that campaign. People with small databases using mistake-based training see this dramatically, and it's very handy for them in real-life use. A similar effect helps on the ham side, when training (e.g.) on that once-per-month HTML newletter from (say) American Century Investments, which look very spammy the first time around. Because legit companies pay ad firms small fortunes to establish "brand identity", such newletters are typically *stuffed* with hapaxes identifying the source. 2. Bad: Most spam campaigns fizzle out within a month. The hapaxes stick around, though. Sooner or later an unusual ham comes across that just happens to hit a large number of the leftover spam hapaxes, then serves as a "spectacular failure" example here. They're very rare, but very unsettling when they occur (well, likely *because* they're so rare for most people). > 2) Let's say we want to train on a spam with the discarded bi-gram. > It was originally a hapax, so it should now have an occurrence count > of two. After training, it again shows up as a hapax. This is a > more significant problem. Based on what evidence? Token spamprobs are guesses at best, and an estimated spamprob based on only one or two examples isn't even reliable to one significant digit. The difference between seeing something once or twice doesn't move a spamprob much, either. So I have to guess that this effect is so tiny it will be lost in estimation noise. In early experiments, the database stored more info, and the test framework was able to report which features were used *most* often in making a correct decision. Several times I took the few hundred "most valuable" features (based on a combination of how often they contributed to a correct decision, and their spamprob strength (distance, in either direction, from 0.5)), and threw them out of the database. An amazing (at the time) thing was that this didn't hurt performance -- if the classifier was blinded to what *were* its best clues, it found another set of clues that did just as well overall. Performance eventually deteriorated dramatically if this was done over and over again, but the system has already been shown to be very robust against losing even its best features. That's one reason I'm not worried about throwing away its least useful features (hapaxes have weak spamprobs, and hapaxes that haven't been *used* in scoring for N days may as well not have existed at all for the last N days -- and most hapaxes are like that, no matter how big N is). > 3) Do we eventually reduce the occurrence count of a non-hapax token? There are many possible schemes. Strongly storage-conscious schemes only save a byte or two for a count, and periodically shift all the counts right by 1 bit, to prevent overflow. That seems to work very well in systems that do it. I've already said here that I see the primary point of expiring hapaxes as being a means to reduce database size, and in the context of the much more storage-intensive mixed unigram/bigram scheme. Hapaxes can account for the bulk of the storage all by themselves (this isn't unique to spam filtering, btw -- across many kinds of computer text indexing systems, hapaxes typically account for about half the content), and most hapaxes are never seen again. I'm experimenting with a mixed unigram/bigram classifier right now. It's been trained on (just) 94 ham and 96 spam so far, but there are already 51,378 features in the database. 45,624 of them are hapaxes -- that's 89%! I could eliminate the rest of the database entirely, and not cut its size enough to care about. This is why picking specifically on hapaxes is a high-value proposition (high potential, low risk). > If we do, we could eventually have none of the tokens from a trained > message present but its message count will still be there. Unless we > implement your token cross-reference as explained below, the message > counts will eventually not be correct if we expire enough tokens. I want to do expiration "correctly". But even if all the tokens from a message expire when the total message count is N, it still doesn't change that counts on tokens that remain were in fact derived from N messages, and so N remains the best possible thing to feed into the spamprob guesses. > If we don't expire a lot of tokens over the long run, why bother? I expect an enormous number of hapaxes to expire, in steady state essentially equaling the rate at which they're created by new messages. In the example above, 90% of the features created for me right now *are* hapaxes. I expect that to drop with more training, but for hapaxes to remain both the single biggest database consumer, and the least valuable tokens to retain. >> ... >> There's another bullet we haven't bitten yet, saving a map of >> message id to an explicit list of all tokens produced by that >> message (Skip wants the inverse of that mapping for diagnostic >> purposes too). Given that, training and untraining of individual >> messages could proceed smoothly despite intervening changes in >> tokenization details; expiring entire messages would be >> straightforward; and when expiring an individual feature, it would >> be enough to remove that feature from each msg->[feature] list it's >> in (then untraining on a msg later wouldn't *try* to decrement the >> per-feature count of any feature that had previously been expired >> individually and appeared in the msg at the time). > This definitely works. But why bother tracking, cross-referencing and > expiring individual tokens when we can just expire whole messages, > which is a lot simpler? I doubt that it's simpler at all, and you earlier today sketched quite an elaborate scheme for expiring different messages at different rates. That's got its share of tuning parameters (aka wild-ass guesses ) too, showed every sign of being just the beginning of its brand of complication, and has no testing or experience to support it. We know a lot about the real-life effects of hapaxes now. BTW, the single worst thing you can do with a system of this type is train a message into the wrong category. Everyone does it eventually, and some people can't seem to help but doing it often. Maybe that's a UI problem at heart -- I don't know, because I seem to be unusually resistant to it. It's happened to me too, though, and it can be hard to recover. One sterling use for a feature -> msg_ids map is, as Skip noted, a way to find out *why* your latest spam was a false negative: look at the low-scoring features, then look at the messages with those features that were trained on as ham. This has an excellent shot at pinpointing mis-trained messages. That's difficult at best now, and is a real problem for some people. I've got gigabytes of unused disk space myself . Evolution of this system would also be served by saving an explict msg_id -> features map. When we change tokenization to get a small win, sometimes the tokens originally added to a database by training on message M can no longer be reconstructed by re-tokenizing M (the tokenizer has changed! if it always returned exactly what it returned before the change, there wasn't much point to the change ). Blindly untraining anyway can violate database invariants then, eventually manifesting as assertion errors and the need to retrain from scratch. The only clear and simple way to prevent this is to save a map from msg_id to the tokens it originally produced. Then untraining simply walks that list, and nothing can go wrong as a result. That's a bit subtle, so takes some long-term experience to appreciate at a gut level. Of more immediate concern to most users is that only the obsessed *want* to save their spam. Most people want to throw spam away ASAP. But, if they do that, we currently have no way to expire any spam they ever trained on. Moving toward saving msg_ids <-> features maps solves that too, and with suitable reuse of little integers for feature ids can store the relevant bits about trained messages in less space than it takes to save the original messages. Note that hapaxes would waste the most resource in this context too. >> That's all easy enough to do, but the database grows ever bigger. >> It would probably need reworking to start using "feature ids" >> (little integers) too, so that relatively big strings didn't have to >> get duplicated all over the database. > No argument there. How about a 32-bit hash for any token whether > unigram, bi-gram, etc.? The token database could then consist of an > ordered list of 32-bit hashes paired with an occurrence count > (16-bits would probably do it). That's only six bytes/token, and you > could use your indexing method of choice, if any, to speed up the > lookups. We ran experiments on that before, and results were dreadful. 32-bit hashes have far too high a collision rate on a sizable database (don't forget the Birthday Paradox here!), confusing ham with spam in highly entertaining ways (provided you're just experimenting and don't really care how well it does). An MD5 or SHA-1 hash would be fine, but then it's up to 16 or 20 bytes per feature, and most of the strings we store in the current pure unigram scheme are shorter than that. A 64-bit hash would probably be OK. Another hated (widely in this project, among the developers) consequence of using hash codes is that mining the database for clues is useless then. "Hey, hash code 45485448 is your strongest spam clue!" "Oh -- no wonder, then" . Storing the actual feature strings as plainly as possible is extremely helpful for development, debugging, and research. > Similarly, if we implemented a message database with this method, each > token in a message would only take up four bytes. The hash calculation > costs something, but the smaller database size and quicker lookup time > could make up for it. We're not going to abandon plain strings, because they're far too useful and loved in various reports intended for human consumption. Adding feature_id <-> feature_string maps would allow for effective compression of message storage. From tim.one at comcast.net Thu Dec 18 00:38:55 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 00:38:57 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677773@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > I think part of the Japanese/Asian languages patch which I keep > meaning to look more closely into has these turn into unicode strings > (how many bits is that? I know nothing much about unicode; English > is good enough for me ). > > (Just in case someone was about to implement a new spambayes db > system with only 8-bit tokens). Overall, I'd encourage them in that vice. I did all I could to keep SpamBayes neutral across European "Latin-insert-your-favorite-number" languages, except for the non-default Anglocentric replace_nonascii_chars option. That's why I favored split-on-whitespace as the only msg body lexing gimmick (of course it helped a lot that s-o-w did best in tests across all lexing schemes ever tried!); have consistently resisted attempts to add knowledge about "punctuation" (except in header-line contexts, where standards constrain the permitted characters); haven't voiced any support for gimmicks like "map Latin-1 into letters that look more like the ones I'm used to" (but as the replace_nonascii_chars perpetrator, couldn't oppose them in good conscience as options either ); and haven't written a u'' literal anywhere in the source. My belief is that Asian languages are so different in what they would need to do a good job that someone wanting that would be better off forking the project. I really don't want to see masses of deeply different algorithms all slammed into the same codebase, not even if "the cost" were just massively refactoring SpamBayes to add another two layers of expensive indirection. SpamBayes isn't required to be all things to all people. I haven't studied the patch you're talking about, so maybe it's just a one-liner . Alas, I'm aware of it, and have read the patch comments, and the panic above is a fair reflection of my first, second, and third reactions. As to how many bits are in a Unicode string, you don't want to know. "It depends." Pickles store them in an Anglocentric format (UTF-8) that happens to consume exactly the same number of bytes as now if the string consists of just US ASCII characters. The memory burden is much larger, though (Python Unicode string objects are big beasts). From tim.one at comcast.net Thu Dec 18 00:49:04 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 00:49:05 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <200312180448.hBI4muIJ010785@localhost.localdomain> Message-ID: [Tim] >> the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the- >> code-ly y'rs - tim [Anthony Baxter] > Isn't that how I suckered MarkH into working on the Outlook plugin? Yes it is! It's *also* how Barry suckered me into writing the spambayes tokenizer and classifier to begin with -- although he's conviently forgetting the karmic reciprocal obligation now . From tameyer at ihug.co.nz Thu Dec 18 01:44:11 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Dec 18 01:44:20 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0EE3@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677778@its-xchg4.massey.ac.nz> [Tim] > Note that Jeremy already wrote code to run spambayes via ZEO, > in the project's pspam/ directory. I don't know how much > bitrot that's suffered. Not as much as I had thought. I believe the check-ins I just made get it working again - at least the three main scripts (pop.py, scoremsg.py and update.py) appear to do what they are meant to. It works without socket.AF_UNIX now, too. It is still separate from everything else, of course, but it does work again... =Tony Meyer From barry at python.org Thu Dec 18 02:02:09 2003 From: barry at python.org (Barry Warsaw) Date: Thu Dec 18 02:02:18 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071730928.17717.13.camel@anthem> On Thu, 2003-12-18 at 00:49, Tim Peters wrote: > [Tim] > >> the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the- > >> code-ly y'rs - tim > > [Anthony Baxter] > > Isn't that how I suckered MarkH into working on the Outlook plugin? > > Yes it is! It's *also* how Barry suckered me into writing the spambayes > tokenizer and classifier to begin with -- although he's conviently > forgetting the karmic reciprocal obligation now . Nope, I'm just younger than you, and old guys are so easily duped. -Barry From anthony at interlink.com.au Thu Dec 18 04:52:35 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Thu Dec 18 04:53:00 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: <200312180952.hBI9qZC6005810@localhost.localdomain> >>> "Tim Peters" wrote > Yes it is! It's *also* how Barry suckered me into writing the spambayes > tokenizer and classifier to begin with -- although he's conviently > forgetting the karmic reciprocal obligation now . I'm sure this is some sort of standard method of getting things done in the opensource world. Eric Raymond's Cathedral and Bazaar metaphor extends here, of course - in a bazaar often you end up getting suckered. now-who-took-my-wallet, Anthony -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Thu Dec 18 08:41:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 18 08:41:11 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: Message-ID: <16353.44663.34193.301968@montanaro.dyndns.org> Tim> I'm experimenting with a mixed unigram/bigram classifier right now. Tim> It's been trained on (just) 94 ham and 96 spam so far, but there Tim> are already 51,378 features in the database. 45,624 of them are Tim> hapaxes -- that's 89%! Late yesterday afternoon I tweaked my procmailrc file to automatically train on everything which scored as ham or spam. I awoke this morning to a database with 489 spam, 600 ham and 198,747 features, 158,116 of were hapaxes (80%). At the same time I moved my ham/spam thresholds closer to 0 and 1 to minimize the amount of retraining necessary to counteract false positives and false negatives. (It's kind of a pain because I'm also saving the messages I train on, so I have to rummage around in a Unix mbox to find incorrectly trained messages.) I train unsures by hand. Still only 16 unsures overnight, but my database is up to 10.5MB, so training and scoring time is on the rise. Bringing it back to this topic, hapax expiration seems like both a worthwhile step to take from space/time considerations, and even less likely to produce problems because I'm training on everything I see. Now if I could only test this setup easily without a huge time investment. Perhaps a few more Emacs keybindings are in order. Tim> BTW, the single worst thing you can do with a system of this type Tim> is train a message into the wrong category. Everyone does it Tim> eventually, and some people can't seem to help but doing it often. :-) Skip From kennypitt at hotmail.com Thu Dec 18 10:01:42 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 18 10:02:23 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071710513.27808.62.camel@anthem> Message-ID: Barry Warsaw wrote: > On Wed, 2003-12-17 at 18:25, Richie Hindle wrote: >> o Since BDB uses bsddb in transaction mode rather than single-file >> mode, we can say goodbye to those nasty little DBRunRecovery >> errors. Yay! > > That's the hope, anyway. :) Unfortunately I don't think we can say that with any confidence until we know why they are occurring in the first place. The following comments apply to using bsddb directly. I'm not familiar with ZODB, and don't know if they are already handling all of these issues. The BerkeleyDB docs say this: """ Errors can occur in the Berkeley DB library where the only solution is to shut down the application and run recovery (for example, if Berkeley DB is unable to allocate heap memory). When a fatal error occurs in Berkeley DB, methods will throw a DbRunRecoveryException, at which point all subsequent database calls will also fail in the same way. When this occurs, recovery should be performed. """ (http://www.sleepycat.com/docs/api_cxx/runrec_class.html) This seems to indicate that this problem can be caused by more than just threading problems. It also says this: """ When building transactionally protected applications, there are some special issues that must be considered. The most important one is that if any thread of control exits for any reason while holding Berkeley DB resources, recovery must be performed... """ (http://www.sleepycat.com/docs/ref/transapp/app.html) This seems very clear that using full transactional mode does not protect you from DbRunRecovery errors. I wonder if the only real solution is to run recovery when opening the database. This should be easy for the Outlook add-in, sb_server, etc. where a single, long-running process performs all database access (just specify the DB_RECOVER flag when opening the environment). Running recovery requires that only one thread in one process has access to the database environment until recovery is complete. This might be harder to accomplish for apps such as sb_filter that could be run as multiple simultaneous processes. -- Kenny Pitt From barry at python.org Thu Dec 18 10:10:22 2003 From: barry at python.org (Barry Warsaw) Date: Thu Dec 18 10:10:29 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071760220.26140.3.camel@anthem> On Thu, 2003-12-18 at 10:01, Kenny Pitt wrote: > This seems very clear that using full transactional mode does not > protect you from DbRunRecovery errors. True, but it makes them rarer. > I wonder if the only real solution is to run recovery when opening the > database. This should be easy for the Outlook add-in, sb_server, etc. > where a single, long-running process performs all database access (just > specify the DB_RECOVER flag when opening the environment). Running > recovery requires that only one thread in one process has access to the > database environment until recovery is complete. This might be harder > to accomplish for apps such as sb_filter that could be run as multiple > simultaneous processes. In all the BDB apps I've written, I always pass the DB_RECOVER flag to the open call. Except for coordinating the above, it's harmless if recovery doesn't need to happen. -Barry From kennypitt at hotmail.com Thu Dec 18 10:16:51 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 18 10:17:27 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071711060.27808.67.camel@anthem> Message-ID: Barry Warsaw wrote: > On Wed, 2003-12-17 at 20:16, Tim Peters wrote: > >> Having never written anything myself using bsddb3's "real" >> interface, I can't say how hard that would be. I *expect* it would >> actually be easy for someone with a non-trivial understanding of >> BDB. The only use we have for BDB now is to use it as if it were a >> giant dict -- it probably doesn't get any simpler than that. > > If you map all square-bracket setitems to .put()'s and square-bracket > getitems to .get()'s, it's fairly straightforward. That is, provided > you can define the transaction boundaries so you can call txn begin, > abort, and commit at the Right Times. You will want to pass the BDB > txn object into the .gets and .puts to make it all work smoothly. > Add a little extra goo to create the environment if it doesn't exist > (or join it if it does), and viola! or contrabasso! The bsddb package includes a dbshelve module that handles all the required dictionary access methods to provide compatibility with standard shelve functionality. It also allows specifying the DB_ENV when opening the database. The only thing it doesn't seem to handle is transactions, but I'm not convinced we need that. Transactions are only really important if you are updating several related entries, and need to be able to rollback the whole lot if any one of them fails. There are some points in SpamBayes that could be reworked to use transactions (e.g. rollback all token count updates for a single message if we can't update them all), but I don't think that has anything to do with the DbRunRecovery errors. The important thing re our suspected cause would be the multi-thread and multi-process locking, and that can be used independently of transactions. -- Kenny Pitt From tim.one at comcast.net Thu Dec 18 10:21:15 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 10:21:13 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <1071760220.26140.3.camel@anthem> Message-ID: [Barry] > ... > In all the BDB apps I've written, I always pass the DB_RECOVER flag to > the open call. Except for coordinating the above, it's harmless if > recovery doesn't need to happen. OTOH, don't you also do some *seemingly* senseless dance (revealed to you by a SleepyCat guy) involving back-to-back checkpoints so that the next harmless recovery doesn't take forever not to do any harm ? That's probably all wrong, but it might jog your memory. From barry at python.org Thu Dec 18 10:29:32 2003 From: barry at python.org (Barry Warsaw) Date: Thu Dec 18 10:29:40 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: <1071761371.26140.7.camel@anthem> On Thu, 2003-12-18 at 10:21, Tim Peters wrote: > [Barry] > > ... > > In all the BDB apps I've written, I always pass the DB_RECOVER flag to > > the open call. Except for coordinating the above, it's harmless if > > recovery doesn't need to happen. > > OTOH, don't you also do some *seemingly* senseless dance (revealed to you by > a SleepyCat guy) involving back-to-back checkpoints so that the next > harmless recovery doesn't take forever not to do any harm ? That's why the Sleepycat guy told me to checkpoint occasionally (the BDB storages do that in a thread), /and/ to force a checkpoint twice before closing the database. I'm sure the latter is mostly voodoo, but our faith gives us the strength of conviction. -Barry From kennypitt at hotmail.com Thu Dec 18 12:34:32 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 18 12:35:08 2003 Subject: [spambayes-dev] Broken link on website Message-ID: Just discovered a (partially) broken link on the website. On the Windows page (http://spambayes.sourceforge.net/windows.html) in the "Non Outlook Solutions" section, the POP3 link goes to the correct page but the wrong bookmark. The link references a bookmark of "#pop3", but the anchor tag on the destination page now uses the bookmark name "sb_server". -- Kenny Pitt From skip at pobox.com Thu Dec 18 12:56:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 18 12:56:50 2003 Subject: [spambayes-dev] Broken link on website In-Reply-To: References: Message-ID: <16353.60003.407697.88323@montanaro.dyndns.org> Kenny> ... the POP3 link goes to the correct page but the wrong Kenny> bookmark. The link references a bookmark of "#pop3", but the Kenny> anchor tag on the destination page now uses the bookmark name Kenny> "sb_server". Thanks. Should be fixed. Skip From tim.one at comcast.net Thu Dec 18 14:09:09 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 14:09:07 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Kenny Pitt] > The bsddb package includes a dbshelve module that handles all the > required dictionary access methods to provide compatibility with > standard shelve functionality. It also allows specifying the DB_ENV > when opening the database. Speaking of which, 4 of the test_bsddb3.py tests fail on Win98SE with the soon-to-be-released Python 2.3.3 (which is at least as well as that test has ever done on that platform). The 4 failing tests all exercise the dbshelve module: ERROR: test01_basics (bsddb.test.test_dbshelve.EnvBTreeShelveTestCase) ERROR: test01_basics (bsddb.test.test_dbshelve.EnvHashShelveTestCase) ERROR: test01_basics (bsddb.test.test_dbshelve.EnvThreadBTreeShelveTestCase) ERROR: test01_basics (bsddb.test.test_dbshelve.EnvThreadHashShelveTestCase) and all die with the same traceback and error: Traceback (most recent call last): File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 75, in test01_basics self.do_open() File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 238, in do_open self.env.open(homeDir, self.envflags | db.DB_INIT_MPOOL | db.DB_CREATE) DBAgainError: (11, 'Resource temporarily unavailable -- unable to join the environment') If that isn't just an artifact of something else the test suite is doing, it's enough to kill the idea of using dbshelve on Windows. > The only thing it doesn't seem to handle is transactions, but I'm not > convinced we need that. > > Transactions are only really important if you are updating several > related entries, and need to be able to rollback the whole lot if any > one of them fails. I expect a transaction commit supplies a natural and useful boundary for doing a database checkpoint operation (see earlier email w/ Barry; making frequent checkpoints is probably important so that running recovery when the database is opened runs quickly). > ... > The important thing re our suspected cause would be the multi-thread > and multi-process locking, and that can be used independently of > transactions. Gregory Smith found and fixed several bugs in the bsddb3 use-it-like-a-dict wrappers we've *been* using, all related to concurrent access. Unfortunately, it doesn't look like anyone backported those fixes for the Python 2.3 release (the last few checkins only exist on the trunk, which is Python 2.4 development). Given the history of bsddb3 support so far, I think we'll be best off using the Berkeley-native APIs as directly as possible, avoiding "convenience wrappers" like the plague. Very little of our code interacts with the database directly, and bugs in those wrappers have probably caused hundreds of times more hours of bug-chasing than would have been required to write a few extra lines of lower-level code. Of course, using the Berkeley-native API directly should run faster too, but I don't hold that it against it *too* much . From tim.one at comcast.net Thu Dec 18 14:46:16 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 14:46:16 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk emailfolder. In-Reply-To: <009e01c3c493$2489c0b0$2c00a8c0@eden> Message-ID: [Tim] >> I wonder whether the Outlook addin should stop trying to remember >> Outlook's internal folder IDs, remember the user-visible string >> paths instead, and enumerate the folders to (re)discover the >> internal Outlook IDs "whenever anything may have changed". [Mark Hammond] > I'm not sure what you had in mind for "anything may have changed", In the limit, I suppose that means finding the folder object again from scratch every time a folder object is needed. Anything else is just optimization . > but in general, I agree. I always had the idea that we would also > store the FQN, and fall back to that when necessary, making the > folder ID more a "cached" value. It just never happened. It does > get complex though - what happens when the user renames the folder? > Before you know it, we have even more cruft that noone really > understand why is there It's a part of Outlook's model that doesn't make sense to people. When my sister, for example, renames or moves a folder holding Word documents, she'd be baffled if Word *did* magically notice this. The idea that a data object is accessed by, and only by, its current "string path", has been beat into her by Explorer, by all the other Office programs, and-- for that matter --by all other programs she uses too. Outlook is the oddball exception here, but only wrt its internal objects. I remember how surprised *I* was the first time I changed the name of a folder that was the "move to" target of an Outlook rule, and the rule kept moving things into the renamed folder; I had expected Outlook to screw up, or at best to pop up an error box the next time the rule conditions fired, telling me the rule no longer made sense. That would have been fine by me. So we can play along with Outlook's model, and have 99% of our users wonder why SpamBayes deletes all their email , or match the mental model everyone (except Outlook experts) has from the start. I agree they're deeply incompatible, but each gives a clear answer to questions like "what happens when the user renames the folder?". > Another alternative would be to change things so that most errors > re-displayed the config wizard. I'm not sure how that could help. The problem people have now is that there *aren't* any "hard" errors after they delete their Spam or Unsure folders by mistake -- SpamBayes plays along with Outlook's "the name and path are irrelevant, I still know where the *object* is", and users have no idea Outlook works that way. What they do instead is create a new Spam folder directly under Personal Folders, and then are baffled again because that doesn't work either (for the same reason -- they're thinking of string name and path, not Outlook object identity). > ... > Either way, I'm going for a new combined binary before this even gets > a look in +1 From kennypitt at hotmail.com Thu Dec 18 14:53:08 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 18 14:53:45 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt] >> The bsddb package includes a dbshelve module that handles all the >> required dictionary access methods to provide compatibility with >> standard shelve functionality. It also allows specifying the DB_ENV >> when opening the database. > > Speaking of which, 4 of the test_bsddb3.py tests fail on Win98SE with > the soon-to-be-released Python 2.3.3 (which is at least as well as > that test has ever done on that platform). The 4 failing tests all > exercise the dbshelve module: > > ERROR: test01_basics (bsddb.test.test_dbshelve.EnvBTreeShelveTestCase) > ERROR: test01_basics (bsddb.test.test_dbshelve.EnvHashShelveTestCase) > ERROR: test01_basics > (bsddb.test.test_dbshelve.EnvThreadBTreeShelveTestCase) ERROR: > test01_basics (bsddb.test.test_dbshelve.EnvThreadHashShelveTestCase) > > and all die with the same traceback and error: > > Traceback (most recent call last): > File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 75, in > test01_basics > self.do_open() > File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 238, in > do_open self.env.open(homeDir, self.envflags | db.DB_INIT_MPOOL | > db.DB_CREATE) DBAgainError: (11, 'Resource temporarily unavailable -- > unable to join the environment') > > If that isn't just an artifact of something else the test suite is > doing, it's enough to kill the idea of using dbshelve on Windows. Notice that the exception is occurring in test_dbshelve itself, not in the dbshelve module. I have a sneaking suspicion that this would be a general problem with bsddb on Win98, and not just if we used dbshelve. dbshelve doesn't do a whole lot besides pickling and unpickling the item values before calling the direct API. The "Windows Notes" page in the Berkeley DB docs says: """ On Windows/9X, files opened by multiple processes do not share data correctly. For this reason, the DB_SYSTEM_MEM flag is implied for any application that does not specify the DB_PRIVATE flag, causing the system paging file to be used for sharing data. """ (http://www.sleepycat.com/docs/ref/build_win/notes.html) Possibly related? I assume DBAgainError maps to error code EAGAIN in the C API, which means "The shared memory region was locked and (repeatedly) unavailable." Maybe something isn't getting released properly from the earlier non-environment tests. -- Kenny Pitt From tim.one at comcast.net Thu Dec 18 15:45:46 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 15:45:44 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Kenny Pitt] > ... > > Notice that the exception is occurring in test_dbshelve itself, not in > the dbshelve module. I have a sneaking suspicion that this would be a > general problem with bsddb on Win98, and not just if we used dbshelve. > dbshelve doesn't do a whole lot besides pickling and unpickling the > item values before calling the direct API. > > The "Windows Notes" page in the Berkeley DB docs says: Sorry, I can't make more time for this. The Berkeley docs don't mention any Berkeley bugs on Win9x, just cautions. The Berkeley wrappers "we" (Python) wrote have had a miserable history on Windows, which is why I'll just repeat that we'll be better off avoiding "our" convenience wrappers like the plague, sticking as close to base Berkeley as possible. This isn't getting better. The bsddb3 tests on Win98SE on the current Python CVS trunk suffer 6 errors and 48 failures, up from 4 errors and 0 failures under 2.3.3: Gregory's attempts to "fix the wrappers" have actually made things much worse on Win98SE, and this can't be pinned on Sleepycat because the Sleepycat distro in use hasn't changed. OTOH, the ZODB test suite exercises ZODB-on-BDB, which is also coded in Python, and is actually more likely to fail on Linux than on Win98SE these days (presumably due to thread-race bugs in our test setup). Barry coded ZODB-on-BDB using the Berkeley-native API, and *that* hasn't given us any headaches on any flavor of Windows. From kennypitt at hotmail.com Thu Dec 18 16:32:15 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Dec 18 16:33:00 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt] >> ... I have a sneaking suspicion that this would >> be a general problem with bsddb on Win98, and not just if we used >> dbshelve. dbshelve doesn't do a whole lot besides pickling and >> unpickling the item values before calling the direct API. > > Sorry, I can't make more time for this. Understood. Just one more quick question if you would, since I think I may have misunderstood where you were coming from. > ... The Berkeley > wrappers "we" (Python) wrote have had a miserable history on Windows, > which is why I'll just repeat that we'll be better off avoiding "our" > convenience wrappers like the plague, sticking as close to base > Berkeley as possible. > > [snip] > > OTOH, the ZODB test suite exercises ZODB-on-BDB, which is also coded > in Python, and is actually more likely to fail on Linux than on > Win98SE these days (presumably due to thread-race bugs in our test > setup). Barry coded ZODB-on-BDB using the Berkeley-native API, and > *that* hasn't given us any headaches on any flavor of Windows. So are you recommending that we avoid using the whole _bsddb.pyd binary package? I originally thought you were referring only to the Python-coded wrappers. -- Kenny Pitt From tameyer at ihug.co.nz Thu Dec 18 17:30:23 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Dec 18 17:30:32 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junkemailfolder. In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C11A2@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A10@its-xchg4.massey.ac.nz> [Mark] > It does get complex though - what happens when the user renames > the folder? [Tim] > It's a part of Outlook's model that doesn't make sense to > people. When my sister, for example, renames or moves a > folder holding Word documents, she'd be baffled if Word *did* > magically notice this. The idea that a data object is > accessed by, and only by, its current "string path", has been > beat into her by Explorer, by all the other Office programs, > and-- for that matter --by all other programs she uses too. This surprised me - I've usually found people don't understand why Explorer et al can't keep track of things when they are renamed. I think maybe the difference is that most of these people started out using Macs rather than Windows, and Macs have always managed this better (compare aliases to shortcuts, for example). Of course, the Outlook plug-in is for Windows users, so that's not all that relevant ;) +1 to switching to names in a release at some point in the future. How often do Outlook 'experts' change the names of their spam/unsure folders anyway? =Tony Meyer From tim.one at comcast.net Thu Dec 18 18:04:21 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 18:04:21 2003 Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junkemailfolder. In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A10@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > This surprised me - I've usually found people don't understand why > Explorer et al can't keep track of things when they are renamed. I > think maybe the difference is that most of these people started out > using Macs rather than Windows, and Macs have always managed this > better (compare aliases to shortcuts, for example). I said Windows programs "beat it into her" because it wasn't natural at first. But I'm not sure if anything explainable could have captured initial expectations -- I think people quickly forget how overwhelmingly complicated most GUIs are at first glance. Hell, there still things I did in Visual Studio 5 years ago that I've never been able to find again <0.5 wink>. > ... > +1 to switching to names in a release at some point in the future. I'm not voting yet -- still something to mull over. I guess that's +0, then. > How often do Outlook 'experts' change the names of their spam/unsure > folders anyway? Not often. I did several times at the start, while working out training strategies I could live with. There's an inconsistency here: if you move an Outlook folder by dragging it to a *different* .pst file in your Folder List display, *then* Outlook rules (and SpamBayes) lose track of it entirely. Any rules that reference it turn themselves off. So sometimes Outlook makes you reselect the folder after a change, and sometimes it doesn't. I can guarandamntee that my sisters will never grow a mental model of "OK, folders in Outlook work by object identity, not name, but object identity is relative to the containing .pst file" . From nobody at spamcop.net Thu Dec 18 18:41:04 2003 From: nobody at spamcop.net (Seth Goodman) Date: Thu Dec 18 18:41:06 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: Tim, Thanks for taking the time to construct such a complete set of answers. I learned a lot from it and I assume other list readers did as well. > > [Seth Goodman] > > If we do, we could eventually have none of the tokens from a trained > > message present but its message count will still be there. Unless we > > implement your token cross-reference as explained below, the message > > counts will eventually not be correct if we expire enough tokens. > > [Tim Peters] > I want to do expiration "correctly". But even if all the tokens from a > message expire when the total message count is N, it still doesn't change > that counts on tokens that remain were in fact derived from N > messages, and > so N remains the best possible thing to feed into the spamprob guesses. Not really. If you decrement all the token counts from a trained message, the database is in the exact same state as it was before you trained on that message (ignoring subsequent messages trained). At that point, the trained message count was N-1, so that is the best thing to use for the probability calculation rather than N. The message count will keep increasing as you train new messages but the token database will eventually level off. That suggests that the trained message counts will become too large as time goes on. If you only expire hapaxes, perhaps the incorrect message count is a technicality and won't have a significant effect on the spam probabilities. But unless you expire non-hapaxes as well, the token database can't track a changing message stream very well. Once you start expiring non-hapax tokens (is there a name for these?), my guess is that you can no longer ignore the incorrect message count issue. So how _do_ you do expiration "correctly" if not by whole messages? > >> [Tim Peters] > >> ... > >> There's another bullet we haven't bitten yet, saving a map of > >> message id to an explicit list of all tokens produced by that > >> message (Skip wants the inverse of that mapping for diagnostic > >> purposes too). Given that, training and untraining of individual > >> messages could proceed smoothly despite intervening changes in > >> tokenization details; expiring entire messages would be > >> straightforward; and when expiring an individual feature, it would > >> be enough to remove that feature from each msg->[feature] list it's > >> in (then untraining on a msg later wouldn't *try* to decrement the > >> per-feature count of any feature that had previously been expired > >> individually and appeared in the msg at the time). > > > [Seth Goodman] > > This definitely works. But why bother tracking, cross-referencing and > > expiring individual tokens when we can just expire whole messages, > > which is a lot simpler? > > [Tim Peters] > I doubt that it's simpler at all, and you earlier today sketched quite an > elaborate scheme for expiring different messages at different > rates. That's > got its share of tuning parameters (aka wild-ass guesses ) > too, showed > every sign of being just the beginning of its brand of > complication, and has > no testing or experience to support it. We know a lot about the real-life > effects of hapaxes now. Offhand, adding a single timestamp per message at training time sounds easier than tracking the last time seen for every token in the database. As far as the "elaborate" scheme I suggested for variable expiration times, all that's involved is changing the message timestamp before storing it. Since you don't have anything like that now, you can just ignore that idea and the extra parameter that goes with it. BTW, that parameter value is not just a wild-ass guess, it's a SWAG (sophisticated wild-ass guess), and I don't like them any better than you do :) Either way, rather than frequently searching for expired tokens (in a very long list), you would only do token expiration when you have to train a new message. At that point, you find the oldest trained message (from a much shorter list) and untrain it. The extra complication is storing the token list with each message ID plus its training timestamp. That doesn't sound big compared to cross referencing every token to every message it appeared in. They're certainly not mutually exclusive and you later made a good argument for having this extra information anyway. > [Tim Peters] > BTW, the single worst thing you can do with a system of this type > is train a > message into the wrong category. Everyone does it eventually, and some > people can't seem to help but doing it often. Maybe that's a UI > problem at > heart -- I don't know, because I seem to be unusually resistant > to it. It's I agree completely. This was an important motivation for expiring a whole message at a time. Training mistakes would eventually drop out of the database without user intervention. Not that a tool to help track down training mistakes wouldn't be great, but a "casual" user could still make occasional mistakes and the system would recover by itself. > [Tim Peters] > happened to me too, though, and it can be hard to recover. One > sterling use > for a feature -> msg_ids map is, as Skip noted, a way to find out > *why* your > latest spam was a false negative: look at the low-scoring features, then > look at the messages with those features that were trained on as > ham. This > has an excellent shot at pinpointing mis-trained messages. > That's difficult > at best now, and is a real problem for some people. I've got gigabytes of > unused disk space myself . No argument there, it's a great feature for problem-solving. > [Tim Peters] > Evolution of this system would also be served by saving an > explict msg_id -> > features map. When we change tokenization to get a small win, > sometimes the > tokens originally added to a database by training on message M > can no longer > be reconstructed by re-tokenizing M (the tokenizer has changed! if it > always returned exactly what it returned before the change, there wasn't > much point to the change ). Blindly untraining anyway can violate > database invariants then, eventually manifesting as assertion > errors and the > need to retrain from scratch. The only clear and simple way to > prevent this > is to save a map from msg_id to the tokens it originally produced. Then > untraining simply walks that list, and nothing can go wrong as a result. I agree completely and that's why I suggested saving the token list with each message. Your feature_ID scheme makes it practical. > [Tim Peters] > That's a bit subtle, so takes some long-term experience to appreciate at a > gut level. Of more immediate concern to most users is that only the > obsessed *want* to save their spam. Most people want to throw spam away > ASAP. But, if they do that, we currently have no way to expire any spam > they ever trained on. Moving toward saving msg_ids <-> features > maps solves > that too, and with suitable reuse of little integers for feature ids can > store the relevant bits about trained messages in less space than it takes > to save the original messages. Note that hapaxes would waste the most > resource in this context too. Sounds like _you're_ arguing for expiration of whole messages :) I know you're not arguing that, but if there were bidirectional msg_id <-> feature_ID maps, it would be fairly easy to expire whole messages. That would obviate the need to track last time seen for every token. In any case, I hope you move in the direction of saving such maps as it adds so much flexibility. > [Tim Peters] > We're not going to abandon plain strings, because they're far too > useful and > loved in various reports intended for human consumption. Adding > feature_id > <-> feature_string maps would allow for effective compression of message > storage. All your arguments on this point make lots of sense. I'm a little surprised that you had significant collisions mapping perhaps 100K items (my guess) into a 32-bit space. I think that is rather dependent on the hash used, but that's what you saw. Since you need the cleartext anyway, your feature-ID concept is far superior. Thanks for educating me. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From rmalayter at bai.org Thu Dec 18 18:57:11 2003 From: rmalayter at bai.org (Ryan Malayter) Date: Thu Dec 18 18:57:19 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01A75280@cliff.bai.org> {Seth Goodman} > All your arguments on this point make lots of sense. > I'm a little surprised that you had significant > collisions mapping perhaps 100K items (my guess) > into a 32-bit space. I think that is rather dependent > on the hash used, but that's what you saw. That's not surprising at all to me. Because of the "birthday paradox", even very input-sensitive (random-looking) hash functions like the 160-bit SHA-1 only give 80 bits of collision resistance. With a 32 bit perfect hash, you get just 16 bits of collision resistance. That means there is a 50% chance of a collision if you hash just 65,536 items. Hash more items than that, and your chances of collision go up further. If your hash function isn't perfectly (randomly) distributed in the 32-bit space, things could be much worse with 100,000 hashes in a collection. I would suggest using storing at least a 64 bit hash; perhaps the first 8 bytes of an SHA-1 or MD5 hash would be appropriate. There exists good optimized code for both algorithms in the public domain. Regards, Ryan From tim.one at comcast.net Thu Dec 18 19:09:45 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 19:09:44 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Kenny Pitt] > So are you recommending that we avoid using the whole _bsddb.pyd > binary package? I originally thought you were referring only to the > Python-coded wrappers. We need _bsddb.pyd to use a modern BDB at all from within Python on Windows. I'm not familiar with what all is in that DLL, but there certainly aren't any Python-coded wrappers in it. It's about 5000 lines worth of compiled C code, and wraps Sleepycat's C API so that it can be called from Python. Most of it consists of short wrapper functions calling Sleepycat functions with similar names, just converting raw C bits to and from Python objects at the boundaries ... Hmm. Maybe you're not aware of this: http://pybsddb.sourceforge.net/bsddb3.html That's effictively the *real* documentation for Python's modern Berkeley module. It hasn't been folded into the Python doc set yet. _bsddb.pyd is intended to provide quite directs way of calling the native Sleepycat API functions. The *Python* docs never get around to explaining that, so if that's what you've been looking at, you've been looking at old docs describing lots of "legacy tricks" last updated for Berkeley 1.85, around the time of the last asteroid-induced mass extinction event . I'm suggesting avoiding the legacy tricks, avoiding the slow & buggy stuff trying to make current BDB "act just like a magical Python dict", and writing as directly as possible to Sleepycat's C API (which, I hasten to add, is much easier from Python than from C). Barry will testify it's not that bad, and since he's actually done it in a much more difficult (wrt database demands) project, I believe him. Sleepycat BDB has its own idioms and rhythms, and I'd like us to try to play along with them instead of fighting, or trying to hide, them. From nobody at spamcop.net Thu Dec 18 20:25:16 2003 From: nobody at spamcop.net (Seth Goodman) Date: Thu Dec 18 20:25:25 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A01A75280@cliff.bai.org> Message-ID: > {Seth Goodman} > > All your arguments on this point make lots of sense. > > I'm a little surprised that you had significant > > collisions mapping perhaps 100K items (my guess) > > into a 32-bit space. I think that is rather dependent > > on the hash used, but that's what you saw. > > [Ryan Malayter] > That's not surprising at all to me. Because of the "birthday paradox", > even very input-sensitive (random-looking) hash functions like the > 160-bit SHA-1 only give 80 bits of collision resistance. With a 32 bit > perfect hash, you get just 16 bits of collision resistance. That means > there is a 50% chance of a collision if you hash just 65,536 items. Hash > more items than that, and your chances of collision go up further. > > If your hash function isn't perfectly (randomly) distributed in the > 32-bit space, things could be much worse with 100,000 hashes in a > collection. As I understand it, the birthday paradox leads to the conclusion that for a 32-bit perfect hash function, after hashing around 78,000 items (just over 16-bits worth), you are likely to experience a _single_ collision. What Tim described sounded like they probably had multiple collisions to account for the spectacular failures they saw. I don't know the size of the token databases they dealt with back then, but I doubt a single collision in a token list of 78K items would affect the classifier. Since most of the tokens are hapaxes anyway (perhaps 80-90% ?), it is most probable that there would be no visible effect. You are of course correct that going over 78K items limit would give more collisions, but it would take quite a few collisions for one of the colliding tokens to be something other than a hapax. I am guessing that unless there were a lot more than 100K tokens, the 32-bit hash function used probably didn't do as good a randomizing job as needed. Since they ultimately had to construct a map of hash_value <-> token_string, they could have detected collisions (check the token already stored with the hash value) and done something about it (i.e. use next empty bucket). Since this would be a rare event, it wouldn't have cost much. In any case, Tim's idea of a mapping token_string <-> feature_ID (i.e. sequentially allocated number with "wrap-around") sounds much simpler. However, it is important that the number has enough bits that previously allocated feature_ID's are ready to be reused (their tokens expired) by the time the allocation number "wraps around" to them. This just means that the number should probably be 32-bits. Assuming you generate 100K tokens per day, the wrap-around time for a 32-bit number is 117 years. For a 24-bit number and the same rate of token production, the wrap-around time is 167 days (around 5.5 months). I'd go for the 32-bit number and not worry about pathological operating schemes or new tokenizers. Even at 1 million new tokens per day, the wrap-around time for a 32-bit feature_ID is over 10 years. Why hash when you can sequentially allocate? This was just a bad idea on my part. And it won't be the last one :) -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From tim.one at comcast.net Thu Dec 18 20:26:43 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 20:26:49 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Seth Goodman] > Thanks for taking the time to construct such a complete set of > answers. I learned a lot from it and I assume other list readers did > as well. My pleasure, but I'm afraid it was taken out of sleep time, and I can't do that again. So, no offense intended, I have to be very brief here, while wanting to do more: > Not really. If you decrement all the token counts from a trained > message, the database is in the exact same state as it was before you > trained on that message (ignoring subsequent messages trained). At > that point, the trained message count was N-1, so that is the best > thing to use for the probability calculation rather than N. The > message count will keep increasing as you train new messages but the > token database will eventually level off. That suggests that the > trained message counts will become too large as time goes on. > > If you only expire hapaxes, perhaps the incorrect message count is a > technicality and won't have a significant effect on the spam > probabilities. But unless you expire non-hapaxes as well, the token > database can't track a changing message stream very well. Once you > start expiring non-hapax tokens (is there a name for these?), my > guess is that you can no longer ignore the incorrect message count > issue. So how _do_ you do expiration "correctly" if not by whole > messages? I only intend to expire hapaxes for now, with whole-msg expiration after; but one thing at a time, and each step will take a long time for testing. There's no rush. The idea that all the tokens in a message could get expired seems too implausible to me to worry about, when only hapaxes are expired. ... > Offhand, adding a single timestamp per message at training time sounds > easier than tracking the last time seen for every token in the > database. As far as the "elaborate" scheme I suggested for variable > expiration times, all that's involved is changing the message > timestamp before storing it. Since you don't have anything like that > now, you can just ignore that idea and the extra parameter that goes > with it. BTW, that parameter value is not just a wild-ass guess, > it's a SWAG (sophisticated wild-ass guess), and I don't like them any > better than you do :) > > Either way, rather than frequently searching for expired tokens (in a > very long list), you would only do token expiration when you have to > train a new message. At that point, you find the oldest trained > message (from a much shorter list) and untrain it. The extra > complication is storing the token list with each message ID plus its > training timestamp. That doesn't sound big compared to cross > referencing every token to every message it appeared in. They're > certainly not mutually exclusive and you later made a good argument > for having this extra information anyway. There are messages I never want to expire. That creates major new UI headaches to be doable. I believe (but don't yet know) that expiring hapaxes can be done without need for user intervention, and without harm. At some point, if you want to try your ideas, *try* your ideas -- that's what Open Source is all about. Everyone is born knowing how to program in Python, although most don't realize it until they try. ... > I agree completely. This was an important motivation for expiring a > whole message at a time. Training mistakes would eventually drop out > of the database without user intervention. Not that a tool to help > track down training mistakes wouldn't be great, but a "casual" user > could still make occasional mistakes and the system would recover by > itself. Without intervention, it will also expire the screaming bright-red HTML birthday message sent by my favorite 7-year-old niece, and when she's 8 the next one may get tagged as spam. These are the kinds of messages I never want to expire. "Elaborate" before referred to untested gimmicks for adjusting expiration date based on "how far away" a message was from its correct classification, etc. I don't have a feel for whether that can be made to work well in real life, and it needs serious implementation effort and testing to get a good feel. In the vanishingly small time I can still make for this project, I need to give it to things my experience suggests will almost certainly win with no more effort or surprises than I already know they require enduring. ... > Sounds like _you're_ arguing for expiration of whole messages :) Oh yes, I do want that -- eventually. We have no experience with that in this project, though; we have a lot of experience with the consequences of hapaxes, and I have no fears remaining about picking on them. > I know you're not arguing that, but if there were bidirectional msg_id > <-> feature_ID maps, it would be fairly easy to expire whole > messages. Yes, and that's a real attraction. Doing the actual expiration would be trivially easy and fast then. Deciding *when* to do expiration, and of which messages, are the things we really don't know anything about yet. > That would obviate the need to track last time seen for every token. Only if you don't want also to be able to expire tokens on their own. > In any case, I hope you move in the direction of saving such maps as > it adds so much flexibility. Not to mention database size . ... > All your arguments on this point make lots of sense. I'm a little > surprised that you had significant collisions mapping perhaps 100K > items (my guess) into a 32-bit space. That would be a very small database for the mixed unigram-bigram scheme, and the unigram-only database I used most often in original testing (for filtering high-volume tech mailing lists) contained about 350K tokens. As Ryan explained later, the Birthday Paradox can't be avoided here, and has real consequences. > I think that is rather dependent on the hash used, but that's what > you saw. I used Python's builtin 32-bit hash() function, and the observed collision rate was indistinguishable from what a truly random 32-bit hash would have produced (about one standard deviation lower). The damnable thing is that you only need one extremely unfortunate collision to start seeing results that are incomprehensible to the human eye. > Since you need the cleartext anyway, your feature-ID concept is far > superior. We don't *need* the cleartext, really, it's just highly desirable. I'll certainly endure a lot to keep the cleartext. If this isn't the smallest or fastest spam filter possible, I don't really care. I don't even care whether it's popular. What I care about most is whether it filters my damn spam. > Thanks for educating me. Don't mistake a lecture for education . I'd love to be able to afford the luxury of *discussing* it with you instead (you've got a lot of plausible ideas and express them well), but afraid I just can't. With any luck, maybe my employer will go out of business . From listsub at wickedgrey.com Thu Dec 18 20:43:31 2003 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Thu Dec 18 20:44:10 2003 Subject: [spambayes-dev] Hapaxes? (was: How low can you go?) References: Message-ID: <3FE257C3.1020800@wickedgrey.com> Tim Peters wrote: > > Oh yes, I do want that -- eventually. We have no experience with that in > this project, though; we have a lot of experience with the consequences of > hapaxes, and I have no fears remaining about picking on them. Does anyone have any gentle nudges to information explaining what hapaxes are in a spambayes context? I _think_ I've canvased the website, some of the docs that come with the install (enough to get it running under Linux), etc., but I don't recall seeing anything about them. I'm not afraid to Use the Source, Luke (but not while I'm here at work ;), so just a pointer to a file name would probably be enough (if there's a hapax.py I'm gonna feel silly - I'll check when I get home ;). Thanks! Eli From tameyer at ihug.co.nz Thu Dec 18 20:52:50 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Dec 18 20:52:56 2003 Subject: [spambayes-dev] Hapaxes? (was: How low can you go?) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C1287@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467777E@its-xchg4.massey.ac.nz> > Does anyone have any gentle nudges to information explaining what > hapaxes are in a spambayes context? Wrt SpamBayes, 'word' is a token, and 'corpus' is the token database. Is this enough information? =Tony Meyer From tim.one at comcast.net Thu Dec 18 21:16:25 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 18 21:16:28 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Ryan Malayter] >> That's not surprising at all to me. Because of the "birthday >> paradox", ... [Seth Goodman] > As I understand it, the birthday paradox leads to the conclusion that > for a 32-bit perfect hash function, after hashing around 78,000 items > (just over 16-bits worth), you are likely to experience a _single_ > collision. What Tim described sounded like they probably had > multiple collisions to account for the spectacular failures they saw. > I don't know the size of the token databases they dealt with back > then, but I doubt a single collision in a token list of 78K items > would affect the classifier. Since most of the tokens are hapaxes > anyway (perhaps 80-90% ?), it is most probable that there would be no > visible effect. > ... Let me clarify this: the experiments we ran couldn't actually use a 32-bit hash code because they used a Python dict to simulate a giant sparse array, and the box I was using didn't have enough RAM to deal with this load. Instead we ran with smaller hash codes and smaller training sets, projecting results. The results were too discouraging for anyone here to want to continue along that line. It's all in the archives if you want to dig back far enough (I don't ). With a 32-bit hash code, the expected # of collisions for a truly random hash is close to 1, with a standard deviation also close to one, at about 92,600 items, so Seth is quite close. With 350K items (close to the # of tokens in the pure-unigram database I was actually using at the time), the mean # of collisions is a bit over 14 with an sdev of about 3.8. Those numbers aren't scary, and Python's hash() was indeed behaving as a random hash would have. We were considering schemes with much higher feature-generation rates than pure-unigram at the time, though, so all those stats don't matter to what we were really wondering about. BTW, discussions like this really don't belong on the spambayes list. They're fine spambayes-dev, though, so I've set reply-to to that. Anyone who wants to follow that level of tech-talk should subscribe to spambayes-dev. From mhammond at skippinet.com.au Thu Dec 18 21:33:57 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Dec 18 21:34:14 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <200312180952.hBI9qZC6005810@localhost.localdomain> Message-ID: <004f01c3c5d8$8e28c430$2c00a8c0@eden> > I'm sure this is some sort of standard method of getting things > done in the opensource world. Eric Raymond's Cathedral and Bazaar > metaphor extends here, of course - in a bazaar often you end up > getting suckered. The cool thing is that just like a bazaar, the seller does indeed figure the buyer for a sucker - but still the buyer goes home thinking he got a bargain. Amazingly, everyone truly is happy! But-who-is-who ly, Mark. From listsub at wickedgrey.com Thu Dec 18 21:33:55 2003 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Thu Dec 18 21:34:33 2003 Subject: [spambayes-dev] Hapaxes? (was: How low can you go?) References: <1ED4ECF91CDED24C8D012BCF2B034F130467777E@its-xchg4.massey.ac.nz> Message-ID: <3FE26393.8010105@wickedgrey.com> Tony Meyer wrote: > > > > Wrt SpamBayes, 'word' is a token, and 'corpus' is the token database. > > Is this enough information? Yes, thank you. :) Heh, glossary. Who'd have thunk? Another newbie Q: were hapaxes not stored at one time? Some of the recent discussion implies that a recent change (storing them?) has increased the DB size considerably. Was that the only heuristic, or was it tokens seen less than N times...? Just trying to get up to speed. :) Eli From matt at mondoinfo.com Thu Dec 18 21:58:46 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Thu Dec 18 21:59:04 2003 Subject: [spambayes-dev] Hapaxes? (was: How low can you go?) In-Reply-To: <3FE26393.8010105@wickedgrey.com> References: <1ED4ECF91CDED24C8D012BCF2B034F130467777E@its-xchg4.massey.ac.nz> <3FE26393.8010105@wickedgrey.com> Message-ID: <1071802446.62.680@mint-julep.mondoinfo.com> > Another newbie Q: were hapaxes not stored at one time? Some of the > recent discussion implies that a recent change (storing them?) has > increased the DB size considerably. Was that the only heuristic, > or was it tokens seen less than N times...? Hapaxes have always been stored. There have been various experiments with removing them since they seem to make up about half of an "average" database. It turns out that if you have a well-trained database, you can remove hapaxes with little effect on scoring. The problem comes if you're doing ongoing training. If you remove hapaxes every day, a strong clue that only arrives once a day will never persist to become a strong clue. Regards, Matt From tim.one at comcast.net Fri Dec 19 00:29:50 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Dec 19 00:29:53 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: <200312180952.hBI9qZC6005810@localhost.localdomain> Message-ID: [Anthony Baxter] > I'm sure this is some sort of standard method of getting things > done in the opensource world. Eric Raymond's Cathedral and Bazaar > metaphor extends here, of course - in a bazaar often you end up > getting suckered. Let's see. Everyone here knows you did the most of the work in setting up the spambayes web site, and some of us know you're doing the bulk of the work for producing today's Python 2.3.3 release. So who thinks Anthony's a sucker? And who thinks he's taken on these glamorous jobs just to get Barry's regurgitated drugs and quality sex from all the SpamBayes and Python groupies in Australia? Yes, it is a hard call. if-we-didn't-appreciate-each-other-we'd-be-rich-ly y'rs - tim From skip at pobox.com Fri Dec 19 07:07:28 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 19 07:07:23 2003 Subject: [spambayes-dev] a bit better received header parsing Message-ID: <16354.59904.523711.847643@montanaro.dyndns.org> A few days ago Tim noticed that some ip addresses in Received: headers were being broken down in the wrong order. For example, '[199.249.165.175]' would yield fragments in the wrong order: '[199.249.165.175]', '249.165.175]', '165.175]' and '175]'. The problem was that the hostname recognizer was catching them and fragmenting them from right-to-left, as if they were hostnames. I solved that problem and a couple others related to locating hostnames and ip addresses in Received: headers. I have no idea if it will help or not, but your database will be oh-so-much-cleaner if you do a complete retrain after a cvs up. Cheers, Skip From kennypitt at hotmail.com Fri Dec 19 09:09:37 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Dec 19 09:10:15 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt] >> So are you recommending that we avoid using the whole _bsddb.pyd >> binary package? I originally thought you were referring only to the >> Python-coded wrappers. > > We need _bsddb.pyd to use a modern BDB at all from within Python on > Windows. I'm not familiar with what all is in that DLL, but there > certainly aren't any Python-coded wrappers in it. It's about 5000 > lines worth of compiled C code, and wraps Sleepycat's C API so that > it can be called from Python. Most of it consists of short wrapper > functions calling Sleepycat functions with similar names, just > converting raw C bits to and from Python objects at the boundaries ... OK, then you were talking about what I originally thought you were talking about. I was getting worried that the C wrappers themselves were too buggy to be of use <0.5 wink>. > Hmm. Maybe you're not aware of this: > > http://pybsddb.sourceforge.net/bsddb3.html > > That's effictively the *real* documentation for Python's modern > Berkeley module. Yeah, that's the documentation I started from, which basically just summarizes what each function is for and shows the Python syntax for calling them. It then refers you to the Berkeley docs for any level of detail about option flags and such, so I usually just go straight to the Sleepycat C++ API docs. The Python wrappers are basically a direct mapping of the C++ class structure, and I've previously worked a little with BDB in C++. > I'm suggesting avoiding the legacy tricks, avoiding the slow & buggy > stuff trying to make current BDB "act just like a magical Python > dict", and writing as directly as possible to Sleepycat's C API > (which, I hasten to add, is much easier from Python than from C). OK, I can go with that, and it should be relatively straightforward. My concern still stands re Win98, though. Maybe I didn't express it clearly. Whenever you use direct BDB through the pybsddb/bsddb3/bsddb module in a multi-thread/multi-user scenario, you always have to start with a call to initialize the DB environment before you can do anything else. You expressed some concern over the breakage on Win98 of the tests in test_dbshelve.py. Unfortunately, the line that always fails is that very first and most basic initialization call, the same one that we would need to call for any use in SpamBayes. It is a direct call into the C-code wrappers, and happens *before* any "legacy tricks". Since the test suite opens and closes a number of databases and environments before it gets to the point that fails, there could be some adverse effects there. Maybe the best thing is to throw some test code into SpamBayes and see if it will even start up on Win98. I don't have access to a Win98 test system, but if I can code up enough support that we can try this out, would you be willing to give it a test? It will probably be after the holidays before I can get to it, but we'll see. -- Kenny Pitt From popiel at wolfskeep.com Fri Dec 19 11:55:37 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Dec 19 11:55:41 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Thu, 18 Dec 2003 20:26:43 EST." References: Message-ID: <20031219165537.EDB162DF7F@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >> Sounds like _you're_ arguing for expiration of whole messages :) > >Oh yes, I do want that -- eventually. We have no experience with that in >this project, though; we have a lot of experience with the consequences of >hapaxes, and I have no fears remaining about picking on them. Actually, there have been experiments done (by me) with expiry of whole messages. I invite you to look at the 'expire4months' regime for my incremental testing harness. Performance was worse than remembering everything, but significantly better than mistake-based training (with the 'fpfnunsure' regime). I have not done any experiments with just nuking hapaxes; I didn't see any reason to do a partial job instead of a full one. >> I know you're not arguing that, but if there were bidirectional msg_id >> <-> feature_ID maps, it would be fairly easy to expire whole >> messages. >> That would obviate the need to track last time seen for every token. > >Only if you don't want also to be able to expire tokens on their own. No... just find the most recent message that the token appeared in, which would be a quick search through a few message times. A really quick search if you're only looking to expire hapaxes. - Alex From skip at pobox.com Fri Dec 19 14:57:16 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 19 16:01:43 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" Message-ID: <16355.22556.564839.561779@montanaro.dyndns.org> I've been running with mine_received_headers set to True for quite awhile. I fixed a couple nits this morning with the regular expressions used to pick out hostnames and ip addresses from Received: headers. The hostname re was frequently picking up ip addresses and chomping them from the wrong end. I am pleased with how well it seems to work at this point(*). Looking at a graph or table of the 'received:.*' spamprob distribution shows that (for me, at least) the bulk of the spamprobs are at or outside of the hapax points. See: http://www.musi-cal.com/~skip/rcvd.png http://www.musi-cal.com/~skip/rcvd.txt The graph plots the number of features with a given spamprob. The two impulses at the hapax points are 523 (0.155...) and 1047 (0.844...). I cropped the graph so the smaller values would be visible. Obviously, this is still strongly hapax-driven (I have a small database at the moment - 163 spam, 171 ham), but the data suggests that the hapax values are pretty good indicators of the direction that feature will take when the second instance is seen. While I was messing with the received header regular expressions today I also noticed that Sendmail sometimes adds "may be forged" to a header. Here's a bit from the sendmail docs in the context of an open relay discussion: QAA02454: ... Relaying denied QAA02454: ruleset=check_rcpt, arg1=, relay=some.domain [10.0.0.1] (may be forged), reject=550 ... Relaying denied QAA02454: from=, size=0, class=0, pri=0, nrcpts=0, proto=SMTP, relay=some.domain [10.0.0.1] (may be forged) Here the (may be forged) is the important part: it means that the DNS data for the host is inconsistent, and hence the name is not used for the relaying check but only the IP number. This is also a very good spam indicator: % spamcounts -r 'may be forged' db: /Users/skip/.hammiedb token,nspam,nham,spam prob bi:received:may be forged received:mx,1,0,0.844827586207 bi:received:may be forged received:biz,2,0,0.908163265306 received:may be forged,5,0,0.95871559633 bi:received:may be forged received:com,1,0,0.844827586207 bi:received:127.0.0.1 received:may be forged,5,0,0.95871559633 bi:received:may be forged received:il,1,0,0.844827586207 I generate it within the block controlled by the mine_received_headers option. A quick scan of my testing databases shows this is overwhelmingly associated with spam (shows up in 221 out of 6843 spams and only 30 out of 8395 ham). I'm inclined to trust sendmail on this one and just add it. It seems like a very objective feature. In fact, if other mail transport agents provide similar clues about forged addresses, I think we should look for their clues and lump them all into one 'received:may be forged' feature. Skip (*) Here's a quick summary of my latest setup. I'm running from CVS (natch). I pushed my cutoffs out to 0.05 and 0.95 and run with bigrams enabled. I train on all mistakes and unsures. I also have it automatically training on a random 10% of the messages with score as ham or spam. I tried training on everything, but the database was growing way too quickly. The extreme cutoffs minimize the chance of a fp or fn which would mean to untrain I have to go find the message and move it from one pile to the other. So far, no fp's, a few fn's and fewer unsures than I anticipated. From tim.one at comcast.net Sat Dec 20 01:06:28 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 20 01:06:33 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System Message-ID: As the project admins should already know, SourceForge is integrating a donations system throughout their site (see the attached for a bit of detail). I'd like the people here working their spare-time asses off on SpamBayes to give this some thought. We don't *have* to give SpamBayes contributions to the PSF, and I wouldn't object if the people doing the work here wanted to split donations among themselves. It probably wouldn't amount to much, but even 100 bucks now and again can work wonders for morale. I don't have a stake in this either way. My employment contract forbids doing compensated work on anything other than employer-assigned tasks, so my fingers aren't allowed in the pie. That's fine by me. As a Director of the PSF, I really appreciate that SpamBayes donations have been given to the PSF so far, but objectively speaking I think the rationale for doing so is weak (and the way things seem to be heading, we should probably start giving them to Sleepycat instead ). Anyway, give it some thought over the holidays! It's your project more than mine, and has been for a long time. I'll support whatever decision you make. -------------- next part -------------- An embedded message was scrubbed... From: "SourceForge.net Team" Subject: SF.NET Project Donation System Date: Fri, 19 Dec 2003 02:29:33 -0500 Size: 2713 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031220/07e6863a/attachment.mht From tim.one at comcast.net Sat Dec 20 01:15:51 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 20 01:15:56 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System Message-ID: As the project admins should already know, SourceForge is integrating a donations system throughout their site (see the attached for a bit of detail). I'd like the people here working their spare-time asses off on SpamBayes to give this some thought. We don't *have* to give SpamBayes contributions to the PSF, and I wouldn't object if the people doing the work here wanted to split donations among themselves. It probably wouldn't amount to much, but even 100 bucks now and again can work wonders for morale. I don't have a stake in this either way. My employment contract forbids doing compensated work on anything other than employer-assigned tasks, so my fingers aren't allowed in the pie. That's fine by me. As a Director of the PSF, I really appreciate that SpamBayes donations have been given to the PSF so far, but objectively speaking I think the rationale for doing so is weak (and the way things seem to be heading, we should probably start giving them to Sleepycat instead ). Anyway, give it some thought over the holidays! It's your project more than mine, and has been for a long time. I'll support whatever decision you make. -------------- next part -------------- An embedded message was scrubbed... From: "SourceForge.net Team" Subject: SF.NET Project Donation System Date: Fri, 19 Dec 2003 02:29:33 -0500 Size: 2713 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031220/07e6863a/attachment-0001.mht From Guido.DellaBruna at meteoswiss.ch Sat Dec 20 10:38:19 2003 From: Guido.DellaBruna at meteoswiss.ch (Guido.DellaBruna@meteoswiss.ch) Date: Sat Dec 20 10:38:23 2003 Subject: [spambayes-dev] SpamBayes and Outlook on Metaframe Message-ID: <27E7C74777D477408C644B95A24E6F87041045@lomex01.meteoswiss.ch> Hello, sorry, I'm not sure if this is the right place for such questions, but here i goes: I would like to install SpamBayes Outlook-Plugin on Citrix-Metaframe. The problem seems to be the directory where to install the DB: in "My Computer" I only have access to some network drives (no local "C:", for example) and to a "local" folder named "My Documents". Is it possible to instruct SpamBayes to use that folder for the Spam database (and any other file needed by SpamBayes)? Or do I need to modify the Python code and recompile it? I didn't find a way to change this in the GUI. Many thanks, -- Guido Della Bruna Processo Meteo Locarno MeteoSvizzera Via ai Monti 146 CH-6605 Locarno 5 Monti Svizzera From nobody at spamcop.net Fri Dec 19 17:58:10 2003 From: nobody at spamcop.net (Seth Goodman) Date: Sat Dec 20 13:48:10 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: > [Tim Peters] > There are messages I never want to expire. That creates major new UI > headaches to be doable. I believe (but don't yet know) that expiring > hapaxes can be done without need for user intervention, and without harm. I hope the "without harm" part is true. See my question two sections down. > [Tim Peters] > At some point, if you want to try your ideas, *try* your ideas -- > that's what Open Source is all about. Everyone is born knowing how to > program in Python, although most don't realize it until they try. I admit I wasn't aware that I could program in Python since birth, but I'm willing to take your word on that. We all have hidden potential. So that I don't have to re-invent that round thing with the axle in the middle, could someone please give me some hints as to which of the mapping features we've discussed in this thread exist or will soon exist and where I can look for them? I saw on spambayes-dev that there is discussion of a new database, so I don't want to go off on a useless fork with the present db if that comes to pass. Search for your inner newbie when you answer this. > > [Seth Goodman] > > I agree completely. This was an important motivation for expiring a > > whole message at a time. Training mistakes would eventually drop out > > of the database without user intervention. Not that a tool to help > > track down training mistakes wouldn't be great, but a "casual" user > > could still make occasional mistakes and the system would recover by > > itself. > > [Tim Peters] > Without intervention, it will also expire the screaming bright-red HTML > birthday message sent by my favorite 7-year-old niece, and when > she's 8 the > next one may get tagged as spam. These are the kinds of messages I never > want to expire. ... Here lies my concern. I sincerely hope that correct classification of these infrequent, unusual messages is not hapax-driven. If it is, the result of pruning infrequently-used hapaxes will be as bad as deleting the whole message. If that is the case, the _only_ solution will be to keep either those hapaxes or the whole message trained forever. Either way, I agree this is a big UI problem without an obvious intuitive solution. It does appear from looking at the scoring of some of my "typical" messages that hapaxes don't contribute much, as you've said before. Could you look at the scoring of a couple of those special messages and tell if their scoring would be seriously affected if the hapaxes were gone? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From richie at entrian.com Sat Dec 20 15:42:28 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Dec 20 15:42:41 2003 Subject: [spambayes-dev] Re: [Spambayes] SpamBayes and financial sponsorship In-Reply-To: <2045949.1071845232983.JavaMail.jboss@p15135617.pureserver.info> References: <2045949.1071845232983.JavaMail.jboss@p15135617.pureserver.info> Message-ID: Hi Dawn, Replying on behalf of the SpamBayes team: > You have received this email because your project has been nominated for > financial sponsorship by Gary Daw. [...] The panel's decision will be > emailed to you just as soon as your nomination has been evaluated. Great! We're very pleased to hear it, and we look forward to hearing the decision. > PS. We are considering providing facilities to our members for managing their own > projects, such as CVS repositories, issue tracking and mailing lists. Would that > be something that you would be interested in? We already use SourceForge for CVS and issue tracking, and we run our own mailing lists. As far as I'm aware, we'll all happy with our current setup. -- Richie Hindle richie@entrian.com From richie at entrian.com Sat Dec 20 15:56:04 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Dec 20 15:56:15 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: References: Message-ID: [Tim and Barry] > [much ZODB and bsddb wisdom] Thanks, guys. I'll try to do two things over Christmas: o Write a script to hammer the current SpamBayes bsddb code, to try to reproduce the problems we've been seeing and to test the second thing: o Write a ZODB-on-BDB storage for SpamBayes. [Kenny] > Maybe the best thing is to throw some test code > into SpamBayes and see if it will even start up on Win98. I don't have > access to a Win98 test system, but if I can code up enough support that > we can try this out, would you be willing to give it a test? It will > probably be after the holidays before I can get to it, but we'll see. I have a win98 environment that I'll be happy to run test code in. -- Richie Hindle richie@entrian.com From richie at entrian.com Sat Dec 20 17:04:40 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Dec 20 17:04:52 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: References: Message-ID: [Tim] > As the project admins should already know, SourceForge is integrating a > donations system throughout their site (see the attached for a bit of > detail). > > I'd like the people here working their spare-time asses off on SpamBayes to > give this some thought. We don't *have* to give SpamBayes contributions to > the PSF, and I wouldn't object if the people doing the work here wanted to > split donations among themselves. It probably wouldn't amount to much, but > even 100 bucks now and again can work wonders for morale. This is a great idea in principle, but devising a fair system for distributing the donations would be difficult. Trying to measure people's contributions, when that means original code, bugfixes, patches, contributions to spambayes-dev, work on the web site, providing support to users, writing documentation, admin... it's more difficult that SpamBayes itself. (Anyway, we all know Mark deserves all of it for fighting Outlook all this time. And in Australian dollars, 100 bucks US would set him up for life! 8-) A couple of ideas do spring to mind, though: Anyone who's spent real money on the project, like Rob with the spambayes.org domain, could be reimbursed. We could add developer links to the Donations page, so that if a user wanted to donate to a specific developer, he could. Though that in itself raises fairness problems: who gets an entry on that page? Do they get to write their own entry (for example, could Barry put up an entry saying "Bassist. Please help." - that would be grossly unfair to those of us without that kind of affliction.) I'd love to make money from SpamBayes (and my wife would *really* love it 8-) but I wouldn't want to leave others feeling short changed. [Tim] > Anyway, give it some thought over the holidays! It's your project more than > mine, and has been for a long time. I'll support whatever decision you > make. And if we're ever in the same bar at the same time, you won't need to buy any drinks. 8-) -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Sat Dec 20 17:58:54 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat Dec 20 17:59:15 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: Message-ID: <029e01c3c74c$d84a8470$2c00a8c0@eden> [Richie] > This is a great idea in principle, but devising a fair system for > distributing the donations would be difficult. Trying to > measure people's > contributions, when that means original code, bugfixes, patches, > contributions to spambayes-dev, work on the web site, > providing support to > users, writing documentation, admin... it's more difficult > that SpamBayes > itself. Agreed. A more practical problem is that someone would need to collect this money, and this would have tax implications, even if they tried to say they were just "holding" it. > (Anyway, we all know Mark deserves all of it for > fighting Outlook > all this time. And in Australian dollars, 100 bucks US would > set him up for life! 8-) With change :) > A couple of ideas do spring to mind, though: > > Anyone who's spent real money on the project, like Rob with the > spambayes.org domain, could be reimbursed. I agree, but not sure how this could work in practical terms with the tax and holding issues. > We could add developer links to the Donations page, so that if a user > wanted to donate to a specific developer, he could. Though > that in itself > raises fairness problems: who gets an entry on that page? Do > they get to > write their own entry (for example, could Barry put up an entry saying > "Bassist. Please help." - that would be grossly unfair to those of us > without that kind of affliction.) > > I'd love to make money from SpamBayes (and my wife would > *really* love it > 8-) but I wouldn't want to leave others feeling short changed. I think there is something here. One approach would be that any listed "developers" are eligible. Our "donations" page could list the developers, and include a link to their personal sourceforge page. What they say about themselves there is their issue. Our "donations" page makes no attempt to guide towards individuals - all developers are considered equal. If you gain the respect and credibility to be listed as a developer, and opt-in with your paypal account number, then you qualify to receive donations. The developers make no attempt to guide people to anyone - we just point at the donations page, and shutup. A risk is that this will lead to lots of people trying to become developers. We may need a semi-formal process for new members, maybe borrowing from the Python +-1/+-0 system - all developers must vote, and one +1 is required, and a single -0 is a veto. To avoid conflicts (which I doubt, but worth coverting), we could adopt the same system for the entire donations system - a single developer could choose to veto the whole scheme (presumably as they felt it unfair). In that case, we drop the entire scheme, and move back to 100% going to the PSA. Extending this a little to handle reimbursing real costs - assuming the person with the cost is a developer, then we *could* put a note at the top of the 'donations' page saying 'please pay this person first, as he has real costs to recover'. Once these are recovered, this developer moves back to the 'normal' list. > [Tim] > > Anyway, give it some thought over the holidays! It's your > project more than > > mine, and has been for a long time. I'll support whatever > decision you > > make. > > And if we're ever in the same bar at the same time, you won't > need to buy > any drinks. 8-) Isn't it a shame all these fine taxation people wont let us start a SpamBayes slush fund :) Mark. From mhammond at skippinet.com.au Sat Dec 20 18:00:38 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat Dec 20 18:00:55 2003 Subject: [spambayes-dev] SpamBayes and Outlook on Metaframe In-Reply-To: <27E7C74777D477408C644B95A24E6F87041045@lomex01.meteoswiss.ch> Message-ID: <029f01c3c74d$16b7cf60$2c00a8c0@eden> The GUI does not expose this option, but if you read the 'Configuration Guide' (available via the 'About' document after installation) you will find information how to manually configure the data directory SpamBayes uses. Mark. > Hello, > > sorry, I'm not sure if this is the right place for such questions, but > here i goes: > > I would like to install SpamBayes Outlook-Plugin on Citrix-Metaframe. > The problem seems to be the directory where to install the DB: in "My > Computer" I only have access to some network drives (no local > "C:", for > example) and to a "local" folder named "My Documents". Is it > possible to > instruct SpamBayes to use that folder for the Spam database (and any > other file needed by SpamBayes)? Or do I need to modify the > Python code > and recompile it? I didn't find a way to change this in the GUI. > > Many thanks, > > -- > Guido Della Bruna > Processo Meteo Locarno > MeteoSvizzera > Via ai Monti 146 > CH-6605 Locarno 5 Monti > Svizzera > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev From barry at python.org Sat Dec 20 18:40:12 2003 From: barry at python.org (Barry Warsaw) Date: Sat Dec 20 18:40:36 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: References: Message-ID: <1071963612.17967.22.camel@anthem> On Sat, 2003-12-20 at 17:04, Richie Hindle wrote: > Do they get to > write their own entry (for example, could Barry put up an entry saying > "Bassist. Please help." - that would be grossly unfair to those of us > without that kind of affliction.) Plus, think of all the bummed out guitar players -- oh wait, they're guitar players, they gets the chicks, they don't needs the money. Of course, since I've managed to contribute so little to the project, it's only fair I get the lion's share of the dough. And because I have the same employer as Tim, you're going to have to put "for funky accompaniment" in the memo section of the checks just to make everything copacetic. ain't-quittin'-my-day-job-for-this-one-either-ly y'rs, -Barry From anthony at interlink.com.au Sat Dec 20 22:18:30 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Sat Dec 20 22:18:52 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: Message-ID: <200312210318.hBL3IUcW017811@localhost.localdomain> As others have raised, trying to distribute this money would be somewhat tricky. I'd suggest that we instead kick any money we receive via SourceForge back to SourceForge. They run a service we all depend on, and I'd like to see them being able to continue to do so. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From matt at mondoinfo.com Sat Dec 20 22:39:34 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sat Dec 20 22:39:41 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: <200312210318.hBL3IUcW017811@localhost.localdomain> References: <200312210318.hBL3IUcW017811@localhost.localdomain> Message-ID: <1071977330.29.10858@mint-julep.mondoinfo.com> > As others have raised, trying to distribute this money would be > somewhat tricky. I'd suggest that we instead kick any money we > receive via SourceForge back to SourceForge. They run a service we > all depend on, and I'd like to see them being able to continue to > do so. In the absence of a more direct way to support SpamBayes or individual developers, I'd suggest routing donations to the Python Software Foundation. It's a properly-organized charity so it's equipped to receive them. And I don't think there's any reason that it can't spend money to support SpamBayes when there's an opportunity to. I also understand that that's apt to help the PSF and, indirectly, Python since the PSF has to show the government that it gets donations from lots of different people. Full disclosure: I'm a member of the PSF but don't sit on its board. Regards, Matt From tim.one at comcast.net Sat Dec 20 23:27:22 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 20 23:27:26 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: <16355.22556.564839.561779@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I've been running with mine_received_headers set to True for quite > awhile. I fixed a couple nits this morning with the regular > expressions used to pick out hostnames and ip addresses from > Received: headers. The hostname re was frequently picking up ip > addresses and chomping them from the wrong end. I am pleased with > how well it seems to work at this point(*). Looking at a graph or > table of the 'received:.*' spamprob distribution shows that (for me, > at least) the bulk of the spamprobs are at or outside of the hapax > points. See: > > http://www.musi-cal.com/~skip/rcvd.png > http://www.musi-cal.com/~skip/rcvd.txt > > The graph plots the number of features with a given spamprob. The two > impulses at the hapax points are 523 (0.155...) and 1047 (0.844...). > I cropped the graph so the smaller values would be visible. > > Obviously, this is still strongly hapax-driven (I have a small > database at the moment - 163 spam, 171 ham), but the data suggests > that the hapax values are pretty good indicators of the direction > that feature will take when the second instance is seen. Cool! Thanks for the good work. I'll give this a try too. > While I was messing with the received header regular expressions > today I also noticed that Sendmail sometimes adds "may be forged" to > a header. Here's a bit from the sendmail docs in the context of an > open relay discussion: > > QAA02454: ... Relaying denied > QAA02454: ruleset=check_rcpt, arg1=, > relay=some.domain [10.0.0.1] (may be forged), > reject=550 ... Relaying denied > QAA02454: from=, size=0, class=0, pri=0, > nrcpts=0, proto=SMTP, relay=some.domain [10.0.0.1] (may > be forged) > > Here the (may be forged) is the important part: it means that the > DNS data for the host is inconsistent, and hence the name is not > used for the relaying check but only the IP number. > > This is also a very good spam indicator: > > % spamcounts -r 'may be forged' > db: /Users/skip/.hammiedb > token,nspam,nham,spam prob > bi:received:may be forged received:mx,1,0,0.844827586207 > bi:received:may be forged received:biz,2,0,0.908163265306 > received:may be forged,5,0,0.95871559633 > bi:received:may be forged received:com,1,0,0.844827586207 > bi:received:127.0.0.1 received:may be forged,5,0,0.95871559633 > bi:received:may be forged received:il,1,0,0.844827586207 > > I generate it within the block controlled by the mine_received_headers > option. A quick scan of my testing databases shows this is > overwhelmingly associated with spam (shows up in 221 out of 6843 > spams and only 30 out of 8395 ham). > > I'm inclined to trust sendmail on this one and just add it. It seems > like a very objective feature. I agree -- it's extremely unlikely to lose. The ones to worry about are things spammers could inject to push things in the ham direction, but they're not gonna get far forging "may be forged" unless I have a *very* weird idea of ham . > In fact, if other mail transport agents provide similar clues about > forged addresses, I think we should look for their clues and lump them > all into one 'received:may be forged' feature. I noticed this in the headers of a spam today: Received: from shawmail-cg-shawcable-net (c-24-9-163-244.client.comcast.net[24.9.163.244](untrusted sender)) by rwcrmxc11.comcast.net (rwcrmxc11) with SMTP id <20031220054919r1100n4pj1e>; Sat, 20 Dec 2003 05:49:20 +0000 It's the "(untrusted sender)" part that's interesting. I'd suggest *not* folding that in with "may be forged", though. There probably aren't a lot of strings of this nature, so the database burden should be trivial, and I *bet* different strings will prove to have different spamprobs. > (*) Here's a quick summary of my latest setup. I'm running from CVS > (natch). I pushed my cutoffs out to 0.05 and 0.95 and run with > bigrams enabled. I train on all mistakes and unsures. I also have > it automatically training on a random 10% of the messages with score > as ham or spam. I tried training on everything, but the database was > growing way too quickly. The extreme cutoffs minimize the chance of > a fp or fn which would mean to untrain I have to go find the message > and move it from one pile to the other. So far, no fp's, a few fn's > and fewer unsures than I anticipated. I'm running 0.04 and 0.95 with bigrams now, sticking to just mistake-and-unsure training, after seeding with 50 of each, although the seeds were the most recent trained on from my mistake-and-unsure-trained unigram classifer. Am at about 145 of each now. I don't trust it yet -- it's still surprising too often. I had disappointing results with a purely mistake/unsure-trained unigram classifier before; the bigram one isn't disappointing so far, it just leaves me cautious after a few days. I expect (without proof) that *some* random component is very helpful, at least to get the thing started. It's still 89% hapax. I had expected that percentage to drop by now, but without a random component I'm not sure that was a reasonable expectation: spam+ham count % cumm 1 63611 88.85 88.85 2 4126 5.76 94.61 3 1377 1.92 96.54 4 680 0.95 97.49 5 397 0.55 98.04 6 255 0.36 98.40 7 178 0.25 98.65 8 134 0.19 98.83 9 109 0.15 98.98 10 70 0.10 99.08 ... From tim at fourstonesExpressions.com Sat Dec 20 23:59:46 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Sat Dec 20 23:59:53 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: <200312210318.hBL3IUcW017811@localhost.localdomain> References: <200312210318.hBL3IUcW017811@localhost.localdomain> Message-ID: On Sun, 21 Dec 2003 14:18:30 +1100, Anthony Baxter wrote: > > As others have raised, trying to distribute this money would be > somewhat tricky. I'd suggest that we instead kick any money we > receive via SourceForge back to SourceForge. They run a service > we all depend on, and I'd like to see them being able to continue > to do so. > > Anthony > +1 from me. -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From tim.one at comcast.net Sun Dec 21 00:03:07 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 21 00:03:12 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: <1071977330.29.10858@mint-julep.mondoinfo.com> Message-ID: [Matthew Dixon Cowles] > In the absence of a more direct way to support SpamBayes or > individual developers, I'd suggest routing donations to the Python > Software Foundation. Matt, we already do. Visit http://spambayes.sourceforge.net/donations.html for proof . There's a PayPal button on that page, which contributes directly to the PSF now. Some users have done that. > It's a properly-organized charity so it's equipped to receive them. > And I don't think there's any reason that it can't spend money to > support SpamBayes when there's an opportunity to. Me neither, and I want the PSF to do things like that, but I think it's still a long way off. The PSF's work is all done by unpaid "spare time" volunteers too, and while we're accomplishing what we need to accomplish there, it's in very slow motion. We're overwhelmingly still bogged down trying to clean up legalities; e.g., after something like 2 years in existence, we're still trying to get legally sound contribution (code, not money) forms established. I expect that people contributing cash to SpamBayes would like a more direct connection, and I sympathize. > I also understand that that's apt to help the PSF and, indirectly, > Python since the PSF has to show the government that it gets > donations from lots of different people. The SpamBayes-derived contributions have helped a lot in moving the PSF toward meeting the so-called "public support ratio" test, which the PSF must meet to retain public charity status. I don't think the PSF *needs* the SpamBayes contributions for that, though. > Full disclosure: I'm a member of the PSF but don't sit on its board. If you want to, you probably can . Everyone should keep in mind that we're not talking big money here. The total contributed to the PSF from all sources so far wouldn't pay one person's salary, and the SpamBayes donations are a small part of that. The PSF's Treasurer could break out exact numbers, but I don't want to bother him -- when I said "100 bucks now & again" earlier, that's the right ballbark, given current contributions. OTOH, I expect SpamBayes contributions would increase if there were a plausibly direct connection between giving money and getting back a better SpamBayes someday. From tim.one at comcast.net Sun Dec 21 01:21:25 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 21 01:21:30 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: Message-ID: Good news and bad news on mine_received_headers in my classifier now. The good news is that it *generally* made ham hammier and spam spammier. The bad news is that spam leaking thru python.org mailing lists is much more likely to score as ham than unsure as before, due to the large number of new python.org-related clues. The lowest-scoring spam in my training data now is this: """ ... Subject: HOT OPPORTUNITY ... JUST CHECK OUT MY WEBSITE http://www.webspawner.com/users/hawkk/index.html -- http://mail.python.org/mailman/listinfo/python-list """ It turns out I've actually trained on two copies of that one, but despite that it's scoring only 19 now: Combined Score: 19% (0.193802) Internal ham score (*H*): 0.85051 Internal spam score (*S*): 0.238114 These are all the "ah, this came from a python.org mailing list" features now, more than doubling the number of such features before: 'url:mailman' 0.128016 'url:listinfo' 0.130533 'url:python' 0.135874 'bi:proto:http url:mail' 0.138712 'url:python-list' 0.145499 'received:127' 0.146801 'received:127.0' 0.146801 'received:127.0.0' 0.146801 'received:127.0.0.1' 0.146801 'bi:received:12.155.117.29 received:localdomain' 0.1549 'received:localhost.localdomain' 0.16481 'sender:addr:python-list-bounces+tim.one=comcast.net' 0.168566 'sender:addr:python.org' 0.168824 'received:12' 0.211812 'bi:to:addr:python.org to:no real name:2**0' 0.213042 'received:12.155' 0.214529 'received:12.155.117' 0.214529 'received:mail.python.org' 0.214529 'received:python.org' 0.214529 'url:org' 0.221085 So it's got 11(!) new correlated clues extracted from two Received headers: Received: from mail.python.org ([12.155.117.29]) by sccrmxc14.comcast.net (sccrmxc14) with ESMTP id <20031211091604s14001ch25e>; Thu, 11 Dec 2003 09:16:04 +0000 X-Originating-IP: [12.155.117.29] Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org) by mail.python.org with esmtp (Exim 4.22) id 1AUMvU-0000lU-7N for tim.one@comcast.net; Thu, 11 Dec 2003 04:16:04 -0500 If I were doing train-on-everything instead of just mistakes, I'm afraid the spamprobs on the python.org clues would approach 0 (I get a couple hundred ham from mail.python.org every day, but typically no spam from there) -- then we'd be close to "spectacular failure" territory, for such very short spam. Something to be aware of, anyway! On the other side, all the ham in my training data scores 0 now (rounded to two digits), which I've never seen before. That's remarkable since the only ham in there came from mistakes and unsures (50 left over from my unigram classifier, about 100 added since then). Only 5 training spam don't score 100 (rounded), which are exactly the 5 training spam that came from a python.org mailing list. Overall, that's also better than I've seen before, although the bit of python.org spam is doing worse than I've seen before (for the obvious reason explained above). From matt at mondoinfo.com Sun Dec 21 14:08:03 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sun Dec 21 14:09:09 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: References: <1071977330.29.10858@mint-julep.mondoinfo.com> Message-ID: <1072030630.02.10929@mint-julep.mondoinfo.com> [Tim, on donations to the PSF] > Matt, we already do. Visit > http://spambayes.sourceforge.net/donations.html > for proof . There's a PayPal button on that page, which > contributes directly to the PSF now. Some users have done that. Actually, I was aware of that. But since there's a new mechanism that may be better or cheaper or at least different it might be confusing to accept donations that go to different places. > I expect that people contributing cash to SpamBayes would like a > more direct connection, and I sympathize. Agreed. But since no one has yet suggested a good way to donate directly to SpamBayes, I think that having the money go to the PSF is a trifle closer to supporting SpamBayes than having the money go to SourceForge. Other people may have different thoughts. >> Full disclosure: I'm a member of the PSF but don't sit on its >> board. > If you want to, you probably can . You're not getting off the board that easily . Regards, Matt From tameyer at ihug.co.nz Sun Dec 21 20:35:09 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Dec 21 20:35:49 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7A6B@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A13@its-xchg4.massey.ac.nz> [Tim] > I'd like the people here working their spare-time asses off on > SpamBayes to give this some thought. We don't *have* to give > SpamBayes contributions to the PSF, and I wouldn't object if the > people doing the work here wanted to split donations among > themselves. I'm happy with any contributions going to the PSF, or SF, or whatever 'good cause' people like most (including a new yacht for Mark ). If contributions were more likely to be 100 bucks every day or two instead of now and again, I might think differently ;) In addition, if I received anything, I don't know how that would be viewed here - if it was seen as payment for work, then I'd have to greatly reduce the amount of time I muck about with spambayes. That said, although that's my preference, I don't feel strongly against an 'individual' system, if that's the majority preference. > It probably wouldn't amount to much, but even 100 bucks now > and again can work wonders for morale. True, but knowing (or wondering) that others are getting 100 bucks now and again and you're not, or knowing that people could be giving you a buck now and then, and aren't, might be negative for morale. [Richie] > (Anyway, we all know Mark deserves all of it for > fighting Outlook all this time. And in Australian dollars, > 100 bucks US would set him up for life! 8-) Or in NZ dollars, I could buy most of Narnia & Middle Earth ;) > Anyone who's spent real money on the project, like Rob with > the spambayes.org domain, could be reimbursed. I think if the money does keep going to the PSF, then we ought to be able to convince them to fork out for that sort of thing (is there anything apart from the domain that costs at the moment?). From my ignorant perspective, that seems easier in terms of tax etc, too. > We could add developer links to the Donations page, so that > if a user wanted to donate to a specific developer, he could. It seems to me that this would end up being more of a donate-for-support page, which leaves out those people that support but don't develop. My personal suspicion is that people are more likely to want to donate for support than development, anyway. =Tony Meyer From tameyer at ihug.co.nz Sun Dec 21 23:44:58 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Dec 21 23:45:05 2003 Subject: [spambayes-dev] "X-" as a prefix for experimental options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304314CAC@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677795@its-xchg4.massey.ac.nz> [Mark] > A problem I see is that the users will have no way > of measuring any changes. The binaries don't come with > any of the test tools, and relying on lots of people > giving subjective results doesn't seem useful. [...] > I think we need some kind of better, application based testing > framework first. The scripts we use now predate all of the > applications, and I can never remember how to run them. If I could > just get a test tool to run directly over Outlook folders, we would be > much closer (for Outlook anyway ). This needn't be too hard > - just abstracting the test tools a little so they allow > sub-classes to extract the actual message streams for the > test runs. I've made (a very rough) start to something like this and checked it in. If you apply the attached patch to sb_server.py and then go to http://localhost:8880/cv, you'll be presented with a page for running the 'timcv' test (defaults against any of the experimental/deprecated options). It's all very rough at the moment, but I'd be interested to know if people thought that this would be user friendly enough (for advanced users, not everyone), or ideas about other ways to go about it. [Mark] > Ultimately, we end up with a simple way for either Outlook > or sb_server to run tests over the training sets, and report succinct > results. Otherwise, I doubt anything will change in terms of the > number of *users* running tests (let alone developers ) I definitely agree that this is needed :) If something like this does end up in sb_server, then it would be extremely simple to add it to Outlook, too. In fact, if it was presented in a message (like "Show Clues") then the exact html could maybe be used . =Tony Meyer -------------- next part -------------- A non-text attachment was scrubbed... Name: ttui.patch Type: application/octet-stream Size: 1076 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/6a66af97/ttui.obj From tim.one at comcast.net Mon Dec 22 00:24:25 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 00:24:36 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <20031219165537.EDB162DF7F@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Actually, there have been experiments done (by me) with expiry of > whole messages. Yes. By "the project" having experience I mean controlled tests run by several across their own email mix, using exactly the same strategy, with reporting and analysis and all that good stuff. We've done little of that (as a group) over the last year. > I invite you to look at the 'expire4months' regime for my incremental > testing harness. Performance was worse than remembering everything, > but significantly better than mistake-based training (with the > 'fpfnunsure' regime). > > I have not done any experiments with just nuking hapaxes; I didn't see > any reason to do a partial job instead of a full one. There may not be one. The question arose specifically in the context of the mixed unigram/bigram classifier, which grows the database at a much faster rate. I've got ~90% hapaxes after a couple days with that, and the database is already 3x larger than after months of mistake/unsure training under the pure-unigram classifier. Expiring a full message doesn't seem to make sense after two days, or even after a week; expiring unused hapaxes may; that's for experiment to decide. >>> I know you're not arguing that, but if there were bidirectional >>> msg_id <-> feature_ID maps, it would be fairly easy to expire whole >>> messages. >>> >>> That would obviate the need to track last time seen for every token. >> Only if you don't want also to be able to expire tokens on their own. > No... just find the most recent message that the token appeared in, > which would be a quick search through a few message times. A really > quick search if you're only looking to expire hapaxes. I don't want to expire a hapax if it's been used recently in *scoring*. Message times can't distinguish used from unused features. If you're doing train-on-everything (with or without whole-msg expiration), a hapax used in scoring becomes a non-hapax the first time it's used in scoring. For mistake/unsure training, a hapax used in scoring remains a hapax if the message being scored ends up correctly classified. Hapaxes that are never seen again also remain hapaxes. Distinguishing used from unused requires recording use. Followups set to spambayes-dev@python.org, as this speculative stuff really doesn't belong on the general spambayes list. From popiel at wolfskeep.com Mon Dec 22 00:40:35 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 22 00:40:40 2003 Subject: [spambayes-dev] RE: How low can you go? In-Reply-To: Message from "Tim Peters" of "Mon, 22 Dec 2003 00:24:25 EST." References: Message-ID: <20031222054035.EB7672DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[T. Alexander Popiel] >> Actually, there have been experiments done (by me) with expiry of >> whole messages. > >Yes. By "the project" having experience I mean controlled tests run by >several across their own email mix, using exactly the same strategy, with >reporting and analysis and all that good stuff. We've done little of that >(as a group) over the last year. Ah. Yes, that hasn't happened... I've been as lax as most folks with regards to trying to replicate other people's testing, too. :-( >> No... just find the most recent message that the token appeared in, >> which would be a quick search through a few message times. A really >> quick search if you're only looking to expire hapaxes. > >I don't want to expire a hapax if it's been used recently in *scoring*. *blink* *blink* Oh, right, you don't train on everything like I do. Sometimes I forget. ;-) - Alex From gbrown at alumni.caltech.edu Mon Dec 22 11:49:23 2003 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Mon Dec 22 11:49:30 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse Message-ID: <01ae01c3c8ab$8e3003a0$6601a8c0@Glenn> I fear my email box is seeing a reliable Spam attack on Bayesian filters, starting in the past week: the tweaking of spam tokens by repeating characters. If spammers use 0-3 repetitions of each letter, a spam token like "investment" can be spelled 4^10 (a million) different ways. I don't want to suffer a million spam messages to train my filter for this one word. A simple solution would be to eliminate character repetitions in the spam database. This produces 163 ambiguities out of the 25143 words in the Solaris /usr/dict/words list of words in the English language, but probably none of these are spam tokens. I've appended a list of the ambiguous tokens below. For example, "be" represents "be" and "bee". I won't be implementing adding this feature myself, but would sure like to see this feature in my favorite spam filter. Cheers to all the SpamBayes developers, --Glenn Alan Alison Barnet Bela Burt De Diane Douglas Eliot Eliot Emanuel Gary Godwin Greg Haley Herman Kaufman Kenan Liget Lilian Marieta Mathews Matson McConel NW Nichols Paterson Philip SE SW Scot Shafer Shepard Simons Wals Whitaker ad advise apointe as bare bat be bel below bel below bet bib bit bled boby bogy bon both bred bus but canister canon canvas carton chery chose col coma con con cop coral cot desert desicate devise devote discus divorce dragon drol drop duly el el escape fed fel fiance filet fogy fury gable gal glom god gripe grove hel his hop hot i i in inbred invite ken knel later legate lop lose lot mana marque mate met milenia mortgage mot ne non nose of pal parole pep pepy per pol pol pop pose put red refuge retire rifle robin rod rot salon sen shot slop son sped step stop tapa ten the til to todle tol tor tot very vi vi we wed whop willful -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/8aea057d/attachment-0001.html From skip at pobox.com Mon Dec 22 11:57:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 11:57:36 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: <01ae01c3c8ab$8e3003a0$6601a8c0@Glenn> References: <01ae01c3c8ab$8e3003a0$6601a8c0@Glenn> Message-ID: <16359.8821.262470.812169@montanaro.dyndns.org> Glenn> I fear my email box is seeing a reliable Spam attack on Bayesian Glenn> filters, starting in the past week: the tweaking of spam tokens Glenn> by repeating characters. Only if "deprrravved" is a hammy word for you. If not, then it has no effect and other clues are used to distinguish ham from spam. Can you post a full set of clues for such a message? Skip From skip at pobox.com Mon Dec 22 12:10:13 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 12:10:29 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: References: <16355.22556.564839.561779@montanaro.dyndns.org> Message-ID: <16359.9589.956992.208750@montanaro.dyndns.org> >> While I was messing with the received header regular expressions >> today I also noticed that Sendmail sometimes adds "may be forged" to >> a header.... >> I'm inclined to trust sendmail on this one and just add it. It seems >> like a very objective feature. Tim> I agree -- it's extremely unlikely to lose. The ones to worry Tim> about are things spammers could inject to push things in the ham Tim> direction, but they're not gonna get far forging "may be forged" Tim> unless I have a *very* weird idea of ham . I just checked in tokenizer.py with this change. Note that it's guarded by options["Tokenizer", "mine_received_headers"]. Skip Tim> I noticed this in the headers of a spam today: Tim> Received: from shawmail-cg-shawcable-net Tim> (c-24-9-163-244.client.comcast.net[24.9.163.244](untrusted sender)) Tim> by rwcrmxc11.comcast.net (rwcrmxc11) with SMTP Tim> id <20031220054919r1100n4pj1e>; Sat, 20 Dec 2003 05:49:20 +0000 Tim> It's the "(untrusted sender)" part that's interesting. I'd suggest Tim> *not* folding that in with "may be forged", though. There probably Tim> aren't a lot of strings of this nature, so the database burden Tim> should be trivial, and I *bet* different strings will prove to have Tim> different spamprobs. You're probably right. In this case it may just be that an ident lookup failed (many servers don't run identd), so the assertion that the message is spam would be much weaker. Poking around Google a bit suggests "(untrusted sender)" is something specific to Comcast. I'm happy to add it if you would like, but in the mail I've saved it actually seems to turn up a bit more in ham (six messages) than in spam (one message) and not at all in my current training database. All such lines also match "client2?\.attbi\.com". Skip From gbrown at alumni.caltech.edu Mon Dec 22 12:28:24 2003 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Mon Dec 22 12:28:35 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: <16359.8821.262470.812169@montanaro.dyndns.org> Message-ID: <01bb01c3c8b1$01927d50$6601a8c0@Glenn> > Only if "deprrravved" is a hammy word for you. If not, then it has no > effect and other clues are used to distinguish ham from spam. Well, after 10K+ spam messages of training, the approach gets a 1% spam score for the following otherwise obvious spam. > Can you post a full set of clues for such a message? Done. --Glenn Spam Score: 1% (0.00944801) word spamprob #ham #spam '*H*' 0.981104 - - '*S*' 1.71005e-007 - - "i've" 0.0338633 513 64 'guys' 0.0546273 191 39 'wet' 0.0661982 93 23 'stan' 0.097452 7 2 'think' 0.099588 845 336 'header:Received:2' 0.106663 534 229 'amazing.' 0.118307 18 8 'got' 0.123214 507 256 'x-mailer:qualcomm windows eudora version 6.0.0.22' 0.125758 38 19 'it.' 0.154291 570 374 'from:addr:aloktorvaldis.com' 0.155172 1 0 'from:addr:ger' 0.155172 1 0 'from:name:detractors m. tinnier' 0.155172 1 0 'heyyouguys,' 0.155172 1 0 'huunngg' 0.155172 1 0 'message-id:@aloktorvaldis.com' 0.155172 1 0 'reply-to:addr:aloktorvaldis.com' 0.155172 1 0 'subject:giirrllss' 0.155172 1 0 'subject:soak' 0.155172 1 0 'subject:squiiirrrtt' 0.155172 1 0 'url:aloktorvaldis' 0.155172 1 0 'hit' 0.16835 138 100 'little' 0.170749 412 305 'when' 0.174352 1030 783 'reply-to:no real name:2**0' 0.181731 2589 2071 'well' 0.190478 390 330 'splash' 0.203534 5 4 'what' 0.206076 1117 1044 'skip:o 30' 0.206761 23 21 'they' 0.210826 1024 985 'seen' 0.211543 372 359 'into' 0.220027 586 595 'tiny' 0.220335 32 32 'very' 0.220758 705 719 'then' 0.234163 596 656 'that' 0.257805 2233 2794 'have' 0.264403 2222 2877 'has' 0.264422 1086 1406 'just' 0.264724 1257 1630 'over' 0.265763 837 1091 "won't" 0.269641 153 203 'how' 0.273075 771 1043 "don't" 0.284679 824 1181 'reply-to:addr:stacy' 0.290906 1 1 'faces.' 0.292799 3 4 "it's" 0.299524 726 1118 'totally' 0.308229 47 75 'also' 0.327243 665 1165 'love' 0.329743 174 308 'will' 0.330548 1301 2314 'right' 0.33105 399 711 'want' 0.336363 636 1161 'ever' 0.341633 273 510 'see' 0.343879 847 1599 'truly' 0.348488 36 69 'their' 0.356169 528 1052 'most' 0.362206 485 992 'with' 0.367376 1955 4090 'net.' 0.371879 10 21 'the' 0.378413 3332 7308 'taken' 0.380944 112 248 'kk0kks' 0.389062 1 2 'masssive' 0.389062 1 2 'sluttttsss' 0.389062 1 2 'squirting' 0.389062 1 2 'ads' 0.3959 34 80 'promise' 0.397485 14 33 'are' 0.399717 1652 3963 'dripping' 0.606785 1 6 'simply' 0.621827 86 510 'believe' 0.622983 140 834 'subject:and' 0.634866 115 721 'anymore' 0.647478 7 47 'address' 0.652343 124 839 'subject:the' 0.653141 147 998 'yours' 0.67376 23 172 'jusdt' 0.689717 1 9 'squirt' 0.689717 1 9 'subject:place' 0.689717 1 9 'skip:d 30' 0.74621 26 277 'here' 0.748105 579 6197 'url:html' 0.760285 284 3247 'subject:that' 0.777933 8 103 'url:face' 0.781771 1 15 'subject:all' 0.78701 9 122 'girls' 0.804866 11 166 'gushing' 0.809961 1 18 'url:index' 0.828674 117 2042 'to:name:gbrown' 0.884536 1 33 'recieve' 0.918175 4 170 Message Stream: Received: from ptd-204-210-94-45.maine.rr.com [204.210.94.45] by alumniweb.ir.caltech.edu (SMTPD32-8.05) id ACDA3B000C4; Sat, 20 Dec 2003 23:33:46 -0800 Received: from aloktorvaldis.com (mail.aloktorvaldis.com [64.156.186.89]) by ptd-204-210-94-45.maine.rr.com (Postfix) with ESMTP id E76C1C464E for ; Sun, 21 Dec 2003 02:33:36 -0500 Message-ID: <6.0.0.22.1.20031221023336.c0bc8684@aloktorvaldis.com> X-Sender: centurions@mail.aloktorvaldis.com X-Mailer: QUALCOMM Windows Eudora Version 6.0.0.22 Reply-To: stacy@aloktorvaldis.com Date: Sun, 21 Dec 2003 02:33:36 -0500 To: Gbrown From: "Detractors M. Tinnier" Subject: giirrllss that squiiirrrtt and soak all over the place MIME-Version: 1.0 Content-Type: text/plain; format=flowed Content-Transfer-Encoding: 7bit X-IMAIL-SPAM-STATISTICS: 1.0000 X-RCPT-TO: Status: U X-UIDL: 371502929 heyyouguys, Gushing and squirting teeeenn girls that splash and squirt all over their boyfriend's faces. Simply the wetest most totally dripping pusssiesss you have got to see on the net. These girls are so soaking wet that you won't believe how they squirt all over the masssive kk0kks of these very well huunngg guys into their tiny little cooochiees. It's totally amazing. And these are the pretiest little young sluttttsss that I've think I've ever seen when it comes to this and this site has a phrreeeee triall to go with it. http://www.aloktorvaldis.com/dollar/face/index.html You'll love what you see I just promise you. DAcLBBIXKwQVHggXAksaCgkNDgYRRQAdHg== Also if you don't want to recieve anymore of these ads from me then you jusdt have to hit on the address right here just hit it and you will be taken offitwww.splasterastem.com/wanton/ and you will be taken off! yours truly stan Message Tokens: 130 unique tokens 'address' 'ads' 'all' 'also' 'amazing.' 'and' 'anymore' 'are' 'believe' "boyfriend's" 'cc:none' 'comes' 'content-type:text/plain' 'cooochiees.' "don't" 'dripping' 'ever' 'faces.' 'from' 'from:addr:aloktorvaldis.com' 'from:addr:ger' 'from:name:detractors m. tinnier' 'girls' 'got' 'gushing' 'guys' 'has' 'have' 'header:Date:1' 'header:From:1' 'header:MIME-Version:1' 'header:Message-ID:1' 'header:Received:2' 'header:Reply-To:1' 'header:Subject:1' 'header:To:1' 'here' 'heyyouguys,' 'hit' 'how' 'huunngg' "i've" 'into' "it's" 'it.' 'jusdt' 'just' 'kk0kks' 'little' 'love' 'masssive' 'message-id:@aloktorvaldis.com' 'most' 'net.' 'off!' 'over' 'phrreeeee' 'pretiest' 'promise' 'proto:http' 'pusssiesss' 'recieve' 'reply-to:addr:aloktorvaldis.com' 'reply-to:addr:stacy' 'reply-to:no real name:2**0' 'right' 'see' 'seen' 'sender:none' 'simply' 'site' 'skip:d 30' 'skip:o 30' 'sluttttsss' 'soaking' 'splash' 'squirt' 'squirting' 'stan' 'subject: ' 'subject:all' 'subject:and' 'subject:giirrllss' 'subject:over' 'subject:place' 'subject:soak' 'subject:squiiirrrtt' 'subject:that' 'subject:the' 'taken' 'teeeenn' 'that' 'the' 'their' 'then' 'these' 'they' 'think' 'this' 'tiny' 'to:2**0' 'to:addr:alumni.caltech.edu' 'to:addr:gbrown' 'to:name:gbrown' 'totally' 'triall' 'truly' 'url:aloktorvaldis' 'url:com' 'url:dollar' 'url:face' 'url:html' 'url:index' 'url:www' 'very' 'want' 'well' 'wet' 'wetest' 'what' 'when' 'will' 'with' "won't" 'x-mailer:qualcomm windows eudora version 6.0.0.22' 'you' "you'll" 'you.' 'young' 'yours' From popiel at wolfskeep.com Mon Dec 22 12:39:15 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 22 12:39:19 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: Message from "Glenn Brown" of "Mon, 22 Dec 2003 09:28:24 PST." <01bb01c3c8b1$01927d50$6601a8c0@Glenn> References: <01bb01c3c8b1$01927d50$6601a8c0@Glenn> Message-ID: <20031222173915.C01132DF61@cashew.wolfskeep.com> In message: <01bb01c3c8b1$01927d50$6601a8c0@Glenn> "Glenn Brown" writes: >> Only if "deprrravved" is a hammy word for you. If not, then it has no >> effect and other clues are used to distinguish ham from spam. >> Can you post a full set of clues for such a message? > >Done. >'from:addr:aloktorvaldis.com' 0.155172 1 0 >'from:addr:ger' 0.155172 1 0 >'from:name:detractors m. tinnier' 0.155172 1 0 >'heyyouguys,' 0.155172 1 0 >'huunngg' 0.155172 1 0 >'message-id:@aloktorvaldis.com' 0.155172 1 0 >'reply-to:addr:aloktorvaldis.com' 0.155172 1 0 >'subject:giirrllss' 0.155172 1 0 >'subject:soak' 0.155172 1 0 >'subject:squiiirrrtt' 0.155172 1 0 >'url:aloktorvaldis' 0.155172 1 0 Based on these clues, I'd say that you trained on one of these messages as ham. That'll certainly encourage a ham classification for them. What happens if you untrain and then retrain as spam? - Alex From tim.one at comcast.net Mon Dec 22 12:50:25 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 12:50:35 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: <20031222173915.C01132DF61@cashew.wolfskeep.com> Message-ID: [Glenn Brown] >> 'from:addr:aloktorvaldis.com' 0.155172 1 0 >> 'from:addr:ger' 0.155172 1 0 >> 'from:name:detractors m. tinnier' 0.155172 1 0 >> 'heyyouguys,' 0.155172 1 0 >> 'huunngg' 0.155172 1 0 >> 'message-id:@aloktorvaldis.com' 0.155172 1 0 >> 'reply-to:addr:aloktorvaldis.com' 0.155172 1 0 >> 'subject:giirrllss' 0.155172 1 0 >> 'subject:soak' 0.155172 1 0 >> 'subject:squiiirrrtt' 0.155172 1 0 >> 'url:aloktorvaldis' 0.155172 1 0 [T. Alexander Popiel] > Based on these clues, I'd say that you trained on one of these > messages as ham. That'll certainly encourage a ham classification > for them. Yup, looks certain -- or else Glenn makes some mighty fine distinctions about which kinds of porn spam he *wants* to see . This line: 'reply-to:addr:stacy' 0.290906 1 1 also tells us the database was trained on a lot more spam than ham (a token appearing equally often in both ends up with a decidedly hammy spamprob). Glenn, you should find that spambayes works better if you train on *less* spam (or more ham -- the math works out best if you train on an approximately equal number of each). This database isn't wildly unbalanced, but it's beyond the point where my classifier starts acting flaky. From nobody at spamcop.net Mon Dec 22 13:04:03 2003 From: nobody at spamcop.net (Seth Goodman) Date: Mon Dec 22 13:04:06 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: > >>> [Seth Goodman] > >>> I know you're not arguing that, but if there were bidirectional > >>> msg_id <-> feature_ID maps, it would be fairly easy to expire whole > >>> messages. > >>> > >>> That would obviate the need to track last time seen for every token. > > >> [Tim Peters] > >> Only if you don't want also to be able to expire tokens on their own. > > > [T. Alexander Popiel] > > No... just find the most recent message that the token appeared in, > > which would be a quick search through a few message times. A really > > quick search if you're only looking to expire hapaxes. > > [Tim Peters] > I don't want to expire a hapax if it's been used recently in *scoring*. > Message times can't distinguish used from unused features. If > you're doing > train-on-everything (with or without whole-msg expiration), a > hapax used in > scoring becomes a non-hapax the first time it's used in scoring. For But for really unusual messages of the type you were concerned about, this may only happen once a year, or so, which is too long for a hapax-expiration scheme. > mistake/unsure training, a hapax used in scoring remains a hapax if the > message being scored ends up correctly classified. Hapaxes that are never > seen again also remain hapaxes. Distinguishing used from unused requires > recording use. -------------------------------------- I'm reposting an earlier post that didn't receive any comments (poor netiquette, I know) because I feel it's relevant to both comments made subsequently in this thread and the question of expiring hapaxes not recently used vs. whole messages. I also asked for a little help getting started to be able to test some of my own and/or other peoples' ideas and would still like to do that, unless you folks would prefer otherwise. I've noticed that hapaxes do seem to contribute to scoring when the training set is small and I think I've seen others make similar comments. This also may be the case for really odd messages. So please forgive me for the repost, but here it is: > [Tim Peters] > There are messages I never want to expire. That creates major new UI > headaches to be doable. I believe (but don't yet know) that expiring > hapaxes can be done without need for user intervention, and without harm. I hope the "without harm" part is true. See my question two sections down. > [Tim Peters] > At some point, if you want to try your ideas, *try* your ideas -- > that's what Open Source is all about. Everyone is born knowing how to > program in Python, although most don't realize it until they try. I admit I wasn't aware that I could program in Python since birth, but I'm willing to take your word on that. We all have hidden potential. So that I don't have to re-invent that round thing with the axle in the middle, could someone please give me some hints as to which of the mapping features we've discussed in this thread exist or will soon exist and where I can look for them? I saw on spambayes-dev that there is discussion of a new database, so I don't want to go off on a useless fork with the present db if that comes to pass. Search for your inner newbie when you answer this. > > [Seth Goodman] > > I agree completely. This was an important motivation for expiring a > > whole message at a time. Training mistakes would eventually drop out > > of the database without user intervention. Not that a tool to help > > track down training mistakes wouldn't be great, but a "casual" user > > could still make occasional mistakes and the system would recover by > > itself. > > [Tim Peters] > Without intervention, it will also expire the screaming bright-red HTML > birthday message sent by my favorite 7-year-old niece, and when > she's 8 the > next one may get tagged as spam. These are the kinds of messages I never > want to expire. ... Here lies my concern. I sincerely hope that correct classification of these infrequent, unusual messages is not hapax-driven. If it is, the result of pruning infrequently-used hapaxes will be as bad as deleting the whole message. If that is the case, the _only_ solution will be to keep either those hapaxes or the whole message trained forever. Either way, I agree this is a big UI problem without an obvious intuitive solution. It does appear from looking at the scoring of some of my "typical" messages that hapaxes don't contribute much, as you've said before. Could you look at the scoring of a couple of those special messages and tell if their scoring would be seriously affected if the hapaxes were gone? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From skip at pobox.com Mon Dec 22 13:09:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 13:09:53 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: <01bb01c3c8b1$01927d50$6601a8c0@Glenn> References: <16359.8821.262470.812169@montanaro.dyndns.org> <01bb01c3c8b1$01927d50$6601a8c0@Glenn> Message-ID: <16359.13160.783191.328164@montanaro.dyndns.org> Looks like you have a mistake in your training: Glenn> word spamprob #ham #spam ... Glenn> 'from:addr:aloktorvaldis.com' 0.155172 1 0 Glenn> 'from:addr:ger' 0.155172 1 0 Glenn> 'from:name:detractors m. tinnier' 0.155172 1 0 Glenn> 'heyyouguys,' 0.155172 1 0 Glenn> 'huunngg' 0.155172 1 0 Glenn> 'message-id:@aloktorvaldis.com' 0.155172 1 0 Glenn> 'reply-to:addr:aloktorvaldis.com' 0.155172 1 0 Glenn> 'subject:giirrllss' 0.155172 1 0 Glenn> 'subject:soak' 0.155172 1 0 Glenn> 'subject:squiiirrrtt' 0.155172 1 0 Glenn> 'url:aloktorvaldis' 0.155172 1 0 You said that message was spam, yet the above suggests you trained on it as ham one time. My guess is that if you untrained it, the outcome would be unsure or spam. Skip From skip at pobox.com Mon Dec 22 13:15:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 13:15:43 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: Message-ID: <16359.13512.453204.264142@montanaro.dyndns.org> >> [Tim Peters] >> I don't want to expire a hapax if it's been used recently in >> *scoring*. Message times can't distinguish used from unused >> features. If you're doing train-on-everything (with or without >> whole-msg expiration), a hapax used in scoring becomes a non-hapax >> the first time it's used in scoring. For Seth> But for really unusual messages of the type you were concerned Seth> about, this may only happen once a year, or so, which is too long Seth> for a hapax-expiration scheme. Under the heading of "practicality beats purity"... If you know a given type of message is ham but is seen infrequently, train on it twice. That makes sure none of its tokens are hapaxes, and are thus never candidates for deletion. Hmmm... That violates my "never train on a message twice" dictum. Skip From tim.one at comcast.net Mon Dec 22 13:35:30 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 13:35:34 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Seth Goodman] > But for really unusual messages of the type you were concerned about, > this may only happen once a year, or so, which is too long for a > hapax-expiration scheme. Yes, and I'm aware of that. > I'm reposting an earlier post that didn't receive any comments (poor > netiquette, I know) because I feel it's relevant to both comments made > subsequently in this thread and the question of expiring hapaxes not > recently used vs. whole messages. I also asked for a little help > getting started to be able to test some of my own and/or other > peoples' ideas and would still like to do that, unless you folks > would prefer otherwise. Sorry, I can't make time to reply now. Your original message is still sitting in my queue (actually, several of your msgs are -- you write a lot, you know ), and I'll get to it when I can. Let's do the easy ones: > could someone please give me some hints as to which of the mapping > features we've discussed in this thread exist None. We map string features to pairs of little integers (ham count and spam count) now, and that's all. > or will soon exist Also none. > and where I can look for them? For now, somewhere over the rainbow. > I saw on spambayes-dev that there is discussion of a new database, Also just speculation at this time. We "have problems" with the most-common Berkeley back end now (there are several other database back ends you *could* configure spambayes to use already), and mostly those threads are trying to find ways to sidestep those problems. "Problems" == error messages from Berkeley saying that the database is corrupted. It's very unusual to see these in the Outlook addin, but it has happened. For some people on Linux, they seem downright common. > so I don't want to go off on a useless fork with the present db if > that comes to pass. Try say something more specific about what you want to investigate, and you'll probably get a better answer. From nobody at spamcop.net Mon Dec 22 13:40:40 2003 From: nobody at spamcop.net (Seth Goodman) Date: Mon Dec 22 13:41:56 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <16359.13512.453204.264142@montanaro.dyndns.org> Message-ID: > >> [Tim Peters] > >> I don't want to expire a hapax if it's been used recently in > >> *scoring*. Message times can't distinguish used from unused > >> features. If you're doing train-on-everything (with or without > >> whole-msg expiration), a hapax used in scoring becomes a non-hapax > >> the first time it's used in scoring. For > > Seth> But for really unusual messages of the type you were concerned > Seth> about, this may only happen once a year, or so, which > is too long > Seth> for a hapax-expiration scheme. > > [Skip Montanaro] > Under the heading of "practicality beats purity"... > > If you know a given type of message is ham but is seen infrequently, train > on it twice. That makes sure none of its tokens are hapaxes, and are thus > never candidates for deletion. Great point. That solves the problem for hapax expiration and unusual messages. > [Skip Montanaro] > Hmmm... That violates my "never train on a message twice" dictum. Since you're thinking pragmatically, don't worry about the dictum. Presumably, you would only do this rarely, i.e. on messages the likes of which you only expect a couple times a year. For the Outlook version, you would have to make a copy of the message and train on that, but it would still solve the problem. Just out of curiosity, does the proxy version of SpamBayes have the same protection as the Outlook version against training on the same msg_id twice? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From sethg at GoodmanAssociates.com Mon Dec 22 14:06:14 2003 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Mon Dec 22 14:06:16 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: > [Tim Peters] > Sorry, I can't make time to reply now. Your original message is still > sitting in my queue (actually, several of your msgs are -- you > write a lot, > you know ), and I'll get to it when I can. Sorry about that. I'll try to keep the noise level down. > > [Seth Goodman] > > so I don't want to go off on a useless fork with the present db if > > that comes to pass. > > [Tim Peters] > Try say something more specific about what you want to investigate, and > you'll probably get a better answer. I would like to investigate whole message expiration with different training and expiration schemes. From our previous discussion, it seems that the most flexible way to approach this is by going to a system with the several bidirectional maps implemented in the databases: feature_id <-> token, msg_id (+ training timestamp) <-> feature_id and token database w/training timestamp per entry. Instead of training timestamp, expiration time might be preferable. If none of this exists, I guess I need to start there. I was hoping that some of this might exist since you are already experimenting with hapax expiration. I thought I read that there was experimental code that mapped tokens to the message_id's they were trained from, but that may have been wishful thinking. In any case, these are all significant database changes, and I was afraid to go off half-cocked if the underlying database was not going to hang around. Any advise as to how to proceed would be appreciated. If this is too ambitious for a first project, please help me pare it down. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From tim.one at comcast.net Mon Dec 22 14:20:14 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 14:20:25 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: <16359.9589.956992.208750@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > Poking around Google a bit suggests "(untrusted sender)" is something > specific to Comcast. I'm happy to add it if you would like, but in > the mail I've saved it actually seems to turn up a bit more in ham > (six messages) than in spam (one message) and not at all in my > current training database. All such lines also match > "client2?\.attbi\.com". It really doesn't matter whether it looks hammy or spammy to you -- each person's classifier learns "what works" for that person's email mix. IOW, I'm not looking for "spam clues" here, I'm looking for potentially interesting raw data to throw at the classifier, be that hammy or spammy or neutral. It's the classifier's job to *learn* what's useful, but it can only see what we explicitly show it. A generalization of this gimmick finds several potentially interesting Received comments in my current little training database: 'received:(built aug 5\n 2002)' spam: 0 ham: 1 'received:(built aug 5 2002)' spam: 0 ham: 1 'received:(built mar 18 2003)' spam: 0 ham: 2 'received:(built may\n 14 2003)' spam: 0 ham: 1 'received:(built may 7 2001)' spam: 0 ham: 1 'received:(built may 13 2002)' spam: 0 ham: 3 'received:(built may 14 2003)' spam: 0 ham: 6 'received:(built nov\n 25 2002)' spam: 0 ham: 2 'received:(built nov 6 2002)' spam: 0 ham: 2 'received:(built nov 25 2002)' spam: 0 ham: 3 'received:(built nov 6\n 2002)' spam: 0 ham: 2 'received:(built sep 23\n 2002)' spam: 0 ham: 1 'received:(built sep 23 2002)' spam: 0 ham: 2 'received:(helo bala)' spam: 0 ham: 1 'received:(helo cyb)' spam: 0 ham: 1 'received:(helo gamer)' spam: 0 ham: 1 'received:(helo hp751n)' spam: 0 ham: 1 'received:(helo mailscanner)' spam: 0 ham: 1 'received:(may\n\tbe forged)' spam: 0 ham: 1 'received:(no client certificate requested)' spam: 0 ham: 3 'received:(qmail 20043 invoked from network)' spam: 0 ham: 1 'received:(qmail 20649 invoked from network)' spam: 0 ham: 1 'received:(qmail 20705 invoked from network)' spam: 0 ham: 1 'received:(qmail 29420 invoked from network)' spam: 0 ham: 1 'received:(qmail 30856 invoked from network)' spam: 0 ham: 1 'received:(qmail 59242 invoked by uid 1002)' spam: 0 ham: 1 'received:(qmail 6276 invoked by uid 99)' spam: 0 ham: 1 'received:(qmail 6378 invoked from network)' spam: 0 ham: 1 'received:(qmail 6383 invoked from network)' spam: 0 ham: 1 'received:(qmail 76214 invoked by uid 0)' spam: 0 ham: 1 'received:(qmail 94959 invoked by uid 399)' spam: 0 ham: 1 'received:(built feb 13 2003)' spam: 1 ham: 1 'received:(helo 3sfm)' spam: 1 ham: 0 'received:(helo d1e)' spam: 1 ham: 0 'received:(helo lsi)' spam: 1 ham: 0 'received:(helo s9rr4v)' spam: 1 ham: 0 'received:(helo timslaptop)' spam: 1 ham: 0 'received:(helo xtr)' spam: 1 ham: 0 'received:(qmail 13979 invoked from network)' spam: 1 ham: 0 'received:(qmail 5950 invoked by uid 500)' spam: 1 ham: 0 'received:(sasktel mail service)' spam: 1 ham: 0 'received:(smtp server)' spam: 2 ham: 1 'received:(misconfigured sender)' spam: 12 ham: 5 'received:(may be forged)' spam: 3 ham: 1 'received:(untrusted sender)' spam: 9 ham: 3 Note that one of the "may be forged" comments there was split across lines ('(may\n\tbe forged)'). That was done via adding received_complaints_re = re.compile(r'\(\w+(?:\s+\w+)+\)') and replacing if header.lower().find('may be forged') != -1: yield 'received:may be forged' with for x in received_complaints_re.findall(header.lower()): yield 'received:' + x Since these feed into bigrams too, there are a lot more combinations. Some are purely spammy so far: 'bi:received:(untrusted sender) received:ca' spam: 3 ham: 0 'bi:received:63.240.213.250 received:(may be forged)' spam: 3 ham: 0 and some are purely hammy so far: 'bi:received:(built may 14 2003) received:172' spam: 0 ham: 5 From popiel at wolfskeep.com Mon Dec 22 14:27:27 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 22 14:27:30 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Seth Goodman" of "Mon, 22 Dec 2003 12:04:03 CST." References: Message-ID: <20031222192727.874882DF61@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >could someone please give me some hints as to which of the mapping features >we've discussed in this thread exist or will soon exist and where I can look >for them? If we're trying to get reproducible results, then I strongly suggest looking at the various testing frameworks we have built. I haven't done any stuff with maintaining the mappings, but my expire4months regime does keep message lists for expiry. Building a mapping on top of that shouldn't be too difficult... >I saw on spambayes-dev that there is discussion of a new database, so >I don't want to go off on a useless fork with the present db if that >comes to pass. Again, if we're trying to get reproducible results, then I think that the main DB and such is the wrong place to be starting. We shouldn't be treating just anecdotal evidence from running changed code with our ongoing live mail feeds as the best we can do. While the Outlook plugin has done wonders for our popularity, it seems to have utterly destroyed our rigor. People now typically don't have the slightest clue how to go from their normal usage to a testing deployment... or at least don't know how to extract their mail from Outlook's clutches so that they have data to work _on_. As I don't use Outlook in any environment where I see spam, I don't know how to write the newbie guide to fix this... if indeed it is fixable. I'm spoiled by doing most of my mail handling in an environment which encourages treating mail as data to be arbitrarily processed, instead of just viewed through a gui. - Alex From skip at pobox.com Mon Dec 22 14:55:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 14:55:59 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: Message-ID: <16359.19517.682605.648156@montanaro.dyndns.org> Seth> I would like to investigate whole message expiration with Seth> different training and expiration schemes. From our previous Seth> discussion, it seems that the most flexible way to approach this Seth> is by going to a system with the several bidirectional maps Seth> implemented in the databases: feature_id <-> token, msg_id (+ Seth> training timestamp) <-> feature_id and token database w/training Seth> timestamp per entry. Instead of training timestamp, expiration Seth> time might be preferable. I'll just toss out a thought with nothing really to back it up besides my seat-of-the-pants experience. You might find it easier to experiment with different table layouts using SQL. There are both MySQL and PostgreSQL classifiers available (browse spambayes/storage.py). You could add new tables or new columns to existing tables without much fuss. Also, hapax expiration would be pretty simple. (Add a last_used column, arrange for it to get incremented whenever a row is fetched - fairly trivial with PostgreSQL's triggers I think, then use it to expire hapaxes periodically.) Finally, problems of multi-thread or multi-process access to the database should go away. Skip From tim.one at comcast.net Mon Dec 22 15:10:38 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 15:10:46 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <20031222192727.874882DF61@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > Again, if we're trying to get reproducible results, then I think that > the main DB and such is the wrong place to be starting. Right! > We shouldn't be treating just anecdotal evidence from running changed > code with our ongoing live mail feeds as the best we can do. We're really not, Alex. It's just a source of ideas to try, and nothing has changed as a result of it (some experimental, non-default options have been added, but that's it). > While the Outlook plugin has done wonders for our popularity, it seems > to have utterly destroyed our rigor. I'm still comfortable with what's been checked in. While there's been massive refactoring of the code, very little has changed in how messages get tokenized and scored. Nothing material has changed in classifier.py, except for removing experimental_ham_spam_imbalance_adjustment support, and there was plenty of evidence that that gimmick hurt more than it helped, and more so the more unbalanced training got. It was a proven loser (since I wrote it to begin with, I'm biased in its favor ). I did check in a few material changes to tokenizer.py over the last year without full-scale testing. These were all in the nature of untangling HTML obfuscations, so that the classifier got a better idea of what the human email reader sees, instead of tokenizing mountains of raw numeric character entities, nonsense tags, and other coding tricks unique to HTML. That was driven by staring at low-scoring unsures, and identifying tricks that had no purpose beyond disguising the rendered content. Tests (on my own email and on my original large test data) showed that de-obfuscating that stuff was a pure win, so I was willing to risk that much. I'm hard pressed to think of other default behavior that's changed. > People now typically don't have the slightest clue how to go from > their normal usage to a testing deployment... or at least don't know > how to extract their mail from Outlook's clutches so that they have > data to work _on_. That's for sure, and is one reason nothing else material *has* been checked in. Mark knows how to extract email from Outlook for usable testing, and wrote some code to help do that, but I haven't yet had time to figure out how it's done myself. I'm sure very few Outlook users have. I agree that needs to change. I've been speculating about lots of stuff lately, but I have no intention of checking in any of that as default behavior without full-blown, multi-corpus rigorous testing. > As I don't use Outlook in any environment where I see spam, I don't > know how to write the newbie guide to fix this... if indeed it is > fixable. I'm spoiled by doing most of my mail handling in an > environment which encourages treating mail as data to be arbitrarily > processed, instead of just viewed through a gui. OTOH, Outlook users are spoiled by that GUI, deeply integrated with spambayes. It's truly a joy to use, day-to-day. Training spambayes effectively via the Outlook UI remains more than a bit of a puzzle, though, and that extends in part to everyone who isn't prepared to retrain from scratch at the drop of a pin. There's a growing disconnect that way between what developers are happy to do, and what "real users" are able to tolerate. That's worth some thought too. From skip at pobox.com Mon Dec 22 15:29:58 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 15:30:13 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: References: <16359.9589.956992.208750@montanaro.dyndns.org> Message-ID: <16359.21574.614482.990083@montanaro.dyndns.org> Tim> A generalization of this gimmick finds several potentially Tim> interesting Received comments in my current little training Tim> database: ... Interesting scheme. When I tried that I got swamped by '(qmail NNN ...' stuff, where it appears that NNN is a process id. To retain this in its current form I suspect we'd have to either specifically eliminate such features or implement hapax expiration. Tim> Note that one of the "may be forged" comments there was split Tim> across lines ('(may\n\tbe forged)'). Perhaps we should add header = re.sub(r'\s+', ' ', header) to the "for header ..." loop in any case? It seems that many other headers get split that way. If we're looking for features which include whitespace we should probably normalize it. I'm willing to tuck the more general received sifting into the tokenizer controlled by a new experimental option. Let me know if you want me to take that step. Skip From igidon at resystemsgroup.com Mon Dec 22 15:44:44 2003 From: igidon at resystemsgroup.com (Ira L. Gidon) Date: Mon Dec 22 15:44:57 2003 Subject: [spambayes-dev] Duplicate E-Mails Message-ID: <00d501c3c8cc$73d55430$1e14a8c0@ILGToshiba> I am running on a laptop using Windows XP. I am using Outlook 2002 SP-2 for e-mail. I installed Spambayes and I start getting duplicate e-mails (same e-mail with same date/time stamp). It appears that there is a delay in notifying my exchange server that the e-mail has been downloaded. I can sometimes get 3 or 4 copies of the same e-mail. This definitely started at the same time as the installation of Spambeyes. I was hoping someone can provide me with a solution. Thanks! Ira Gidon -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/1ddfee60/attachment.html From tim.one at comcast.net Mon Dec 22 15:50:54 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 15:50:57 2003 Subject: [spambayes-dev] Duplicate E-Mails In-Reply-To: <00d501c3c8cc$73d55430$1e14a8c0@ILGToshiba> Message-ID: [Ira L. Gidon] > I am running on a laptop using Windows XP. > I am using Outlook 2002 SP-2 for e-mail. > > I installed Spambayes and I start getting duplicate e-mails (same > e-mail with same date/time stamp). It appears that there is a delay > in notifying my exchange server that the e-mail has been downloaded. > I can sometimes get 3 or 4 copies of the same e-mail. This definitely > started at the same time as the installation of Spambeyes. > > I was hoping someone can provide me with a solution. Try SpamBayes -> SpamBayes Manager ... -> Advanced and check the "Enable background filtering" box. That cures a lot of strange Outlook symptoms, and will be enabled by default the next time the Outlook addin is released. I don't know whether it will cure your problem (I've never seen it happen myself), but it's easy to try. From popiel at wolfskeep.com Mon Dec 22 16:09:18 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 22 16:09:21 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Seth Goodman" of "Mon, 22 Dec 2003 13:06:14 CST." References: Message-ID: <20031222210918.1EB742DF61@cashew.wolfskeep.com> In message: "Seth Goodman" writes: > >I would like to investigate whole message expiration with different training >and expiration schemes. Ah, in that case, definitely look at the incremental framework that I built. I have various training regimes that do train-on-everything vs. mistake-only, as well as one which expires stuff based on time. Making more regimes to do various other things should be very easy. >From our previous discussion, it seems that the most flexible way to >approach this is by going to a system with the several bidirectional >maps implemented in the databases: feature_id <-> token, msg_id (+ >training timestamp) <-> feature_id and token database w/training >timestamp per entry. Instead of training timestamp, expiration time >might be preferable. Definite overkill. Most of this won't be needed for any given regime, and will instead just bloat the transient data requirements during testing. Just make each regime keep track of the data it needs to do whatever it wants to do. - Alex From tim.one at comcast.net Mon Dec 22 16:41:17 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 16:41:25 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: <16359.21574.614482.990083@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Interesting scheme. When I tried that I got swamped by '(qmail NNN > ...' stuff, where it appears that NNN is a process id. To retain > this in its current form I suspect we'd have to either specifically > eliminate such features or implement hapax expiration. Changing the regexp to use [a-z] instead of \w would weed out all that stuff. I didn't see any containing numbers that looked promising. The all-text ones looked interesting, though. > Perhaps we should add > > header = re.sub(r'\s+', ' ', header) > > to the "for header ..." loop in any case? There are many "for header" loops, and I'm not sure which one(s) you're talking about here. If you want to do this somewhere, header = ' '.join(header.split()) is faster. > It seems that many other headers get split that way. If we're > looking for features which include whitespace we should probably > normalize it. I doubt this is often a concern. It's dangerous to make basic changes "in general", so don't do it except where there's a specific need. It should be fine in Received lines. As a counter-example, Subject line parsing *wants* to know whether tab characters appear, and runs of multiple spaces are also significant there. It's irrelevant to parsing of multi-line address headers (like To and Cc) because email.Utils.getaddresses() is already used for those, and already hides the line structure. > I'm willing to tuck the more general received sifting into the > tokenizer controlled by a new experimental option. Let me know if > you want me to take that step. No, I don't want another experimental option just for this. It seems clear enough already that "may be forged" is potentially interesting, and also that "may be forged" isn't the only potentially interesting string. We should suck up a bunch of them, or none of them. The classifier will learn which are and aren't useful, and it sure looks like that will vary depending on user (that one of my ISPs is Comcast and one of yours isn't is not a good reason to poo-poo the clues Comcast leaves behind ). From popiel at wolfskeep.com Mon Dec 22 16:54:35 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 22 16:54:39 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Mon, 22 Dec 2003 15:10:38 EST." References: Message-ID: <20031222215435.CE64A2DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[T. Alexander Popiel] > >> We shouldn't be treating just anecdotal evidence from running changed >> code with our ongoing live mail feeds as the best we can do. > >We're really not, Alex. It's just a source of ideas to try, and nothing has >changed as a result of it (some experimental, non-default options have been >added, but that's it). You're right, and I'm being overly emphatic. The significant work over the last year has almost entirely been with the Outlook integration; the original core of the project has gone fairly dormant. For UI stuff, you don't need rigor (unless you're Don Norman), and I've been letting some of that bleed over into my perception of all of the recent progress. >I did check in a few material changes to tokenizer.py over the last year >without full-scale testing. These were all in the nature of untangling HTML >obfuscations, so that the classifier got a better idea of what the human >email reader sees, instead of tokenizing mountains of raw numeric character >entities, nonsense tags, and other coding tricks unique to HTML. These are definitely good changes. The header whitespace normalization that's been suggested in a separate thread may also be, though I'm less certain of that one; since the vast majority of people don't look at the headers, I suspect there's a greater chance of something quirky but useful there that'd be obscured by the normalization. (I suppose it depends on whether intermediate mailservers unwrap and rewrap the headers...) >> I'm spoiled by doing most of my mail handling in an >> environment which encourages treating mail as data to be arbitrarily >> processed, instead of just viewed through a gui. > >OTOH, Outlook users are spoiled by that GUI, deeply integrated with >spambayes. It's truly a joy to use, day-to-day. Training spambayes >effectively via the Outlook UI remains more than a bit of a puzzle, though, >and that extends in part to everyone who isn't prepared to retrain from >scratch at the drop of a pin. There's a growing disconnect that way between >what developers are happy to do, and what "real users" are able to tolerate. >That's worth some thought too. I don't use a gui at all from my normal mail, so I really don't know what it would be like to have spambayes 'tightly integrated'. As it is, I've got a couple folders set up for spambayes use, and some procmail stuff... but any retraining or corrections only take effect in my nightly rebuild-the-database-from-scratch, unless I go out of my way to kick off a rebuild early. I think the biggest disconnect by far is whether or not people are willing to keep every single piece of mail they get for months or years at a time. That's what I'm doing now... but I think I can count the number of people who do that on one hand. The next test that I'm actually interested in doing is a comparison between training on everything and training on everything that isn't 1.00 or 0.00 (rounded). I may post a regime for that shortly. - Alex From skip at pobox.com Mon Dec 22 16:58:59 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 16:59:09 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: References: <16359.21574.614482.990083@montanaro.dyndns.org> Message-ID: <16359.26915.135330.791705@montanaro.dyndns.org> Tim> Changing the regexp to use [a-z] instead of \w would weed out all Tim> that stuff. I'll give that a try. Thanks. >> Perhaps we should add >> >> header = re.sub(r'\s+', ' ', header) >> >> to the "for header ..." loop in any case? Tim> There are many "for header" loops, and I'm not sure which one(s) Tim> you're talking about here. If you want to do this somewhere, Tim> header = ' '.join(header.split()) Tim> is faster. Okay. I was just referring to the loop over the Received headers in the section of code we've been messing with. >> I'm willing to tuck the more general received sifting into the >> tokenizer controlled by a new experimental option. Let me know if >> you want me to take that step. Tim> No, I don't want another experimental option just for this. It Tim> seems clear enough already that "may be forged" is potentially Tim> interesting, and also that "may be forged" isn't the only Tim> potentially interesting string. We should suck up a bunch of them, Tim> or none of them. The classifier will learn which are and aren't Tim> useful, and it sure looks like that will vary depending on user Tim> (that one of my ISPs is Comcast and one of yours isn't is not a Tim> good reason to poo-poo the clues Comcast leaves behind ). Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted sender)". I posted a note to comp.mail.misc asking for equivalents to "(may be forged)" for other MTAs. I'll see if anything interesting turns up which warrants investigation. Skip From tim.one at comcast.net Mon Dec 22 17:28:33 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 17:28:38 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: <16359.26915.135330.791705@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Okay. I was just referring to the loop over the Received headers in > the section of code we've been messing with. Cool! The line structure clearly does't do anything except get in the way for us there. > ... > Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted > sender)". I posted a note to comp.mail.misc asking for equivalents > to "(may be forged)" for other MTAs. I'll see if anything > interesting turns up which warrants investigation. Don't you think this is a "stupid beats smart" kind of thing? I do. Besides those strings, "(no client certificate requested)" is 100% correlated with ham for me now, and "(misconfigured sender)" is curiously mixed. I don't know who's generating them, but after weeding out the ones containing digits there are so few remaining I don't give a rip. MTAs will change over time, MTAs in other countries may use different words, spammers trying to forge Received lines are (if history is any guide) quite likely to screw up small details ... the classifier will learn all this on its own, provided it's not blinded to the raw data by a presumption that we know in advance what will and won't be useful. be-stupid-be-happy-ly y'rs - tim From skip at pobox.com Mon Dec 22 18:13:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 18:13:54 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: References: <16359.26915.135330.791705@montanaro.dyndns.org> Message-ID: <16359.31400.650350.732281@montanaro.dyndns.org> >> Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted >> sender)". Tim> Don't you think this is a "stupid beats smart" kind of thing? For the moment I'd like to at least make a passing stab at understanding what those phrases mean (or at least what generates them). If anyone else would like to generate some raw data, you could run something like this: from spambayes.mboxutils import getmbox import re, pprint d = {} for msg in getmbox(""): hdrs = msg.get_all("received", ()) for hdr in hdrs: for hit in pat.findall(' '.join(hdr.split())): d[hit] = d.get(hit,0)+1 l = [(d[k], k) for k in d if d[k] > 2] l.sort() pprint.pprint(l) using a relatively recent cvs checkout (one that has the more general definition of getmbox()). The conditional in the lc is just to trim the output to a reasonable size. Using a couple training databases I get: [(3, '(HELO bean)'), (3, '(HELO ckalin)'), (3, '(HELO default)'), (3, '(HELO laptop)'), (3, '(No client certificate requested)'), (3, '(authenticated user wgmachado)'), (3, '(may be fabricated)'), (4, '(HELO jim)'), (4, '(HELO vaio)'), (4, '(Postfix MTA)'), (4, '(account dave HELO nefarious)'), (4, '(verified OK)'), (5, '(HELO there)'), (6, '(HELO lion)'), (7, '(HELO bogdanm)'), (8, '(HELO opus)'), (8, '(misconfigured sender)'), (15, '(NEW ZEALAND STANDARD TIME)'), (15, '(untrusted sender)'), (17, '(HELO localhost)'), (18, '(from localhost)'), (19, '(SMTP Server)'), (26, '(MET DST)'), (28, '(NEW ZEALAND DAYLIGHT TIME)'), (435, '(may be forged)')] I am really starting to worry about those kiwis. Are these header phrases part of their master plan for world domination? Tom Ridge just raised our alert level in the US to "orange". Is there a correlation. Do you think I should call 9-1-1? Tim> be-stupid-be-happy-ly y'rs - tim Every time I try that I'm happy until Ellen hits me with a 2-by-4. Then my head hurts like hell for about three days. Skip From popiel at wolfskeep.com Mon Dec 22 18:35:05 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 22 18:35:34 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "T. Alexander Popiel" of "Mon, 22 Dec 2003 13:54:35 PST." <20031222215435.CE64A2DF61@cashew.wolfskeep.com> References: <20031222215435.CE64A2DF61@cashew.wolfskeep.com> Message-ID: <20031222233505.608672DF61@cashew.wolfskeep.com> In message: <20031222215435.CE64A2DF61@cashew.wolfskeep.com> "T. Alexander Popiel" writes: > >The next test that I'm actually interested in doing is a comparison >between training on everything and training on everything that isn't >1.00 or 0.00 (rounded). I may post a regime for that shortly. Regime 'nonedge' is now checked in for this. I'll be running the tests with it shortly. - Alex From richie at entrian.com Mon Dec 22 18:57:27 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 22 18:57:39 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: <16359.31400.650350.732281@montanaro.dyndns.org> References: <16359.26915.135330.791705@montanaro.dyndns.org> <16359.31400.650350.732281@montanaro.dyndns.org> Message-ID: > If anyone else would like to generate some raw data Your script didn't define 'pat' - I've assumed you meant: pat = re.compile(r'\(\w+(?:\s+\w+)+\)') Here's what I get from my corpus of 20,000 verified spams: [(3, '(HELO 0j3x2or)'), (3, '(HELO 2vqmm)'), (3, '(HELO 3bn0dn2)'), (3, '(HELO 3frty7)'), (3, '(HELO 6qzmi3)'), (3, '(HELO QRJATYDI)'), (3, '(HELO ben)'), (3, '(HELO d9vyix)'), (3, '(HELO ic6nlfq)'), (3, '(HELO laabud)'), (3, '(HELO ojeudcb)'), (3, '(HELO pebbyrl)'), (3, '(HELO pm9he0)'), (3, '(HELO r26)'), (3, '(HELO richie)'), (3, '(HELO vzjqt6x)'), (3, '(HELO xhz5j)'), (3, '(HELO yu5s)'), (3, '(untrusted sender)'), (4, '(built Aug 19 2002)'), (4, '(built May 7 2001)'), (6, '(HELO kos)'), (6, '(built Jul 28 2003)'), (6, '(built Oct 18 2002)'), (7, '(built Feb 21 2002)'), (8, '(HELO localhost)'), (9, '(built Sep 8 2003)'), (11, '(HELO pm69)'), (12, '(built Feb 13 2003)'), (15, '(HELO pm65)'), (18, '(built Mar 18 2003)'), (21, '(built May 14 2003)'), (27, '(SMTP Server)'), (149, '(may be forged)')] And these from the 12,000 or so message in the spambayes and spambayes-dev archives - not 100% spam-free, but very very nearly: [(3, '(HELO GR43)'), (3, '(HELO WPWD0038)'), (3, '(HELO diffy2)'), (3, '(HELO gamer)'), (3, '(built Jul 12 2002)'), (4, '(HELO jimws)'), (4, '(HELO localhost)'), (6, '(HELO dj2klap)'), (6, '(built Feb 21 2002)'), (6, '(built Sep 8 2003)'), (7, '(userid 1)'), (8, '(EHLO localhost)'), (8, '(MET DST)'), (8, '(No client certificate requested)'), (8, '(SquirrelMail authenticated user gaza)'), (8, '(built Jul 28 2003)'), (9, '(0 bits)'), (11, '(HELO STRIPER)'), (11, '(built Jan 23 2003)'), (11, '(built Oct 18 2002)'), (11, '(sSMTP sendmail emulation)'), (13, '(HELO jim)'), (13, '(SMTP Server)'), (16, '(built Nov 6 2002)'), (21, '(built Nov 25 2002)'), (26, '(HELO striper)'), (27, '(built Jul 29 2002)'), (28, '(built Jan 7 2003)'), (33, '(misconfigured sender)'), (34, '(userid 4)'), (35, '(HELO lion)'), (51, '(may be forged)'), (59, '(built Feb 13 2003)'), (86, '(built May 14 2003)'), (99, '(built Sep 23 2002)'), (100, '(built Mar 18 2003)'), (101, '(built May 13 2002)'), (158, '(untrusted sender)'), (364, '(built Aug 5 2002)')] So "(may be forged)" would be a weak spam clue for me, while "(untrusted sender)" would be a strong ham clue - but 133 of those 158 are from Tim... Even taking Tim out of the equation, it's 25-to-3 in favour of ham. The other 25 are from maybe a dozen other people. Ah - all are either attbi.com or comcast.net. Here's an example of an attbi.com one: Received: from hal2 (h00e01840da57.ne.client2.attbi.com[24.91.108.212](untrusted sender)) by attbi.com (rwcrmhc11) with SMTP id <2003061814314101300an5bve>; Wed, 18 Jun 2003 14:31:42 +0000 Message-ID: Make of all that what you will. -- Richie Hindle richie@entrian.com From gbrown at alumni.caltech.edu Mon Dec 22 20:31:43 2003 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Mon Dec 22 20:32:17 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: Message-ID: <01ff01c3c8f4$8760f680$6601a8c0@Glenn> > [T. Alexander Popiel] > > Based on these clues, I'd say that you trained on one of these > > messages as ham. That'll certainly encourage a ham classification > > for them. > > Yup, looks certain -- or else Glenn makes some mighty fine distinctions > about which kinds of porn spam he *wants* to see . > [T. Alexander Popiel] > > Based on these clues, I'd say that you trained on one of these > > messages as ham. That'll certainly encourage a ham classification > > for them. > > Yup, looks certain -- or else Glenn makes some mighty fine distinctions > about which kinds of porn spam he *wants* to see . I had "recovered from spam" that very message before scoring it and sending the output. My intention is was to remove the message from the "spam" db, but I forgot it moved the message to "inbox" instead of "junk suspects". I'm sure this SNAFU effectively killed this thread, but the meme is planted. If character repetition attacks become a problem, time will tell, and the solution is easy... --Glenn From skip at pobox.com Mon Dec 22 21:17:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 22 21:17:18 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: References: <16359.26915.135330.791705@montanaro.dyndns.org> <16359.31400.650350.732281@montanaro.dyndns.org> Message-ID: <16359.42403.517834.642502@montanaro.dyndns.org> Richie> Your script didn't define 'pat' - I've assumed you meant: Richie> pat = re.compile(r'\(\w+(?:\s+\w+)+\)') Whoops. I was cutting-n-pasting from an interpreter session. 'pat' was actually pat = re.compile(r'\([a-z]+(?:\s+[a-z]+)+\)', re.I) but yours is close enough. Thanks for the input/output. Richie> Here's what I get from my corpus of 20,000 verified spams: ... Richie> (3, '(untrusted sender)'), ... Richie> (149, '(may be forged)')] Richie> And these from the 12,000 or so message in the spambayes and Richie> spambayes-dev archives - not 100% spam-free, but very very Richie> nearly: ... Richie> (51, '(may be forged)'), ... Richie> (158, '(untrusted sender)'), ... Richie> "(untrusted sender)".... Ah - all are either attbi.com or Richie> comcast.net. Here's an example of an attbi.com one: Yup, this tag is almost certainly added by Comcast's MTA (they bought AT&T's cable internet business not that long ago). It's interesting that you seem to have a lot of HELO's with the same value. Frequent correspondents perhaps? I don't see that many HELO's (some from localhost). Are they generated close to your machine (in a late Received: header)? Skip From tim.one at comcast.net Mon Dec 22 21:44:12 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 22 21:44:16 2003 Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse In-Reply-To: <01ff01c3c8f4$8760f680$6601a8c0@Glenn> Message-ID: [Glenn Brown] > I had "recovered from spam" that very message before scoring it and > sending the output. My intention is was to remove the message from > the "spam" db, but I forgot it moved the message to "inbox" instead > of "junk suspects". Ah! You're still well-advised to better balance your training data. The imbalance now is hurting you. > I'm sure this SNAFU effectively killed this thread, but the meme is > planted. If character repetition attacks become a problem, time will > tell, and the solution is easy... That can't be known without testing, a great many tokens aren't "words" at all, and SpamBayes isn't limited to English even if they were. IOW, it may or may not prove an effective gimmick, but nobody can claim to know one way or the other without testing. There are mnay, many ohter wyas to obcsure werds t00, but they all have in comon that they make the s p a m m e r luk like an 1d1ot, and so cut response rate. From mhammond at skippinet.com.au Tue Dec 23 01:29:25 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 23 01:29:43 2003 Subject: [spambayes-dev] Experimental SpamBayes build available Message-ID: <001801c3c91e$1cde2bf0$2c00a8c0@eden> Hi all, I have just uploaded an installer for a new experimental binary of SpamBayes. This binary includes *both* the Outlook addin and the sb_server applications. The installer attempts to detect the most appropriate one to install. Everything is built from CVS sources as of today. Hopefully, this will mean the Outlook addin has a number of bugs fixed over the 0.8 release. However, it is possible there are a number of bugs *not* in 0.8, and even the possiblility it will not work at all for many people (as this is released with different 'python->.exe' technology than previous versions) The sb_server application suite all seem to work fine too, so non-outlook users are also encouraged to try this version. Note that it comes with almost no documentation (as there is none!) and that this is the first release of such a binary, so this too is bleeding edge. Thus, only brave people willing to test out stuff with almost no release notes should try it :) To further dissuade you, I am leaving for a week or so holiday, and will not be in a position to respond to any mail or bugs relating to this build. That said, it works well for me and the testing I have done on a number of machines. If anyone is keen, please visit http://starship.python.net/crew/mhammond/spambayes/ Happy holidays! Mark. From skip at pobox.com Tue Dec 23 10:43:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 23 10:44:03 2003 Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifier assumptions? Message-ID: <16360.25269.160738.272779@montanaro.dyndns.org> The comment for DBDictClassifier._wordinfoset says: # "Singleton" words (i.e. words that only have a single instance) # take up more than 1/2 of the database, but are rarely used # so we don't put them into the wordinfo cache, but write them # directly to the database # If the word occurs again, then it will be brought back in and # never be a singleton again. # This seems to reduce the memory footprint of the DBDictClassifier by # as much as 60%!!! This also has the effect of reducing the time it # takes to store the database With the recent testing of bigrams the clause "but are rarely used" would seem to be at least partially false. I'm not too concerned about memory footprint of the classifier, since I have lots of memory and use sb_filter.py, not one of the long-running servers or plugins. I also wonder about the contention that it reduces the database store time. It's probably true that the time spent at shutdown is shorter, but that time has been amortized over the entire runtime of the program. Perhaps we should reexamine the caching in DBDictClassifier. I would like it to be able to inherit a bit more functionality from its base class. If the assumptions it makes aren't entirely accurate, much of the extra work maintaining caches might be avoided. Skip From nobody at spamcop.net Tue Dec 23 11:40:32 2003 From: nobody at spamcop.net (Seth Goodman) Date: Tue Dec 23 11:40:34 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <16359.19517.682605.648156@montanaro.dyndns.org> Message-ID: Thanks to all for replying. However, I am still a bit confused by the advise (or like we say in sunny Wisconsin, Uff Dah!). Skip suggests trying out MySQL or PostgreSQL to implement the various bidirectional mappings (I assume this means trash the existing database and create new ones). Alex suggests that bidirectional maps are overkill and not to bother. Alex also has some scripts that do much of what I am trying to do, but it sounds like they will only work in a procmail environment and not with Outlook, which is where I am stuck. I run an Outlook client in IMO mode and fetch mail with POP3. Tim appeared to agree with Alex that I shouldn't mess with the main database but I should nonetheless experiment and I know he likes the bidirectional maps. I understand that there are also a bunch of testing frameworks/harnesses checked in and standard data sets to test against, though it sounds like they don't work with Outlook, which is a real pity. So I'm again asking for direction in the initial, most important decisions. For testing message and hapax expiration with various training regimens under the Outlook environment (if that is even possible or reasonable): 1) Do you recommend that I use the Outlook code base or ditch the Outlook plug-in and install the sbproxy version from source? I hate to lose the integration and I don't even know if the proxy produces mbox-style mail folders that the myriad scripts already written can work with. 2) Do you recommend I start with the existing database and modify it, or as Skip suggested, change over to a database that doesn't have the multi-thread corruption problem? 3) And finally, Skip previously suggested that I check out the CVS trunk. Is that still your recommendation? Thanks for all your help. I just want to avoid taking initial mis-steps that would make anything I put together useless to anybody else. I also don't want to duplicate efforts that others who are experienced have already taken. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From skip at pobox.com Tue Dec 23 12:07:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 23 12:07:22 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: <16359.19517.682605.648156@montanaro.dyndns.org> Message-ID: <16360.30271.161758.305580@montanaro.dyndns.org> Seth> 2) Do you recommend I start with the existing database and modify Seth> it, or as Skip suggested, change over to a database that doesn't Seth> have the multi-thread corruption problem? That's not why I suggested MySQL or PostgreSQL. Sure, thread safety would be a nice side-effect, but for testing I probably wouldn't care much about that. I suggested them because it would be easy to experiment with different database structures. Seth> 3) And finally, Skip previously suggested that I check out the CVS Seth> trunk. Is that still your recommendation? For testing, yes. I'd also recommend you ditch the Outlook plugin for testing. If you've ever done any Unix programming you'll probably find cobbling stuff together much easier without the overhead of a GUI. Skip From nobody at spamcop.net Tue Dec 23 13:18:29 2003 From: nobody at spamcop.net (Seth Goodman) Date: Tue Dec 23 13:19:07 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <16360.30271.161758.305580@montanaro.dyndns.org> Message-ID: > [Skip Montanaro] > I'd also recommend you ditch the Outlook plugin for testing. If > you've ever > done any Unix programming you'll probably find cobbling stuff > together much > easier without the overhead of a GUI. Just to be clear, I would then use the sbserver code, as I run Windows, not Unix. I have done Unix scripts in the past, and certainly appreciated the flexibility and ease of messing around, but I no longer have a Unix setup. Are the scripts that people are checking in (like Alex's nonedge script) compatible with the mail folders produced (if any) by the sbserver code? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From skip at pobox.com Tue Dec 23 13:25:37 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 23 13:25:44 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: References: <16360.30271.161758.305580@montanaro.dyndns.org> Message-ID: <16360.34977.846047.659325@montanaro.dyndns.org> Seth> Just to be clear, I would then use the sbserver code, as I run Seth> Windows, not Unix. Yeah, or sb_filter.py and/or sb_moxtrain.py. Note that I'm assuming you're going to test your changes on a collection of saved mail, not on your incoming mail feed. Seth> I have done Unix scripts in the past, and certainly appreciated Seth> the flexibility and ease of messing around, but I no longer have a Seth> Unix setup. Are the scripts that people are checking in (like Seth> Alex's nonedge script) compatible with the mail folders produced Seth> (if any) by the sbserver code? sb_server.py is a proxy. It doesn't create long-term storage for messages. It only annotates messages it fetches on your behalf from your POP3 server. Skip From tim at fourstonesExpressions.com Tue Dec 23 15:10:06 2003 From: tim at fourstonesExpressions.com (Tim Stone) Date: Tue Dec 23 15:10:13 2003 Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifier assumptions? In-Reply-To: <16360.25269.160738.272779@montanaro.dyndns.org> References: <16360.25269.160738.272779@montanaro.dyndns.org> Message-ID: On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro wrote: > Perhaps we should reexamine the caching in DBDictClassifier. I would > like > it to be able to inherit a bit more functionality from its base class. > If > the assumptions it makes aren't entirely accurate, much of the extra work > maintaining caches might be avoided. I have no idea where that comment came from... The scheme seems bogus to me. It's a word, it occurs once or many times, there's no reason to treat it differently. If we have memory consumption problems, then that's the problem to fix.. We've had a bunch of discussion about using other db systems (zodb, mysql, etc.). Perhaps this is yet another reason to "modernize" our database. -- Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com See my writing at www.xanga.com/obj3kshun From skip at pobox.com Tue Dec 23 15:26:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 23 15:26:15 2003 Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifier assumptions? In-Reply-To: References: <16360.25269.160738.272779@montanaro.dyndns.org> Message-ID: <16360.42206.839409.41850@montanaro.dyndns.org> Tim> On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro wrote: >> Perhaps we should reexamine the caching in DBDictClassifier. Tim> I have no idea where that comment came from... That much I can tell you. Mark wrote the comment on May 30th. Here's the checkin comment: 2 changes to the way the DB classifier manages words: * As per Tim P's mail, keep a list of "changed words" with a flag indicating "change" or "delete". This prevents the database save from updating every single word ever loaded by the db. * From Sean, a change that prevents caching of hapaxes. Such words are saved directly to the DB. This reduces the memory footprint significantly (as these words are not kept in memory) and helps save times. This change makes "incremental" saving of the database happen in a reasonable time, and doesn't degrade after a complete retrain etc. I'm off for a weekend holiday - someone can just back this out if I screwed it up Perhaps Mark can elaborate when he returns from holiday. If we are going to cache lookups in the file-based classifiers, I'd prefer to restructure things so we can reuse behavior defined in classifier.Classifier wherever possible. That means that self.wordinfo should refer to the real file storage, not a cache. _wordinfoget() and friends can then rely on the versions in classifier.Classifier and fron that functionality with caches or other apply other annotations. This all breaks down when you consider the SQL-based classifiers, but they've only ever been experimental (I think - is anyone using them on a regular basis?), so I think it's okay for the maintenance burden to be higher for them. Skip From nobody at spamcop.net Tue Dec 23 15:33:24 2003 From: nobody at spamcop.net (Seth Goodman) Date: Tue Dec 23 15:40:53 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <16360.34977.846047.659325@montanaro.dyndns.org> Message-ID: > Seth> Just to be clear, I would then use the sbserver code, as I run > Seth> Windows, not Unix. > > [Skip Montanaro] > Yeah, or sb_filter.py and/or sb_moxtrain.py. Note that I'm > assuming you're > going to test your changes on a collection of saved mail, not on your > incoming mail feed. In that case, is it possible to leave the Outlook binary installed for my incoming mail stream while I use sb_mboxtrain.py and sb_filter.py for stored mbox testing? My system doesn't seem to have a PythonPath environment variable, so I would guess this is possible, so long as I can keep all the relevant paths different. If I can have the Outlook binary and non-Outlook source working at the same time, is there a way to convert my saved Outlook mail folders to mbox format so that I _can_ see how the changes I make work on my own mail stream as well? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From kennypitt at hotmail.com Tue Dec 23 16:04:10 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Dec 23 16:04:50 2003 Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifierassumptions? In-Reply-To: Message-ID: Tim Stone wrote: > On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro > wrote: > >> Perhaps we should reexamine the caching in DBDictClassifier. I >> would like it to be able to inherit a bit more functionality from >> its base class. If the assumptions it makes aren't entirely >> accurate, much of the extra work maintaining caches might be avoided. > > I have no idea where that comment came from... The scheme seems bogus > to me. It's a word, it occurs once or many times, there's no reason > to treat it differently. If we have memory consumption problems, > then that's the problem to fix.. We've had a bunch of discussion > about using other db systems (zodb, mysql, etc.). Perhaps this is > yet another reason to "modernize" our database. The comment appears in the _wordinfoset() function, which means it is called when a message is trained. I believe the original reasoning was probably that there are a lot of tokens in a newly trained message that have never been seen before, and quite likely will never be seen again. It would be a waste of memory to cache lots of singleton tokens that will never be used to classify another message, so the token is saved to the database on disk but is discarded from the memory cache. If the token is ever needed when classifying a message in the future, then it will be read in from the database and will then be kept in the memory cache. Because the uni/bigram scheme generates so many more tokens from the same message, I would think this reasoning would apply even more so there. This same caching scheme could be applied to any of the random-access database storage mechanisms, such as MySQL or Postgres. It doesn't seem like it would apply to pickles, however, because the complete list of all known tokens is always kept in memory for a pickle. Since PickledClassifier also derives from Classifier, I would have to vote against moving caching logic into the base Classifier class. Maybe a DBClassifierBase class derived from Classifier and containing the caching logic for all database storage mechanisms would be in order. Regarding the reduced store time, this "optimization" seems to be oriented towards a train-on-everything strategy and a long running application such as sb_server. Keeping updates in memory means that the counts for a token can be updated multiple times with only one database write at the end, while writing out singletons immediately keeps the size of the change list down so that the database update doesn't take quite so long at shutdown. With the caching and optimization in the database engines being what it is today, it seems that we might be better off to always write changes to the DB immediately and dispense with the whole self.changed_words thing altogether. When there are multiple processes that could be using the database at the same time, any caching (read or write) that we do ourselves outside the database engine has the potential to generate inconsistencies in the data anyway. Whew, that's a much longer response than I intended. Guess that's what happens when things get slow before the holidays. -- Kenny Pitt From richie at entrian.com Tue Dec 23 17:19:33 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 23 17:19:46 2003 Subject: [spambayes-dev] default to mine_received_headers=True, "may be forged" In-Reply-To: <16359.42403.517834.642502@montanaro.dyndns.org> References: <16359.26915.135330.791705@montanaro.dyndns.org> <16359.31400.650350.732281@montanaro.dyndns.org> <16359.42403.517834.642502@montanaro.dyndns.org> Message-ID: <28fhuvs6574p1v00amrap1j62v7s46vvrk@4ax.com> [Skip] > It's interesting that you seem to have a lot of HELO's with the same value. > Frequent correspondents perhaps? Based on looking at a couple of examples, each unique HELO corresponds with one person. "HELO lion", for instance, is David LeBlanc. From the spams, "HELO kos" are all instances of the same spam. "HELO pm69" are all opt-in spams from Interseer, a service that watches your website uptime at the cost of (very mildly) spamming you. > I don't see that many HELO's (some from localhost). Are they generated > close to your machine (in a late Received: header)? The three I looked at were all in the initial Received header - that is, the one that was added by the first MTA, and therefore appears last in the message. -- Richie Hindle richie@entrian.com From richie at entrian.com Tue Dec 23 19:14:10 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 23 19:14:23 2003 Subject: [spambayes-dev] FW: SF.NET Project Donation System In-Reply-To: <029e01c3c74c$d84a8470$2c00a8c0@eden> References: <029e01c3c74c$d84a8470$2c00a8c0@eden> Message-ID: <66mhuv0k1kjgmkbv786r6ja6gpuei5us0v@4ax.com> [Richie] > Anyone who's spent real money on the project, like Rob with the > spambayes.org domain, could be reimbursed. [Mark] > I agree, but not sure how this could work in practical terms with the tax > and holding issues. We could temporarily add them to the top of the donations page, and use your "Pay this guy first" idea. Then users could donate directly to them. [Mark] > Our "donations" page could list the developers, > and include a link to their personal sourceforge page. What they say about > themselves there is their issue. Sounds sensible. We could leave the PSF donation button there, and also include a donation button for SourceForge (although they already take a 5% cut of all donations, subject to a $1 minimum). That should cover all the bases. [Tony] > It seems to me that this would end up being more of a donate-for-support > page, which leaves out those people that support but don't develop. My > personal suspicion is that people are more likely to want to donate for > support than development, anyway. Good point. Maybe we should have a scheme whereby a non-developer contributor can be nominated for inclusion on the donations page. -- Richie Hindle richie@entrian.com From popiel at wolfskeep.com Tue Dec 23 20:24:49 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 23 20:24:54 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Seth Goodman" of "Tue, 23 Dec 2003 10:40:32 CST." References: Message-ID: <20031224012449.35EA12DF61@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >Thanks to all for replying. Eh, I'm just satisfying one of my own vices: babbling at people while scrambling to do the groundwork to back up my babble. >Alex suggests that bidirectional maps are overkill and not to bother. Hrm. I think I'd rephrase to say that the maps are overkill for most all of the individual tests/regimes that you might be interested in. Furthermore, while we're just trying things out, it seems to make more sense to do the tests individually as we come up with them, instead of trying to make some over-arching generalization that could be used to implement any of them. >Alex also has some scripts that do much of what I am trying to do, but >it sounds like they will only work in a procmail environment and not >with Outlook, which is where I am stuck. My scripts don't really work in a mail environment at all; they work in an environment where data content (which happens to be RFC 822 formatted mail messages) is stored in files in a specific directory structure with a special naming convention. This structure is: Data/ Ham/ reservoir/ Set1/ Set2/ ... SetN/ Spam/ reservoir/ Set1/ Set2/ ... SetN/ Inside each of the bottom-level directories is a set of files named with a 4-digit number, a dash, and a 6-digit number, such as 0267-045075. The 4-digit number is a day-of-arrival indicator (for grouping vs. periodic processes like the fixed retraining in the 'corrected' regime), and the 6-digit number is a unique sequence number (for ordering all the messages for behaviour-over-time analysis). Note that the above structure can be used for Tim's cv tests, too; his framework uses the directory hierarchy but doesn't care about the file names. More information on how I generate and manipulate this structure is in the incremental.HOWTO.txt in the testtools directory of the project. Also, the README-DEVEL.txt in the root of the project explains a lot more about this structure and the other tools for manipulating it. >I run an Outlook client in IMO mode and fetch mail with POP3. To get at your raw mail messages, I'd stick a POP3 proxy in there which saved each message into a separate file... but I'm a protocol weenie, and there might be easier ways to get at the data. >I understand that there are also a bunch of testing frameworks/harnesses >checked in Yes. The testtools directory is your friend. >and standard data sets to test against This we do not have (in any significant quantity), for multiple reasons: 1) If we have a standard data set, then we'll end up with a tool that's good at classifying that data set, not random people's mail. 2) While sharing spam is fairly innocuous, sharing ham opens up all sorts of privacy concerns... and if we filter out private info from the stuff we share, then we're systematically neglecting a portion of the data we're trying to represent. 3) We seem to enjoy nagging each other into running tests on private datasets. There seems to be some thought that if we nag enough people, someone will actually read the code that's being tested and point out where we're being stupid. <.5 wink> >though it sounds like they don't work with Outlook, which is a real pity. They don't really work with any mail hander, as mentioned above; instead, they owrk on organized data, so you can rerun tests time and time again after various fidgets and fixes. The reason why Outlook is a particular problem is that Outlook mutilates mail, irretrievably destroying the RFC 822 structure that it may have once been delivered in. A similar structure can theoretically be recreated, but like many recreations, some information (like the separators used in MIME encapsulation, etc) is not the same. >So I'm again asking for direction in the initial, most important decisions. >For testing message and hapax expiration with various training regimens >under the Outlook environment (if that is even possible or reasonable): > >1) Do you recommend that I use the Outlook code base or ditch the Outlook >plug-in and install the sbproxy version from source? I hate to lose the >integration and I don't even know if the proxy produces mbox-style mail >folders that the myriad scripts already written can work with. I'm strongly in favor of ditching Outlook entirely. >2) Do you recommend I start with the existing database and modify it, or as >Skip suggested, change over to a database that doesn't have the multi-thread >corruption problem? I'm not even sure if the test harnesses use a database backend at all; I think they may be keeping everything in memory. Dunno. I haven't looked at that in ages. What I would suggest is starting with the existing test harnesses and building from there. >3) And finally, Skip previously suggested that I check out the CVS trunk. >Is that still your recommendation? Definitely. Last I heard, there's a bunch of stuff (including all the test info) that's in CVS but not in the binary distributions. >Thanks for all your help. I just want to avoid taking initial mis-steps >that would make anything I put together useless to anybody else. I also >don't want to duplicate efforts that others who are experienced have already >taken. Reproducing what's gone before is useful. Duplicating it is not so useful. Where the line is drawn between the two is something I'll leave to someone else. ;-) - Alex From popiel at wolfskeep.com Tue Dec 23 20:32:33 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 23 20:32:38 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Seth Goodman" of "Tue, 23 Dec 2003 14:33:24 CST." References: Message-ID: <20031224013233.A3CCB2DF61@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >> >> [Skip Montanaro] >> Yeah, or sb_filter.py and/or sb_moxtrain.py. Note that I'm >> assuming you're >> going to test your changes on a collection of saved mail, not on your >> incoming mail feed. > >In that case, is it possible to leave the Outlook binary installed for my >incoming mail stream while I use sb_mboxtrain.py and sb_filter.py for stored >mbox testing? Certainly. You can run test tools completely independent of your mail feed, without affecting it at all. >My system doesn't seem to have a PythonPath environment variable, so I >would guess this is possible, so long as I can keep all the relevant >paths different. Exactly. Most of the test scripts have path-futzing stuff at the top to find local copies of the spambayes code, too, so it's theoretically possible even if you do have a PythonPath set. >is there a way to convert my saved Outlook mail folders to mbox format This is much more problematic; Mark may have code for this, but as I mentioned in my last mail, Outlook mutilates mail. It may be easier to just start collecting fresh by inserting something which saves incoming mail before Outlook gets its grubby little hands all over it. >so that I _can_ see how the changes I make work on my own mail stream >as well? This is the most laudable goal of all... for it is how we judge if things are good or bad. :-) - Alex From tameyer at ihug.co.nz Tue Dec 23 20:47:32 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 23 20:47:39 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7DAE@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677796@its-xchg4.massey.ac.nz> [Seth Goodman] > Just out of curiosity, does the proxy version > of SpamBayes have the same protection as the Outlook version > against training on the same msg_id twice? Kinda. It won't train a message with the same id twice, but that id is generated when mail travels through the proxy. So if you download the same message (through the proxy) twice, then you'll have two messages identical apart from the ids. If you used the web interface to find a message after you had already trained it, training it again will have no effect (unless the classification is different, then it'll be fixed). FWIW, the imap filter does the same thing, except that since mail isn't downloaded (it's a filter not a proxy) mail does get given a permanent unique id. If you wanted to train a message twice and not download it twice, you could simply duplicate the one sitting in the "Unknown" cache and give it a different name (fitting the scheme). =Tony Meyer From tameyer at ihug.co.nz Tue Dec 23 20:50:00 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 23 20:50:08 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7DFB@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677797@its-xchg4.massey.ac.nz> > People now > typically don't have the slightest clue how to go from their > normal usage to a testing deployment... or at least don't > know how to extract their mail from Outlook's clutches so > that they have data to work _on_. Right now, the idea is simply that people run the "export.py" script in the Outlook2000 directory (running from source, obviously), which churns out the 'standard' testing setup containing all the messages in the folders Outlook knows about. From there you run tests like anyone else. =Tony Meyer From tameyer at ihug.co.nz Tue Dec 23 20:56:19 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 23 20:56:24 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7F62@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677798@its-xchg4.massey.ac.nz> [Seth Goodman] > I understand that there are also a bunch of testing > frameworks/harnesses checked in and standard data sets to > test against, though it sounds like they don't work with > Outlook, which is a real pity. You can still use all the testing tools with mail that you receive via Outlook, though. This is what I do, (AFAIK) what Mark does. Look at the export.py script in the Outlook2000 directory. > 1) Do you recommend that I use the Outlook code base or ditch > the Outlook plug-in and install the sbproxy version from > source? Stick with the plug-in. sb_server's not going to give you anything helpful in the way of testing (my experimental TestToolsUI excluded ). > 3) And finally, Skip previously suggested that I check out > the CVS trunk. Is that still your recommendation? Yes. [Later] > In that case, is it possible to leave the Outlook binary > installed for my incoming mail stream while I use > sb_mboxtrain.py and sb_filter.py for stored mbox testing? Yes you can leave the binary installed. You don't need to use sb_mboxtrain or sb_filter if you're going to use the testing setup, though. You're after the scripts in the testtools directory, not the scripts one. (If I understand the recommendations that have been made so far). > My system doesn't seem to have a PythonPath environment variable, > so I would guess this is possible, so long as I can keep all > the relevant paths different. Just don't run "addin.py" in the Outlook2000 directory, and the plug-in binary will keep on chugging. > If I can have the Outlook binary and non-Outlook source > working at the same time, is there a way to convert my > saved Outlook mail folders to mbox format so that I _can_ > see how the changes I make work on my own mail stream as well? export.py in the Outlook2000 directory. Let me know if you have any troubles getting the testing setup going or exporting the messages from Outlook. =Tony Meyer From tim.one at comcast.net Tue Dec 23 21:11:48 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 23 21:11:53 2003 Subject: [spambayes-dev] comment assertion error? revisitDBDictClassifierassumptions? In-Reply-To: Message-ID: [Kenny Pitt] You're doing an excellent job of channeling Mark, and I have only a little to add. From a 5-mile view, we run a memory cache (which happens to be a Python dict) on top of a disk-based database, in order that the system not run too slow to bear. The memory cache is effective at speeding normal operation; that's why it's there. It may err on the side of keeping too much in memory. > The comment appears in the _wordinfoset() function, which means it is > called when a message is trained. I believe the original reasoning > was probably that there are a lot of tokens in a newly trained > message that have never been seen before, and quite likely will never > be seen again. It would be a waste of memory to cache lots of > singleton tokens that will never be used to classify another message, > so the token is saved to the database on disk but is discarded from > the memory cache. If the token is ever needed when classifying a > message in the future, then it will be read in from the database and > will then be kept in the memory cache. All correct. > Because the uni/bigram scheme generates so many more tokens from the > same message, I would think this reasoning would apply even more so > there. Me too. > This same caching scheme could be applied to any of the random-access > database storage mechanisms, such as MySQL or Postgres. That's right, and if looking up frequently reference tokens goes faster in a dict than reading from disk (hint: it does ), it will help them too. > It doesn't seem like it would apply to pickles, however, because > the complete list of all known tokens is always kept in memory for a > pickle. Also right. Skip, what you described before makes me wonder why you'd want a disk-based database: I'm not too concerned about memory footprint of the classifier, since I have lots of memory ... I also wonder about the contention that it reduces the database store time. If you want peak classification and/or training speed, have lots of memory, and don't care about initialization or finalization time, running a plain Python dict (stored as a giant binary pickle) is definitely the way to go. It's much faster, and it was much faster still before we added layers of indirection to *allow* dict operations to get satisfied by "real" databases instead. FWIW, the memory cache may not apply much to ZODB either, since ZODB keeps accessed Python objects (which is what ZODB stores) in its own memory cache. > Since PickledClassifier also derives from Classifier, I would have > to vote against moving caching logic into the base Classifier class. > Maybe a DBClassifierBase class derived from Classifier and containing > the caching logic for all database storage mechanisms would be in > order. Of course different storage mechanisms may want different caching strategies. > Regarding the reduced store time, this "optimization" seems to be > oriented towards a train-on-everything strategy and a long running > application such as sb_server. Keeping updates in memory means that > the counts for a token can be updated multiple times with only one > database write at the end, while writing out singletons immediately > keeps the size of the change list down so that the database update > doesn't take quite so long at shutdown. It was really aimed at incremental training. When you hit, e.g., the "Delete as Spam" button in the Outlook addin with even just one msg selected, the Berkeley db on disk is synch'ed after training. This makes for a *very* perceptible delay if the cache contains lots of info that differs from what's on disk. Startup and shutdown time are also important in this context, and amortizing those costs has major "perceived usability" benefits. If, e.g., you run from a giant pickled Python dict instead, you can expect to wait several seconds (at best) whenever loading it from, or storing it to, disk. > With the caching and optimization in the database engines being what > it is today, it seems that we might be better off to always write > changes to the DB immediately and dispense with the whole > self.changed_words thing altogether. This should be measured; it's not (or shouldn't be) a religious issue. I have no experience with general-purpose database engines that are actually fast; only some that aren't as slow as others <0.5 wink>. > When there are multiple processes that could be using the database > at the same time, any caching (read or write) that we do ourselves > outside the database engine has the potential to generate > inconsistencies in the data anyway. A conclusion there, one way or the other, depends on specific details. Concurrent read-write access is never simple, and I'm not sure anyone uses spambayes that way anyway. From tameyer at ihug.co.nz Tue Dec 23 21:16:19 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 23 21:16:23 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D8049@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467779A@its-xchg4.massey.ac.nz> [Alex] > The reason why Outlook is a particular problem is that > Outlook mutilates mail, irretrievably destroying the RFC 822 > structure that it may have once been delivered in. A similar > structure can theoretically be recreated, but like many > recreations, some information (like the separators used in > MIME encapsulation, etc) is not the same. [...] > I'm strongly in favor of ditching Outlook entirely. The export.py script does a reasonable job of putting everything back together again. Actually, I believe it does the exact same job as when getting a message to pass to tokenizer for general use. So although popping a proxy in between Outlook and the POP3 server to catch raw messages would certainly be more pure and correct (sb_server can do this, BTW, just set the cache expiry limit *really* high and don't bother classifiying any messages), for practical purposes using the data that Outlook gives is just as useful. (Since if anything got accepted into the core those using the Outlook plug-in would be dealing with those effects). This is a (another) good reason for us to try each other's patches (and I will get to the incremental ones soon, honest! ) since some of us have Outlook-altered messages to test, and others have nice pure message streams. > I'm not even sure if the test harnesses use a database > backend at all; I think they may be keeping everything in > memory. Dunno. I haven't looked at that in ages. They keep everything in memory unless you've enabled the 'save the classifier' option (can't remember what it's called; too lazy to check), and then it pickles them. =Tony Meyer From tim.one at comcast.net Wed Dec 24 01:07:21 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 24 01:07:30 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Kenny Pitt] > ... > Whenever you use direct BDB through the pybsddb/bsddb3/bsddb module > in a multi-thread/multi-user scenario, you always have to start with > a call to initialize the DB environment before you can do anything > else. You expressed some concern over the breakage on Win98 of the > tests in test_dbshelve.py. Unfortunately, the line that always fails > is that very first and most basic initialization call, the same one > that we would need to call for any use in SpamBayes. I don't think there's a problem with that: C:\Python23>python lib/bsddb/test/test_dbshelve.py -v test01_basics (__main__.DBShelveTestCase) ... ok test02_cursors (__main__.DBShelveTestCase) ... ok test01_basics (__main__.BTreeShelveTestCase) ... ok test02_cursors (__main__.BTreeShelveTestCase) ... ok test01_basics (__main__.HashShelveTestCase) ... ok test02_cursors (__main__.HashShelveTestCase) ... ok test01_basics (__main__.ThreadBTreeShelveTestCase) ... ok test02_cursors (__main__.ThreadBTreeShelveTestCase) ... ok test01_basics (__main__.ThreadHashShelveTestCase) ... ok test02_cursors (__main__.ThreadHashShelveTestCase) ... ok test01_basics (__main__.EnvBTreeShelveTestCase) ... ERROR test02_cursors (__main__.EnvBTreeShelveTestCase) ... ok test01_basics (__main__.EnvHashShelveTestCase) ... ERROR test02_cursors (__main__.EnvHashShelveTestCase) ... ok test01_basics (__main__.EnvThreadBTreeShelveTestCase) ... ERROR test02_cursors (__main__.EnvThreadBTreeShelveTestCase) ... ok test01_basics (__main__.EnvThreadHashShelveTestCase) ... ERROR test02_cursors (__main__.EnvThreadHashShelveTestCase) ... ok Note that the 4 Env instances of test02_cursors pass. They're doing the full-blown open-with-env bit too. It's the the 4 Env instances of test01_basics that fail, and all of them die with the same traceback: Traceback (most recent call last): File "lib/bsddb/test/test_dbshelve.py", line 75, in test01_basics self.do_open() File "lib/bsddb/test/test_dbshelve.py", line 238, in do_open self.env.open(homeDir, self.envflags | db.DB_INIT_MPOOL | db.DB_CREATE) DBAgainError: (11, 'Resource temporarily unavailable -- unable to join the environment') This isn't the *first* time test01_basics opens with an env, though. Line 75 is here: def test01_basics(self): if verbose: print '\n', '-=' * 30 print "Running %s.test01_basics..." % self.__class__.__name__ self.populateDB(self.d) self.d.sync() self.do_close() self.do_open() # ********************* LINE 75 d = self.d The test setUp() method already does self.do_open() once by the time test01_basics begins. So there's something screwed up about how the test tries to close and reopen the dbshelve (self.d) on this box. Figuring out exactly what would require digging into the guts of the stinkin' dbshelve module, to see how *its* stinkin' close method screws up . If I comment out lines 74 and 75 (the back-to-back close()/open() pair), the 4 env instances of test01_basics all pass. In fact, they all pass if I just comment out line 74. They also pass if I replace lines 74 and 75 with: self.tearDown() self.setUp() self.populateDB(self.d) The only way they don't pass is to do exactly what the test does . > ... > Maybe the best thing is to throw some test code into SpamBayes > and see if it will even start up on Win98. Yes. > I don't have access to a Win98 test system, but if I can code up > enough support that we can try this out, would you be willing to give > it a test? Certainly. > It will probably be after the holidays before I can get to it, but > we'll see. That's fine, there's no rush. Especially since it will work . From tim.one at comcast.net Wed Dec 24 02:33:21 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 24 02:33:31 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130467779A@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > The export.py script does a reasonable job of putting everything back > together again [from Outlook]. Thanks, Tony! I'm mortified to admit I had forgotten where this script lived. > Actually, I believe it does the exact same job as when getting a > message to pass to tokenizer for general use. In particular, exactly the same as when scoring a message, or training on one. The MIME armor (if any) is gone, (at least all) non text/* attachments are gone, and if the original headers contained Content-Type or Content-Transfer-Encoding specs, they're gone too. If it was multipart/alternative with text/plain and text/html sections, they're both slammed into the body, without indication of where one ends and the other begins. But that's the way we score Outlook email, and it's darned hard to do better. Outlook's message store is a complicated beast, and predates current email standards; they tacked MIME email on top of a sprawling store that didn't know anything about MIME, spraying bits and pieces all over the place. Pretty cool . > So although popping a proxy in between Outlook and the POP3 server to > catch raw messages would certainly be more pure and correct > (sb_server can do this, BTW, just set the cache expiry limit *really* > high and don't bother classifiying any messages), for practical > purposes using the data that Outlook gives is just as useful. For anyone using spambayes via the Outlook addin, it's *better* to use export.py than to capture the incoming email bytestream. SpamBayes can't reconstruct the original bytestream from Outlook (not out of laziness, it's simply impossible), so how the classifier would do if it *could* see the original bytestream is irrelevant to real-life Outlook use. It's close enough that I doubt it matters much. From kennypitt at hotmail.com Wed Dec 24 08:44:48 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Dec 24 08:45:26 2003 Subject: [spambayes-dev] comment assertion error? revisitDBDictClassifierassumptions? In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt] >> With the caching and optimization in the database engines being what >> it is today, it seems that we might be better off to always write >> changes to the DB immediately and dispense with the whole >> self.changed_words thing altogether. > > This should be measured; it's not (or shouldn't be) a religious > issue. I have no experience with general-purpose database engines > that are actually fast; only some that aren't as slow as others > <0.5 wink>. As always, never assume anything without thorough testing, right? >> When there are multiple processes that could be using the database >> at the same time, any caching (read or write) that we do ourselves >> outside the database engine has the potential to generate >> inconsistencies in the data anyway. > > A conclusion there, one way or the other, depends on specific details. > Concurrent read-write access is never simple, and I'm not sure anyone > uses spambayes that way anyway. As far as I can tell, this should only happen with sb_filter/sb_mboxtrain. All the other solutions that I know about (Outlook, sb_server, sb_imapfilter, sb_xmlrpcserver) have a single server process that handles all database access. Out of any remaining solutions, I also suspect they are rarely used since I hardly ever see them mentioned on any of the mailing lists. This leads to a question regarding the proposed direct BerkeleyDB storage. If we never access the database from more than one process at the same time, do we really need a full-fledged multi-process environment for Berkeley? You can do private, multi-thread environments that provide sufficient locking with less overhead for a single process. Any guesses from anyone as to what cases would require cross-process locking? -- Kenny Pitt From kennypitt at hotmail.com Wed Dec 24 08:56:59 2003 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Dec 24 08:57:39 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt] >> ... >> Unfortunately, the line that always fails >> is that very first and most basic initialization call, the same one >> that we would need to call for any use in SpamBayes. > > I don't think there's a problem with that: > > ... > > Note that the 4 Env instances of test02_cursors pass. They're doing > the full-blown open-with-env bit too. It's the the 4 Env instances of > test01_basics that fail, and all of them die with the same traceback: > > ... > > So there's something screwed up about how the > test tries to close and reopen the dbshelve (self.d) on this box. > Figuring out exactly what would require digging into the guts of the > stinkin' dbshelve module, to see how *its* stinkin' close method > screws up . I suspect some timing issue with the Windows disk cache not immediately flushing stuff to disk. That's just idle speculation, of course, but I have seen similar things in other development projects. > If I comment out lines 74 and 75 (the back-to-back close()/open() > pair), the 4 env instances of test01_basics all pass. > > ... > > The only way they don't pass is to do exactly what the test does > . > >> ... >> Maybe the best thing is to throw some test code into SpamBayes >> and see if it will even start up on Win98. > > Yes. Good to know, thanks. I'll proceed along that line, then. I can't think of a good reason that we should need to close and then immediately reopen the same database. -- Kenny Pitt From skip at pobox.com Wed Dec 24 10:07:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 24 10:07:10 2003 Subject: [spambayes-dev] test_storage.py failing Message-ID: <16361.43929.945893.137105@montanaro.dyndns.org> I ran the spambayes/test/test_storage.py this morning for the first time (on a fresh CVS checkout) and got several instances of the same error. Here's one example: ERROR: testHapax (__main__.DBStorageTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_storage.py", line 137, in setUp return _StorageTestBase.setUp(self) File "test_storage.py", line 20, in setUp self.classifier = self.__class__.StorageClass(self.db_name) TypeError: __init__() takes exactly 1 argument (2 given) I inserted print self.__class__.StorageClass right above the class call and in the error case it's always instantiating DBDictClassifier which does take a db_name argument, so I'm a bit confused about why this is generating an error. My brain is not in a high enough gear to see why. I think I'll just go do a little last minute Christmas shopping and let someone else figure it out. Skip From tim.one at comcast.net Wed Dec 24 12:35:33 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 24 12:35:37 2003 Subject: [spambayes-dev] test_storage.py failing In-Reply-To: <16361.43929.945893.137105@montanaro.dyndns.org> Message-ID: [spambayes-dev-bounces@python.org] [Skip] > I ran the spambayes/test/test_storage.py this morning for the first > time (on a fresh CVS checkout) and got several instances of the same > error. How many is several? Note that there are only 5 tests here, so if several means more than 2, it's possible that *all* the tests died this way for you. That would be a clue. > Here's one example: > > ERROR: testHapax (__main__.DBStorageTestCase) > > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_storage.py", line 137, in setUp > return _StorageTestBase.setUp(self) > File "test_storage.py", line 20, in setUp > self.classifier = self.__class__.StorageClass(self.db_name) > TypeError: __init__() takes exactly 1 argument (2 given) > > I inserted > > print self.__class__.StorageClass > > right above the class call and in the error case it's always > instantiating DBDictClassifier which does take a db_name argument, so > I'm a bit confused about why this is generating an error. I expect you need to run python with -v to see how the imports are getting satisfied -- the only guess I have is that you're not getting the classes the test expects to get. Here are runs on my box (Pythons 2.3.3 and 2.2.3): C:\Code\spambayes>echo %PYTHONPATH% \code\spambayes C:\Code\spambayes>\python23\python spambayes/test/test_storage.py -v testHapax (__main__.PickleStorageTestCase) ... ok test_bug777026 (__main__.PickleStorageTestCase) ... ok testHapax (__main__.DBStorageTestCase) ... ok testNoDBMAvailable (__main__.DBStorageTestCase) ... ok test_bug777026 (__main__.DBStorageTestCase) ... ok ---------------------------------------------------------------------- Ran 5 tests in 0.050s OK C:\Code\spambayes>\python22\python spambayes/test/test_storage.py -v testHapax (__main__.PickleStorageTestCase) ... ok test_bug777026 (__main__.PickleStorageTestCase) ... ok testHapax (__main__.DBStorageTestCase) ... ok testNoDBMAvailable (__main__.DBStorageTestCase) ... ok test_bug777026 (__main__.DBStorageTestCase) ... ok ---------------------------------------------------------------------- Ran 5 tests in 0.110s Note that I checked in some code cleanup for test_storage.py right before typing this msg, but I got the same results before too. From nobody at spamcop.net Wed Dec 24 13:46:14 2003 From: nobody at spamcop.net (Seth Goodman) Date: Wed Dec 24 13:46:23 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: Thanks so much Alex, Skip, Tony and Tim! This gives me all the rope I need, as they say. I'll dig in and ask for specific help when I run into problems later. I've been saving my whole mail stream for a while, so I do have something to test on. First task is get export.py working, next explore the test programs that others have already checked in. Thanks again. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From richie at entrian.com Wed Dec 24 16:31:31 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 24 16:31:46 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <001801c3c91e$1cde2bf0$2c00a8c0@eden> References: <001801c3c91e$1cde2bf0$2c00a8c0@eden> Message-ID: [Mark] > I have just uploaded an installer for a new experimental binary of > SpamBayes. This binary includes *both* the Outlook addin and the sb_server > applications. Nice one! Barring a few minor glitches (which I'll enter into the SF tracker when I get the chance) sb_tray worked like a charm for me. -- Richie Hindle richie@entrian.com From popiel at wolfskeep.com Thu Dec 25 15:24:04 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Dec 25 15:24:08 2003 Subject: [spambayes-dev] Reduced training test results Message-ID: <20031225202404.757652DF61@cashew.wolfskeep.com> Training on just those messages whose score isn't 0.00 or 1.00 (rounded) seems to be a huge win over training on everything. Not so much because the accuracy is better (though accuracy does seem to be improved by neglecting those messages that it's already certain about), but because of a hugely reduced training set (and thus database). Specifically, training on everything yielded a database with 70,000 messages, while training only on the non-extreme put only about 3,500 messages into the database. Unfortunately, I don't have firm numbers on token counts. Also of significant interest is that the classifier doesn't seem to decay as badly over time. With training on everything, the unsure rate in particular (and fn to a much lesser extent) goes up significantly after about 200 days worth of traffic, though the fp rate stays low. With just training on those things that aren't already certain, the unsure rate climbs much more slowly after 200 days (with the cumulative rate staying relatively flat), while the fp and fn rates stay at very low values. Details of my experiment parameters: I've got about 77000 messages in my dataset, covering a span of 418 days. Of these, about 21500 are ham, and nearly 56000 are spam. I include virus/worm messages in my spam, and the "latest windows update" worm makes its presence felt around day 360. I divided my dataset into 10 subsets, and ran the incremental.py harness over these 10 times, excluding 1 set each time, as per normal cv-ish behaviour. Thus, each of my measurements is replicated 10 times, with slightly different input data. Finally, I did the above-mentioned 10 runs using both the 'perfect' and 'nonedge' regimes. The 'perfect' regime trains on every message using the proper ham/spam classification, while the 'nonedge' regime trains only on those messages that were not correctly classified with 0.00 or 1.00 (rounded) scores. I've plotted the both cumulative and 7-day average values for error rates (fp, fn, and unsure) and training counts (ham and spam). Pictures (and a copy of this writeup) are on my website at: http://www.wolfskeep.com/~popiel/spambayes/nonedge - Alex PS. Sorry this took so long, but running the perfect regime on such a large dataset took a couple days on my machine... I need more memory! ;-) From tim.one at comcast.net Thu Dec 25 17:10:23 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 25 17:10:29 2003 Subject: [spambayes-dev] A new and altogether different bsddb breakage In-Reply-To: Message-ID: [Kenny Pitt] > ... > I suspect some timing issue with the Windows disk cache not > immediately flushing stuff to disk. That's just idle speculation, of > course, but I have seen similar things in other development projects. It's true that doing fileobject.flush() on Windows doesn't make any guarantee about writing anything to disk. Python 2.3 grew an os.fsync implementation for Windows, and os.fsync(fileobject.fileno()) does write to disk on Windows (and sometimes takes a veeeeery long time to do so!). That calls the MS C _commit() function under the covers, which in turn calls the Win32 FlushFileBuffers(). > ... > I can't think of a good reason that we should need to close and then > immediately reopen the same database. Me neither, but I bet we can find a way if we need to. In particular, you pointed to Sleepycat docs before containing cautions about how things need to be set up under Windows, and I'm almost certain the test suite doesn't do that. From tim.one at comcast.net Thu Dec 25 17:10:24 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Dec 25 17:10:36 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Seth Goodman] > ... > If I can have the Outlook binary and non-Outlook source working at > the same time, Probably, but I don't really know. I run the Outlook addin directly from a CVS checkout of spambayes, and have never used the binary installer (I don't object to it , it's just that using it would consume a little more non-existent "spare time"). > is there a way to convert my saved Outlook mail folders to mbox > format export.py in the spambayes Outlook2000 directory works fine, and I just checked in a pile of changes so it works even finer. > so that I _can_ see how the changes I make work on my own mail > stream as well? That's potentially more difficult than what you've (or I've!) been doing: to run "what if I changed this or that?" experiments, you need to save every email you ever get, and ensure that each one is correctly classified. Else you're not reproducing your original email stream, so it's anyone's guess then what you'd really be testing. Two days ago I created a new .pst file, with two folders "All ham" and "All spam". Since then I've been copying each message I get into one of them. When it comes time to use export.py, I'll have to temporarily fiddle my spambayes config to say that "All ham" is my (only) ham folder and "All spam" my (only) spam folder (export.py gets its idea of where your ham and spam training data are from your Outlook spambayes config file). Copying all incoming msgs is a bit of a PITA for me, and if you use Outlook rules too (I don't) to sort ham into different folders, may be a royal PITA. So it goes -- Outlook wasn't designed for running spam-filter experiments (then again, no email client was, and that's why we have a "standard" test-data directory structure of our own). Ah, I've noted before that I throw away half my Unsures unclassified, because I can't tell whether they're ham or spam (these are usually barely intelligible msgs addressed to public "admin" or "help" kinds of addresses). I'm making an arbitrary guess about each of those too, and saving a copy in "All ham" or "All spam". I *expect* a relatively high Unsure rate because of this aspect of my email mix. No part of the testing framework can be talked into believing that Unsure is the *desired* outcome for a msg, though, so I either have to make a guess about each, or damage the experimental setup in unknown ways by not saving *all* my email. From skip at pobox.com Thu Dec 25 18:08:40 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 25 18:08:47 2003 Subject: [spambayes-dev] test_storage.py failing In-Reply-To: References: <16361.43929.945893.137105@montanaro.dyndns.org> Message-ID: <16363.28152.314476.785433@montanaro.dyndns.org> Tim> [Skip] >> I ran the spambayes/test/test_storage.py this morning for the first >> time (on a fresh CVS checkout) and got several instances of the same >> error. Tim> How many is several? Note that there are only 5 tests here, so if Tim> several means more than 2, it's possible that *all* the tests died Tim> this way for you. That would be a clue. "several" was 3 in this case - all the DBDictClassifier tests. Tim> I expect you need to run python with -v to see how the imports are Tim> getting satisfied -- the only guess I have is that you're not Tim> getting the classes the test expects to get. They were coming from the right place. I eventually figured out that distutils didn't overwrite my installed copy when I tried installing from a new CVS version. Sorry for the false alarm. I wonder if I should file a bug report against distutils... Skip From popiel at wolfskeep.com Thu Dec 25 18:59:14 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Dec 25 18:59:18 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Thu, 25 Dec 2003 17:10:24 EST." References: Message-ID: <20031225235914.8CE0E2DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >Ah, I've noted before that I throw away half my Unsures unclassified, >because I can't tell whether they're ham or spam >No part of the testing framework can be talked into believing that >Unsure is the *desired* outcome for a msg, Hrm. Good point. Perhaps we should fix this, adding a third branch to the testing framework's data directory tree, and then convincing the test code to use messages in that third branch in the classify phase, but not in the train phase. And then we'd have the six error states of ham->spam, ham->unsure, unsure->ham, unsure->spam, spam->ham, and spam->unsure. Hrm. Not for me to do today, though... I'm still running more variations of the stuff I posted about earlier. Redoing the fpfnunsure test that I did last March (with my new dataset so it's comparable), and then adding in 200 day message expiry to my nonedge regime. - Alex From skip at pobox.com Fri Dec 26 09:35:40 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 26 09:35:52 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <20031225202404.757652DF61@cashew.wolfskeep.com> References: <20031225202404.757652DF61@cashew.wolfskeep.com> Message-ID: <16364.18236.225460.401395@montanaro.dyndns.org> Alex> Also of significant interest is that the classifier doesn't seem Alex> to decay as badly over time. With training on everything, the Alex> unsure rate in particular (and fn to a much lesser extent) goes up Alex> significantly after about 200 days worth of traffic, though the fp Alex> rate stays low. With just training on those things that aren't Alex> already certain, the unsure rate climbs much more slowly after 200 Alex> days (with the cumulative rate staying relatively flat), while the Alex> fp and fn rates stay at very low values. Alex> Details of my experiment parameters: Alex> I've got about 77000 messages in my dataset, covering a span of Alex> 418 days. Of these, about 21500 are ham, and nearly 56000 are spam. Alex> I include virus/worm messages in my spam, and the "latest windows Alex> update" worm makes its presence felt around day 360. Is it possible that the ham/spam ratio isn't as bad when you don't train on everything? Skip From nobody at spamcop.net Fri Dec 26 12:07:52 2003 From: nobody at spamcop.net (Seth Goodman) Date: Fri Dec 26 12:07:57 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: > [Tim Peters] > Two days ago I created a new .pst file, with two folders "All > ham" and "All > spam". Since then I've been copying each message I get into one of them. That's exactly what I've been doing for a while, so that's encouraging. I have local ham and spam corpus folders in outlook.pst that I move (or copy) _all_ messages into when I am finished with them. I toss a few unsures, but most go into one bucket or the other. Those two folders in Outlook.pst get autoarchived into SpamCorpus1.pst when messages in them are more than three days old. That gives me time to manually track statistics (another PITA). > [Tim Peters] > When it comes time to use export.py, I'll have to temporarily fiddle my > spambayes config to say that "All ham" is my (only) ham folder and "All > spam" my (only) spam folder (export.py gets its idea of where your ham and > spam training data are from your Outlook spambayes config file). Which place in the SpamBayes manager is the one that changes the config that export.py uses? There are ham and spam folder specifications in more than one place: filtering, training and watched folders at least, there may be more. > [Tim Peters] > Copying all incoming msgs is a bit of a PITA for me, and if you > use Outlook > rules too (I don't) to sort ham into different folders, may be a > royal PITA. > So it goes -- Outlook wasn't designed for running spam-filter experiments > (then again, no email client was, and that's why we have a "standard" > test-data directory structure of our own). Yeah, I use a lot of rules and sub-folders, so I have developed a "recipe" to make sure I don't screw up the semi-manual sorting (the thought of learning VB and the insides of Outlook is painful; my hat's off to Mark). One thing I do that may or may not be typical is that I let Outlook rules take care of all the mailing list traffic. That includes almost no spam and so I don't train or classify it (the list admins do a good job). Therefore, I _don't_ include it in my ham corpus. This gives me a roughly 1:5 ham/spam corpus, instead of roughly even, but that's the mail stream that SpamBayes sees. I _do_ make sure the training sets have equal numbers of messages. At present, my corpus is about 7,500 messages total. This may not be enough to "divide into ten sets", etc. Or is it? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From dave at boost-consulting.com Fri Dec 26 13:16:55 2003 From: dave at boost-consulting.com (David Abrahams) Date: Fri Dec 26 13:17:06 2003 Subject: [spambayes-dev] NEWTRICKS Message-ID: I keep getting quite a few spams which fit the descriptions below (from NEWTRICKS.txt): - Punctuation sometimes gets inserted in otherwise spammy words or phrases, e.g.: "Ch-eck ou=t ou-r sel)ection _of grea)t R_X -emgffj". It might be helpful to try stripping punctuation. (Idea from Paul Sorenson) - Similarly, some letters get replaced by numbers, e.g.: "V1agra" instead of "Viagra". Mapping numbers to suitable letters might help in some situations. Since "this file is for ideas that have or have not yet been tried", I'd love to know what constitutes "trying". Is there some official testing procedure or corpus we can test against? I'd like to know whether any change I make is worth proposing. Of course I can try it on my own databases of Ham and Spam first... -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Fri Dec 26 13:36:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 26 13:36:33 2003 Subject: [spambayes-dev] NEWTRICKS In-Reply-To: References: Message-ID: <16364.32665.369857.975422@montanaro.dyndns.org> Dave> I keep getting quite a few spams which fit the descriptions below Dave> (from NEWTRICKS.txt): Dave> - Punctuation sometimes gets inserted in otherwise spammy words Dave> or phrases, e.g.: "Ch-eck ou=t ou-r sel)ection _of grea)t R_X Dave> -emgffj". It might be helpful to try stripping punctuation. Dave> (Idea from Paul Sorenson) Dave> - Similarly, some letters get replaced by numbers, e.g.: Dave> "V1agra" instead of "Viagra". Mapping numbers to suitable Dave> letters might help in some situations. Dave> Since "this file is for ideas that have or have not yet been Dave> tried", I'd love to know what constitutes "trying". Is there some Dave> official testing procedure or corpus we can test against? I'd Dave> like to know whether any change I make is worth proposing. Of Dave> course I can try it on my own databases of Ham and Spam first... I tried the first (eliding punctuation from words). From a testing standpoint it turns out to not be all that useful, I think for a couple reasons: * There are plenty of other spammy clues in such messages which are sufficient to kick these messages into spam range. Most of this stuff winds up scoring at 0.95 or above for me. If they don't score as spam for you, train on a few and see how it does then. * Training databases full of old-ish mail won't contain many of these sorts of messages, so enabling punctuation removal won't change things very much. Skip From popiel at wolfskeep.com Fri Dec 26 13:44:21 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Dec 26 13:44:25 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: Message from Skip Montanaro of "Fri, 26 Dec 2003 08:35:40 CST." <16364.18236.225460.401395@montanaro.dyndns.org> References: <20031225202404.757652DF61@cashew.wolfskeep.com> <16364.18236.225460.401395@montanaro.dyndns.org> Message-ID: <20031226184421.D3E342DF61@cashew.wolfskeep.com> In message: <16364.18236.225460.401395@montanaro.dyndns.org> Skip Montanaro writes: > > Alex> Also of significant interest is that the classifier doesn't seem > Alex> to decay as badly over time. With training on everything, the > Alex> unsure rate in particular (and fn to a much lesser extent) goes up > Alex> significantly after about 200 days worth of traffic, though the fp > Alex> rate stays low. With just training on those things that aren't > Alex> already certain, the unsure rate climbs much more slowly after 200 > Alex> days (with the cumulative rate staying relatively flat), while the > Alex> fp and fn rates stay at very low values. > > Alex> Details of my experiment parameters: > > Alex> I've got about 77000 messages in my dataset, covering a span of > Alex> 418 days. Of these, about 21500 are ham, and nearly 56000 are spam. > Alex> I include virus/worm messages in my spam, and the "latest windows > Alex> update" worm makes its presence felt around day 360. > >Is it possible that the ham/spam ratio isn't as bad when you don't train on >everything? Eyeballing the graphs, it seems that the ratio is slightly _more_ unbalanced for the nonedge regime, rather than less. Also, from looking closer at the 7-day span graphs, I see that the inflection point is at about 120 days, not 200. - Alex From popiel at wolfskeep.com Fri Dec 26 14:09:23 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Dec 26 14:09:28 2003 Subject: [spambayes-dev] NEWTRICKS In-Reply-To: Message from David Abrahams of "Fri, 26 Dec 2003 13:16:55 EST." References: Message-ID: <20031226190923.5708E2DF61@cashew.wolfskeep.com> In message: David Abrahams writes: > >Since "this file is for ideas that have or have not yet been tried", >I'd love to know what constitutes "trying". Is there some official >testing procedure or corpus we can test against? I'd like to know >whether any change I make is worth proposing. Of course I can try it >on my own databases of Ham and Spam first... Heh. We just went through this question with Seth Goodman. Basic summation of the last week or so of advice is: Grab the latest CVS image, then read README-DEVEL.txt and incremental.HOWTO.txt. Lots of good info in there. Collect your own ham & spam corpora, put them into the appropriate directory structure, then run the testing tools over them with different options/classifiers/tokenizers/whatnot. Post results and enough explanation so that people can try to replicate your results using their own corpora. - Alex From popiel at wolfskeep.com Fri Dec 26 14:21:08 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Dec 26 14:21:12 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Seth Goodman" of "Fri, 26 Dec 2003 11:07:52 CST." References: Message-ID: <20031226192108.B38002DF61@cashew.wolfskeep.com> In message: "Seth Goodman" writes: > >One thing I do that may or may not be typical is that I let Outlook rules >take care of all the mailing list traffic. That includes almost no spam and >so I don't train or classify it (the list admins do a good job). Therefore, >I _don't_ include it in my ham corpus. Reasonable. >This gives me a roughly 1:5 ham/spam corpus, instead of roughly even, but >that's the mail stream that SpamBayes sees. This is the stuff I'd tend to use for the testing, as opposed to your equal-sized training sets. >At present, my corpus is about 7,500 messages total. This may not be enough >to "divide into ten sets", etc. Or is it? I think we did our first classifier shootouts with a minimum of 2,000 messages, so you should be fine. You may not have enough to see some of the longer-term effects I'm now witnessing (with inflection points at 120 and 200+ days), but you should be able to get started, at least. And heck, those inflection points (or the timing thereof) may be peculiarities of my own data. It'd be good to see. - Alex From stephena at hiwaay.net Fri Dec 26 16:32:45 2003 From: stephena at hiwaay.net (Stephen Anderson) Date: Fri Dec 26 16:32:49 2003 Subject: [spambayes-dev] Two SB on One Computer Message-ID: Hi, I searched through the archives and I couldn't find anything conclusive on this. Please don't hesistate to point me back to the archives if you know I've missed something. I'm using the sb_server (pop3proxy) on an XP computer as a service. I'd like to install it as two separate services and use two separate databases and separate web management ports so two different users can each have their own customized spam filter. I tickled through the service script and eye-balled the sb_server but I'm not sure what all assumptions are made that would make two instances of SB overlap. Can anybody give me some insight on what things they think I will have to watch out for? Thank you! Stephen Anderson =========================================================================== http://wecanstopspam.org From tim.one at comcast.net Fri Dec 26 21:13:04 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Dec 26 21:13:15 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <20031225235914.8CE0E2DF61@cashew.wolfskeep.com> Message-ID: [Tim] >> ... >> Ah, I've noted before that I throw away half my Unsures unclassified, >> because I can't tell whether they're ham or spam >> ... >> No part of the testing framework can be talked into believing that >> Unsure is the *desired* outcome for a msg, though ... [T. Alexander Popiel] > Hrm. Good point. Perhaps we should fix this, adding a third branch > to the testing framework's data directory tree, and then convincing > the test code to use messages in that third branch in the classify > phase, but not in the train phase. And then we'd have the six > error states of ham->spam, ham->unsure, unsure->ham, unsure->spam, > spam->ham, and spam->unsure. The added complication is unattractive -- I'm OK with guessing "the right" category, even while believing that doesn't make sense . From tim.one at comcast.net Fri Dec 26 21:13:05 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Dec 26 21:13:19 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message-ID: [Seth Goodman] > ... > Which place in the SpamBayes manager is the one that changes the > config that export.py uses? There are ham and spam folder > specifications in more than one place: filtering, training and > watched folders at least, there may be more. Training. This will become clear when you run export.py, since it displays the names of the folders it's exporting. Don't hesitate to run export.py. It doesn't change your .pst files in any way -- it's harmless, and the files it creates can be thrown away at will. > ... > One thing I do that may or may not be typical is that I let Outlook > rules take care of all the mailing list traffic. That includes > almost no spam and so I don't train or classify it (the list admins > do a good job). Therefore, I _don't_ include it in my ham corpus. > This gives me a roughly 1:5 ham/spam corpus, instead of roughly even, > but that's the mail stream that SpamBayes sees. Yet it remains possible that the best training strategy for your mix requires artificially forcing a particular ratio. Picture an extreme: if your actual incoming ratio is a million to one ... > I _do_ make sure the training sets have equal numbers of messages. At > present, my corpus is about 7,500 messages total. This may not be > enough to "divide into ten sets", etc. Or is it? It's plenty. The last multi-corpus "death match" experiments here required that participants use exacty 10 sets of ham and 10 sets of spam, each set having exactly 200 messages. That's a grand total of 4,000 msgs. However, it's not clear *what* to test anymore. At the start, this project was aimed at high-volume mailing lists, where the admins were thought most likely to train on giant sets of ham and spam a few times per year. Randomized cross-validation testing is a fine approach for that use. There are apparently only a few people who use spambayes that way, though, and among the rest of us no two seem to train in the same way. Incremental training, and preserving the order in which messages arrive, seem overwhelmingly more interesting to most real users. So what may be more important now, building on Alex's incremental testers, isn't the sheer number of messages so much as the span of time they cover. Indeed, for new users, it's important to know how this filter behaves after training on just a few messages. That's my particular interest with the experimental mixed unigram/bigram scheme: the hope is that it "learns faster". In earlier tests, I never found anything that beat the pure unigram scheme *given enough training data*, but few users have 20,000 recent ham and spam to start off with. OTOH, I don't have enough exhaustive personal email saved away to measure anything other than how the system performs across a few days, and a scheme that "learns fast" starting from nothing *may* also be slow to adapt to changes over time (we all know a bright kid who never outgrew their 6th-grade worldview, right ?). Oh well. There have always been more ideas to test than were possible to cover. From tim.one at comcast.net Fri Dec 26 21:13:06 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Dec 26 21:13:25 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <20031226192108.B38002DF61@cashew.wolfskeep.com> Message-ID: [Seth Goodman] >> This gives me a roughly 1:5 ham/spam corpus, instead of roughly >> even, but that's the mail stream that SpamBayes sees. [T. Alexander Popiel] > This is the stuff I'd tend to use for the testing, as opposed to your > equal-sized training sets. If Seth is going to test incremental training regimes, then yes, his entire email stream (well, the parts of it scored by spambayes -- he said he uses Outlook rules to exempt a large part of it from getting scored at all) should be included. If he wants to do cross-validation testing, he should still bust it all up into the same number of sets. timcv's "ham-keep" and "spam-keep" options can be used then to select random equal-sized (or non-equal-sized) subsets dynamically. In your (Alex's) recent "nonedge" incremental training experiment, it looks like your training data grew to about a 5.5::1 spam::ham ratio after 400 days. I know my personal classifiers start acting flaky whenever I've let them get imbalanced by more than 2::1 in either direction. So if I had your data, I'd be curious to try variations that force better balance. I have my data, but it's less than a week old . You have enough data that it may well be more interesting to you to try variations including expiration (the second derivative of your "Cumulative Trained Counts" ham training curve appears slightly negative, but your spam training curve appears mostly straight except for two points where it clearly gets steeper -- a hypothesis is that your ham isn't changing much over time, but that your spam is, the weight of the old spam training data is making it harder to adjust to the spam changes, and that this gets worse over time; OTOH, with the spam::ham training imbalance getting worse over time too, it may just be that the classifier is getting flakier over time too for that reason alone). From tim.one at comcast.net Sat Dec 27 00:52:36 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 27 00:52:41 2003 Subject: [spambayes-dev] NEWTRICKS In-Reply-To: Message-ID: [David Abrahams] > I keep getting quite a few spams which fit the descriptions below > (from NEWTRICS.txt): I'm sure everyone gets them, the interesting question is whether they're evading your spambayes filter. They don't seem to give mine particular trouble (of course I train on those that score Unsure; I'm not sure I've ever seen one score as Ham). > ... [descriptions of attempted obfuscation via insertion of > punctuation, and replacing letters by digits] ... > Since "this file is for ideas that have or have not yet been tried", > I'd love to know what constitutes "trying". Is there some official > testing procedure or corpus we can test against? I'd like to know > whether any change I make is worth proposing. Of course I can try it > on my own databases of Ham and Spam first... There's no official corpus, else we'd be teaching the system to recognize that corpus. Alex gave the right pointers to docs for the testing framework. From tim.one at comcast.net Sat Dec 27 00:52:37 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 27 00:52:45 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <20031225202404.757652DF61@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Training on just those messages whose score isn't 0.00 or 1.00 > (rounded) seems to be a huge win over training on everything. > Not so much because the accuracy is better (though accuracy > does seem to be improved by neglecting those messages that it's > already certain about), I'm afraid TOE gives too much weight to systematically correlated tokens. My experience with python.org mailing lists has pointed in that direction since the start, but it's probably more general than that. In a recap nutshell, every piece of email coming from python.org has (with mine_received_headers enabled) about a dozen tokens effectively saying "I came from python.org". I get several hundred ham like that every day, but also a few spam per week. Under TOE, the "python.org clues" get spamprobs approaching 0, and a dozen very strong ham tokens is hard to overcome. As a result, it's *hard* for a spam leaking thru python.org to score as spam on my end -- even under mistake-based training, where the spamprobs on python.org-tokens are much higher than they'd be under TOE. I expect most (maybe all) of the developers here have similar long-term sources of ham, feeding you daily with correlated tokens effectively identifying the source. An irony is that I don't need those python.org tokens: the *content* of those msgs is solidly hammy even without them. Maybe we should ignore our strongest clues <0.5 wink>. > but because of a hugely reduced training set (and thus database). > Specifically, training on everything yielded a database with 70,000 > messages, while training only on the non-extreme put only about > 3,500 messages into the database. Unfortunately, I don't have firm > numbers on token counts. That's OK. It was rigorously established before that the # of tokens either does or doesn't go up with the square root, or some other function, of the message count . > Also of significant interest is that the classifier doesn't seem > to decay as badly over time. With training on everything, the > unsure rate in particular (and fn to a much lesser extent) goes > up significantly after about 200 days worth of traffic, That's peculiar. Did you try this with different starting dates, and find that "about 200 days" was invariant across starting dates -- or did you try a single starting date, and note that something funny happened about 200 days after that single starting date. I think the latter, in which case it's natural to speculate that something significant changed around then in your ham and/or spam mix. Thanks for the report, Alex! Good work. From popiel at wolfskeep.com Sat Dec 27 01:01:24 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 01:01:30 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Fri, 26 Dec 2003 21:13:04 EST." References: Message-ID: <20031227060125.07CC32DF80@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Tim] >>> ... >>> Ah, I've noted before that I throw away half my Unsures unclassified, >>> because I can't tell whether they're ham or spam >>> ... >>> No part of the testing framework can be talked into believing that >>> Unsure is the *desired* outcome for a msg, though ... > >[T. Alexander Popiel] >> Hrm. Good point. Perhaps we should fix this, adding a third branch >> to the testing framework's data directory tree, ... >> And then we'd have the six error states ... > >The added complication is unattractive -- I'm OK with guessing "the right" >category, even while believing that doesn't make sense . Oh, good. That wasn't a bit of hackery I was looking forward to. If you're OK with ignoring it, then I certainly am. - Alex From popiel at wolfskeep.com Sat Dec 27 01:19:35 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 01:20:37 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Fri, 26 Dec 2003 21:13:06 EST." References: Message-ID: <20031227061940.7C5492DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >In your (Alex's) recent "nonedge" incremental training experiment, it looks >like your training data grew to about a 5.5::1 spam::ham ratio after 400 >days. Yup. I have a nice picture now of the ratio over time at the bottom of the report at: http://www.wolfskeep.com/~popiel/spambayes/nonedge >I know my personal classifiers start acting flaky whenever I've let >them get imbalanced by more than 2::1 in either direction. Interestingly enough, though, the nonedge did better than TOE, despite a worse imbalance. >So if I had your data, I'd be curious to try variations that force better >balance. I'd love to... but I haven't been able to come up with anything which maintains the balance better without extreme artificiality. If you think of any regimes that make sense, I'd be more than happy to run them. >You have enough data that it >may well be more interesting to you to try variations including expiration *grin* That's part of what's been burning my CPU ever since I posted the last report. I'll have another report, including that, probably within 3 days. Still have more to test... and my runs are taking between 6 and 20 hours each, depending on the memory used by the classifiers. - Alex From popiel at wolfskeep.com Sat Dec 27 01:39:02 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 01:39:06 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Fri, 26 Dec 2003 21:13:05 EST." References: Message-ID: <20031227063902.9D9372DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >So what may be more important now, building on Alex's incremental testers, >isn't the sheer number of messages so much as the span of time they cover. I've been having this supposition, too, but was afraid of scaring people off by voicing it. After all, I don't know if anyone else has been anal enough to have been maintaining growing corpora for over a year... >OTOH, I don't have enough exhaustive personal email saved away to measure >anything other than how the system performs across a few days, and a scheme >that "learns fast" starting from nothing *may* also be slow to adapt to >changes over time (we all know a bright kid who never outgrew their >6th-grade worldview, right ?). Heh. I could be convinced to run the bigram scheme over my dataset after I'm done with my current set of tests... though I may need a gig of memory to do it. ;-) My current 256 meg is dying under the load of TOE with expiry. - Alex From popiel at wolfskeep.com Sat Dec 27 01:43:13 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 01:43:17 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: Message from "Tim Peters" of "Sat, 27 Dec 2003 00:52:37 EST." References: Message-ID: <20031227064313.AB1702DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[T. Alexander Popiel] > >> Also of significant interest is that the classifier doesn't seem >> to decay as badly over time. With training on everything, the >> unsure rate in particular (and fn to a much lesser extent) goes >> up significantly after about 200 days worth of traffic, > >That's peculiar. Did you try this with different starting dates, and find >that "about 200 days" was invariant across starting dates -- or did you try >a single starting date, and note that something funny happened about 200 >days after that single starting date. I think the latter, in which case >it's natural to speculate that something significant changed around then in >your ham and/or spam mix. It was in fact the latter, and I'm just now prepping for spinning my dataset by 80 and 160 days to revalidate. Even odder things are happening at specific times in the expiry stuff, and I want to see if it's specific real times, or time after training commences... - Alex From tim.one at comcast.net Sat Dec 27 04:58:31 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 27 04:58:43 2003 Subject: [spambayes-dev] New sort+group.py Message-ID: Attached is a major rewrite of testtools/sort+group.py. Anyone who uses that, please give it a try. If nobody gripes, I'll check it in. (If you're on Linux, the attached probably has Windows line ends, and you may need to change that.) It's used exactly the same way as before, and creates filenames with the same pattern as before, *except* that any pre-existing extension (like ".txt" on Windows) is preserved. Extensions are necessary for sane life on Windows, but the code currently checked in strips extensions as part of renaming. The major thrust of the changes is to order msgs by full-precision UTC timestamp. It was sorting just by date (not time), and wasn't accounting for that different ISPs may be in different time zones. It also failed to parse many of the Received headers in my email, partly because Comcast's Received headers don't make any attempt to keep the date-time part on a single line. Other failures were due to "unusual" spellings in the date-time part. Instead email.Utils.parsedate_tz() is used to parse this stuff, and that didn't fail on any of the email I've tried so far. Almost all Received headers I see have hour:minute:second info, and since I do incremental training during the day, as email comes in, it's important to me that the email be ordered at finer granularity than "a day". A second should be good enough . My various ISPs are in different time zones too, and normalizing to UTC should help model that, e.g., the first time I see a new spam campaign it's much more likely to arrive from my MSN account than from my Comcast account. -------------- next part -------------- #! /usr/bin/env python ### Sort and group the messages in the Data hierarchy. ### Run this prior to mksets.py for setting stuff up for ### testing of chronological incremental training. """Usage: sort+group.py This program has no options! Muahahahaha! """ import sys import os import glob import time from email.Utils import parsedate_tz, mktime_tz loud = True SECONDS_PER_DAY = 24 * 60 * 60 # Scan the file with path fpath for its first Received header, and return # a UTC timestamp for the date-time it specifies. If anything goes wrong # (can't find a Received header; can't parse the date), return None. # This is the best guess about when we received the msg. def get_time(fpath): fh = file(fpath, "rb") # Find first Received header. for line in fh: if line.lower().startswith("received:"): break else: print "\nNo Received header found." fh.close() return None # Paste on the continuation lines. received = line for line in fh: if line[0] in ' \t': received += line else: break fh.close() # RFC 2822 says the date-time field must follow a semicolon at the end. i = received.rfind(';') if i < 0: print "\n" + received print "No semicolon found in Received header." return None # We only want the part after the semicolon. datestring = received[i+1:] # It may still be split across lines (like "Wed, \r\n\t22 Oct ..."). datestring = ' '.join(datestring.split()) as_tuple = parsedate_tz(datestring) if as_tuple is None: print "\n" + received print "Couldn't parse the date: %r" % datestring return None return mktime_tz(as_tuple) def main(): """Main program; parse options and go.""" data = [] # list of (time_received, path) pairs now = time.time() if loud: print "Scanning everything" for name in glob.glob('Data/*/*/*'): if loud: sys.stdout.write("%-78s\r" % name) sys.stdout.flush() when_received = get_time(name) data.append((when_received or now, name)) if loud: print "" print "Sorting ..." data.sort() # First rename all the files to a form we can't produce in the end. # This is to protect against name clashes in case the files are # already named according to the scheme we use. if loud: print "Renaming first pass ..." for dummy, name in data: dirname = os.path.dirname(name) basename = os.path.basename(name) os.rename(name, os.path.join(dirname, "-"+basename)) if loud: print "Renaming second pass ..." earliest = data[0][0] # timestamp of earliest msg received for i, (when_received, name) in enumerate(data): dirname = os.path.dirname(name) basename = os.path.basename(name) extension = os.path.splitext(basename)[-1] group = int((when_received - earliest) / SECONDS_PER_DAY) newbasename = "%04d-%06d%s" % (group, i, extension) os.rename(os.path.join(dirname, "-"+basename), os.path.join(dirname, newbasename)) if __name__ == "__main__": main() From richie at entrian.com Sat Dec 27 05:07:33 2003 From: richie at entrian.com (Richie Hindle) Date: Sat Dec 27 05:07:37 2003 Subject: [spambayes-dev] Two SB on One Computer In-Reply-To: References: Message-ID: Hi Stephen, > I'm using the sb_server (pop3proxy) on an XP computer as a service. I'd > like to install it as two separate services and use two separate databases > and separate web management ports so two different users can each have > their own customized spam filter. You can't (I don't think) do this with the service, but you can certainly do it when running sb_server from the command line, or via the Startup group. Just run each one in its own working directory with its own bayescustomize.ini: o Create a directory for each instance of sb_server. o In each, create a bayescustomize.ini with minimal settings. This: [html_ui] port=1234 is probably enough. Set up the rest through http://localhost:1234 o Run sb_server from the command line in each directory. -- Richie Hindle richie@entrian.com From dave at boost-consulting.com Sat Dec 27 07:08:32 2003 From: dave at boost-consulting.com (David Abrahams) Date: Sat Dec 27 07:08:40 2003 Subject: [spambayes-dev] NEWTRICKS In-Reply-To: (Tim Peters's message of "Sat, 27 Dec 2003 00:52:36 -0500") References: Message-ID: "Tim Peters" writes: > [David Abrahams] >> I keep getting quite a few spams which fit the descriptions below >> (from NEWTRICS.txt): > > I'm sure everyone gets them, the interesting question is whether they're > evading your spambayes filter. They are showing up as Unsure; I wouldn't see them otherwise. > They don't seem to give mine particular trouble (of course I train > on those that score Unsure; Me too. > I'm not sure I've ever seen one score as Ham). Me neither. >> ... [descriptions of attempted obfuscation via insertion of >> punctuation, and replacing letters by digits] ... > >> Since "this file is for ideas that have or have not yet been tried", >> I'd love to know what constitutes "trying". Is there some official >> testing procedure or corpus we can test against? I'd like to know >> whether any change I make is worth proposing. Of course I can try it >> on my own databases of Ham and Spam first... > > There's no official corpus, else we'd be teaching the system to recognize > that corpus. Alex gave the right pointers to docs for the testing > framework. Thanks. We'll see if my Christmas downtime lasts long enough for me to be able to try that ;-) -- Dave Abrahams Boost Consulting www.boost-consulting.com From popiel at wolfskeep.com Sat Dec 27 14:29:33 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 14:29:38 2003 Subject: [spambayes-dev] New sort+group.py In-Reply-To: Message from "Tim Peters" of "Sat, 27 Dec 2003 04:58:31 EST." References: Message-ID: <20031227192933.AD0CC2DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >Attached is a major rewrite of testtools/sort+group.py. Yay! I'll be happy to admit I just sort of threw that together. >Anyone who uses that, please give it a try. Trying... but it seems to have major problems with python2.2. It barfs on enumerate(), and it doesn't seem to be picking up continuation lines, either, so I suspect the file reading style you're using isn't grokked correctly, either. >The major thrust of the changes is to order msgs by full-precision UTC >timestamp. It was sorting just by date (not time), and wasn't accounting >for that different ISPs may be in different time zones. Oops. Doh. Thanks for catching that. All my mail gets received and timestamped by my local machine, so the timezones weren't an issue... but ignoring time of day entirely is rather embarassing. - Alex, who is now trying to get it to work with python2.2... From tim.one at comcast.net Sat Dec 27 14:51:46 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 27 14:51:49 2003 Subject: [spambayes-dev] New sort+group.py In-Reply-To: <20031227192933.AD0CC2DF61@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Trying... but it seems to have major problems with python2.2. Ah, fiddlesticks -- does someone here still give a rip about Python 2.2? I don't. 2.2 is dead -- it's no longer maintained, 2.3.3 is out and universally regarded as stabler & faster than 2.2.3, and development has moved on to 2.4. I'd like to drop all our 2.2 compatibility cruft; it's a growing mass of dead weight. > It barfs on enumerate(), and it doesn't seem to be picking up > continuation lines, either, so I suspect the file reading style > you're using isn't grokked correctly, either. Right, it wouldn't. The easiest pithy explanation is that file objects in 2.2 *have* iterators, but file objects in 2.3 *are* iterators. To use the same style of code under both requires getting an explicit iterator, it = iter(fh) and then doing for line in it: everywhere instead of for line in fh: As is, the for line in fh: lines under 2.2 are really jumping across internal file buffers. That was crazy behavior, and that's why it got repaired for 2.3 (but the fix couldn't be backported to 2.2 lest some crazy code relied on the broken behavior). >> The major thrust of the changes is to order msgs by full-precision >> UTC timestamp. It was sorting just by date (not time), and wasn't >> accounting for that different ISPs may be in different time zones. > Oops. Doh. Thanks for catching that. All my mail gets received > and timestamped by my local machine, so the timezones weren't an > issue... but ignoring time of day entirely is rather embarassing. Not your fault: it's not possible to find out time of day under 2.2 either . From popiel at wolfskeep.com Sat Dec 27 15:03:15 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 15:03:19 2003 Subject: [spambayes-dev] New sort+group.py In-Reply-To: Message from "Tim Peters" of "Sat, 27 Dec 2003 04:58:31 EST." References: Message-ID: <20031227200315.3E1242DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >Attached is a major rewrite of testtools/sort+group.py. Here's a patch to make it work with python2.2. It appears that the 'for line in fh:' syntax for filereading in 2.2 buffered a bunch of lines which were then unavailable for use to subsequent similar loops in the case of the first loop terminating early. Also, enumerate() didn't seem to exist, so I just maintained the count manually. Enjoy. - Alex --- sort+group.py.noworky Sat Dec 27 11:58:04 2003 +++ sort+group.py Sat Dec 27 11:57:37 2003 @@ -26,20 +26,21 @@ def get_time(fpath): fh = file(fpath, "rb") # Find first Received header. - for line in fh: + line = fh.readline() + while line != "\r\n" and line != "\n" and line != "": if line.lower().startswith("received:"): break + line = fh.readline() else: print "\nNo Received header found." fh.close() return None # Paste on the continuation lines. received = line - for line in fh: - if line[0] in ' \t': - received += line - else: - break + line = fh.readline() + while line[0] in ' \t': + received += line + line = fh.readline() fh.close() # RFC 2822 says the date-time field must follow a semicolon at the end. i = received.rfind(';') @@ -90,7 +91,9 @@ if loud: print "Renaming second pass ..." earliest = data[0][0] # timestamp of earliest msg received - for i, (when_received, name) in enumerate(data): + i = 0 + for when_received, name in data: + i = i + 1 dirname = os.path.dirname(name) basename = os.path.basename(name) extension = os.path.splitext(basename)[-1] From popiel at wolfskeep.com Sat Dec 27 16:45:50 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 16:45:55 2003 Subject: [spambayes-dev] New sort+group.py In-Reply-To: Message from "Tim Peters" of "Sat, 27 Dec 2003 14:51:46 EST." References: Message-ID: <20031227214550.D698C2DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[T. Alexander Popiel] >> Trying... but it seems to have major problems with python2.2. > >Ah, fiddlesticks -- does someone here still give a rip about Python 2.2? Yeah, I do. There are no more recent versions packaged for Debian stable. Sorry. This will likely change when sarge makes it to stable... but that's likely not going to happen for at least 6 months. >Not your fault: it's not possible to find out time of day under 2.2 either >. *pthbbbt* - Alex From tim.one at comcast.net Sat Dec 27 21:41:53 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 27 21:41:55 2003 Subject: [spambayes-dev] New sort+group.py In-Reply-To: <20031227214550.D698C2DF61@cashew.wolfskeep.com> Message-ID: [Tim] >> Ah, fiddlesticks -- does someone here still give a rip about Python >> 2.2? [T. Alexander Popiel] > Yeah, I do. There are no more recent versions packaged for Debian > stable. Sorry. This will likely change when sarge makes it to > stable... but that's likely not going to happen for at least 6 months. Heh. You're on a Linux system and can't upgrade a package? Makes me glad I'm running Windows, where others don't dictate what I can run on my own machine . I made sort+group.py 2.2.3-friendly, far as I can tell, but since 2.3 came out I don't use 2.2 for anything anymore -- if I introduce more incompatibilities, I won't know. From tim.one at comcast.net Sat Dec 27 22:07:56 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 27 22:07:59 2003 Subject: [spambayes-dev] Code changes Message-ID: The meaning of Outlook2000/export.py's -n option has changed. Here's the checkin comment: INCOMPATIBLE CHANGE: the -n option now gives the number of Set subdirectories desired, instead of a number of msgs per Set subdir "to shoot for". If you want to run, e.g., 10-fold cross-validation, you have to have exactly 10 Set folders, and the # of msgs per folder is of much less importance. Also added a note recommending to run rebal.py afterwards. rebal is the expert in setting up randomized Set subdirectories, and the export.py script probably should have stuck to just extracting msgs from Outlook. utilities/rebal.py has grown a -t option, which makes it (once again) easy to use with a standard test setup. It was originally easy to use that way, but grew -r and -s options, presumably added by someone with a non-standard test setup. Unfortunately, those with a standard test setup had to use them too, and they're both clumsy and error-prone to use with a standard test setup. -t can't be used in the same run with -r or -s. Those with a standard test setup no longer need to worry about -r or -s, just -t; vice versa for those with a non-standard test setup. The changes to testtools/sort+group.py discussed here have been checked in, after fiddling to play nice with Python 2.2.3 too. From popiel at wolfskeep.com Sat Dec 27 22:32:16 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Dec 27 22:32:20 2003 Subject: [spambayes-dev] New sort+group.py In-Reply-To: Message from "Tim Peters" of "Sat, 27 Dec 2003 21:41:53 EST." References: Message-ID: <20031228033216.2897F2DF61@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Tim] >>> Ah, fiddlesticks -- does someone here still give a rip about Python >>> 2.2? > >[T. Alexander Popiel] >> Yeah, I do. There are no more recent versions packaged for Debian >> stable. Sorry. This will likely change when sarge makes it to >> stable... but that's likely not going to happen for at least 6 months. > >Heh. You're on a Linux system and can't upgrade a package? Makes me glad >I'm running Windows, where others don't dictate what I can run on my own >machine . Eh, it's not that I can't... it that if I do, I either have to go through a lot of hassle to package it myself, or I have to go through a lot of hassle to make all the packages that depend on python ignore the fact that there's no python package listed as installed. A pain either way. (I _might_ be able to grab a package version out of sarge and recompile to avoid a library-version-incompatibility cascade which would require me to upgrade half my system to possibly broken versions... but still, a nuisance). >I made sort+group.py 2.2.3-friendly, far as I can tell, but since 2.3 came >out I don't use 2.2 for anything anymore -- if I introduce more >incompatibilities, I won't know. *nod* I'll tell ya if you broke it for me. ;-) - Alex From sourceforge at metrak.com Sun Dec 28 01:16:32 2003 From: sourceforge at metrak.com (Paul Sorenson) Date: Sun Dec 28 01:16:36 2003 Subject: [spambayes-dev] error training on dbx file Message-ID: <00e901c3cd0a$23566f20$c48b0fcb@home.classware.com.au> With code I checked out from CVS in the last 24 hours or so, I got the error below when trying to train a dbx file via the web interface. I have python 2.3.3 installed and I the box involved is running Windows XP. oe_mailbox.py doesn't appear to import time. 500 Server error Traceback (most recent call last): File "C:\usr\spambayes\spambayes\Dibbler.py", line 457, in found_terminator getattr(plugin, name)(**params) File "C:\usr\spambayes\spambayes\UserInterface.py", line 479, in onTrain content = self._convertToMbox(content) File "C:\usr\spambayes\spambayes\UserInterface.py", line 521, in _convertToMbox content = oe_mailbox.convertToMbox(content) File "C:\usr\spambayes\spambayes\oe_mailbox.py", line 465, in convertToMbox dbxBuffer += "From spambayes@spambayes.org %s\n%s" \ NameError: global name 'strftime' is not defined From sourceforge at metrak.com Sun Dec 28 01:26:24 2003 From: sourceforge at metrak.com (Paul Sorenson) Date: Sun Dec 28 01:26:24 2003 Subject: [spambayes-dev] Re: error training on dbx file Message-ID: <00fa01c3cd0b$847ffae0$c48b0fcb@home.classware.com.au> Please ignore this error. The user has been deleted :-) From rob at hooft.net Mon Dec 29 04:37:58 2003 From: rob at hooft.net (Rob Hooft) Date: Mon Dec 29 04:39:16 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <20031225202404.757652DF61@cashew.wolfskeep.com> References: <20031225202404.757652DF61@cashew.wolfskeep.com> Message-ID: <3FEFF5F6.1090004@hooft.net> T. Alexander Popiel wrote: > Training on just those messages whose score isn't 0.00 or 1.00 > (rounded) seems to be a huge win over training on everything. Told you: See the section "Train on Errors, Unsures, and non-obvious correct decisions" at http://www.entrian.com/sbwiki/TrainingIdeas Happy that it comes out as I thought it would, though. > Not so much because the accuracy is better (though accuracy > does seem to be improved by neglecting those messages that it's > already certain about), but because of a hugely reduced training > set (and thus database). Both are effects I can feel in practice! Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From skip at pobox.com Mon Dec 29 09:05:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 29 12:41:16 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <3FEFF5F6.1090004@hooft.net> References: <20031225202404.757652DF61@cashew.wolfskeep.com> <3FEFF5F6.1090004@hooft.net> Message-ID: <16368.13462.404757.694070@montanaro.dyndns.org> Rob> T. Alexander Popiel wrote: >> Training on just those messages whose score isn't 0.00 or 1.00 >> (rounded) seems to be a huge win over training on everything. Rob> Told you: Rob> See the section "Train on Errors, Unsures, and non-obvious correct Rob> decisions" at http://www.entrian.com/sbwiki/TrainingIdeas I think we need to split that page into multiple chunks. I (directly and indirectly) contributed a fair amount of content to that page, but my eyes just glaze over now when reading it. Anybody got some pretty graphs to break up the text? Skip From popiel at wolfskeep.com Mon Dec 29 12:51:22 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 29 12:51:29 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: Message from Rob Hooft of "Mon, 29 Dec 2003 10:37:58 +0100." <3FEFF5F6.1090004@hooft.net> References: <20031225202404.757652DF61@cashew.wolfskeep.com> <3FEFF5F6.1090004@hooft.net> Message-ID: <20031229175122.C3E6A2DE88@cashew.wolfskeep.com> In message: <3FEFF5F6.1090004@hooft.net> Rob Hooft writes: >T. Alexander Popiel wrote: >> Training on just those messages whose score isn't 0.00 or 1.00 >> (rounded) seems to be a huge win over training on everything. > >Told you: >See the section "Train on Errors, Unsures, and non-obvious correct >decisions" at http://www.entrian.com/sbwiki/TrainingIdeas Hrm. I suppose that I ought to actually look at the wiki. ;-) Is there any way for me to upload my plots to go along with any discussion that I might add to the above page? I could just reference them on my machine, but it seems better to keep the wiki content all in one place. >> Not so much because the accuracy is better (though accuracy >> does seem to be improved by neglecting those messages that it's >> already certain about), but because of a hugely reduced training >> set (and thus database). > >Both are effects I can feel in practice! FWIW, using this training style with my nightly retrains cut my database size in half (from 21 meg to 10 meg). This is with a 4-month horizon, too, so the difference would likely be even greater with a longer span. - Alex From popiel at wolfskeep.com Mon Dec 29 13:28:36 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 29 13:28:41 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: Message from Skip Montanaro of "Mon, 29 Dec 2003 08:05:10 CST." <16368.13462.404757.694070@montanaro.dyndns.org> References: <20031225202404.757652DF61@cashew.wolfskeep.com> <3FEFF5F6.1090004@hooft.net> <16368.13462.404757.694070@montanaro.dyndns.org> Message-ID: <20031229182836.EA1012DE88@cashew.wolfskeep.com> In message: <16368.13462.404757.694070@montanaro.dyndns.org> Skip Montanaro writes: > > Rob> T. Alexander Popiel wrote: > >> Training on just those messages whose score isn't 0.00 or 1.00 > >> (rounded) seems to be a huge win over training on everything. > > Rob> Told you: > Rob> See the section "Train on Errors, Unsures, and non-obvious correct > Rob> decisions" at http://www.entrian.com/sbwiki/TrainingIdeas > >I think we need to split that page into multiple chunks. Agreed. I think that using the subpage mechanism would be good. >Anybody got some pretty graphs to break up the text? Several, now... though I may need to see if I can rescale them to something less than full-page. Hrm. Also, a few of these training ideas are already represented by regimes for the incremental harness... and I've got a couple more to check in. I also recognize that my names for the regimes are, umm, less than optimal; if people have better names for such, please speak up. As an example, I'm probably going to rename the 'perfect' regime to 'TrainOnEverything'. Suggestions for capitalization style? Should it be TrainOnEverything, train_on_everything, or something else? - Alex From richie at entrian.com Mon Dec 29 13:32:29 2003 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 29 13:32:35 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <20031229175122.C3E6A2DE88@cashew.wolfskeep.com> References: <20031225202404.757652DF61@cashew.wolfskeep.com> <3FEFF5F6.1090004@hooft.net> of "Mon, 29 Dec 2003 10:37:58 +0100." <3FEFF5F6.1090004@hooft.net> <20031229175122.C3E6A2DE88@cashew.wolfskeep.com> Message-ID: [Alex] > Is there any way for me to upload my plots to go along with any > discussion that I might add to the above page? I could just > reference them on my machine, but it seems better to keep the > wiki content all in one place. You can't upload images into the Wiki, no. You can either reference images on another server, as you say, or if you make them available to me then I'll upload them onto the Wiki server and let you know their URLs. -- Richie Hindle richie@entrian.com From tim.one at comcast.net Mon Dec 29 13:57:54 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 29 13:58:07 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <20031229182836.EA1012DE88@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > Suggestions for capitalization style? Should it be TrainOnEverything, > train_on_everything, or something else? I like the latter. Barry's experience with the email package is that especially non-native English readers have an easier time with underscores than with CamelCasing. Underscores are also more natural if these end up as specifiable values in .ini files. From skip at pobox.com Mon Dec 29 15:14:42 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 29 15:14:54 2003 Subject: [spambayes-dev] Reduced training test results In-Reply-To: <20031229175122.C3E6A2DE88@cashew.wolfskeep.com> References: <20031225202404.757652DF61@cashew.wolfskeep.com> <3FEFF5F6.1090004@hooft.net> <20031229175122.C3E6A2DE88@cashew.wolfskeep.com> Message-ID: <16368.35634.271833.34086@montanaro.dyndns.org> Alex> Is there any way for me to upload my plots to go along with any Alex> discussion that I might add to the above page? I could just Alex> reference them on my machine, but it seems better to keep the wiki Alex> content all in one place. Dunno. You'll have to poke around the Wiki help. Skip From tim.one at comcast.net Mon Dec 29 15:37:17 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 29 15:37:42 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <20031227061940.7C5492DF61@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > Yup. I have a nice picture now of the ratio over time at the bottom > of the report at: > http://www.wolfskeep.com/~popiel/spambayes/nonedge Hmm. That appears to be using a log scale for the Y (ratio) axis, so what *appears* to be straight-line growth in the ratio after about day 150 is really exponential growth. That could get bad over time . > ... > Interestingly enough, though, the nonedge did better than TOE, despite > a worse imbalance. Yup, I saw that. >> So if I had your data, I'd be curious to try variations that force >> better balance. > I'd love to... but I haven't been able to come up with anything which > maintains the balance better without extreme artificiality. If you > think of any regimes that make sense, I'd be more than happy to run > them. Oh, there are billions of things that could be tried. Who knows what might pay? Picking just enough edge ham at random for training to force balance is one idea. The definition of "nonedge" is arbitrarily mutable too: there's nothing a priori compelling about "0.00 or 1.00 after rounding to 2 decimal digits after the radix point". For example, maybe it's better to use 3 decimal digits, or 1, or maybe it's really best to use 2 digits after the radix point when the score is expressed in base 7 . Asymmetric bounds also have some attraction, since, e.g., in mistake-based training "by hand" I always end up moving the ham cutoff closer to 0 than the spam cutoff is to 1. IOW, empirically, in my own email mix, and based on one kind of lazy training, my region of certainty for ham is smaller than my region of certainty for spam. This makes some sense to me, since my ham is more uniform than my spam. Heh. Except at Christmas, and probably through the first week of next year, when I get piles of msgs from people I only hear from once a year. From popiel at wolfskeep.com Mon Dec 29 17:00:14 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 29 17:00:18 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: Message from "Tim Peters" of "Mon, 29 Dec 2003 15:37:17 EST." References: Message-ID: <20031229220014.2B7C02DE88@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[T. Alexander Popiel] >> ... >> Yup. I have a nice picture now of the ratio over time at the bottom >> of the report at: >> http://www.wolfskeep.com/~popiel/spambayes/nonedge > >Hmm. That appears to be using a log scale for the Y (ratio) axis, so what >*appears* to be straight-line growth in the ratio after about day 150 is >really exponential growth. That could get bad over time . Yeah, I used log scale for the ratio... log makes more sense to me for ratios. I can trivially replot on linear scale if you want. ;-) >Oh, there are billions of things that could be tried. Who knows what might >pay? Aye, there are. I don't have billions of CPU-days to burn, though, so I'm trying to winnow down to stuff that's likely to pay off. Theoretical beauty is one measure that sort of appeals. >displayed>. No argument there. I have no particular love for that rule, either. >Asymmetric bounds also have some attraction, since, e.g., in mistake-based >training "by hand" I always end up moving the ham cutoff closer to 0 than >the spam cutoff is to 1. One thing that's occurred to me is to have the training cutoffs at N sigma from mean (where N == .5?) for the two populations; how you'd bootstrap that is an open question, of course. - Alex From mhammond at skippinet.com.au Mon Dec 29 17:21:26 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Dec 29 17:21:48 2003 Subject: [spambayes-dev] test_storage.py failing In-Reply-To: <16363.28152.314476.785433@montanaro.dyndns.org> Message-ID: <0e7401c3ce5a$19e1e900$2c00a8c0@eden> > They were coming from the right place. I eventually figured out that > distutils didn't overwrite my installed copy when I tried > installing from a > new CVS version. > > Sorry for the false alarm. I wonder if I should file a bug > report against > distutils... I have seen similar things with distutils. If the 'installed' file has a later date than the file being installed, distutils decides not to install it. I struck this when I modified an installed version of a file, making a quick hack of a change for debugging. My idea was that by changing it in the installed copy, I wouldn't need to undo the change, and would just rely on distutils to overwrite with the correct copy. I went so far as stepping through distutils in a debugger before I saw that was the intent of the code. I agree it sucks, but it doesn't appear to be a bug. Mark. From skip at pobox.com Mon Dec 29 17:55:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 29 17:55:31 2003 Subject: [spambayes-dev] test_storage.py failing In-Reply-To: <0e7401c3ce5a$19e1e900$2c00a8c0@eden> References: <16363.28152.314476.785433@montanaro.dyndns.org> <0e7401c3ce5a$19e1e900$2c00a8c0@eden> Message-ID: <16368.45275.633710.948924@montanaro.dyndns.org> >> Sorry for the false alarm. I wonder if I should file a bug report >> against distutils... Mark> I have seen similar things with distutils. If the 'installed' Mark> file has a later date than the file being installed, distutils Mark> decides not to install it. I think cvs exacerbates the problem by setting the mod time of checked out files to the last checkin date. Here's a listing of the Outlook2000 in my ~/tmp/spambayes directory: % ls -ltr Outlook2000/ total 284 -rw-rw-r-- 1 skip staff 4102 Oct 3 00:23 README.txt -rw-rw-r-- 1 skip staff 1779 Dec 14 05:23 default_bayes_customize.ini -rw-rw-r-- 1 skip staff 7199 Dec 15 23:06 train.py -rw-rw-r-- 1 skip staff 3539 Dec 15 23:06 oastats.py -rw-rw-r-- 1 skip staff 6205 Dec 15 23:06 config_wizard.py -rw-rw-r-- 1 skip staff 57355 Dec 19 00:25 msgstore.py -rw-rw-r-- 1 skip staff 7105 Dec 19 00:27 filter.py -rw-rw-r-- 1 skip staff 39497 Dec 20 05:21 manager.py -rw-rw-r-- 1 skip staff 18414 Dec 21 21:13 config.py -rw-rw-r-- 1 skip staff 73521 Dec 21 21:16 addin.py -rw-rw-r-- 1 skip staff 7407 Dec 21 21:17 about.html drwxrwxr-x 15 skip staff 510 Dec 23 22:00 dialogs drwxrwxr-x 7 skip staff 238 Dec 23 22:00 docs drwxrwxr-x 11 skip staff 374 Dec 23 22:00 sandbox drwxrwxr-x 10 skip staff 340 Dec 23 22:00 installer drwxrwxr-x 6 skip staff 204 Dec 23 22:00 images -rw-rw-r-- 1 skip staff 29666 Dec 23 22:06 tester.py -rw-r--r-- 1 skip staff 8209 Dec 29 15:58 export.py drwxrwxr-x 5 skip staff 170 Dec 29 15:58 CVS Note that I created the entire tree just a few days before Christmas, yet the README.txt file has a timestamp of October 3rd. Mark> I went so far as stepping through distutils in a debugger before I Mark> saw that was the intent of the code. I agree it sucks, but it Mark> doesn't appear to be a bug. Maybe there's a flag in cvs which will set the timestamp appropriately. Alternatively, I suppose a 'find . -type f | xargs touch' would work for us Unix geeks. Still, it's surprising. (Best thing would be to install if the source and destination files have different checksums.) Skip From ltnieh at earthlink.net Mon Dec 29 21:01:27 2003 From: ltnieh at earthlink.net (Luther Nieh) Date: Mon Dec 29 21:01:26 2003 Subject: [spambayes-dev] Full re-initialization Message-ID: Hello SpamBayes Tech support, Thank you for developing this useful software. I have used it for a few weeks and I feel it has been doing it's job as described in your documentation. One day, as I was cleaning up the M/S Outlook folders, I accidentally deleted the spam email folder. Now, SpamBayes does not seem to work anymore. It said it couldn't send spam emails to that folder, even though I had manually re-created the spam email folder. I even attempted to remove and re-install SpamBayes without any improvement. Please let me know what one does to have the program re-initialize itself as if it was a new installation. Thank you for your help. Regards, Luther Nieh ltnieh@sunnydesign.com 12/29/03 --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.552 / Virus Database: 344 - Release Date: 12/15/2003 From popiel at wolfskeep.com Mon Dec 29 23:29:17 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 29 23:29:22 2003 Subject: [spambayes-dev] Full re-initialization In-Reply-To: Message from "Luther Nieh" of "Mon, 29 Dec 2003 18:01:27 PST." References: Message-ID: <20031230042918.115EB2DE88@cashew.wolfskeep.com> In message: "Luther Nieh" writes: > >Hello SpamBayes Tech support, Well, we're not really tech support here... it's just the people who wrote the code and us hangers-on who heckle from the sides. >One day, as I was cleaning up the M/S Outlook folders, I accidentally >deleted the spam email folder. Now, SpamBayes does not seem to work >anymore. It said it couldn't send spam emails to that folder, even >though I had manually re-created the spam email folder. In this case, Outlook is "helpful" and remembers that you "moved" the spam mail folder to the trash. What you should do is go back to the configuration panel where you set the spam mail folder, and point it back to your re-created spam mail folder. Once that's done, all should be happy again. - Alex (who doesn't use Outlook and thus can't get super-specific) From mhammond at skippinet.com.au Tue Dec 30 01:09:22 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 30 01:09:31 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: Message-ID: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden> [Richie] > [Mark] > > I have just uploaded an installer for a new experimental binary of > > SpamBayes. This binary includes *both* the Outlook addin > and the sb_server > > applications. > > Nice one! Barring a few minor glitches (which I'll enter into the SF > tracker when I get the chance) sb_tray worked like a charm for me. Great! I didn't see any new items in the tracker though. If they are trivial, just mail them to me. Otherwise, did anyone else try this build? Either for Outlook or sb_server? I fear I may have "disclaimed" the build a little too much, as this is the only reply I got, and I see no new bugs etc. Thanks, Mark. From theller at python.net Tue Dec 30 04:54:51 2003 From: theller at python.net (Thomas Heller) Date: Tue Dec 30 03:55:35 2003 Subject: [spambayes-dev] Re: test_storage.py failing References: <16363.28152.314476.785433@montanaro.dyndns.org> <0e7401c3ce5a$19e1e900$2c00a8c0@eden> Message-ID: "Mark Hammond" writes: >> They were coming from the right place. I eventually figured out that >> distutils didn't overwrite my installed copy when I tried >> installing from a >> new CVS version. >> >> Sorry for the false alarm. I wonder if I should file a bug >> report against >> distutils... > > I have seen similar things with distutils. If the 'installed' file has a > later date than the file being installed, distutils decides not to install > it. > > I struck this when I modified an installed version of a file, making a quick > hack of a change for debugging. My idea was that by changing it in the > installed copy, I wouldn't need to undo the change, and would just rely on > distutils to overwrite with the correct copy. > > I went so far as stepping through distutils in a debugger before I saw that > was the intent of the code. I agree it sucks, but it doesn't appear to be a > bug. There's a '--force' command line option for distutils' install command for that. Thomas From richie at entrian.com Tue Dec 30 08:50:02 2003 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 30 08:50:10 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden> References: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden> Message-ID: <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com> [Richie] > Nice one! Barring a few minor glitches (which I'll enter into the SF > tracker when I get the chance) sb_tray worked like a charm for me. [Mark] > Great! I didn't see any new items in the tracker though. If they are > trivial, just mail them to me. Sorry Mark, Christmas has been pretty hectic. 8-) Here are the notes I made. Lots of these aren't anything to do with the binary packaging, but I'll send the lot anyway. If you need any more detail, just ask: I have Outlook, and the installer says "Outlook appears to be installed". But I don't use Outlook, so I clear that checkbox, check the Server box, and hit Next. The I think, "Hang on, I might as well have a look at the Outlook plugin" so I hit Back. It now says "Outlook does not appear to be installed". A bit misleading. At the end of the install I checked both "View welcome.html" and "View proxy_readme". Only welcome.html appeared. I can launch many instances of sb_tray without complaint. The ini file for the proxy appeared in "C:\Documents and Settings\rjh\Application Data\SpamBayes\Proxy" as you'd expect, but the database and cache directories appeared in "C:\Program Files\SpamBayes\bin". Then after restarting, another set of database and cache directories appeared in "C:\Documents and Settings\rjh". I guess sb_tray writes them into the working directory, and the installer's working directory is "C:\Program Files\SpamBayes\bin" when it launches sb_tray for the first time. Then when you start it from the Start menu, the working directory is "C:\Documents and Settings\rjh". If I right-click the tray icon and go "Stop spambayes", the icon goes red after a second or two and the proxy stops. When I go right-click / Start, the proxy doesn't start, and the icon still shows red. If I move the cursor over the icon - without clicking - it goes green, but the proxy still hasn't started. Right-clicking presents a "Stop" command, which makes the icon go red again, but as soon as I move the cursor over the icon it returns to green. I have to exit and restart before the proxy will restart. I'd question whether we need the Stop/Start command - why would I want the tray icon to stay there but the application to not run? Stop vs. Exit is not a clear distinction. Things like firewalls and virus scanners need this because they can be intrusive, but sb_tray is not intrusive - if your email client is configured to use it then it must be running for your email to work, and if your email client isn't configured to use it then it has no effect. After training through the web interface, the home page still says "Database has no training information ..." even though the stats say "Total emails trained: Spam: 3 Ham: 18". Only after changing and saving the configuration does it update to say "Database only has 18 good and 3 spam ..." Then after subsequent training it still says "18 good and 3 spam". Defaulting the "Maximum results" field in the Find pane to 1 seems wrong. It made sense when all you could do was search for a message ID (because they're unique) but if I'm searching for text, I'll want to see all the hits. The Find pane only looks in the unknown cache, so it won't find anything once you've trained. It ought to look in the ham and spam caches as well. I deliberately induced a false positive (by training on a thousand spams with no hams trained) then corrected it via the Review page, and the statistics now say "1 being false negatives" (plural: ack!) and "0 being false positives". That's the wrong way round. -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Tue Dec 30 08:55:30 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 30 08:55:41 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com> Message-ID: <0fa601c3cedc$97e6d400$2c00a8c0@eden> > Sorry Mark, Christmas has been pretty hectic. 8-) Woo hoo - me too - and I've a blinder planned for tomorrow night ;) Thanks for that! I'll reply in detail for each of these points - either a 'fixed' or the bug number. Happy new year! Mark. From skip at pobox.com Tue Dec 30 09:37:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 30 09:37:35 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com> References: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden> <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com> Message-ID: <16369.36262.575261.379159@montanaro.dyndns.org> Mark, Thanks for the new installer. I tried it out on my little-used Win2k machine. While it seemed to install fine, the tray icon does nothing but briefly change the pointer to an hourglass. I double-clicked the sb_server.exe icon and it popped up a window then immediately went away. Following the suggestion in the readme_proxy.html file I tried right-mousing the tray menu item but saw nothing but the usual Windows fluff ("Save/Restore Desktop Icons", "Open", ..., "Properties"). Skip From nobody at spamcop.net Tue Dec 30 11:32:18 2003 From: nobody at spamcop.net (Seth Goodman) Date: Tue Dec 30 11:36:48 2003 Subject: [spambayes-dev] RE: [Spambayes] How low can you go? In-Reply-To: <20031229220014.2B7C02DE88@cashew.wolfskeep.com> Message-ID: > [T. Alexander Popiel] > One thing that's occurred to me is to have the training cutoffs at > N sigma from mean (where N == .5?) for the two populations; how you'd > bootstrap that is an open question, of course. Great idea. The first pass could just be set to two constant thresholds, then start computing the mean, SD and new thresholds. This should converge fairly quickly. Another idea is to use the two means, but decide how many SD's to go for each one based on the incoming ham/spam ratio. This requires you to make an assumption about the distributions. Along the same lines, one more possibility is to construct a cumulative distribution function (CDF) of new mail received, then set the training thresholds such that you would train an equal number of ham/spam. This also lets you set the total number of messages trained, or at least to limit it to a maximum value. Since this is a batch (nightly?) process rather than continuous, the CDF calculation is a posteriori so both the ratio and number of new trained messages will be achieved exactly. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From nobody at spamcop.net Tue Dec 30 11:39:46 2003 From: nobody at spamcop.net (Seth Goodman) Date: Tue Dec 30 11:40:06 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden> Message-ID: I tried the straight Outlook add-in and so far, no bugs to report! It appeared to recognize my old databases just fine, but I retrained to take advantage of any things you smoothed out in the tokenizer. I like the new spam clues page with the internal ham and spam scores plus the number of significant tokens listed. Nice job! -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From tim.one at comcast.net Tue Dec 30 14:17:56 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 30 14:18:07 2003 Subject: [spambayes-dev] A URL experiment Message-ID: Over on the spambayes list yesterday, we were discussing a particularly good identity-theft scam spam, purporting to be from PayPal. It linked extensively to PayPal's real site, and about the only fishy lexical thing was a highly obfuscated href (full of % escapes). We don't do anything special with % escapes in URLs now. Maybe we should. The attached patch does. I don't have enough personal email saved to make for a good test, but who cares . I just took what I had, slammed into randomly into 10 even sets, and did "the usual" cross-validation business on it. All of this email is less than a week old, is all the email I've gotten since then, is atypical for me (Christmas time -> a lot less email than usual, but a spike in personal email), and runs 3:1 in favor of ham. None of that matters, though -- *whatever* you have, and however you train, the interesting question is just how it does with the patch, compared to without it. I ran my 10-fold CV with "the default" settings for Outlook. These match the current (CVS) project defaults, with the addition of [Tokenizer] replace_nonascii_chars: True record_header_absence: True I'm *not* using mine_received_headers or x-use_bigrams in these tests. befores -> afters -> tested 151 hams & 52 spams against 1359 hams & 468 spams [19 repetitions of that] false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.662 0.662 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.0662251655629 to 0.0662251655629 tied false negative percentages 1.923 1.923 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.923 1.923 tied 1.923 1.923 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 3 to 3 tied mean fn % went from 0.576923076924 to 0.576923076924 tied ham mean ham sdev 0.44 0.44 +0.00% 4.52 4.52 +0.00% 0.34 0.34 +0.00% 4.11 4.11 +0.00% 0.27 0.27 +0.00% 3.16 3.16 +0.00% 0.17 0.17 +0.00% 1.51 1.51 +0.00% 1.06 1.06 +0.00% 9.11 9.12 +0.11% 0.00 0.00 +(was 0) 0.01 0.01 +0.00% 0.78 0.78 +0.00% 8.16 8.16 +0.00% 0.42 0.43 +2.38% 5.19 5.21 +0.39% 0.01 0.01 +0.00% 0.11 0.11 +0.00% 0.07 0.07 +0.00% 0.90 0.90 +0.00% ham mean and sdev for all runs 0.36 0.36 +0.00% 4.77 4.78 +0.21% spam mean spam sdev 96.41 96.43 +0.02% 13.52 13.51 -0.07% 98.51 98.56 +0.05% 6.99 6.99 +0.00% 97.80 97.80 +0.00% 6.42 6.41 -0.16% 98.21 98.22 +0.01% 7.31 7.30 -0.14% 93.00 93.03 +0.03% 16.68 16.66 -0.12% 97.40 97.41 +0.01% 8.29 8.27 -0.24% 97.58 97.70 +0.12% 12.30 12.18 -0.98% 97.01 97.02 +0.01% 14.38 14.37 -0.07% 95.90 96.03 +0.14% 11.61 11.46 -1.29% 98.86 98.86 +0.00% 6.12 6.11 -0.16% spam mean and sdev for all runs 97.07 97.11 +0.04% 11.09 11.05 -0.36% ham/spam mean difference: 96.71 96.75 +0.04 Not much to talk about there! Pretty much indistinguishable, although the spam mean went up a tad consistently, and the spam sdev down a tad consistently. table.py's "best cost" output shows that I could have reduced the optimal cost by 1 unsure if I changed my cutoffs: filename: before after ham:spam: 1510:520 1510:520 fp total: 1 1 fp %: 0.07 0.07 fn total: 3 3 fn %: 0.58 0.58 unsure t: 39 39 unsure %: 1.92 1.92 real cost: $20.80 $20.80 best cost: $17.60 $17.40 h mean: 0.36 0.36 h sdev: 4.77 4.78 s mean: 97.07 97.11 s sdev: 11.09 11.05 mean diff: 96.71 96.75 k: 6.10 6.11 So the change would have been the tiniest of wins for me. For you? BTW, the fp here was an "end of year sale" blaring HTML ad from Gateway. That's ham to me, but there are no other msgs from Gateway in this email. It contains enough Gateway-specific lexicalisms that training on one is enough to score future ones as solid ham. The PayPal scam that started this remained a solid FN. -------------- next part -------------- Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.27 diff -c -u -r1.27 tokenizer.py --- tokenizer.py 30 Dec 2003 16:26:33 -0000 1.27 +++ tokenizer.py 30 Dec 2003 18:45:59 -0000 @@ -1011,9 +1011,25 @@ Stripper.__init__(self, url_re.search, re.compile("").search) def tokenize(self, m): + import urllib + proto, guts = m.groups() tokens = ["proto:" + proto] pushclue = tokens.append + + # %nn escapes are usually intentional obfuscation. Generate a lot + # of correlated tokens if the URL contains a lot of them. The + # classifier will learn which specific ones are and aren't spammy. + escapes = re.findall(r'%..', guts) + tokens.extend(["url:" + escape for escape in escapes]) + + try: + # Tokenize the unobfuscated URL. + guts = urllib.unquote(guts) + except: + pushclue("url:invalid escapes") + # And guts is unchanged; however, I don't think urllib.unquote() + # ever raises an exception now. # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. From skip at pobox.com Tue Dec 30 16:45:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 30 16:46:10 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: References: Message-ID: <16369.61971.794205.276663@montanaro.dyndns.org> Tim> Over on the spambayes list yesterday, we were discussing a Tim> particularly good identity-theft scam spam, purporting to be from Tim> PayPal. It linked extensively to PayPal's real site, and about the Tim> only fishy lexical thing was a highly obfuscated href (full of % Tim> escapes). Tim> We don't do anything special with % escapes in URLs now. Maybe we Tim> should. The attached patch does. I tried a somewhat different approach (patch is attached) and got similar results (all ties at the more gross level, slight increase in spam mean and slight decrease in spam sdev, no change to ham at all (*)): stds.txt -> pickurlss.txt -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams false positive percentages 0.000 0.000 tied 0.400 0.400 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.08 to 0.08 tied false negative percentages 3.333 3.333 tied 5.000 5.000 tied 7.333 7.333 tied 5.667 5.667 tied 4.000 4.000 tied won 0 times tied 5 times lost 0 times total unique fn went from 76 to 76 tied mean fn % went from 5.06666666667 to 5.06666666667 tied ham mean ham sdev 1.64 1.64 +0.00% 8.44 8.44 +0.00% 0.99 0.99 +0.00% 8.29 8.29 +0.00% 2.82 2.82 +0.00% 12.52 12.52 +0.00% 1.58 1.58 +0.00% 8.29 8.29 +0.00% 1.30 1.30 +0.00% 8.04 8.04 +0.00% ham mean and sdev for all runs 1.66 1.66 +0.00% 9.30 9.30 +0.00% spam mean spam sdev 93.80 93.82 +0.02% 19.39 19.35 -0.21% 90.56 90.58 +0.02% 24.31 24.26 -0.21% 89.24 89.27 +0.03% 27.03 27.04 +0.04% 89.27 89.27 +0.00% 25.51 25.50 -0.04% 92.72 92.74 +0.02% 21.67 21.67 +0.00% spam mean and sdev for all runs 91.12 91.14 +0.02% 23.81 23.80 -0.04% ham/spam mean difference: 89.46 89.48 +0.02 (*) Operational question: Given that my training data is somewhat small at the moment (roughly 1000-1500 each of ham and spam), would I be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)? Skip -------------- next part -------------- Index: spambayes/Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.97 diff -c -r1.97 Options.py *** spambayes/Options.py 30 Dec 2003 16:26:33 -0000 1.97 --- spambayes/Options.py 30 Dec 2003 21:42:48 -0000 *************** *** 145,150 **** --- 145,155 ---- """(DEPRECATED) Extract day of the week tokens from the Date: header.""", BOOLEAN, RESTORE), + ("x-pick_apart_urls", "Extract clues about url structure", False, + """(EXPERIMENTAL) Note whether url contains non-standard port or + user/password elements.""", + BOOLEAN, RESTORE), + ("replace_nonascii_chars", "Replace non-ascii characters", False, """If true, replace high-bit characters (ord(c) >= 128) and control characters with question marks. This allows non-ASCII character Index: spambayes/tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.27 diff -c -r1.27 tokenizer.py *** spambayes/tokenizer.py 30 Dec 2003 16:26:33 -0000 1.27 --- spambayes/tokenizer.py 30 Dec 2003 21:42:48 -0000 *************** *** 13,18 **** --- 13,20 ---- import time import os import binascii + import urlparse + import urllib try: from sets import Set except ImportError: *************** *** 1014,1019 **** --- 1016,1038 ---- proto, guts = m.groups() tokens = ["proto:" + proto] pushclue = tokens.append + + if options["Tokenizer", "x-pick_apart_urls"]: + url = proto + "://" + guts + num_pcs = url.count("%") + if num_pcs: + pushclue("url:%d %%s" % num_pcs) + url = urllib.unquote(url) + scheme, netloc, path, params, query, frag = urlparse.urlparse(url) + user_pwd, host_port = urllib.splituser(netloc) + if user_pwd is not None: + pushclue("url:has user") + host, port = urllib.splitport(host_port) + if port is not None: + if scheme == "http" and port != '80': + pushclue("url:non-standard http port") + elif scheme == "https" and port != '443': + pushclue("url:non-standard https port") # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. From tameyer at ihug.co.nz Tue Dec 30 17:51:29 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 30 17:51:37 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304985F95@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz> [I'll leave the install stuff for Mark, but I can sort out the rest of these]. > The ini file for the proxy appeared in "C:\Documents and > Settings\rjh\Application Data\SpamBayes\Proxy" as you'd > expect, but the database and cache directories appeared in > "C:\Program Files\SpamBayes\bin". Did the ini file have the appropriate [Storage] lines in it? It's meant to add them in there, storing the directories in that directory, too. You didn't already have an ini file in there, did you? (It only adds those lines if it's a new file, so that it doesn't overwrite someone's settings). > I'd question whether > we need the Stop/Start command - why would I want the tray > icon to stay there but the application to not run? I was thinking this just yesterday. I'm not sure what the original reasoning behind having it was (and it may have been me that put it there ;). +1 to getting rid of it, unless someone does know the reasoning. We can dump the 'stopped' icon, then, too. (I'd like to see a '!' icon, though, which appeared when there were important status messages to review). > After training through the web interface, the home page still > says "Database has no training information ..." even though > the stats say "Total emails trained: Spam: 3 Ham: 18". Good spotting. I've checked in a fix for this. > Defaulting the "Maximum results" field in the Find pane to 1 > seems wrong. It made sense when all you could do was search > for a message ID (because they're unique) but if I'm > searching for text, I'll want to see all the hits. Fair enough. Line 435 of ui.html; change it to whatever you like most :) > The Find pane only looks in the unknown cache, so it won't > find anything once you've trained. It ought to look in the > ham and spam caches as well. Are you positive? The code has it looking in all three, and a quick test here had it finding messages in more than one. > I deliberately induced a false positive (by training on a > thousand spams with no hams trained) then corrected it via > the Review page, and the statistics now say "1 being false > negatives" (plural: ack!) and "0 being false positives". > That's the wrong way round. Opps, my bad. I've checking in a fix for this. I think I've fixed all the plurals, too. If you've still got that false positive statistic around, could you give it a run from cvs? =Tony Meyer From tameyer at ihug.co.nz Tue Dec 30 20:32:02 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 30 20:32:09 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304985F9F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777BD@its-xchg4.massey.ac.nz> [Skip] > Thanks for the new installer. I tried it out on my > little-used Win2k machine. While it seemed to install fine, > the tray icon does nothing but briefly change the pointer to > an hourglass. I double-clicked the sb_server.exe icon and it > popped up a window then immediately went away. In your temp directory (C:\Documents and Settings\[username]\Local Settings\Temp in Win2k, I think) there should be some SpamBayesServerN.log files (where N is a number). Could you grab any that are there and mail them to me/the list? (I haven't seen this myself, so it would be good to figure out what it is). =Tony Meyer From tameyer at ihug.co.nz Tue Dec 30 20:36:44 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 30 20:36:48 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304985F55@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1D@its-xchg4.massey.ac.nz> [Mark] > Otherwise, did anyone else try this build? Either for > Outlook or sb_server? I fear I may have "disclaimed" the > build a little too much, as this is the only reply I got, and > I see no new bugs etc. I briefly tried it, and all seemed ok to me (Outlook XP and the others on WinXP). I've also been trying various experimental builds of my own (both on the XP box and on Win98 at home). I can't see how they would be different to your build since it's using the same process. (Latest CVS each time). Sorry I didn't report back earlier - didn't your message say that you were away for a couple of weeks and wouldn't be looking at anything until then? I got a 'not urgent' impression from it :) No bugs to report - the install has worked fine for me, and I've fixed anything wrong I've found with the source itself. I've made some improvements to the documentation, too - is it sufficient for a 8.5 release, do you think? =Tony Meyer From tim.one at comcast.net Tue Dec 30 20:59:44 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 30 20:59:46 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: <16369.61971.794205.276663@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I tried a somewhat different approach (patch is attached) and got > similar results (all ties at the more gross level, slight increase in > spam mean and slight decrease in spam sdev, no change to ham at all > (*)): 3-way compare on my data: filename: before after skip ham:spam: 1510:520 1510:520 1510:520 fp total: 1 1 1 fp %: 0.07 0.07 0.07 fn total: 3 3 3 fn %: 0.58 0.58 0.58 unsure t: 39 39 39 unsure %: 1.92 1.92 1.92 real cost: $20.80 $20.80 $20.80 best cost: $17.60 $17.40 $17.80 h mean: 0.36 0.36 0.36 h sdev: 4.77 4.78 4.77 s mean: 97.07 97.11 97.08 s sdev: 11.09 11.05 11.03 mean diff: 96.71 96.75 96.72 k: 6.10 6.11 6.12 The "best cost" measure actually got marginally worse, but not significantly so. Note that this part of the patch can't be helping much: + num_pcs = url.count("%") + if num_pcs: + pushclue("url:%d %%s" % num_pcs) That is, raw counts are almost never useful -- if I have a URL in a spam that embeds 40 escapes, that does nothing to indict a URL with 39 (or 41) escapes. Pumping out log2(a_count) usually does more good. I *expect* the approach in my patch would work better, though (generating lots of correlated tokens -- there are good reasons to escape some punctuation characters in URLs, but the only good reason to escape a letter or digit is to obfuscate; let the classifier see these things, and it will learn that on its own, as appropriate, for each escape code; then a URL escaping several letters or digits will get penalized more the more heavily it employs this kind of obfuscation). > (*) Operational question: Given that my training data is somewhat > small at the moment (roughly 1000-1500 each of ham and spam), would I > be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs > each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)? If you ask me , cross-validation should *always* be done with a minimum of 10 sets, regardless of how much data you have. There are many reasons for this, from statistical reliability of the grand averages at the end (they're subject to central-limit theorem constraints, and the more sets the more reliable they are, growing with the square root of the # of sets); to that it's extremely important to see run-by-run comparisons (how many runs won, lost, tied), and just about any distribution of those numbers is achievable by chance with few sets (IOW, "9 won, 1 tied, 0 lost" is very much harder to account for by chance than "4 won, 1 tied, 0 lost"; likewise "1 won, 8 tied, 1 lost" is much less likely to be produced by a significant (good or bad) change than "1 won, 3 tied, 1 lost"). Note, though, that cross-validation is modeling the performance of a train-on-everything strategy, and in random time order to boot. If that's not how you train, the results may be irrelevant to what you'll see in real life. It should be good enough to weed out really bad ideas-- and highlight really good ones --regardless, though. From tameyer at ihug.co.nz Tue Dec 30 22:17:49 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 30 22:18:02 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130498603B@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz> My results (this is with a chuck of my most recent mail, with timcv.py -n10). Tim's patch: bases.txt -> nntims.txt -> tested 357 hams & 395 spams against 3311 hams & 3704 spams -> tested 397 hams & 384 spams against 3271 hams & 3715 spams -> tested 385 hams & 433 spams against 3283 hams & 3666 spams -> tested 407 hams & 397 spams against 3261 hams & 3702 spams -> tested 350 hams & 412 spams against 3318 hams & 3687 spams -> tested 338 hams & 405 spams against 3330 hams & 3694 spams -> tested 359 hams & 416 spams against 3309 hams & 3683 spams -> tested 358 hams & 405 spams against 3310 hams & 3694 spams -> tested 348 hams & 411 spams against 3320 hams & 3688 spams -> tested 369 hams & 441 spams against 3299 hams & 3658 spams -> tested 357 hams & 395 spams against 3311 hams & 3704 spams -> tested 397 hams & 384 spams against 3271 hams & 3715 spams -> tested 385 hams & 433 spams against 3283 hams & 3666 spams -> tested 407 hams & 397 spams against 3261 hams & 3702 spams -> tested 350 hams & 412 spams against 3318 hams & 3687 spams -> tested 338 hams & 405 spams against 3330 hams & 3694 spams -> tested 359 hams & 416 spams against 3309 hams & 3683 spams -> tested 358 hams & 405 spams against 3310 hams & 3694 spams -> tested 348 hams & 411 spams against 3320 hams & 3688 spams -> tested 369 hams & 441 spams against 3299 hams & 3658 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.246 0.246 tied 0.000 0.000 tied 0.000 0.000 tied 0.557 0.557 tied 0.559 0.559 tied 0.287 0.287 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 6 to 6 tied mean fp % went from 0.164881884948 to 0.164881884948 tied false negative percentages 0.253 0.253 tied 0.781 0.781 tied 0.462 0.462 tied 0.756 0.756 tied 0.243 0.243 tied 0.247 0.247 tied 0.240 0.240 tied 0.494 0.494 tied 0.973 0.973 tied 0.454 0.454 tied won 0 times tied 10 times lost 0 times total unique fn went from 20 to 20 tied mean fn % went from 0.490257037938 to 0.490257037938 tied ham mean ham sdev 1.18 1.17 -0.85% 7.76 7.67 -1.16% 0.99 0.99 +0.00% 6.64 6.64 +0.00% 0.84 0.85 +1.19% 6.14 6.14 +0.00% 1.99 2.10 +5.53% 9.46 9.73 +2.85% 0.49 0.49 +0.00% 3.59 3.57 -0.56% 0.85 0.87 +2.35% 5.45 5.46 +0.18% 1.16 1.16 +0.00% 9.30 9.29 -0.11% 1.20 1.30 +8.33% 8.13 8.66 +6.52% 1.55 1.55 +0.00% 8.05 8.05 +0.00% 0.47 0.47 +0.00% 3.22 3.15 -2.17% ham mean and sdev for all runs 1.08 1.10 +1.85% 7.13 7.21 +1.12% spam mean spam sdev 98.75 98.75 +0.00% 8.72 8.72 +0.00% 97.67 97.69 +0.02% 11.26 11.24 -0.18% 98.08 98.14 +0.06% 10.12 9.97 -1.48% 98.16 98.16 +0.00% 10.19 10.20 +0.10% 98.35 98.41 +0.06% 8.77 8.69 -0.91% 98.45 98.47 +0.02% 8.97 8.86 -1.23% 98.35 98.41 +0.06% 9.73 9.69 -0.41% 98.25 98.36 +0.11% 9.16 8.96 -2.18% 97.93 97.97 +0.04% 11.99 11.98 -0.08% 98.92 98.93 +0.01% 7.62 7.62 +0.00% spam mean and sdev for all runs 98.30 98.34 +0.04% 9.72 9.66 -0.62% ham/spam mean difference: 97.22 97.24 +0.02 Skip's patch: bases.txt -> pickskips.txt -> tested 357 hams & 395 spams against 3311 hams & 3704 spams -> tested 397 hams & 384 spams against 3271 hams & 3715 spams -> tested 385 hams & 433 spams against 3283 hams & 3666 spams -> tested 407 hams & 397 spams against 3261 hams & 3702 spams -> tested 350 hams & 412 spams against 3318 hams & 3687 spams -> tested 338 hams & 405 spams against 3330 hams & 3694 spams -> tested 359 hams & 416 spams against 3309 hams & 3683 spams -> tested 358 hams & 405 spams against 3310 hams & 3694 spams -> tested 348 hams & 411 spams against 3320 hams & 3688 spams -> tested 369 hams & 441 spams against 3299 hams & 3658 spams -> tested 357 hams & 395 spams against 3311 hams & 3704 spams -> tested 397 hams & 384 spams against 3271 hams & 3715 spams -> tested 385 hams & 433 spams against 3283 hams & 3666 spams -> tested 407 hams & 397 spams against 3261 hams & 3702 spams -> tested 350 hams & 412 spams against 3318 hams & 3687 spams -> tested 338 hams & 405 spams against 3330 hams & 3694 spams -> tested 359 hams & 416 spams against 3309 hams & 3683 spams -> tested 358 hams & 405 spams against 3310 hams & 3694 spams -> tested 348 hams & 411 spams against 3320 hams & 3688 spams -> tested 369 hams & 441 spams against 3299 hams & 3658 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.246 0.246 tied 0.000 0.000 tied 0.000 0.000 tied 0.557 0.557 tied 0.559 0.559 tied 0.287 0.287 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 6 to 6 tied mean fp % went from 0.164881884948 to 0.164881884948 tied false negative percentages 0.253 0.253 tied 0.781 0.781 tied 0.462 0.462 tied 0.756 0.756 tied 0.243 0.243 tied 0.247 0.247 tied 0.240 0.240 tied 0.494 0.494 tied 0.973 0.973 tied 0.454 0.454 tied won 0 times tied 10 times lost 0 times total unique fn went from 20 to 20 tied mean fn % went from 0.490257037938 to 0.490257037938 tied ham mean ham sdev 1.18 1.18 +0.00% 7.76 7.76 +0.00% 0.99 0.99 +0.00% 6.64 6.64 +0.00% 0.84 0.84 +0.00% 6.14 6.14 +0.00% 1.99 1.99 +0.00% 9.46 9.46 +0.00% 0.49 0.50 +2.04% 3.59 3.60 +0.28% 0.85 0.87 +2.35% 5.45 5.55 +1.83% 1.16 1.16 +0.00% 9.30 9.30 +0.00% 1.20 1.21 +0.83% 8.13 8.14 +0.12% 1.55 1.55 +0.00% 8.05 8.06 +0.12% 0.47 0.47 +0.00% 3.22 3.22 +0.00% ham mean and sdev for all runs 1.08 1.08 +0.00% 7.13 7.14 +0.14% spam mean spam sdev 98.75 98.78 +0.03% 8.72 8.56 -1.83% 97.67 97.70 +0.03% 11.26 11.25 -0.09% 98.08 98.08 +0.00% 10.12 10.12 +0.00% 98.16 98.17 +0.01% 10.19 10.15 -0.39% 98.35 98.38 +0.03% 8.77 8.73 -0.46% 98.45 98.46 +0.01% 8.97 8.97 +0.00% 98.35 98.38 +0.03% 9.73 9.68 -0.51% 98.25 98.29 +0.04% 9.16 9.05 -1.20% 97.93 97.95 +0.02% 11.99 11.98 -0.08% 98.92 98.93 +0.01% 7.62 7.62 +0.00% spam mean and sdev for all runs 98.30 98.32 +0.02% 9.72 9.68 -0.41% ham/spam mean difference: 97.22 97.24 +0.02 3-way compare: filename: bases nntims pickskips ham:spam: 3668:4099 3668:4099 3668:4099 fp total: 6 6 6 fp %: 0.16 0.16 0.16 fn total: 20 20 20 fn %: 0.49 0.49 0.49 unsure t: 178 173 175 unsure %: 2.29 2.23 2.25 real cost: $115.60 $114.60 $115.00 best cost: $93.00 $91.20 $92.40 h mean: 1.08 1.10 1.08 h sdev: 7.13 7.21 7.14 s mean: 98.30 98.34 98.32 s sdev: 9.72 9.66 9.68 mean diff: 97.22 97.24 97.24 k: 5.77 5.76 5.78 Rather like Tim's results, really, at least to my ignorant eyes. =Tony Meyer From tameyer at ihug.co.nz Tue Dec 30 22:32:56 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Dec 30 22:33:01 2003 Subject: [spambayes-dev] pop3proxy_tray error In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F6E9@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777C0@its-xchg4.massey.ac.nz> [Kenny on the 5th of December] > I tried to stop SpamBayes from the right-click menu, and then > start it again. Here's the output I got when it tried to > restart SpamBayes. [...] > serverStrings = ["%s:%s" % (s, p) for s, p in self.servers] > TypeError: iteration over non-sequence I finally got around to finding this and fixing it :) I think it's also the problem that Richie found. Fixed in sb_server 1.16, I hope. =Tony Meyer From tim.one at comcast.net Tue Dec 30 22:46:36 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 30 22:46:40 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer, tries the patches] Thanks, Tony! > ... > Rather like Tim's results, really, at least to my ignorant eyes. The results are both as weakly positive as things get, but at least neither patch is doing any harm. As before, I'd rather see Skip try to deal with % escapes the way my patch did -- that's a common obfuscation trick, and I bet it accounts for the small reduction in Unsures you saw. My patch should do a lot more to penalize that trick than Skip's. Both patches tokenize the de-obfuscated URL, so they're a wash in that respect. Skip's patch also exposes higher-level concepts to the classifier, like "non-standard port number". I don't see that often, but when I do it's usually in email from my work account (e.g., trying to get me to preview a pre-release site change, accessed via a non-standard port so it doesn't interfere with the production site). That's OK, though: *my* classifier will learn that's a ham clue in my email mix -- so it goes. Since everyone is getting some good out of Skip's changes (and I don't think his treatment of % escapes is making a difference), and also getting some good out of mine (which don't try to do anything except get some good of % escapes), combining the two will do better than either, or cancel each other out <0.5 wink>. From mhammond at skippinet.com.au Tue Dec 30 23:31:45 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 30 23:31:58 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz> Message-ID: <10fe01c3cf57$02248cc0$2c00a8c0@eden> > > The ini file for the proxy appeared in "C:\Documents and > > Settings\rjh\Application Data\SpamBayes\Proxy" as you'd > > expect, but the database and cache directories appeared in > > "C:\Program Files\SpamBayes\bin". > > Did the ini file have the appropriate [Storage] lines in it? > It's meant to add them in there, storing the directories in > that directory, too. > You didn't already have an ini file in there, did you? (It only > adds those > lines if it's a new file, so that it doesn't overwrite > someone's settings). I really don't like the code in Options.py that handles the default values for these storage items. I'm not sure it is to blame, but it did cause me to see a new .db file created in the cwd, rather than the data directory - as my INI file already existed, it didn't get the default FQN for the new option. IMO, the ini files should generally store relative path names, being relative to the directory of the config file being used. This means we never allow the cwd to determine anything other than the location of the main config file, as all paths resolve via the directory of this file. A single Options.resolve_path() should be able to do this for us. Code speaks louder than words - I'm suggesting: Options.py, line 1156, the code starting: # If the file doesn't exist, then let's get the user to # store their databases and caches here as well, by # default, and save the file. db_name = os.path.join(windowsUserDirectory, "statistics_database.db") And all similar setting of the options to FQNs die. The default remains "statistics_database.db" . All code that uses this option ('persistent_storage_file') does so via a new function: def get_pathname_option(section, value): filename = options.get(section, value) if not os.path.isabs(filename): return filename # maybe expanduser() to *nix? return os.path.join(os.path.dirname(optionsPathname), # existing global filename) Or-something-like-that ly, Mark. From mhammond at skippinet.com.au Tue Dec 30 23:45:26 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Dec 30 23:45:39 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1D@its-xchg4.massey.ac.nz> Message-ID: <110101c3cf58$e95d3fa0$2c00a8c0@eden> > I briefly tried it, and all seemed ok to me (Outlook XP and > the others on > WinXP). I've also been trying various experimental builds of > my own (both > on the XP box and on Win98 at home). I can't see how they would be > different to your build since it's using the same process. > (Latest CVS each > time). I've a few win32all changes yet to release in binary - but from memory most are pretty trivial. I'll do a win32all at the same time (next year :) > Sorry I didn't report back earlier - didn't your message say > that you were > away for a couple of weeks and wouldn't be looking at > anything until then? It probably did, but I meant a few days :) I always had to get back in time for this huge bender of a party we have planned! About to take off now. > I got a 'not urgent' impression from it :) It certainly wasn't urgent, and I'm glad it worked so well. I was too pesimistic to believe "no news is good news", but it seems to have been the case! > is it sufficient for > a 8.5 release, > do you think? I think an 0.85 would be perfect, in both source and binary. We then go to 0.9, and we could still end up with 1.0 by March! Happy new year! Mark. From skip at pobox.com Wed Dec 31 00:02:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 31 00:02:29 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: References: <16369.61971.794205.276663@montanaro.dyndns.org> Message-ID: <16370.22611.419923.477159@montanaro.dyndns.org> Tim> Note that this part of the patch can't be helping much: Tim> + num_pcs = url.count("%") Tim> + if num_pcs: Tim> + pushclue("url:%d %%s" % num_pcs) Tim> That is, raw counts are almost never useful -- if I have a URL in a Tim> spam that embeds 40 escapes, that does nothing to indict a URL with Tim> 39 (or 41) escapes. Pumping out log2(a_count) usually does more Tim> good. I realized that before trying, but not having any raw data upon which to base things, I left it as-is. If I enable it I'll look at some results to see what tokens are actually generated and how they seem to correlate with ham and spam. One other possibility would be a sort of "Watership Down" approach: "1, 2, 3, many" (or something similar - rabbits can't count very high). The problem with log2(count) in this situation is there seems to be a practical limit to how many % signs a URL might have (maybe 50?), so something that creates buckets using division (counts // 5 ???) might do a decent job of lumping things together. I'm off work the next couple of days and have some house guests in from out of town, so I probably won't look at this much. I will try to at least build a database from my current training set using this feature and see how things shake out. (Maybe tomorrow morning before everyone's up and about.) Tim> I *expect* the approach in my patch would work better, though Tim> (generating lots of correlated tokens -- there are good reasons to Tim> escape some punctuation characters in URLs, but the only good Tim> reason to escape a letter or digit is to obfuscate; let the Tim> classifier see these things, and it will learn that on its own, as Tim> appropriate, for each escape code; then a URL escaping several Tim> letters or digits will get penalized more the more heavily it Tim> employs this kind of obfuscation). My problem with that approach is the stuff the spammers escape can be essentially random, as in the bogus URL you received. I think you might get scads of hapaxes (or at least low-count escapes). Stuff with high-counts will be legitimate (%20 and so forth). Conclusions obviously await some eyeballing of databases. >> (*) Operational question: Given that my training data is somewhat >> small at the moment (roughly 1000-1500 each of ham and spam), would I >> be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs >> each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)? Tim> If you ask me , cross-validation should *always* be done with Tim> a minimum of 10 sets, regardless of how much data you have. There Tim> are many reasons for this, from statistical reliability of the Tim> grand averages at the end (they're subject to central-limit theorem Tim> constraints, and the more sets the more reliable they are, growing Tim> with the square root of the # of sets); Thanks, I will rebalance my training database to 10 sets and see how that goes. Tim> Note, though, that cross-validation is modeling the performance of Tim> a train-on-everything strategy, and in random time order to boot. The random time order isn't so important to me at the moment, because all the messages I'm using are recent (received within the past month or so). The "train on everything" aspect is more interesting. I find the cross-validation tests never perform as well as in real life. ;-) Tim> If that's not how you train, the results may be irrelevant to what Tim> you'll see in real life. It should be good enough to weed out Tim> really bad ideas-- and highlight really good ones --regardless, Tim> though. There's the rub. What might be really good ideas at this point will probably only result in very small changes in performance because the baseline system is currently so good. Skip From skip at pobox.com Wed Dec 31 00:04:33 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 31 00:04:48 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130498603B@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz> Message-ID: <16370.22753.639278.165195@montanaro.dyndns.org> Tony> 3-way compare: Tony> filename: bases nntims pickskips Tony> ham:spam: 3668:4099 3668:4099 Tony> 3668:4099 Tony> fp total: 6 6 6 Tony> fp %: 0.16 0.16 0.16 Tony> fn total: 20 20 20 ... What do you use to generate these three-way comparisons? Skip From richie at entrian.com Wed Dec 31 08:54:25 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 31 08:54:34 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/scripts sb_server.py, 1.15, 1.16 In-Reply-To: References: Message-ID: [Tony] > Modified Files: > sb_server.py > Log Message: > When we stopped sb_server and then restarted, we didn't init the state, so it > wouldn't work. Fix that. > > [...] > > def prepare(): > + state.init() > state.prepare() This edit keeps appearing and disappearing. Mark removed that line in order to fix the fact that none of sb_server's command line arguments worked. Tony has now put it back in order to fix a restart problem, which has once again broken all the command line arguments. The docstring for State.__init__() describes how the code was originally intended to work: """Initialises the State object that holds the state of the app. The default settings are read from Options.py and bayescustomize.ini and are then overridden by the command-line processing code in the __main__ code below.""" The __main__ code is now in a function called run(), and the code to read the options is now in State.init(). Calling State.init() a second time, as prepare() now does, overwrites the command line options set up by run(). It does seem weird that there's a State.prepare() and a global prepare(), but the global prepare() calls both State.init() and State.prepare(). I don't know much about this code (if it was checked in under my name, it must have been my evil twin that wrote it 8-) Does anyone have a clear idea of what each of prepare(), State.__init__(), State.init(), and State.prepare() are all intended to do? I think the command line option code needs to be inserted somewhere into one of them, but I'm not 100% sure what each of them is for. PS. (next eight hours): Mark: You are too drunk to reply. PS. (after eight hours): Mark: How's the head? 8-) -- Richie Hindle richie@entrian.com From richie at entrian.com Wed Dec 31 09:38:40 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 31 09:38:49 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <10fe01c3cf57$02248cc0$2c00a8c0@eden> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz> <10fe01c3cf57$02248cc0$2c00a8c0@eden> Message-ID: [Mark] > IMO, the ini files should generally store relative path names, being > relative to the directory of the config file being used. +1, definitely. -- Richie Hindle richie@entrian.com From richie at entrian.com Wed Dec 31 09:59:15 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 31 09:59:22 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304985F95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz> Message-ID: [Tony] > Did the ini file have the appropriate [Storage] lines in it? It's meant to > add them in there, storing the directories in that directory, too. You > didn't already have an ini file in there, did you? (It only adds those > lines if it's a new file, so that it doesn't overwrite someone's settings). The environment's at work, so I don't know. I can find out on Friday. > Fair enough. Line 435 of ui.html; change it to whatever you like most :) Done. 20. > The code has it looking in all three, and a quick test > here had it finding messages in more than one. You're quite right. I've no idea what happened last time - I'll double-ckeck on Friday. > If you've still got that false positive statistic around, > could you give it a run from cvs? Yes, that's now working. Thanks for that, and the other fixes. -- Richie Hindle richie@entrian.com From skip at pobox.com Wed Dec 31 10:53:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 31 10:53:32 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: References: <16369.61971.794205.276663@montanaro.dyndns.org> Message-ID: <16370.61687.75647.724533@montanaro.dyndns.org> Tim> Note that this part of the patch can't be helping much: Tim> + num_pcs = url.count("%") Tim> + if num_pcs: Tim> + pushclue("url:%d %%s" % num_pcs) Tim> That is, raw counts are almost never useful -- if I have a URL in a Tim> spam that embeds 40 escapes, that does nothing to indict a URL with Tim> 39 (or 41) escapes. Pumping out log2(a_count) usually does more Tim> good. Okay, here are the raw number of URL percents as present in my current ham/spam database: npcs nspam nham 1 21 46 2 4 1 3 2 2 4 1 2 5 0 1 6 2 2 7 1 1 8 0 2 14 2 0 15 0 1 16 1 0 18 1 0 23 1 0 24 1 0 28 1 0 30 1 0 38 2 0 40 1 0 42 1 0 74 1 0 75 1 0 84 1 0 97 1 0 103 1 0 109 1 0 191 1 0 I redid my patch to generate tokens like so: pushclue("url:%%%d" % int(log2(num_pcs))) Converting the first column to int(log(n,2)) then rebuilding the database gives: log(npcs) nspam nham 0 21 46 1 6 3 2 4 2 3 2 2 4 5 0 5 3 0 6 2 0 7 1 0 The new cv test results are essentially the same (I still have just five sets): stds.txt -> pickurlss.txt -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams -> tested 250 hams & 300 spams against 1000 hams & 1200 spams false positive percentages 0.000 0.000 tied 0.400 0.400 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.08 to 0.08 tied false negative percentages 3.333 3.333 tied 5.000 5.000 tied 7.333 7.333 tied 5.667 5.667 tied 4.000 4.000 tied won 0 times tied 5 times lost 0 times total unique fn went from 76 to 76 tied mean fn % went from 5.06666666667 to 5.06666666667 tied ham mean ham sdev 1.64 1.64 +0.00% 8.44 8.45 +0.12% 0.99 0.99 +0.00% 8.29 8.29 +0.00% 2.82 2.82 +0.00% 12.52 12.52 +0.00% 1.58 1.58 +0.00% 8.29 8.29 +0.00% 1.30 1.30 +0.00% 8.04 8.04 +0.00% ham mean and sdev for all runs 1.66 1.66 +0.00% 9.30 9.30 +0.00% spam mean spam sdev 93.80 93.83 +0.03% 19.39 19.31 -0.41% 90.56 90.59 +0.03% 24.31 24.26 -0.21% 89.24 89.28 +0.04% 27.03 27.04 +0.04% 89.27 89.27 +0.00% 25.51 25.50 -0.04% 92.72 92.74 +0.02% 21.67 21.67 +0.00% spam mean and sdev for all runs 91.12 91.14 +0.02% 23.81 23.79 -0.08% ham/spam mean difference: 89.46 89.48 +0.02 Skip From popiel at wolfskeep.com Wed Dec 31 14:06:34 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 31 14:06:39 2003 Subject: [spambayes-dev] Semi-results for TOE, TOAE, and expiry Message-ID: <20031231190634.B0F0D2DF7C@cashew.wolfskeep.com> Yes, a few days ago I promised a further report on various things, including the effects of alternate start points in my dataset and expiry on the train_on_everything and train_on_almost_everything regimes. Unfortunately, all I have at the moment is some preliminary results and a big heap of frustration pointed at my computer. 1. There doesn't appear to be anything particularly magical about 120 days after start. Rotating my data forward or backward 80 days shows that (a) there was a particular event/change in my data at about 120 days after I started collecting that affects the accuracy of further classifications, and (b) the general curve of getting better for a few months then decaying for the rest of time still holds even when the data is rotated... but the curve is not as distinct when not reinforced by (a). 2. Expiry (as I implemented it) appears to be a very bad thing for long-term TOAE. I implemented it to expire trained messages after 120 days, without completely rebuilding the classifier. This resulted in significantly degraded accuracy after about 250 days, though that may just be due to an ever increasing spam/ham imbalance. There was a sharp drop in the amount of spam training for about 30 days after the initial expiry date, and then a net spam training rate about equivalent to non-expiring TOAE until the "latest windows update" worm, after which spam training about doubled the non-expiry version. This seems to show that spam mutation has a stong effect on 4-month expiry for TOAE. On the other hand, net ham training was fairly consistently slightly negative after expiry commenced, showing that once it got a good idea of what ham was and threw out the oddballs that got trained on initially, it didn't need much categorize ham. By the end of the mess (at 418 days) the spam:ham ratio was over 15:1, and the unsure rate was around 3% (compared to non-expiring with 4.5:1 and 1%). 3. Expiry for TOE seems neutral (compared to non-expiring TOE), to the best of my ability to eyeball the three runs that actually completed. The graphs I have are at: http://www.wolfskeep.com/~popiel/spambayes/plots/expire.html My primary machine (cashew.wolfskeep.com) unfortunately doesn't seem have the capability to maintain reliable service while running these tests anymore. They're just too big, and it doesn't have the memory/CPU to do everything all at once (including running my web server, a mysql engine, my mail feed, etc.). Plus, it appears that Linux 2.4.18 doesn't take too kindly to multiple processes trying to access/manipulate a single directory with over 100,000 files in it; anything that touches that directory after things have started going wonky just hangs in disk-wait. I'm suspecting a deadlock in the filesystem layer on extended directory operations... probably due to not enough file cache (see my memory problems) to hold the entire structure at once. I haven't poked deep enough into the ext2 drivers to be sure, though. Anyway, I'm not going to be able to do all that much until I get this straightened out. I'll add graphs and stuff to the wiki as I have time, but that's likely to be all for a bit... - Alex From popiel at wolfskeep.com Wed Dec 31 14:15:45 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 31 14:15:50 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: Message from Skip Montanaro of "Tue, 30 Dec 2003 23:04:33 CST." <16370.22753.639278.165195@montanaro.dyndns.org> References: <1ED4ECF91CDED24C8D012BCF2B034F130498603B@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz> <16370.22753.639278.165195@montanaro.dyndns.org> Message-ID: <20031231191545.E64892DF7C@cashew.wolfskeep.com> In message: <16370.22753.639278.165195@montanaro.dyndns.org> Skip Montanaro writes: > > Tony> 3-way compare: > > Tony> filename: bases nntims pickskips > Tony> ham:spam: 3668:4099 3668:4099 > Tony> 3668:4099 > Tony> fp total: 6 6 6 > Tony> fp %: 0.16 0.16 0.16 > Tony> fn total: 20 20 20 > ... > >What do you use to generate these three-way comparisons? That's table.py output. - Alex From popiel at wolfskeep.com Wed Dec 31 14:18:58 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 31 14:19:23 2003 Subject: [spambayes-dev] A URL experiment In-Reply-To: Message from Skip Montanaro of "Tue, 30 Dec 2003 23:02:11 CST." <16370.22611.419923.477159@montanaro.dyndns.org> References: <16369.61971.794205.276663@montanaro.dyndns.org> <16370.22611.419923.477159@montanaro.dyndns.org> Message-ID: <20031231191858.85C5B2DF7C@cashew.wolfskeep.com> In message: <16370.22611.419923.477159@montanaro.dyndns.org> Skip Montanaro writes: > >There's the rub. What might be really good ideas at this point will >probably only result in very small changes in performance because the >baseline system is currently so good. Aye, I think that's what killed this sort of tokenizer/classifier testing a year ago... - Alex From tameyer at ihug.co.nz Wed Dec 31 16:45:55 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 31 16:46:02 2003 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/scriptssb_server.py, 1.15, 1.16 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C191@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A20@its-xchg4.massey.ac.nz> > This edit keeps appearing and disappearing. Mark removed > that line in order to fix the fact that none of sb_server's > command line arguments worked. Tony has now put it back in > order to fix a restart problem, which has once again broken > all the command line arguments. Opps. Sorry, I must try and pay more attention... > Does anyone have a clear idea of what each of prepare(), > State.__init__(), State.init(), and State.prepare() are all > intended to do? I think the command line option code needs > to be inserted somewhere into one of them, but I'm not 100% > sure what each of them is for. I'm not sure about intent, but this is what they currently do: __init__(): calls init() init(): opens the log file sets up the list of servers/ports loads options from configuration file resets statistics prepare(): opens mutex (and in calling createWorkers()) opens db opens the corpora creates the trainers Should init() be only done once for every time that sb_server is run, and prepare() each time it is started/stopped? In that case it should be: __init__(): calls init() init(): opens the log file loads options from configuration file resets statistics prepare(): opens mutex sets up the list of servers/ports (and in calling createWorkers()) opens db opens the corpora creates the trainers This makes prepare() a kind of anti-close(). This is probably the fix I should have applied - moving setting up the list of servers/ports to prepare() instead of calling init() again. It needs to be done *sometime* after close(), though, and before the call to createWorkers(). Whether the log file and statistics should be reset on start/stop, I don't know, but I suspect not. This would mean that the command-line code could be anywhere between __init__() and prepare(). +1 to docstrings for init() and prepare(), though :) =Tony Meyer From tameyer at ihug.co.nz Wed Dec 31 17:15:41 2003 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Dec 31 17:15:47 2003 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C0C8@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A21@its-xchg4.massey.ac.nz> > I really don't like the code in Options.py that handles the > default values for these storage items. I'm not sure it is > to blame, but it did cause me to see a new .db file created > in the cwd, rather than the data directory - as my INI file > already existed, it didn't get the default FQN for the new option. The idea was that if you already had an ini file, then you had already set things up (you were an existing user), and so we wouldn't want to fiddle with your setup, whatever it was, because that might mean that we lose track of your databases. (This could even be with the default "hammie.db" in the cwd for persistent_storage_file, if the user always runs the script from the same directory). Part of the problem is that when consolidating the storage name options, I picked "hammie.db" over "~/.hammiedb", which was by far the better option. I didn't realise that it would expand quite nicely on Windows (well, 2k and XP; I presume earlier as well), and I presume on Macs as well. > IMO, the ini files should generally store relative path > names, being relative to the directory of the config file > being used. +1. I can go through and check this in if you like (once others have ripped into the idea ;). > Code speaks louder than words - I'm suggesting: [...] > The default remains "statistics_database.db". The proper default, of course, is still "hammie.db". When I put that code in to put things into a better place for Windows users I figured that since they wouldn't have an existing db, a more easily understandable name would be good, too, but I didn't think that I could change it for everyone. Do we continue setting this option to "statistics_database.db" in that place? (Without the FQN). The same code also gives new default names for the cache directories and messageinfo db. > All code that uses this option > ('persistent_storage_file') does so via a new function: Presumably also these: [URLRetriever] x-cache_directory [Storage] messageinfo_storage_file [Storage] spam_cache [Storage] ham_cache [Storage] unknown_cache What about these? [TestDriver] spam_directories [TestDriver] ham_directories > def get_pathname_option(section, value): > filename = options.get(section, value) > if not os.path.isabs(filename): > return filename Shouldn't that be "if os.path.isabs(filename):"? > # maybe expanduser() to *nix? I think this is necessary, yes, but before this (because os.path.isabs("~/hammie.db") is False). What about this? def get_pathname_option(section, value): filename = os.path.expanduser(options.get(section, value)) if os.path.isabs(filename): return filename return os.path.join(os.path.dirname(optionsPathname), # existing global filename) How do people feel about having this happen implictly when one of these options is used, rather than explicitly? (I worry that we'll miss an occurrence of one of them, or that someone (maybe me!) will add new code and forget to use the get_pathname_option function). Something like this: [Current code in OptionsClass.py] def get(self): '''Get option value.''' return self.value [Proposed] def expand_path(value): filename = os.path.expanduser(value) if os.path.isabs(filename): return filename return os.path.join(os.path.dirname(optionsPathname), # existing global filename) def get(self): '''Get option value. If the option is a path, then get relative to the configuration file.''' if self.allowed_values in [PATH, FILE_WITH_PATH,]: # maybe also VARIABLE_PATH? return self.expand_path(self.value) return self.value =Tony Meyer From richie at entrian.com Wed Dec 31 20:54:51 2003 From: richie at entrian.com (Richie Hindle) Date: Wed Dec 31 20:55:01 2003 Subject: [spambayes-dev] Strange performance dip and DBRunRecoveryError retreat Message-ID: As part of trying to reproduce the DBRunRecoveryError problems (a task that I'm giving up on for now - see below) I've written a script to hammer the core SpamBayes code, repeatedly training and classifying using faked-up messages. It manages about 40 train-and-classify loops per second on my 2.4GHz P4, *except* between about 100 and 400 messages, when the performance drops to about a tenth of that and then recovers. I've done enough investigation to know that the time is being spent in the core SpamBayes code and not my script, that it's only the occasional message that takes a long time (around a second in a few cases) and that it can be either training or classifying that slows down. I've committed the script as testtools/hammer.py, and I offer this as a curiosity to anyone interested. I'm not going to pursue this myself because I've never seen a similar complaint about real-world SpamBayes use. The script includes code to build fake emails that look similar to real-world ones, but which are all unique and include random elements. Maybe this will be useful to someone in the future. It works by taking a small collection of real emails and chopping pieces out of them at random, then stitching them back togther. I don't think the script is going to be a lot of use in tracking down DBRunRecoveryErrors - it *will* reproduce them as it is, but only by mimicking a bug that was fixed in 1.0a6, and people have still been complaining about DBRunRecoveryErrors in 1.0a6 and 1.0a7. Having read up on full-mode bsddb, and bsddb-backed ZODB (including the phrases "The underlying Berkeley database technology requires maintenance, careful system resource planning, and tuning for performance." and "BerkeleyDB never deletes "old" log files. Eventually, if you do not maintain your Berkeley database by deleting "old" log files, you will run out of disk space") I've given up - for the moment at least - on trying to use full-mode bsddb (with or without ZODB). sb_server users should use a pickle and be done with it. Maybe we should change the default. Maybe it's five to two and I should be in bed. -- Richie Hindle richie@entrian.com From sourceforge at metrak.com Wed Dec 31 22:55:58 2003 From: sourceforge at metrak.com (Paul Sorenson) Date: Wed Dec 31 22:56:05 2003 Subject: [spambayes-dev] converToMbox broken Message-ID: <00b601c3d01b$2a13f030$c48b0fcb@home.classware.com.au> Further to my insane ramblings the other day about trying to train on dbx files with the web interface, the only way I could get oe_mailbox.convertToMbox() to work was by adding: from time import * at the top of the file. Otherwise you get "global symbol strftime not found" I am using python 2.3.2 on linux this time round, the other day I tried it on a WinXP box with 2.3.3. Cheers