From tim@fourstonesExpressions.com Sun Dec 1 04:09:19 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat, 30 Nov 2002 22:09:19 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: <1QPA8GDWQ09KFUORPQN65JDQLE0IDT.3de98b6f@riven> Dredging up a less recent thread... 11/21/2002 11:07:36 AM, Neale Pickett wrote: >> Options class is a bit too much right now... Lots of wonderful options >> for research, but nobody in their right mind would tweak most of them. >> I think we should split options into ones that the average person >> would be interested in, and those that the average person should never >> touch. Then give 'em a (Richie style web) ui to tweak the ones that >> they should be interested it. I'd be happy to run with that one for a >> while... > >Now there's an idea, make a configuration engine that runs as a >SimpleHTTPServer, and have the person connect with their favorite >browser to some port on localhost. Isn't that how SATAN worked? In any >case, that would be a good very project for someone who wanted to help >out but didn't know where to start :) I've taken a good swipe at creating a configuration application that the average user could use to make simple changes to the spambayes configuration. In particular, it's useful for pop3 users to configure their settings. But it's also useful for other settings as well, and perhaps even for some administrative tasks, like purging old words from databases, doing a zodb pack, maybe printing word probabilities, and stuff like that. It's html based, and it uses a subclass of SimpleHTTPServer, named SmarterHTTPServer. (Sorry, Richie, I just couldn't figure out how to make the stuff in pop3proxy work for me...) SimpleHTTPServer cannot serve requests with parameters very well, so SmarterHTTPServer adds that functionality. It also adds the ability to call 'methlets' or methods on the server class which can be used to dynamically create content. These two additional functions are quite handy, and may even warrant inclusion into SimpleHTTPServer. The program maintains options changes in bayescustomize.ini. Lemme tell ya what... the way Options.py is set up made figuring out how to do this stuff one freakin pain. What's up with that? Why does OptionsClass wrap a ConfigParser, rather than subclass it? Using bayescustomize.ini is very nice, because it allows the user to easily revert back to default values if that ever becomes necessary. To execute this module, just invoke OptionConfig.py . The working directory should be the same one where the bayescustomize.ini file is located. The port number is the port the http server will listen on, and defaults to 8000. Then point your browser at http://locahost:8000 (or whatever port you chose). I've embedded all the necessary html in the module itself, but this can really only be temporary. We will inevitably accumulate html for little applications like this (e.g. pop3proxy), and more importantly, for documentation. I think that the documentation standard for the project should be html. The look and feel that Richie came up with for the pop3proxy works very well for me. We need to decide how we're going to structure the directories for that kind of stuff. May I propose the following: html application pop3proxy optionConfig doc graphics - TimS > >Neale > > > c'est moi - TimS www.fourstonesExpressions.com From neale@woozle.org Sun Dec 1 04:16:20 2002 From: neale@woozle.org (Neale Pickett) Date: 30 Nov 2002 20:16:20 -0800 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: So then, "Moore, Paul" is all like: > From: Tim Stone - Four Stones Expressions > > So... does this lay to rest forever the pickle/dbm debate? Is there any > > reason left to use a pickle? > > Sorry, quite the opposite (IMHO). The patch switches to using shelve, > which uses anydbm, which (still) uses the buggy BerkeleyDB 1.85 on > Windows. So Windows users should probably still use pickles. I've just checked in a new anydbm that has a more appropriate list of database back-ends to try on the Windows platform. But it needs someone with a Windows box to fix the dumb test I put in it: # XXX: Some windows dude should fix this test if sys.platform == "windows": # dbm on windows is awful. _names = ["dbhash", "gdbm", "dumbdbm"] else: _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] So, if you are a Windows dude and feel up to fixing that test, please do so, and remove the first comment while you're at it :) This should eliminate any dbm concerns for Windows folk. Neale From neale@woozle.org Sun Dec 1 04:19:02 2002 From: neale@woozle.org (Neale Pickett) Date: 30 Nov 2002 20:19:02 -0800 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > So... does this lay to rest forever the pickle/dbm debate? Is there > any reason left to use a pickle? The pickle is still smaller, and it's faster to write out the whole thing than it is the dbm. So the original recommendations, and the ones in hammiebulk.py, are still accurate: pickle for pop3proxy and hammiesrv, dbm for hammiefilter. If you use the Outlook plugin or Jeremy's ZODB driver, you don't need to concern yourself with this. From richie@entrian.com Sun Dec 1 14:16:53 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 14:16:53 +0000 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <1QPA8GDWQ09KFUORPQN65JDQLE0IDT.3de98b6f@riven> References: <1QPA8GDWQ09KFUORPQN65JDQLE0IDT.3de98b6f@riven> Message-ID: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> [Tim Stone] > I've taken a good swipe at creating a configuration application that the > average user could use to make simple changes to the spambayes configuration. > [...] > It's html based, and it uses a subclass of SimpleHTTPServer, named > SmarterHTTPServer. (Sorry, Richie, I just couldn't figure out how to make the > stuff in pop3proxy work for me...) Your configurator looks great! One of thing on my list of things to do is to turn the HTML user interface code into a plugin-hosting library, so that new components like this can be plugged into the user interface. It's a shame I didn't get round to doing this before you write your code - maybe you and I can work together to design that API, and make your configurator the first plugin? I think we can combine my HTTP server code with your 'methlet' idea, and come up with something that works very well and makes it easy to write further plugins. When I have time, hopefully later today, I'll write up the thoughts I have on it. > I've embedded all the necessary html in the module itself, but this can really > only be temporary. We will inevitably accumulate html for little applications > like this (e.g. pop3proxy), and more importantly, for documentation. I think > that the documentation standard for the project should be html. The look and > feel that Richie came up with for the pop3proxy works very well for me. We > need to decide how we're going to structure the directories for that kind of > stuff. May I propose the following: > > html > application > pop3proxy > optionConfig > doc > graphics John Draper has said a similar thing - he wants to add to the HTML user interface as well, and he wants administrators to be able to plug in their own look and feel (by replacing images, stylesheets and so on). This structure looks good (we need to include stylesheets - maybe the 'graphics' area could be called 'global' or something and include stylesheets, javascript modules, and so on?) -- Richie Hindle richie@entrian.com From tim@fourstonesExpressions.com Sun Dec 1 16:58:01 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 01 Dec 2002 10:58:01 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> Message-ID: <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> 12/1/2002 8:16:53 AM, Richie Hindle wrote: > >[Tim Stone] >> I've taken a good swipe at creating a configuration application that the >> average user could use to make simple changes to the spambayes configuration. >> [...] >> It's html based, and it uses a subclass of SimpleHTTPServer, named >> SmarterHTTPServer. (Sorry, Richie, I just couldn't figure out how to make the >> stuff in pop3proxy work for me...) > >Your configurator looks great! One of thing on my list of things to do is >to turn the HTML user interface code into a plugin-hosting library, so that >new components like this can be plugged into the user interface. It's a >shame I didn't get round to doing this before you write your code - maybe >you and I can work together to design that API, and make your configurator >the first plugin? I think we can combine my HTTP server code with your >'methlet' idea, and come up with something that works very well and makes >it easy to write further plugins. When I have time, hopefully later today, >I'll write up the thoughts I have on it. I'm not sure I understand why you're using asynchat and asyncore. The SimpleHTTPServer thingy needed a little tweaking, but it's capable of threaded responses, etc... What was the advantage you gained? > >> I've embedded all the necessary html in the module itself, but this can really >> only be temporary. We will inevitably accumulate html for little applications >> like this (e.g. pop3proxy), and more importantly, for documentation. I think >> that the documentation standard for the project should be html. The look and >> feel that Richie came up with for the pop3proxy works very well for me. We >> need to decide how we're going to structure the directories for that kind of >> stuff. May I propose the following: >> > >John Draper has said a similar thing - he wants to add to the HTML user >interface as well, and he wants administrators to be able to plug in their >own look and feel (by replacing images, stylesheets and so on). This >structure looks good (we need to include stylesheets - maybe the 'graphics' >area could be called 'global' or something and include stylesheets, >javascript modules, and so on?) I'm a bit of a stickler on subdirectory contents on websites, because you can really get into a mishmash quickly. I think that css and js files (in particular) should be kept separately from graphic files should be kept separately from html... How about: ui cgi-bin pop3proxy optionConfig html doc graphics style js > >-- >Richie Hindle >richie@entrian.com > > > c'est moi - TimS www.fourstonesExpressions.com From skip@pobox.com Sun Dec 1 18:15:09 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 1 Dec 2002 12:15:09 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> References: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> Message-ID: <15850.20909.555491.227850@montanaro.dyndns.org> Tim> I'm not sure I understand why you're using asynchat and asyncore. Makes it (relatively) easy to talk to multiple connections simultaneously without resorting to multiple threads. It requires you to reorient how you look at such things, but once you understand the model it's pretty easy to program. Skip From richie@entrian.com Sun Dec 1 18:51:22 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 18:51:22 +0000 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <15850.20909.555491.227850@montanaro.dyndns.org> References: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> <15850.20909.555491.227850@montanaro.dyndns.org> Message-ID: [Tim Stone] > I'm not sure I understand why you're using asynchat and asyncore. [Skip] > Makes it (relatively) easy to talk to multiple connections simultaneously > without resorting to multiple threads. It requires you to reorient how you > look at such things, but once you understand the model it's pretty easy to > program. To expand a bit on this a bit, it means that you can have the existing HTML user interface, your configurator, and multiple POP3 proxies, all potentially being used by multiple simultaneous users, all within one thread of one process and all sharing common data structures (the 'options' object, a Classifier instance, etc) without ever having to worry about synchronising anything. At the moment, for instance, we have a potential (and rather contrived I admit) problem whereby your configurator could be halfway through writing the ini file when the POP3 proxy tries to read it. Using asyncore to run everything within one thread of one process prevents that entire class of problem with no extra effort. Asyncore works in exactly the same way as your methlets - it takes away the procedural programming job of reading and writing sockets, and instead asks that the programmer writes event handler functions. Your OptionsConfigurator.homepage(self, parms) is an event handler for the "someone is asking for the homepage" event. This is exactly how my async-based HTTP server works - my UserInterface.onHome(self, params) does the same job for the pop3proxy.py HTML user interface. The plugin API I have in mind will work exactly that way - and there'll certainly be no requirement for the plugin programmer to know about asyncore. -- Richie Hindle richie@entrian.com From richie@entrian.com Sun Dec 1 20:20:10 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 20:20:10 +0000 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: [Neale] > I've just checked in a new anydbm that has a more appropriate list of > database back-ends to try on the Windows platform. [...] > This should eliminate any dbm concerns for Windows folk. You left dbhash in the list - that's just another interface to the broken bsddb. And if that gets removed, Windows users will be left with dumbdbm - the name doesn't inspire confidence, and the docstring says "XXX TO DO: - seems to contain a bug when updating..." As far as I can see there's a complete solution available to these DBM problems. Perhaps I've missed something, but I've been back over all the discussions and I can't see anything wrong with it: o We demand bsddb 3 or better on platforms where bsddb is the dbm implementation that gets picked up. So until Python 2.3 is released, Windows users need to install pybsddb. I've just done this and it's trivial. (We already demand a new "email" library and no-one's complained.) Would this cause problems on any other platforms? o If training goes slowly, we implement Tim Peters' idea: "Bulk training could be taught to use a new classifier based on an in-memory dict. When that's done, the in-memory dict's ham and spam counts would be added into the persistent DB (rewriting only those WordInfo records corresponding to words that appeared in the bulk training data), and then the in-memory dict could be thrown away." o Or (Neale) you were talking about writing a caching front-end for the DBM (regardless of which actual DBM was behind it) - that would work as well. Wouldn't that solve *everything*? Startup times would be quick, training would be quick, no buggy DBM implementations would be used, and different components wouldn't default to different storage formats (hammie vs. pop3proxy). Installing pybsddb on Windows is trivial, and once Python 2.3 comes out you won't even need to do that. I've probably missed something - it's hard to keep up! -- Richie Hindle richie@entrian.com From lists@morpheus.demon.co.uk Sun Dec 1 20:34:43 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 01 Dec 2002 20:34:43 +0000 Subject: [Spambayes] don't update if you don't want to retrain References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: Neale Pickett writes: > I've just checked in a new anydbm that has a more appropriate list of > database back-ends to try on the Windows platform. But it needs someone > with a Windows box to fix the dumb test I put in it: > > # XXX: Some windows dude should fix this test > if sys.platform == "windows": > # dbm on windows is awful. > _names = ["dbhash", "gdbm", "dumbdbm"] > else: > _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] > I see someone changed "windows" to "win32". But the other problem is more serious. Windows doesn't *have* gdbm or dbm - the problem lies with "dbhash" (the Berkeley DB implementation). So the Windows branch should be if sys.platform == "windows": # The Berkeley DB implementation on Windows is out of date _names = ["gdbm", "dbm", "dumbdbm"] (or probably just _names = ["dumbdbm"]). Paul. -- This signature intentionally left blank From tim@fourstonesExpressions.com Sun Dec 1 21:35:48 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 01 Dec 2002 15:35:48 -0600 Subject: [Spambayes] don't update if you don't want to retrain Message-ID: <7SMC0SN3YX2UWE0549CLJ3ZMJA8.3dea80b4@riven> 'twas me that changed it to win32. When I do a print sys.platform, out comes win32... - TimS 12/1/2002 2:34:43 PM, Paul Moore wrote: >Neale Pickett writes: > >> I've just checked in a new anydbm that has a more appropriate list of >> database back-ends to try on the Windows platform. But it needs someone >> with a Windows box to fix the dumb test I put in it: >> >> # XXX: Some windows dude should fix this test >> if sys.platform == "windows": >> # dbm on windows is awful. >> _names = ["dbhash", "gdbm", "dumbdbm"] >> else: >> _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] >> > >I see someone changed "windows" to "win32". But the other problem is >more serious. Windows doesn't *have* gdbm or dbm - the problem lies >with "dbhash" (the Berkeley DB implementation). > >So the Windows branch should be > > if sys.platform == "windows": > # The Berkeley DB implementation on Windows is out of date > _names = ["gdbm", "dbm", "dumbdbm"] > >(or probably just _names = ["dumbdbm"]). > >Paul. > >-- >This signature intentionally left blank > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From lists@morpheus.demon.co.uk Sun Dec 1 23:04:25 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 01 Dec 2002 23:04:25 +0000 Subject: [Spambayes] don't update if you don't want to retrain References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: Richie Hindle writes: > As far as I can see there's a complete solution available to these DBM > problems. Perhaps I've missed something, but I've been back over all the > discussions and I can't see anything wrong with it: > > o We demand bsddb 3 or better on platforms where bsddb is the dbm > implementation that gets picked up. So until Python 2.3 is released, > Windows users need to install pybsddb. I've just done this and it's > trivial. (We already demand a new "email" library and no-one's > complained.) Would this cause problems on any other platforms? I'm all in favour of this. However, it's worth pointing out a couple of things: 1. Email is pure python, bsddb is not only in C, but also needs a 3rd party library (Sleepycat DB). No problem on Windows (Python 2.3 will come with it built in, and there's a trivial-to-install binary build for 2.2 users), but might it cause problems on Unix systems? 2. On Unix, as I understand it, it's possible to use the new Sleepycat DB with the old Python module. So Unix users quite possibly don't need to bother with bsddb 3. The simple answer is to require bsddb 3 on Windows with Python 2.2, and otherwise use it if present, otherwise use the built-in dbhash (and assume that a suitably up to date Berkeley DB is behind it). But as I said, I'm happy with your approach - I only offer this if Unix users don't like the bsddb 3 requirement... Paul. -- This signature intentionally left blank From skip@pobox.com Sun Dec 1 23:18:15 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 1 Dec 2002 17:18:15 -0600 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: <15850.39095.51566.137899@montanaro.dyndns.org> Paul> 1. Email is pure python, bsddb is not only in C, but also needs a 3rd Paul> party library (Sleepycat DB). No problem on Windows (Python 2.3 Paul> will come with it built in, and there's a trivial-to-install binary Paul> build for 2.2 users), but might it cause problems on Unix systems? Unlikely. Most Unixes have had recent versions of Sleepycat's library available for a long time. Versions 3 or 4(.0) are required for pybsddb. Failing that, Version 2 doesn't suffer with the bugs that Version 1 does. The old bsddb will still be available, just not built by default. Paul> 2. On Unix, as I understand it, it's possible to use the new Sleepycat Paul> DB with the old Python module. So Unix users quite possibly don't Paul> need to bother with bsddb 3. Correct. The new module has already been checked into CVS though, so Unix types will get it as the default but be able to fall back to Version 2 (or even 1) if they want. don't-worry-about-us-we're-just-fine-ly, y'rs, Skip From richie@entrian.com Sun Dec 1 23:49:38 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 23:49:38 +0000 Subject: [Spambayes] The database question that would not die Message-ID: I've tried using bsddb3 on Windows, and the results are encouraging. Testing with 500 spams, 500 hams and 500 unknowns looks like this: Training 1000 Database size Classifying 500 Database load Pickle 65 seconds 999,540 35 seconds 4 seconds bsddb3 82 seconds 1,318,912 43 seconds (negligible) Close enough on all counts, I'd say (and the startup time will be a bigger and bigger win as the database grows). Small savings in time and space for some operations aren't worth the hassle of having two formats, IMHO. Here's what I did: o Installed pybsddb, which gave me the bsddb3 module o Created dbhash3.py, a duplicate of dbhash.py (16 lines of code) that refers to bsddb3 rather than bsddb o Changed anydbm.py to always use dbhash3 on Windows. I can see a few possible objections: o There may be platforms on which anydbm defaults to bsddb 1.85, but for which installing bsddb3 is a pain. Any takers? o Current pickle users may violently object to the (small?) time and space losses incurred by switching to using an anydbm database (which may not be bsddb3 on their platform). Any takers? o Insisting on bsddb3 prevents closed-source use of the spambayes code until Python 2.3 is released. I can't imagine anyone here objecting...? I only mention this one for completeness. o We should skip bsddb3 and go directly to ZODB. My feeling is that this is possibly a good long-term goal, but at this stage it would be premature. o The dramatic fifth objection, which I haven't thought of but which means this idea will never fly. Any takers? 8-) So now I can ask the question that Neale (I think) asked a while ago - is there any need to keep the pickle option? I would LOVE for us to drop the pickle option before I submit my article to the Linux Journal, which has to happen before Thursday 5th December. Explaining the different database formats will be an embarrassment - much better to simply say "Python 2.2 users on Windows also need to download bsddb3 from ". -- Richie Hindle richie@entrian.com From papaDoc@videotron.ca Mon Dec 2 00:47:13 2002 From: papaDoc@videotron.ca (Remi Ricard) Date: Sun, 01 Dec 2002 19:47:13 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentation Message-ID: <1038790033.1032.11.camel@porsche> Hi, This is my first draft for the documentation. The presentation is really plain. I will improve the formating, add color and images, after your comments. So any comment will be welcome. By the way, I don't know if I should have done it in French. If it is not understandable let me know I won't be frustrated, it will just help me improve my English. P.S. OptionConfig.py is now working for me __author__ is not defined. If I comment out this line I get Classes: OptionsConfigurator - changes select values in Options.py Abstact: Some text here: too long to copy To Do: o Suggestions? : File name too long -- Remi Ricard From tim@fourstonesExpressions.com Mon Dec 2 03:44:43 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 01 Dec 2002 21:44:43 -0600 Subject: [Spambayes] pop3proxy and Mozilla documentation In-Reply-To: <1038790033.1032.11.camel@porsche> Message-ID: Remi, I don't know why, but I can't see the attachment... 12/1/2002 6:47:13 PM, Remi Ricard wrote: >Hi, > >This is my first draft for the documentation. >The presentation is really plain. I will improve the formating, add >color and images, after your comments. > >So any comment will be welcome. > >By the way, I don't know if I should have done it in French. If it is >not understandable let me know I won't be frustrated, >it will just help me improve my English. > >P.S. OptionConfig.py is now working for me __author__ is not defined. If >I comment out this line I get >Classes: > OptionsConfigurator - changes select values in Options.py >Abstact: > Some text here: too long to copy >To Do: > o Suggestions? > >: File name too long This is a new one to me. It's working just fine on my machine, and Richie's too. What platform are you on? - TimS > >-- >Remi Ricard > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > c'est moi - TimS www.fourstonesExpressions.com From neale@woozle.org Mon Dec 2 04:58:22 2002 From: neale@woozle.org (Neale Pickett) Date: 01 Dec 2002 20:58:22 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: References: Message-ID: So then, Richie Hindle is all like: > I've tried using bsddb3 on Windows, and the results are encouraging. Neato. This seems consistent with my experience. I've been recommending that people using the pop3proxy and hammiesrv (all 1 of them <0.2 wink>) use the pickle because it will be faster in all cases than the dbm. The dbm win comes into play with hammiefilter training or scoring one message at a time, and then the win is huge. > Close enough on all counts, I'd say (and the startup time will be a > bigger and bigger win as the database grows). Small savings in time > and space for some operations aren't worth the hassle of having two > formats, IMHO. While Tim S's storage class makes having two formats much easier, it would be nice if a pop3proxy database could be used by hammiefilter without having to change a configuration file. I can't think of a good reason not to drop pickle, but I think we should wait to see what Tim Peters thinks about it. I've only ever used the pickle when testing new code, to see if it works with both storage formats. Neale From Paul.Moore@atosorigin.com Mon Dec 2 10:31:15 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 2 Dec 2002 10:31:15 -0000 Subject: [Spambayes] Easy task for Outlook Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E36@UKDCX001.uk.int.atosorigin.com> From: Mark Hammond [mailto:mhammond@skippinet.com.au] > The plugin could do with some kind of "log file" strategy. > Currently all "print" statements go to the win32traceutil > package. However, once we package this up as a stand-alone > DLL, this wont fly. This seems an ideal candidate for the Python 2.3 "logging" module (see PEP 282). There's a standalone version for Python 2.2 - would the fact that this introduces another dependency on a temporarily external module be an issue? (I'm not volunteering to do this - my time is pretty used up at the moment. But it would be a shame if someone reinvented the wheel here...) Paul. From francois.granger@free.fr Mon Dec 2 10:35:34 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Mon, 02 Dec 2002 11:35:34 +0100 Subject: [Spambayes] The database question that would not die In-Reply-To: Message-ID: on 2/12/02 0:49, Richie Hindle at richie@entrian.com wrote: > I've tried using bsddb3 on Windows, and the results are encouraging. > > So now I can ask the question that Neale (I think) asked a while ago - is > there any need to keep the pickle option? Is the conversion between Pickle and bdsdb3 an issue at all ? If there is only one user involved, I think no. -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From Paul.Moore@atosorigin.com Mon Dec 2 10:36:07 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 2 Dec 2002 10:36:07 -0000 Subject: [Spambayes] don't update if you don't want to retrain Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E37@UKDCX001.uk.int.atosorigin.com> From: Tim Stone - Four Stones Expressions > 'twas me that changed it to win32. When I do a print > sys.platform, out comes win32... Sorry, that change was correct. My bad wording, plus a cut&paste typo, made it look like I was suggesting that "windows" was correct - I wasn't :-( But the bit about needing to remove dbhash *was* intended... Excuse me while I go and type 1000 times "I must check my posts before sending them"... Paul. From richie@entrian.com Mon Dec 2 12:22:51 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 12:22:51 +0000 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: [Richie] > so the on-demand-ness should come for free for all Corpus-using code. [Mark] > How much Corpus-using code is there? Are there any plans to move any > existing code that does not use it towards using it? I've raised this with > Tim S for Outlook, and it doesn't appear we will - I have no idea about the > other apps though. Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, but doesn't seem to use it (?) I'd like to see more of the existing code using it, but then again I'm not in a hurry to implement the idea myself... In an ideal (meaning "engineering purity") world, we'd have abstract Corpus and Message interfaces, and all the applications would code to those interfaces regardless of the concrete classes implementing them. Then any application would work with messages stored in any format - hammie could classify your Outlook messages from the command line, the Outlook plug-in could train on messages in mbox files, and so on. In the real world, that kind of thing usually turns out either to be YAGNI or so hard as to be unreasonable. Where we end up will probably be somewhere in between. I was able to scratch an itch using Corpus - it was exactly what I needed for the web training interface (partly because Tim and I discussed the design of Corpus with that in mind). If other people find they can scratch itches with it, its usage will grow, otherwise it won't. Migrating already-working code to use a new library for reasons of engineering purity isn't an itch that many people suffer from. I have a *much* bigger problem with Corpus, which is that I find the word 'Corpus' impossible to type. Is it just me? > In the back of my mind, I am pondering if we need a better directory > structure - maybe with the core engine in a package, and some of these > "wrappers" used only by a few application also into their own? Isn't this also YAGNI? We have a few tens of Python files in the project - do we really need to split it up? And if we do, should we be doing it with the code this young? -- Richie Hindle richie@entrian.com From richie@entrian.com Mon Dec 2 12:27:57 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 12:27:57 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: References: Message-ID: [Richie] > So now I can ask the question that Neale (I think) asked a while ago - is > there any need to keep the pickle option? [François] > Is the conversion between Pickle and bdsdb3 an issue at all ? > If there is only one user involved, I think no. I wasn't going to provide an upgrade path, if that's what you meant... pickle users will need to retrain, but that happens on a regular basis anyway, as we change the pickle version. Sooner or later we'll need to worry about upgrading existing databases, but it's too early for that now. While you're here, François, would the switch to anydbm have a big effect on your MacOS 9 platform? I don't know what database types are supported there. -- Richie Hindle richie@entrian.com From skip@pobox.com Mon Dec 2 13:03:52 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 2 Dec 2002 07:03:52 -0600 Subject: [Spambayes] The database question that would not die In-Reply-To: References: Message-ID: <15851.23096.388509.925822@montanaro.dyndns.org> Richie> o There may be platforms on which anydbm defaults to bsddb Richie> 1.85, but for which installing bsddb3 is a pain. Any takers? I think there are some misunderstandings still out there about the various incarnations of the bsddb module and the underlying Berkeley DB code. Even if everyone understands what's what, the language I see used suggests they might not. Let me try and make sure every has a similar grasp of the issues and terminology. There has been a bsddb or dbhash module in Python for quite awhile (five years at least). It requires the Berkeley DB library, originally available from UC Berkeley, but now from Sleepycat (whose founders where grad students at Berkeley when they wrote the earliest versions). The original bsddb module was originally written against Berkeley DB 1.85. That version created two interfaces, a C API (the version 1.85 API), and a file format (the version 1.85 file format. If you ask file(1) about a file created with it the version numbers will likely differ. File format versions and library release versions have no obvious correspondence to the untrained observer. There were various bugs in the code in db 1.85. To correct (some of) those bugs, file format changes were necessary. This originally happened in version 1.86, which, unfortunately, was never widely adopted (licensing issues?). The C API didn't change. When version 2.x of Berkeley DB was released (I think by Sleepycat shortly after its founding), they changed the file formats again and added a new C API. The old version 1.85 C API was still available (and still is even in the most recent versions). This API is what the original bsddb module was written against. When version 3.x of Berkeley DB was released, Sleepycat added another C API (or at least extended the version 2 API significantly). Pybsddb (aka bsddb3, aka the current bsddb module in CVS) was written against this richer API. This API remained current through version 4.0.x of Sleepycat's offerings. Unfortunately, in version 4.1.x, they changed some aspects of it which cause problems for Pybsddb. Consequently, you can't build Pybsddb against the 4.1.x library. So, here's a summary of what works with what: The historic bsddb module (bsddb185 in CVS now) works with any version of the Berkeley DB library as long as the 1.85 C API is enabled. If you use it with version 1.85 of the library you may experience data corruption problems because of bugs in the code and file structure (not the 1.85 API). You can use it safely with later versions of the library as the 1.85 API was enabled during configuration. The current bsddb module (Pybsddb, bsddb3, bsddb in CVS) works with versions 3.x and 4.0.x of the Berkeley DB library. Skip From papaDoc@videotron.ca Mon Dec 2 13:18:58 2002 From: papaDoc@videotron.ca (papaDoc) Date: Mon, 02 Dec 2002 08:18:58 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentation In-Reply-To: References: Message-ID: <3DEB5DC2.9080807@videotron.ca> Hi, >Remi, I don't know why, but I can't see the attachment... > I don't know either when I took my mail from work, I also did not receive the attachment. I was sure I added it to my email. I will resend it this evening from home with a missing part. I forgot to had how to filter the mail he.. he.. >> >>P.S. OptionConfig.py is now working for me __author__ is not defined. If >>I comment out this line I get >>Classes: >>OptionsConfigurator - changes select values in Options.py >>Abstact: >> Some text here: too long to copy >>To Do: >> o Suggestions? >> >>: File name too long >> >> > >This is a new one to me. It's working just fine on my machine, and Richie's >too. What platform are you on? > I'm running Red Hat 7.2 with python 2.2.1 if I remember correctly. I needed to get the email directory from the CVS since I don't have the lastest python that comes with it. papaDoc From richie@entrian.com Mon Dec 2 13:47:29 2002 From: richie@entrian.com (richie@entrian.com) Date: Mon, 02 Dec 2002 13:47:29 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <15851.23096.388509.925822@montanaro.dyndns.org> Message-ID: [Skip] > I think there are some misunderstandings still out there about the various > incarnations of the bsddb module and the underlying Berkeley DB code. > [...] > So, here's a summary of what works with what: > > The historic bsddb module (bsddb185 in CVS now) works with any version > of the Berkeley DB library as long as the 1.85 C API is enabled. If you > use it with version 1.85 of the library you may experience data > corruption problems because of bugs in the code and file structure (not > the 1.85 API). You can use it safely with later versions of the library > as the 1.85 API was enabled during configuration. > > The current bsddb module (Pybsddb, bsddb3, bsddb in CVS) works with > versions 3.x and 4.0.x of the Berkeley DB library. Thanks for the clarification. To rephrase my question in these terms: Are there any platforms on which, when you ask anydbm to create a database, it uses version 1.85 of the underlying Berkeley DB library to do that? And if there are such platforms, is upgrading the underlying Berkeley DB library, either directly or by installing pybsddb (aka bsddb3), a pain for a typical user of that platform? I strongly believe that if no such platform exists, we should drop pickle support in favour of using anydbm, and add a check that if the underlying database library chosen by anydbm is the Berkeley DB library, it is version 2 or better. On Windows, people can meet this requirement by installing pybsddb or Python 2.3. -- Richie Hindle richie@entrian.com From Paul.Moore@atosorigin.com Mon Dec 2 14:14:46 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 2 Dec 2002 14:14:46 -0000 Subject: [Spambayes] The database question that would not die Message-ID: <16E1010E4581B049ABC51D4975CEDB88619962@UKDCX001.uk.int.atosorigin.com> See dead horse, flog. Repeat as required :-) Sorry. From: richie@entrian.com [mailto:richie@entrian.com] > Are there any platforms on which, when you ask anydbm to create a = database, > it uses version 1.85 of the underlying Berkeley DB library to do that? = And > if there are such platforms, is upgrading the underlying Berkeley DB = library, > either directly or by installing pybsddb (aka bsddb3), a pain for a = typical > user of that platform? 1. Yes, Windows, with Python 2.2. 2. Yes. Not because installing pybsddb/bsddb3 is difficult, but because pybsddb/bsddb3 doesn't upgrade the library that anydbm uses, but = instead installs a second, parallel, copy, which is accessible under a = different name (bsddb3). > I strongly believe that if no such platform exists, we should drop = pickle > support in favour of using anydbm, and add a check that if the = underlying > database library chosen by anydbm is the Berkeley DB library, it is = version > 2 or better. On Windows, people can meet this requirement by = installing > pybsddb or Python 2.3. We have to code explicitly to use bsddb3 if that is present. If it is = not, we can fall back on anydbm (and complain loudly at Windows users). I do not believe that bsddb (neither the standard library one, nor bsddb3) offers = any way to check the version of the underlying Sleepycat code. Paul. From wsy@merl.com Mon Dec 2 14:44:10 2002 From: wsy@merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 09:44:10 -0500 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <200212021444.gB2EiA327329@localhost.localdomain> Final test statistics for CRM114 for November are in: Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5) For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1993 3914 4 0 5911 99.915 2 Spam features in hash tables: 398K Nonspam features in hash tables: 299K There was just 1 spam that got through in the last week of November- a very strange spam written in mixed English and Czech trying to sell me diesel engine parts. It came through on a moto-head email list, which I suppose might be slightly topical, and it certainly was amusing, rather reminiscent of the Monty Python "camshaft smuggling" skit, but it's still spam and counts as such. This gives an N+1 accuracy of > 99.9% for the entire month of November. (99.932% for N-accuracy). So, CRM114 barely squeaked through the month at >99.9%. Barely. There's clearly still work to be done (the spambayes mailing list is kicking around the proper way to evaluate probabilities; I'm looking into some of their ideas as well.) --- On The Other Hand (the bad news)--- December is looking much worse - TWO have gotten through already over the weekend (one "barnyard teen" pornspam- it hasn't seen that before) and one very short mortgage solicitation, written folksy-style. I'm also getting mailer errors now out of Sendmail whenever I do a "learn"; I'm starting to think that our systems people have upgraded something and broken something else in the process. This throws some question onto whether the CRM114 training code is actually getting run at all, or whether the increasing spam rate is symptomatic of the evolution of spam against static filters. -Bill Yerazunis From wsy@merl.com Mon Dec 2 14:51:33 2002 From: wsy@merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 09:51:33 -0500 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <200212021451.gB2EpXq27342@localhost.localdomain> Ooops, messed up the spreadsheet... corrected statistics below: Even-More-Final test statistics for CRM114 for November are in: Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5) For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1931 3914 4 0 5849 99.914 2 Spam features in hash tables: 398K Nonspam features in hash tables: 299K There was just 1 spam that got through in the last week of November- a very strange spam written in mixed English and Czech trying to sell me diesel engine parts. It came through on a moto-head email list, which I suppose might be slightly topical, and it certainly was amusing, rather reminiscent of the Monty Python "camshaft smuggling" skit, but it's still spam and counts as such. This gives an N+1 accuracy of > 99.9% for the entire month of November. (99.932% for N-accuracy). So, CRM114 barely squeaked through the month at >99.9%. Barely. There's clearly still work to be done (the spambayes mailing list is kicking around the proper way to evaluate probabilities; I'm looking into some of their ideas as well.) --- On The Other Hand (the bad news)--- December is looking much worse - TWO have gotten through already over the weekend (one "barnyard teen" pornspam- it hasn't seen that before) and one very short mortgage solicitation, written folksy-style. I'm also getting mailer errors now out of Sendmail whenever I do a "learn"; I'm starting to think that our systems people have upgraded something and broken something else in the process. This throws some question onto whether the CRM114 training code is actually getting run at all, or whether the increasing spam rate is symptomatic of the evolution of spam against static filters. -Bill Yerazunis From bkc@murkworks.com Mon Dec 2 14:58:20 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 09:58:20 -0500 Subject: [Spambayes] The database question that would not die In-Reply-To: Message-ID: <3DEB2D6C.31813.9E8625E@localhost> On 1 Dec 2002 at 23:49, Richie Hindle wrote: > Training 1000 Database size Classifying 500 Database load > Pickle 65 seconds 999,540 35 seconds 4 seconds > bsddb3 82 seconds 1,318,912 43 seconds (negligible) How many tokens are stored in the pickle / bsddb3 in this example? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim@fourstonesExpressions.com Mon Dec 2 15:09:34 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 02 Dec 2002 09:09:34 -0600 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: Message-ID: 12/2/2002 6:22:51 AM, Richie Hindle wrote: > >[Richie] >> so the on-demand-ness should come for free for all Corpus-using code. > >[Mark] >> How much Corpus-using code is there? Are there any plans to move any >> existing code that does not use it towards using it? I've raised this with >> Tim S for Outlook, and it doesn't appear we will - I have no idea about the >> other apps though. > >Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, >but doesn't seem to use it (?) > >I'd like to see more of the existing code using it, but then again I'm not >in a hurry to implement the idea myself... In an ideal (meaning >"engineering purity") world, we'd have abstract Corpus and Message >interfaces, and all the applications would code to those interfaces >regardless of the concrete classes implementing them. Then any application >would work with messages stored in any format - hammie could classify your >Outlook messages from the command line, the Outlook plug-in could train on >messages in mbox files, and so on. In the real world, that kind of thing >usually turns out either to be YAGNI or so hard as to be unreasonable. > >Where we end up will probably be somewhere in between. I was able to >scratch an itch using Corpus - it was exactly what I needed for the web >training interface (partly because Tim and I discussed the design of Corpus >with that in mind). If other people find they can scratch itches with it, >its usage will grow, otherwise it won't. Migrating already-working code to >use a new library for reasons of engineering purity isn't an itch that many >people suffer from. Well, I'm a bit of an engineering purist, and I think that there's benefit to having a single abstract interface for message storage. Right now, we have mbox stuff, corpus stuff, outlook stuff. Mark has indicated that he's not interested in the abstraction for the outlook stuff, and that's fine. But I think the mbox/msg stuff should disappear. They don't do anything that corpus doesn't do at the moment, and it's gonna get confusing down the road for someone who becomes interested in our code. Not to mention that our code reflects on us... Let's take the plunge and make the Corpus stuff the 'standard', and where it doesn't support a current requirement, let's fix it. - TimS > >I have a *much* bigger problem with Corpus, which is that I find the word >'Corpus' impossible to type. Is it just me? > >> In the back of my mind, I am pondering if we need a better directory >> structure - maybe with the core engine in a package, and some of these >> "wrappers" used only by a few application also into their own? > >Isn't this also YAGNI? We have a few tens of Python files in the project - >do we really need to split it up? And if we do, should we be doing it with >the code this young? > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From bkc@murkworks.com Mon Dec 2 15:23:46 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 10:23:46 -0500 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <15851.23096.388509.925822@montanaro.dyndns.org> Message-ID: <3DEB3361.19290.9FFA921@localhost> On 2 Dec 2002 at 13:47, richie@entrian.com wrote: > I strongly believe that if no such platform exists, we should drop pickle > support in favour of using anydbm, and add a check that if the underlying > database library chosen by anydbm is the Berkeley DB library, it is version > 2 or better. On Windows, people can meet this requirement by installing > pybsddb or Python 2.3. > Sorry I haven't been keeping up with this issue. I have my own "database format" that I want to use for classifier storage.. Has the classifier interface to "storage" been abstracted yet? I thought that's where things were headed. But I haven't had a chance to cvs update lately. Can I "drop-in" my own "database instance" into hammie or the Outlook plugin in a transparent way? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From msergeant@startechgroup.co.uk Mon Dec 2 15:22:23 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 02 Dec 2002 15:22:23 +0000 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212021444.gB2EiA327329@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> <200212021444.gB2EiA327329@localhost.localdomain> Message-ID: <3DEB7AAF.4080206@startechgroup.co.uk> Bill Yerazunis said the following on 02/12/02 14:44: > Final test statistics for CRM114 for November are in: > > Standard rules apply (no whitelists, no blacklists, realtime email stream > only (no "canned spam"), train only on errors, polynomial length 5) > > For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) > > Spams Nonspams False False Total N+1 Accuracy NHC's > Accepts Rejects Emails > 1993 3914 4 0 5911 99.915 2 > > Spam features in hash tables: 398K > Nonspam features in hash tables: 299K CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc. Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat. If not, we'll just have to try and copy the sparse binary polynomial hash idea ;-) From francois.granger@free.fr Mon Dec 2 15:50:26 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Mon, 02 Dec 2002 16:50:26 +0100 Subject: [Spambayes] The database question that would not die In-Reply-To: Message-ID: on 2/12/02 13:27, Richie Hindle at richie@entrian.com wrote: > I wasn't going to provide an upgrade path, if that's what you meant... > pickle users will need to retrain, but that happens on a regular basis > anyway, as we change the pickle version. Sooner or later we'll need to > worry about upgrading existing databases, but it's too early for that now= . Just a reminder ;-) Being a tech support guy, I always think about compatibility, upgrade, documentation... That side of product development ;-) > While you're here, Fran=E7ois, would the switch to anydbm have a big effect > on your MacOS 9 platform? I don't know what database types are supported > there. No issue, we have gdbm on Mac wich get selected. I don't know how robust it is, but I can ask on the MacPython sig if needed. I tested it once or twice with pop3proxy and got no issue. --=20 Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From wsy@merl.com Mon Dec 2 15:57:52 2002 From: wsy@merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 10:57:52 -0500 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) In-Reply-To: <3DEB7AAF.4080206@startechgroup.co.uk> (message from Matt Sergeant on Mon, 02 Dec 2002 15:22:23 +0000) References: <20021202040836.54151.qmail@mail.archub.org> <200212021444.gB2EiA327329@localhost.localdomain> <3DEB7AAF.4080206@startechgroup.co.uk> Message-ID: <200212021557.gB2FvqC28251@localhost.localdomain> From: Matt Sergeant CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc. It _is_ procedural, it's just extremely high level. Perhaps higher-level than APL if you count statements rather than operators. And sorry about the syntax. I was being playful, and reading a book on Latin at the time, which is why it uses symmetric declensional parsing rather than something more sane, like recursive descent. (*) Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat. Yes, it's not difficult to get at the code. Pop the .gz open, emacs the file crm114.c, and look for the case headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there is _not_ generated, but executed in-line, so cut and paste will work. The current code requires a null-terminated string as input, but that's because of the GNU regex library limits (when TRE gives me a new library, that requirement will go away). You _will_ need to link it against a regex library (of your choice, CRM114 uses the standard ANSI regcomp/regexec calling sequence), and the OS itself needs to support stat() [for file existence/length] and mmap() [to map a file into virtual memory without actually reading it in a byte at a time- this is just for efficiency and can be worked around]. How bad do you want it? :-) If not, we'll just have to try and copy the sparse binary polynomial hash idea ;-) Always legitimate. It's GPLware, no problemo. -Bill Yerazunis (*) all in all, I like the way it ended up; one can just type programs on the command line and they do useful things. But hindsight is always 20/20, and "less wierdass" might be better in the long run. From kanderson@bbn.com Mon Dec 2 16:04:30 2002 From: kanderson@bbn.com (Ken Anderson) Date: Mon, 02 Dec 2002 11:04:30 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212021444.gB2EiA327329@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set? At 09:44 AM 12/2/2002, Bill Yerazunis wrote: >Final test statistics for CRM114 for November are in: > >Standard rules apply (no whitelists, no blacklists, realtime email stream >only (no "canned spam"), train only on errors, polynomial length 5) > > For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) > > Spams Nonspams False False Total N+1 Accuracy NHC's > Accepts Rejects Emails > 1993 3914 4 0 5911 99.915 2 > > Spam features in hash tables: 398K > Nonspam features in hash tables: 299K > >There was just 1 spam that got through in the last week of November- >a very strange spam written in mixed English and Czech trying to sell >me diesel engine parts. It came through on a moto-head email list, >which I suppose might be slightly topical, and it certainly was amusing, >rather reminiscent of the Monty Python "camshaft smuggling" skit, >but it's still spam and counts as such. > >This gives an N+1 accuracy of > 99.9% for the entire month of November. >(99.932% for N-accuracy). > >So, CRM114 barely squeaked through the month at >99.9%. Barely. There's >clearly still work to be done (the spambayes mailing list is kicking >around the proper way to evaluate probabilities; I'm looking into some >of their ideas as well.) > > > >--- On The Other Hand (the bad news)--- > >December is looking much worse - TWO have gotten through already over >the weekend (one "barnyard teen" pornspam- it hasn't seen that before) >and one very short mortgage solicitation, written folksy-style. > >I'm also getting mailer errors now out of Sendmail whenever I do >a "learn"; I'm starting to think that our systems people have >upgraded something and broken something else in the process. This >throws some question onto whether the CRM114 training code is actually >getting run at all, or whether the increasing spam rate is >symptomatic of the evolution of spam against static filters. > > -Bill Yerazunis From richie@entrian.com Mon Dec 2 16:12:30 2002 From: richie@entrian.com (richie@entrian.com) Date: Mon, 02 Dec 2002 16:12:30 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619962@UKDCX001.uk.int.atosorigin.com> Message-ID: [Paul] > See dead horse, flog. Repeat as required :-) Sorry. Tell me about it. This is proving really difficult. Am I the only one who thinks that having two incompatible database formats sucks? Especially when they're each the default for different pieces of the same software, so you can't use those pieces together without reconfiguring things. > 1. Yes, Windows, with Python 2.2. > 2. Yes. [reasons snipped] I know, and I believe I've already dealt with Windows. Please see http://mail.python.org/pipermail/spambayes/2002-December/002385.html > I do not believe that bsddb (neither the standard library one, nor > bsddb3) offers any way to check the version of the underlying Sleepycat > code. OK, fine. I agree with whoever said that we document the fact that we require 2 or better, provide a link to pybsddb for Windows, and let users of other platforms worry about it themselves - other platforms have allegedly had Berkeley DB 2 or better for ages, which brings me back to the dead horse question: are there platforms other than Windows where using anydbm instead of pickle will cause problems? Windows we've dealt with, Unix has a recent Berkeley DB, the Mac has gdbm (thanks François!), are there any others? (Are there even any other platforms that we need to consider?) If not, let's ditch pickles before we get publicity from the Linux Journal articles and the Spam conference. [Brad] > Has the classifier interface to "storage" been abstracted yet? I thought > that's where things were headed. But I haven't had a chance to cvs update > lately. Can I "drop-in" my own "database instance" Yes, all that has been done. The main project has a pickle interface and a DBM interface, and I'm proposing we ditch the pickle interface because it no longer has any advantages. Adding another interface should be easy. -- Richie Hindle richie@entrian.com From msergeant@startechgroup.co.uk Mon Dec 2 16:21:10 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 02 Dec 2002 16:21:10 +0000 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212021557.gB2FvqC28251@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> <200212021444.gB2EiA327329@localhost.localdomain> <3DEB7AAF.4080206@startechgroup.co.uk> <200212021557.gB2FvqC28251@localhost.localdomain> Message-ID: <3DEB8876.8070408@startechgroup.co.uk> Bill Yerazunis said the following on 02/12/02 15:57: > From: Matt Sergeant > > CRM114's learn and classify stuff looks really interesting, but it has a > really freaky syntax to someone who is used to regular procedural or OO > languages like Perl, Python, C, etc. > > It _is_ procedural, it's just extremely high level. Perhaps higher-level > than APL if you count statements rather than operators. Sorry, I meant "prodedural like Perl/Python/C" not "procedural, like Perl/Python/C". Actually maybe python shouldn't be in that list since it has a weirdass syntax too :-) > Is there *any* chance the library > in crm114 for learning and classifying can be extracted into a plain > .so? That would be tremendous, and I'd be willing to build a perl XS > library for it in a heartbeat. > > Yes, it's not difficult to get at the code. > > Pop the .gz open, emacs the file crm114.c, and look for the case > headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there > is _not_ generated, but executed in-line, so cut and paste will work. > > The current code requires a null-terminated string as input, but > that's because of the GNU regex library limits (when TRE gives me a > new library, that requirement will go away). You _will_ need to link > it against a regex library (of your choice, CRM114 uses the standard > ANSI regcomp/regexec calling sequence), and the OS itself needs to > support stat() [for file existence/length] and mmap() [to map a file > into virtual memory without actually reading it in a byte at a time- > this is just for efficiency and can be worked around]. I was thinking of punting on splitting the email to tokens back to the host language. Since perl and python both support POSIX regexps (and thus [[:graph:]]) its probably easier that way. Unless there's an inherent reason it has to be embedded in the library. > How bad do you want it? :-) What interests me is the hashing technique. It should be reasonably easy to extract that, but for me it's just a lack of tuits - it's hard enough keeping up with my regular day to day activities, and my todo list never gets shorter. > (*) all in all, I like the way it ended up; one can just type programs > on the command line and they do useful things. But hindsight is always > 20/20, and "less wierdass" might be better in the long run. I imagine you'd get a few more users with a regular syntax ;-) Matt. From popiel at wolfskeep.com Mon Dec 2 17:51:50 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon, 02 Dec 2002 09:51:50 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: Message from richie@entrian.com References: Message-ID: <20021202175150.22F5C2DEB1@cashew.wolfskeep.com> In message: richie@entrian.com writes: >[Paul] >> See dead horse, flog. Repeat as required :-) Sorry. > >Tell me about it. This is proving really difficult. Am I the only one >who thinks that having two incompatible database formats sucks? No, you're not the only one. I'd be chiming in if I actually had any time to deal with it. Unfortunately, my time recently has been sucked into a different black hole (finally got nightly backups working properly again). I'm getting ready to switch from my own home-brew Graham implementation to hammiefilter for my real live incoming feed. (Yes, I've been testing spambayes but not really using it up to this point.) As I make that transition, I'll become quite interested in what database format is used... I'll also make my procmailrc and support scripts available. - Alex From glouis at dynamicro.on.ca Mon Dec 2 18:40:21 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Mon, 2 Dec 2002 13:40:21 -0500 Subject: [Spambayes] train on error - to exhaustion? Message-ID: <20021202184021.GA6315@athame.dynamicro.on.ca> Training on error means "classify messages from the training corpus in random order; if the classifier errs or is uncertain, submit that message (once?) for training." Has anyone tried either of: 1) when the classifier errs or is uncertain, train on that message until the classifier gets it right, or 2) train once on each error, but then repeat the whole training process until all messages are classified correctly? I'd think the latter might be beneficial, but haven't tried it yet myself. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From wsy at merl.com Mon Dec 2 19:43:18 2002 From: wsy at merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 14:43:18 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021202184021.GA6315@athame.dynamicro.on.ca> (message from Greg Louis on Mon, 2 Dec 2002 13:40:21 -0500) References: <20021202184021.GA6315@athame.dynamicro.on.ca> Message-ID: <200212021943.gB2JhIl29523@localhost.localdomain> From: Greg Louis Training on error means "classify messages from the training corpus in random order; if the classifier errs or is uncertain, submit that message (once?) for training." Has anyone tried either of: 1) when the classifier errs or is uncertain, train on that message until the classifier gets it right, I've looked into that on CRM114; the circumstance never happens. I typically submit the erroneous message three times in rapid succession: - once to get a "before training" value to confirm the misclassify; - once with "train this message as" turned on (*) - and once again to get an "after training" result and verify the learn. It's never misclassified any message ever on the "after training" verification, so I don't know if it would change anything or not to re-train again and again until it gets the classification correct. 2) train once on each error, but then repeat the whole training process until all messages are classified correctly? I'd think the latter might be beneficial, but haven't tried it yet myself. Hmmm... that would be a good way to do regression checking to verify that every message that is classified correctly once is classified correctly forevermore. -Bill Y. (*) this is the step that seems to be running "LEARNing", but for some reason sendmail is getting upset at me and returning an error message _as well as_ the confirmation message. Bizarre. I'm working on it. From neale at woozle.org Mon Dec 2 20:19:52 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 12:19:52 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: <3DEB3361.19290.9FFA921@localhost> References: <15851.23096.388509.925822@montanaro.dyndns.org> <3DEB3361.19290.9FFA921@localhost> Message-ID: So then, "Brad Clements" is all like: > Has the classifier interface to "storage" been abstracted yet? I > thought that's where things were headed. But I haven't had a chance to > cvs update lately. > > Can I "drop-in" my own "database instance" into hammie or the Outlook > plugin in a transparent way? If we do end up canning the pickle, I guess we could support this sort of thing by making everything instantiate a storage.PersistentClassifier = storage.DBDictClassifier. Then folks like Brad could write their own class and set storage.PersistentClassifier equal to that. Unless your "database instance" is something the rest of us would be interested in? You've piqued my interest, Brad, now you gotta tell us what you're up to ;) Neale From neale at woozle.org Mon Dec 2 20:27:11 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 12:27:11 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: So then, Richie Hindle is all like: > Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, > but doesn't seem to use it (?) > > I'd like to see more of the existing code using it, but then again I'm not > in a hurry to implement the idea myself... I have to confess that I haven't even looked at Corpus.py yet. hammiebulk imports it because it needed it for some verbose variable at one point. But I'm going to read up before I take it out, maybe there's something there I can use :) Neale From bkc at murkworks.com Mon Dec 2 20:40:20 2002 From: bkc at murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 15:40:20 -0500 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <3DEB3361.19290.9FFA921@localhost> Message-ID: <3DEB7D92.26160.B217D9F@localhost> On 2 Dec 2002 at 12:19, Neale Pickett wrote: > Unless your "database instance" is something the rest of us would be > interested in? You've piqued my interest, Brad, now you gotta tell us what > you're up to ;) Just playing around with compressing the token list. So, I take 315680 tokens from my training database, stored as a pickle is 4,597,551 bytes, but I can get it down to 2,105,046 bytes with almost no decompression overhead. But .. What I really want to do is replicate the pickle/db speed trials so I can do some real testing, both on linux and windows, memory mapped or not. I think the "database interface" should be abstract, regardless of what I do. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie at entrian.com Mon Dec 2 21:47:08 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 21:47:08 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <3DEB2D6C.31813.9E8625E@localhost> References: <3DEB2D6C.31813.9E8625E@localhost> Message-ID: [Richie] > Training 1000 Database size Classifying 500 Database load > Pickle 65 seconds 999,540 35 seconds 4 seconds > bsddb3 82 seconds 1,318,912 43 seconds (negligible) [Brad] > How many tokens are stored in the pickle / bsddb3 in this example? 31846 -- Richie Hindle richie@entrian.com From neale at woozle.org Mon Dec 2 22:06:43 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 14:06:43 -0800 Subject: [Spambayes] OT: hotels near subway in Boston? Message-ID: So I'm booking travel for this spam conference next month, and I'm learning that I probably don't want to rent a car in Boston. The country being very automobile-happy, though, I can't find any hotels that advertise proximity to the subway. Are there any Boston-area residents on the list who can recommend a place to stay that's near the subway? I'm a tourist so I can't handle a lot of bus transfers. Thanks Neale From richie at entrian.com Mon Dec 2 22:08:42 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 22:08:42 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <3DEB2D6C.31813.9E8625E@localhost> Message-ID: [Richie] > Training 1000 Database size Classifying 500 Database load > Pickle 65 seconds 999,540 35 seconds 4 seconds > bsddb3 82 seconds 1,318,912 43 seconds (negligible) [Brad] > How many tokens are stored in the pickle / bsddb3 in this example? [Richie] > 31846 Sorry, brain trouble. The real answer is 30236. -- Richie Hindle richie@entrian.com From bkc at murkworks.com Mon Dec 2 22:27:21 2002 From: bkc at murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 17:27:21 -0500 Subject: [Spambayes] wordinfoget Message-ID: <3DEB96A7.27517.B837981@localhost> My storage method is most efficient when given a pre-sorted list of words, so, in _getclues, I would want wordstream to be sorted first. I guess I'll have to override _getclues, add_msg and friends in my subclass ;-) Which .py file in CVS generates the comparative time test for db and pickle training/classifying? If its not in .cvs, could someone email it to me? Thanks Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie at entrian.com Mon Dec 2 22:29:53 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 22:29:53 +0000 Subject: [Spambayes] wordinfoget In-Reply-To: <3DEB96A7.27517.B837981@localhost> References: <3DEB96A7.27517.B837981@localhost> Message-ID: [Brad] > Which .py file in CVS generates the comparative time test for db and pickle > training/classifying? I don't know whether such a thing exists - I produced my results the old-fashioned way, with a command prompt and a watch. 8-) -- Richie Hindle richie@entrian.com From bkc at murkworks.com Mon Dec 2 22:36:38 2002 From: bkc at murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 17:36:38 -0500 Subject: [Spambayes] wordinfoget In-Reply-To: References: <3DEB96A7.27517.B837981@localhost> Message-ID: <3DEB98D4.7778.B8BF7D7@localhost> oh, ok. which test modules did you time? On 2 Dec 2002 at 22:29, Richie Hindle wrote: > > [Brad] > > Which .py file in CVS generates the comparative time test for db and > > pickle training/classifying? > > I don't know whether such a thing exists - I produced my results the > old-fashioned way, with a command prompt and a watch. 8-) > > -- > Richie Hindle > richie@entrian.com > Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie at entrian.com Mon Dec 2 22:47:28 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 22:47:28 +0000 Subject: [Spambayes] wordinfoget In-Reply-To: <3DEB98D4.7778.B8BF7D7@localhost> References: <3DEB96A7.27517.B837981@localhost> <3DEB98D4.7778.B8BF7D7@localhost> Message-ID: [Brad] > which test modules did you time? For training, I ran: hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -d -p temp.bsddb3 hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -D -p temp.pickle For classifying, I ran: hammiebulk.py -u 500-hams.mbox -d -p richie-500.bsddb3 hammiebulk.py -u 500-hams.mbox -D -p richie-500.pickle (because I didn't have an mbox of 500 random ham/spam messages to hand). In each of the four cases I ran the command twice and timed the second one. I'm using a hacked version of the software that uses bsddb3 - if you need my patches, let me know. -- Richie Hindle richie@entrian.com From trebor at animeigo.com Mon Dec 2 22:35:36 2002 From: trebor at animeigo.com (Robert Woodhead) Date: Mon, 2 Dec 2002 17:35:36 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> References: <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> Message-ID: At 11:04 AM -0500 12/2/02, Ken Anderson wrote: >The "train only on errors" bothers me. Can you say what you use for >a training set and what you use for a test set? Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From neale at woozle.org Mon Dec 2 22:56:50 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 14:56:50 -0800 Subject: [Spambayes] wordinfoget In-Reply-To: References: <3DEB96A7.27517.B837981@localhost> <3DEB98D4.7778.B8BF7D7@localhost> Message-ID: So then, Richie Hindle is all like: > [Brad] > > which test modules did you time? > > For training, I ran: > > hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -d -p temp.bsddb3 > hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -D -p temp.pickle > > For classifying, I ran: > > hammiebulk.py -u 500-hams.mbox -d -p richie-500.bsddb3 > hammiebulk.py -u 500-hams.mbox -D -p richie-500.pickle That is what I did, too. Unix has a "time" command you can put in front of a command line, which will tell you all sorts of neat statistics. I did five runs of each (pickle and non) and averaged the times by hand. Neale From tim.one at comcast.net Mon Dec 2 23:18:33 2002 From: tim.one at comcast.net (Tim Peters) Date: Mon, 02 Dec 2002 18:18:33 -0500 Subject: [Spambayes] OT: hotels near subway in Boston? In-Reply-To: Message-ID: [Neale Pickett] > So I'm booking travel for this spam conference next month, Where is the conference located? I'm guessing Cambridge. > and I'm learning that I probably don't want to rent a car in Boston. Not unless you're traveling to a "far" suburb (like Burlington). Boston proper is a very small city, it's a maze of unmarked one-way streets, and there's very little parking space (in Boston or Cambridge). > The country being very automobile-happy, though, I can't find any > hotels that advertise proximity to the subway. googling on hotel boston subway finds a bunch. > Are there any Boston-area residents on the list who can recommend a > place to stay that's near the subway? Anywhere in Boston proper is close to the T (what locals call the subway) ... hmm, I see the conference is at the MIT Media Lab, and that http://www.media.mit.edu/contact/hotels.html lists a couple dozen convenient hotels. For whatever reason, they don't mention the T there either! MIT is at the Kendall Square stop on the Red Line. Any hotel in the MIT or Harvard area (two T stops away from MIT) would work fine, and the airport is easy to get to and from via T (take shuttle bus 33 from the terminal to the T station -- that's free). > I'm a tourist so I can't handle a lot of bus transfers. Heh . Instead you can handle a lot of T transfers: Subway directions from Logan International Airport: Take the free airport shuttle to the subway "T" station. Take the Blue Line to Government Center stop where you will switch to the Green Line. Take the Green Line to Park Street stop where you will switch to the Red Line. Take the Red Line to Kendall Square stop. It's easier than it sounds. At least it was the sixth time I did it when I lived there <0.9 wink>. From wsy at merl.com Tue Dec 3 02:30:46 2002 From: wsy at merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 21:30:46 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: (message from Robert Woodhead on Mon, 2 Dec 2002 17:35:36 -0500) References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <200212030230.gB32UkR30864@localhost.localdomain> X-Sender: trebor@mail.animeigo.com Date: Mon, 2 Dec 2002 17:35:36 -0500 From: Robert Woodhead Cc: spamfilt@archub.org, spambayes@python.org X-Spam-Status: No, hits=-14.9 required=7.0 tests=IN_REP_TO,REFERENCES,SIGNATURE_SHORT_DENSE, SPAM_PHRASE_01_02,SUBJECT_MONTH,SUBJECT_MONTH_2 version=2.41 X-Spam-Level: At 11:04 AM -0500 12/2/02, Ken Anderson wrote: >The "train only on errors" bothers me. Can you say what you use for >a training set and what you use for a test set? Training a particular incarnation of CRM114 usually takes a week or two; I read my mail (both categories) and when I find a piece of mail misclassified, I train that one piece into the filter. After a couple of days the errors get very sparse; after two or three weeks, I "go for data" and that's what gets reported in the monthlies. The current spam.css files are pretty much based on the live spam errors in the first week of October; since only four spam came through in all of November and only two were worth training on (the Czech Diesel Parts spam was just too funny to train out), the .css files are pretty much unchanged. Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. I did put in that capability as a flag called "refute". You can say learn < refute > ( spamfile.css ) /[[:graph:]]/ to unlearn something as nonspam, and then you can relearn it in the proper category, but except for testing code paths, I've never actually used it. On the other hand, there's an old difficulty in AI that one of my teachers called "the Kalman Belly Gaze". If you let a filter (of any type, he was teaching Kalman filters at the time but it applies to any trained filter) learn on it's own output stream, it quickly reinforces it's own behavior to the exclusion of all else (i.e. it goes off and gazes at it's own navel, simply ignoring the reality of the world around it). The reason I haven't auto-trained is due to my lack of understanding on what the limiting amount of self-teaching one can allow that doesn't go off into belly gaze. -Bill Yerazunis From kanderson at bbn.com Tue Dec 3 02:00:40 2002 From: kanderson at bbn.com (Ken Anderson) Date: Mon, 02 Dec 2002 21:00:40 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: References: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> Message-ID: <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> Yes, this is my concern. I think the approach Robert describes is perfectly find for adaptively learning how to filter email, though there should probably be some for of forgetting, though the system will eventually forget on its own as words occur less often. However, if this is the approach Bill uses, you can't use to for performance estimates. Our speech and natural language group is very careful not to mix its training set with its test set. When they do, they do something like 10 fold cross validation which averages (?) the results of 10 experiments that take some random fraction of the data as training and the rest as testing. This gives a lower performance score that is likely to be more accurate on real data. If your getting 3 9's be sure you're getting them the hard way. k At 05:35 PM 12/2/2002, Robert Woodhead wrote: >At 11:04 AM -0500 12/2/02, Ken Anderson wrote: >>The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set? > >Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. > >R > >-- >=========================================================== >Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ >=========================================================== >http://selfpromotion.com/ The Net's only URL registration >SHARESERVICE. A power tool for power webmasters. From tim at fourstonesExpressions.com Tue Dec 3 03:01:06 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 02 Dec 2002 21:01:06 -0600 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: <9595B96183MJYYW5ZXVF0VP4XMI0IC.3dec1e72@riven> Ok, so I found the message, and here are my thoughts. I freely admit that the abstraction was done completely from a single concrete example, that being the pop3proxy. It seems that the competing interests here can be successfully resolved by further abstraction. The 'Corpus' that Mark describes is essentially an iterator, which doesn't work well for the pop3proxy, but works well for the outlook plugin. I've spent some time looking at the Hammie/Hammiebulk/mboxutils stuff, along with the rfc822/Mailbox/email.* stuff over the last week, and I think that we (I) have managed to somewhat reinvent the wheel. It sounded like a good idea to me and Tim1 at the time... I certainly don't view Corpora as being particularly static. I view any collection of messages that are somehow related as a Corpus. Perhaps a better (more portable) term would have been Folder. Beats me. At any rate, I don't think anybody is locked in to the classes as they exist right now. Neale and Richie have added/removed stuff they need/don't need from them. I *would* like to see a single abstraction that works for the whole project. Should we start over? I'm ok with that... - TimS 11/7/2002 10:48:54 PM, "Mark Hammond" wrote: >> Laughing and pointing should be directed towards me rather than Tim. > >None of that, but some thoughts . > >I think that the classes I posted a while ago suffer from the exact reverse >problem as your idea. My idea was to make a "message store" that is largely >independent of training. I believe the problem with your design is that it >deals with the training at the expense of the message store. > >Obviously, but worth mentioning, is that there are competing interests here. >My focus is towards clients, and specifically the outlook one (if there were >more clients I would be happy to think of them too ). Alot of the >focus of this group is towards admins rather than individuals (which is just >fine!) But it seems the current thinking is of a corpus as being a fairly >static, well-controlled set of messages used almost purely for training >purposes. > >For client programs, this may not be practical. The corpus is a more >dynamic set of messages - and worse, actually *is* the user's set of >messages rather than a collection of message copies. > >For example, "moving" a message in a corpus may actually mean moving the >message in the user's real inbox. This may or may not be what is intended - >a corpus "move" operation is more about changing a message's classification >than it is about physically moving pieces of mail around. > >> A Corpus wouldn't know how to create Message objects, nor would a Message >> object know how to create itself - classes *derived from* them would know >> how to do that. For instance (totally untested code, probably full of >> typos) - >> >> class Message: > >Jeremy and I both posted real code, so starting with something that takes >that into consideration would be good. > >> I may be putting too much >> into the base class by demanding that the text of the message be given to >> the constructor - that precludes making FileMessage lazy, and >> only read the >> file when it needs to.] > >It also defeats the abstract nature of the class. > >> 'Corpus' works the same way; again, the details may be naive, but this is >> the general idea: > >I'm hoping I don't sound grumpy, but again, the few systems that already >exist for this engine are the best ones to use to discover the naivety early > > >> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an >> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on. > >I can't quite imagine that at the moment, as per my comments at the top. > >Off the top of my head, I believe we need: >* An abstract "message id" >* A message classification database, as discussed before - basically just a >dictionary, keyed by ID, holding either "spam" or "ham". >* A "corpus" becomes just an enumerator of message IDs for bulk/batch >training. It has no move etc operations. >* A "message store" is capable of returning a message object given its ID. >* The training API simply takes message objects and updates the probability >and message databases. > >At that level, we really don't need much else - no folders or any other >grouping of messages. I'm really not too sure there is much value in adding >higher-level concepts such as folders or message store "move" operations - >certainly not at the outset, where there are too many competing >requirements. > >> Yes - this could work using observer objects registered with Corpus >> objects: > >This could work, but may be too simple to be necessary. If the process of >re-training a message in the Outlook GUI becomes: > >def RetrainMessageAsSpam(): > # Outlook specific code to get an ID. > message = message_store.GetMessage(id) > if not classifier.IsSpam(message): > classifier.train(message, is_spam=True) > >And not a whole lot else, it doesn't seem worth it. Unfortunately, the >decision to perform the retrain is the complex, but client specific part. >Is this a newly delivered message? Did the user manually move the message >somewhere? Did the user click one of our buttons? Is the user deleting old >ham that we want to train on before it dies forever? > >Outlook does this via examining what Outlook event we are seeing, and >looking at meta-data we possibly previously attached to the message. I'm >not sure this can be encapsulated well at the moment without adding all our >meta-data etc baggage to the base classes. > >> Most of the *new* code that's needed is defining the abstract concepts and >> their interfaces, rather than writing code that actually *does* anything - >> it's building a framework. > >*cough* ummm... This is doomed to failure. Code *must* do something to be >taken seriously. At the very least, I would expect to see the existing test >driver framework running against these "abstract concepts" > >> Once the framework is there, most of the code needed to implement the >> functionality should already be in the project - code to hook >> into Outlook, >> to train on a message, to parse mbox files, and so on. It just needs >> hooking into the framework. > >See above . > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From papaDoc at videotron.ca Tue Dec 3 01:38:42 2002 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon, 02 Dec 2002 20:38:42 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentatio (2 try) Message-ID: <1038879522.998.3.camel@porsche> Hi, This is my second try at sending the documentation on how to use pop3proxy and Mozilla. The same warnings apply. Don't forget any comment is welcome. -- Remi Ricard From papaDoc at videotron.ca Tue Dec 3 01:56:31 2002 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon, 02 Dec 2002 20:56:31 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentatio (third try) Message-ID: <1038880590.998.9.camel@porsche> Hi again, I don't know what is going on but my attachment are not following my mail. (Evolution should be a good mail program ?????) Since my documentation is not really big I will include it in my email Here it comes --------------------------- Documentation for the Spambayes pop3proxy.py program.

Documentation for the Spambayes pop3proxy.py program.

This documentation will describe how to use pop3proxy.py with Mozilla:mail. But pop3proxy is not restricted to be used with mozilla.

I will talk about mozilla:mail because this is the only mail reader I use with pop3proxy.

First some definitions:

  • What is Spambayes?

    This project is developing a Bayesian anti-spam classifier, initially based on the work of Paul Graham, in python.

  • What is spam?

    broadly speaking: any email that's not wanted by the end-user. More specifically: unsolicited bulk email; email that you do not want and did not ask for, and was sent to a whole bunch of people by automated means at the same time it was sent to you. This definition deliberately excludes viruses and those stupid jokes sent to you by your Aunt Tillie.

  • What is ham?

    the opposite of spam; not necessarily email that you want or that you asked for, just anything that's not unsolicited bulk email.

  • What is a proxy?

    A proxy is a program that acts as an intermediary between your PC and something. (I hope this is general enough).


Now that we have some definitions we can be more specific:

So what is pop3proxy ?

pop3proxy.py is a program written by the Spambayes team, it is a middle man installed between your current pop servers (usually provided by your ISP) and your Mail reader. Upon request It will call your (or one of your) usual pop servers and get mails from it. Then class the mail in 3 different categories: Spam|Ham|Unsure. After the classification, it will add a new Header which is by default X-Spambayes-Classification that you might look at to find the status of the mail. Then, it will forward the mail to you the same way your usual pop server does it.

Your mail reader can use the new header to classify the mails into 3 categories: ham, spam and unsure. pop3proxy.py can talk to as many pop servers that you want. For me I have 3 pop server on 3 different ISP, but to simplify the documentation I will only use two different pop servers. If you have only one pop server then it is even simpler.

With pop3proxy installed and running:

  pop server on ISP                Proxy on localhost       Port on
localhost   Mail reader
                                                             
  --------------------             ---------------------         
----       
 |pop.videotron.ca:110|  <------->|                     | <----> |6110|
<-\     --------------
  --------------------            |                     |        
----     \   |              |
                                  | pop3proxy.py       
|                   -> | Mozilla:mail |
  ------------------              |                     |        
____     /   |              |
 |mail.ulaval.ca:110|    <_------>|                     | <----> |6111|
<-/     --------------
  ------------------               ---------------------          ----
Without pop3proxy installed and running:
 pop server on ISP                Mail reader
                                                             
  --------------------           
 |pop.videotron.ca:110|   <-\      --------------
  --------------------       \    |              |
                              ->  | Mozilla:mail |
  ------------------         /    |              |
 |mail.ulaval.ca:110|     <-/      --------------
  ------------------ 
If you keep in mind the pictures above I will explain how all of this is working. Usually when pop3proxy is not installed and you want to get your email from your pop server. You set you mail reader to talk to your pop server and tell the mail reader to use port 110 to do that. (the usual port for pop server). And you do this for all your pop servers.

When pop3proxy is running your mail reader will talk to pop3proxy and ask it to get the mail from different pop server (pop.videotron.ca or mail.ulaval.ca in the picture above). To distinguish which pop server you want the mail from, you will talk to pop3proxy on different ports.

On the pictures aboves talking by the port 6110 (on your PC) to pop3proxy will tell him to talk to pop.videotron.ca. If you talk to pop3proxy by the port 6111 then it will know that you what the mail on the mail.ulaval.ca server.

The association local port <--> pop server can be done when you start pop3proxy by adding some command line options or by configuring some parameters in the file Options.py or by using the new OptionConfig.py program.

Settings the things up!

Modification to the Options.py file:

I changed the following lines:
pop3proxy_servers:
pop3proxy_ports:
for
pop3proxy_servers: pop.videotron.ca:110,mail.ulaval.ca:110
pop3proxy_ports: 6110, 6111
Note: The order is important since the first item in the pop3proxy_servers list will be associated with the first item in the pop3proxy_ports list. This mean that I have associated port 6110 on my PC (i.e. localhost) to the pop server pop.videotron.ca. When I will be talking to pop3proxy by the port 6110 it will know that I want the mail from the server pop.videotron.ca.

Modification to the Mail reader. (i.e Mozilla:mail)

You need to do some modification in the Mail & Newsgroups Account Setting windows. To get this window start Mozilla. Then select the menu Windows->Mail & Newsgroup. The Mail & Newsgroup window will appear select the menu Edit->Mail & Newsgroups Account Setting.... You will get the window you need. If you already have created an Email account then you will need to edit this entry (See below). But for now we will start from scratch.

  • We select: Add Account
  • We select: Email Account, then we press the Next Button
  • We enter the required information, then we press the Next Button
  • We select:
    • For server type = POP (this is what most of us will need)
    • Incoming Server = localhost (This is different since usually here we were use to enter our ISP mail server but now the mail reader talk to pop3proxy which is on our computer (i.e. localhost) and pop3proxy will talk to the mail server.
    • Outgoing Server: relais.videotron.ca (Here you enter what your ISP had tell you what to enter. We follow what they said since we don't need to classify our outgoing mail. P.S. Spammer here you enter dev_null.spammer.com), then we press the Next Button
  • We enter the required information (User Name), then we press the Next Button
  • We enter the required information (Account Name=ricard), then we press the Finish Button
Now you need to edit the information you just have entered to specify the port we will use to talk to pop3proxy.
In the left part of the window, select under the entry you created by the above manipulation (for me it is ricard) the item Server Settings. The right part of the window should change and you should have the following fields.
  • Server Name: localhost
  • User Name: ricard
  • Port: 110
You need to change the port number to the one you specified in the file Options.py. (For me I change this to 6110). (For the next account I will use the second number of the line pop3proxy_ports:.

Using the new header to classify the mail

To classify the mail we will use the filter option available in mozilla:mail.

First we need to create a new filter item. Usually we can filter on: subject, sender or body, but we need to filter on the new header X-Spambayes-Classification. To do this you need at least one account (see above on how to create a new account).
  • In the Mail & Newsgroups window select the menu Tools->Message Filters...
  • In the new window, click on new.
  • Create the new item by click on the arrow on the right of Subject, then go to the item Customize in the drop down list.
  • Write X-Spambayes-Classification in the field and click the Add button.
Now it is possible to use this new header as a filter criteria. Since we can do whatever you want with this new criteria I will give you the setup I use.

Example on how to use the new filter item

In my in box I have 4 sub folders. 2 that receive the mail from mailing lists (Spambayes and Freesco). One that receive mails that was classified has unsure by pop3proxy and finally a sub folder for the spams. The Inbox will have only mail from my friends (hopefully).

Each good folders filters with X-Spambayes-Classification = ham.
Inbox              (Filter on X-Spambayes-Classification = ham)
 |-----> Spambayes (Filter on Subject = [Spambayes] and
X-Spambayes-Classification = ham)
 |-----> Freesco   (Filter on Subject = [freesco] and
X-Spambayes-Classification = ham)
 |-----> Unsure    (Filter on X-Spambayes-Classification = unsure)
 |-----> Spam      (Filter on X-Spambayes-Classification = spam)
-- Remi Ricard From msergeant at startechgroup.co.uk Tue Dec 3 09:52:20 2002 From: msergeant at startechgroup.co.uk (Matt Sergeant) Date: Tue, 03 Dec 2002 09:52:20 +0000 Subject: [Spambayes] OT: hotels near subway in Boston? In-Reply-To: References: Message-ID: <3DEC7ED4.4060403@startechgroup.co.uk> Neale Pickett said the following on 02/12/02 22:06: > So I'm booking travel for this spam conference next month, and I'm > learning that I probably don't want to rent a car in Boston. The > country being very automobile-happy, though, I can't find any hotels > that advertise proximity to the subway. Are there any Boston-area > residents on the list who can recommend a place to stay that's near the > subway? I'm a tourist so I can't handle a lot of bus transfers. I'm staying at the Marriot. Matt. From glouis at dynamicro.on.ca Tue Dec 3 12:04:36 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 07:04:36 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <200212021943.gB2JhIl29523@localhost.localdomain> References: <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> Message-ID: <20021203120436.GA1332@athame.dynamicro.on.ca> On 20021202 (Mon) at 1443:18 -0500, Bill Yerazunis wrote: > > 2) train once on each error, but then repeat the whole training process > until all messages are classified correctly? > > I'd think the latter might be beneficial, but haven't tried it yet > myself. > > Hmmm... that would be a good way to do regression checking to > verify that every message that is classified correctly once > is classified correctly forevermore. I have tried it now. I started from scratch, with 6372 spams and 6372 nonspams, and did a single pass of training-on-error. Then I did second, third, fourth and fifth passes. Here are the numbers of messages that had to be trained on each pass: rounds spam good 1 1 1090 764 2 2 193 56 3 3 28 15 4 4 10 5 5 5 8 3 Then I took three files of 1624 nonspams each and three files of 617 spams each and ran bogofilter on them with the training db's from each round of training: round run fpos fneg err percent 1 1 0 22 126 148 6.60 2 1 1 17 123 140 6.25 3 1 2 19 121 140 6.25 4 2 0 23 105 128 5.71 5 2 1 18 113 131 5.85 6 2 2 22 109 131 5.85 7 3 0 23 104 127 5.67 8 3 1 18 111 129 5.76 9 3 2 22 108 130 5.80 10 4 0 23 104 127 5.67 11 4 1 18 111 129 5.76 12 4 2 22 108 130 5.80 13 5 0 23 103 126 5.62 14 5 1 19 108 127 5.67 15 5 2 22 107 129 5.76 Summarizing, round meanerrpc lcl95 ucl95 1 1 6.37 6.13 6.60 2 2 5.80 5.56 6.04 3 3 5.74 5.50 5.98 4 4 5.74 5.50 5.98 5 5 5.68 5.44 5.92 It appears that a second round of training did improve discrimination slightly, but after that the law of diminishing returns set in. What remains to be done is to start again from scratch and do a full training, followed by one round of training-on-error, and run the test data against those two training sets to see if the result is any different. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From tim at fourstonesExpressions.com Tue Dec 3 13:40:05 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 03 Dec 2002 07:40:05 -0600 Subject: [Spambayes] Rethinking Corpus, mboxutils, life, the world, everything In-Reply-To: Message-ID: <96KECWQNMYSGFALGGE3XA72ZUOSPB7.3decb435@riven> 12/2/2002 2:27:11 PM, Neale Pickett wrote: >So then, Richie Hindle is all like: > >> Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, >> but doesn't seem to use it (?) >> >> I'd like to see more of the existing code using it, but then again I'm not >> in a hurry to implement the idea myself... > >I have to confess that I haven't even looked at Corpus.py yet. >hammiebulk imports it because it needed it for some verbose variable at >one point. But I'm going to read up before I take it out, maybe there's >something there I can use :) The Corpus stuff was created in response to primarily the needs of the pop3proxy. That process manages sets of mail for 'the other' clients, like Netscape, Opera, OE, etc., for which we don't have any hooks into their internals. The only 'interface' we have from them is their pop3 socket datastream. We can't tell when a message moves around in one of their folders, and so we have to keep caches of the mail we receive and give them a user interface they can use to train the classifier with the cached mail. Corpus and its subclass FileCorpus manage that cache for the pop3proxy. Message and its subclass FileMessage wrap each message, giving it an interface that is particularly suited for the pop3proxy. ExpiryCorpus and ExpiryFileCorpus allow the cache contents to be age purged, so the cache doesn't grow indefinitely. All of this is quite suitable for the pop3proxy, but not at all suitable for the Outlook client, which has plenty of hooks into the mail persistence mechanism. The Corpus is observable, and sends notification of two events: a message addition and a message removal. The Trainer class is an observer, and trains a classifier appropriately, based on the kind of trainer it is and whether a message is being added to or removed from the corpus it's observing. In the Outlook client (nearly as I can tell) the idea of a cached corpus is nonsense. Mark can tell when a message moves from one folder to another, and can do the training based on the kind of folder, so this 'third party' user interface to an observable cache messages is not a paradigm that works for outlook. The other thing involved is the mboxutils and msgs 'legacy'. This appears to be primarily directed at unix-style mailboxes, with the message classes being kinda force-fit into some other use-cases. Clearly unix-style mailboxes represent a third message persistence paradigm, a single file with all the messages in it, with a recognizable boundary line between. (btw, it seems like it would be fairly easy to screw up this kind of mailbox...) Hammie* uses this stuff, even when it's not training on unix mailboxes, and there's code rambling around in there that says "if I'm looking at a mbox, do (a), if I'm looking at a directory, do (b), if I'm looking at a ..." There are clearly some valid candidates for abstraction in this arena. So when I look at Corpus, I think that some further abstraction is necessary. Mark saw this instantly, it took me longer. Specifically, the concept of a 'corpus' carries some definitional baggage that has to do with training and such. The Corpus class is abstract in definition, but it makes too many assumptions about its environment to be abstract *enough*. I think we should refactor and introduce another level of abstraction, perhaps called 'Folder'. Here's a strawman: class Folder: """Basic iteration, maybe not much else here""" def __getitem__(self, key): def keys(self): def __iter__(self): def makeMessage(self, key): class Directory(Folder): def __init__(self, directory) class Mbox(Folder): def __init__(self, mbox) class Outlook(Folder): def __init__(self, ???) class FileCorpus(Directory): """Observable set of messages""" class FileCache(FileCorpus): """Expirable set of messages""" class Message: """Message wrapper, maybe even is just email.Message""" class MessageFactory: """Abstract factory for Message""" class FileMessageFactory: """Wraps a file system message""" class OutlookMessageFactory""" """Wraps an outlook message, probably only has a key and delegator methods to outlook api (?)""" class SomeOtherMessageFactory: """wraps some other kind of message... you get the idea""" > >Neale > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From wsy at merl.com Tue Dec 3 14:51:59 2002 From: wsy at merl.com (Bill Yerazunis) Date: Tue, 3 Dec 2002 09:51:59 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <311837598.1038877084@[192.168.2.9]> (message from Brian Burton on Tue, 03 Dec 2002 00:58:04 -0500) References: <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <311837598.1038877084@[192.168.2.9]> Message-ID: <200212031451.gB3Epxi32211@localhost.localdomain> From: Brian Burton > Training a particular incarnation of CRM114 usually takes a week or > two; I read my mail (both categories) and when I find a piece of mail > misclassified, I train that one piece into the filter. Training only on errors after a cut-off point is interesting. Why do you do this? Is there a reason not to increment the good/spam counts for terms in every email? Is it to avoid overflowing the counts in your hash table or is this likely to be more accurate since it keeps the message counts small? The reason I started doing it is that I used "unsigned char" as the counters in the big hash tables, to keep them as small as reasonable (remember, we're doing really _random_ accesses of these files and we thrash virtual memory and cache like crazy). The bin incrementer is "smart" in that it won't wrap past 255, but it is losing data at that point, and losing it on the _most_ significant features. I did consider "uncorking" the values up to unsigned int16, but I haven't had a good justification to do that yet. It's a simple change and if there's a need, it'll happen. > After a couple of days the errors get very sparse; after two or three > weeks, I "go for data" and that's what gets reported in the monthlies. Perhaps I misunderstand, but doesn't that mean that you are training up to a desirable accuracy before beginning to measure your accuracy? Is the transition from training to performance measurement based on a predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in corpus, or 14 calendar days of training) or based on the accuracy rising to a certain level? It's measured intuitively, by when I find I'm just not getting enough errors to keep my attention in training. This _is_ human-guided training, mind you. Other influences on when to start are "it's the start of November, start getting data". and "now that the BCR has that nasty underflow problem fixed and the data has settled down, let's get numbers". The other issue that can't be dodged is that spam is not ergodic; spam evolves in fits and starts; my spam of 1996 is very different than my spam of 2002. Any filter that is trained and tested against data statically is operating "in vitro"- a necessary and useful scientific measure but it misses the point of how well a spam filter can retrain on the fly against evolution in action. The training period coincidentally works out to be about 2+ weeks of training, and co-coincidentally I usually have just a few bins in the hash table maxing out about then. (right now I've got 7 bins out of a million maxed out in the spam hashtable, and 5 bins out of a million maxed out in the nonspam hashtable.) If I were to find that I was maxing out a significant number of bins (say, hundreds) I'd rebuild with unsigned int16 bins and accept the performance hit. (yes, this is a very "engineering" style approach; I'm not a good mathematician, so I just do experiments and report on what comes back.) For those of you with exceptionally high boredom thresholds, the current under-test spectra histograms follow. It does exhibit a comforting long distribution tail. -Bill Y. Sparse spectra file spam.css has 1048577 bins total total number of hash datums in this file is 398830 now scanning bins- please be patient... bin value 0 found 786135 times bin value 1 found 188350 times bin value 2 found 48948 times bin value 3 found 11125 times bin value 4 found 8550 times bin value 5 found 2511 times bin value 6 found 992 times bin value 7 found 464 times bin value 8 found 470 times bin value 9 found 240 times bin value 10 found 140 times bin value 11 found 104 times bin value 12 found 77 times bin value 13 found 65 times bin value 14 found 46 times bin value 15 found 47 times bin value 16 found 32 times bin value 17 found 36 times bin value 18 found 19 times bin value 19 found 17 times bin value 20 found 30 times bin value 21 found 11 times bin value 22 found 14 times bin value 23 found 8 times bin value 24 found 7 times bin value 25 found 7 times bin value 26 found 6 times bin value 27 found 10 times bin value 28 found 9 times bin value 29 found 7 times bin value 30 found 6 times bin value 31 found 6 times bin value 32 found 5 times bin value 33 found 2 times bin value 34 found 5 times bin value 35 found 2 times bin value 36 found 6 times bin value 37 found 5 times bin value 38 found 2 times bin value 39 found 2 times bin value 40 found 4 times bin value 41 found 2 times bin value 43 found 3 times bin value 44 found 1 times bin value 46 found 3 times bin value 47 found 1 times bin value 50 found 2 times bin value 52 found 3 times bin value 53 found 3 times bin value 55 found 1 times bin value 56 found 3 times bin value 58 found 1 times bin value 60 found 1 times bin value 62 found 1 times bin value 64 found 1 times bin value 69 found 1 times bin value 73 found 1 times bin value 74 found 1 times bin value 76 found 1 times bin value 77 found 1 times bin value 89 found 1 times bin value 90 found 2 times bin value 103 found 1 times bin value 105 found 2 times bin value 116 found 1 times bin value 121 found 1 times bin value 130 found 1 times bin value 143 found 1 times bin value 146 found 1 times bin value 157 found 1 times bin value 171 found 1 times bin value 175 found 2 times bin value 189 found 1 times bin value 208 found 1 times bin value 255 found 7 times Sparse spectra file nonspam.css has 1048577 bins total total number of hash datums in this file is 299527 now scanning bins- please be patient... bin value 0 found 819494 times bin value 1 found 187269 times bin value 2 found 31009 times bin value 3 found 7158 times bin value 4 found 1776 times bin value 5 found 614 times bin value 6 found 371 times bin value 7 found 165 times bin value 8 found 100 times bin value 9 found 76 times bin value 10 found 74 times bin value 11 found 46 times bin value 12 found 46 times bin value 13 found 29 times bin value 14 found 46 times bin value 15 found 53 times bin value 16 found 38 times bin value 17 found 16 times bin value 18 found 24 times bin value 19 found 9 times bin value 20 found 5 times bin value 21 found 11 times bin value 22 found 7 times bin value 23 found 13 times bin value 24 found 5 times bin value 25 found 6 times bin value 26 found 6 times bin value 27 found 5 times bin value 28 found 3 times bin value 29 found 3 times bin value 30 found 10 times bin value 31 found 5 times bin value 32 found 4 times bin value 33 found 4 times bin value 34 found 3 times bin value 35 found 3 times bin value 36 found 5 times bin value 37 found 2 times bin value 38 found 3 times bin value 39 found 3 times bin value 40 found 2 times bin value 41 found 2 times bin value 45 found 1 times bin value 46 found 2 times bin value 48 found 3 times bin value 49 found 3 times bin value 50 found 1 times bin value 51 found 1 times bin value 52 found 2 times bin value 54 found 1 times bin value 55 found 1 times bin value 56 found 1 times bin value 57 found 1 times bin value 58 found 1 times bin value 59 found 1 times bin value 60 found 1 times bin value 64 found 1 times bin value 66 found 1 times bin value 67 found 1 times bin value 71 found 2 times bin value 72 found 1 times bin value 74 found 1 times bin value 75 found 1 times bin value 78 found 1 times bin value 79 found 1 times bin value 80 found 2 times bin value 82 found 2 times bin value 83 found 1 times bin value 86 found 1 times bin value 95 found 1 times bin value 102 found 1 times bin value 104 found 1 times bin value 113 found 1 times bin value 122 found 1 times bin value 138 found 1 times bin value 164 found 1 times bin value 169 found 1 times bin value 173 found 1 times bin value 183 found 1 times bin value 189 found 1 times bin value 222 found 1 times bin value 254 found 1 times bin value 255 found 5 times Enter bin value to zeroize, or 0 to exit: From trebor at animeigo.com Tue Dec 3 14:28:10 2002 From: trebor at animeigo.com (Robert Woodhead) Date: Tue, 3 Dec 2002 09:28:10 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> References: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> Message-ID: >However, if this is the approach Bill uses, you can't use to for >performance estimates. Our speech and natural language group is >very careful not to mix its training set with its test set. When >they do, they do something like 10 fold cross validation which >averages (?) the results of 10 experiments that take some random >fraction of the data as training and the rest as testing. ah, but the point is, since each individual user will have his own email stream to train on, all you care about is how accurate the system is when it looks at the very next email that comes in. Thus, a system that gets very good after a few weeks of training on all the incoming mail, AND STAYS THAT WAY, is what you want in the real world. Dividing up training sets can be good for analysing the statistical properties of particular algorithm choices, but what counts (in a production environment) is real world performance, and real world filters have to adapt as the spam (and ham) changes over time. Tests like "pick a random sample, train on it, and then pick another sample (nonintersecting) from the same corpus, and test" don't properly reflect the real world environment. Spams are ordered by time! Thus, my philosophical position is that a real world app has to train on every incoming email (and be corrected by the user when it goofs). At 9:30 PM -0500 12/2/02, Bill Yerazunis wrote: >The reason I haven't auto-trained is due to my lack of understanding >on what the limiting amount of self-teaching one can allow that >doesn't go off into belly gaze. This cannot happen unless the user is derelict in not correcting the output. If he is, then the input to the training system is 100% correct. And if the training system has an aging system, correction mistakes will eventually decay (and, if they cause misclassifications, the user will notice and correct the filter). Keep in mind there is always a new stream of incoming spam and ham to work with. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From brian at burton-computer.com Tue Dec 3 05:47:45 2002 From: brian at burton-computer.com (Brian Burton) Date: Tue, 03 Dec 2002 00:47:45 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> References: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> Message-ID: <311217897.1038876465@[192.168.2.9]> --On Monday, December 02, 2002 9:00 PM -0500 Ken Anderson wrote: > However, if this is the approach Bill uses, you can't use to for > performance estimates. Our speech and natural language group is very > careful not to mix its training set with its test set. When they do, > they do something like 10 fold cross validation which averages (?) the > results of 10 experiments that take some random fraction of the data as > training and the rest as testing. > > This gives a lower performance score that is likely to be more accurate > on real data. Absolutely. That's the way I evaluate algorithms in SpamProbe as well. I use 10 different random partitionings of my good and bad spams into training and test subsets. Some tests yield excellent results. Others yield bad results. The average is always somewhere in the middle. Taking only a single partitioning isn't a very good way to evaluate the accuracy of an algorithm. All the best, ++Brian From brian at burton-computer.com Tue Dec 3 05:58:04 2002 From: brian at burton-computer.com (Brian Burton) Date: Tue, 03 Dec 2002 00:58:04 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212030230.gB32UkR30864@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <200212030230.gB32UkR30864@localhost.localdomain> Message-ID: <311837598.1038877084@[192.168.2.9]> --On Monday, December 02, 2002 9:30 PM -0500 Bill Yerazunis wrote: > Training a particular incarnation of CRM114 usually takes a week or > two; I read my mail (both categories) and when I find a piece of mail > misclassified, I train that one piece into the filter. Training only on errors after a cut-off point is interesting. Why do you do this? Is there a reason not to increment the good/spam counts for terms in every email? Is it to avoid overflowing the counts in your hash table or is this likely to be more accurate since it keeps the message counts small? > After a couple of days the errors get very sparse; after two or three > weeks, I "go for data" and that's what gets reported in the monthlies. Perhaps I misunderstand, but doesn't that mean that you are training up to a desirable accuracy before beginning to measure your accuracy? Is the transition from training to performance measurement based on a predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in corpus, or 14 calendar days of training) or based on the accuracy rising to a certain level? All the best, ++Brian From glouis at dynamicro.on.ca Tue Dec 3 16:27:34 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 11:27:34 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021203120436.GA1332@athame.dynamicro.on.ca> References: <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> <20021203120436.GA1332@athame.dynamicro.on.ca> Message-ID: <20021203162734.GA12825@athame.dynamicro.on.ca> On 20021203 (Tue) at 0704:36 -0500, Greg Louis wrote: > > Summarizing, > round meanerrpc lcl95 ucl95 > 1 1 6.37 6.13 6.60 > 2 2 5.80 5.56 6.04 > 3 3 5.74 5.50 5.98 > 4 4 5.74 5.50 5.98 > 5 5 5.68 5.44 5.92 > > It appears that a second round of training did improve discrimination > slightly, but after that the law of diminishing returns set in. > > What remains to be done is to start again from scratch and do a full > training, followed by one round of training-on-error, and run the test > data against those two training sets to see if the result is any > different. train meanerrpc lcl95 ucl95 1 production 2.11 1.79 2.44 2 errtwice 5.80 5.48 6.12 3 full 5.10 4.78 5.43 4 fullerr 5.10 4.78 5.43 Production refers to my big production training set, just for comparison; it was full-trained up to about 10k spams and 10k hams and then trained, not randomly, on every error encountered since. Errtwice is two rounds of training-on-error with the 6372-of-each training corpus. Full is one round of full training with the same corpus, and fullerr is one round of full training followed by one round of train-on-error (only 18 spams and 221 nonspams were registered in that round; although the means are identical, there was some variation in the individual runs). Doesn't look as though pure training-on-error is particularly advantageous with the Robinson-Fisher (chi) calculation method. It may still be useful in maintaining the effectiveness of an established training base. The above experiment is described more fully at http://www.bgl.nu/~glouis/bogofilter/training.html -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From tim at zope.com Tue Dec 3 16:53:10 2002 From: tim at zope.com (Tim Peters) Date: Tue, 3 Dec 2002 11:53:10 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021203162734.GA12825@athame.dynamicro.on.ca> Message-ID: [Greg Louis] > ... > Doesn't look as though pure training-on-error is particularly > advantageous with the Robinson-Fisher (chi) calculation method. Are you hashing tokens? spambayes does not, CRM114 does. Bill generates about 16 hash codes per input token, and with just a million hash buckets, collision rates zoom quickly if you train on everything. The experiments spambayes did with CRM114-like schemes were a disaster due to this -- we continued to train on everything, with hashing but without any bounds on bucket count, and the hash collisions quickly caused outrageously bad classification mistakes. Removing the hashing cured that, but then the database size goes through the roof (when generating ~16 "exact strings" per input token, and training on everything). Training-on-error helps Bill because it slashes hash collisions, simply via producing far fewer hash codes than does training on everything. Experiments in the default non-hashing spambayes unigram code found that train-on-error hurt the unsure rate but not the FP or FN rates. > It may still be useful in maintaining the effectiveness of an established > training base. Possibly; we didn't do any experiments on that. From relson at osagesoftware.com Tue Dec 3 16:57:58 2002 From: relson at osagesoftware.com (David Relson) Date: Tue, 03 Dec 2002 11:57:58 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021203162734.GA12825@athame.dynamicro.on.ca> References: <20021203120436.GA1332@athame.dynamicro.on.ca> <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> <20021203120436.GA1332@athame.dynamicro.on.ca> Message-ID: <4.3.2.7.2.20021203115102.00e234a0@mail.osagesoftware.com> At 11:27 AM 12/3/02, Greg Louis wrote: >Doesn't look as though pure training-on-error is particularly >advantageous with the Robinson-Fisher (chi) calculation method. It may >still be useful in maintaining the effectiveness of an established >training base. Greg, That makes sense. By definition, with training-on-error, only some of the training corpora are put into the word lists. The obvious result is smaller word lists. Other than list size, the effects are less clear. On the one hand, incoming messages will have fewer "hits" in the word lists; while on the other hand, the hits will be more "meaningful". With the smaller lists, there is less "breadth of knowledge" about spam and ham. This could account for the lack of advantage of training-on-error. David From glouis at dynamicro.on.ca Tue Dec 3 17:11:10 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 12:11:10 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: References: <20021203162734.GA12825@athame.dynamicro.on.ca> Message-ID: <20021203171110.GA13054@athame.dynamicro.on.ca> On 20021203 (Tue) at 1153:10 -0500, Tim Peters wrote: > [Greg Louis] > > ... > > Doesn't look as though pure training-on-error is particularly > > advantageous with the Robinson-Fisher (chi) calculation method. > > Are you hashing tokens? spambayes does not, CRM114 does. Bill generates > about 16 hash codes per input token, and with just a million hash buckets, > collision rates zoom quickly if you train on everything. Understood. We don't hash tokens, and I agree that the sentence you quoted is misleading; I should have said something like "bogofilter's current tokenization and the R-F classification method." I didn't try any of bogofilter's other calculation methods. > The experiments spambayes did with CRM114-like schemes were a > disaster due to this -- we continued to train on everything, with > hashing but without any bounds on bucket count, and the hash > collisions quickly caused outrageously bad classification mistakes. > Removing the hashing cured that, but then the database size goes > through the roof (when generating ~16 "exact strings" per input > token, and training on everything). Yup. > Training-on-error helps Bill because it slashes hash collisions, simply via > producing far fewer hash codes than does training on everything. I didn't mean to imply otherwise, and your correction of my sloppy wording is appreciated. > Experiments in the default non-hashing spambayes unigram code found that > train-on-error hurt the unsure rate but not the FP or FN rates. > > > It may still be useful in maintaining the effectiveness of an established > > training base. > > Possibly; we didn't do any experiments on that. Neither have I; I've been doing it in practice and it seems to work (my fp/fn are coming down), but I would like to perform a properly-designed experiment to assess it. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From neale at woozle.org Tue Dec 3 17:19:52 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:19:52 -0800 Subject: [Spambayes] dbm on windows, hopefully for the last time Message-ID: What do you all think of this: new option "dbm_type" which can be "best", "db3hash", "dbhash", "gdbm", or "dumbdbm". If it's "best", then the best available dbm implementation will be used. Note that "best" on Windows excludes "dbhash". So now, you get the best one your platform supports by default. Or you can specify a specific dbm if you like that better. This will remove the "anydbm" module, but add a tiny "dbmstorage" module. Please let me know what you think. I'll check it in if I don't get any "no, don't do that" comments. Here's the diff: Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.78 diff -u -r1.78 Options.py --- Options.py 26 Nov 2002 00:43:51 -0000 1.78 +++ Options.py 3 Dec 2002 17:13:20 -0000 @@ -372,6 +372,10 @@ [globals] verbose: False +# What DBM storage type should we use? Must be best, db3hash, dbhash, +# gdbm, dumbdbm. Windows folk should steer clear of dbhash. Default is +# "best", which will pick the best DBM type available on your platform. +dbm_type: best """ int_cracker = ('getint', None) @@ -460,6 +464,7 @@ 'html_ui_launch_browser': boolean_cracker, }, 'globals': {'verbose': boolean_cracker, + 'dbm_type': string_cracker, }, } Index: anydbm.py =================================================================== RCS file: anydbm.py diff -N anydbm.py --- anydbm.py 2 Dec 2002 20:23:39 -0000 1.3 +++ /dev/null 1 Jan 1970 00:00:00 -0000 @@ -1,57 +0,0 @@ -#! /usr/bin/env python -"""Generic interface to all dbm clones. - -This is just like anydbm from the Python distribution, except that this -one leaves out the "dbm" type on Windows, since reliable reports have it -that this module is antiquated and most dreadful. - -""" - -import sys - -try: - class error(Exception): - pass -except (NameError, TypeError): - error = "anydbm.error" - -if sys.platform in ["win32"]: - # dbm on windows is awful. - _names = ["bsddb3", "gdbm", "dumbdbm"] -else: - _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] -_errors = [error] -_defaultmod = None - -for _name in _names: - try: - _mod = __import__(_name) - except ImportError: - continue - if not _defaultmod: - _defaultmod = _mod - _errors.append(_mod.error) - -if not _defaultmod: - raise ImportError, "no dbm clone found; tried %s" % _names - -error = tuple(_errors) - -def open(file, flag = 'r', mode = 0666): - # guess the type of an existing database - from whichdb import whichdb - result=whichdb(file) - if result is None: - # db doesn't exist - if 'c' in flag or 'n' in flag: - # file doesn't exist and the new - # flag was used so use default type - mod = _defaultmod - else: - raise error, "need 'c' or 'n' flag to open new db" - elif result == "": - # db type cannot be determined - raise error, "db type could not be determined" - else: - mod = __import__(result) - return mod.open(file, flag, mode) Index: dbmstorage.py =================================================================== RCS file: dbmstorage.py diff -N dbmstorage.py --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ dbmstorage.py 3 Dec 2002 17:13:20 -0000 @@ -0,0 +1,53 @@ +"""Wrapper to open an appropriate dbm storage type.""" + +from Options import options + +class error(Exception): + pass + +def open_db3hash(*args): + """Open a bsddb3 hash.""" + import bsddb3 + return bsddb3.hashopen(*args) + +def open_dbhash(*args): + """Open a bsddb hash. Don't use this on Windows.""" + import bsddb + return bsddb.hashopen(*args) + +def open_gdbm(*args): + """Open a gdbm database.""" + import gdbm + return gdbm.open(*args) + +def open_dumbdbm(*args): + """Open a dumbdbm database.""" + import dumbdbm + return dumbdbm.open(*args) + +def open_best(*args): + if sys.platform == "win32": + funcs = [open_db3hash, open_gdbm, open_dumbdbm] + else: + funcs = [open_db3hash, open_dbhash, open_gdbm, open_dumbdbm] + for f in funcs: + try: + return f(*args) + except ImportError: + pass + raise error("No dbm modules available!") + +open_funcs = { + "best": open_best, + "db3hash": open_db3hash, + "dbhash": open_dbhash, + "gdbm": open_gdbm, + "dumbdbm": open_dumbdbm, + } + +def open(*args): + dbm_type = options.dbm_type.lower() + f = open_funcs.get(dbm_type) + if not f: + raise error("Unknown dbm type in options file") + return f(*args) Index: storage.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/storage.py,v retrieving revision 1.5 diff -u -r1.5 storage.py --- storage.py 2 Dec 2002 06:02:03 -0000 1.5 +++ storage.py 3 Dec 2002 17:13:20 -0000 @@ -51,6 +51,7 @@ import cPickle as pickle import errno import shelve +import dbmstorage PICKLE_TYPE = 1 NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training @@ -130,7 +131,8 @@ if options.verbose: print 'Loading state from',self.db_name,'database' - self.db = shelve.DbfilenameShelf(self.db_name, self.mode) + self.dbm = dbmstorage.open(self.db_name, self.mode) + self.db = shelve.Shelf(self.dbm) if self.db.has_key(self.statekey): t = self.db[self.statekey] From neale at woozle.org Tue Dec 3 17:22:39 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:22:39 -0800 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: So then, Neale Pickett is all like: > --- /dev/null 1 Jan 1970 00:00:00 -0000 > +++ dbmstorage.py 3 Dec 2002 17:13:20 -0000 > @@ -0,0 +1,53 @@ > +"""Wrapper to open an appropriate dbm storage type.""" > + > +from Options import options dbmstorage.py will, of course, also import sys :) > + if sys.platform == "win32": > + funcs = [open_db3hash, open_gdbm, open_dumbdbm] > + else: > + funcs = [open_db3hash, open_dbhash, open_gdbm, open_dumbdbm] From glouis at dynamicro.on.ca Tue Dec 3 17:23:25 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 12:23:25 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <4.3.2.7.2.20021203115102.00e234a0@mail.osagesoftware.com> References: <20021203120436.GA1332@athame.dynamicro.on.ca> <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> <20021203120436.GA1332@athame.dynamicro.on.ca> <4.3.2.7.2.20021203115102.00e234a0@mail.osagesoftware.com> Message-ID: <20021203172325.GB13054@athame.dynamicro.on.ca> On 20021203 (Tue) at 1157:58 -0500, David Relson wrote: > By definition, with training-on-error, only some of the > training corpora are put into the word lists. The obvious result is > smaller word lists. I can confirm that. "twice" is the directory where the db files were built by two rounds of train-on-error: # ls -l full twice full: total 47288 -rw-r--r-- 1 spamtest root 38936576 Dec 3 07:24 goodlist.db -rw-r--r-- 1 spamtest root 9424896 Dec 3 07:06 spamlist.db twice: total 22168 -rw-r--r-- 1 spamtest users 15761408 Dec 2 14:54 goodlist.db -rw-r--r-- 1 spamtest users 6905856 Dec 2 14:55 spamlist.db > Other than list size, the effects are less clear. On > the one hand, incoming messages will have fewer "hits" in the word lists; > while on the other hand, the hits will be more "meaningful". With the > smaller lists, there is less "breadth of knowledge" about spam and > ham. This could account for the lack of advantage of training-on-error. The fact that you get only half a percent more errors with less than half the bulk of wordlists does suggest that full training introduces a lot of unproductive cruft, though. What I _think_ I'm seeing is that, when done on top of an existing "full" base, training on every error as it's encountered does quickly improve the discrimination. That's gut-feeling and could be wrong -- experimentation is needed. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From neale at woozle.org Tue Dec 3 17:25:50 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:25:50 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: <3DEB7D92.26160.B217D9F@localhost> References: <3DEB3361.19290.9FFA921@localhost> <3DEB7D92.26160.B217D9F@localhost> Message-ID: So then, "Brad Clements" is all like: > I think the "database interface" should be abstract, regardless of > what I do. It is abstracted in a few places: you can write a new PersistentClassifier class (in storage.py), or you can have the DBDictClassifier use a new dbm storage backend (also in storage.py). If my latest patch is amenable to everyone, you can also hack dbmstorage.py to include a new dbm-like back-end. Neale From skip at pobox.com Tue Dec 3 17:28:34 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 11:28:34 -0600 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: <15852.59842.998057.524034@montanaro.dyndns.org> Neale> What do you all think of this: new option "dbm_type" which can be Neale> "best", "db3hash", "dbhash", "gdbm", or "dumbdbm". If it's Neale> "best", then the best available dbm implementation will be used. Neale> Note that "best" on Windows excludes "dbhash". Looks like a winner to me. Skip From richie at entrian.com Tue Dec 3 17:44:19 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 17:44:19 +0000 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: [Neale] > What do you all think of this: new option "dbm_type" which can be > "best", "db3hash", "dbhash", "gdbm", or "dumbdbm". If it's "best", then > the best available dbm implementation will be used. Note that "best" on > Windows excludes "dbhash". Looks spot on - nice one! We should also change the default for pop3proxy_persistent_use_database to True. -- Richie Hindle richie@entrian.com From neale at woozle.org Tue Dec 3 17:50:44 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:50:44 -0800 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: So then, Richie Hindle is all like: > We should also change the default for > pop3proxy_persistent_use_database to True. For that matter, what do you think about moving persistent_use_database back to the [global] section and doing away with *_presistent_use_database? Neale From skip at pobox.com Tue Dec 3 18:10:28 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 12:10:28 -0600 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <15851.23096.388509.925822@montanaro.dyndns.org> Message-ID: <15852.62356.862434.212872@montanaro.dyndns.org> richie> Are there any platforms on which, when you ask anydbm to create richie> a database, it uses version 1.85 of the underlying Berkeley DB richie> library to do that? Yes, unfortunately the Python Windows installer is distributed with Berkeley DB 1.85. On other platforms it's a hit-or-miss proposition. I don't believe any Linux vendors ship with db1 as the default anymore, but I could easily be disabused of that notion. I don't know about the commercial Unix vendors. Has anyone considered Sleepycat's caveats about using 1.85? The relevant page is here: http://www.sleepycat.com/historic.html The q/a about 1.85 is: Are there known problems with the 1.85 and 1.86 versions? Yes. Specifically, we recommend that you avoid the following operations when using versions 1.85 and 1.86: * Btree cursor (seq and put using a cursor) operations. * Large numbers of btree duplicates (specifically, avoid migrating duplicate keys to internal pages). * Large numbers of btree deletes (you should periodically dump and rebuild the database if you delete large numbers of records). * Overwriting or deleting overflow hash key/data pairs (pairs with items larger than the page size). * Intermixing hash cursor operations with deletes. In addition: * As there was no locking support in version 1.85, you cannot perform concurrent read/write operations in the database. * As there was no logging or transaction support in version 1.85, you must re-create your database whenever abnormal application termination occurs (e.g., either the application or the system crashes) as the database may have been left in a corrupted state. Finally, you should not upgrade your GNU gcc or Solaris compiler. Optimizations in versions of gcc 2 that were in alpha test in the summer of 1997, and a version of the standard Solaris WorkShop Compiler that was in beta test in the fall of 1997, trigger bugs in versions 1.85 and 1.86 that will cause sporadic core dumps. It seems to me the most important issues for us are the last two bullets in the first section and the last bullet in the second section. How close can we come to avoiding them? I don't think we should have any overflow has key/data pairs. The largest item in my current hammie.db file is only 108 bytes. Does the code do things like foo = db.next() if someprop(foo): del db[foo[0]] ? If not that may not be a problem either. The "abnormal termination" bit bothers me some, based on historical prejudices about Windows' (in)stability. I imagine others can speak to that. Skip From richie at entrian.com Tue Dec 3 19:08:07 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 19:08:07 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <15852.62356.862434.212872@montanaro.dyndns.org> References: <15851.23096.388509.925822@montanaro.dyndns.org> <15852.62356.862434.212872@montanaro.dyndns.org> Message-ID: [Skip] > Yes, unfortunately the Python Windows installer is distributed with Berkeley > DB 1.85. I should have said "except Windows" - I know we need to special-case that one. > Has anyone considered Sleepycat's caveats about using 1.85? I've read it, and although we might not hit any of the specific problems they mention right now, it seemed sufficiently scary to put me off using it. Who knows what code will be added to spambayes in the future - we can't make any assumptions. I think the patch Neale posted today does an excellent job of avoiding the problems - let's go with that. -- Richie Hindle richie@entrian.com From richie at entrian.com Tue Dec 3 19:14:04 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 19:14:04 +0000 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: [Neale] > For that matter, what do you think about moving persistent_use_database > back to the [global] section and doing away with > *_presistent_use_database? Yes, good plan. But what about *_persistent_storage_file? That defaults to ~/.hammiedb for hammie, which is meaningless on Windows 9x but very sensible on Unix. Maybe we need to move from having per-application defaults in bayescustomize.ini to having per-platform defaults? This is effectively what we've done with "dbm_type: best" (but in a different place). -- Richie Hindle richie@entrian.com From knutsen at yahoo.com Tue Dec 3 20:35:26 2002 From: knutsen at yahoo.com (Mark Knutsen) Date: Tue, 3 Dec 2002 12:35:26 -0800 (PST) Subject: [Spambayes] Great project; please keep up the good work Message-ID: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Looks like you've got something good going on there, especially the Outlook 2000 plugin for all of us stuck in corporate world. However, I'm neither a Python nor a Windows developer (I do Perl on Linux) and only use Windows to browse the Web and read my email at work, so I'm a bit leery of the installation process at present. What are the chances of an easy, packaged install coming down the pike? ===== --Mark Knutsen (Have you visited http://tbcy.org lately?) __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com From skip at pobox.com Tue Dec 3 20:57:58 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 14:57:58 -0600 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <20021203203526.28181.qmail@web10006.mail.yahoo.com> References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: <15853.6870.487509.485369@montanaro.dyndns.org> Mark> What are the chances of an easy, packaged install coming down the Mark> pike? It's in the cards, though I'm not sure anyone knows what the timeframe for that is. Skip From francois.granger at free.fr Tue Dec 3 20:59:47 2002 From: francois.granger at free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Tue, 3 Dec 2002 21:59:47 +0100 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <20021203203526.28181.qmail@web10006.mail.yahoo.com> References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: At 12:35 -0800 3/12/02, in message [Spambayes] Great project; please keep up the good work, Mark Knutsen wrote: > >However, I'm neither a Python nor a Windows developer (I do Perl on Linux) >and only use Windows to browse the Web and read my email at work, so I'm a >bit leery of the installation process at present. What are the chances of an >easy, packaged install coming down the pike? Go for the pop3proxy. It is a "one size fit all" working very nicely. Ot§her people will help you install on Unix with Procmail.... -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From mhammond at skippinet.com.au Tue Dec 3 21:08:16 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 4 Dec 2002 08:08:16 +1100 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: > Looks like you've got something good going on there, especially > the Outlook > 2000 plugin for all of us stuck in corporate world. > > However, I'm neither a Python nor a Windows developer (I do Perl on Linux) > and only use Windows to browse the Web and read my email at work, so I'm a > bit leery of the installation process at present. What are the > chances of an > easy, packaged install coming down the pike? There is a good chance - I am working on this at the moment. Mark. From richie at entrian.com Tue Dec 3 21:26:10 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 21:26:10 +0000 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: <2u7quu0t59l75012u6hjdnv85k41ihsj6u@4ax.com> [Mark Knutsen] > What are the chances of an easy, packaged install coming down the pike? [Mark Hammond] > There is a good chance - I am working on this at the moment. Fantastic! This is good news. Do you know for sure yet what this will include? Is it intended to be Outlook-specific, or will it include everything (hammie, the web interface, the POP3 proxy, etc)? Will you be shipping Python as part of the package? Would you like me to stop asking annoying questions now? 8-) -- Richie Hindle richie@entrian.com From richie at entrian.com Tue Dec 3 21:28:54 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 21:28:54 +0000 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <2u7quu0t59l75012u6hjdnv85k41ihsj6u@4ax.com> References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> <2u7quu0t59l75012u6hjdnv85k41ihsj6u@4ax.com> Message-ID: [Mark Knutsen] > What are the chances of an easy, packaged install coming down the pike? [Mark Hammond] > There is a good chance - I am working on this at the moment. [Richie Hindle] > Fantastic! This is good news. Do you know for sure yet what this will > include? Is it intended to be Outlook-specific, or will it include > everything (hammie, the web interface, the POP3 proxy, etc)? Will you be > shipping Python as part of the package? Would you like me to stop asking > annoying questions now? 8-) Will it include bsddb3? -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Tue Dec 3 21:51:14 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 4 Dec 2002 08:51:14 +1100 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: Message-ID: > [Richie Hindle] > > Fantastic! This is good news. Do you know for sure yet what this will > > include? Is it intended to be Outlook-specific, or will it include > > everything (hammie, the web interface, the POP3 proxy, etc)? > > Will you be > > shipping Python as part of the package? Would you like me to > > stop asking > > annoying questions now? 8-) > > Will it include bsddb3? It will be Outlook specific - there will only be DLL files, no executables. Of course, I would be happy to apply the same technology to a general distribution. And my plan is for it to include bsddb3 ;) Mark. From francois.granger at free.fr Tue Dec 3 22:13:37 2002 From: francois.granger at free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Tue, 3 Dec 2002 23:13:37 +0100 Subject: [Spambayes] Fwd: Re: [Pythonmac-SIG] Database engine Message-ID: >Delivered-To: online.fr-francois.granger@free.fr >Date: Tue, 3 Dec 2002 22:24:30 +0100 >Subject: Re: [Pythonmac-SIG] Database engine >Cc: MacPython >To: François Granger >From: Jack Jansen > > >On maandag, dec 2, 2002, at 18:21 Europe/Amsterdam, François Granger wrote: > >>On the Spambayes mailing list there was a discussion about the quality of >>the current bsddb database engine on Windows platforms. I was asked the >>question of how it is on Mac OS9 side. All I could say is that anydbm rely >>on gdbm. But I don't know which engine it is. >> >>Anyone knows a little on this ? > >It's the gdbm engine. This is a GNU database engine of approximately >10-12 years old. It used to be popular in its day, but nowadays most >people seem to prefer bsddb. But gdbm was reasonably easy to port to >MacOS9, and I never looked at bsddb afterwards. >-- >- Jack Jansen >http://www.cwi.nl/~jack - >- If I can't dance I don't want to be part of your revolution -- >Emma Goldman - -- Recently using MacOSX....... From skip at pobox.com Tue Dec 3 22:26:45 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 16:26:45 -0600 Subject: [Spambayes] hammie misquote? Message-ID: <15853.12197.292027.777745@montanaro.dyndns.org> In hammie.py --help the output includes: -g PATH mbox or directory of known good messages (non-spam) to train on. Can be specified more than once, or use - for stdin. -s PATH mbox or directory of known spam messages to train on. Can be specified more than once, or use - for stdin. As far as I can tell feeding it directories instead of mbox files, doesn't actually work. The code in train() suggests this as well: def train(hammie, msgs, is_spam): """Train bayes with all messages from a mailbox.""" mbox = mboxutils.getmbox(msgs) ... which is called like so: for g in good: print "Training ham (%s):" % g train(h, g, False) save = True where good is a list containing one directory if I invoke hammie like so: BAYESCUSTOMIZE=pfx.ini python ./hammie.py -g Data/Ham/Set1 -p ./hammie.db -d Did I miss something or is this a documentation mistake? Skip From baa at encodeweb.dk Tue Dec 3 22:45:20 2002 From: baa at encodeweb.dk (=?ISO-8859-1?Q?Brian_=C5gren?=) Date: Tue, 03 Dec 2002 23:45:20 +0100 Subject: [Spambayes] language support Message-ID: <3DED3400.7070101@encodeweb.dk> Hi SpamBayes Folks. I'm from a non-english country and was considering the consequences of using bayes based spam filtering with my language. As i see the problem .. all my spam is in english, most of my non-spam (incl. ham) is in danish .. my whife almost never gets any non-spam in english sċ if she (or i) would get something which is in english all the words in the email that is known, will be known from spam and therefore consider to be "most-likely-spam". Is this the case, or? which licence is being used for this project? I'd like to use this in a webmail-app i'm writing (in java), I've used python a while back in the university, Would it be better for me to write another implementation of this or use jpython to incorporate your project into mine? - Brian Aagren From richie at entrian.com Tue Dec 3 23:27:28 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 23:27:28 +0000 Subject: [Spambayes] Fwd: Re: [Pythonmac-SIG] Database engine In-Reply-To: References: Message-ID: <8afquukb0ljib2est5m889vfufo5iosojc@4ax.com> Hi François, > From: Jack Jansen > [...] > It's the gdbm engine. This is a GNU database engine of approximately > 10-12 years old. It used to be popular in its day, but nowadays most > people seem to prefer bsddb. But gdbm was reasonably easy to port to > MacOS9, and I never looked at bsddb afterwards. Thanks to you and Jack for the confirmation - this should mean we don't need to treat the Mac in any special way in terms of database type. Neale's recent edits should pick up gdbm, and all should be well. -- Richie Hindle richie@entrian.com From dereks at itsite.com Tue Dec 3 18:53:09 2002 From: dereks at itsite.com (Derek Simkowiak) Date: Tue, 3 Dec 2002 13:53:09 -0500 (EST) Subject: [Spambayes] New Application of SpamBayesian tech? Message-ID: Surfing Slashdot, ran accrossed the interview at http://www.theopenenterprise.com/story/TOE20021202S0001 which is about finding jobs. I saw this part: ------------------------------------------------------- [The interviewee mentions they got 3000 resumes in a single weekend...] [Interviewer] TheOpenEnterprise: How do you handle 3000 resumes? Do you look at them all? [Interviewee] Cranston-Cuebas: In a sense, we do. But we first scan them quickly to filter out applicants without relevant skills. We create an index of all incoming resumes and search on keywords. That's why it's important for job-seekers to repeat the major skills multiple times in their resume. Another reason is that some recruiters use applicant tracking programs that do automatic skills assessment based on keywords found in the resume, and will rank resumes based on that assessment. ------------------------------------------------------- Is anyone else seeing what I'm seeing? It seems like the SpamBayes algorithms are perfectly suited to this task... and would be far more accurate than whatever simple "keyword" tracking the current apps use. For some reason, the application of "filtering in" with SpamBayes (instead of "filtering out") never occurred to me before. Given the large number of people looking for jobs in the U.S., this seems like a good opportunity. Anyone else find this interesting? --Derek From skip at pobox.com Wed Dec 4 03:07:40 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 21:07:40 -0600 Subject: [Spambayes] New Application of SpamBayesian tech? In-Reply-To: References: Message-ID: <15853.29052.186166.31687@montanaro.dyndns.org> Derek> [Interviewer] TheOpenEnterprise: How do you handle 3000 resumes? Do you Derek> look at them all? Derek> [Interviewee] Cranston-Cuebas: In a sense, we do. But we first Derek> scan them quickly to filter out applicants without relevant Derek> skills. Derek> Is anyone else seeing what I'm seeing? Yes. It also seems to me that web page content filtering proxies (you know, keeping your kids or employees from visiting XXX websites) would be another good application of the technology. Skip From neale at woozle.org Wed Dec 4 04:31:50 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 20:31:50 -0800 Subject: [Spambayes] hammie misquote? In-Reply-To: <15853.12197.292027.777745@montanaro.dyndns.org> References: <15853.12197.292027.777745@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > BAYESCUSTOMIZE=pfx.ini python ./hammie.py -g Data/Ham/Set1 -p ./hammie.db -d > > Did I miss something or is this a documentation mistake? Ah, yes, hrm. Here's the problem: the mboxutils module makes some guesses about the path you give it, and for the Data sets (at least, for *my* Data sets), it guesses wrong. The fix would be to just get mboxutils to recognize your flavor of directory. Myself, I just made my data sets look like Maildirs and then everything was fine. But that's just a hack, not a solution ;) Tim S, is the Corpus module smarter about things like this? From neale at woozle.org Wed Dec 4 04:34:55 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 20:34:55 -0800 Subject: [Spambayes] New Application of SpamBayesian tech? In-Reply-To: <15853.29052.186166.31687@montanaro.dyndns.org> References: <15853.29052.186166.31687@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Yes. It also seems to me that web page content filtering proxies (you > know, keeping your kids or employees from visiting XXX websites) would > be another good application of the technology. Not to mention IDS (Intrusion Detection Systems). IANAS but I have a friend who is, and he's suggested to me a few times that it would be very interesting and possibly fruitful to apply Bayesian analysis to network security. But I think I'm going to have to pull out the probab/stats book from college before I embark on such a thing :) From neale at woozle.org Wed Dec 4 04:39:09 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 20:39:09 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: <15852.62356.862434.212872@montanaro.dyndns.org> References: <15851.23096.388509.925822@montanaro.dyndns.org> <15852.62356.862434.212872@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Has anyone considered Sleepycat's caveats about using 1.85? I think we do read and write at the same time when training, by the way. I don't really know what they mean by "concurrent read/write operations". Are they talking about two processes working on the same database, or do they mean one process doing both operations? But is it really worth persuing? I'd be happy to just write off 1.85, and it seems most of the windows folks are okay with that too. Neale From skip at pobox.com Wed Dec 4 04:49:10 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 22:49:10 -0600 Subject: [Spambayes] How are individual values stored in the database? Message-ID: <15853.35142.408092.547180@montanaro.dyndns.org> I thought values associated with keys in the DBDict thing were stored as little pickles. Scanning the code in dbdict.py suggests that's the case, but I'm unable to unserialize items using either cPickle or marshal: Python 2.3a0 (#6, Nov 13 2002, 19:57:35) [GCC 3.1 20020420 (prerelease)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import dbhash >>> db = dbhash.open("hammie.db") >>> db["pfxlen:5"] 'W(GA\xce\xf6\xbf-\xc0$\x89K\x00K\x07K\x00G?\x9e\xed\x19\xc5\x95y\xfdtq\x01.' >>> import cPickle as pickle >>> pickle.loads(db["pfxlen:5"]) Traceback (most recent call last): File "", line 1, in ? cPickle.UnpicklingError: invalid load key, 'W'. >>> import marshal >>> marshal.loads(db["pfxlen:5"]) Traceback (most recent call last): File "", line 1, in ? ValueError: bad marshal data I used to be able to do this (I can still do it with the hammie.db file I generated in mid-November). The file in question was created by hammie.py invocations like BAYESCUSTOMIZE=pfx.ini python ./hammie.py -g ham.mbox -p ./hammie.db -d where pfx.ini has these lines: [Tokenizer] address_headers: to cc summarize_prefixes: True (I'm trying to evaluate a new tokenizer change and want to examine raw counts for the generated tokens.) I realize WordInfo objects aren't being pickled any longer, but I thought tuples were. What have I missed? Thx, Skip From skip at pobox.com Wed Dec 4 04:51:12 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 22:51:12 -0600 Subject: [Spambayes] hammie misquote? In-Reply-To: References: <15853.12197.292027.777745@montanaro.dyndns.org> Message-ID: <15853.35264.832756.67640@montanaro.dyndns.org> Neale> The fix would be to just get mboxutils to recognize your flavor Neale> of directory. Myself, I just made my data sets look like Neale> Maildirs and then everything was fine. But that's just a hack, Neale> not a solution ;) Nothin' special about my directory. It's of the usual Unix variety. Its contents are the one message per file thing Tim defined for testing. What are "Maildirs"? How do they differ from Tim's thing? Skip From neale at woozle.org Wed Dec 4 05:08:38 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 21:08:38 -0800 Subject: [Spambayes] hammie misquote? In-Reply-To: <15853.35264.832756.67640@montanaro.dyndns.org> References: <15853.12197.292027.777745@montanaro.dyndns.org> <15853.35264.832756.67640@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Nothin' special about my directory. It's of the usual Unix variety. > Its contents are the one message per file thing Tim defined for > testing. What are "Maildirs"? How do they differ from Tim's thing? Maildirs are, well, here's a picture. $HOME/Maildir/ new/ 1038978004.24787_1.gwydion cur/ 1037168130.15835_0.gwydion,S=542:2,S 1037214764.7823_0.gwydion,S=1749:2,S tmp/ So the idea here is that when the MTA is writing a new message, it does so in a new file in tmp/, one file per message. When it's done, it renames the file into the new/ directory (that's an atomic operation on just about every FS). Then when your client has read the message, it puts in in the cur/ directory. So you don't need to lock anything. It's super-de-duper for NFS-mounted mail directories, and beats mbox files on everything but indexing. Google maildir for more info. So strictly speaking, all files in a Maildir have to be named NUMBER.STRING.STRING. But our stuff just reads in every file in the directory. I made a symlink to my Set1 directory called "cur" and told it to train on Data/Ham. So it slurped in every file. An MH directory, on the other hand, doesn't have the new/ cur/ and tmp/ subdirectories, all the messages are in the same directory. And they all have to be numbers, starting at 1. The way mboxutils works currently, it first tries to read the directory as a maildir (looking for a "cur" subdirectory). Then, if "/Mail/" is in the pathname, it reads it as an MH directory. Otherwise, it treats it as a directory of text files and only reads *.txt and *.lorien (what is this?) files. So I guess we could change that last option to read everything, but it has to be that way for some reason. Anyone care to elucidate this point? Neale From Paul.Moore at atosorigin.com Wed Dec 4 09:01:00 2002 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Wed, 4 Dec 2002 09:01:00 -0000 Subject: [Spambayes] Interesting behaviour from the Outlook client Message-ID: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Over the past few days, I've been seeing an increase in FNs and Unsures. = I initially trained on my inbox and spam folders (386 ham, 999 spam), = and since then I've trained on errors only. I'm now at 391 ham and 1011 = spam. Initially, I was getting no errors, and 1 or 2 unsures per day. = Now, I'm starting to get at least 1 FN per day, and a slight increase in = the unsure rate. It's far too early to tell, but could this be related to Tim's code to = handle unbalanced training sets? As time goes on, the spam:ham ratio = will increase (as FNs happen more often than FPs) and so the impact of = spam clues will be lessened (by Tim's code). I'll keep monitoring this, = but my "real life" mail is definitely unbalanced (home is massively = biased in favour of spam, work massively biased in favour of ham, but I = pre-filter mailing lists which muddies the water badly). I dunno. Do the testing gurus round here have any idea whether this type = of hypothesis could be tested in practice? Paul. From francois.granger at free.fr Wed Dec 4 09:23:22 2002 From: francois.granger at free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Wed, 04 Dec 2002 10:23:22 +0100 Subject: [Spambayes] pop3proxy documentation In-Reply-To: <1038880590.998.9.camel@porsche> Message-ID: on 3/12/02 2:56, Remi Ricard at papaDoc@videotron.ca wrote: > Since my documentation is not really big I will include it in my email Nice job. I suggest that instead of mofifying the Option.py file you instruct the guy to create the bayescustomize.ini file.... I added instructions for MacOS with Eudora and Entourage.... Please find only the added text below: =======================================================

For MacOS 9

before anything

Due to MacOS multitasking, the popproxy does not work very fast. On a Cube or a G4 400, I found it usable but not much. YMMV. To handle network connection to localhost it is easier to add a host file. If you don't have one already, create one with any text editor.

It name must be "hosts". It should be located in the "Preference" foder. It content should be similar to:

localhost CNAME fbgmac.intranet.teleprosoft.com
fbgmac.intranet.teleprosoft.com A 127.0.0.1

The localhost and 127.0.0.1 values must be exactly like this. If you don't know the right value to use for fbgmac.intranet.teleprosoft.com put anything looking like this one. It have to be exactly the same for end of first line and biggining of second line.

When this file is created, go to "TCP/IP" control panel. Set user level to Administrator. Click on "Use a host file" and select this file. Save your changes.

On the Mac, you can transform a Python script into a double clickable applet. Just Drag & Drop the pop3proxy.py script onto the BuildApplet application. You'll get a double clickable pop3proxy application.

Create or modify the "bayescustomize.ini" file in the Spambayes folder.

be sur you have these lines:

[pop3proxy]
pop3proxy_servers: pop.videotron.ca:110,mail.ulaval.ca:110
pop3proxy_ports: 110, 6111

Configuring Entourage

Go to the Tools menu and choose Accounts.

Click on New and choose POP.

Fill in the various fields. For the POP server field, put "localhost".

For the videotron account, you are done.

For the ulaval account, in the "Advance receive option" windows click on the "Ignore the default POP port" check box and type in 6111.

Filtering with Entourage

The rule can be:

If
    Specific header: X-Spambayes-Classification Contains ham
then    
    do nothing
If
    Specific header: X-Spambayes-Classification Contains spam
then    
    Move message to folder Spam
If
    Specific header: X-Spambayes-Classification Contains unsure
then    
    Move message to folder Unsure

Configuring Eudora

In Eudora, you will be able to reach only one pop server, since you can configure only one port number for POP. But on this server, you can access more than one account.

Go to the Tool menu and choose Personalities.

Create an new personality with the POP server as "localhost".

With the proposed "bayescustomize.ini" you will be able to talk onlu to the videotron server.

Filtering with Eudora

The rule can be:

Match
    Header: X-Spambayes-Classification contains ham
Action    
    do nothing
Match
    Header: X-Spambayes-Classification contains spam
Action    
    Transfer To Spam
Match
    Header: X-Spambayes-Classification contains unsure
Action    
    Transfer To Unsure
======================================================= -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From petera at intrinsica.co.uk Wed Dec 4 09:47:03 2002 From: petera at intrinsica.co.uk (Peter Arnold) Date: Wed, 4 Dec 2002 09:47:03 -0000 Subject: [Spambayes] Outlook addin: Removing the tray icon Message-ID: It's always bugged me that Outlook leaves the New Mail icon in the system tray after a rule or addin has moved or deleted all the newly arrived e-mail. I know there's no programmatic interface to remove the icon but I found some Visual Basic code at http://www.slipstick.com/dev/code/clearenvicon.htm that does the job. I've converted the three pages of VB to 5 lines of python (!) and submitted it as a patch (http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D648271&grou= p_i d=3D61702&atid=3D498105). I'm a bit bamboozled where to put it in the actual addin code so I'm hoping someone more knowledgeable than me will be able to do that. I imagine it should be invoked after scanning all new e-mail in the inbox and determining that all of it was spam. Peter Arnold petera@intrinsica.co.uk _____________________________________________________________________ This=20e-mail=20has=20been=20scanned=20for=20viruses=20by=20the=20WorldCom= =20Internet=20Managed=20Scanning=20Service=20-=20powered=20by=20MessageLab= s.=20For=20further=20information=20visit=20http://www.worldcom.com From Alexander at Leidinger.net Wed Dec 4 10:03:23 2002 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Wed, 4 Dec 2002 11:03:23 +0100 Subject: [Spambayes] hammie misquote? In-Reply-To: References: <15853.12197.292027.777745@montanaro.dyndns.org> <15853.35264.832756.67640@montanaro.dyndns.org> Message-ID: <20021204110323.62d9a18f.Alexander@Leidinger.net> On 03 Dec 2002 21:08:38 -0800 Neale Pickett wrote: > The way mboxutils works currently, it first tries to read the directory > as a maildir (looking for a "cur" subdirectory). Then, if "/Mail/" is > in the pathname, it reads it as an MH directory. Otherwise, it treats > it as a directory of text files and only reads *.txt and *.lorien (what > is this?) files. > > So I guess we could change that last option to read everything, but it > has to be that way for some reason. Anyone care to elucidate this > point? It's the way Tim had it's directory set up (at least this is how I had understand it at the time I implemented the first version of this functionality). *.txt and *.lorien are files with one mail per file. I think *.lorien denotes a particular set of SPAM mails. IMHO we can remove the restriction, I assume it evolved from a quick hack. Bye, Alexander. -- The computer revolution is over. The computers won. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From mwh at python.net Wed Dec 4 11:49:23 2002 From: mwh at python.net (Michael Hudson) Date: 04 Dec 2002 11:49:23 +0000 Subject: [Spambayes] Re: New Application of SpamBayesian tech? References: <15853.29052.186166.31687@montanaro.dyndns.org> Message-ID: <2madjmqa0s.fsf@starship.python.net> Neale Pickett writes: > Skip Montanaro writes: > > > Yes. It also seems to me that web page content filtering proxies (you > > know, keeping your kids or employees from visiting XXX websites) would > > be another good application of the technology. > > Not to mention IDS (Intrusion Detection Systems). > > IANAS but I have a friend who is, and he's suggested to me a few times > that it would be very interesting and possibly fruitful to apply > Bayesian analysis to network security. But I think I'm going to have to > pull out the probab/stats book from college before I embark on such a > thing :) I have half a mind to see how it works as a replacement for gnus' adaptive scoring. A harder problem than spam filtering, I guess, but it might be interesting. Cheers, M. From jm at jmason.org Wed Dec 4 11:55:04 2002 From: jm at jmason.org (Justin Mason) Date: Wed, 04 Dec 2002 11:55:04 +0000 Subject: [Spambayes] New Application of SpamBayesian tech? In-Reply-To: Message from Skip Montanaro <15853.29052.186166.31687@montanaro.dyndns.org> Message-ID: <20021204115509.6E55716F17@jmason.org> Skip Montanaro said: > Yes. It also seems to me that web page content filtering proxies (you know, > keeping your kids or employees from visiting XXX websites) would be another > good application of the technology. BTW I'm reasonably sure I saw a patent for bayesian prob analysis applied to web filtering on the IBM database. --j. From tim at fourstonesExpressions.com Wed Dec 4 13:07:08 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 04 Dec 2002 07:07:08 -0600 Subject: [Spambayes] pop3proxy documentation In-Reply-To: Message-ID: 12/4/2002 3:23:22 AM, François Granger wrote: >on 3/12/02 2:56, Remi Ricard at papaDoc@videotron.ca wrote: > >> Since my documentation is not really big I will include it in my email > >Nice job. > >I suggest that instead of mofifying the Option.py file you instruct the guy >to create the bayescustomize.ini file.... There is a configuration script to do this now. OptionsConfig.py. You should definitely have users use this script, rather than manually modify bayescustomize.ini. And you should definitely not instruct them to modify Options.py under any circumstances. It's just a bit too critical. If you accidentally screw it up, the whole system dies a horrible death. - TimS > >I added instructions for MacOS with Eudora and Entourage.... > >Please find only the added text below: > >======================================================= >

For MacOS 9

>

before anything

>

Due to MacOS multitasking, the popproxy does not work very fast. On a >Cube >or a G4 400, I found it usable but not much. YMMV. >To handle network connection to localhost it is easier to add a >host file. If you don't have one already, create one with any text editor. >

It name must be "hosts". It should be located in the "Preference" foder. >It content should be similar to: >

>localhost CNAME fbgmac.intranet.teleprosoft.com
>fbgmac.intranet.teleprosoft.com A 127.0.0.1
>
>

The localhost and 127.0.0.1 values must be exactly like this. If you >don't >know the right value to use for fbgmac.intranet.teleprosoft.com put >anything looking like this one. It have to be exactly the same for end of >first line and biggining of second line. >

When this file is created, go to "TCP/IP" control panel. Set user level >to Administrator. Click on "Use a host file" and select this file. Save >your changes. >

On the Mac, you can transform a Python script into a double clickable >applet. Just Drag & Drop the pop3proxy.py script onto the BuildApplet >application. You'll get a double clickable pop3proxy application. >

Create or modify the "bayescustomize.ini" file in the Spambayes folder. >

be sur you have these lines: >

>[pop3proxy]
>pop3proxy_servers: pop.videotron.ca:110,mail.ulaval.ca:110
>pop3proxy_ports: 110, 6111
>
>

Configuring Entourage

>

Go to the Tools menu and choose Accounts. >

Click on New and choose POP. >

Fill in the various fields. For the POP server field, put "localhost". >

For the videotron account, you are done. >

For the ulaval account, in the "Advance receive option" windows click on >the "Ignore the default POP port" check box and type in 6111. >

Filtering with Entourage

>

The rule can be: >

>If
>    Specific header: X-Spambayes-Classification Contains ham
>then    
>    do nothing
>If
>    Specific header: X-Spambayes-Classification Contains spam
>then    
>    Move message to folder Spam
>If
>    Specific header: X-Spambayes-Classification Contains unsure
>then    
>    Move message to folder Unsure
>
> > >

Configuring Eudora

>

In Eudora, you will be able to reach only one pop server, since you can >configure only one port number for POP. But on this server, you can access >more than one account. >

Go to the Tool menu and choose Personalities. >

Create an new personality with the POP server as "localhost". >

With the proposed "bayescustomize.ini" you will be able to talk onlu to >the videotron server. >

Filtering with Eudora

>

The rule can be: >

>Match
>    Header: X-Spambayes-Classification contains ham
>Action    
>    do nothing
>Match
>    Header: X-Spambayes-Classification contains spam
>Action    
>    Transfer To Spam
>Match
>    Header: X-Spambayes-Classification contains unsure
>Action    
>    Transfer To Unsure
>
> >======================================================= > >-- >Le courrier est un moyen de communication. Les gens devraient >se poser des questions sur les implications politiques des choix (ou non >choix) de leurs outils et technologies. Pour des courriers propres : > -- > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From papaDoc at videotron.ca Wed Dec 4 15:11:43 2002 From: papaDoc at videotron.ca (papaDoc) Date: Wed, 04 Dec 2002 10:11:43 -0500 Subject: [Spambayes] pop3proxy documentation In-Reply-To: References: Message-ID: <3DEE1B2F.7010704@videotron.ca> Hi, >>I suggest that instead of mofifying the Option.py file you instruct the guy >>to create the bayescustomize.ini file.... >> >> > >There is a configuration script to do this now. OptionsConfig.py. You should >definitely have users use this script, rather than manually modify >bayescustomize.ini. And you should definitely not instruct them to modify >Options.py under any circumstances. It's just a bit too critical. If you >accidentally screw it up, the whole system dies a horrible death. > Since I don't want to kill anyone I will explain how to use OptionsConfig.py. papaDoc From noreply at sourceforge.net Wed Dec 4 08:59:33 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed, 04 Dec 2002 00:59:33 -0800 Subject: [Spambayes] [ spambayes-Patches-648271 ] Code to remove the New Mail icon Message-ID: Patches item #648271, was opened at 2002-12-04 08:59 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=648271&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Peter Arnold (lardladpa) Assigned to: Nobody/Anonymous (nobody) Summary: Code to remove the New Mail icon Initial Comment: It would be great if having processed the newly arrived e-mail and discovered that they were all spam the addin could remove the New Message icon from the system tray. I know there's no programitic interface to do this but I found some VB code at http://www.slipstick.com/dev/code/clearenvicon.htm I've converted the 3 pages of VB to this small bit of python import win32gui # Locate the outlook window owning the tray icon hWnd = win32gui.FindWindow("rctrl_renwnd32", "") if hWnd != 0: # Send a NIM_DELETE to remove the icon nid = (hWnd, 0) win32gui.Shell_NotifyIcon(2, nid) # Send a WUM_RESETNOTIFICATION to the owning window win32gui.SendMessage(hWnd, 1031, 0, 0) It would be super if this patch could be integrated into the outlook plugin although I'm not quite sure where in the code it would go. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=648271&group_id=61702 From wsy at merl.com Wed Dec 4 15:36:26 2002 From: wsy at merl.com (Bill Yerazunis) Date: Wed, 4 Dec 2002 10:36:26 -0500 Subject: [Spambayes] Anyone else seeing increasing error rates over time? Message-ID: <200212041536.gB4FaQP02769@localhost.localdomain> Thread-Index: AcKbc6q3qg2tOWANRlqgdMLcb73raw== Over the past few days, I've been seeing an increase in FNs and Unsures. I initially trained on my inbox and spam folders (386 ham, 999 spam), and since then I've trained on errors only. I'm now at 391 ham and 1011 spam. Initially, I was getting no errors, and 1 or 2 unsures per day. Now, I'm starting to get at least 1 FN per day, and a slight increase in the unsure rate. It's far too early to tell, but could this be related to Tim's code to handle unbalanced training sets? As time goes on, the spam:ham ratio will increase (as FNs happen more often than FPs) and so the impact of spam clues will be lessened (by Tim's code). I'll keep monitoring this, but my "real life" mail is definitely unbalanced (home is massively biased in favour of spam, work massively biased in favour of ham, but I pre-filter mailing lists which muddies the water badly). I dunno. Do the testing gurus round here have any idea whether this type of hypothesis could be tested in practice? I'm seeing an increase in error rates as well. I'm starting to think of it as "evolution in action", that is, it's actually an indication of how fast spam mutates. The errors are new kinds of spam, or at least new topics, or in a new style, and not simmple misclassifies in the classic sense. Looking at the statistics on CRM114, as of today (with the run starting Nov 1): Week 1 - zero errors Week 2 - zero errors Week 3 - two errors Week 4 - two errors Week 5 - four errors, and it's only Wednesday! As of the start of week 5, I'm back to Train On Errors on-the-fly, and I'll let you know if that helps or not. It's too early to really have any assurance that this is the case, but I'll hypothesize that this shows that spam has a measurable nonzero mutation rate, and that mutation rate can be approximated by: kT Total really new spams seen = Spams seen * (e - 1) where T is the elapsed time in days since training stopped, and k is an empirical constant with value of roughly .0001 Paul Moore: see if this predicts your increase in errors. If you get 100 spams a day, and it's been 5 days since you last trained, this rule predicts 1/4 chance of a spam by the 5th day... but 4 spams by the 20th day. HUGE SCREAMING CAVEAT: This equation is pure smoke and mirrors, as I have far too little data to get an error bar that isn't the entire plotting area; a case of "torturing the data until it confesses" sufficient to warrant investigation by the Hague Tribunal. n.b.: The spams I've seen come through since the start of the November run are in general really new, and either 1)written so well that they even fool me into reading for a page or two until I figure out that they're spams (or have me laughing so hard that I keep reading anyway) or 2)written so tersely that it takes some background research to figure out that they're spams. The exception was the first occurrence of "Barnyard Teen" spam (you figure it out...) And gee, just when I thought things had settled down enough that I could sit back and make CRM114 truly 8-bit clean and wchar-safe... -Bill Yerazunis From tim.one at comcast.net Wed Dec 4 17:10:48 2002 From: tim.one at comcast.net (Tim Peters) Date: Wed, 04 Dec 2002 12:10:48 -0500 Subject: [Spambayes] Interesting behaviour from the Outlook client In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > Over the past few days, I've been seeing an increase in FNs and > Unsures. I initially trained on my inbox and spam folders (386 > ham, 999 spam), and since then I've trained on errors only. I'm > now at 391 ham and 1011 spam. Initially, I was getting no errors, > and 1 or 2 unsures per day. Now, I'm starting to get at least 1 > FN per day, and a slight increase in the unsure rate. My experiments with mistake-based training all said it was brittle, due to extreme reliance on hapaxes. That makes it more of a keyword-spotting classifier than a statistical inferencer. But since you've trained on only 5 ham + 12 spam since starting mistake-based training, I think this is just evidence that spam is changing. > It's far too early to tell, but could this be related to Tim's > code to handle unbalanced training sets? As time goes on, the > spam:ham ratio will increase (as FNs happen more often than FPs) > and so the impact of spam clues will be lessened (by Tim's code). This is so, and an increase in FN is an expected outcome of the imbalance adjustment, if you have more spam than ham. If you want to experiment with life without the imbalance adjustment, comment out the experimental_ham_spam_imbalance_adjustment: True line in your default_bayes_customize.ini file (in your spambayes Outlook2000 directory). That will make everything look less spammy, so an increase in FP is an expected outcome if you do this. > I'll keep monitoring this, but my "real life" mail is definitely > unbalanced (home is massively biased in favour of spam, work > massively biased in favour of ham, but I pre-filter mailing lists > which muddies the water badly). > > I dunno. Do the testing gurus round here have any idea whether > this type of hypothesis could be tested in practice? What exactly is the hypothesis? Whatever it is , it's certainly testable, but testing w/ Outlook is at best clumsy (testing is easiest if you have a stream of plain-text msgs ordered by time received; getting that out of Outlook is a series of battles). From neale at woozle.org Thu Dec 5 02:35:18 2002 From: neale at woozle.org (Neale Pickett) Date: 04 Dec 2002 18:35:18 -0800 Subject: [Spambayes] busy Message-ID: I've been pretty loud on the list recently so I figured I should let you all know that I've become quite busy with an upcoming internal release at $FIRM, so I'm not going to be very active for a little while. That means you'll all be deprived of my "kamikaze commit" style for a bit. But I guess reliability can be nice every now and then, if it's in small amounts Good luck on finishing your article in time, Richie! Neale From tim.one at comcast.net Thu Dec 5 04:01:11 2002 From: tim.one at comcast.net (Tim Peters) Date: Wed, 04 Dec 2002 23:01:11 -0500 Subject: [Spambayes] FW: PyCon DC 2003: Call For Papers Message-ID: In case you missed the announcement, or just need more pressure , the first PyCon is scheduled for the end of March, and Steve would *love* to get a paper on the spambayes project. This is a low-cost conference in Washington, DC, aimimg more at hackers than suits. I expect to be there, but expect my employer would kill me if I so much as mentioned this project. http://www.python.org/pycon/ -----Original Message----- From: python-list-admin@python.org [mailto:python-list-admin@python.org]On Behalf Of Steve Holden Sent: Wednesday, December 04, 2002 4:44 PM To: python-list@python.org; python-announce-list@python.org Subject: PyCon DC 2003: Call For Papers PyCon DC 2003, the first Python Community Conference, has now issued a formal call for papers, which you can read at www.python.org/pycon/cfp.html The organizing committee is interested in any and all submissions for presentations. Traditional presentation styles will doubtless be the norm, but if you would like to experiment with a different format you are encouraged to mail suggestions to pycon-interest at python dot org if you are a subscriber to that list, or to the address given at the foot of this message. Time is short, so please make sure any questions are sent in promptly. We will do our best to turn them around quickly. We look forward to seeing you at PyCon DC 2003, for which registration details should be published shortly. Steve Holden mailto:sholden@holdenweb.com PyCon Committee Chair pycondc-2003 at python dot org -- http://mail.python.org/mailman/listinfo/python-list From skip at pobox.com Thu Dec 5 19:40:22 2002 From: skip at pobox.com (Skip Montanaro) Date: Thu, 5 Dec 2002 13:40:22 -0600 Subject: [Spambayes] msg.get_content_type()? Message-ID: <15855.43942.935960.375535@montanaro.dyndns.org> I can't remember the relationship between the email package and Python version. Did we decide that 2.2.2 was required? I'm getting an AttributeError on one machine (running 2.2.1) complaining that a message object doesn't have get_content_type. I should be able to just drop the 2.2.2 or 2.3 email package into site-packages, right? Thx, Skip From richie at entrian.com Thu Dec 5 19:48:39 2002 From: richie at entrian.com (Richie Hindle) Date: Thu, 5 Dec 2002 19:48:39 +0000 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <15855.43942.935960.375535@montanaro.dyndns.org> References: <15855.43942.935960.375535@montanaro.dyndns.org> Message-ID: <8cdef7d942da49f3.dlg@entrian.com> [Skip] > I can't remember the relationship between the email package and Python > version. Did we decide that 2.2.2 was required? I'm getting an > AttributeError on one machine (running 2.2.1) complaining that a message > object doesn't have get_content_type. I should be able to just drop the > 2.2.2 or 2.3 email package into site-packages, right? Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or better of the email package - you can get that from Python 2.3, or download it from http://mimelib.sf.net. No released version of Python ships with it. -- Richie Hindle richie@entrian.com From tim at fourstonesExpressions.com Thu Dec 5 19:50:08 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 05 Dec 2002 13:50:08 -0600 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <8cdef7d942da49f3.dlg@entrian.com> Message-ID: <7552JHSNM86PFTR1Y6ZRPGRQ11U.3defadf0@riven> 12/5/2002 1:48:39 PM, "Richie Hindle" wrote: > >[Skip] >> I can't remember the relationship between the email package and Python >> version. Did we decide that 2.2.2 was required? I'm getting an >> AttributeError on one machine (running 2.2.1) complaining that a message >> object doesn't have get_content_type. I should be able to just drop the >> 2.2.2 or 2.3 email package into site-packages, right? > >Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or better of >the email package - you can get that from Python 2.3, or download it >from http://mimelib.sf.net. No released version of Python ships with >it. Argh... another external module dependency? - TimS > >-- >Richie Hindle >richie@entrian.com > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From skip at pobox.com Thu Dec 5 19:56:46 2002 From: skip at pobox.com (Skip Montanaro) Date: Thu, 5 Dec 2002 13:56:46 -0600 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <8cdef7d942da49f3.dlg@entrian.com> References: <15855.43942.935960.375535@montanaro.dyndns.org> <8cdef7d942da49f3.dlg@entrian.com> Message-ID: <15855.44926.841650.437684@montanaro.dyndns.org> Richie> Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or Richie> better of the email package - you can get that from Python 2.3, Richie> or download it from http://mimelib.sf.net. No released version Richie> of Python ships with it. Thanks. I pulled the email package from my release22maint branch. It seems to be 2.4.3. Skip From richie at entrian.com Thu Dec 5 21:08:33 2002 From: richie at entrian.com (Richie Hindle) Date: Thu, 5 Dec 2002 21:08:33 +0000 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <7552JHSNM86PFTR1Y6ZRPGRQ11U.3defadf0@riven> References: <7552JHSNM86PFTR1Y6ZRPGRQ11U.3defadf0@riven> Message-ID: [Richie] > you need version 2.4.3 or better of the email package [...] > No released version of Python ships with it. [Tim] > Argh... another external module dependency? This one's always been there - we've always required this version of 'email'. At least this one's pure Python, and the same on all platforms. -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Fri Dec 6 03:21:46 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 6 Dec 2002 14:21:46 +1100 Subject: [Spambayes] Outlook addin: Removing the tray icon In-Reply-To: Message-ID: > It's always bugged me that Outlook leaves the New Mail icon in the > system tray after a rule or addin has moved or deleted all the newly > arrived e-mail. I know there's no programmatic interface to remove the > icon but I found some Visual Basic code at > http://www.slipstick.com/dev/code/clearenvicon.htm that does the job. > I've converted the three pages of VB to 5 lines of python (!) and > submitted it as a patch > (http://sourceforge.net/tracker/index.php?func=detail&aid=648271&group_i > d=61702&atid=498105). > > I'm a bit bamboozled where to put it in the actual addin code so I'm > hoping someone more knowledgeable than me will be able to do that. I > imagine it should be invoked after scanning all new e-mail in the inbox > and determining that all of it was spam. I'm not sure that this addin is the correct place for this code, or how you picture it working. One way would be that if there are no new items in the Inbox after filtering, the icon is removed. While this sounds OK on the face of it, I'm not sure how useful it will be in the real world - it certainly won't help me. I can't remember the last time my inbox had zero unread items . What I *could* see is a useful standalone addin just for this purpose. This addin would actually *replace* the Outlook item. This could be smarter - such as only showing up if there are new items since you last opened outlook. This would be far more useful, but beyond the scope of the SpamBayes addin. It really wouldn't be too hard, and I would be happy to help out with it - we certainly have all the tools available, including a sample that creates a new taskbar icon, etc. It would make a great sample for win32all ;) Of course, if others think that a simple bit of code in the addin would be useful for the majority, then speak up and I will squish it in somewhere... Mark. From tim.one at comcast.net Fri Dec 6 03:36:23 2002 From: tim.one at comcast.net (Tim Peters) Date: Thu, 05 Dec 2002 22:36:23 -0500 Subject: [Spambayes] Outlook addin: Removing the tray icon In-Reply-To: Message-ID: [Peter Arnold] > It's always bugged me that Outlook leaves the New Mail icon in the > system tray after a rule or addin has moved or deleted all the newly > arrived e-mail. > ... [Mark Hammond] > I'm not sure that this addin is the correct place for this code, > or how you picture it working. > > ... [how Mark pictures it working ] ... > > Of course, if others think that a simple bit of code in the addin > would be useful for the majority, then speak up and I will squish > it in somewhere... -1. We (meaning you ...) have enough work here without making this project a dumping ground for generic Outlook annoyances. A distinct Outlook addin would be fine, though. I have to say I've always ignored the New Mail icon, and could never figure out what it thought it was trying to tell me -- it seems to appear and disappear at random. If someone invested a year in figuring out what it's doing, it would be a shame if installing *this* code ruined their hard-won mental model . From piersh at friskit.com Fri Dec 6 03:55:14 2002 From: piersh at friskit.com (Piers Haken) Date: Thu, 5 Dec 2002 19:55:14 -0800 Subject: [Spambayes] Outlook addin: Removing the tray icon Message-ID: <9891913C5BFE87429D71E37F08210CB92C742A@zeus.sfhq.friskit.com> I agree. Also it probably wouldn't work in the case where you have unread messages that have been redirected, by inbox rules, to unwatched folders. Piers. > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net]=20 > Sent: Thursday, December 05, 2002 7:36 PM > To: Mark Hammond; Peter Arnold > Cc: spambayes@python.org > Subject: RE: [Spambayes] Outlook addin: Removing the tray icon >=20 >=20 > [Peter Arnold] > > It's always bugged me that Outlook leaves the New Mail icon in the=20 > > system tray after a rule or addin has moved or deleted all=20 > the newly=20 > > arrived e-mail. ... >=20 > [Mark Hammond] > > I'm not sure that this addin is the correct place for this code, or=20 > > how you picture it working. > > > > ... [how Mark pictures it working ] ... > > > > Of course, if others think that a simple bit of code in the addin=20 > > would be useful for the majority, then speak up and I will=20 > squish it=20 > > in somewhere... >=20 > -1. We (meaning you ...) have enough work here without=20 > making this project a dumping ground for generic Outlook=20 > annoyances. A distinct Outlook addin would be fine, though. =20 > I have to say I've always ignored the New Mail icon, and=20 > could never figure out what it thought it was trying to tell=20 > me -- it seems to appear and disappear at random. If someone=20 > invested a year in figuring out what it's doing, it would be=20 > a shame if installing *this* code ruined their hard-won=20 > mental model . >=20 >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 From mhammond at skippinet.com.au Fri Dec 6 07:17:26 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 6 Dec 2002 18:17:26 +1100 Subject: [Spambayes] Interesting behaviour from the Outlook client In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Message-ID: > Over the past few days, I've been seeing an increase in FNs and > Unsures. I initially trained on my inbox and spam folders (386 > ham, 999 spam), and since then I've trained on errors only. I'm > now at 391 ham and 1011 spam. Initially, I was getting no errors, > and 1 or 2 unsures per day. Now, I'm starting to get at least 1 > FN per day, and a slight increase in the unsure rate. I think something is broken. I'm not sure what though :( I am seeing bizarre stuff that I can't explain, and don't even know how to describe reasonably :( Eg, recently I saw a clear spam scored as 3%. The spam-clues showed: word spamprob #ham #spam ... 'card-swipe' 0.123921 2 0 'cash-only' 0.123921 2 0 but still lots of obvious spam clues (ie, not everything was screwed). However, I was certain these don't appear in ham, so I did a full re-train. Then, these were correctly identified as only in spam (ie, not in ham), so the spam got a solid 100%. Interestingly I did a full retrain very recently before this. I suspect incremental retrain is broken, but I haven't looked too far - I just throw this out in speculation that there may be a more subtle bug in the training rather than the algorithm or in the options that control it. Mark. From db3l at fitlinxx.com Fri Dec 6 16:25:16 2002 From: db3l at fitlinxx.com (David Bolen) Date: 06 Dec 2002 11:25:16 -0500 Subject: [Spambayes] Re: Outlook addin: Removing the tray icon References: Message-ID: "Peter Arnold" writes: > It's always bugged me that Outlook leaves the New Mail icon in the > system tray after a rule or addin has moved or deleted all the newly > arrived e-mail. Isn't that just because the movement itself does not make the message unread? At least on my system (others experience may vary), the new mail icon behaves pretty consistently - if I have an unread/new message (in any of my folders on any message store) it'll show up when the first such message appears in the system. I have occasionally gotten confused when an automatic rule had moved a new message into a sub-folder that I didn't have expanded in my folder list so it was sort of a hunt to locate that new message. But once the last message is marked read the icon goes away. > I'm a bit bamboozled where to put it in the actual addin code so I'm > hoping someone more knowledgeable than me will be able to do that. I > imagine it should be invoked after scanning all new e-mail in the inbox > and determining that all of it was spam. It would seem to me to be cleaner if we just considered adding an option to the addin to mark messages it moved as read (probably two choices - one for the spam and one for the maybe spam), which should accomplish the same thing with respect to the icon but let Outlook itself take care of it following it's normal rules. I know that at the moment pretty much the only thing I'm doing with the spam folder contents is selecting all the messages that were moved into it by the addin, and marking them as read (since I'm still saving them for future retrainings). So the option to mark them read automatically might be attractive. A downside is that it would be harder to notice just which messages were recently moved there, but if you just make semi-regular scans of the folder that's probably minor. -- David From db3l at fitlinxx.com Fri Dec 6 16:34:26 2002 From: db3l at fitlinxx.com (David Bolen) Date: 06 Dec 2002 11:34:26 -0500 Subject: [Spambayes] Re: Interesting behaviour from the Outlook client References: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Message-ID: "Mark Hammond" writes: > I think something is broken. I'm not sure what though :( > > I am seeing bizarre stuff that I can't explain, and don't even know how to > describe reasonably :( Eg, recently I saw a clear spam scored as 3%. The > spam-clues showed: For another data point, in case it's related. I've been seeing sporadic spams scoring close to 0 although they're clearly spam. If I dump the spam clues, it actually shows *S* as 1 and *H* around 1e-6 (so clearly spam) but the Spam field in the message in Outlook still shows a very low value (sometimes 0%). If all I do is leave the message as unread and re-run the filter on unread mail it will rescore it as 100% and move it to the spam folder. I'm also occasionally seeing messages fail to show a score in their Spam column. It leads me to think that somehow the wrong message is being operated on at some point and/or the result stored in the wrong location. Since bringing up the spam-clues window appears to re-score the message, it would seem as if the earlier attempt either used the wrong message for the purpose of scoring (but stored it in the spam message) or in some other way got out of sync. It doesn't seem like the scoring itself is inaccurate. There is, however, no direct one-to-one correlation between messages missing a score and these bad spam scores. A particularly interesting point is that in these cases, I have been unable to find the affected message listed in the trace window - in either the case where it happens as new mail when the client is already running, or as unread mail detected upon client startup. But the message has clearly had its spam field updated. -- David PS: In what is more surely an Outlook issue, has anyone else had messages put back to an unread status after they've already been read? It happens semi-frequently to me (and sometimes correlated with a failure to update the spam column in the display until I switch out of and back into the folder). I'm assuming it's a race condition between my viewing the message clearing the unread bit and something that the addin is doing setting it back, but it's tricky to isolate a regular procedure for reproducing. From popiel at wolfskeep.com Fri Dec 6 16:37:24 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri, 06 Dec 2002 08:37:24 -0800 Subject: [Spambayes] Outlook addin: Removing the tray icon In-Reply-To: Message from Tim Peters References: Message-ID: <20021206163724.83A0E2DED6@cashew.wolfskeep.com> In message: Tim Peters writes: > >I have to say I've always ignored the New Mail icon, >and could never figure out what it thought it was trying to tell me -- it >seems to appear and disappear at random. If someone invested a year in >figuring out what it's doing, it would be a shame if installing *this* code >ruined their hard-won mental model . Turning off a few options (like the mail preview pane, which is a security risk anyway) makes the behaviour much more obvious: The new mail icon appears whenever new mail arrives, and the Outlook client is running. The new mail icon disappears whenever any message which had been unread gets marked as read, or the Outlook client exits. Note that being shown in the preview pane is often enough to get a message marked as read, so under some options configurations merely bringing the Outlook window to the foreground when the current message is a new message (due to the folder being previously empty or the click to raise the window also selecting a new message or somesuch) is enough to (apparently inconsistently) make the new mail icon disappear. - Alex (who suffers with Outlook at work) From Paul.Moore at atosorigin.com Fri Dec 6 16:38:35 2002 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Fri, 6 Dec 2002 16:38:35 -0000 Subject: [Spambayes] Re: Interesting behaviour from the Outlook client Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E76@UKDCX001.uk.int.atosorigin.com> From: David Bolen [mailto:db3l@fitlinxx.com] > I'm also occasionally seeing messages fail to show a score in their > Spam column. It leads me to think that somehow the wrong message is > being operated on at some point and/or the result stored in the wrong > location. [...] > PS: In what is more surely an Outlook issue, has anyone else had > messages put back to an unread status after they've already been read? > It happens semi-frequently to me (and sometimes correlated with a > failure to update the spam column in the display until I switch out of > and back into the folder). I'm assuming it's a race condition between > my viewing the message clearing the unread bit and something that the > addin is doing setting it back, but it's tricky to isolate a regular > procedure for reproducing. Yes, I see both of these behaviours. I also can't find a consistent pattern, but I think you're right that it's a race condition or synchronisation problem of some sort... Paul From barry at python.org Fri Dec 6 16:54:49 2002 From: barry at python.org (Barry A. Warsaw) Date: Fri, 6 Dec 2002 11:54:49 -0500 Subject: [Spambayes] msg.get_content_type()? References: <15855.43942.935960.375535@montanaro.dyndns.org> <8cdef7d942da49f3.dlg@entrian.com> Message-ID: <15856.54873.377409.136435@gargle.gargle.HOWL> >>>>> "RH" == Richie Hindle writes: RH> Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or RH> better of the email package - you can get that from Python RH> 2.3, or download it from http://mimelib.sf.net. No released RH> version of Python ships with it. Wrong. Python 2.2.2 (#1, Oct 15 2002, 12:24:47) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import email >>> email.__version__ '2.4.3' >>> from email.Message import Message >>> Message.get_content_type >>> email.__file__ '/usr/local/lib/python2.2/email/__init__.pyc' -Barry From tim at fourstonesExpressions.com Fri Dec 6 16:59:13 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 06 Dec 2002 10:59:13 -0600 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <15856.54873.377409.136435@gargle.gargle.HOWL> Message-ID: 12/6/2002 10:54:49 AM, barry@python.org (Barry A. Warsaw) wrote: > >>>>>> "RH" == Richie Hindle writes: > > RH> Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or > RH> better of the email package - you can get that from Python > RH> 2.3, or download it from http://mimelib.sf.net. No released > RH> version of Python ships with it. > >Wrong. > >Python 2.2.2 (#1, Oct 15 2002, 12:24:47) >[GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 >Type "help", "copyright", "credits" or "license" for more information. >>>> import email >>>> email.__version__ >'2.4.3' >>>> from email.Message import Message >>>> Message.get_content_type > >>>> email.__file__ >'/usr/local/lib/python2.2/email/__init__.pyc' > >-Barry Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32 Type "copyright", "credits" or "license" for more information. IDLE 0.8 -- press F1 for help >>> import email >>> email.__version__ '2.4.3' >>> from email.Message import Message >>> Message.get_content_type >>> email.__file__ 'C:\\Program Files\\Python2.2\\lib\\email\\__init__.pyc' - TimS > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From barry at python.org Fri Dec 6 17:03:35 2002 From: barry at python.org (Barry A. Warsaw) Date: Fri, 6 Dec 2002 12:03:35 -0500 Subject: [Spambayes] msg.get_content_type()? References: <15856.54873.377409.136435@gargle.gargle.HOWL> Message-ID: <15856.55399.635079.816568@gargle.gargle.HOWL> >>>>> "TS" == Tim Stone writes: TS> Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] Specifically, Python 2.2.x where x < 2 is /not/ sufficient. -Barry From tim at fourstonesExpressions.com Fri Dec 6 21:46:26 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 06 Dec 2002 15:46:26 -0600 Subject: [Spambayes] A followup to an article on spammers... Message-ID: Totally hillarious... http://www.freep.com/money/tech/mwend6_20021206.htm c'est moi - TimS www.fourstonesExpressions.com From mhammond at skippinet.com.au Sat Dec 7 01:41:22 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 7 Dec 2002 12:41:22 +1100 Subject: [Spambayes] RE: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: Message-ID: [Me] > > In the back of my mind, I am pondering if we need a better directory > > structure - maybe with the core engine in a package, and some of these > > "wrappers" used only by a few application also into their own? [Richie] > Isn't this also YAGNI? We have a few tens of Python files in the > project - > do we really need to split it up? And if we do, should we be > doing it with the code this young? Yeah, I think so :) We all know young minds are easily manipulated, so getting the code while it is young is good. The main directory has 46 or so .py files in it now, which is getting too many. There are 3 clear categories: * Main engine. * pop3proxy application. * Test code. Even just making this split would be a good thing. If we can factor some commonly used "application base classes" (ie, the intent of Corpus.py etc), then these could stay in the main directory (contradicting what I said above ). I don't want to go overboard, but I think something could be done. The longer we leave it, the harder it gets. I don't have a real strong opinion, but am bringing this up because I feel it now, not simply because it offends my sensibilities Actually-running-quite-low-on-sensibilities ly, Mark. From piersh at friskit.com Sat Dec 7 02:52:54 2002 From: piersh at friskit.com (Piers Haken) Date: Fri, 6 Dec 2002 18:52:54 -0800 Subject: [Spambayes] Using spambayes with outlook XP's hotmail connector Message-ID: <9891913C5BFE87429D71E37F08210CB929751A@zeus.sfhq.friskit.com> The following patch allows spambayes to correctly filter messages on hotmail when using Outlook XP's hotmil connector. It simply ignores the exception that occurs when spambayes tries to set the 'spam' field on a message which resides on hotmail - the hotmail connector doesn't support such property changes. Piers. Index: msgstore.py =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.36 diff -u -r1.36 msgstore.py --- msgstore.py 25 Nov 2002 05:57:41 -0000 1.36 +++ msgstore.py 7 Dec 2002 02:35:03 -0000 @@ -631,7 +631,10 @@ =20 def Save(self): assert self.dirty, "asking me to save a clean message!" - self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE | USE_DEFERRED_ERRORS) + try: + self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE | USE_DEFERRED_ERRORS) + except: + pass self.dirty =3D False =20 def _DoCopyMove(self, folder, isMove): From lists at webcrunchers.com Sat Dec 7 10:26:52 2002 From: lists at webcrunchers.com (John D.) Date: Sat, 7 Dec 2002 02:26:52 -0800 Subject: [Spambayes] All these Default Mailboxes in OpenBSD. What for? Message-ID: How I can stop mail from the outside coming into "root", yet still allow internal system mail to go to root? # Basic system aliases -- these MUST be present MAILER-DAEMON: postmaster postmaster: root # General redirections for pseudo accounts bin: root daemon: root named: root nobody: root operator: root uucp: root www: root ftp-bugs: root popa3d: root proxy: root smmsp: root sshd: root _portmap: root _rstatd: root _identd: root _rusersd: root _fingerd: root _x11: root Why MUST these be present? We are getting an unusual amount od spam mail sent to these Email addresses, and want to know why these are created in the first place. John From skip at pobox.com Sat Dec 7 19:11:40 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 13:11:40 -0600 Subject: [Spambayes] You talk, it types - interesting irony... Message-ID: <15858.18412.476854.880620@montanaro.dyndns.org> I find it mildly interesting that spammers are hawking Dragon Systems' Naturally Speaking, while Tim Peters, a former Dragon rocket scientist has been actively working on a tool to thwart such hawking. ;-) Skip From skip at pobox.com Sat Dec 7 20:12:18 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 14:12:18 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? Message-ID: <15858.22050.915577.31246@montanaro.dyndns.org> I just realized I fed the wrong mbox file to hammie. After poking around a bit I came up with this untrain sequence: mbox = mboxutils.getmbox("/Users/skip/tmp/newham") h = hammie.open("hammie.db", mode='w') for msg in mbox: h.untrain_ham(msg) h.store() It seemed to work, but is that the right way to untrain an mbox file? Skip From tim at fourstonesExpressions.com Sat Dec 7 20:13:26 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat, 07 Dec 2002 14:13:26 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: <15858.22050.915577.31246@montanaro.dyndns.org> Message-ID: 12/7/2002 2:12:18 PM, Skip Montanaro wrote: > >I just realized I fed the wrong mbox file to hammie. After poking around a >bit I came up with this untrain sequence: > > mbox = mboxutils.getmbox("/Users/skip/tmp/newham") > h = hammie.open("hammie.db", mode='w') > for msg in mbox: > h.untrain_ham(msg) > h.store() > >It seemed to work, but is that the right way to untrain an mbox file? It is if you originally trained as ham... - TimS > >Skip > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From skip at pobox.com Sat Dec 7 21:38:10 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 15:38:10 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: References: <15858.22050.915577.31246@montanaro.dyndns.org> Message-ID: <15858.27202.880939.275746@montanaro.dyndns.org> >> It seemed to work, but is that the right way to untrain an mbox file? Tim> It is if you originally trained as ham... - TimS Thanks, yes, I did. It wasn't that I was supposed to train as spam, but that I fed it the unclean ham mbox (still had SpamAssassin and VM headers). Skip From tim.one at comcast.net Sat Dec 7 21:41:46 2002 From: tim.one at comcast.net (Tim Peters) Date: Sat, 07 Dec 2002 16:41:46 -0500 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: <15858.27202.880939.275746@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > It wasn't that I was supposed to train as spam, but that I fed it the > unclean ham mbox (still had SpamAssassin and VM headers). Skip, unless you've changed the defaults, such headers are ignored. From skip at pobox.com Sat Dec 7 21:59:34 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 15:59:34 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: References: <15858.27202.880939.275746@montanaro.dyndns.org> Message-ID: <15858.28486.462125.381180@montanaro.dyndns.org> Skip> It wasn't that I was supposed to train as spam, but that I fed it Skip> the unclean ham mbox (still had SpamAssassin and VM headers). Tim> Skip, unless you've changed the defaults, such headers are ignored. Yeah, I realize that, however, by default SA stuffs a block of scoring information at the top of spam message bodies, so I needed to run unheader.py to get rid of that. As long as I was at it, I figured I might as well get rid of the VM-related headers. On the off-chance that I ever visit that mbox in VM the offsets in those headers would all be bogus. Skip From skip at pobox.com Sun Dec 8 03:00:01 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 21:00:01 -0600 Subject: [Spambayes] using binary pickles makes for much smaller databases Message-ID: <15858.46513.333137.130764@montanaro.dyndns.org> I was messing around with various things today. One thing I tried is to modify Python's shelve.py and Spambayes' storage.py to allow and use binary pickles. Before: -rw-rw-r-- 1 skip staff 20914176 Dec 7 18:20 hammie.db After: -rw-rw-r-- 1 skip staff 10874880 Dec 7 18:32 hammie.db In both cases I trained 13144 hams and 6662 spams starting with no hammie.db file. The databases each wound up with 324310 keys. The times seemed about the same: 324.66user+62.30sys for the ascii version and 322.89user+60.61sys for the binary version. The wall clock times weren't comparable because I was doing other things as they ran. Attached are diffs for Python's Lib/shelve.py and Spambayes' storage.py. I believe they should both be backward compatible though I haven't tested it. Let me know if you think they are reasonable changes. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: shelve.diff Type: application/octet-stream Size: 1342 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021207/4d69f2f0/shelve.exe -------------- next part -------------- A non-text attachment was scrubbed... Name: storage.diff Type: application/octet-stream Size: 1078 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021207/4d69f2f0/storage.exe From skip at pobox.com Sun Dec 8 04:27:55 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 22:27:55 -0600 Subject: [Spambayes] More messin' around - common email prefixes Message-ID: <15858.51787.475757.151843@montanaro.dyndns.org> I modified the tokenizer to generate tokens related to common prefixes in email addresses. One observation several people have made is that some spammers send out email to clumps of alphabetically similar addresses. One spam I received recently was sent to To: Cc: , , , , I fooled around a bit generating tokens that take into account the length of the common prefix and the number of recipients. I generate tokens that are the product of the length of the common prefix and the number of recipients divided by 10. In the above case I score it a '4' ((6 * 7) // 10). I only generate the token if there are more than one recipient and a non-zero common prefix. Here's the distribution of tokens in my database (13144 hams, 6662 spams): ('pfxlen:0', (18, 209)) ('pfxlen:1', (48, 32)) ('pfxlen:2', (42, 10)) ('pfxlen:3', (24, 2)) ('pfxlen:4', (23, 0)) ('pfxlen:5', (16, 0)) ('pfxlen:6', (16, 0)) ('pfxlen:7', (11, 0)) ('pfxlen:8', (6, 0)) ('pfxlen:9', (4, 0)) ('pfxlen:10', (5, 0)) ('pfxlen:11', (1, 0)) ('pfxlen:12', (1, 0)) ('pfxlen:14', (1, 0)) ('pfxlen:17', (1, 0)) ('pfxlen:18', (1, 0)) ('pfxlen:19', (1, 0)) ('pfxlen:24', (1, 0)) ('pfxlen:28', (1, 0)) Not too surprisingly, higher scores are associated with spam than with ham. This distribution suugests to me that perhaps I should squash that to two distinct tokens, one for scores of 0 or 1, and one for all higher scores. I'll try that out in a bit. Skip From richie at entrian.com Sun Dec 8 15:24:47 2002 From: richie at entrian.com (Richie Hindle) Date: Sun, 08 Dec 2002 15:24:47 +0000 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <15856.54873.377409.136435@gargle.gargle.HOWL> References: <15855.43942.935960.375535@montanaro.dyndns.org> <8cdef7d942da49f3.dlg@entrian.com> <15856.54873.377409.136435@gargle.gargle.HOWL> Message-ID: <36h6vucleapmke20g0muvc963pit13g0j6@4ax.com> [Richie] > Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or > better of the email package - you can get that from Python > 2.3, or download it from http://mimelib.sf.net. No released > version of Python ships with it. [Barry] > Wrong. > > Python 2.2.2 (#1, Oct 15 2002, 12:24:47) > [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import email > >>> email.__version__ > '2.4.3' Oops, my mistake. That's good news. -- Richie Hindle richie@entrian.com From richie at entrian.com Sun Dec 8 15:25:06 2002 From: richie at entrian.com (Richie Hindle) Date: Sun, 08 Dec 2002 15:25:06 +0000 Subject: [Spambayes] using binary pickles makes for much smaller databases In-Reply-To: <15858.46513.333137.130764@montanaro.dyndns.org> References: <15858.46513.333137.130764@montanaro.dyndns.org> Message-ID: > modify Python's shelve.py and Spambayes' storage.py to allow and use binary > pickles. Good plan! (I had no idea that shelve used text pickles.) Below is an alternative implementation that avoids the need to change shelve.py, though it's a slight hack in that a future version of shelve could potentially break it by not keeping its pickler in a module global called 'Pickler'. This goes at the top of storage.py: --------------------------------------------------------------------------- # Make shelve use binary pickles by default. oldShelvePickler = shelve.Pickler def binaryDefaultPickler(f, binary=1): return oldShelvePickler(f, binary) shelve.Pickler = binaryDefaultPickler --------------------------------------------------------------------------- This gives me 335,872 bytes in 21 seconds vs. 679,936 bytes in 26 seconds. These are wall-clock times on an otherwise-idle Win98 box for training on 200 messages. This is backwards-compatible too - I can still use my existing database with no problems. Can anyone see a problem with this code (or is anyone offended by grubbing around with shelve.Pickler)? What if one of the DBMs supported by anydbm doesn't support values with embedded NULL characters for instance? (Seems unlikely.) Skip, your patch to shelve.py looks like a good candidate for inclusion into Python itself, assuming there really is no problem using binary pickles via shelve/anydbm. -- Richie Hindle richie@entrian.com From richie at entrian.com Sun Dec 8 15:25:19 2002 From: richie at entrian.com (Richie Hindle) Date: Sun, 08 Dec 2002 15:25:19 +0000 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: [Mark] > The main directory has 46 or so .py files in it now, which is getting too > many. You're probably right - now that I look at it again, it is getting a bit crowded in there. > There are 3 clear categories: > > * Main engine. > * pop3proxy application. > * Test code. There are also the command-line applications: hammie*.py and mboxtrain.py. I think they're in a different category from the "main engine" code. > Even just making this split would be a good thing. If we can factor some > commonly used "application base classes" (ie, the intent of Corpus.py etc), > then these could stay in the main directory (contradicting what I said above > ). I don't want to go overboard, but I think something could be done. I think you're right, but I also think it itches you more than it itches me (must be the hot Aussie summer - it's freezing in the UK). One possible issue: will we lose CVS history by moving files about? Does SourceForge give us the ability to move a file and its CVS history together? -- Richie Hindle richie@entrian.com From popiel at wolfskeep.com Sun Dec 8 16:47:36 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sun, 08 Dec 2002 08:47:36 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: Message from Richie Hindle References: Message-ID: <20021208164736.15A292DED1@cashew.wolfskeep.com> In message: Richie Hindle writes: > >[Mark] >> The main directory has 46 or so .py files in it now, which is getting too >> many. > >You're probably right - now that I look at it again, it is getting a bit >crowded in there. > >> There are 3 clear categories: >> >> * Main engine. >> * pop3proxy application. >> * Test code. > >There are also the command-line applications: hammie*.py and mboxtrain.py. >I think they're in a different category from the "main engine" code. I agree that the breakup is a good thing, and that hammie* and friends should be in their own category. As a python newbie, I wonder if breaking it up will complicate invocation, though... will the spambayes core stuff have to be placed someplace special (on a module search path?) to be found by the various front-ends in their subdirectories? - Alex From Alexander at Leidinger.net Sun Dec 8 18:56:41 2002 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Sun, 8 Dec 2002 19:56:41 +0100 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: <20021208195641.27ba9bba.Alexander@Leidinger.net> On Sun, 08 Dec 2002 15:25:19 +0000 Richie Hindle wrote: > One possible issue: will we lose CVS history by moving files about? Does > SourceForge give us the ability to move a file and its CVS history > together? Removing a file puts it into the attic (a special directory in the CVS repository). You can still get it from there. CVS itselv doesn't has a "move" command, if you have shell access to the CVS repository, you can copy the xxx,v files to the new location and "cvs remove" them in the old location (directly moving it in the repository is not an option, because you can't go back to an old version then). Bye, Alexander. -- Actually, Microsoft is sort of a mixture between the Borg and the Ferengi. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From noreply at sourceforge.net Sun Dec 8 18:39:39 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Sun, 08 Dec 2002 10:39:39 -0800 Subject: [Spambayes] [ spambayes-Bugs-650496 ] hammie.py discards headers Message-ID: Bugs item #650496, was opened at 2002-12-08 18:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Simon Baatz (bnomis26) Assigned to: Nobody/Anonymous (nobody) Summary: hammie.py discards headers Initial Comment: When feeding the (malformed) attached mail to hammie.py in filter mode, the headers of the mail are not present in the output. Command line: python hammie.py -f -d -p ~/mail/hammie.db < msg.lAoM Output: X-Spambayes-Classification: ham; 0.00 --Amazon.com_multipart_boundary____________ Content-Type: text/plain; charset=iso-8859-1 Vielen Dank für Ihre Bestellung bei Amazon.de. --Amazon.com_multipart_boundary____________ Content-Type: text/html; charset=iso-8859-1 --Amazon.com_multipart_boundary____________-- ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 From Todd.Miller at courtesan.com Mon Dec 9 15:17:52 2002 From: Todd.Miller at courtesan.com (Todd C. Miller) Date: Mon Dec 9 17:34:04 2002 Subject: [Spambayes] Re: All these Default Mailboxes in OpenBSD. What for? In-Reply-To: Your message of "Sat, 07 Dec 2002 02:26:52 PST." References: Message-ID: <200212092217.gB9MHqkL000956@xerxes.courtesan.com> In message so spake "John D." (lists): > # General redirections for pseudo accounts > bin: root > daemon: root > named: root > nobody: root > operator: root > uucp: root > www: root > ftp-bugs: root > popa3d: root > proxy: root > smmsp: root > sshd: root > _portmap: root > _rstatd: root > _identd: root > _rusersd: root > _fingerd: root > _x11: root > > Why MUST these be present? > We are getting an unusual amount od spam mail sent to these Email addresses, > and want to know why these are created in the first place. These are all pseudo-users. They either own files in the filesystem or act as unprivileged users that certain commands run as. If they are not aliases to something, then any mail that happens to come for them will just end up in /var/mail, which could eventually fill up /var. There's no reason they have to point to anything real, though. They could just go to /dev/null if you want (though operator is often used as real user and some people mail www instead of webmaster). - todd From skip at pobox.com Mon Dec 9 22:51:19 2002 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 9 23:51:10 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? Message-ID: <15861.29383.516933.165140@montanaro.dyndns.org> I just tried runing pop3proxy as python pop3proxy.py -t -b -p hammie.db -d and fetch the two sample messages it has. They come through just fine but don't appear to be scored. Is that a property of the test mode or will I run into that problem when grabbing mail from a real server as well? Version is recent CVS - updated earlier this evening. Thx, Skip From skip at pobox.com Mon Dec 9 23:06:58 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 10 00:06:48 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.29383.516933.165140@montanaro.dyndns.org> References: <15861.29383.516933.165140@montanaro.dyndns.org> Message-ID: <15861.30322.597106.427808@montanaro.dyndns.org> Skip> ... and fetch the two sample messages [pop3proxy] has. They come Skip> through just fine but don't appear to be scored. I am seeing the same behavior from hammie.py. I must be muffing something. Here's my .ini file: [Hammie] hammie_debug_header: True [Tokenizer] address_headers: from to cc generate_recipients: true Skip From tim at fourstonesExpressions.com Tue Dec 10 00:06:43 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Dec 10 01:07:24 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.29383.516933.165140@montanaro.dyndns.org> Message-ID: <83XWGCWQHDLFXVBAS4ZDCGEIDLDD0.3df58473@riven> No, Skip, there appears to be something wrong... I don't have time to look at it tonight, but this doesn't sound right. - TimS 12/9/2002 10:51:19 PM, Skip Montanaro wrote: >I just tried runing pop3proxy as > > python pop3proxy.py -t -b -p hammie.db -d > >and fetch the two sample messages it has. They come through just fine but >don't appear to be scored. Is that a property of the test mode or will I >run into that problem when grabbing mail from a real server as well? > >Version is recent CVS - updated earlier this evening. > >Thx, > >Skip > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From richie at entrian.com Tue Dec 10 13:10:16 2002 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 10 08:10:21 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.29383.516933.165140@montanaro.dyndns.org> References: <15861.29383.516933.165140@montanaro.dyndns.org> Message-ID: <3dpbvuo01o5pnce5ejo3bj8gl52a0ru89l@4ax.com> [Skip] > I just tried runing pop3proxy as > > python pop3proxy.py -t -b -p hammie.db -d > > and fetch the two sample messages it has. They come through just fine but > don't appear to be scored. Is that a property of the test mode or will I > run into that problem when grabbing mail from a real server as well? That's a property of test mode. "pop3proxy -t" runs a test server that serves up two unscored messages. I'm surpised it even accepts the other switches with -t. I should really put the test server in a different source file - if you want to use the proxy to score the test messages, you should run one "pop3proxy -t" and one "pop2proxy " - the first will run the test server and the second will run the proxy itself. I don't know why hammie isn't scoring things, but the test POP3 server is a red herring. -- Richie Hindle richie@entrian.com From skip at pobox.com Tue Dec 10 08:16:24 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 10 09:16:21 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.30322.597106.427808@montanaro.dyndns.org> References: <15861.29383.516933.165140@montanaro.dyndns.org> <15861.30322.597106.427808@montanaro.dyndns.org> Message-ID: <15861.63288.43031.18877@montanaro.dyndns.org> Skip> ... and fetch the two sample messages [pop3proxy] has. They come Skip> through just fine but don't appear to be scored. Skip> I am seeing the same behavior from hammie.py. I must be muffing Skip> something. Aside from the bogus .ini file there were several new modules (at least Corpus, storage and dbmstorage) which weren't being installed. I think I'm all set now. Sorry for the false alarm(s). Skip From bkc at murkworks.com Tue Dec 10 09:57:22 2002 From: bkc at murkworks.com (Brad Clements) Date: Tue Dec 10 09:51:53 2002 Subject: [Spambayes] Anyone find this spam interesting? Message-ID: <3DF5B90A.3986.1EC6AF74@localhost> Just got this, its .. hmm. First, some relavent headers: To: Subject: {%CRAND2%}Need extra Cash? - Get Paid in 48 HRS! - Home Reps Needed{%CRAND1%} Date: Tue, 10 Dec 2002 08:53:30 +0700 MiME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_00D6_55E22C0A.B3647D70" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: AOL 7.0 for Windows US sub 118 Importance: Normal ------=_NextPart_000_00D6_55E22C0A.B3647D70 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64 eyVDUkFORDYlfQ0KDQpJbW1lZGlhdGUgSGVscCBOZWVkZWQuICBXZSBhcmUg YSAuY29tIGNvcnBvcmF0aW9uIHRoYXQgaXMgZ3Jvd2luZyBhdCBhIHRyZW1l Here's the text of the base64 decoded portion: {%CRAND6%} Immediate Help Needed. We are a .com corporation that is growing at a tremendous rate of over 1000% per year. We simply cannot keep up. We are looking for motivated individuals who are looking to earn a substantial income working from home. This is a real world opportunity to make an excellent income from home. No experience is required. We will provide you with any training you may need. We are looking for energetic and self motivated people. If that is you, then click on the link below and complete our online information request form, and one of our employment specialist will contact you. http://www.digitalcraftsmanship.com/pg.htm So if you are looking to be employed at home, with a career that will provide you vast opportunities and a substantial income, please fill out our online information request form here now: http://www.digitalcraftsmanship.com/pg.htm Take a Look {%CRAND8%} ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ Your email address was obtained from an opt-in list. If you wish to be deleted from this list, please click on the following link: http://www.digitalcraftsmanship.com/remove/remove.html and you will be removed from the list. If you have previously dealt with this matter and are still receiving this message, you may call our Abuse Control Center at 1-866-667-5398, or write us at: NOUCE1, 6822 22nd Ave. N., St. Petersburg, FL 33710-3918. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ {%CRAND9%} Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From noreply at sourceforge.net Tue Dec 10 02:42:06 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Tue Dec 10 10:14:08 2002 Subject: [Spambayes] [ spambayes-Bugs-651365 ] getattr recursion in Corpus.py Message-ID: Bugs item #651365, was opened at 2002-12-10 11:42 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651365&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: getattr recursion in Corpus.py Initial Comment: After feeding a bunch of new messages into pop3proxy, classifying them and when trying to save the result, I got a recursion loop (followed by recursion depth exceeded) in \cvshome\spambayes\Corpus.py|__getattr__|269] After looking into setSubstance, I noticed that setSubstance (called by load) only sets the attributes payload and hdrtext when the pattern matches. I temporarily added an else clause to bmatch, i.e. if bmatch: self.payload = bmatch.group(2) self.hdrtxt = sub[:bmatch.start(2)] print ".", else: self.payload = "nix\r\n" self.hdrtxt="nix\r\n" print "?", len(sub), and indeed, when trying to save, I notice that after about 800 good messages, ~ 100 have an empty message, see the output below. I don't really know what I'm doing here, but at this fix at least allows me to continue. ------------------------- C:\archiv\cvshome\spambayes>python -u pop3proxy.py - l 8110 mail.gmd.de Loading database... Done. Listener on port 8110 is proxying mail:110 User interface url is http://localhost:8880 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 . . . . . . . . . . . . . . . ----------------------- Initial traceback: error: uncaptured python exception, closing channel <__main__.UserInterface conn ected at 0x2213470> (exceptions.RuntimeError:maximum recursion depth exceeded [C :\Python22\lib\asyncore.py|poll|95] [C:\Python22 \lib\asyncore.py|handle_read_eve nt|392] [C:\Python22\lib\asynchat.py|handle_read|112] [C:\archiv\cvshome\spambay es\pop3proxy.py|found_terminator|804] [C:\archiv\cvshome\spambayes\pop3proxy.py| onRequest|830] [C:\archiv\cvshome\spambayes\pop3proxy.py|onReview|1 093] [C:\arch iv\cvs\spambayes\Corpus.py|takeMessage|188] [C:\archiv\cvs\spambayes\FileCorpus. py|addMessage|140] [C:\archiv\cvs\spambayes\FileCorpus.py|store|231] [C:\archiv\ cvs\spambayes\Corpus.py|getSubstance|318] [C:\archiv\cvs\spambayes\Corpus.py|__g etattr__|269] [C:\archiv\cvs\spambayes\Corpus.py|__getattr__|269] [C:\archiv\cvs \spambayes\Corpus.py|__getattr__|269] [C:\archiv\cvs\spambayes\Corpus.py|__getat ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651365&group_id=61702 From grobinson at transpose.com Tue Dec 10 10:15:59 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 10 10:16:05 2002 Subject: [Spambayes] A grassroots auto whitelist Message-ID: I had a wild idea that I'd like to bounce off the readers of SpamBayes and Spamflt. There is a company, Habeas, attempting to leverage trademark and copyright law to get people to insert a trademarked/copyrighted haiku into their emails. Then when spammers try it, they can be sued for trademark/copyright infringement (more here: http://radio.weblogs.com/0101454/categories/spam/2002/12/09.html#a200). It's free for individuals to use the haiku, but corporations have to pay. I was thinking today that it could be interest to do something less legalistic, but that would still probably have a certain practical effectiveness (particularly since I am not thrilled about including their silly haiku in all my emails). Suppose we all started putting some character string like ImNotSpam! in our emails. Then spam filters would come to associate them with non-spam, and emails with that word in it would have a better chance of getting through. Of course, spammers could do the same thing, but we could trademark it and if anybody infringes, gives lawyers the chance to sue in return for all (or most) of the damages as their contingency fee. But more effectively, instead of basing it on a simple character string like ImNotSpam, it could be that we base it on the URL of some antispam resource like Paul Graham's http://www.paulgraham.com/antispam.html or the wiki I started recently, http://spamland.org. (A URL with such a name can also be a registered trademark.) Of course, spammers could include such a URL, but wouldn't want to if it pointed to a potent source of antispam info such as a list of spam filtering products, and the trademark issue would be an additional danger. So, some balance would be achieved over time. If it got to be a popular tool for legitimate individuals to get their emails through, some spammers would use it, but if they did so on too large a scale they would be in danger of being sued and if a URL were used, they would also be informing people about how to deal with spam (the URL could list antispam products etc. as Paul's site and spamland do). A URL would also explain what the effort was about exhort people who come to the page to also start using the token in their emails, so it would be viral. I'd be very interested in any thoughts on this. If readers of these lists wanted to try including such a URL in their emails, we could get the grassroots efforts started. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From tim.one at comcast.net Tue Dec 10 11:05:12 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 11:09:25 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5B90A.3986.1EC6AF74@localhost> Message-ID: [Brad Clements] > Just got this, its .. hmm. Brad, what *might* be interesting about it? It looked like vanilla work-at-home spam. From bkc at murkworks.com Tue Dec 10 11:17:20 2002 From: bkc at murkworks.com (Brad Clements) Date: Tue Dec 10 11:12:03 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: References: <3DF5B90A.3986.1EC6AF74@localhost> Message-ID: <3DF5CBC7.11245.1F0FE5CB@localhost> > [Brad Clements] > > Just got this, its .. hmm. > > Brad, what *might* be interesting about it? It looked like vanilla > work-at-home spam. > I'm a butterscotch fan myself .. -- Uh ok, are we decoding base64 text attachments now, or tossing them? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one at comcast.net Tue Dec 10 11:16:56 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 11:21:50 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5CBC7.11245.1F0FE5CB@localhost> Message-ID: [Brad Clements] > Uh ok, are we decoding base64 text attachments now, or tossing them? We've been decoding them almost since The Beginning. Quoted-printable too. We aren't decoding embedded uuencoded sections, though. From bkc at murkworks.com Tue Dec 10 11:31:08 2002 From: bkc at murkworks.com (Brad Clements) Date: Tue Dec 10 11:26:02 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: References: <3DF5CBC7.11245.1F0FE5CB@localhost> Message-ID: <3DF5CF04.12768.1F1C88FE@localhost> > [Brad Clements] > > Uh ok, are we decoding base64 text attachments now, or tossing them? > > We've been decoding them almost since The Beginning. Quoted-printable too. > We aren't decoding embedded uuencoded sections, though. > (Church lady from SNL) "Oh, nevermind". Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From popiel at wolfskeep.com Tue Dec 10 08:34:15 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 10 11:31:41 2002 Subject: [Spambayes] A grassroots auto whitelist In-Reply-To: Message from Gary Robinson References: Message-ID: <20021210163415.EAB9A2DED2@cashew.wolfskeep.com> In message: Gary Robinson writes: >Suppose we all started putting some character string like ImNotSpam! in our >emails. Then spam filters would come to associate them with non-spam, and >emails with that word in it would have a better chance of getting through. >But more effectively, instead of basing it on a simple character string like >ImNotSpam, it could be that we base it on the URL of some antispam resource >like Paul Graham's http://www.paulgraham.com/antispam.html or the wiki I >started recently, http://spamland.org. (A URL with such a name can also be a >registered trademark.) The largest problem I see with such (other than the fact that I despise canned text in my emails, hence no .sig) is tha tthe URLs in question become prime targets for domain-name-grabbing or other nefarious trickery. I'm much happier to leave the selection of strong ham clues to less overt means. - Alex From grobinson at transpose.com Tue Dec 10 13:54:16 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 10 13:54:27 2002 Subject: [Spambayes] What the heck Message-ID: I couldn't resist. See my sig below. At least it will be an interesting experiment. (I didn't include the trademark/copyright mechanism because I don't want to run up against Habeas' pending patent. And I think it may not be necessary in any case.) So as not to take up more bandwidth on these lists I won't say anything more about this unless somebody has a comment or question they want me to respond to. Bottom line, I'm up for trying things and seeing what happens. --Gary -- http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From tim.one at comcast.net Tue Dec 10 14:42:46 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 14:46:32 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5CF04.12768.1F1C88FE@localhost> Message-ID: [Tim] > We've been decoding them [base64 text sections] almost since The Beginning. > Quoted-printable too. We aren't decoding embedded uuencoded sections, > though. [Brad Clements] > (Church lady from SNL) > > "Oh, nevermind". It's unclear -- there have been bugs in decoding base64 before, and may still be. Why did you ask originally? For example, perhaps that spam got classified as ham, and you didn't see any of the decoded spammy words in the clue list. In that case, we'd need the original msg to figure out what went wrong. From tim at fourstonesExpressions.com Tue Dec 10 16:12:26 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Dec 10 17:13:02 2002 Subject: [Spambayes] What the heck Message-ID: 12/10/2002 3:56:55 PM, Gary Robinson wrote: > >> It's an interesting idea... I'll bet that a spammer would have no qualms about >> including such a url in spam, though. > >That may be, but for the spammers to be motivated to do so, the idea will >have had to already been very successful in terms of people using it! > >If the spammers then put the URL in their spams, millions of people will >gain access to the laest info about the most powerful current antispam >techniques... probably including who to lobby to get antispam laws passed, >tips about who and where the spammers might be would be in a discussion >area, etc.... > >The outcome would be good either way! :) > >--Gary > > >-- >http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails What will the tokenizer do with the above url? - TimS > >Gary Robinson >CEO >Transpose, LLC >grobinson@transpose.com >207-942-3463 >http://www.emergentmusic.com >http://radio.weblogs.com/0101454 > > >> From: Tim Stone - Four Stones Expressions >> Reply-To: tim@fourstonesExpressions.com >> Date: Tue, 10 Dec 2002 15:51:36 -0600 >> To: Gary Robinson >> Subject: Re: [Spambayes] What the heck >> >> It will be interesting to see how many links back to the page you get... I >> wonder if the spambayes tokenizer will discard the 'ToDestroy...Emails' part >> as being a too long word... >> >> It's an interesting idea... I'll bet that a spammer would have no qualms about >> including such a url in spam, though. >> >> - TimS >> >> 12/10/2002 12:54:16 PM, Gary Robinson wrote: >> >>> I couldn't resist. See my sig below. At least it will be an interesting >>> experiment. >>> >>> (I didn't include the trademark/copyright mechanism because I don't want to >>> run up against Habeas' pending patent. And I think it may not be necessary >>> in any case.) >>> >>> So as not to take up more bandwidth on these lists I won't say anything more >>> about this unless somebody has a comment or question they want me to respond >>> to. >>> >>> Bottom line, I'm up for trying things and seeing what happens. >>> >>> >>> --Gary >>> >>> -- >>> http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails >>> >>> Gary Robinson >>> CEO >>> Transpose, LLC >>> grobinson@transpose.com >>> 207-942-3463 >>> http://www.emergentmusic.com >>> http://radio.weblogs.com/0101454 >>> >>> >>> >>> _______________________________________________ >>> Spambayes mailing list >>> Spambayes@python.org >>> http://mail.python.org/mailman/listinfo/spambayes >>> >>> >> >> >> c'est moi - TimS >> www.fourstonesExpressions.com >> >> > > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From skip at pobox.com Tue Dec 10 16:24:52 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 10 17:24:43 2002 Subject: [Spambayes] What the heck In-Reply-To: References: Message-ID: <15862.27060.809694.985672@montanaro.dyndns.org> >> http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails Tim> What will the tokenizer do with the above url? - TimS Chew it up and spit out little pieces I believe, something like url:spamland url:org url:jsp url:Wiki url:ToDestroySpamIncludeThisLinkInAllLegitEmails Not sure about that last one. It might generate some sort of skip token, but I think that's just for regular long words. Skip From tim.one at comcast.net Tue Dec 10 17:28:17 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 17:30:34 2002 Subject: [Spambayes] What the heck In-Reply-To: Message-ID: > http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails [Tim Stone] > What will the tokenizer do with the above url? - TimS It generates 6 tokens: proto:http url:spamland url:org url:jsp url:wiki url:todestroyspamincludethislinkinalllegitemails From tim at fourstonesExpressions.com Tue Dec 10 16:33:11 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Dec 10 17:35:05 2002 Subject: [Spambayes] What the heck In-Reply-To: Message-ID: 12/10/2002 4:28:17 PM, Tim Peters wrote: >> http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails > >[Tim Stone] >> What will the tokenizer do with the above url? - TimS > >It generates 6 tokens: > >proto:http >url:spamland >url:org >url:jsp >url:wiki >url:todestroyspamincludethislinkinalllegitemails skip_max_wordsize only applies to words, not to url fragments? > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From tim.one at comcast.net Tue Dec 10 17:50:30 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 17:52:49 2002 Subject: [Spambayes] What the heck In-Reply-To: Message-ID: > http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails > It generates 6 tokens: > > proto:http > url:spamland > url:org > url:jsp > url:wiki > url:todestroyspamincludethislinkinalllegitemails > skip_max_wordsize only applies to words, not to url fragments? It does not apply to url fragments. As to the first half of the question, tokenizer.py is open for inspection . From skip at pobox.com Tue Dec 10 23:01:18 2002 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 11 00:01:07 2002 Subject: [Spambayes] New option: summarize_email_prefixes Message-ID: <15862.50846.195158.599726@montanaro.dyndns.org> I just checked in code for a new option: summarize_email_prefixes. It tries to take advantage of clumps of related email addresses in a single message, e.g.: To: Cc: , , , , It's not a big win, but "pfxlen:big" is a very strong spam indicator. It might help on small messages without many other clues. I'd like others to give it a try and post their results. The code is pretty straightforward, so I won't go into more detail. Just gaze at tokenizer.py for a few seconds. Skip From noreply at sourceforge.net Tue Dec 10 20:56:28 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 00:03:11 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 23:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From msurface at myvine.com Wed Dec 11 00:05:42 2002 From: msurface at myvine.com (Mitchell Surface) Date: Wed Dec 11 00:05:47 2002 Subject: [Spambayes] Bug in mboxtrain.py? Message-ID: <20021211050542.GA14646@brewer.fwn.fortwayne.com> I think I may have found a bug in mboxtrain.py. When you run mboxtrain.py against a mbox that contains messages that have already been trained on, the old messages are deleted. I opened a bug on SF for this. It's late here, I'll try to find some time to look at the code tomorrow and see what's going on. It's probably something simple and better eyes than mine will spot it more quickly, but I wanted to give a warning as soon as I could. It wasn't a highlight of my day to see a mailbox disappear. <0.5 wink> Oh well, that's why it's called pre-alpha code, right? -- Mitchell Surface N9OSL Fort Wayne, IN USA Don't ever think you know what's right for the other person. He might start thinking he knows what's right for you. -- Paul Williams, `Das Energi' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021211/c7937172/attachment.bin From neale at woozle.org Wed Dec 11 08:23:17 2002 From: neale at woozle.org (Neale Pickett) Date: Wed Dec 11 11:23:26 2002 Subject: [Spambayes] Bug in mboxtrain.py? In-Reply-To: <20021211050542.GA14646@brewer.fwn.fortwayne.com> (Mitchell Surface's message of "Wed, 11 Dec 2002 00:05:42 -0500") References: <20021211050542.GA14646@brewer.fwn.fortwayne.com> Message-ID: Mitchell Surface writes: > I think I may have found a bug in mboxtrain.py. When you run > mboxtrain.py against a mbox that contains messages that have already > been trained on, the old messages are deleted. I opened a bug on SF for > this. Zowie! Well *that* was a dumb bug. I hope you backed up your mbox--sorry about that. I've checked in the fix. Please, everyone using mboxtrain.py on an mbox file, update your source from CVS. Thanks for the bug report, Mitchell. Neale From noreply at sourceforge.net Wed Dec 11 08:42:27 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 11:48:40 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 23:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- >Comment By: Mitchell Surface (msurface) Date: 2002-12-11 11:42 Message: Logged In: YES user_id=21257 I justt did a cvs up and it looks like the code has been rewritten to not do this, I'll test tonight and post results. Thanks guys! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From noreply at sourceforge.net Wed Dec 11 08:52:18 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 12:01:57 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 20:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) >Assigned to: Neale Pickett (npickett) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- Comment By: Mitchell Surface (msurface) Date: 2002-12-11 08:42 Message: Logged In: YES user_id=21257 I justt did a cvs up and it looks like the code has been rewritten to not do this, I'll test tonight and post results. Thanks guys! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From noreply at sourceforge.net Wed Dec 11 08:52:18 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 12:02:16 2002 Subject: [Spambayes] [ spambayes-Bugs-650496 ] hammie.py discards headers Message-ID: Bugs item #650496, was opened at 2002-12-08 10:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Simon Baatz (bnomis26) >Assigned to: Neale Pickett (npickett) Summary: hammie.py discards headers Initial Comment: When feeding the (malformed) attached mail to hammie.py in filter mode, the headers of the mail are not present in the output. Command line: python hammie.py -f -d -p ~/mail/hammie.db < msg.lAoM Output: X-Spambayes-Classification: ham; 0.00 --Amazon.com_multipart_boundary____________ Content-Type: text/plain; charset=iso-8859-1 Vielen Dank für Ihre Bestellung bei Amazon.de. --Amazon.com_multipart_boundary____________ Content-Type: text/html; charset=iso-8859-1 --Amazon.com_multipart_boundary____________-- ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 From noreply at sourceforge.net Wed Dec 11 08:53:19 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 12:02:28 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 20:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) Assigned to: Neale Pickett (npickett) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- >Comment By: Neale Pickett (npickett) Date: 2002-12-11 08:53 Message: Logged In: YES user_id=619391 I think this is fixed with my most recent cvs checkin. Feel free to re-open the bug if not :) ---------------------------------------------------------------------- Comment By: Mitchell Surface (msurface) Date: 2002-12-11 08:42 Message: Logged In: YES user_id=21257 I justt did a cvs up and it looks like the code has been rewritten to not do this, I'll test tonight and post results. Thanks guys! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From neale at watchguard.com Wed Dec 11 22:02:16 2002 From: neale at watchguard.com (Neale Pickett) Date: Thu Dec 12 01:27:44 2002 Subject: [Spambayes] New option: summarize_email_prefixes In-Reply-To: <15862.50846.195158.599726@montanaro.dyndns.org> (Skip Montanaro's message of "Tue, 10 Dec 2002 23:01:18 -0600") References: <15862.50846.195158.599726@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > It's not a big win, but "pfxlen:big" is a very strong spam indicator. > It might help on small messages without many other clues. I'd like > others to give it a try and post their results. It didn't make one bit of difference for me. So if's helpful to you, I'm okay with it :) """ cv1s -> cv2s -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams false positive percentages 1.000 1.000 tied 1.000 1.000 tied 2.000 2.000 tied 3.000 3.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 14 to 14 tied mean fp % went from 1.4 to 1.4 tied false negative percentages 0.500 0.500 tied 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fn went from 3 to 3 tied mean fn % went from 0.3 to 0.3 tied ham mean ham sdev 4.39 4.39 +0.00% 16.31 16.31 +0.00% 3.62 3.62 +0.00% 14.22 14.22 +0.00% 5.01 5.01 +0.00% 18.67 18.67 +0.00% 4.94 4.94 +0.00% 19.29 19.29 +0.00% 4.20 4.21 +0.24% 14.37 14.38 +0.07% ham mean and sdev for all runs 4.43 4.43 +0.00% 16.71 16.71 +0.00% spam mean spam sdev 99.23 99.23 +0.00% 6.99 6.99 +0.00% 99.26 99.26 +0.00% 7.41 7.41 +0.00% 99.96 99.96 +0.00% 0.31 0.25 -19.35% 98.96 98.96 +0.00% 8.51 8.51 +0.00% 99.63 99.63 +0.00% 2.71 2.71 +0.00% spam mean and sdev for all runs 99.41 99.41 +0.00% 6.07 6.07 +0.00% ham/spam mean difference: 94.98 94.98 +0.00 """ From skip at pobox.com Thu Dec 12 08:34:11 2002 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 12 09:34:21 2002 Subject: [Spambayes] New option: summarize_email_prefixes In-Reply-To: References: <15862.50846.195158.599726@montanaro.dyndns.org> Message-ID: <15864.40547.760153.378737@montanaro.dyndns.org> >> It's not a big win, but "pfxlen:big" is a very strong spam indicator. Neale> It didn't make one bit of difference for me. So if's helpful to Neale> you, I'm okay with it :) It didn't help me much either. I figured it was worth leaving in as an experimental device because other people had asked about it before. Skip From ducky at webfoot.com Thu Dec 12 11:28:41 2002 From: ducky at webfoot.com (Kaitlin Duck Sherwood) Date: Thu Dec 12 14:26:06 2002 Subject: [Spambayes] Tiny bug In-Reply-To: References: Message-ID: Last night I checked out spambayes, and ran into trouble in mboxtest.py. There's a line in mboxtest.py from timtest import Msg but mboxtest.py words much better if it's from msgs import Msg I presume that as a newbie, I shouldn't (or can't) check in. Cheers. No reply needed. From tim at fourstonesExpressions.com Thu Dec 12 13:31:33 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Dec 12 14:32:09 2002 Subject: [Spambayes] Tiny bug In-Reply-To: Message-ID: I have no idea why timtest is there, but your fix is correct. You are correct that you cannot check in. - TimS (not the tim in timtest...;) 12/12/2002 1:28:41 PM, Kaitlin Duck Sherwood wrote: >Last night I checked out spambayes, and ran into trouble in mboxtest.py. > >There's a line in mboxtest.py > from timtest import Msg >but mboxtest.py words much better if it's > from msgs import Msg > >I presume that as a newbie, I shouldn't (or can't) check in. > >Cheers. > >No reply needed. > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From piper_dragon at lycos.com Fri Dec 13 04:12:12 2002 From: piper_dragon at lycos.com (douglas P craig) Date: Fri Dec 13 00:26:15 2002 Subject: [Spambayes] Lindsey Carter Message-ID: You guys are way over my head but if I understand correctly the email I received is only sent to get my to submit to the AVS service. I take it your passion is working on software to detect spam. Educate me someone and tell me what ham is? Thanks Doug From skip at pobox.com Thu Dec 12 23:58:19 2002 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 13 01:00:36 2002 Subject: [Spambayes] Lindsey Carter In-Reply-To: References: Message-ID: <15865.30459.363352.142342@montanaro.dyndns.org> doug> You guys are way over my head but if I understand correctly the doug> email I received is only sent to get my to submit to the AVS doug> service. Not sure what the "AVS service" is. We're not trying to get you to buy or submit to anything. doug> I take it your passion is working on software to detect doug> spam. Educate me someone and tell me what ham is? The opposite of spam. Skip From tim.one at comcast.net Sun Dec 15 21:11:01 2002 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 15 21:12:06 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: I got a typical mortgage spam today, surprising because it scored 0.78, at the high end of my personal-email Unsure range (which ends at 0.80). There were very few words in the clue listing; it got a score as high as it did because of the subject line Low rates will not last forever. some assorted spammish header clues, URL clues, and the single word "month!". Staring at the source revealed a cute trick I haven't seen before: ... Let the Lenders
Compete for your Loan! ... That is, the spammy words like Lenders and Compete and Loan! are broken up by embedded HTML comments. Our tokenizer does strip HTML comments, but replaces each with a blank, so the spammy words remain broken up. I'll fix that. In the meantime, if anyone knows this spammer , counsel them to break up the word "month!" too, as that was the highest-spamprob token in the whole msg. From dereks at itsite.com Sun Dec 15 19:40:39 2002 From: dereks at itsite.com (Derek Simkowiak) Date: Sun Dec 15 22:41:47 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: > Let the Lenders
> Compete for your Loan! > [...] Our tokenizer does strip HTML comments, but replaces each with a > blank, so the spammy words remain broken up. > > I'll fix that. Pretend I'm a spammer. Hi! Greeat Deeals with lo
w rates!

	(I.e., not just comments, but valid HTML tags too.)

	For that matter, since unrecognized tags are ignored by browsers,
it could be:

Hi! Great deals Here!

	Hell, it wouldn't even need too look like HTML:

Hi! Great deals here!

	I haven't followed the discussions on HTML handling, but given
this latest cute trick this other stuff can't be far away.



--Derek


From tim at fourstonesExpressions.com  Sun Dec 15 21:49:14 2002
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun Dec 15 22:49:50 2002
Subject: [Spambayes] Cute spam trick
In-Reply-To: 
Message-ID: 

12/15/2002 6:40:39 PM, Derek Simkowiak  wrote:

>>     Let the Lenders 
>> Compete for your Loan! > >> [...] Our tokenizer does strip HTML comments, but replaces each with a >> blank, so the spammy words remain broken up. >> >> I'll fix that. > > Pretend I'm a spammer. > >Hi! Greeat Deeals with lo
w rates!
>
>	(I.e., not just comments, but valid HTML tags too.)
>
>	For that matter, since unrecognized tags are ignored by browsers,
>it could be:
>
>Hi! Great deals Here!
>
>	Hell, it wouldn't even need too look like HTML:
>
>Hi! Great deals here!
>
>	I haven't followed the discussions on HTML handling, but given
>this latest cute trick this other stuff can't be far away.

Right, but our current tokenizer would currently defeat all of these.  It 
would have defeated Tim's example, except that in the case of a stripped 
comment, it replaced it with a blank.  This is a great example of how the 
efforts of teams like ours are already forcing spammers into more and more 
convoluted behaviors, which will make their mail even more readily 
recognizable!  - TimS

>
>
>
>--Derek
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com
http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails



From popiel at wolfskeep.com  Sun Dec 15 21:10:02 2002
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 16 00:07:03 2002
Subject: [Spambayes] Another hammie setup
Message-ID: <20021216051002.3FE3A2DE86@cashew.wolfskeep.com>

A couple weeks ago, I mentioned that I was finally going to start
using hammie for my live filtering, and that I'd share the scripts,
etc that I generated to do so.

First off, let me describe how I've got things set up.  I am an
avid (and rather religious) MH user, so my mail folders are of
course stored in the MH format (directories full of single-message
files, where the filenames are numbers indicating ordering in the
folder).  I've got four mail folders of interest for this discussion:
everything, spam, newspam, and inbox.

When mail arrives, it is classified, then immediately copied in the
everything folder.  If it was classified as spam or ham, it is
trained as such, reinforcing the classification.  Then, if it was
labeled as spam, it goes into the newspam folder; otherwise it
goes into my inbox.

When I read my mail (from inbox or newspam), I move any confirmed
spam into my spam folder; ham may be deleted.  (Of course, I still
have a copy of my ham in the everything folder.)

Every night, I run a complete retraining (from cron at 2:10am);
it trains on all mail in the everything folder that is less than
4 months old.  If a given message has an identical copy in the spam
or newspam folder, then it is trained as spam; otherwise it is
trained as ham.  This does mean that unread unsures will be
treated as ham for up to a day; there's few enough of them that
I don't care.  The four-month age limit will have the effect of
expiring old mail out of the training set, which will keep the
database size fairly manageable (it's currently just under 10 meg,
with 6 days to go until I have 4 months of data).

The retraining generates a little report for me each night,
showing a graph of my ham and spam levels over time.  Here's
a sample:

Scanning spamdir (/home/cashew/popiel/Mail/spam):
Scanning spamdir (/home/cashew/popiel/Mail/newspam):
Scanning everything
sshsshsshsshsshsshsshshsshshshshsshshshshshshsshsshshsshssshsshshsshshsshshsssh
shshshsshshsshshshshshssshshshsshsshsshshshshshshsshshhshshsshshshshssshssshshs
ssshs
  154
  152|
  144|
  136|
  128|                                                   h
  120|                                                   h      s
  112|                             s       ss     ss s   h   s  ss
  104|                             ss      ss     ss sHs h   s  ss
   96|                           s ss   s  sH  s  ss sHs h  Sss ss
   88|                    h  ss  s sss ss  sH sss ssssHHhS sSsssss
   80|                 s sSH ss ssssss sssssH HssssHsHHHSS sSsssss
   72|                 ssHSH ssssssssssssHHsHSHssHsHsHHHSSssSsssss
   64|      s  s  s s sHsHSHsssssssHsHsssHHsHSHssHsHsHHHSSssSsssss
   56|   s sss ss sssssHHHSHsHsssHsHHHHssHHsHSHHsHHHsHHHSSsHSsssss
   48|   ssssssssssssssHHHSHHHHssHsHHHHHsHHsHSHHsHHHsHHHSSsHSssHsss
   40|   ssssssssssHsHHHHHSHHHHHsHsHHHHHHHHHHSHHsHHHHHHHSSsHSHsHHss
   32|   ssHHssHsssHHHHHHHSHHHHHHHsHHHHHHHHHHSHHsHHHHHHHSSHHSHHHHHs
   24|   ssHHHHHHHsHHHHHHHSHHHHHHHsHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
   16|   HsHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
    8|   HHHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHH
    0|SSSUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
     +------------------------------------------------------------

Total: 6441 ham, 9987 spam (60.79% spam)

real    7m45.049s
user    5m38.980s
sys     0m39.170s

This is a set of overlaid bar graphs; s is for spam, h is for ham,
u is unsure.  The shorter bars are in front and capitalized.  In
the example, I have very few days where I have more ham than spam.

My scripts (and a .procmailrc) are available at:
  http://www.wolfskeep.com/~popiel/spambayes/hammie

- Alex

From popiel at wolfskeep.com  Sun Dec 15 21:13:30 2002
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 16 00:10:26 2002
Subject: [Spambayes] One question I forgot to ask...
Message-ID: <20021216051330.0E6F22DE86@cashew.wolfskeep.com>

I forgot to ask in my last mail: would people like me to add
the scripts I use for my nightly retraining (given my slightly
unusual setup) to the project?

- Alex

From tim.one at comcast.net  Mon Dec 16 00:26:44 2002
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 16 00:28:58 2002
Subject: [Spambayes] Cute spam trick
In-Reply-To: 
Message-ID: 

[Tim]
>     ...
>     Let the Lenders 
> Compete for your Loan! > ... > > That is, the spammy words like Lenders and Compete and Loan! are > broken up by embedded HTML comments. Our tokenizer does strip HTML > comments, but replaces each with a blank, so the spammy words remain > broken up. > > I'll fix that. In the meantime, if anyone knows this spammer , > counsel them to break up the word "month!" too, as that was the > highest-spamprob token in the whole msg. It's fixed, and that particular spam is nailed now. Among the previously "hidden" words, 'refinance', 'equity', and 'debt' all have higher spamprobs than 'month!', and the score is now at the high end of the spam range. stupid-beats-smart-stupid-beats-smart-stupid-beats-smart-ly y'rs - tim From tim.one at comcast.net Mon Dec 16 00:46:38 2002 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 16 00:49:15 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: [Derek Simkowiak, on embedding other kinds of tags in words] > ... > I haven't followed the discussions on HTML handling, but given > this latest cute trick this other stuff can't be far away. I don't know, but Tim Stone was right that we strip out all HTML tags, so it wouldn't help them against this system. They could still work around that, by including extremely long tags -- our cheap-ass regexp gimmicks are bounded in how far they'll look ahead when deciding what is and isn't a tag (we don't even know whether we're looking at HTML, and don't want to chew up non-HTML text that just happens to contain "<"). Someday I expect we'll need "a real" HTML parser -- but not today . The technically cleverest spam I've gotten to date remains an HTML spam that interspersed legitimate news stories & tech newsgroup postings with the spam, but specified a tiny font and white-on-white for the legit parts. Invisible when rendered. I've only seen that once, and part of the downside of stripping HTML tags is that the classifier will never learn on its own which HTML tricks are used to get this effect. OTOH, you can't guess someone's "ham words" without knowing something about them, and personal information is very expensive for spammers to obtain or exploit. From mwh at python.net Mon Dec 16 12:18:41 2002 From: mwh at python.net (Michael Hudson) Date: Mon Dec 16 07:18:43 2002 Subject: [Spambayes] Re: Cute spam trick References: Message-ID: <2md6o2cg2m.fsf@starship.python.net> Tim Peters writes: > OTOH, you can't guess someone's "ham words" without knowing > something about them, and personal information is very expensive for > spammers to obtain or exploit. Indeed, if they know that much about me, a better tactic would be to try to sell me something I might actually want... Cheers, M. -- There are two kinds of large software systems: those that evolved from small systems and those that don't work. -- Seen on slashdot.org, then quoted by amk From neale at woozle.org Mon Dec 16 10:46:08 2002 From: neale at woozle.org (Neale Pickett) Date: Mon Dec 16 13:46:17 2002 Subject: [Spambayes] One question I forgot to ask... In-Reply-To: <20021216051330.0E6F22DE86@cashew.wolfskeep.com> ("T. Alexander Popiel"'s message of "Sun, 15 Dec 2002 21:13:30 -0800") References: <20021216051330.0E6F22DE86@cashew.wolfskeep.com> Message-ID: "T. Alexander Popiel" writes: > I forgot to ask in my last mail: would people like me to add > the scripts I use for my nightly retraining (given my slightly > unusual setup) to the project? Yes. But I think we should go ahead and create that hammie subdir first. All hammie front-ends (hammiefilter, hammiebulk, hammiecli/srv, mboxtrain, any others) would be moved there, as well as the HAMMIE.txt file and any other hammie-like things. hammie.py would stay in the top-level directory, since lots of things use it. What do y'all think of that? Neale From trebor at animeigo.com Mon Dec 16 13:37:55 2002 From: trebor at animeigo.com (Robert Woodhead) Date: Mon Dec 16 13:48:55 2002 Subject: [Spambayes] Re: Spambayes Digest, Vol 52, Issue 26 In-Reply-To: References: Message-ID: >The technically cleverest spam I've gotten to date remains an HTML spam that >interspersed legitimate news stories & tech newsgroup postings with the >spam, but specified a tiny font and white-on-white for the legit parts. >Invisible when rendered. I've only seen that once, and part of the downside >of stripping HTML tags is that the classifier will never learn on its own >which HTML tricks are used to get this effect. OTOH, you can't guess >someone's "ham words" without knowing something about them, and personal >information is very expensive for spammers to obtain or exploit. I was a bit surprised that you guys haven't run across the embedding tricks before. In my spam parsing, I have the parser spit out all not only the words, but also the tokens internal to a tag (< and > are considered whitespace), and catenate those words broken up by tags. So foobar foobaz results in output: derf foobar bork foobaz font color ffffff Seems to work well. The state machine for doing this is trivial. And the extra stuff you glean from the interior of tags is likely to be significant. R From popiel at wolfskeep.com Mon Dec 16 11:05:45 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 16 14:02:40 2002 Subject: [Spambayes] One question I forgot to ask... In-Reply-To: Message from Neale Pickett of "Mon, 16 Dec 2002 10:46:08 PST." References: <20021216051330.0E6F22DE86@cashew.wolfskeep.com> Message-ID: <20021216190545.1AFDF2DE8C@cashew.wolfskeep.com> In message: Neale Pickett writes: > >But I think we should go ahead and create that hammie subdir >first. All hammie front-ends (hammiefilter, hammiebulk, hammiecli/srv, >mboxtrain, any others) would be moved there, as well as the HAMMIE.txt >file and any other hammie-like things. hammie.py would stay in the >top-level directory, since lots of things use it. > >What do y'all think of that? Sounds good to me. Let the Great Rearrangement begin! - Alex From popiel at wolfskeep.com Mon Dec 16 12:16:53 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 16 15:13:47 2002 Subject: [Spambayes] Re: Spambayes Digest, Vol 52, Issue 26 In-Reply-To: Message from Robert Woodhead References: Message-ID: <20021216201653.934562DE8C@cashew.wolfskeep.com> In message: Robert Woodhead writes: > >I was a bit surprised that you guys haven't run across the embedding >tricks before. In my spam parsing, I have the parser spit out all >not only the words, but also the tokens internal to a tag (< and > >are considered whitespace), and catenate those words broken up by >tags. >Seems to work well. The state machine for doing this is trivial. >And the extra stuff you glean from the interior of tags is likely to >be significant. We (I use the term loosely, since I didn't do any of the work) did some stuff with paying attention to HTML tags. You're right, the effects _were_ significant: significantly bad. It made it impossible to talk _about_ specific HTML or send a mail in HTML without being called spam. The (highly correlated) HTML markers all got associated so strongly with spam that any HTML presence was instant damnation. Depending on what sort of people send you mail, this may or may not be a problem. ;-) I suspect some interesting stuff could be done by deciding to pay attention to all but a select set of HTML tags, while treating
,

,


, and other similar basic formatting tags as whitespace. It would be interesting to try to determine the set of tags to ignore based on a collection of HTML ham vs. HTML spam... but I don't have such a collection, and since I've already got a 0.6% unsure rate with no errors, I'm not too motivated. I now well understand why Tim Peters lost interest in algorithm tweaking; until the amount of spam leaking through increases by an order of magnitude, I'm probably just going to ignore it, as I ignored it for the five years before last summer. The good is the enemy of the perfect, too. - Alex From tim at fourstonesExpressions.com Mon Dec 16 14:15:02 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Dec 16 15:16:26 2002 Subject: [Spambayes] One question I forgot to ask... In-Reply-To: Message-ID: Go for it, dude. - TimS 12/16/2002 12:46:08 PM, Neale Pickett wrote: >"T. Alexander Popiel" writes: > >> I forgot to ask in my last mail: would people like me to add >> the scripts I use for my nightly retraining (given my slightly >> unusual setup) to the project? > >Yes. But I think we should go ahead and create that hammie subdir >first. All hammie front-ends (hammiefilter, hammiebulk, hammiecli/srv, >mboxtrain, any others) would be moved there, as well as the HAMMIE.txt >file and any other hammie-like things. hammie.py would stay in the >top-level directory, since lots of things use it. > >What do y'all think of that? > >Neale > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From lists at webcrunchers.com Mon Dec 16 19:01:43 2002 From: lists at webcrunchers.com (John D.) Date: Mon Dec 16 22:07:02 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5B90A.3986.1EC6AF74@localhost> Message-ID: >Just got this, its .. hmm. > >First, some relavent headers: Yea - I'm getting an average of about 10 of these per week. While on the same subject... How many of these ones have you people gotten? You have been approved. Cash Grant Amount: $10,000-$5,000,000 Did You Know? -Each Year the U.S. Goverment Gives away BILLIONS in cash grants? -There=A0 are No special requirements to obtain these grants. -These are Free Cash Grants That you NEVER have to repay! =20 crunch,You Qualify! Click Here Limited Time Offer =20 I'm getting a whopping 250 of these per week. These guys really love me. Remember: my Email is crunch@shopip.com... my from mail above is strictly= for mailing lists, and I don't always read it, except pull it down once= a week or so. John From tim.one at comcast.net Tue Dec 17 01:04:01 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 17 01:05:51 2002 Subject: [Spambayes] Message-ID: [Robert Woodhead] > I was a bit surprised that you guys haven't run across the embedding > tricks before. I don't know that we haven't, just that only one such managed to get itself classified as Unsure in my personal email so far. That was discussed at length here when it happened. It got a high ham score for *me* because one of the news stories it included was about the DC-area snipers, and since I live in the area I had lots of ham from friends and relatives talking about that too. The other putative ham it included wasn't notably hammy to my classifier, and would not have saved the msg from being called spam -- the spammy parts were extremely spammy. > In my spam parsing, I have the parser spit out all not only the words, > but also the tokens internal to a tag (< and > are considered > whitespace), and catenate those words broken up by tags. > > So > > foobar foobaz > > results in output: > > derf foobar bork foobaz font color ffffff OTOH, we go out of our way to strip almost all evidence of tags, lest every HTML email be classified as spam. > Seems to work well. The state machine for doing this is trivial. > And the extra stuff you glean from the interior of tags is likely > to be significant. For a long time we had an option not to strip HTML tags, because in the early days my comp.lang.python test found that extremely helpful (not surprising! there are virtually no legit HTML msgs on tech mailing lists, while lots of spam is HTML). A result was that every one of the few legit HTML c.l.py msgs became false positives, and a larger number of legit non-HTML c.l.py msgs talking *about* HTML became FP. As other parts of the algorithms improved, the advantage of these "killer clues" eventually fell to nothing, and then below nothing because of their bad effects on the FP rate. This was all quantified by experiments at the time. Later I put a bit back in, to capture specific suspicious tags (like " (popiel@wolfskeep.com) References: <20021216201653.934562DE8C@cashew.wolfskeep.com> Message-ID: <200212171311.gBHDB5o04483@localhost.localdomain> From: "T. Alexander Popiel" [...] I now well understand why Tim Peters lost interest in algorithm tweaking; until the amount of spam leaking through increases by an order of magnitude, I'm probably just going to ignore it, as I ignored it for the five years before last summer. The good is the enemy of the perfect, too. Completely correct. When testing for > 99.9% accuracy requires that you grab fresh spam for a month, and the spam mutation rate introduces new spams at almost that rate, it just isn't fun any more. The question to ask is "how good is good enough?" As a human I'm good to 99.84 % accuracy in classifying spam, so maybe that's good enough and the pursuit of four-nines accuracy is a windmill to tilt at. I'm not sure yet, but it's an interesting hypothesis to consider. -Bill Yerazunis From popiel at wolfskeep.com Tue Dec 17 09:17:02 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 17 12:13:54 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: Message from "John D." of "Mon, 16 Dec 2002 19:01:43 PST." References: Message-ID: <20021217171702.EA27D2DED2@cashew.wolfskeep.com> In message: "John D." writes: > >How many of these ones have you people gotten? [ Free Cash Grant spam ] >I'm getting a whopping 250 of these per week. These guys really love me. Heh. I got a couple similar to this back in October. Not getting any with that keyphrase recently... - Alex From grobinson at transpose.com Tue Dec 17 16:23:00 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 17 16:23:05 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <87257970.1040120563@[192.168.2.9]> Message-ID: I want to suggest a two-stage plan to solve the spam problem. I'm not sure if it makes sense, but it's interesting enough to me that I decided to share it to see what other people think. FIRST STAGE Many of you are aware of the http://wecanstopspam.org idea, whereby: --If a lot of real people use it in the sigs of real emails, spam filters will get trained to see it as a very strong indicator of being legitimate. Thus, it will have become a sort of "virtual whitelist". I see this as being able to counteract, to some extent, the fact that spammers will be trying to use words with no very spammy associations. Instead, this technique puts the stress on "hammy" words, in particular this very hammy indicator. --If the URL does become widely used and is accepted by filters, of course spammers will want to include it too. But at that point, it will be popular enough that filter authors will be motivated to make sure that only visible, clickable versions of the URL are given a high hamminess value. So spammers would have to, in effect, advertise the wecanstopspam.org website and provide a convenient link. --The URL would contain information about how to combat spam, as it does now, but hopefully much better written and presented, as the site evolves under community guidance. So spammers that include it will be helping their targets to fight spam. SECOND STAGE The problem with all possibly foolproof anti-spam approaches, such as the pay-to-spam approach, or the camram one (http://www.camram.org/), is that there is a huge chicken-or-egg problem. The world really has to settle on one solution and get a real critical mass of users in order for it to work. Now, if in fact it gets to the point that spammers are sending the http://wecanstopspam.org URL to millions of users a day (or even if it doesn't, but millions of individuals are using it because the virtual whitelist aspect), then there will be enormous power associated with the wecanstopspam.org site. That is, that site may then, all by itself, have the power to determine what the world standard solution is by announcing it on the site. What will it be? That would be determined by some sort of community process. Maybe online voting, or maybe a conference where people would discuss and finally vote on the solution. CONCLUSION If: --A compelling enough meme could be crafted that people would want to include the URL in their sigs so that it would spread in a p2p viral fashion, and --It is in fact possible for filters to only give credit to the token when it is visible and clickable, then it seems to me that this could serve as a realistic means for solving the chicken-and-egg problem, thereby creating a single dominant standard with enough critical mass to actually work. The basis for it is that it avoids the chicken-or-egg problem in the first stage by leveraging existing spam technology. It can do that because the substrate is already in place for the idea to get to critical mass, in the form of existing adaptive spam filters such as Graham's. Then when it gets to critical mass, spammers will want to co-opt the token, except that in the act of doing that they give the wecanstopspam,org site enough power to enable the world to agree on a foolproof solution. Now, I realize the above may be crazy since I haven't thought about it for that long. But I just thought it was perhaps interesting enough to be worth sharing. Feedback? --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From piersh at friskit.com Tue Dec 17 16:44:37 2002 From: piersh at friskit.com (Piers Haken) Date: Tue Dec 17 19:30:15 2002 Subject: [Spambayes] Two Stage Plan Message-ID: <9891913C5BFE87429D71E37F08210CB92C744B@zeus.sfhq.friskit.com> Sounds like a disater to me. I hope that spambayes will have an option to completely ignore ANY instance of this URL in ALL messages. http://wecanstopspam.org Piers. > -----Original Message----- > From: Gary Robinson [mailto:grobinson@transpose.com]=20 > Sent: Tuesday, December 17, 2002 1:23 PM > To: Spamfilt; SpamBayes > Subject: [Spambayes] Two Stage Plan >=20 >=20 > I want to suggest a two-stage plan to solve the spam problem.=20 > I'm not sure if it makes sense, but it's interesting enough=20 > to me that I decided to share it to see what other people think. >=20 > FIRST STAGE >=20 > Many of you are aware of the http://wecanstopspam.org idea, whereby: >=20 > --If a lot of real people use it in the sigs of real emails,=20 > spam filters will get trained to see it as a very strong=20 > indicator of being legitimate. Thus, it will have become a=20 > sort of "virtual whitelist". I see this as being able to=20 > counteract, to some extent, the fact that spammers will be=20 > trying to use words with no very spammy associations.=20 > Instead, this technique puts the stress on "hammy" words, in=20 > particular this very hammy indicator. >=20 > --If the URL does become widely used and is accepted by=20 > filters, of course spammers will want to include it too. But=20 > at that point, it will be popular enough that filter authors=20 > will be motivated to make sure that only visible, clickable=20 > versions of the URL are given a high hamminess value. So=20 > spammers would have to, in effect, advertise the=20 > wecanstopspam.org website and provide a convenient link. >=20 > --The URL would contain information about how to combat spam,=20 > as it does now, but hopefully much better written and=20 > presented, as the site evolves under community guidance. So=20 > spammers that include it will be helping their targets to fight spam. >=20 > SECOND STAGE >=20 > The problem with all possibly foolproof anti-spam approaches,=20 > such as the pay-to-spam approach, or the camram one=20 > (http://www.camram.org/), is that there is a huge=20 > chicken-or-egg problem. The world really has to settle on one=20 > solution and get a real critical mass of users in order for=20 > it to work. >=20 > Now, if in fact it gets to the point that spammers are=20 > sending the http://wecanstopspam.org URL to millions of users=20 > a day (or even if it doesn't, but millions of individuals are=20 > using it because the virtual whitelist aspect), then there=20 > will be enormous power associated with the wecanstopspam.org site. >=20 > That is, that site may then, all by itself, have the power to=20 > determine what the world standard solution is by announcing=20 > it on the site. What will it be? That would be determined by=20 > some sort of community process. Maybe online voting, or maybe=20 > a conference where people would discuss and finally vote on=20 > the solution.=20 >=20 > CONCLUSION >=20 > If: >=20 > --A compelling enough meme could be crafted that people would=20 > want to include the URL in their sigs so that it would spread=20 > in a p2p viral fashion, and >=20 > --It is in fact possible for filters to only give credit to=20 > the token when it is visible and clickable, >=20 > then it seems to me that this could serve as a realistic=20 > means for solving the chicken-and-egg problem, thereby=20 > creating a single dominant standard with enough critical mass=20 > to actually work. >=20 > The basis for it is that it avoids the chicken-or-egg problem=20 > in the first stage by leveraging existing spam technology. It=20 > can do that because the substrate is already in place for the=20 > idea to get to critical mass, in the form of existing=20 > adaptive spam filters such as Graham's. Then when it gets to=20 > critical mass, spammers will want to co-opt the token, except=20 > that in the act of doing that they give the wecanstopspam,org=20 > site enough power to enable the world to agree on a foolproof=20 > solution. >=20 > Now, I realize the above may be crazy since I haven't thought=20 > about it for that long. But I just thought it was perhaps=20 > interesting enough to be worth sharing. >=20 > Feedback? >=20 >=20 > --Gary >=20 >=20 > --=20 > Help your email get through while making life harder for=20 > spammers: use http://wecanstopspam.org in your sig. >=20 > Gary Robinson > CEO > Transpose, LLC > grobinson@transpose.com > 207-942-3463 > http://www.transpose.com > http://radio.weblogs.com/0101454 >=20 >=20 > > >=20 >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 From grobinson at transpose.com Tue Dec 17 20:21:11 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 17 20:21:15 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C744B@zeus.sfhq.friskit.com> Message-ID: Well, the whole key to the idea is that people would get behind it. Some irrational cranks are always expected, of course, no matter how worthy an idea is. The question is whether a LOT of people can get behind it. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: "Piers Haken" > Date: Tue, 17 Dec 2002 16:44:37 -0800 > To: "Gary Robinson" , "Spamfilt" > , "SpamBayes" > Subject: RE: [Spambayes] Two Stage Plan > > Sounds like a disater to me. I hope that spambayes will have an option > to completely ignore ANY instance of this URL in ALL messages. > > > http://wecanstopspam.org > > > > > Piers. > >> -----Original Message----- >> From: Gary Robinson [mailto:grobinson@transpose.com] >> Sent: Tuesday, December 17, 2002 1:23 PM >> To: Spamfilt; SpamBayes >> Subject: [Spambayes] Two Stage Plan >> >> >> I want to suggest a two-stage plan to solve the spam problem. >> I'm not sure if it makes sense, but it's interesting enough >> to me that I decided to share it to see what other people think. >> >> FIRST STAGE >> >> Many of you are aware of the http://wecanstopspam.org idea, whereby: >> >> --If a lot of real people use it in the sigs of real emails, >> spam filters will get trained to see it as a very strong >> indicator of being legitimate. Thus, it will have become a >> sort of "virtual whitelist". I see this as being able to >> counteract, to some extent, the fact that spammers will be >> trying to use words with no very spammy associations. >> Instead, this technique puts the stress on "hammy" words, in >> particular this very hammy indicator. >> >> --If the URL does become widely used and is accepted by >> filters, of course spammers will want to include it too. But >> at that point, it will be popular enough that filter authors >> will be motivated to make sure that only visible, clickable >> versions of the URL are given a high hamminess value. So >> spammers would have to, in effect, advertise the >> wecanstopspam.org website and provide a convenient link. >> >> --The URL would contain information about how to combat spam, >> as it does now, but hopefully much better written and >> presented, as the site evolves under community guidance. So >> spammers that include it will be helping their targets to fight spam. >> >> SECOND STAGE >> >> The problem with all possibly foolproof anti-spam approaches, >> such as the pay-to-spam approach, or the camram one >> (http://www.camram.org/), is that there is a huge >> chicken-or-egg problem. The world really has to settle on one >> solution and get a real critical mass of users in order for >> it to work. >> >> Now, if in fact it gets to the point that spammers are >> sending the http://wecanstopspam.org URL to millions of users >> a day (or even if it doesn't, but millions of individuals are >> using it because the virtual whitelist aspect), then there >> will be enormous power associated with the wecanstopspam.org site. >> >> That is, that site may then, all by itself, have the power to >> determine what the world standard solution is by announcing >> it on the site. What will it be? That would be determined by >> some sort of community process. Maybe online voting, or maybe >> a conference where people would discuss and finally vote on >> the solution. >> >> CONCLUSION >> >> If: >> >> --A compelling enough meme could be crafted that people would >> want to include the URL in their sigs so that it would spread >> in a p2p viral fashion, and >> >> --It is in fact possible for filters to only give credit to >> the token when it is visible and clickable, >> >> then it seems to me that this could serve as a realistic >> means for solving the chicken-and-egg problem, thereby >> creating a single dominant standard with enough critical mass >> to actually work. >> >> The basis for it is that it avoids the chicken-or-egg problem >> in the first stage by leveraging existing spam technology. It >> can do that because the substrate is already in place for the >> idea to get to critical mass, in the form of existing >> adaptive spam filters such as Graham's. Then when it gets to >> critical mass, spammers will want to co-opt the token, except >> that in the act of doing that they give the wecanstopspam,org >> site enough power to enable the world to agree on a foolproof >> solution. >> >> Now, I realize the above may be crazy since I haven't thought >> about it for that long. But I just thought it was perhaps >> interesting enough to be worth sharing. >> >> Feedback? >> >> >> --Gary >> >> >> -- >> Help your email get through while making life harder for >> spammers: use http://wecanstopspam.org in your sig. >> >> Gary Robinson >> CEO >> Transpose, LLC >> grobinson@transpose.com >> 207-942-3463 >> http://www.transpose.com >> http://radio.weblogs.com/0101454 >> >> >>> >> >> >> _______________________________________________ >> Spambayes mailing list >> Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >> > From grobinson at transpose.com Tue Dec 17 20:49:26 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 17 20:49:33 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message-ID: > Vendor involvement: smtp, mta, and maybe even routers and isp vendors may need > to become deeply involved. They are at the present highly disincented to be > involved in the fight against spam. In some ways, a fight against spam could > be perceived as a fight against some of these vendors, and it would behove > (sp?) us to woo them to our side before that point. I actually tend to think this will not need integration as deep as you suspect. Penny-per-email schemes make a lot of sense, for example. Camram is interesting because it has many of the advantages without the dangers. See http://spamland.org/jsp/Wiki?Ideas for links. I'm not saying the final solution has to be chosen from those, at all -- I think as things evolve, better solutions may appear (or may already exist). But I do think it's likely that a practical solution will appear that really only needs buy-in from users and email software providers. Yes? Right now, it's arguably the case only Microsoft has the umph to possibly make such a solution appear, but I think most of the rest of us would like to see a completely open solution that any open-source software provider (or non-MS commercial vendor) can participate in. And Microsoft is unlikely to play in such a space unless it has no choice because the standard is already accepted. > Legal ramifications: the first thing that jumps to my mind here is that the > emergence of a powerful antispam organization gives spammers a nicely visible > legal target for lawsuits, etc. They could claim infringement of rights, > which puts things in a federal arena. Also, multinational stuff gets > involved, which complicates things considerably. I think we should pay close > attention to this kind of issue. That's not something I've considered. You're talking about free speech rights? I wonder what precedent there is... One thing I think of is that some telcos sell features where unknown callers have to record who they are, so that people have a choice about whether to let them through... We'd have to talk to a lawyer about this, but my guess is we could get some pro bono time. Good point, I'm glad you brought it up! > May I propose a whitepaper? I'd be happy to work with you (or someone) on it. > I'm somewhat of a neophyte on the subject still, but I have good writing > skills, and my naivete (sp?) might prove to be an asset. Paul Graham's > article certainly stirred worldwide action, and an article like this might be > just the ticket... Yeah, I think a whitepaper is a great idea. I personally couldn't take the time to make a really polished one now, so teaming up could be a good idea. Also, I was thinking of offering to talk about it at the Jan 17 conference in Cambridge, if initial feedback on these lists is decent (with the expected exceptions, of course). It seemed to me that that might be a good place to further test the waters before spending a lot of time on a polished white paper. To me, the ideas we're discussing seem logical, at least tentatively, with the big caveat that some people just won't want to include a URL in their sigs no matter whether or not it will ultimately save them significant time every day or help their emails get through! Such practicalities will take second place. What I don't know is what proportion of people will feel that way and what proportion would like to participate in a practical path to solving the problem. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: Tim Stone - Four Stones Expressions > Reply-To: tim@fourstonesExpressions.com > Date: Tue, 17 Dec 2002 16:04:39 -0600 > To: Gary Robinson > Subject: Re: [Spambayes] Two Stage Plan > > Gary, you've made some very good points here. There are several avenues > remaining to be identified. Two that I can think of (off the top of my head) > are: vendor involvement, and legal ramifications. > > Vendor involvement: smtp, mta, and maybe even routers and isp vendors may need > to become deeply involved. They are at the present highly disincented to be > involved in the fight against spam. In some ways, a fight against spam could > be perceived as a fight against some of these vendors, and it would behove > (sp?) us to woo them to our side before that point. > > Legal ramifications: the first thing that jumps to my mind here is that the > emergence of a powerful antispam organization gives spammers a nicely visible > legal target for lawsuits, etc. They could claim infringement of rights, > which puts things in a federal arena. Also, multinational stuff gets > involved, which complicates things considerably. I think we should pay close > attention to this kind of issue. > > May I propose a whitepaper? I'd be happy to work with you (or someone) on it. > I'm somewhat of a neophyte on the subject still, but I have good writing > skills, and my naivete (sp?) might prove to be an asset. Paul Graham's > article certainly stirred worldwide action, and an article like this might be > just the ticket... > > - TimS > > 12/17/2002 3:23:00 PM, Gary Robinson wrote: > >> I want to suggest a two-stage plan to solve the spam problem. I'm not sure >> if it makes sense, but it's interesting enough to me that I decided to share >> it to see what other people think. >> >> FIRST STAGE >> >> Many of you are aware of the http://wecanstopspam.org idea, whereby: >> >> --If a lot of real people use it in the sigs of real emails, spam filters >> will get trained to see it as a very strong indicator of being legitimate. >> Thus, it will have become a sort of "virtual whitelist". I see this as being >> able to counteract, to some extent, the fact that spammers will be trying to >> use words with no very spammy associations. Instead, this technique puts the >> stress on "hammy" words, in particular this very hammy indicator. >> >> --If the URL does become widely used and is accepted by filters, of course >> spammers will want to include it too. But at that point, it will be popular >> enough that filter authors will be motivated to make sure that only visible, >> clickable versions of the URL are given a high hamminess value. So spammers >> would have to, in effect, advertise the wecanstopspam.org website and >> provide a convenient link. >> >> --The URL would contain information about how to combat spam, as it does >> now, but hopefully much better written and presented, as the site evolves >> under community guidance. So spammers that include it will be helping their >> targets to fight spam. >> >> SECOND STAGE >> >> The problem with all possibly foolproof anti-spam approaches, such as the >> pay-to-spam approach, or the camram one (http://www.camram.org/), is that >> there is a huge chicken-or-egg problem. The world really has to settle on >> one solution and get a real critical mass of users in order for it to work. >> >> Now, if in fact it gets to the point that spammers are sending the >> http://wecanstopspam.org URL to millions of users a day (or even if it >> doesn't, but millions of individuals are using it because the virtual >> whitelist aspect), then there will be enormous power associated with the >> wecanstopspam.org site. >> >> That is, that site may then, all by itself, have the power to determine what >> the world standard solution is by announcing it on the site. What will it >> be? That would be determined by some sort of community process. Maybe online >> voting, or maybe a conference where people would discuss and finally vote on >> the solution. >> >> CONCLUSION >> >> If: >> >> --A compelling enough meme could be crafted that people would want to >> include the URL in their sigs so that it would spread in a p2p viral >> fashion, and >> >> --It is in fact possible for filters to only give credit to the token when >> it is visible and clickable, >> >> then it seems to me that this could serve as a realistic means for solving >> the chicken-and-egg problem, thereby creating a single dominant standard >> with enough critical mass to actually work. >> >> The basis for it is that it avoids the chicken-or-egg problem in the first >> stage by leveraging existing spam technology. It can do that because the >> substrate is already in place for the idea to get to critical mass, in the >> form of existing adaptive spam filters such as Graham's. Then when it gets >> to critical mass, spammers will want to co-opt the token, except that in the >> act of doing that they give the wecanstopspam,org site enough power to >> enable the world to agree on a foolproof solution. >> >> Now, I realize the above may be crazy since I haven't thought about it for >> that long. But I just thought it was perhaps interesting enough to be worth >> sharing. >> >> Feedback? >> >> >> --Gary >> >> >> -- >> Help your email get through while making life harder for spammers: use >> http://wecanstopspam.org in your sig. >> >> Gary Robinson >> CEO >> Transpose, LLC >> grobinson@transpose.com >> 207-942-3463 >> http://www.transpose.com >> http://radio.weblogs.com/0101454 >> >> >>> >> >> >> _______________________________________________ >> Spambayes mailing list >> Spambayes@python.org >> http://mail.python.org/mailman/listinfo/spambayes >> >> > > > c'est moi - TimS > www.fourstonesExpressions.com > http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails > > From skip at pobox.com Tue Dec 17 21:48:52 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 17 22:49:01 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: References: <87257970.1040120563@[192.168.2.9]> Message-ID: <15871.61476.530128.540562@montanaro.dyndns.org> Gary, I'm not sure I understand all the ramifications, but I'm willing to give it a try. I doubt it can make the volume of spam I receive any worse all by itself, and if spambayes winds up seeing it as a strong enough spam indicator I'll just dump it. ;-) -- Skip Montanaro - skip@pobox.com http://www.musi-cal.com/ http://www.wecanstopspam.org/ From piersh at friskit.com Tue Dec 17 22:23:27 2002 From: piersh at friskit.com (Piers Haken) Date: Wed Dec 18 01:09:03 2002 Subject: [Spambayes] Two Stage Plan Message-ID: <9891913C5BFE87429D71E37F08210CB9297524@zeus.sfhq.friskit.com> I think the real question is how much of a pain in the ass is it going to be when joe email-user finds that his spam filter stops working because of this and he has no idea how to make it ignore the token? Or are we assuing that anti-spam products are the technically-savvy only? It seems to me that it's more trouble than it's worth especially since we already have filters that are capable of filtering almost all spam correctly. Why introduce something into the system that could potentially reduce their effectiveness? Call me an irrational crank, but... oh wait... Piers. > -----Original Message----- > From: Gary Robinson [mailto:grobinson@transpose.com]=20 > Sent: Tuesday, December 17, 2002 5:21 PM > To: SpamBayes > Subject: Re: [Spambayes] Two Stage Plan >=20 >=20 > Well, the whole key to the idea is that people would get=20 > behind it. Some irrational cranks are always expected, of=20 > course, no matter how worthy an idea is. The question is=20 > whether a LOT of people can get behind it. >=20 >=20 > --Gary >=20 >=20 > --=20 > Help your email get through while making life harder for=20 > spammers: use http://wecanstopspam.org in your sig. >=20 > Gary Robinson > CEO > Transpose, LLC > grobinson@transpose.com > 207-942-3463 > http://www.transpose.com > http://radio.weblogs.com/0101454 >=20 >=20 > > From: "Piers Haken" > > Date: Tue, 17 Dec 2002 16:44:37 -0800 > > To: "Gary Robinson" , "Spamfilt"=20 > > , "SpamBayes" > > Subject: RE: [Spambayes] Two Stage Plan > >=20 > > Sounds like a disater to me. I hope that spambayes will=20 > have an option=20 > > to completely ignore ANY instance of this URL in ALL messages. > >=20 > > > > http://wecanstopspam.org > > > >=20 > > > >=20 > > Piers. > >=20 > >> -----Original Message----- > >> From: Gary Robinson [mailto:grobinson@transpose.com] > >> Sent: Tuesday, December 17, 2002 1:23 PM > >> To: Spamfilt; SpamBayes > >> Subject: [Spambayes] Two Stage Plan > >>=20 > >>=20 > >> I want to suggest a two-stage plan to solve the spam=20 > problem. I'm not=20 > >> sure if it makes sense, but it's interesting enough to me that I=20 > >> decided to share it to see what other people think. > >>=20 > >> FIRST STAGE > >>=20 > >> Many of you are aware of the http://wecanstopspam.org=20 > idea, whereby: > >>=20 > >> --If a lot of real people use it in the sigs of real emails, spam=20 > >> filters will get trained to see it as a very strong indicator of=20 > >> being legitimate. Thus, it will have become a sort of "virtual=20 > >> whitelist". I see this as being able to counteract, to=20 > some extent,=20 > >> the fact that spammers will be trying to use words with no very=20 > >> spammy associations. Instead, this technique puts the stress on=20 > >> "hammy" words, in particular this very hammy indicator. > >>=20 > >> --If the URL does become widely used and is accepted by=20 > filters, of=20 > >> course spammers will want to include it too. But at that point, it=20 > >> will be popular enough that filter authors will be=20 > motivated to make=20 > >> sure that only visible, clickable versions of the URL are given a=20 > >> high hamminess value. So spammers would have to, in=20 > effect, advertise=20 > >> the wecanstopspam.org website and provide a convenient link. > >>=20 > >> --The URL would contain information about how to combat=20 > spam, as it=20 > >> does now, but hopefully much better written and presented, as the=20 > >> site evolves under community guidance. So spammers that include it=20 > >> will be helping their targets to fight spam. > >>=20 > >> SECOND STAGE > >>=20 > >> The problem with all possibly foolproof anti-spam=20 > approaches, such as=20 > >> the pay-to-spam approach, or the camram one=20 > (http://www.camram.org/),=20 > >> is that there is a huge chicken-or-egg problem. The world=20 > really has=20 > >> to settle on one solution and get a real critical mass of users in=20 > >> order for it to work. > >>=20 > >> Now, if in fact it gets to the point that spammers are sending the=20 > >> http://wecanstopspam.org URL to millions of users a day=20 > (or even if=20 > >> it doesn't, but millions of individuals are using it because the=20 > >> virtual whitelist aspect), then there will be enormous power=20 > >> associated with the wecanstopspam.org site. > >>=20 > >> That is, that site may then, all by itself, have the power to=20 > >> determine what the world standard solution is by=20 > announcing it on the=20 > >> site. What will it be? That would be determined by some sort of=20 > >> community process. Maybe online voting, or maybe a=20 > conference where=20 > >> people would discuss and finally vote on the solution. > >>=20 > >> CONCLUSION > >>=20 > >> If: > >>=20 > >> --A compelling enough meme could be crafted that people=20 > would want to=20 > >> include the URL in their sigs so that it would spread in a=20 > p2p viral=20 > >> fashion, and > >>=20 > >> --It is in fact possible for filters to only give credit=20 > to the token=20 > >> when it is visible and clickable, > >>=20 > >> then it seems to me that this could serve as a realistic means for=20 > >> solving the chicken-and-egg problem, thereby creating a single=20 > >> dominant standard with enough critical mass to actually work. > >>=20 > >> The basis for it is that it avoids the chicken-or-egg=20 > problem in the=20 > >> first stage by leveraging existing spam technology. It can do that=20 > >> because the substrate is already in place for the idea to get to=20 > >> critical mass, in the form of existing adaptive spam=20 > filters such as=20 > >> Graham's. Then when it gets to critical mass, spammers=20 > will want to=20 > >> co-opt the token, except that in the act of doing that=20 > they give the=20 > >> wecanstopspam,org site enough power to enable the world to=20 > agree on a=20 > >> foolproof solution. > >>=20 > >> Now, I realize the above may be crazy since I haven't=20 > thought about=20 > >> it for that long. But I just thought it was perhaps interesting=20 > >> enough to be worth sharing. > >>=20 > >> Feedback? > >>=20 > >>=20 > >> --Gary > >>=20 > >>=20 > >> -- > >> Help your email get through while making life harder for > >> spammers: use http://wecanstopspam.org in your sig. > >>=20 > >> Gary Robinson > >> CEO > >> Transpose, LLC > >> grobinson@transpose.com > >> 207-942-3463 > >> http://www.transpose.com > >> http://radio.weblogs.com/0101454 > >>=20 > >>=20 > >>>=20 > >>=20 > >>=20 > >> _______________________________________________ > >> Spambayes mailing list > >> Spambayes@python.org=20 > >> http://mail.python.org/mailman/listinfo/spambayes > >>=20 > >=20 >=20 >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 From popiel at wolfskeep.com Tue Dec 17 22:29:45 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 18 01:26:31 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message from Gary Robinson References: Message-ID: <20021218062945.76AF72DED2@cashew.wolfskeep.com> In message: Gary Robinson writes: >Well, the whole key to the idea is that people would get behind it. Some >irrational cranks are always expected, of course, no matter how worthy an >idea is. The question is whether a LOT of people can get behind it. Hrm, this mailing list seems to be filled with a lot of irrational cranks, myself included. I thought that one of the main features of the spambayes approach is that we actually follow the data, instead of trying to coerce the data (or the mail-sending population) into supporting some particular classification scheme. I guess we're just not into groupthink here. - Alex From grobinson at transpose.com Wed Dec 18 07:06:08 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 07:06:11 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218062945.76AF72DED2@cashew.wolfskeep.com> Message-ID: > Hrm, this mailing list seems to be filled with a lot of irrational > cranks, myself included. .... I guess > we're just not into groupthink here. The message i was responding to was completely flippant in tone so i unfortunately responded in kind. sorry. I shouldn't have. My fault. I'm actually not trying to push the idea I presented. First of all I don't know if the reasoning is right or wrong until it's discussed. Secondly, even if the reasoning is right, it won't work if people don't LIKE it. So all I'm interested in doing at this point is find out what people think. So far, it looks like some people like it and some don't. But it seems that maybe enough like it to get it off the ground if it actually makes sense! > I thought that one of the main features > of the spambayes approach is that we actually follow the data, > instead of trying to coerce the data (or the mail-sending population) > into supporting some particular classification scheme. My personal feelings on this point are a) Yes, and the adaptive statistical approach is working great for me personally compared to not having a spam filter and also compared to the lame one in microsoft's Entourage. b) It isn't really as perfect as I would like, because FP's do still occur, so I still do have to check the subject and/or sender of every rejected email, which still costs me valuable time every day. c) Since there is nothing to give spammers a disincentive for sending me more spam, I have no reason to assume that I won't get 10 times as much a year from now as I do now (I do get 10 times as much now as a year ago), and then checking rejected email will take 10 times longer, whch would be completely intolerable. And I see no reason why it shouldn't be 100 times worse 2 years from now. But even if things just stay as they are, they are much worse than I want. d) I believe that there are other approaches that will make things much more difficult for spammers (such as cost-based ones) which would in fact vastly reduce the amount of spam and thus do a much more complete job of solving the problem. The challenge is to get everybody aboard the same solution. e) While there is a natural tendency toward individualism, there are situations where everybody coming aboard facilitates something worth the sacrifice in individualism. TCP/IP and HTML being good examples. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: "T. Alexander Popiel" > Date: Tue, 17 Dec 2002 22:29:45 -0800 > To: Gary Robinson > Cc: SpamBayes , popiel@wolfskeep.com > Subject: Re: [Spambayes] Two Stage Plan > > In message: > Gary Robinson writes: > >> Well, the whole key to the idea is that people would get behind it. Some >> irrational cranks are always expected, of course, no matter how worthy an >> idea is. The question is whether a LOT of people can get behind it. > > Hrm, this mailing list seems to be filled with a lot of irrational > cranks, myself included. I thought that one of the main features > of the spambayes approach is that we actually follow the data, > instead of trying to coerce the data (or the mail-sending population) > into supporting some particular classification scheme. I guess > we're just not into groupthink here. > > - Alex > From grobinson at transpose.com Wed Dec 18 07:36:58 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 07:37:01 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218062945.76AF72DED2@cashew.wolfskeep.com> Message-ID: Another thought in response to your response. Again, there is a natural reluctance to all get behind one solution. But you know what? We don't have a choice. Microsoft will choose a solution and all Microsoft users will use it. And because Microsoft has enough umph, that solution will work very well for those users. Yes, if you buy Windows, you will have your spam problem solved to a great degree. But if you don't have Windows, then..., er..., your email might not get through because Microsoft will be filtering spam for good upright Microsoft customers. They'll say they are doing the best they can, but can't control what non-MS users do, so can't control whether they're email gets through to non MS users. At the same time, mysteriously, all the tools won't be available for non-MS vendors to create fully compatible email clients and servers. These barriers will almost certainly include patents. So, IMO, there isn't really a choice. A single solution will emerge; there is no question about it because MS has the motivation and means to provide it. And it will work great for 90% of users (the ones who use Windows and Outlook on their desktops). It will exist within a year are two. The ONLY question is whether that single solution is, at its core, one more tool to increase the power of MS's monopoly, or whether it's something that the rest of the world can participate in too. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: "T. Alexander Popiel" > Date: Tue, 17 Dec 2002 22:29:45 -0800 > To: Gary Robinson > Cc: SpamBayes , popiel@wolfskeep.com > Subject: Re: [Spambayes] Two Stage Plan > > In message: > Gary Robinson writes: > >> Well, the whole key to the idea is that people would get behind it. Some >> irrational cranks are always expected, of course, no matter how worthy an >> idea is. The question is whether a LOT of people can get behind it. > > Hrm, this mailing list seems to be filled with a lot of irrational > cranks, myself included. I thought that one of the main features > of the spambayes approach is that we actually follow the data, > instead of trying to coerce the data (or the mail-sending population) > into supporting some particular classification scheme. I guess > we're just not into groupthink here. > > - Alex > From drew-public at poured.net Wed Dec 18 11:13:19 2002 From: drew-public at poured.net (Drew Raines) Date: Wed Dec 18 12:30:56 2002 Subject: [Spambayes] Re: Cute spam trick References: Message-ID: Tim Peters writes: > Staring at the source revealed a cute trick I haven't seen before: > > ... > Let the Lenders
> Compete for your Loan!
> ... > > That is, the spammy words like Lenders and Compete and Loan! are broken up > by embedded HTML comments. Perhaps the evolution of this technique could result in HTML- e-mail's demise? If nothing else, MUA's which send with HTML by default may stop doing so. One can dream. From popiel at wolfskeep.com Wed Dec 18 10:19:22 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 18 13:16:06 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message from Gary Robinson References: Message-ID: <20021218181922.A96CA2DED0@cashew.wolfskeep.com> In message: Gary Robinson writes: >Again, there is a natural reluctance to all get behind one solution. > >But you know what? We don't have a choice. > >Microsoft will choose a solution and all Microsoft users will use it. And >because Microsoft has enough umph, that solution will work very well for >those users. > >Yes, if you buy Windows, you will have your spam problem solved to a great >degree. I don't buy that. Either the product or the argument. ;-) Given that spam mutates at a measurable rate (with filters seeming to degrade after about 3 to 6 months), to make such a solution work, Microsoft would have to provide updates regularly and people would have to accept those updates regularly. This could either be as product upgrades (which people are already complaining about) or through a subscription service (which people are leery of). If you were to claim that people using MSN would have their spam problem solved, then I might be able to believe that. However, that's a lot less than the 80-90% of users needed to make it the Generally Accepted Way (One True Way's little brother)... and as such, it would have to interoperate with other schemes. - Alex From grobinson at transpose.com Wed Dec 18 13:25:59 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 13:26:46 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218181922.A96CA2DED0@cashew.wolfskeep.com> Message-ID: > > Given that spam mutates at a measurable rate (with filters seeming > to degrade after about 3 to 6 months), to make such a solution work, > Microsoft would have to provide updates regularly and people would > have to accept those updates regularly. This could either be as > product upgrades (which people are already complaining about) or > through a subscription service (which people are leery of). No. What if people have to pay a penny for each email message they send (such that legitimate people don't pay anything because they "earn" be receiving email about as much as they "spend" by sending it? Or what if their computer has to do a certain amount of work... say 15 seconds of CPU time... to generate a "hashcash" coin that is required by the recipient before an email can get through? It wouldn't affect normal users at all... but spammers would suddenly be unable to send millions of spams without making a much larger hardware investment than they are likely to be able to afford. What if MS makes the mechanisms for one of this kind of scheme 100% transparent so normal users don't have to think about them at all? You're thinking in terms of filters based on content. That's not the only way to do it, and I don't think it's what's going to happen again. There is no way that spam can mutate around the kinds of solutions above. For instance, the cost-based one will COST something no matter what the spammers do. They will have to pay the cost or give up. Gary From popiel at wolfskeep.com Wed Dec 18 10:59:37 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 18 13:56:21 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message from Gary Robinson References: Message-ID: <20021218185937.B5ADF2DED0@cashew.wolfskeep.com> In message: Gary Robinson writes: > >No. What if people have to pay a penny for each email message they send >(such that legitimate people don't pay anything because they "earn" be >receiving email about as much as they "spend" by sending it? If the money is being verified, then you have a money clearinghouse problem... which leads right back to the subscription service. If the money isn't being verified, then it's worthless. >Or what if their computer has to do a certain amount of work... say 15 >seconds of CPU time... to generate a "hashcash" coin that is required by the >recipient before an email can get through? First, not all computers are created equal; what's 15 seconds on my PentiumII is 5 seconds on my Athlon. Or less than a second on an FPGA programmed for the purpose. There's no way to tell how much someone actually spent on the 'coin'. Second, Moore's Law is against you, creating something equivalent to runaway inflation... forcing everyone to upgrade their hardware as fast as than the spammers do, to keep from paying extortionate time costs to send mail while still keeping the spammers from getting away free. >What if MS makes the mechanisms for one of this kind of scheme 100% >transparent so normal users don't have to think about them at all? I don't think they _can_. Effort in sending purely digital content cannot be verified by only software, and anything else incurs service costs. - Alex From grobinson at transpose.com Wed Dec 18 14:06:46 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 14:06:49 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218185937.B5ADF2DED0@cashew.wolfskeep.com> Message-ID: > From: "T. Alexander Popiel" > Date: Wed, 18 Dec 2002 10:59:37 -0800 > To: Gary Robinson > Cc: SpamBayes , popiel@wolfskeep.com > Subject: Re: [Spambayes] Two Stage Plan > > In message: > Gary Robinson writes: >> >> No. What if people have to pay a penny for each email message they send >> (such that legitimate people don't pay anything because they "earn" be >> receiving email about as much as they "spend" by sending it? > > If the money is being verified, then you have a money clearinghouse > problem... which leads right back to the subscription service. If > the money isn't being verified, then it's worthless. There would be a subscription, but it would not need to cost much at all. It would not have to be a significant barrier. it would probably be part of an overall subscription that included software updates, etc. As should know, MS is moving toward a "software rental" model. > >> Or what if their computer has to do a certain amount of work... say 15 >> seconds of CPU time... to generate a "hashcash" coin that is required by the >> recipient before an email can get through? > > First, not all computers are created equal; what's 15 seconds on my > PentiumII is 5 seconds on my Athlon. Or less than a second on an > FPGA programmed for the purpose. There's no way to tell how much > someone actually spent on the 'coin'. Right you can get maybe a factor of 10 or 15 that way. Spammers would need a factor of thousands. Also, of course, people with slow machines could be required to put in 10 times as many cycles... it would just be one more downside to a slow machine... probably still one that would be unnoticed, however. > > Second, Moore's Law is against you, creating something equivalent > to runaway inflation... forcing everyone to upgrade their hardware > as fast as than the spammers do, to keep from paying extortionate > time costs to send mail while still keeping the spammers from getting > away free. People are already upgrading their hardware for other reasons. Those that don't are penalized for it and this would be one more penalty. > I don't think they _can_. Effort in sending purely digital content > cannot be verified by only software, That is totally incorrect. Gary and anything else incurs service > costs. > > - Alex > From grobinson at transpose.com Wed Dec 18 15:09:12 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 15:46:23 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218200240.AAE952DED0@cashew.wolfskeep.com> Message-ID: > I'm currently working for a company in the transaction tracking and > aggregation business. Transactions under about 5 cents still are a > net loss to the vendor, even with significant volumes. The service > cost is high enough to keep the company afloat (that is, significant). Ah good. Now we can talk about something interesting. What is the basis of the 5 cents? Where does it go? My thinking is that if MS is controlling the whole thing, putting $.01 from one person's account to another will take less overhead than an email, which costs nothing. I.e., I think the complexity comes in when multiple companies are involved. It is one more reason why such a solution would increase MS's monopoly power. >> Right you can get maybe a factor of 10 or 15 that way. Spammers would >> need a factor of thousands. > > Funny... the camram site quotes an expected factor of 500. Darn close > to your 'thousands'. Thanks for the correction, which I will keep in mind. It doesn't change the argument, however. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: "T. Alexander Popiel" > Date: Wed, 18 Dec 2002 12:02:40 -0800 > To: Gary Robinson > Cc: popiel@wolfskeep.com > Subject: Re: [Spambayes] Two Stage Plan > > Taking this off the list, since it's getting down to > 'is so!' 'is not!' exchanges. ;-) > > In message: > Gary Robinson writes: >> >>> If the money is being verified, then you have a money clearinghouse >>> problem... which leads right back to the subscription service. >> >> There would be a subscription, but it would not need to cost much at all. It >> would not have to be a significant barrier. it would probably be part of an >> overall subscription that included software updates, etc. As should know, MS >> is moving toward a "software rental" model. > > I'm currently working for a company in the transaction tracking and > aggregation business. Transactions under about 5 cents still are a > net loss to the vendor, even with significant volumes. The service > cost is high enough to keep the company afloat (that is, significant). > > Yes, MS is moving to a rental model. Many people are moving away > from MS (or just staying with old versions) because of it. > >>> First, not all computers are created equal; what's 15 seconds on my >>> PentiumII is 5 seconds on my Athlon. Or less than a second on an >>> FPGA programmed for the purpose. There's no way to tell how much >>> someone actually spent on the 'coin'. >> >> Right you can get maybe a factor of 10 or 15 that way. Spammers would >> need a factor of thousands. > > Funny... the camram site quotes an expected factor of 500. Darn close > to your 'thousands'. > > Also, to defend against such attacks, they suggest changing the puzzle > on an irregular basis... leading back to the upgrade/subscription > problem. > > - Alex > From grobinson at transpose.com Wed Dec 18 15:48:08 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 15:48:08 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message-ID: >> I'm currently working for a company in the transaction tracking and >> aggregation business. Transactions under about 5 cents still are a >> net loss to the vendor, even with significant volumes. The service >> cost is high enough to keep the company afloat (that is, significant). > > Ah good. Now we can talk about something interesting. > > What is the basis of the 5 cents? Where does it go? > > My thinking is that if MS is controlling the whole thing, putting $.01 from > one person's account to another will take less overhead than an email, which > costs nothing. I.e., I think the complexity comes in when multiple companies > are involved. It is one more reason why such a solution would increase MS's > monopoly power. Answering my own question, with a request for feedback: Mainly it seems to me that if the transaction is between people's accounts the expenses involve 1) Loading the account in the first place -- there would be a charge there. The credit card company or would get some reasonable fee. Assuming this was part of an MS software rental program, however, it would be pretty much absorbed in the transaction to pay for that. Another $1 or $5 year out of probably $50 or more is only going to get zinged by, I dunno, absolute 10% max overall including the minimum fee and the percentage, which would be $.001 overhead per $.01 transaction. (NOTE I am not an expert, at all, on the exact fees here, but am just trying to get an overall sense of things, please correct me.) 2) There is some expense for storing the data associated with each transaction. I've been in the software biz for > 20 years, including having full responsibility for the design of very large customer databases for what used to be called New York Telephone. Given the cost of disk space these days, and the efficiency of modern databases, my gut feeling, for what it's worth, is that the cost of storing the transactions is negligible compared to $.01. 3) There is expense for communications etc but you aren't sending email unless you have that stuff covered, so that isn't an issue. The above, again, assumes that only MS users are involved. So I see a small overhead per email, well worth it to be rid of spam. (Except of course MS would use its power to squeeze as much profit from each person as they could once they've gotten everyone locked in; eventually it wouldn't be so great a deal if they really do end up with control. So they shouldn't have it. There should be some organization that the various email software vendors participate in so that it has the one-vendor advantage of MS without giving MS full control.) I'm not the only one who thinks that the pay-to-play idea makes sense, at least if it's under the control of MS. There are some very, very smart people who do. If we're wrong because of the financial overhead you mention, it would be good for us to understand that. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 From esj at harvee.billerica.ma.us Wed Dec 18 16:21:34 2002 From: esj at harvee.billerica.ma.us (Eric S. Johansson) Date: Wed Dec 18 16:31:01 2002 Subject: [Spambayes] an alternative use of filters Message-ID: <3E00E6DE.4070201@harvee.billerica.ma.us> I'm working on another antispam project (camram) which converts e-mail from a receiver pays (traditional with or without filters) to a sender pays system in which proof of work postage stamps are the go/nogo test. Unstamped mail that isn't from a white listed address generates a postage to notice. Needless to say, that can generate a lot of postage due notices. In an attempt to reduce the number of postage due notices, I'm interested in using Spam filters to categorize the mail into three buckets; clearly Spam, clearly not Spam, and can't tell. Only the can't tell messages will get postage due notices. So, will your filter give me this discrimination capability? ---eric From popiel at wolfskeep.com Wed Dec 18 13:39:15 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 18 16:37:19 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message from Gary Robinson References: Message-ID: <20021218213915.771762DED0@cashew.wolfskeep.com> In message: Gary Robinson writes: >> I'm currently working for a company in the transaction tracking and >> aggregation business. Transactions under about 5 cents still are a >> net loss to the vendor, even with significant volumes. The service >> cost is high enough to keep the company afloat (that is, significant). > >Ah good. Now we can talk about something interesting. > >What is the basis of the 5 cents? Where does it go? Alas, I'm not one of the financial guys here, so I can't say with any authority, but here's my best understanding: 1. There's account setup costs. These are variable, but typically fairly hefty for the vendors. I would be unsurprised if there's a fair amount of milking going on here, but there is a minimum level. 2. There's data storage costs. These tend to vary wildly based on who you ask. I know that the big raid arrays are expensive, though, particularly when you get to dual or triple failure tolerance. 3. There's connectivity costs. These are not particularly high bandwidth, but latency needs to be low and the reliability needs to be extreme. Anything more than 3 second turnaround is unacceptable, and downtime _really_ hurts. (The latency requirement might be relaxed if the only product is email... but for stuff like ringtones, customers get annoyed easily.) 4. There's billing costs. Actually doing collections of the money is one of the largest expenses... even if it's just to a credit card. Somewhere down the line (for typical consumers, at least), a physical mail is getting sent, a check is getting handled and processed... and this cost gets reflected up the line. While processing fees are generally expressed as a base + percentage of the sum, if you do too many small transactions, then the credit card companies will start increasing the base (so cross-cancellation doesn't help much). 5. There's authentication costs. Fraud is a noticable problem. Note that authentication is not necessarily per-transaction, but it is required on a statistically sufficient basis. 6. There's retrieval costs. Outside audits in particular are expensive. 7. There's security and destruction costs. The data has to be safe from prying eyes, and it must go away after certain lengths of time. When it goes away, you must be sure it goes away... privacy liabiliy sucks. Unfortunately, I don't have the information to actually put values to any of these pieces. I've probably left out a few pieces, too. Only the ambient office knowledge that "<5 cents isn't worth it". We really encourage transactions between $.75 and $3.00. Above $3.00, we start imposing additional authentication measures. - Alex From grobinson at transpose.com Wed Dec 18 16:51:38 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 16:51:41 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218213915.771762DED0@cashew.wolfskeep.com> Message-ID: OK, but MOST of those expenses go away in the MS-monopoly-it-goes-along-with-your-software-rental fees, as far as I can tell, right? 1. Account setup has already happened. 2. There is very little storage per $.01 transaction of an already registered user. It coule probably be done in, oh, 20 bytes, including timestamp. yes there is some overhead to carry out the updates, but as an experienced database guy, I don't think it's going to be too significant for this application. 3. Your correct about the latency being relaxed for emails. i don't think it will add significantly to the cost to get an acceptible latency 4. Billing costs. The only billing cost is in loading the account, say, once per year, and it would probably be part of an overall software subscription bill that MS would be sending out anyway, if MS is the vendor. It would be negligible comparitively. remember that $1 would actually pay for MANY transactions because the $.01 is being shifted between user accounts all the time and is reused by the recipient. 5. Authentication. Again only done in the loading of the account, and depending on whether it's part of a larger software rental package, may be being done anyway.6 6. Retrieval costs -- outside audits. I think this probably goes away too for the reasons above but you'd have to spell it out more for me to be sure. 7. Security, etc. Already covered in the MS case for their subscribers. For a non-MS company to do it, NOT as part of an overall software rental plan, the expense will be more. But I don't see that it's going to be remotely $.05 per email, because there is only one "hard" transaction (where money is actually moved by credit card) per year usually, to fill the accounts so that they can shift $.01 back and forth by means of very simple "soft" database transactions (again I'm quite experienced in database stuff, not only from designing large systems for the phone company, but because of personally designing and running an online service where people paid by the minute). --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: "T. Alexander Popiel" > Date: Wed, 18 Dec 2002 13:39:15 -0800 > To: Gary Robinson > Cc: SpamBayes , popiel@wolfskeep.com > Subject: Re: [Spambayes] Two Stage Plan > > In message: > Gary Robinson writes: >>> I'm currently working for a company in the transaction tracking and >>> aggregation business. Transactions under about 5 cents still are a >>> net loss to the vendor, even with significant volumes. The service >>> cost is high enough to keep the company afloat (that is, significant). >> >> Ah good. Now we can talk about something interesting. >> >> What is the basis of the 5 cents? Where does it go? > > Alas, I'm not one of the financial guys here, so I can't say with > any authority, but here's my best understanding: > > 1. There's account setup costs. These are variable, but typically > fairly hefty for the vendors. I would be unsurprised if there's > a fair amount of milking going on here, but there is a minimum > level. > > 2. There's data storage costs. These tend to vary wildly based on > who you ask. I know that the big raid arrays are expensive, > though, particularly when you get to dual or triple failure > tolerance. > > 3. There's connectivity costs. These are not particularly high > bandwidth, but latency needs to be low and the reliability > needs to be extreme. Anything more than 3 second turnaround > is unacceptable, and downtime _really_ hurts. (The latency > requirement might be relaxed if the only product is email... > but for stuff like ringtones, customers get annoyed easily.) > > 4. There's billing costs. Actually doing collections of the > money is one of the largest expenses... even if it's just to > a credit card. Somewhere down the line (for typical consumers, > at least), a physical mail is getting sent, a check is getting > handled and processed... and this cost gets reflected up the > line. While processing fees are generally expressed as a > base + percentage of the sum, if you do too many small > transactions, then the credit card companies will start > increasing the base (so cross-cancellation doesn't help much). > > 5. There's authentication costs. Fraud is a noticable problem. > Note that authentication is not necessarily per-transaction, > but it is required on a statistically sufficient basis. > > 6. There's retrieval costs. Outside audits in particular are > expensive. > > 7. There's security and destruction costs. The data has to be > safe from prying eyes, and it must go away after certain > lengths of time. When it goes away, you must be sure it > goes away... privacy liabiliy sucks. > > Unfortunately, I don't have the information to actually put > values to any of these pieces. I've probably left out a few > pieces, too. Only the ambient office knowledge that "<5 cents > isn't worth it". We really encourage transactions between $.75 > and $3.00. Above $3.00, we start imposing additional > authentication measures. > > - Alex > From tim.one at comcast.net Wed Dec 18 17:04:39 2002 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 18 17:06:56 2002 Subject: [Spambayes] an alternative use of filters In-Reply-To: <3E00E6DE.4070201@harvee.billerica.ma.us> Message-ID: [Eric S. Johansson] > I'm working on another antispam project (camram) which converts > e-mail from a receiver pays (traditional with or without filters) to a > sender pays system in which proof of work postage stamps are the > go/nogo test. > > Unstamped mail that isn't from a white listed address generates a > postage to notice. Needless to say, that can generate a lot of postage > due notices. In an attempt to reduce the number of postage due notices, > I'm interested in using Spam filters to categorize the mail into three > buckets; clearly Spam, clearly not Spam, and can't tell. Only the can't > tell messages will get postage due notices. > > So, will your filter give me this discrimination capability? Three-way classification is the intended use of the spambayes classifier. A msg gets a score from 0.0 (ham) to 1.0 (spam) and there are two configurable cutoffs: msgs with a score below ham_cutoff are called Ham, above spam_cutoff Spam, and any score between those Unsure. While experience varies across test sets and care in training, in my experience Unsures are, over time, about half spam and half ham. A curious and semi-encouraging thing is that they're overwhelmingly msgs *I* can't judge at a glance either, and sometimes it's so hard to tell I just throw the msg away as unintelligble. I call that "semi-"encouraging because, in conjunction with camram, I don't believe I'd want Unsures stopped from reaching me. For example, a common class of Unsures is commercial HTML email from companies I do business with; e.g., last week I got an Unsure that was an auto-generated order receipt for an online order of a software program. I wanted to get the receipt, but the email was very spammish, full of ads and links for follow-on offers, and other marketing collateral. I doubt reply email would be seen by a human, so a postage-due scheme probably would have dropped it into the bit bucket on both ends. The outstanding feature of the kind of classifier we're using is that it adjusts to an individual's notions of what constitutes ham and spam, so this kind of mistake is less frequent here than under other systems (for example, the order receipt mentioned above wasn't called spam, because the system knew I ordered other software of similar nature in the past; but the email *would* have been called spam if most other people had received it). But the error rates are, while very low for individual use, still non-zero, and I expect they always will be. So, if you try this, I suggest setting ham_cutoff very low (below 0.05), and spam_cutoff very high (over 0.95). The mdedian ham score is essentially 0, and the median spam score is essentially 1.0, so, while aggressive, this isn't quite as extreme as it may sound at first. The problem I expect remains, though: solicited commercial email, and especially the first few times a user gets one from a given vendor, will end up Unsure, and there may not be anyone on the other end to respond to a postage nag. From popiel at wolfskeep.com Wed Dec 18 14:33:23 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Dec 18 17:30:07 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message from Gary Robinson References: Message-ID: <20021218223323.A4E362DED0@cashew.wolfskeep.com> In message: Gary Robinson writes: >OK, but MOST of those expenses go away in the >MS-monopoly-it-goes-along-with-your-software-rental fees, as far as I can >tell, right? > >1. Account setup has already happened. > >2. There is very little storage per $.01 transaction of an already >registered user. It coule probably be done in, oh, 20 bytes, including >timestamp. yes there is some overhead to carry out the updates, but as an >experienced database guy, I don't think it's going to be too significant for >this application. You can only shrink it that small if you give up tracability. Otherwise you need things like the datetime, sender/receiver, and unique identifiers... minimum of about a hundred bytes once indexes are added, etc. More if you use non-binary formats or wrap stuff in XML or something stupid (but popular!). >4. Billing costs. The only billing cost is in loading the account, say, once >per year, and it would probably be part of an overall software subscription >bill that MS would be sending out anyway, if MS is the vendor. It would be >negligible comparitively. remember that $1 would actually pay for MANY >transactions because the $.01 is being shifted between user accounts all the >time and is reused by the recipient. If you only bill once a year, then you start running into problems tracking down mid-level fraud (because the customers won't alert you to problems until they see the bill, and by then the trail has most likely gone cold). Small-scale fraud is just a cost of business, and large-scale fraud you'll notice yourself. >5. Authentication. Again only done in the loading of the account, and >depending on whether it's part of a larger software rental package, may be >being done anyway. If you only do authentication at account load time, then you run into problems with spammers masquerading as other users. Return-address forgery is already a problem... it'll get worse as real money gets attached to it. >6. Retrieval costs -- outside audits. I think this probably goes away too >for the reasons above but you'd have to spell it out more for me to be sure. No, outside audits don't go away. Anyone who handles large amounts of money (and this would be a large amount of money, even at only a penny per email) gets their accounting practices scrutinized. I'll grant that MS is already under such scrutiny, but being a broker as well as a vendor adds a whole other dimension to the mess. - Alex From grobinson at transpose.com Wed Dec 18 17:50:43 2002 From: grobinson at transpose.com (Gary Robinson) Date: Wed Dec 18 17:50:43 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <20021218223323.A4E362DED0@cashew.wolfskeep.com> Message-ID: >> this application. > > You can only shrink it that small if you give up tracability. > Otherwise you need things like the datetime, sender/receiver, > and unique identifiers... minimum of about a hundred bytes > once indexes are added, etc. More if you use non-binary > formats or wrap stuff in XML or something stupid (but popular!). True, but I still don't see it as an issue. The email addresses would be on file at the vendor and could be represented in each transaction as 5 byte (max) binary numbers. 8 if you don't want to be fancy to save space. I can't see this going to more than 50 bytes... If it did get to be an issue, which again I do NOT think it is due to the cost of storage today and even further decreased cost of storage tomorrow, then tracability COULD be given up, perhaps -- worth considering. We're talking about $.01 transactions here. If worse came to worse and something went askew with one, it simply would not be worth fixing. This is something that needs to be thought about more IF the storage this was an issue, which again, i don't think it is. Let's forget about the fact that individuals can by 100gig hard drives for $100 or something these days. Suppose we are talking about secure online storage. From Apple, you can get a secure, backed up 1gig online drive for < $400 per year. That would handle, easily, 10 million of these transactions (100 bytes per transaction). That's $0.00004 per transaction for storage. Physical storage for this is negligable. The actual CPU time of the updates would possibly be a bigger factor, but really don't think it would be much more. If you disagree we could look into it further. > > If you only bill once a year, then you start running into > problems tracking down mid-level fraud (because the customers > won't alert you to problems until they see the bill, and by > then the trail has most likely gone cold). > > Small-scale fraud is just a cost of business, and large-scale > fraud you'll notice yourself. Fraud is a factor with most financial transactions, but microsoft would be guaranteeing this service (the delivery of an email) and there would not be fraud in that question as far as I can see. (I.e. ms would not defraud users about whether an email was delivered.) Again this becomes more complicated if it this is done via cooperating email client vendors, but I don't think that changes the dynamic in a major way. > >> 5. Authentication. Again only done in the loading of the account, and >> depending on whether it's part of a larger software rental package, may be >> being done anyway. > > If you only do authentication at account load time, then you > run into problems with spammers masquerading as other users. > Return-address forgery is already a problem... it'll get > worse as real money gets attached to it. Encrypted Passwords, etc. Clearly there would be challenges in making everything secure. MS will have already done that as part of its subscription services but some kind of consortium would have to do it from scratch. It would be interesting to try to quantify that in some way. Any ideas? > >> 6. Retrieval costs -- outside audits. I think this probably goes away too >> for the reasons above but you'd have to spell it out more for me to be sure. > > No, outside audits don't go away. Anyone who handles large > amounts of money (and this would be a large amount of money, > even at only a penny per email) gets their accounting practices > scrutinized. I'll grant that MS is already under such scrutiny, > but being a broker as well as a vendor adds a whole other dimension > to the mess. OK, right, the central organization, whether MS or a consortium, would have to handle audits. For MS, auditing this would probably be folded into overall audits in such a way that it wouldn't be a major factor. it could be more of one for an independent consortium. How do we quantify this? Gary > > - Alex > From tim.one at comcast.net Wed Dec 18 23:56:34 2002 From: tim.one at comcast.net (Tim Peters) Date: Wed Dec 18 23:58:52 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: Message-ID: MS-watchers should note that adaptive content-based spam filtering has been a highlight of the very recent ad campaign (reported to cost $300 million -- nearly twice Python's ad budget for a whole year ) for MSN 8. A broad patent already covers it, and more are on the way: http://tinyurl.com/3o6m While it's unclear exactly what MS does, they've funded a ton of research on "support vector machines" (a bad name for a cool technique); the patent linked to above goes into that in some detail. content-is-as-content-does-ly y'rs - tim From rob at hooft.net Thu Dec 19 09:54:00 2002 From: rob at hooft.net (Rob Hooft) Date: Thu Dec 19 03:54:20 2002 Subject: [Spambayes] Two Stage Plan References: <20021218223323.A4E362DED0@cashew.wolfskeep.com> Message-ID: <3E018928.3060505@hooft.net> Gary, How would you deal with mailing lists? Are they money factories? Are they extremely expensive to host? Or does it cost a tiny $11.50 to send an E-mail to the python mailing list? How about mail-news gateways? intra-company E-mail? Digests? I can see a whole new type of fraud: anything that costs money on one side can be abused to earn money on the other side. Imagine people subscribing /dev/null addresses to high-volume mailing lists just to receive tons of money.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From mhammond at skippinet.com.au Thu Dec 19 22:41:58 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Dec 19 06:42:11 2002 Subject: [Spambayes] an alternative use of filters In-Reply-To: Message-ID: <005001c2a753$a42e7f30$530f8490@eden> [Tim] > While experience varies across test sets and care in training, in my > experience Unsures are, over time, about half spam and half > ham. A curious > and semi-encouraging thing is that they're overwhelmingly > msgs *I* can't > judge at a glance either, and sometimes it's so hard to tell > I just throw > the msg away as unintelligble. While we are dropping anecdotes, my experience is similar - except I find that I have more false negatives than false positives in the unsure range (very very few false-anythings outside our standard unsure range). All false positives are very spammy ham. IIRC, this also reflects the common test results. I'm starting to get interested in the life-cycles of our corpora, as I am starting to get "annoyed" at these false-anythings. I believe simply that my tolerance level is falling (the better we get at filtering spam, the more offensive both uncaught spam and missed ham become). However, it *is* possible that as my ham:spam training ratios change, the effectiveness of the filter also changes subtly. As the Outlook system keeps all spam, and as I naturally delete a bit of everything *except* this spam, my spam:ham ratio slowly, but continually increases. When I get a round tuit, I would like to take some of the existing standard test code, and twist it into generating some sort of "expiry" based statistics - not just expiring unused words, but possibly expiring entire messages (and possibly never expiring "unsure" messages, etc). I'm starting to think this is the next natural progression of the Outlook client - working out how things go once we will have forgotten we even installed the filter, and we have 5 years of spam competing against 1/10th of the ham should we need to retrain. Excluding-the-stand-alone-DLL-version-of-the-filter-which-is-getting-oh-so-c lose-ly, Mark. From grobinson at transpose.com Thu Dec 19 09:07:42 2002 From: grobinson at transpose.com (Gary Robinson) Date: Thu Dec 19 09:07:49 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <3E018928.3060505@hooft.net> Message-ID: > How would you deal with mailing lists? Are they money factories? Are > they extremely expensive to host? Or does it cost a tiny $11.50 to send > an E-mail to the python mailing list? How about mail-news gateways? > intra-company E-mail? Digests? That's covered in the article that seems to be the seminal one in the field, http://www.research.ibm.com/journal/sj/414/forum.pdf. You're right, it IS a complication -- one I end up regarding as a bit more onerous than the cost-based objections, which in the end I think can be covered with a very small annual subscription fee. (Alex, I'm happy to continue discussing it if you still think the costs add up to another conclusion... I don't KNOW an answer, I'm just reporting my impression based on the discussion to date.) Though a complication, this mailing list problem, too, is handleable in the context of the idea. Actually I have some thinking on it that goes a bit beyond the article... Today I'm extremely busy and won't try to cover the mailing list point here now, although I may say something about it when I have more time. I really MUST put my time elsewhere for today. But really anyone interested in this should see the article mentioned above. It may be found at http://spamland.org/jsp/Wiki?Ideas and I hope the community adds more such resources over time. --Gary -- Help your email get through while making life harder for spammers: use http://wecanstopspam.org in your sig. Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 > From: Rob Hooft > Date: Thu, 19 Dec 2002 09:54:00 +0100 > To: "T. Alexander Popiel" > Cc: Gary Robinson , SpamBayes > Subject: Re: [Spambayes] Two Stage Plan > > Gary, > > How would you deal with mailing lists? Are they money factories? Are > they extremely expensive to host? Or does it cost a tiny $11.50 to send > an E-mail to the python mailing list? How about mail-news gateways? > intra-company E-mail? Digests? > > I can see a whole new type of fraud: anything that costs money on one > side can be abused to earn money on the other side. > Imagine people subscribing /dev/null addresses to high-volume mailing > lists just to receive tons of money.... > > Rob > > -- > Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ > From vanhorn at whidbey.com Thu Dec 19 11:56:02 2002 From: vanhorn at whidbey.com (G. Armour Van Horn) Date: Thu Dec 19 14:56:06 2002 Subject: [Spambayes] Two Stage Plan References: Message-ID: <3E022452.A0660429@whidbey.com> And bless their tiny little hearts. I've picked up several billable hours this month from Outlook users who suddenly aren't allowed to open the PDF files that have been sent to them. It's not that Microsoft can't find a clue, there are lots of clues in the Redmond area, but either MS is fundamentally opposed to using them or can't recognize them. Microsoft shouldn't be allowed to handle mail at all, based on their record. Van Tim Peters wrote: > MS-watchers should note that adaptive content-based spam filtering has been > a highlight of the very recent ad campaign (reported to cost $300 million -- > nearly twice Python's ad budget for a whole year ) for MSN 8. A broad > patent already covers it, and more are on the way: > > http://tinyurl.com/3o6m > > While it's unclear exactly what MS does, they've funded a ton of research on > "support vector machines" (a bad name for a cool technique); the patent > linked to above goes into that in some detail. > > content-is-as-content-does-ly y'rs - tim > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From vanhorn at whidbey.com Thu Dec 19 11:49:05 2002 From: vanhorn at whidbey.com (G. Armour Van Horn) Date: Thu Dec 19 14:59:38 2002 Subject: [Spambayes] Two Stage Plan References: Message-ID: <3E0222B1.F7A6C34E@whidbey.com> You've never had any exposure to datacenter cost accounting, have you? You're willing to turn over the control of e-mail to Microshaft, so you're probably not expecting to pay Oracle license fees for the database, but SQLServer licenses for systems with huge arrays of processors come with heart-stopping price tags as well. Then you pay for the computers, the maintenance agreements, the DBAs to keep it running. And for this scale of software, the database license is not a one-time expense, it's annual. Also note that at the speed this will have to operate you can't use "secure online storage." You have to use silicon storage, multi-gigabyte arrays of virtual disk in front of the RAID arrays. I seriously doubt that anyone knows how to build the system you propose yet, although the InifiniBand proponents may be thinking in the right ballpark. (Not delivering, mind you, but at least thinking about it.) And behind the memory-resident part of the system you still are going to need the fastest RAID arrays. (I also don't think that you can store 5 or 8 or 50 bytes of data in a database with an incremental 5, 8, or 50 bytes of storage used.) Given the latency requirements and the volume, I would expect the backup system to be an interesting exercise. Surely you aren't planning on a single global system to authorize all e-mails without backup, are you? Even if your cost estimates weren't several orders of magnitude too low, the very concept of turning e-mail over to any single system (particularly the generally-malevolent convicted monopolist in Redmond) is a fundamental evil. Monocultures are destructive to the environment, and this happens to be an environment I care about. Van Gary Robinson wrote: > >> this application. > > > > You can only shrink it that small if you give up tracability. > > Otherwise you need things like the datetime, sender/receiver, > > and unique identifiers... minimum of about a hundred bytes > > once indexes are added, etc. More if you use non-binary > > formats or wrap stuff in XML or something stupid (but popular!). > > True, but I still don't see it as an issue. The email addresses would be on > file at the vendor and could be represented in each transaction as 5 byte > (max) binary numbers. 8 if you don't want to be fancy to save space. I can't > see this going to more than 50 bytes... > > If it did get to be an issue, which again I do NOT think it is due to the > cost of storage today and even further decreased cost of storage tomorrow, > then tracability COULD be given up, perhaps -- worth considering. We're > talking about $.01 transactions here. If worse came to worse and something > went askew with one, it simply would not be worth fixing. This is something > that needs to be thought about more IF the storage this was an issue, which > again, i don't think it is. > > Let's forget about the fact that individuals can by 100gig hard drives for > $100 or something these days. Suppose we are talking about secure online > storage. From Apple, you can get a secure, backed up 1gig online drive for < > $400 per year. That would handle, easily, 10 million of these transactions > (100 bytes per transaction). That's $0.00004 per transaction for storage. > > Physical storage for this is negligable. > > The actual CPU time of the updates would possibly be a bigger factor, but > really don't think it would be much more. If you disagree we could look into > it further. > > > > > If you only bill once a year, then you start running into > > problems tracking down mid-level fraud (because the customers > > won't alert you to problems until they see the bill, and by > > then the trail has most likely gone cold). > > > > Small-scale fraud is just a cost of business, and large-scale > > fraud you'll notice yourself. > > Fraud is a factor with most financial transactions, but microsoft would be > guaranteeing this service (the delivery of an email) and there would not be > fraud in that question as far as I can see. (I.e. ms would not defraud users > about whether an email was delivered.) > > Again this becomes more complicated if it this is done via cooperating email > client vendors, but I don't think that changes the dynamic in a major way. > > > > >> 5. Authentication. Again only done in the loading of the account, and > >> depending on whether it's part of a larger software rental package, may be > >> being done anyway. > > > > If you only do authentication at account load time, then you > > run into problems with spammers masquerading as other users. > > Return-address forgery is already a problem... it'll get > > worse as real money gets attached to it. > > Encrypted Passwords, etc. Clearly there would be challenges in making > everything secure. MS will have already done that as part of its > subscription services but some kind of consortium would have to do it from > scratch. > > It would be interesting to try to quantify that in some way. Any ideas? > > > > >> 6. Retrieval costs -- outside audits. I think this probably goes away too > >> for the reasons above but you'd have to spell it out more for me to be sure. > > > > No, outside audits don't go away. Anyone who handles large > > amounts of money (and this would be a large amount of money, > > even at only a penny per email) gets their accounting practices > > scrutinized. I'll grant that MS is already under such scrutiny, > > but being a broker as well as a vendor adds a whole other dimension > > to the mess. > > OK, right, the central organization, whether MS or a consortium, would have > to handle audits. > > For MS, auditing this would probably be folded into overall audits in such a > way that it wouldn't be a major factor. it could be more of one for an > independent consortium. How do we quantify this? > > Gary > > > > > - Alex > > > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From grobinson at transpose.com Thu Dec 19 15:08:26 2002 From: grobinson at transpose.com (Gary Robinson) Date: Thu Dec 19 15:08:26 2002 Subject: [Spambayes] Two Stage Plan In-Reply-To: <3E0222B1.F7A6C34E@whidbey.com> Message-ID: > > You've never had any exposure to datacenter cost accounting, have you? I've had two period in my life in personally funding, designing, and running online consumer systems, and the current one has a paid up Oracle license. Your argument seems to depend on the assumption that you have knowledge of such things and I don't. 'Nuf said for now. More when I have time. Gary > You're > willing to turn over the control of e-mail to Microshaft, so you're probably > not > expecting to pay Oracle license fees for the database, but SQLServer licenses > for > systems with huge arrays of processors come with heart-stopping price tags as > well. > Then you pay for the computers, the maintenance agreements, the DBAs to keep > it > running. And for this scale of software, the database license is not a > one-time > expense, it's annual. > > Also note that at the speed this will have to operate you can't use "secure > online > storage." You have to use silicon storage, multi-gigabyte arrays of virtual > disk in > front of the RAID arrays. I seriously doubt that anyone knows how to build the > system you propose yet, although the InifiniBand proponents may be thinking in > the > right ballpark. (Not delivering, mind you, but at least thinking about it.) > And > behind the memory-resident part of the system you still are going to need the > fastest RAID arrays. (I also don't think that you can store 5 or 8 or 50 bytes > of > data in a database with an incremental 5, 8, or 50 bytes of storage used.) > > Given the latency requirements and the volume, I would expect the backup > system to > be an interesting exercise. Surely you aren't planning on a single global > system to > authorize all e-mails without backup, are you? > > Even if your cost estimates weren't several orders of magnitude too low, the > very > concept of turning e-mail over to any single system (particularly the > generally-malevolent convicted monopolist in Redmond) is a fundamental evil. > Monocultures are destructive to the environment, and this happens to be an > environment I care about. > > Van > > Gary Robinson wrote: > >>>> this application. >>> >>> You can only shrink it that small if you give up tracability. >>> Otherwise you need things like the datetime, sender/receiver, >>> and unique identifiers... minimum of about a hundred bytes >>> once indexes are added, etc. More if you use non-binary >>> formats or wrap stuff in XML or something stupid (but popular!). >> >> True, but I still don't see it as an issue. The email addresses would be on >> file at the vendor and could be represented in each transaction as 5 byte >> (max) binary numbers. 8 if you don't want to be fancy to save space. I can't >> see this going to more than 50 bytes... >> >> If it did get to be an issue, which again I do NOT think it is due to the >> cost of storage today and even further decreased cost of storage tomorrow, >> then tracability COULD be given up, perhaps -- worth considering. We're >> talking about $.01 transactions here. If worse came to worse and something >> went askew with one, it simply would not be worth fixing. This is something >> that needs to be thought about more IF the storage this was an issue, which >> again, i don't think it is. >> >> Let's forget about the fact that individuals can by 100gig hard drives for >> $100 or something these days. Suppose we are talking about secure online >> storage. From Apple, you can get a secure, backed up 1gig online drive for < >> $400 per year. That would handle, easily, 10 million of these transactions >> (100 bytes per transaction). That's $0.00004 per transaction for storage. >> >> Physical storage for this is negligable. >> >> The actual CPU time of the updates would possibly be a bigger factor, but >> really don't think it would be much more. If you disagree we could look into >> it further. >> >>> >>> If you only bill once a year, then you start running into >>> problems tracking down mid-level fraud (because the customers >>> won't alert you to problems until they see the bill, and by >>> then the trail has most likely gone cold). >>> >>> Small-scale fraud is just a cost of business, and large-scale >>> fraud you'll notice yourself. >> >> Fraud is a factor with most financial transactions, but microsoft would be >> guaranteeing this service (the delivery of an email) and there would not be >> fraud in that question as far as I can see. (I.e. ms would not defraud users >> about whether an email was delivered.) >> >> Again this becomes more complicated if it this is done via cooperating email >> client vendors, but I don't think that changes the dynamic in a major way. >> >>> >>>> 5. Authentication. Again only done in the loading of the account, and >>>> depending on whether it's part of a larger software rental package, may be >>>> being done anyway. >>> >>> If you only do authentication at account load time, then you >>> run into problems with spammers masquerading as other users. >>> Return-address forgery is already a problem... it'll get >>> worse as real money gets attached to it. >> >> Encrypted Passwords, etc. Clearly there would be challenges in making >> everything secure. MS will have already done that as part of its >> subscription services but some kind of consortium would have to do it from >> scratch. >> >> It would be interesting to try to quantify that in some way. Any ideas? >> >>> >>>> 6. Retrieval costs -- outside audits. I think this probably goes away too >>>> for the reasons above but you'd have to spell it out more for me to be >>>> sure. >>> >>> No, outside audits don't go away. Anyone who handles large >>> amounts of money (and this would be a large amount of money, >>> even at only a penny per email) gets their accounting practices >>> scrutinized. I'll grant that MS is already under such scrutiny, >>> but being a broker as well as a vendor adds a whole other dimension >>> to the mess. >> >> OK, right, the central organization, whether MS or a consortium, would have >> to handle audits. >> >> For MS, auditing this would probably be folded into overall audits in such a >> way that it wouldn't be a major factor. it could be more of one for an >> independent consortium. How do we quantify this? >> >> Gary >> >>> >>> - Alex >>> >> >> _______________________________________________ >> Spambayes mailing list >> Spambayes@python.org >> http://mail.python.org/mailman/listinfo/spambayes > > -- > ---------------------------------------------------------- > Sign up now for Quotes of the Day, a handful of quotations > on a theme delivered every morning. > Enlightenment! Daily, for free! > mailto:twisted@whidbey.com?subject=Subscribe_QOTD > > For web hosting and maintenance, > visit Van's home page: http://www.domainvanhorn.com/van/ > ---------------------------------------------------------- > > From Paul.Moore at atosorigin.com Fri Dec 20 09:01:30 2002 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Fri Dec 20 04:02:31 2002 Subject: [Spambayes] Outlook client - wrongly scored message Message-ID: <16E1010E4581B049ABC51D4975CEDB88619993@UKDCX001.uk.int.atosorigin.com> I just got a definite occurrence of something I've suspected has happened before. I started Outlook this morning, and the normal startup ritual occurred - 75 unread messages, the Spambayes addin went through them scoring and filtering them, and then my rules filed them all in various folders (I know the rules wizard isn't guaranteed to work at any particular time in relation to the addin, but it works, so why worry?) But at the end, I had a clear spam (Korean characters, as far as I could tell) in my inbox, scored as 0%. I got the addin to show clues, and the clues clearly come out as 100% spam. Refiltering the message scored it as 100%. I think this has come up before, and the conclusion was that it's some sort of subtle timing issue, with a score getting somehow attached to the wrong message. What I'm seeing certainly supports that theory. Paul. From esj at harvee.billerica.ma.us Fri Dec 20 11:44:57 2002 From: esj at harvee.billerica.ma.us (Eric S. Johansson) Date: Fri Dec 20 11:45:02 2002 Subject: [Spambayes] an alternative use of filters In-Reply-To: References: Message-ID: <3E034909.9000803@harvee.billerica.ma.us> Tim, thank you for your reply. I hadn't realized it was to the general mailing list until I got the mailman notice. signed up now so I can start my "personal archive" like so many other of the mailing lists I'm on. ;-) Tim Peters wrote: > Three-way classification is the intended use of the spambayes classifier. A > msg gets a score from 0.0 (ham) to 1.0 (spam) and there are two configurable > cutoffs: msgs with a score below ham_cutoff are called Ham, above > spam_cutoff Spam, and any score between those Unsure. good to know. It looks like I'll be experimenting with a variety of filter engines to see how they work. > While experience varies across test sets and care in training, in my > experience Unsures are, over time, about half spam and half ham. A curious > and semi-encouraging thing is that they're overwhelmingly msgs *I* can't > judge at a glance either, and sometimes it's so hard to tell I just throw > the msg away as unintelligble. I call that "semi-"encouraging because, in > conjunction with camram, I don't believe I'd want Unsures stopped from > reaching me. For example, a common class of Unsures is commercial HTML > email from companies I do business with; e.g., last week I got an Unsure > that was an auto-generated order receipt for an online order of a software > program. I wanted to get the receipt, but the email was very spammish, full > of ads and links for follow-on offers, and other marketing collateral. I > doubt reply email would be seen by a human, so a postage-due scheme probably > would have dropped it into the bit bucket on both ends. I feel like I'm living in a bipolar world when it comes to choosing a solution for dealing with mystery meat. Non geek computer users generally love the idea of canning mystery meat. The most common attitude is "if it's important, somebody will call me on the phone". When I tell them there is a jail they can go rummaging through its something important goes astray, objections, for the most part, fall by the wayside. On the other hand, dealing with the geek computer users is a difference of the challenges. While most ngcu and enterprise organizations want absolutely no knobs or buttons to confuse the user. In contrast, gcu seem to want complete control over behavior. I must plead guilty to that as well and have had a few vocal ngcu "educate" me. Now there is nothing stopping the camram filter from passing mystery meat through while at the same time sending out a challenge message. You see, I have a cunning plan. The act of sending out a challenge message creates more information that can be used in separating out ham/mystery meat/spam. If the message is deliverable or not, that information can be fed back into the classifier for further refinement. > The outstanding feature of the kind of classifier we're using is that it > adjusts to an individual's notions of what constitutes ham and spam, so this > kind of mistake is less frequent here than under other systems (for example, > the order receipt mentioned above wasn't called spam, because the system > knew I ordered other software of similar nature in the past; but the email > *would* have been called spam if most other people had received it). But > the error rates are, while very low for individual use, still non-zero, and > I expect they always will be. classifiers for spam filtering will work well for people like us because we're willing to take the effort to train the system. It's sort of like speech recognition. Only one user in 5 succeeds and success seems to be a function of the persons ability to consistently train the recognition engine. What I describe training processes to non geek computer users, the reaction is not kind. It's best described as "seems too much like work". It will probably get easier if there are "delete as spam, delete as not spam" buttons on the user interface but I'm not holding my breath. As for error rates, if you want to have some fun, plot out error rates vs. volume of traffic. A small ISP has mail volume on the order of 300,000 to 500,000 messages a week. a .1% error rate is 300 to 500 messages MIA. how many will trigger customer support call? what happens when your volume reaches 850,000 messages per day. It gets really interesting. > So, if you try this, I suggest setting ham_cutoff very low (below 0.05), and > spam_cutoff very high (over 0.95). The mdedian ham score is essentially 0, > and the median spam score is essentially 1.0, so, while aggressive, this > isn't quite as extreme as it may sound at first. The problem I expect > remains, though: solicited commercial email, and especially the first few > times a user gets one from a given vendor, will end up Unsure, and there may > not be anyone on the other end to respond to a postage nag. I should probably run these classifiers on my mail stream and plot message scores vs. frequency. I would like to see if there are any interesting patterns we can use. as for the not responding problem, yup. It's the only way camram gets false positives. What I am planning on doing is, with users permission, harvesting the addresses of messages that camram was able to 1) successfully send postage due notices to and 2) got no response after 24 hours. I would then try to get these folks to generate stamps so that they can bypass the whole filter/classifier problem. don't expect much success but it's not my mail that's getting trapped. thanks ---eric PS how do you handle meatloaf? From piersh at friskit.com Fri Dec 20 20:46:02 2002 From: piersh at friskit.com (Piers Haken) Date: Fri Dec 20 23:31:28 2002 Subject: [Spambayes] Error with outlook pluing Message-ID: <9891913C5BFE87429D71E37F08210CB92C7451@zeus.sfhq.friskit.com> I just updated my CVS sources and now I'm getting the following, along with a dialog teeling me that the pluing is disabled, when I start outlook. Anyone know what's up? Piers. -------- Outlook Spam Addin module loading SpamAddin - Connecting to Outlook Loaded bayes database from 'C:\Python22\spam\spambayes\Outlook2000\default_bayes_database.pck' Loaded message database from 'C:\Python22\spam\spambayes\Outlook2000\default_message_database.pck' Bayes database initialized with 2134 spam and 3370 good messages Error finding the MAPI folders for a folder switch event Traceback (most recent call last): File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 532, in OnFolderSwitch look_folder =3D self.manager.message_store.GetFolder(look_id) File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 217, in GetFolder table =3D folder.GetContentsTable(0) com_error: (-2147467259, 'Unspecified error', None, None) Warning: failed to create the Outlook user-property in folder 'Inbox' (-2147352567, 'Exception occurred.', (4096, 'Microsoft Outlook', 'A custom field with this name but a different data type already exists. Enter a different name.', None, 0, -2147352567), None) This is probably because the code has recently been changed, but it will have no effect on the filtering or scoring. AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Inbox Error installing folder hooks. Traceback (most recent call last): File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 636, in FiltersChanged self.UpdateFolderHooks() File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 662, in UpdateFolderHooks SpamFolderItemsEvent) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 671, in _HookFolderEvents for msgstore_folder in self.manager.message_store.GetFolderGenerator( File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 200, in GetFolderGenerator table =3D folder.GetContentsTable(0) com_error: (-2147467259, 'Unspecified error', None, None) Spam filtering is disabled - ignoring new message Spam filtering is disabled - ignoring new message Spam filtering is disabled - ignoring new message Piers. From rbyrnes at ozemail.com.au Mon Dec 23 15:17:14 2002 From: rbyrnes at ozemail.com.au (Rob B) Date: Sun Dec 22 23:18:48 2002 Subject: [Spambayes] pop3proxy error Message-ID: <5.1.1.6.2.20021223150301.0384a5e0@127.0.0.1> I'm running pop3proxy on WinNT, under Python 2.2.3 (from Python.org) I tend to get large volumes of mail over the weekends - in this case about 370 or so. When trying to train pop3proxy through the web-interface on these new messages, I get: error: uncaptured python exception, closing channel <__main__.UserInterface connected at 0x1489590> (exceptions.MemoryError: [C:\Python22\lib\asyncore.py|poll|99] [C:\Python22\lib\asyncore.py|handle_read_event|396] [C:\Python22\lib\asynchat.py|handle_read|130] [C:\antispam\spambayes\pop3proxy.py|found_terminator|805] [C:\antispam\spambayes\pop3proxy.py|onRequest|831] [C:\antispam\spambayes\pop3proxy.py|onReview|1144] [C:\Program Files\rob\download\spambayes\Corpus.py|getSubstance|318] [C:\Program Files\rob\download\spambayes\Corpus.py|__getattr__|268] [C:\Program Files\rob\download\spambayes\FileCorpus.py|load|220]) I see that this error has happened back in October (and was claimed to be fixed), my version was taken from CVS only today (22nd December) cheers, Rob -- "There is very little further to go with a girl who has brought you coffee." - John Updike This is random quote 1069 of a collection of 1269 Distance from the centre of the brewing universe: [15200.8 km (8207.8 mi), 262.8 deg](Apparent) Rennerian Public Key fingerprint = 6219 33BD A37B 368D 29F5 19FB 945D C4D7 1F66 D9C5 From tim at fourstonesExpressions.com Sun Dec 22 22:28:54 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun Dec 22 23:29:39 2002 Subject: [Spambayes] pop3proxy error In-Reply-To: <5.1.1.6.2.20021223150301.0384a5e0@127.0.0.1> Message-ID: Rob, you've undoubtedly hit a volume that is beyond what anybody else has experienced with the pop3proxy. The author of the proxy is Richie Hindle, and I'm sure he'll have a peek when he gets a chance. Thanks for reporting this error. In the meantime, it isn't necessary to train each message that comes in. Go into the proxy cache subdirectories and manually delete all but a random sampling of messages. Then refresh the 'train' page in the proxy. There'll be a much smaller set of messages to train on, and you shouldn't have a memory problem. - TimS 12/22/2002 10:17:14 PM, Rob B wrote: >I'm running pop3proxy on WinNT, under Python 2.2.3 (from Python.org) > >I tend to get large volumes of mail over the weekends - in this case about >370 or so. When trying to train pop3proxy through the web- interface on >these new messages, I get: > >error: uncaptured python exception, closing channel <__main__.UserInterface >connected at 0x1489590> (exceptions.MemoryError: >[C:\Python22\lib\asyncore.py|poll|99] >[C:\Python22\lib\asyncore.py|handle_read_event|396] >[C:\Python22\lib\asynchat.py|handle_read|130] >[C:\antispam\spambayes\pop3proxy.py|found_terminator|805] >[C:\antispam\spambayes\pop3proxy.py|onRequest|831] >[C:\antispam\spambayes\pop3proxy.py|onReview|1144] [C:\Program >Files\rob\download\spambayes\Corpus.py|getSubstance|318] [C: \Program >Files\rob\download\spambayes\Corpus.py|__getattr__|268] [C:\Program >Files\rob\download\spambayes\FileCorpus.py|load|220]) > >I see that this error has happened back in October (and was claimed to be >fixed), my version was taken from CVS only today (22nd December) > >cheers, >Rob > >-- >"There is very little further to go with a girl who has brought you coffee." > - John Updike > >This is random quote 1069 of a collection of 1269 > >Distance from the centre of the brewing universe: >[15200.8 km (8207.8 mi), 262.8 deg](Apparent) Rennerian > >Public Key fingerprint = 6219 33BD A37B 368D 29F5 19FB 945D C4D7 1F66 D9C5 > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at diligence.com Mon Dec 23 02:10:46 2002 From: tim at diligence.com (Tim Uckun) Date: Mon Dec 23 10:25:07 2002 Subject: [Spambayes] Spambayes Training. Message-ID: <4.2.0.58.20021223020505.00a59120@mail.diligence.com> Hello, you want some unsolicited advice? Considering that your spam filter will always encounter all incoming mail you could perhaps build in the training to your filter. I imagine a scenario like this. I get some spam your filter did not catch. I then forward that email to .. a) Myself (more on this) b) DetectedSpam@mydomain.com If I forward the email to myself I add something to the first line like --HeySpamBayesThisIsSpam-- If your filter catches this on the first line of an email it then learns from it. Similar things can be done with ham. This way you can even start with a completely blank database and then train your filter perfectly. The beauty is that you don't need some special mail client and it makes no difference whether spambayes is running as a proxy or through procmail. ---------------------------------------------- Tim Uckun Mobile Intelligence Unit. ---------------------------------------------- "There are some who call me TIM?" ---------------------------------------------- From wsy at merl.com Mon Dec 23 11:31:19 2002 From: wsy at merl.com (Bill Yerazunis) Date: Mon Dec 23 11:31:59 2002 Subject: [Spambayes] Spambayes Training. In-Reply-To: <4.2.0.58.20021223020505.00a59120@mail.diligence.com> (message from Tim Uckun on Mon, 23 Dec 2002 02:10:46 -0800) References: <4.2.0.58.20021223020505.00a59120@mail.diligence.com> Message-ID: <200212231631.gBNGVJf06202@localhost.localdomain> From: Tim Uckun Hello, you want some unsolicited advice? Considering that your spam filter will always encounter all incoming mail you could perhaps build in the training to your filter. I imagine a scenario like this. I get some spam your filter did not catch. I then forward that email to .. a) Myself (more on this) b) DetectedSpam@mydomain.com This is what I do on CRM114... you mail yourself the spam, with a command line (and password !) to learn the error in the correct format. It really saves time, and it means I don't have to touch the user agent- if I can read mail, I can program my filter. I also have commands to add things to the whitelist and blacklist, but I rarely use them now. -Bill Y. From richie at entrian.com Mon Dec 23 17:36:10 2002 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 23 12:36:20 2002 Subject: [Spambayes] Spambayes Training. In-Reply-To: <4.2.0.58.20021223020505.00a59120@mail.diligence.com> References: <4.2.0.58.20021223020505.00a59120@mail.diligence.com> Message-ID: Hi Tim, > Hello, you want some unsolicited advice? Always! 8-) > [snip forward-to-self idea] This is a good idea, and one that's been talked about before on this list - you can find the discussions by asking Google about , but they're all mixed in with a lot of other discussions. Tim Stone built a prototype based on this idea. Summarising: just like we proxy POP3 for incoming mail, the idea is to proxy SMTP for outgoing mail, and have special addresses for Ham and Spam. Your idea is simpler, in that we don't need to implement an SMTP proxy, but also less secure - if I know you're running spambayes, I can spam you with messages containing "--HeySpamBayesThisIsHam--" and fool the software into training on my spams as ham. It also means that you'll receive your own training emails, which means setting up another filter, and would be a pain for people on slow dialup links - the SMTP proxy could process the messages without forwarding them on. There's another problem with forwarding the mail - it destroys header information. We don't (currently) do a lot with the headers, but we do look at them, and losing information from them would make the system less accurate. Some email clients have a "Forward Verbatim" or "Forward as Attachment" command which could be used to work around this, but you're no longer in the realm of "you don't need some special mail client" - some mail clients won't get the full benefit, some may package attached messages in different ways, and so on. Bill, how does CRM114 cope with security? It uses a password which you need to keep secret? And does it have a way of coping with the header loss problem? -- Richie Hindle richie@entrian.com From richie at entrian.com Mon Dec 23 17:38:21 2002 From: richie at entrian.com (Richie Hindle) Date: Mon Dec 23 12:38:29 2002 Subject: [Spambayes] pop3proxy error In-Reply-To: References: <5.1.1.6.2.20021223150301.0384a5e0@127.0.0.1> Message-ID: [Rob] > exceptions.MemoryError [Tim S] > Rob, you've undoubtedly hit a volume that is beyond what anybody > else has experienced with the pop3proxy. The author of the proxy is > Richie Hindle, That's me! > and I'm sure he'll have a peek when he gets a chance. I will. I suspect the code is (unnecessarily) reading all the pending messages into memory at once, rather than processing them one at a time. I'll have a look and see what I can do - thanks for the bug report. -- Richie Hindle richie@entrian.com From wsy at merl.com Mon Dec 23 14:06:05 2002 From: wsy at merl.com (Bill Yerazunis) Date: Mon Dec 23 14:06:33 2002 Subject: [Spambayes] Spambayes Training. In-Reply-To: (message from Richie Hindle on Mon, 23 Dec 2002 17:36:10 +0000) References: <4.2.0.58.20021223020505.00a59120@mail.diligence.com> Message-ID: <200212231906.gBNJ65606863@localhost.localdomain> From: Richie Hindle Your idea is simpler, in that we don't need to implement an SMTP proxy, but also less secure - if I know you're running spambayes, I can spam you with messages containing "--HeySpamBayesThisIsHam--" and fool the software into training on my spams as ham. ... which is why CRM114 requires a password as well as the command. It also means that you'll receive your own training emails, which means setting up another filter, and would be a pain for people on slow dialup links - the SMTP proxy could process the messages without forwarding them on. Actually, I put the recieve _in_, as a confirmation to myself that I actually had executed the training. Once your filter program is in control, it can decide whether to save, junk, or pass on confirmations of training operations. I personally like the confirmations. There's another problem with forwarding the mail - it destroys header information. We don't (currently) do a lot with the headers, but we do look at them, and losing information from them would make the system less accurate. Some email clients have a "Forward Verbatim" or "Forward as Attachment" command which could be used to work around this, but you're no longer in the realm of "you don't need some special mail client" - some mail clients won't get the full benefit, some may package attached messages in different ways, and so on. Bill, how does CRM114 cope with security? It uses a password which you need to keep secret? And does it have a way of coping with the header loss problem? Yes, a password, which is (sadly) in plain text in the message, but since you're only mailing them to yourself on your local host (or at worst, back out to your ISP's mailserver and then right back to yourself), it's relatively secure. As to header loss, I don't treat headers any differently than the body text, so it's not a big deal. (also, I use emacs RMAIL which is one of those mailreader clients that can easily toggle headers on and off in the forward) -Bill Y. From n.bergboer at cs.unimaas.nl Sat Dec 28 19:13:02 2002 From: n.bergboer at cs.unimaas.nl (Niek Bergboer) Date: Sat Dec 28 13:57:52 2002 Subject: [Spambayes] Mail classifiers, training sets and technical docs Message-ID: <20021228181302.GA14635@cs0050jac6s.unimaas.nl> Hello, Like many others, I suffer from Spam, and while surfing the web I came across your Bayesian mail classifier. However, since I am also doing my PhD research in the field of machine learning, I am especially interested. Specifically, I am using machine learning techniques (including classifiers) on images, but the application to email seems interesting as well. First off, I tried to find some in-depth technical documentation about your system, but I was unable to find it. Could you direct me to any literature references or papers on which the work is based? Being involved in machine learning, there are of course a number of "standard" questions that immediately pop up: Does the SpamBayes framework use any training before it gets shipped to the user? That is, does the user start out with a completely "dumb" system for which he has to provide _every_ single spam/ham example, or does the system come with a "basic training set" so that the system has some classification capabilities even before the user has specified any examples. If so, what kind of training set do you use? How large is it, and what is the dimension of your feature space? And do you plan to make a large training set available? In addition, I was wondering about the kind of classifiers that could be used. It seems to me that SpamBayes basically is a binary classifier: mail has to be classified as either ham ("1") or spam ("0") (or vice versa, if you like that better). In addition to (naive) Bayesian classifiers, there of course exist more. For example, a classifier that has been around for a while, but has only just begun to be viable (do to new training techniques) is the Support Vector Machine (SVM). In its basic form, this is a binary classifier (though multi-class problems can be handled as well nowadays). Theoretically, one could of course also use a very simple and crude (k-) Nearest Neighbor classifier, though one would need a large training set for this to work well. Based on which criteria was the choice for using a Bayesian classifier made? I wish you the best of luck and success with the project. Good work! TIA, Niek -- N.H. Bergboer - n.bergboer@cs.unimaas.nl University of Maastricht - +31-43-3883901 Institute of Knowledge and Agent Technology "I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question." Charles Babbage From tim at fourstonesExpressions.com Sat Dec 28 13:24:36 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Dec 28 14:25:56 2002 Subject: [Spambayes] Mail classifiers, training sets and technical docs In-Reply-To: <20021228181302.GA14635@cs0050jac6s.unimaas.nl> Message-ID: <96BAD9Z07D43OL72OM41UO93YSA8DA.3e0dfa74@riven> There are people that are much more qualified to answer your questions than I, but most everyone on this project is MIA during the holidays, so I'll give it a go. Undoubtedly someone will add to my remarks below. - TimS 12/28/2002 12:13:02 PM, Niek Bergboer wrote: >Hello, > >Like many others, I suffer from Spam, and while surfing the web I came >across your Bayesian mail classifier. However, since I am also doing my >PhD research in the field of machine learning, I am especially >interested. Specifically, I am using machine learning techniques >(including classifiers) on images, but the application to email seems >interesting as well. > >First off, I tried to find some in-depth technical documentation about >your system, but I was unable to find it. Could you direct me to any >literature references or papers on which the work is based? There's not much that's been developed yet. You can see some in-code commentary in classifier.py and tokenizer.py. Other than that, there are a few how-to and readme type documents in the project. That's about all at this point, unless there's some other doc that isn't checked in. This is all based on Paul Graham's spam article, which can be found by searching on those three words. > >Being involved in machine learning, there are of course a number of >"standard" questions that immediately pop up: > >Does the SpamBayes framework use any training before it gets shipped to >the user? That is, does the user start out with a completely >"dumb" system for which he has to provide _every_ single spam/ham >example, or does the system come with a "basic training set" so that the >system has some classification capabilities even before the user has >specified any examples. This is under discussion, but the general feeling is that there's no universally acceptable definition of spam, that works for everyone. One man's spam is another man's highly desirable mail... > >If so, what kind of training set do you use? How large is it, and what >is the dimension of your feature space? And do you plan to make a large >training set available? There are training corpora that are used basically for testing the effects of changes made in the tokenization and classification algorithms. These are not generally available. However, it's not difficult to accumulate your own set of spam and ham for training . > >In addition, I was wondering about the kind of classifiers that could >be used. It seems to me that SpamBayes basically is a binary classifier: >mail has to be classified as either ham ("1") or spam ("0") (or vice >versa, if you like that better). This is not quite true. Incoming mail is given a spam probability based upon your own training of the database and the tokens that are in the incoming mail. There are default probability thresholds for spam and ham, which you can configure to be tighter or looser as you wish. > In addition to (naive) Bayesian >classifiers, there of course exist more. For example, a classifier that >has been around for a while, but has only just begun to be viable (do to >new training techniques) is the Support Vector Machine (SVM). In its >basic form, this is a binary classifier (though multi-class problems can >be handled as well nowadays). Theoretically, one could of course also >use a very simple and crude (k-) Nearest Neighbor classifier, though one >would need a large training set for this to work well. No comment here, due to my limited qual... ;) > >Based on which criteria was the choice for using a Bayesian classifier >made? This started out as a research project to test the validity of Paul's assertions. As such, it is highly successful. Paul proposed a Bayesian classification. His rationale for that choice was not a subject of this research. > >I wish you the best of luck and success with the project. Good work! Thanks! I have Spambayes running on my Windoze system, with a standard off the shelf mailer, and it works beautifully. I'm loving it... now if I could just get my employer to use this technology... Thanks for your questions, please drop in often! - TimS > >TIA, > >Niek > >-- >N.H. Bergboer - n.bergboer@cs.unimaas.nl >University of Maastricht - +31-43-3883901 >Institute of Knowledge and Agent Technology > >"I have been asked, 'Pray, Mr. Babbage, if you put into the machine > wrong figures, will the right answers come out?' I am not able to > rightly apprehend the kind of confusion of ideas that could provoke > such a question." Charles Babbage > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Sat Dec 28 14:42:50 2002 From: tim.one at comcast.net (Tim Peters) Date: Sat Dec 28 14:43:23 2002 Subject: [Spambayes] Mail classifiers, training sets and technical docs In-Reply-To: <20021228181302.GA14635@cs0050jac6s.unimaas.nl> Message-ID: [Niek Bergboer] > Like many others, I suffer from Spam, and while surfing the web I came > across your Bayesian mail classifier. However, since I am also doing my > PhD research in the field of machine learning, I am especially > interested. Specifically, I am using machine learning techniques > (including classifiers) on images, but the application to email seems > interesting as well. > > First off, I tried to find some in-depth technical documentation about > your system, but I was unable to find it. Could you direct me to any > literature references or papers on which the work is based? There are extensive comments in the source code, and some articles about this project will appear in Linux Journal "soon" (anyone know exactly when?). In the meantime, there are links to follow at http://spambayes.sourceforge.net/background.html The links to Paul Graham's and Gary Robinson's articles are essential reading. > Being involved in machine learning, there are of course a number of > "standard" questions that immediately pop up: > > Does the SpamBayes framework use any training before it gets shipped to > the user? This project hasn't had an alpha release yet, and initial training remains a bit of a mystery. Note that this approach isn't trying to "find spam" -- it's trying to separate spam from ham, based on samples of both. The great strength of the system is that what constitutes ham varies by individual, and it isn't generally possible to guess that (e.g., I've got no use for frequent-flyer solicitations, but my boss does -- they're all spam to me). > That is, does the user start out with a completely "dumb" system for > which he has to provide _every_ single spam/ham example, Doubt it. Any way at all of training has appeared to work very well, be that feeding it every email you get, or just feeding it mistakes, or even letting it train on its own decisions, correcting only the most egregious errors. That's for an individual's email. Attempts to train a single classifier for use by more than one user don't work as well, unless the user group has a lot in common. For example, I've gotten superb results on training a classifier for comp.lang.python postings, with error rates (of both kinds) so close to 0 that the difference can't be measured reliably across my 34,000 c.l.py test messages (20K ham and 14K spam). > or does the system come with a "basic training set" so that the > system has some classification capabilities even before the user has > specified any examples. The system doesn't come with anything yet. If you install it and try it, I predict you'll see good results within 24 hours of starting, and excellent results within a week (on your own email, and provided you take (just) a little care in training). > If so, what kind of training set do you use? Most people use their personal email. As above, I started the project with a random sampling of newsgroup postings, and a spam archive available over the web. > How large is it, People have tried training sets ranging from 1 message to over 50,000. > and what is the dimension of your feature space? Essentially unbounded. The raw text of the message body is broken by whitespace, and each resulting piece is "a feature". Many other kinds of tokens are generated for header lines, embedded URLs, etc. > And do you plan to make a large training set available? There are many public spam archives available, so no on that count. It works better if people use their own spam anyway (for example, that's the only way to pick up header clues unique to their ISP). We can't supply a large training set of ham because what constitutes ham is specific to the user. It "would be nice" to seed a database with some set of msgs everyone would agree are ham, but that's surprisingly difficult to arrive at. > In addition, I was wondering about the kind of classifiers that could > be used. It seems to me that SpamBayes basically is a binary classifier: > mail has to be classified as either ham ("1") or spam ("0") (or vice > versa, if you like that better). We generate a score from 0.0 (ham) to 1.0 (spam). One thing we've found is that it's important to have an Unsure category too: some messages are highly ambiguous, scoring high for both haminess and spaminess, or scoring low for both. This system is lost then, and all evidence to date suggests that all other systems are also lost on such msgs too (indeed, we've often argued about such examples on this mailing list, sometimes doing a ridiculous amount of reserach to figure out whether a msg in question was ham or spam). The lovely thing is that this system is very good about *knowing* when it's lost, and such msgs really need human judgment. > In addition to (naive) Bayesian classifiers, Note that this system really has nothing in common with Bayesian classifiers. It got that name from Paul Graham's original essay, and if you follow the links you'll figure out why it got that name, why that name was dubious, and why this variation moved ever faruther away from it. > there of course exist more. For example, a classifier that has been > around for a while, but has only just begun to be viable (do to > new training techniques) is the Support Vector Machine (SVM). In its > basic form, this is a binary classifier (though multi-class problems > can be handled as well nowadays). Theoretically, one could of course < also use a very simple and crude (k-) Nearest Neighbor classifier, > though one would need a large training set for this to work well. AFAIK, nobody on this mailing list has pursued those alternatives. > Based on which criteria was the choice for using a Bayesian classifier > made? Purely on results (although, again, this isn't a Bayesian classifier). The results on my initial c.l.py test data, and on my own email, are so good that I see no way they can be improved. It does a better job than I can do, and in the very rare cases it makes a mistake, I haven't been able to conceive of a way that any system could do better: such msgs seem intractable. For example, one of the three false positives (out of 20K ham) in my c.l.py test is a quote of an entire Nigerian scam spam, prefaced by a one-line comment essentially saying "hey, this is spam". That it was a comment added by a real person makes the msg formally ham, and I realize that because I've got "real world knowledge" about what the words mean. Statistically, though, it's indistinguishable from Nigerian scam spam. If the comment had been made by a frequent c.l.py poster, that would have been enough to knock the msg into the Unsure category. But the poster is unique in the c.l.py data, so the msg had almost no redeeming features (it *did* have a few mild ham clues in the headers, but that's all). OTOH, the system doesn't work so well for other kinds of "many users" apps. Tech mailing lists have some kind of focus, and commercial advertsing on them that isn't specific to the list topic is *always* spam. Individuals' own email contains many kinds of solicited commercial email, and if a common classifier has to be trained to accept Expedia email for me, it's going to have a hard time blocking Hotel Discount Card spam for you. Such seemingly fine distinctions don't appear to be a problem for a one-user classifier, although the first time or two I get marketing collateral from a company I do business with, it usually scores as Unsure. > I wish you the best of luck and success with the project. Good work! Thanks, Niek. You're welcome to use it too, you know . From francois.granger at free.fr Sun Dec 29 00:41:44 2002 From: francois.granger at free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Sat Dec 28 18:41:49 2002 Subject: [Spambayes] Missing 2.0 compatibility lines in storage.py Message-ID: Fresh install on MacOSX with the standard Python distribution: [fbg:/volumes/OS99/spambayes] fgranger% python Python 2.2 (#1, 07/14/02, 23:25:09) [GCC Apple cpp-precomp 6.14] on darwin Type "help", "copyright", "credits" or "license" for more information. Fresh download from CVS, I get the following errors. [fbg:/volumes/OS99/spambayes] fgranger% python hammiebulk.py Traceback (most recent call last): File "hammiebulk.py", line 53, in ? import hammie File "hammie.py", line 5, in ? import storage File "storage.py", line 57, in ? NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training NameError: name 'False' is not defined [fbg:/volumes/OS99/spambayes] fgranger% python hammie.py Traceback (most recent call last): File "hammie.py", line 5, in ? import storage File "storage.py", line 57, in ? NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training NameError: name 'False' is not defined [fbg:/volumes/OS99/spambayes] fgranger% Just added these cutted and pasted lines at the biggining of the file: try: True, False, bool except NameError: # Maintain compatibility with Python 2.2 True, False = 1, 0 def bool(val): return not not val -- Recently using MacOSX....... From francois.granger at free.fr Sun Dec 29 13:46:57 2002 From: francois.granger at free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Sun Dec 29 07:47:04 2002 Subject: [Spambayes] Missing 2.0 compatibility lines in storage.py Message-ID: At 00:41 +0100 on 29/12/2002, in message [Spambayes] Missing 2.0 compatibility lines in storage., Fran?ois Granger wrote: >Just added these cutted and pasted lines at the biggining of the file: > >try: > True, False, bool >except NameError: > # Maintain compatibility with Python 2.2 > True, False = 1, 0 > def bool(val): > return not not val Same for Coprus and Hammiebulk. -- Recently using MacOSX....... From a.paice at ntlworld.com Mon Dec 30 02:33:23 2002 From: a.paice at ntlworld.com (alan) Date: Sun Dec 29 23:18:57 2002 Subject: [Spambayes] bayesian research Message-ID: <000801c2afab$d3271ee0$db586451@frodo> Hi, I have been given the task of reseaching Bayesian mail filters for my final year Univeristy dissertaion. I have been finding brick walls at every turn. I know paul graham is great, but just about every one talk about his plan for spam, but i need a start place. I have set my system to allow relays and have 1000's of spam examples. Any ideas where i should start? thanks for your time Alan Paice Cambridge, UK From randy.diffenderfer at eds.com Mon Dec 30 01:24:44 2002 From: randy.diffenderfer at eds.com (Diffenderfer, Randy) Date: Mon Dec 30 01:25:02 2002 Subject: [Spambayes] Mail classifiers, training sets and technical doc s Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC35E@usahm014.exmi01.exch.eds.com> > In addition to (naive) Bayesian >classifiers, there of course exist more. For example, a classifier that >has been around for a while, but has only just begun to be viable (do to >new training techniques) is the Support Vector Machine (SVM). In its >basic form, this is a binary classifier (though multi-class problems can >be handled as well nowadays). Theoretically, one could of course also >use a very simple and crude (k-) Nearest Neighbor classifier, though one >would need a large training set for this to work well. IIRC the words "Support Vector Machine" appear in some patent that Microsoft has on classification technology. Reference to it is in the spambayes archives somewhere. :-) Needless to say, building a system based upon something that M$ has a patented interest in is probably a losing proposition. From anthony at interlink.com.au Mon Dec 30 18:24:57 2002 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Dec 30 02:24:18 2002 Subject: [Spambayes] Mail classifiers, training sets and technical docs In-Reply-To: Message-ID: <200212300724.gBU7OvV29221@localhost.localdomain> >>> Tim Peters wrote > There are many public spam archives available, so no on that count. It > works better if people use their own spam anyway (for example, that's the > only way to pick up header clues unique to their ISP). We can't supply a > large training set of ham because what constitutes ham is specific to the > user. It "would be nice" to seed a database with some set of msgs everyone > would agree are ham, but that's surprisingly difficult to arrive at. A thought that occurs to me now - would it make more sense to instead provide a database seeded with a few obvious clues, rather than whole messages - for instance, start with a bunch of the standard "really really really bogus spam clues" from spamassassin? That way, people will hopefully start to get results immediately... Bah, brain foggy from too much Christmas, probably making no sense at all. -- Anthony Baxter It's never too late to have a happy childhood. From anthony at interlink.com.au Mon Dec 30 18:28:26 2002 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Dec 30 02:27:46 2002 Subject: [Spambayes] bayesian research In-Reply-To: <000801c2afab$d3271ee0$db586451@frodo> Message-ID: <200212300728.gBU7SQi29265@localhost.localdomain> >>> "alan" wrote > Hi, > I have been given the task of reseaching Bayesian mail filters for my > final year Univeristy dissertaion. > I have been finding brick walls at every turn. > I know paul graham is great, but just about every one talk about his plan for > spam, but i need a start place. > I have set my system to allow relays and have 1000's of spam examples. > Any ideas where i should start? Look at the 'background' page on our website, for starters. Note that you don't just need a collection of spam - you also want some of the "real email" (we call it 'ham') that went with the spam. You can start with differently sourced ham and spam, but you've then got a problem with false clues (e.g. different header 'Received' lines from the different mail systems). For further info, download the code and read the source - it's heavily commented, and there's a whooole pile of nice information in there. It's probably also worth noting that this project has pretty much tossed out the Graham algorithm. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From pje at telecommunity.com Mon Dec 30 10:52:14 2002 From: pje at telecommunity.com (Phillip J. Eby) Date: Mon Dec 30 10:52:48 2002 Subject: [Spambayes] bayesian research In-Reply-To: <000801c2afab$d3271ee0$db586451@frodo> Message-ID: <5.1.0.14.0.20021230105132.032bb0a0@mail.telecommunity.com> At 02:33 AM 12/30/02 +0000, alan wrote: >I have set my system to allow relays and have 1000's of spam examples. Please don't enable relays on your mail server; this makes the spam problem worse for everyone, and will get your mail server blacklisted as well. From skip at pobox.com Mon Dec 30 09:59:26 2002 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 30 10:59:31 2002 Subject: [Spambayes] bayesian research In-Reply-To: <5.1.0.14.0.20021230105132.032bb0a0@mail.telecommunity.com> References: <000801c2afab$d3271ee0$db586451@frodo> <5.1.0.14.0.20021230105132.032bb0a0@mail.telecommunity.com> Message-ID: <15888.27998.569940.743333@montanaro.dyndns.org> Phillip> At 02:33 AM 12/30/02 +0000, alan wrote: >> I have set my system to allow relays and have 1000's of spam examples. Phillip> Please don't enable relays on your mail server; this makes the Phillip> spam problem worse for everyone, and will get your mail server Phillip> blacklisted as well. You have to admit it's a helluva fast way to collect spam though. ;-) Skip From tim.one at comcast.net Mon Dec 30 12:50:47 2002 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 30 12:51:20 2002 Subject: [Spambayes] Mail classifiers, training sets and technical docs In-Reply-To: <200212300724.gBU7OvV29221@localhost.localdomain> Message-ID: [Anthony Baxter] > A thought that occurs to me now - would it make more sense to instead > provide a database seeded with a few obvious clues, rather than whole > messages - for instance, start with a bunch of the standard "really > really really bogus spam clues" from spamassassin? > > That way, people will hopefully start to get results immediately... > > Bah, brain foggy from too much Christmas, probably making no sense at > all. We ran tests "like that" before, based on a seed database derived from a well-trained database, copying over only the words with very high spamprob that had appeared "often" (so that their spamprobs are somewhat reliable). The database then contains no words with spamprob < 0.5 (or, indeed, < 0.95, if that's the "very high spamprob" cutoff used). Predictably, that boosts the false positive rate -- it's impossible for anything to score as ham, unless ham_cutoff is also boosted above 0.5, so Unsure is the best realistic classification you can hope for. It recovers quickly after training. But then an empty database learns quickly too, and doesn't have to fight off ghost spam . From janzert at haskincentral.com Mon Dec 30 19:11:52 2002 From: janzert at haskincentral.com (Brian Haskin) Date: Mon Dec 30 19:12:01 2002 Subject: [Spambayes] Re: bayesian research In-Reply-To: <000801c2afab$d3271ee0$db586451@frodo> References: <000801c2afab$d3271ee0$db586451@frodo> Message-ID: alan wrote: > Hi, > I have been given the task of reseaching Bayesian mail filters for my final > year Univeristy dissertaion. > I have been finding brick walls at every turn. > I know paul graham is great, but just about every one talk about his plan for > spam, but i need a start place. > I have set my system to allow relays and have 1000's of spam examples. > Any ideas where i should start? > thanks for your time > Alan Paice > Cambridge, UK If you really want to collect spam this way, please run something like jackpot (http://jackpot.uk.net/) rather than actually relaying spam. Brian Haskin brianjr@haskincentral.com