From tim@fourstonesExpressions.com Sun Dec 1 04:09:19 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat, 30 Nov 2002 22:09:19 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: <1QPA8GDWQ09KFUORPQN65JDQLE0IDT.3de98b6f@riven> Dredging up a less recent thread... 11/21/2002 11:07:36 AM, Neale Pickett wrote: >> Options class is a bit too much right now... Lots of wonderful options >> for research, but nobody in their right mind would tweak most of them. >> I think we should split options into ones that the average person >> would be interested in, and those that the average person should never >> touch. Then give 'em a (Richie style web) ui to tweak the ones that >> they should be interested it. I'd be happy to run with that one for a >> while... > >Now there's an idea, make a configuration engine that runs as a >SimpleHTTPServer, and have the person connect with their favorite >browser to some port on localhost. Isn't that how SATAN worked? In any >case, that would be a good very project for someone who wanted to help >out but didn't know where to start :) I've taken a good swipe at creating a configuration application that the average user could use to make simple changes to the spambayes configuration. In particular, it's useful for pop3 users to configure their settings. But it's also useful for other settings as well, and perhaps even for some administrative tasks, like purging old words from databases, doing a zodb pack, maybe printing word probabilities, and stuff like that. It's html based, and it uses a subclass of SimpleHTTPServer, named SmarterHTTPServer. (Sorry, Richie, I just couldn't figure out how to make the stuff in pop3proxy work for me...) SimpleHTTPServer cannot serve requests with parameters very well, so SmarterHTTPServer adds that functionality. It also adds the ability to call 'methlets' or methods on the server class which can be used to dynamically create content. These two additional functions are quite handy, and may even warrant inclusion into SimpleHTTPServer. The program maintains options changes in bayescustomize.ini. Lemme tell ya what... the way Options.py is set up made figuring out how to do this stuff one freakin pain. What's up with that? Why does OptionsClass wrap a ConfigParser, rather than subclass it? Using bayescustomize.ini is very nice, because it allows the user to easily revert back to default values if that ever becomes necessary. To execute this module, just invoke OptionConfig.py . The working directory should be the same one where the bayescustomize.ini file is located. The port number is the port the http server will listen on, and defaults to 8000. Then point your browser at http://locahost:8000 (or whatever port you chose). I've embedded all the necessary html in the module itself, but this can really only be temporary. We will inevitably accumulate html for little applications like this (e.g. pop3proxy), and more importantly, for documentation. I think that the documentation standard for the project should be html. The look and feel that Richie came up with for the pop3proxy works very well for me. We need to decide how we're going to structure the directories for that kind of stuff. May I propose the following: html application pop3proxy optionConfig doc graphics - TimS > >Neale > > > c'est moi - TimS www.fourstonesExpressions.com From neale@woozle.org Sun Dec 1 04:16:20 2002 From: neale@woozle.org (Neale Pickett) Date: 30 Nov 2002 20:16:20 -0800 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: So then, "Moore, Paul" is all like: > From: Tim Stone - Four Stones Expressions > > So... does this lay to rest forever the pickle/dbm debate? Is there any > > reason left to use a pickle? > > Sorry, quite the opposite (IMHO). The patch switches to using shelve, > which uses anydbm, which (still) uses the buggy BerkeleyDB 1.85 on > Windows. So Windows users should probably still use pickles. I've just checked in a new anydbm that has a more appropriate list of database back-ends to try on the Windows platform. But it needs someone with a Windows box to fix the dumb test I put in it: # XXX: Some windows dude should fix this test if sys.platform == "windows": # dbm on windows is awful. _names = ["dbhash", "gdbm", "dumbdbm"] else: _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] So, if you are a Windows dude and feel up to fixing that test, please do so, and remove the first comment while you're at it :) This should eliminate any dbm concerns for Windows folk. Neale From neale@woozle.org Sun Dec 1 04:19:02 2002 From: neale@woozle.org (Neale Pickett) Date: 30 Nov 2002 20:19:02 -0800 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > So... does this lay to rest forever the pickle/dbm debate? Is there > any reason left to use a pickle? The pickle is still smaller, and it's faster to write out the whole thing than it is the dbm. So the original recommendations, and the ones in hammiebulk.py, are still accurate: pickle for pop3proxy and hammiesrv, dbm for hammiefilter. If you use the Outlook plugin or Jeremy's ZODB driver, you don't need to concern yourself with this. From richie@entrian.com Sun Dec 1 14:16:53 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 14:16:53 +0000 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <1QPA8GDWQ09KFUORPQN65JDQLE0IDT.3de98b6f@riven> References: <1QPA8GDWQ09KFUORPQN65JDQLE0IDT.3de98b6f@riven> Message-ID: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> [Tim Stone] > I've taken a good swipe at creating a configuration application that the > average user could use to make simple changes to the spambayes configuration. > [...] > It's html based, and it uses a subclass of SimpleHTTPServer, named > SmarterHTTPServer. (Sorry, Richie, I just couldn't figure out how to make the > stuff in pop3proxy work for me...) Your configurator looks great! One of thing on my list of things to do is to turn the HTML user interface code into a plugin-hosting library, so that new components like this can be plugged into the user interface. It's a shame I didn't get round to doing this before you write your code - maybe you and I can work together to design that API, and make your configurator the first plugin? I think we can combine my HTTP server code with your 'methlet' idea, and come up with something that works very well and makes it easy to write further plugins. When I have time, hopefully later today, I'll write up the thoughts I have on it. > I've embedded all the necessary html in the module itself, but this can really > only be temporary. We will inevitably accumulate html for little applications > like this (e.g. pop3proxy), and more importantly, for documentation. I think > that the documentation standard for the project should be html. The look and > feel that Richie came up with for the pop3proxy works very well for me. We > need to decide how we're going to structure the directories for that kind of > stuff. May I propose the following: > > html > application > pop3proxy > optionConfig > doc > graphics John Draper has said a similar thing - he wants to add to the HTML user interface as well, and he wants administrators to be able to plug in their own look and feel (by replacing images, stylesheets and so on). This structure looks good (we need to include stylesheets - maybe the 'graphics' area could be called 'global' or something and include stylesheets, javascript modules, and so on?) -- Richie Hindle richie@entrian.com From tim@fourstonesExpressions.com Sun Dec 1 16:58:01 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 01 Dec 2002 10:58:01 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> Message-ID: <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> 12/1/2002 8:16:53 AM, Richie Hindle wrote: > >[Tim Stone] >> I've taken a good swipe at creating a configuration application that the >> average user could use to make simple changes to the spambayes configuration. >> [...] >> It's html based, and it uses a subclass of SimpleHTTPServer, named >> SmarterHTTPServer. (Sorry, Richie, I just couldn't figure out how to make the >> stuff in pop3proxy work for me...) > >Your configurator looks great! One of thing on my list of things to do is >to turn the HTML user interface code into a plugin-hosting library, so that >new components like this can be plugged into the user interface. It's a >shame I didn't get round to doing this before you write your code - maybe >you and I can work together to design that API, and make your configurator >the first plugin? I think we can combine my HTTP server code with your >'methlet' idea, and come up with something that works very well and makes >it easy to write further plugins. When I have time, hopefully later today, >I'll write up the thoughts I have on it. I'm not sure I understand why you're using asynchat and asyncore. The SimpleHTTPServer thingy needed a little tweaking, but it's capable of threaded responses, etc... What was the advantage you gained? > >> I've embedded all the necessary html in the module itself, but this can really >> only be temporary. We will inevitably accumulate html for little applications >> like this (e.g. pop3proxy), and more importantly, for documentation. I think >> that the documentation standard for the project should be html. The look and >> feel that Richie came up with for the pop3proxy works very well for me. We >> need to decide how we're going to structure the directories for that kind of >> stuff. May I propose the following: >> > >John Draper has said a similar thing - he wants to add to the HTML user >interface as well, and he wants administrators to be able to plug in their >own look and feel (by replacing images, stylesheets and so on). This >structure looks good (we need to include stylesheets - maybe the 'graphics' >area could be called 'global' or something and include stylesheets, >javascript modules, and so on?) I'm a bit of a stickler on subdirectory contents on websites, because you can really get into a mishmash quickly. I think that css and js files (in particular) should be kept separately from graphic files should be kept separately from html... How about: ui cgi-bin pop3proxy optionConfig html doc graphics style js > >-- >Richie Hindle >richie@entrian.com > > > c'est moi - TimS www.fourstonesExpressions.com From skip@pobox.com Sun Dec 1 18:15:09 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 1 Dec 2002 12:15:09 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> References: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> Message-ID: <15850.20909.555491.227850@montanaro.dyndns.org> Tim> I'm not sure I understand why you're using asynchat and asyncore. Makes it (relatively) easy to talk to multiple connections simultaneously without resorting to multiple threads. It requires you to reorient how you look at such things, but once you understand the model it's pretty easy to program. Skip From richie@entrian.com Sun Dec 1 18:51:22 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 18:51:22 +0000 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <15850.20909.555491.227850@montanaro.dyndns.org> References: <736kuu05lkbff8ugfsdbnsfq17uv0ql8t2@4ax.com> <5421979886ZEB1W05A8EAGGE32C0PK.3dea3f99@riven> <15850.20909.555491.227850@montanaro.dyndns.org> Message-ID: [Tim Stone] > I'm not sure I understand why you're using asynchat and asyncore. [Skip] > Makes it (relatively) easy to talk to multiple connections simultaneously > without resorting to multiple threads. It requires you to reorient how you > look at such things, but once you understand the model it's pretty easy to > program. To expand a bit on this a bit, it means that you can have the existing HTML user interface, your configurator, and multiple POP3 proxies, all potentially being used by multiple simultaneous users, all within one thread of one process and all sharing common data structures (the 'options' object, a Classifier instance, etc) without ever having to worry about synchronising anything. At the moment, for instance, we have a potential (and rather contrived I admit) problem whereby your configurator could be halfway through writing the ini file when the POP3 proxy tries to read it. Using asyncore to run everything within one thread of one process prevents that entire class of problem with no extra effort. Asyncore works in exactly the same way as your methlets - it takes away the procedural programming job of reading and writing sockets, and instead asks that the programmer writes event handler functions. Your OptionsConfigurator.homepage(self, parms) is an event handler for the "someone is asking for the homepage" event. This is exactly how my async-based HTTP server works - my UserInterface.onHome(self, params) does the same job for the pop3proxy.py HTML user interface. The plugin API I have in mind will work exactly that way - and there'll certainly be no requirement for the plugin programmer to know about asyncore. -- Richie Hindle richie@entrian.com From richie@entrian.com Sun Dec 1 20:20:10 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 20:20:10 +0000 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: [Neale] > I've just checked in a new anydbm that has a more appropriate list of > database back-ends to try on the Windows platform. [...] > This should eliminate any dbm concerns for Windows folk. You left dbhash in the list - that's just another interface to the broken bsddb. And if that gets removed, Windows users will be left with dumbdbm - the name doesn't inspire confidence, and the docstring says "XXX TO DO: - seems to contain a bug when updating..." As far as I can see there's a complete solution available to these DBM problems. Perhaps I've missed something, but I've been back over all the discussions and I can't see anything wrong with it: o We demand bsddb 3 or better on platforms where bsddb is the dbm implementation that gets picked up. So until Python 2.3 is released, Windows users need to install pybsddb. I've just done this and it's trivial. (We already demand a new "email" library and no-one's complained.) Would this cause problems on any other platforms? o If training goes slowly, we implement Tim Peters' idea: "Bulk training could be taught to use a new classifier based on an in-memory dict. When that's done, the in-memory dict's ham and spam counts would be added into the persistent DB (rewriting only those WordInfo records corresponding to words that appeared in the bulk training data), and then the in-memory dict could be thrown away." o Or (Neale) you were talking about writing a caching front-end for the DBM (regardless of which actual DBM was behind it) - that would work as well. Wouldn't that solve *everything*? Startup times would be quick, training would be quick, no buggy DBM implementations would be used, and different components wouldn't default to different storage formats (hammie vs. pop3proxy). Installing pybsddb on Windows is trivial, and once Python 2.3 comes out you won't even need to do that. I've probably missed something - it's hard to keep up! -- Richie Hindle richie@entrian.com From lists@morpheus.demon.co.uk Sun Dec 1 20:34:43 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 01 Dec 2002 20:34:43 +0000 Subject: [Spambayes] don't update if you don't want to retrain References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: Neale Pickett writes: > I've just checked in a new anydbm that has a more appropriate list of > database back-ends to try on the Windows platform. But it needs someone > with a Windows box to fix the dumb test I put in it: > > # XXX: Some windows dude should fix this test > if sys.platform == "windows": > # dbm on windows is awful. > _names = ["dbhash", "gdbm", "dumbdbm"] > else: > _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] > I see someone changed "windows" to "win32". But the other problem is more serious. Windows doesn't *have* gdbm or dbm - the problem lies with "dbhash" (the Berkeley DB implementation). So the Windows branch should be if sys.platform == "windows": # The Berkeley DB implementation on Windows is out of date _names = ["gdbm", "dbm", "dumbdbm"] (or probably just _names = ["dumbdbm"]). Paul. -- This signature intentionally left blank From tim@fourstonesExpressions.com Sun Dec 1 21:35:48 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 01 Dec 2002 15:35:48 -0600 Subject: [Spambayes] don't update if you don't want to retrain Message-ID: <7SMC0SN3YX2UWE0549CLJ3ZMJA8.3dea80b4@riven> 'twas me that changed it to win32. When I do a print sys.platform, out comes win32... - TimS 12/1/2002 2:34:43 PM, Paul Moore wrote: >Neale Pickett writes: > >> I've just checked in a new anydbm that has a more appropriate list of >> database back-ends to try on the Windows platform. But it needs someone >> with a Windows box to fix the dumb test I put in it: >> >> # XXX: Some windows dude should fix this test >> if sys.platform == "windows": >> # dbm on windows is awful. >> _names = ["dbhash", "gdbm", "dumbdbm"] >> else: >> _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] >> > >I see someone changed "windows" to "win32". But the other problem is >more serious. Windows doesn't *have* gdbm or dbm - the problem lies >with "dbhash" (the Berkeley DB implementation). > >So the Windows branch should be > > if sys.platform == "windows": > # The Berkeley DB implementation on Windows is out of date > _names = ["gdbm", "dbm", "dumbdbm"] > >(or probably just _names = ["dumbdbm"]). > >Paul. > >-- >This signature intentionally left blank > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From lists@morpheus.demon.co.uk Sun Dec 1 23:04:25 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 01 Dec 2002 23:04:25 +0000 Subject: [Spambayes] don't update if you don't want to retrain References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com> Message-ID: Richie Hindle writes: > As far as I can see there's a complete solution available to these DBM > problems. Perhaps I've missed something, but I've been back over all the > discussions and I can't see anything wrong with it: > > o We demand bsddb 3 or better on platforms where bsddb is the dbm > implementation that gets picked up. So until Python 2.3 is released, > Windows users need to install pybsddb. I've just done this and it's > trivial. (We already demand a new "email" library and no-one's > complained.) Would this cause problems on any other platforms? I'm all in favour of this. However, it's worth pointing out a couple of things: 1. Email is pure python, bsddb is not only in C, but also needs a 3rd party library (Sleepycat DB). No problem on Windows (Python 2.3 will come with it built in, and there's a trivial-to-install binary build for 2.2 users), but might it cause problems on Unix systems? 2. On Unix, as I understand it, it's possible to use the new Sleepycat DB with the old Python module. So Unix users quite possibly don't need to bother with bsddb 3. The simple answer is to require bsddb 3 on Windows with Python 2.2, and otherwise use it if present, otherwise use the built-in dbhash (and assume that a suitably up to date Berkeley DB is behind it). But as I said, I'm happy with your approach - I only offer this if Unix users don't like the bsddb 3 requirement... Paul. -- This signature intentionally left blank From skip@pobox.com Sun Dec 1 23:18:15 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 1 Dec 2002 17:18:15 -0600 Subject: [Spambayes] don't update if you don't want to retrain In-Reply-To: References: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com>

Message-ID: <15850.39095.51566.137899@montanaro.dyndns.org> Paul> 1. Email is pure python, bsddb is not only in C, but also needs a 3rd Paul> party library (Sleepycat DB). No problem on Windows (Python 2.3 Paul> will come with it built in, and there's a trivial-to-install binary Paul> build for 2.2 users), but might it cause problems on Unix systems? Unlikely. Most Unixes have had recent versions of Sleepycat's library available for a long time. Versions 3 or 4(.0) are required for pybsddb. Failing that, Version 2 doesn't suffer with the bugs that Version 1 does. The old bsddb will still be available, just not built by default. Paul> 2. On Unix, as I understand it, it's possible to use the new Sleepycat Paul> DB with the old Python module. So Unix users quite possibly don't Paul> need to bother with bsddb 3. Correct. The new module has already been checked into CVS though, so Unix types will get it as the default but be able to fall back to Version 2 (or even 1) if they want. don't-worry-about-us-we're-just-fine-ly, y'rs, Skip From richie@entrian.com Sun Dec 1 23:49:38 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 01 Dec 2002 23:49:38 +0000 Subject: [Spambayes] The database question that would not die Message-ID: I've tried using bsddb3 on Windows, and the results are encouraging. Testing with 500 spams, 500 hams and 500 unknowns looks like this: Training 1000 Database size Classifying 500 Database load Pickle 65 seconds 999,540 35 seconds 4 seconds bsddb3 82 seconds 1,318,912 43 seconds (negligible) Close enough on all counts, I'd say (and the startup time will be a bigger and bigger win as the database grows). Small savings in time and space for some operations aren't worth the hassle of having two formats, IMHO. Here's what I did: o Installed pybsddb, which gave me the bsddb3 module o Created dbhash3.py, a duplicate of dbhash.py (16 lines of code) that refers to bsddb3 rather than bsddb o Changed anydbm.py to always use dbhash3 on Windows. I can see a few possible objections: o There may be platforms on which anydbm defaults to bsddb 1.85, but for which installing bsddb3 is a pain. Any takers? o Current pickle users may violently object to the (small?) time and space losses incurred by switching to using an anydbm database (which may not be bsddb3 on their platform). Any takers? o Insisting on bsddb3 prevents closed-source use of the spambayes code until Python 2.3 is released. I can't imagine anyone here objecting...? I only mention this one for completeness. o We should skip bsddb3 and go directly to ZODB. My feeling is that this is possibly a good long-term goal, but at this stage it would be premature. o The dramatic fifth objection, which I haven't thought of but which means this idea will never fly. Any takers? 8-) So now I can ask the question that Neale (I think) asked a while ago - is there any need to keep the pickle option? I would LOVE for us to drop the pickle option before I submit my article to the Linux Journal, which has to happen before Thursday 5th December. Explaining the different database formats will be an embarrassment - much better to simply say "Python 2.2 users on Windows also need to download bsddb3 from ". -- Richie Hindle richie@entrian.com From papaDoc@videotron.ca Mon Dec 2 00:47:13 2002 From: papaDoc@videotron.ca (Remi Ricard) Date: Sun, 01 Dec 2002 19:47:13 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentation Message-ID: <1038790033.1032.11.camel@porsche> Hi, This is my first draft for the documentation. The presentation is really plain. I will improve the formating, add color and images, after your comments. So any comment will be welcome. By the way, I don't know if I should have done it in French. If it is not understandable let me know I won't be frustrated, it will just help me improve my English. P.S. OptionConfig.py is now working for me __author__ is not defined. If I comment out this line I get Classes: OptionsConfigurator - changes select values in Options.py Abstact: Some text here: too long to copy To Do: o Suggestions? : File name too long -- Remi Ricard From tim@fourstonesExpressions.com Mon Dec 2 03:44:43 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 01 Dec 2002 21:44:43 -0600 Subject: [Spambayes] pop3proxy and Mozilla documentation In-Reply-To: <1038790033.1032.11.camel@porsche> Message-ID: Remi, I don't know why, but I can't see the attachment... 12/1/2002 6:47:13 PM, Remi Ricard wrote: >Hi, > >This is my first draft for the documentation. >The presentation is really plain. I will improve the formating, add >color and images, after your comments. > >So any comment will be welcome. > >By the way, I don't know if I should have done it in French. If it is >not understandable let me know I won't be frustrated, >it will just help me improve my English. > >P.S. OptionConfig.py is now working for me __author__ is not defined. If >I comment out this line I get >Classes: > OptionsConfigurator - changes select values in Options.py >Abstact: > Some text here: too long to copy >To Do: > o Suggestions? > >: File name too long This is a new one to me. It's working just fine on my machine, and Richie's too. What platform are you on? - TimS > >-- >Remi Ricard > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > c'est moi - TimS www.fourstonesExpressions.com From neale@woozle.org Mon Dec 2 04:58:22 2002 From: neale@woozle.org (Neale Pickett) Date: 01 Dec 2002 20:58:22 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: References: Message-ID: So then, Richie Hindle is all like: > I've tried using bsddb3 on Windows, and the results are encouraging. Neato. This seems consistent with my experience. I've been recommending that people using the pop3proxy and hammiesrv (all 1 of them <0.2 wink>) use the pickle because it will be faster in all cases than the dbm. The dbm win comes into play with hammiefilter training or scoring one message at a time, and then the win is huge. > Close enough on all counts, I'd say (and the startup time will be a > bigger and bigger win as the database grows). Small savings in time > and space for some operations aren't worth the hassle of having two > formats, IMHO. While Tim S's storage class makes having two formats much easier, it would be nice if a pop3proxy database could be used by hammiefilter without having to change a configuration file. I can't think of a good reason not to drop pickle, but I think we should wait to see what Tim Peters thinks about it. I've only ever used the pickle when testing new code, to see if it works with both storage formats. Neale From Paul.Moore@atosorigin.com Mon Dec 2 10:31:15 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 2 Dec 2002 10:31:15 -0000 Subject: [Spambayes] Easy task for Outlook Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E36@UKDCX001.uk.int.atosorigin.com> From: Mark Hammond [mailto:mhammond@skippinet.com.au] > The plugin could do with some kind of "log file" strategy. > Currently all "print" statements go to the win32traceutil > package. However, once we package this up as a stand-alone > DLL, this wont fly. This seems an ideal candidate for the Python 2.3 "logging" module (see PEP 282). There's a standalone version for Python 2.2 - would the fact that this introduces another dependency on a temporarily external module be an issue? (I'm not volunteering to do this - my time is pretty used up at the moment. But it would be a shame if someone reinvented the wheel here...) Paul. From francois.granger@free.fr Mon Dec 2 10:35:34 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Mon, 02 Dec 2002 11:35:34 +0100 Subject: [Spambayes] The database question that would not die In-Reply-To: Message-ID: on 2/12/02 0:49, Richie Hindle at richie@entrian.com wrote: > I've tried using bsddb3 on Windows, and the results are encouraging. > > So now I can ask the question that Neale (I think) asked a while ago - is > there any need to keep the pickle option? Is the conversion between Pickle and bdsdb3 an issue at all ? If there is only one user involved, I think no. -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From Paul.Moore@atosorigin.com Mon Dec 2 10:36:07 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 2 Dec 2002 10:36:07 -0000 Subject: [Spambayes] don't update if you don't want to retrain Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E37@UKDCX001.uk.int.atosorigin.com> From: Tim Stone - Four Stones Expressions > 'twas me that changed it to win32. When I do a print > sys.platform, out comes win32... Sorry, that change was correct. My bad wording, plus a cut&paste typo, made it look like I was suggesting that "windows" was correct - I wasn't :-( But the bit about needing to remove dbhash *was* intended... Excuse me while I go and type 1000 times "I must check my posts before sending them"... Paul. From richie@entrian.com Mon Dec 2 12:22:51 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 12:22:51 +0000 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: [Richie] > so the on-demand-ness should come for free for all Corpus-using code. [Mark] > How much Corpus-using code is there? Are there any plans to move any > existing code that does not use it towards using it? I've raised this with > Tim S for Outlook, and it doesn't appear we will - I have no idea about the > other apps though. Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, but doesn't seem to use it (?) I'd like to see more of the existing code using it, but then again I'm not in a hurry to implement the idea myself... In an ideal (meaning "engineering purity") world, we'd have abstract Corpus and Message interfaces, and all the applications would code to those interfaces regardless of the concrete classes implementing them. Then any application would work with messages stored in any format - hammie could classify your Outlook messages from the command line, the Outlook plug-in could train on messages in mbox files, and so on. In the real world, that kind of thing usually turns out either to be YAGNI or so hard as to be unreasonable. Where we end up will probably be somewhere in between. I was able to scratch an itch using Corpus - it was exactly what I needed for the web training interface (partly because Tim and I discussed the design of Corpus with that in mind). If other people find they can scratch itches with it, its usage will grow, otherwise it won't. Migrating already-working code to use a new library for reasons of engineering purity isn't an itch that many people suffer from. I have a *much* bigger problem with Corpus, which is that I find the word 'Corpus' impossible to type. Is it just me? > In the back of my mind, I am pondering if we need a better directory > structure - maybe with the core engine in a package, and some of these > "wrappers" used only by a few application also into their own? Isn't this also YAGNI? We have a few tens of Python files in the project - do we really need to split it up? And if we do, should we be doing it with the code this young? -- Richie Hindle richie@entrian.com From richie@entrian.com Mon Dec 2 12:27:57 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 12:27:57 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: References: Message-ID: [Richie] > So now I can ask the question that Neale (I think) asked a while ago - is > there any need to keep the pickle option? [Fran�ois] > Is the conversion between Pickle and bdsdb3 an issue at all ? > If there is only one user involved, I think no. I wasn't going to provide an upgrade path, if that's what you meant... pickle users will need to retrain, but that happens on a regular basis anyway, as we change the pickle version. Sooner or later we'll need to worry about upgrading existing databases, but it's too early for that now. While you're here, Fran�ois, would the switch to anydbm have a big effect on your MacOS 9 platform? I don't know what database types are supported there. -- Richie Hindle richie@entrian.com From skip@pobox.com Mon Dec 2 13:03:52 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 2 Dec 2002 07:03:52 -0600 Subject: [Spambayes] The database question that would not die In-Reply-To: References: Message-ID: <15851.23096.388509.925822@montanaro.dyndns.org> Richie> o There may be platforms on which anydbm defaults to bsddb Richie> 1.85, but for which installing bsddb3 is a pain. Any takers? I think there are some misunderstandings still out there about the various incarnations of the bsddb module and the underlying Berkeley DB code. Even if everyone understands what's what, the language I see used suggests they might not. Let me try and make sure every has a similar grasp of the issues and terminology. There has been a bsddb or dbhash module in Python for quite awhile (five years at least). It requires the Berkeley DB library, originally available from UC Berkeley, but now from Sleepycat (whose founders where grad students at Berkeley when they wrote the earliest versions). The original bsddb module was originally written against Berkeley DB 1.85. That version created two interfaces, a C API (the version 1.85 API), and a file format (the version 1.85 file format. If you ask file(1) about a file created with it the version numbers will likely differ. File format versions and library release versions have no obvious correspondence to the untrained observer. There were various bugs in the code in db 1.85. To correct (some of) those bugs, file format changes were necessary. This originally happened in version 1.86, which, unfortunately, was never widely adopted (licensing issues?). The C API didn't change. When version 2.x of Berkeley DB was released (I think by Sleepycat shortly after its founding), they changed the file formats again and added a new C API. The old version 1.85 C API was still available (and still is even in the most recent versions). This API is what the original bsddb module was written against. When version 3.x of Berkeley DB was released, Sleepycat added another C API (or at least extended the version 2 API significantly). Pybsddb (aka bsddb3, aka the current bsddb module in CVS) was written against this richer API. This API remained current through version 4.0.x of Sleepycat's offerings. Unfortunately, in version 4.1.x, they changed some aspects of it which cause problems for Pybsddb. Consequently, you can't build Pybsddb against the 4.1.x library. So, here's a summary of what works with what: The historic bsddb module (bsddb185 in CVS now) works with any version of the Berkeley DB library as long as the 1.85 C API is enabled. If you use it with version 1.85 of the library you may experience data corruption problems because of bugs in the code and file structure (not the 1.85 API). You can use it safely with later versions of the library as the 1.85 API was enabled during configuration. The current bsddb module (Pybsddb, bsddb3, bsddb in CVS) works with versions 3.x and 4.0.x of the Berkeley DB library. Skip From papaDoc@videotron.ca Mon Dec 2 13:18:58 2002 From: papaDoc@videotron.ca (papaDoc) Date: Mon, 02 Dec 2002 08:18:58 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentation In-Reply-To: References: Message-ID: <3DEB5DC2.9080807@videotron.ca> Hi, >Remi, I don't know why, but I can't see the attachment... > I don't know either when I took my mail from work, I also did not receive the attachment. I was sure I added it to my email. I will resend it this evening from home with a missing part. I forgot to had how to filter the mail he.. he.. >> >>P.S. OptionConfig.py is now working for me __author__ is not defined. If >>I comment out this line I get >>Classes: >>OptionsConfigurator - changes select values in Options.py >>Abstact: >> Some text here: too long to copy >>To Do: >> o Suggestions? >> >>: File name too long >> >> > >This is a new one to me. It's working just fine on my machine, and Richie's >too. What platform are you on? > I'm running Red Hat 7.2 with python 2.2.1 if I remember correctly. I needed to get the email directory from the CVS since I don't have the lastest python that comes with it. papaDoc From richie@entrian.com Mon Dec 2 13:47:29 2002 From: richie@entrian.com (richie@entrian.com) Date: Mon, 02 Dec 2002 13:47:29 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <15851.23096.388509.925822@montanaro.dyndns.org> Message-ID: [Skip] > I think there are some misunderstandings still out there about the various > incarnations of the bsddb module and the underlying Berkeley DB code. > [...] > So, here's a summary of what works with what: > > The historic bsddb module (bsddb185 in CVS now) works with any version > of the Berkeley DB library as long as the 1.85 C API is enabled. If you > use it with version 1.85 of the library you may experience data > corruption problems because of bugs in the code and file structure (not > the 1.85 API). You can use it safely with later versions of the library > as the 1.85 API was enabled during configuration. > > The current bsddb module (Pybsddb, bsddb3, bsddb in CVS) works with > versions 3.x and 4.0.x of the Berkeley DB library. Thanks for the clarification. To rephrase my question in these terms: Are there any platforms on which, when you ask anydbm to create a database, it uses version 1.85 of the underlying Berkeley DB library to do that? And if there are such platforms, is upgrading the underlying Berkeley DB library, either directly or by installing pybsddb (aka bsddb3), a pain for a typical user of that platform? I strongly believe that if no such platform exists, we should drop pickle support in favour of using anydbm, and add a check that if the underlying database library chosen by anydbm is the Berkeley DB library, it is version 2 or better. On Windows, people can meet this requirement by installing pybsddb or Python 2.3. -- Richie Hindle richie@entrian.com From Paul.Moore@atosorigin.com Mon Dec 2 14:14:46 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 2 Dec 2002 14:14:46 -0000 Subject: [Spambayes] The database question that would not die Message-ID: <16E1010E4581B049ABC51D4975CEDB88619962@UKDCX001.uk.int.atosorigin.com> See dead horse, flog. Repeat as required :-) Sorry. From: richie@entrian.com [mailto:richie@entrian.com] > Are there any platforms on which, when you ask anydbm to create a = database, > it uses version 1.85 of the underlying Berkeley DB library to do that? = And > if there are such platforms, is upgrading the underlying Berkeley DB = library, > either directly or by installing pybsddb (aka bsddb3), a pain for a = typical > user of that platform? 1. Yes, Windows, with Python 2.2. 2. Yes. Not because installing pybsddb/bsddb3 is difficult, but because pybsddb/bsddb3 doesn't upgrade the library that anydbm uses, but = instead installs a second, parallel, copy, which is accessible under a = different name (bsddb3). > I strongly believe that if no such platform exists, we should drop = pickle > support in favour of using anydbm, and add a check that if the = underlying > database library chosen by anydbm is the Berkeley DB library, it is = version > 2 or better. On Windows, people can meet this requirement by = installing > pybsddb or Python 2.3. We have to code explicitly to use bsddb3 if that is present. If it is = not, we can fall back on anydbm (and complain loudly at Windows users). I do not believe that bsddb (neither the standard library one, nor bsddb3) offers = any way to check the version of the underlying Sleepycat code. Paul. From wsy@merl.com Mon Dec 2 14:44:10 2002 From: wsy@merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 09:44:10 -0500 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <200212021444.gB2EiA327329@localhost.localdomain> Final test statistics for CRM114 for November are in: Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5) For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1993 3914 4 0 5911 99.915 2 Spam features in hash tables: 398K Nonspam features in hash tables: 299K There was just 1 spam that got through in the last week of November- a very strange spam written in mixed English and Czech trying to sell me diesel engine parts. It came through on a moto-head email list, which I suppose might be slightly topical, and it certainly was amusing, rather reminiscent of the Monty Python "camshaft smuggling" skit, but it's still spam and counts as such. This gives an N+1 accuracy of > 99.9% for the entire month of November. (99.932% for N-accuracy). So, CRM114 barely squeaked through the month at >99.9%. Barely. There's clearly still work to be done (the spambayes mailing list is kicking around the proper way to evaluate probabilities; I'm looking into some of their ideas as well.) --- On The Other Hand (the bad news)--- December is looking much worse - TWO have gotten through already over the weekend (one "barnyard teen" pornspam- it hasn't seen that before) and one very short mortgage solicitation, written folksy-style. I'm also getting mailer errors now out of Sendmail whenever I do a "learn"; I'm starting to think that our systems people have upgraded something and broken something else in the process. This throws some question onto whether the CRM114 training code is actually getting run at all, or whether the increasing spam rate is symptomatic of the evolution of spam against static filters. -Bill Yerazunis From wsy@merl.com Mon Dec 2 14:51:33 2002 From: wsy@merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 09:51:33 -0500 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <200212021451.gB2EpXq27342@localhost.localdomain> Ooops, messed up the spreadsheet... corrected statistics below: Even-More-Final test statistics for CRM114 for November are in: Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5) For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1931 3914 4 0 5849 99.914 2 Spam features in hash tables: 398K Nonspam features in hash tables: 299K There was just 1 spam that got through in the last week of November- a very strange spam written in mixed English and Czech trying to sell me diesel engine parts. It came through on a moto-head email list, which I suppose might be slightly topical, and it certainly was amusing, rather reminiscent of the Monty Python "camshaft smuggling" skit, but it's still spam and counts as such. This gives an N+1 accuracy of > 99.9% for the entire month of November. (99.932% for N-accuracy). So, CRM114 barely squeaked through the month at >99.9%. Barely. There's clearly still work to be done (the spambayes mailing list is kicking around the proper way to evaluate probabilities; I'm looking into some of their ideas as well.) --- On The Other Hand (the bad news)--- December is looking much worse - TWO have gotten through already over the weekend (one "barnyard teen" pornspam- it hasn't seen that before) and one very short mortgage solicitation, written folksy-style. I'm also getting mailer errors now out of Sendmail whenever I do a "learn"; I'm starting to think that our systems people have upgraded something and broken something else in the process. This throws some question onto whether the CRM114 training code is actually getting run at all, or whether the increasing spam rate is symptomatic of the evolution of spam against static filters. -Bill Yerazunis From bkc@murkworks.com Mon Dec 2 14:58:20 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 09:58:20 -0500 Subject: [Spambayes] The database question that would not die In-Reply-To: Message-ID: <3DEB2D6C.31813.9E8625E@localhost> On 1 Dec 2002 at 23:49, Richie Hindle wrote: > Training 1000 Database size Classifying 500 Database load > Pickle 65 seconds 999,540 35 seconds 4 seconds > bsddb3 82 seconds 1,318,912 43 seconds (negligible) How many tokens are stored in the pickle / bsddb3 in this example? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim@fourstonesExpressions.com Mon Dec 2 15:09:34 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 02 Dec 2002 09:09:34 -0600 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: Message-ID: 12/2/2002 6:22:51 AM, Richie Hindle wrote: > >[Richie] >> so the on-demand-ness should come for free for all Corpus-using code. > >[Mark] >> How much Corpus-using code is there? Are there any plans to move any >> existing code that does not use it towards using it? I've raised this with >> Tim S for Outlook, and it doesn't appear we will - I have no idea about the >> other apps though. > >Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, >but doesn't seem to use it (?) > >I'd like to see more of the existing code using it, but then again I'm not >in a hurry to implement the idea myself... In an ideal (meaning >"engineering purity") world, we'd have abstract Corpus and Message >interfaces, and all the applications would code to those interfaces >regardless of the concrete classes implementing them. Then any application >would work with messages stored in any format - hammie could classify your >Outlook messages from the command line, the Outlook plug-in could train on >messages in mbox files, and so on. In the real world, that kind of thing >usually turns out either to be YAGNI or so hard as to be unreasonable. > >Where we end up will probably be somewhere in between. I was able to >scratch an itch using Corpus - it was exactly what I needed for the web >training interface (partly because Tim and I discussed the design of Corpus >with that in mind). If other people find they can scratch itches with it, >its usage will grow, otherwise it won't. Migrating already-working code to >use a new library for reasons of engineering purity isn't an itch that many >people suffer from. Well, I'm a bit of an engineering purist, and I think that there's benefit to having a single abstract interface for message storage. Right now, we have mbox stuff, corpus stuff, outlook stuff. Mark has indicated that he's not interested in the abstraction for the outlook stuff, and that's fine. But I think the mbox/msg stuff should disappear. They don't do anything that corpus doesn't do at the moment, and it's gonna get confusing down the road for someone who becomes interested in our code. Not to mention that our code reflects on us... Let's take the plunge and make the Corpus stuff the 'standard', and where it doesn't support a current requirement, let's fix it. - TimS > >I have a *much* bigger problem with Corpus, which is that I find the word >'Corpus' impossible to type. Is it just me? > >> In the back of my mind, I am pondering if we need a better directory >> structure - maybe with the core engine in a package, and some of these >> "wrappers" used only by a few application also into their own? > >Isn't this also YAGNI? We have a few tens of Python files in the project - >do we really need to split it up? And if we do, should we be doing it with >the code this young? > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From bkc@murkworks.com Mon Dec 2 15:23:46 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 10:23:46 -0500 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <15851.23096.388509.925822@montanaro.dyndns.org> Message-ID: <3DEB3361.19290.9FFA921@localhost> On 2 Dec 2002 at 13:47, richie@entrian.com wrote: > I strongly believe that if no such platform exists, we should drop pickle > support in favour of using anydbm, and add a check that if the underlying > database library chosen by anydbm is the Berkeley DB library, it is version > 2 or better. On Windows, people can meet this requirement by installing > pybsddb or Python 2.3. > Sorry I haven't been keeping up with this issue. I have my own "database format" that I want to use for classifier storage.. Has the classifier interface to "storage" been abstracted yet? I thought that's where things were headed. But I haven't had a chance to cvs update lately. Can I "drop-in" my own "database instance" into hammie or the Outlook plugin in a transparent way? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From msergeant@startechgroup.co.uk Mon Dec 2 15:22:23 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 02 Dec 2002 15:22:23 +0000 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212021444.gB2EiA327329@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> <200212021444.gB2EiA327329@localhost.localdomain> Message-ID: <3DEB7AAF.4080206@startechgroup.co.uk> Bill Yerazunis said the following on 02/12/02 14:44: > Final test statistics for CRM114 for November are in: > > Standard rules apply (no whitelists, no blacklists, realtime email stream > only (no "canned spam"), train only on errors, polynomial length 5) > > For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) > > Spams Nonspams False False Total N+1 Accuracy NHC's > Accepts Rejects Emails > 1993 3914 4 0 5911 99.915 2 > > Spam features in hash tables: 398K > Nonspam features in hash tables: 299K CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc. Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat. If not, we'll just have to try and copy the sparse binary polynomial hash idea ;-) From francois.granger@free.fr Mon Dec 2 15:50:26 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Mon, 02 Dec 2002 16:50:26 +0100 Subject: [Spambayes] The database question that would not die In-Reply-To: Message-ID: on 2/12/02 13:27, Richie Hindle at richie@entrian.com wrote: > I wasn't going to provide an upgrade path, if that's what you meant... > pickle users will need to retrain, but that happens on a regular basis > anyway, as we change the pickle version. Sooner or later we'll need to > worry about upgrading existing databases, but it's too early for that now= . Just a reminder ;-) Being a tech support guy, I always think about compatibility, upgrade, documentation... That side of product development ;-) > While you're here, Fran=E7ois, would the switch to anydbm have a big effect > on your MacOS 9 platform? I don't know what database types are supported > there. No issue, we have gdbm on Mac wich get selected. I don't know how robust it is, but I can ask on the MacPython sig if needed. I tested it once or twice with pop3proxy and got no issue. --=20 Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From wsy@merl.com Mon Dec 2 15:57:52 2002 From: wsy@merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 10:57:52 -0500 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) In-Reply-To: <3DEB7AAF.4080206@startechgroup.co.uk> (message from Matt Sergeant on Mon, 02 Dec 2002 15:22:23 +0000) References: <20021202040836.54151.qmail@mail.archub.org> <200212021444.gB2EiA327329@localhost.localdomain> <3DEB7AAF.4080206@startechgroup.co.uk> Message-ID: <200212021557.gB2FvqC28251@localhost.localdomain> From: Matt Sergeant CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc. It _is_ procedural, it's just extremely high level. Perhaps higher-level than APL if you count statements rather than operators. And sorry about the syntax. I was being playful, and reading a book on Latin at the time, which is why it uses symmetric declensional parsing rather than something more sane, like recursive descent. (*) Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat. Yes, it's not difficult to get at the code. Pop the .gz open, emacs the file crm114.c, and look for the case headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there is _not_ generated, but executed in-line, so cut and paste will work. The current code requires a null-terminated string as input, but that's because of the GNU regex library limits (when TRE gives me a new library, that requirement will go away). You _will_ need to link it against a regex library (of your choice, CRM114 uses the standard ANSI regcomp/regexec calling sequence), and the OS itself needs to support stat() [for file existence/length] and mmap() [to map a file into virtual memory without actually reading it in a byte at a time- this is just for efficiency and can be worked around]. How bad do you want it? :-) If not, we'll just have to try and copy the sparse binary polynomial hash idea ;-) Always legitimate. It's GPLware, no problemo. -Bill Yerazunis (*) all in all, I like the way it ended up; one can just type programs on the command line and they do useful things. But hindsight is always 20/20, and "less wierdass" might be better in the long run. From kanderson@bbn.com Mon Dec 2 16:04:30 2002 From: kanderson@bbn.com (Ken Anderson) Date: Mon, 02 Dec 2002 11:04:30 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212021444.gB2EiA327329@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set? At 09:44 AM 12/2/2002, Bill Yerazunis wrote: >Final test statistics for CRM114 for November are in: > >Standard rules apply (no whitelists, no blacklists, realtime email stream >only (no "canned spam"), train only on errors, polynomial length 5) > > For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) > > Spams Nonspams False False Total N+1 Accuracy NHC's > Accepts Rejects Emails > 1993 3914 4 0 5911 99.915 2 > > Spam features in hash tables: 398K > Nonspam features in hash tables: 299K > >There was just 1 spam that got through in the last week of November- >a very strange spam written in mixed English and Czech trying to sell >me diesel engine parts. It came through on a moto-head email list, >which I suppose might be slightly topical, and it certainly was amusing, >rather reminiscent of the Monty Python "camshaft smuggling" skit, >but it's still spam and counts as such. > >This gives an N+1 accuracy of > 99.9% for the entire month of November. >(99.932% for N-accuracy). > >So, CRM114 barely squeaked through the month at >99.9%. Barely. There's >clearly still work to be done (the spambayes mailing list is kicking >around the proper way to evaluate probabilities; I'm looking into some >of their ideas as well.) > > > >--- On The Other Hand (the bad news)--- > >December is looking much worse - TWO have gotten through already over >the weekend (one "barnyard teen" pornspam- it hasn't seen that before) >and one very short mortgage solicitation, written folksy-style. > >I'm also getting mailer errors now out of Sendmail whenever I do >a "learn"; I'm starting to think that our systems people have >upgraded something and broken something else in the process. This >throws some question onto whether the CRM114 training code is actually >getting run at all, or whether the increasing spam rate is >symptomatic of the evolution of spam against static filters. > > -Bill Yerazunis From richie@entrian.com Mon Dec 2 16:12:30 2002 From: richie@entrian.com (richie@entrian.com) Date: Mon, 02 Dec 2002 16:12:30 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619962@UKDCX001.uk.int.atosorigin.com> Message-ID: [Paul] > See dead horse, flog. Repeat as required :-) Sorry. Tell me about it. This is proving really difficult. Am I the only one who thinks that having two incompatible database formats sucks? Especially when they're each the default for different pieces of the same software, so you can't use those pieces together without reconfiguring things. > 1. Yes, Windows, with Python 2.2. > 2. Yes. [reasons snipped] I know, and I believe I've already dealt with Windows. Please see http://mail.python.org/pipermail/spambayes/2002-December/002385.html > I do not believe that bsddb (neither the standard library one, nor > bsddb3) offers any way to check the version of the underlying Sleepycat > code. OK, fine. I agree with whoever said that we document the fact that we require 2 or better, provide a link to pybsddb for Windows, and let users of other platforms worry about it themselves - other platforms have allegedly had Berkeley DB 2 or better for ages, which brings me back to the dead horse question: are there platforms other than Windows where using anydbm instead of pickle will cause problems? Windows we've dealt with, Unix has a recent Berkeley DB, the Mac has gdbm (thanks Fran�ois!), are there any others? (Are there even any other platforms that we need to consider?) If not, let's ditch pickles before we get publicity from the Linux Journal articles and the Spam conference. [Brad] > Has the classifier interface to "storage" been abstracted yet? I thought > that's where things were headed. But I haven't had a chance to cvs update > lately. Can I "drop-in" my own "database instance" Yes, all that has been done. The main project has a pickle interface and a DBM interface, and I'm proposing we ditch the pickle interface because it no longer has any advantages. Adding another interface should be easy. -- Richie Hindle richie@entrian.com From msergeant@startechgroup.co.uk Mon Dec 2 16:21:10 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 02 Dec 2002 16:21:10 +0000 Subject: [Spambayes] CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212021557.gB2FvqC28251@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> <200212021444.gB2EiA327329@localhost.localdomain> <3DEB7AAF.4080206@startechgroup.co.uk> <200212021557.gB2FvqC28251@localhost.localdomain> Message-ID: <3DEB8876.8070408@startechgroup.co.uk> Bill Yerazunis said the following on 02/12/02 15:57: > From: Matt Sergeant > > CRM114's learn and classify stuff looks really interesting, but it has a > really freaky syntax to someone who is used to regular procedural or OO > languages like Perl, Python, C, etc. > > It _is_ procedural, it's just extremely high level. Perhaps higher-level > than APL if you count statements rather than operators. Sorry, I meant "prodedural like Perl/Python/C" not "procedural, like Perl/Python/C". Actually maybe python shouldn't be in that list since it has a weirdass syntax too :-) > Is there *any* chance the library > in crm114 for learning and classifying can be extracted into a plain > .so? That would be tremendous, and I'd be willing to build a perl XS > library for it in a heartbeat. > > Yes, it's not difficult to get at the code. > > Pop the .gz open, emacs the file crm114.c, and look for the case > headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there > is _not_ generated, but executed in-line, so cut and paste will work. > > The current code requires a null-terminated string as input, but > that's because of the GNU regex library limits (when TRE gives me a > new library, that requirement will go away). You _will_ need to link > it against a regex library (of your choice, CRM114 uses the standard > ANSI regcomp/regexec calling sequence), and the OS itself needs to > support stat() [for file existence/length] and mmap() [to map a file > into virtual memory without actually reading it in a byte at a time- > this is just for efficiency and can be worked around]. I was thinking of punting on splitting the email to tokens back to the host language. Since perl and python both support POSIX regexps (and thus [[:graph:]]) its probably easier that way. Unless there's an inherent reason it has to be embedded in the library. > How bad do you want it? :-) What interests me is the hashing technique. It should be reasonably easy to extract that, but for me it's just a lack of tuits - it's hard enough keeping up with my regular day to day activities, and my todo list never gets shorter. > (*) all in all, I like the way it ended up; one can just type programs > on the command line and they do useful things. But hindsight is always > 20/20, and "less wierdass" might be better in the long run. I imagine you'd get a few more users with a regular syntax ;-) Matt. From popiel at wolfskeep.com Mon Dec 2 17:51:50 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon, 02 Dec 2002 09:51:50 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: Message from richie@entrian.com References: Message-ID: <20021202175150.22F5C2DEB1@cashew.wolfskeep.com> In message: richie@entrian.com writes: >[Paul] >> See dead horse, flog. Repeat as required :-) Sorry. > >Tell me about it. This is proving really difficult. Am I the only one >who thinks that having two incompatible database formats sucks? No, you're not the only one. I'd be chiming in if I actually had any time to deal with it. Unfortunately, my time recently has been sucked into a different black hole (finally got nightly backups working properly again). I'm getting ready to switch from my own home-brew Graham implementation to hammiefilter for my real live incoming feed. (Yes, I've been testing spambayes but not really using it up to this point.) As I make that transition, I'll become quite interested in what database format is used... I'll also make my procmailrc and support scripts available. - Alex From glouis at dynamicro.on.ca Mon Dec 2 18:40:21 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Mon, 2 Dec 2002 13:40:21 -0500 Subject: [Spambayes] train on error - to exhaustion? Message-ID: <20021202184021.GA6315@athame.dynamicro.on.ca> Training on error means "classify messages from the training corpus in random order; if the classifier errs or is uncertain, submit that message (once?) for training." Has anyone tried either of: 1) when the classifier errs or is uncertain, train on that message until the classifier gets it right, or 2) train once on each error, but then repeat the whole training process until all messages are classified correctly? I'd think the latter might be beneficial, but haven't tried it yet myself. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From wsy at merl.com Mon Dec 2 19:43:18 2002 From: wsy at merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 14:43:18 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021202184021.GA6315@athame.dynamicro.on.ca> (message from Greg Louis on Mon, 2 Dec 2002 13:40:21 -0500) References: <20021202184021.GA6315@athame.dynamicro.on.ca> Message-ID: <200212021943.gB2JhIl29523@localhost.localdomain> From: Greg Louis Training on error means "classify messages from the training corpus in random order; if the classifier errs or is uncertain, submit that message (once?) for training." Has anyone tried either of: 1) when the classifier errs or is uncertain, train on that message until the classifier gets it right, I've looked into that on CRM114; the circumstance never happens. I typically submit the erroneous message three times in rapid succession: - once to get a "before training" value to confirm the misclassify; - once with "train this message as" turned on (*) - and once again to get an "after training" result and verify the learn. It's never misclassified any message ever on the "after training" verification, so I don't know if it would change anything or not to re-train again and again until it gets the classification correct. 2) train once on each error, but then repeat the whole training process until all messages are classified correctly? I'd think the latter might be beneficial, but haven't tried it yet myself. Hmmm... that would be a good way to do regression checking to verify that every message that is classified correctly once is classified correctly forevermore. -Bill Y. (*) this is the step that seems to be running "LEARNing", but for some reason sendmail is getting upset at me and returning an error message _as well as_ the confirmation message. Bizarre. I'm working on it. From neale at woozle.org Mon Dec 2 20:19:52 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 12:19:52 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: <3DEB3361.19290.9FFA921@localhost> References: <15851.23096.388509.925822@montanaro.dyndns.org> <3DEB3361.19290.9FFA921@localhost> Message-ID: So then, "Brad Clements" is all like: > Has the classifier interface to "storage" been abstracted yet? I > thought that's where things were headed. But I haven't had a chance to > cvs update lately. > > Can I "drop-in" my own "database instance" into hammie or the Outlook > plugin in a transparent way? If we do end up canning the pickle, I guess we could support this sort of thing by making everything instantiate a storage.PersistentClassifier = storage.DBDictClassifier. Then folks like Brad could write their own class and set storage.PersistentClassifier equal to that. Unless your "database instance" is something the rest of us would be interested in? You've piqued my interest, Brad, now you gotta tell us what you're up to ;) Neale From neale at woozle.org Mon Dec 2 20:27:11 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 12:27:11 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: So then, Richie Hindle is all like: > Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, > but doesn't seem to use it (?) > > I'd like to see more of the existing code using it, but then again I'm not > in a hurry to implement the idea myself... I have to confess that I haven't even looked at Corpus.py yet. hammiebulk imports it because it needed it for some verbose variable at one point. But I'm going to read up before I take it out, maybe there's something there I can use :) Neale From bkc at murkworks.com Mon Dec 2 20:40:20 2002 From: bkc at murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 15:40:20 -0500 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <3DEB3361.19290.9FFA921@localhost> Message-ID: <3DEB7D92.26160.B217D9F@localhost> On 2 Dec 2002 at 12:19, Neale Pickett wrote: > Unless your "database instance" is something the rest of us would be > interested in? You've piqued my interest, Brad, now you gotta tell us what > you're up to ;) Just playing around with compressing the token list. So, I take 315680 tokens from my training database, stored as a pickle is 4,597,551 bytes, but I can get it down to 2,105,046 bytes with almost no decompression overhead. But .. What I really want to do is replicate the pickle/db speed trials so I can do some real testing, both on linux and windows, memory mapped or not. I think the "database interface" should be abstract, regardless of what I do. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie at entrian.com Mon Dec 2 21:47:08 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 21:47:08 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <3DEB2D6C.31813.9E8625E@localhost> References: <3DEB2D6C.31813.9E8625E@localhost> Message-ID: [Richie] > Training 1000 Database size Classifying 500 Database load > Pickle 65 seconds 999,540 35 seconds 4 seconds > bsddb3 82 seconds 1,318,912 43 seconds (negligible) [Brad] > How many tokens are stored in the pickle / bsddb3 in this example? 31846 -- Richie Hindle richie@entrian.com From neale at woozle.org Mon Dec 2 22:06:43 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 14:06:43 -0800 Subject: [Spambayes] OT: hotels near subway in Boston? Message-ID: So I'm booking travel for this spam conference next month, and I'm learning that I probably don't want to rent a car in Boston. The country being very automobile-happy, though, I can't find any hotels that advertise proximity to the subway. Are there any Boston-area residents on the list who can recommend a place to stay that's near the subway? I'm a tourist so I can't handle a lot of bus transfers. Thanks Neale From richie at entrian.com Mon Dec 2 22:08:42 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 22:08:42 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <3DEB2D6C.31813.9E8625E@localhost> Message-ID: [Richie] > Training 1000 Database size Classifying 500 Database load > Pickle 65 seconds 999,540 35 seconds 4 seconds > bsddb3 82 seconds 1,318,912 43 seconds (negligible) [Brad] > How many tokens are stored in the pickle / bsddb3 in this example? [Richie] > 31846 Sorry, brain trouble. The real answer is 30236. -- Richie Hindle richie@entrian.com From bkc at murkworks.com Mon Dec 2 22:27:21 2002 From: bkc at murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 17:27:21 -0500 Subject: [Spambayes] wordinfoget Message-ID: <3DEB96A7.27517.B837981@localhost> My storage method is most efficient when given a pre-sorted list of words, so, in _getclues, I would want wordstream to be sorted first. I guess I'll have to override _getclues, add_msg and friends in my subclass ;-) Which .py file in CVS generates the comparative time test for db and pickle training/classifying? If its not in .cvs, could someone email it to me? Thanks Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie at entrian.com Mon Dec 2 22:29:53 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 22:29:53 +0000 Subject: [Spambayes] wordinfoget In-Reply-To: <3DEB96A7.27517.B837981@localhost> References: <3DEB96A7.27517.B837981@localhost> Message-ID: [Brad] > Which .py file in CVS generates the comparative time test for db and pickle > training/classifying? I don't know whether such a thing exists - I produced my results the old-fashioned way, with a command prompt and a watch. 8-) -- Richie Hindle richie@entrian.com From bkc at murkworks.com Mon Dec 2 22:36:38 2002 From: bkc at murkworks.com (Brad Clements) Date: Mon, 02 Dec 2002 17:36:38 -0500 Subject: [Spambayes] wordinfoget In-Reply-To: References: <3DEB96A7.27517.B837981@localhost> Message-ID: <3DEB98D4.7778.B8BF7D7@localhost> oh, ok. which test modules did you time? On 2 Dec 2002 at 22:29, Richie Hindle wrote: > > [Brad] > > Which .py file in CVS generates the comparative time test for db and > > pickle training/classifying? > > I don't know whether such a thing exists - I produced my results the > old-fashioned way, with a command prompt and a watch. 8-) > > -- > Richie Hindle > richie@entrian.com > Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie at entrian.com Mon Dec 2 22:47:28 2002 From: richie at entrian.com (Richie Hindle) Date: Mon, 02 Dec 2002 22:47:28 +0000 Subject: [Spambayes] wordinfoget In-Reply-To: <3DEB98D4.7778.B8BF7D7@localhost> References: <3DEB96A7.27517.B837981@localhost> <3DEB98D4.7778.B8BF7D7@localhost> Message-ID: [Brad] > which test modules did you time? For training, I ran: hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -d -p temp.bsddb3 hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -D -p temp.pickle For classifying, I ran: hammiebulk.py -u 500-hams.mbox -d -p richie-500.bsddb3 hammiebulk.py -u 500-hams.mbox -D -p richie-500.pickle (because I didn't have an mbox of 500 random ham/spam messages to hand). In each of the four cases I ran the command twice and timed the second one. I'm using a hacked version of the software that uses bsddb3 - if you need my patches, let me know. -- Richie Hindle richie@entrian.com From trebor at animeigo.com Mon Dec 2 22:35:36 2002 From: trebor at animeigo.com (Robert Woodhead) Date: Mon, 2 Dec 2002 17:35:36 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> References: <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> Message-ID: At 11:04 AM -0500 12/2/02, Ken Anderson wrote: >The "train only on errors" bothers me. Can you say what you use for >a training set and what you use for a test set? Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From neale at woozle.org Mon Dec 2 22:56:50 2002 From: neale at woozle.org (Neale Pickett) Date: 02 Dec 2002 14:56:50 -0800 Subject: [Spambayes] wordinfoget In-Reply-To: References: <3DEB96A7.27517.B837981@localhost> <3DEB98D4.7778.B8BF7D7@localhost> Message-ID: So then, Richie Hindle is all like: > [Brad] > > which test modules did you time? > > For training, I ran: > > hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -d -p temp.bsddb3 > hammiebulk.py -g 500-hams.mbox -s 500-spams.mbox -D -p temp.pickle > > For classifying, I ran: > > hammiebulk.py -u 500-hams.mbox -d -p richie-500.bsddb3 > hammiebulk.py -u 500-hams.mbox -D -p richie-500.pickle That is what I did, too. Unix has a "time" command you can put in front of a command line, which will tell you all sorts of neat statistics. I did five runs of each (pickle and non) and averaged the times by hand. Neale From tim.one at comcast.net Mon Dec 2 23:18:33 2002 From: tim.one at comcast.net (Tim Peters) Date: Mon, 02 Dec 2002 18:18:33 -0500 Subject: [Spambayes] OT: hotels near subway in Boston? In-Reply-To: Message-ID: [Neale Pickett] > So I'm booking travel for this spam conference next month, Where is the conference located? I'm guessing Cambridge. > and I'm learning that I probably don't want to rent a car in Boston. Not unless you're traveling to a "far" suburb (like Burlington). Boston proper is a very small city, it's a maze of unmarked one-way streets, and there's very little parking space (in Boston or Cambridge). > The country being very automobile-happy, though, I can't find any > hotels that advertise proximity to the subway. googling on hotel boston subway finds a bunch. > Are there any Boston-area residents on the list who can recommend a > place to stay that's near the subway? Anywhere in Boston proper is close to the T (what locals call the subway) ... hmm, I see the conference is at the MIT Media Lab, and that http://www.media.mit.edu/contact/hotels.html lists a couple dozen convenient hotels. For whatever reason, they don't mention the T there either! MIT is at the Kendall Square stop on the Red Line. Any hotel in the MIT or Harvard area (two T stops away from MIT) would work fine, and the airport is easy to get to and from via T (take shuttle bus 33 from the terminal to the T station -- that's free). > I'm a tourist so I can't handle a lot of bus transfers. Heh . Instead you can handle a lot of T transfers: Subway directions from Logan International Airport: Take the free airport shuttle to the subway "T" station. Take the Blue Line to Government Center stop where you will switch to the Green Line. Take the Green Line to Park Street stop where you will switch to the Red Line. Take the Red Line to Kendall Square stop. It's easier than it sounds. At least it was the sixth time I did it when I lived there <0.9 wink>. From wsy at merl.com Tue Dec 3 02:30:46 2002 From: wsy at merl.com (Bill Yerazunis) Date: Mon, 2 Dec 2002 21:30:46 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: (message from Robert Woodhead on Mon, 2 Dec 2002 17:35:36 -0500) References: <20021202040836.54151.qmail@mail.archub.org> Message-ID: <200212030230.gB32UkR30864@localhost.localdomain> X-Sender: trebor@mail.animeigo.com Date: Mon, 2 Dec 2002 17:35:36 -0500 From: Robert Woodhead Cc: spamfilt@archub.org, spambayes@python.org X-Spam-Status: No, hits=-14.9 required=7.0 tests=IN_REP_TO,REFERENCES,SIGNATURE_SHORT_DENSE, SPAM_PHRASE_01_02,SUBJECT_MONTH,SUBJECT_MONTH_2 version=2.41 X-Spam-Level: At 11:04 AM -0500 12/2/02, Ken Anderson wrote: >The "train only on errors" bothers me. Can you say what you use for >a training set and what you use for a test set? Training a particular incarnation of CRM114 usually takes a week or two; I read my mail (both categories) and when I find a piece of mail misclassified, I train that one piece into the filter. After a couple of days the errors get very sparse; after two or three weeks, I "go for data" and that's what gets reported in the monthlies. The current spam.css files are pretty much based on the live spam errors in the first week of October; since only four spam came through in all of November and only two were worth training on (the Czech Diesel Parts spam was just too funny to train out), the .css files are pretty much unchanged. Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. I did put in that capability as a flag called "refute". You can say learn < refute > ( spamfile.css ) /[[:graph:]]/ to unlearn something as nonspam, and then you can relearn it in the proper category, but except for testing code paths, I've never actually used it. On the other hand, there's an old difficulty in AI that one of my teachers called "the Kalman Belly Gaze". If you let a filter (of any type, he was teaching Kalman filters at the time but it applies to any trained filter) learn on it's own output stream, it quickly reinforces it's own behavior to the exclusion of all else (i.e. it goes off and gazes at it's own navel, simply ignoring the reality of the world around it). The reason I haven't auto-trained is due to my lack of understanding on what the limiting amount of self-teaching one can allow that doesn't go off into belly gaze. -Bill Yerazunis From kanderson at bbn.com Tue Dec 3 02:00:40 2002 From: kanderson at bbn.com (Ken Anderson) Date: Mon, 02 Dec 2002 21:00:40 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: References: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> Message-ID: <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> Yes, this is my concern. I think the approach Robert describes is perfectly find for adaptively learning how to filter email, though there should probably be some for of forgetting, though the system will eventually forget on its own as words occur less often. However, if this is the approach Bill uses, you can't use to for performance estimates. Our speech and natural language group is very careful not to mix its training set with its test set. When they do, they do something like 10 fold cross validation which averages (?) the results of 10 experiments that take some random fraction of the data as training and the rest as testing. This gives a lower performance score that is likely to be more accurate on real data. If your getting 3 9's be sure you're getting them the hard way. k At 05:35 PM 12/2/2002, Robert Woodhead wrote: >At 11:04 AM -0500 12/2/02, Ken Anderson wrote: >>The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set? > >Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. > >R > >-- >=========================================================== >Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ >=========================================================== >http://selfpromotion.com/ The Net's only URL registration >SHARESERVICE. A power tool for power webmasters. From tim at fourstonesExpressions.com Tue Dec 3 03:01:06 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 02 Dec 2002 21:01:06 -0600 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: <9595B96183MJYYW5ZXVF0VP4XMI0IC.3dec1e72@riven> Ok, so I found the message, and here are my thoughts. I freely admit that the abstraction was done completely from a single concrete example, that being the pop3proxy. It seems that the competing interests here can be successfully resolved by further abstraction. The 'Corpus' that Mark describes is essentially an iterator, which doesn't work well for the pop3proxy, but works well for the outlook plugin. I've spent some time looking at the Hammie/Hammiebulk/mboxutils stuff, along with the rfc822/Mailbox/email.* stuff over the last week, and I think that we (I) have managed to somewhat reinvent the wheel. It sounded like a good idea to me and Tim1 at the time... I certainly don't view Corpora as being particularly static. I view any collection of messages that are somehow related as a Corpus. Perhaps a better (more portable) term would have been Folder. Beats me. At any rate, I don't think anybody is locked in to the classes as they exist right now. Neale and Richie have added/removed stuff they need/don't need from them. I *would* like to see a single abstraction that works for the whole project. Should we start over? I'm ok with that... - TimS 11/7/2002 10:48:54 PM, "Mark Hammond" wrote: >> Laughing and pointing should be directed towards me rather than Tim. > >None of that, but some thoughts . > >I think that the classes I posted a while ago suffer from the exact reverse >problem as your idea. My idea was to make a "message store" that is largely >independent of training. I believe the problem with your design is that it >deals with the training at the expense of the message store. > >Obviously, but worth mentioning, is that there are competing interests here. >My focus is towards clients, and specifically the outlook one (if there were >more clients I would be happy to think of them too ). Alot of the >focus of this group is towards admins rather than individuals (which is just >fine!) But it seems the current thinking is of a corpus as being a fairly >static, well-controlled set of messages used almost purely for training >purposes. > >For client programs, this may not be practical. The corpus is a more >dynamic set of messages - and worse, actually *is* the user's set of >messages rather than a collection of message copies. > >For example, "moving" a message in a corpus may actually mean moving the >message in the user's real inbox. This may or may not be what is intended - >a corpus "move" operation is more about changing a message's classification >than it is about physically moving pieces of mail around. > >> A Corpus wouldn't know how to create Message objects, nor would a Message >> object know how to create itself - classes *derived from* them would know >> how to do that. For instance (totally untested code, probably full of >> typos) - >> >> class Message: > >Jeremy and I both posted real code, so starting with something that takes >that into consideration would be good. > >> I may be putting too much >> into the base class by demanding that the text of the message be given to >> the constructor - that precludes making FileMessage lazy, and >> only read the >> file when it needs to.] > >It also defeats the abstract nature of the class. > >> 'Corpus' works the same way; again, the details may be naive, but this is >> the general idea: > >I'm hoping I don't sound grumpy, but again, the few systems that already >exist for this engine are the best ones to use to discover the naivety early > > >> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an >> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on. > >I can't quite imagine that at the moment, as per my comments at the top. > >Off the top of my head, I believe we need: >* An abstract "message id" >* A message classification database, as discussed before - basically just a >dictionary, keyed by ID, holding either "spam" or "ham". >* A "corpus" becomes just an enumerator of message IDs for bulk/batch >training. It has no move etc operations. >* A "message store" is capable of returning a message object given its ID. >* The training API simply takes message objects and updates the probability >and message databases. > >At that level, we really don't need much else - no folders or any other >grouping of messages. I'm really not too sure there is much value in adding >higher-level concepts such as folders or message store "move" operations - >certainly not at the outset, where there are too many competing >requirements. > >> Yes - this could work using observer objects registered with Corpus >> objects: > >This could work, but may be too simple to be necessary. If the process of >re-training a message in the Outlook GUI becomes: > >def RetrainMessageAsSpam(): > # Outlook specific code to get an ID. > message = message_store.GetMessage(id) > if not classifier.IsSpam(message): > classifier.train(message, is_spam=True) > >And not a whole lot else, it doesn't seem worth it. Unfortunately, the >decision to perform the retrain is the complex, but client specific part. >Is this a newly delivered message? Did the user manually move the message >somewhere? Did the user click one of our buttons? Is the user deleting old >ham that we want to train on before it dies forever? > >Outlook does this via examining what Outlook event we are seeing, and >looking at meta-data we possibly previously attached to the message. I'm >not sure this can be encapsulated well at the moment without adding all our >meta-data etc baggage to the base classes. > >> Most of the *new* code that's needed is defining the abstract concepts and >> their interfaces, rather than writing code that actually *does* anything - >> it's building a framework. > >*cough* ummm... This is doomed to failure. Code *must* do something to be >taken seriously. At the very least, I would expect to see the existing test >driver framework running against these "abstract concepts" > >> Once the framework is there, most of the code needed to implement the >> functionality should already be in the project - code to hook >> into Outlook, >> to train on a message, to parse mbox files, and so on. It just needs >> hooking into the framework. > >See above . > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From papaDoc at videotron.ca Tue Dec 3 01:38:42 2002 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon, 02 Dec 2002 20:38:42 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentatio (2 try) Message-ID: <1038879522.998.3.camel@porsche> Hi, This is my second try at sending the documentation on how to use pop3proxy and Mozilla. The same warnings apply. Don't forget any comment is welcome. -- Remi Ricard From papaDoc at videotron.ca Tue Dec 3 01:56:31 2002 From: papaDoc at videotron.ca (Remi Ricard) Date: Mon, 02 Dec 2002 20:56:31 -0500 Subject: [Spambayes] pop3proxy and Mozilla documentatio (third try) Message-ID: <1038880590.998.9.camel@porsche> Hi again, I don't know what is going on but my attachment are not following my mail. (Evolution should be a good mail program ?????) Since my documentation is not really big I will include it in my email Here it comes --------------------------- Documentation for the Spambayes pop3proxy.py program.

Documentation for the Spambayes pop3proxy.py program.

This documentation will describe how to use pop3proxy.py with Mozilla:mail. But pop3proxy is not restricted to be used with mozilla.

I will talk about mozilla:mail because this is the only mail reader I use with pop3proxy.

First some definitions:

What is Spambayes?
This project is developing a Bayesian anti-spam classifier, initially based on the work of Paul Graham, in python.
What is spam?
broadly speaking: any email that's not wanted by the end-user. More specifically: unsolicited bulk email; email that you do not want and did not ask for, and was sent to a whole bunch of people by automated means at the same time it was sent to you. This definition deliberately excludes viruses and those stupid jokes sent to you by your Aunt Tillie.
What is ham?
the opposite of spam; not necessarily email that you want or that you asked for, just anything that's not unsolicited bulk email.
What is a proxy?
A proxy is a program that acts as an intermediary between your PC and something. (I hope this is general enough).

Now that we have some definitions we can be more specific:

So what is pop3proxy ?

pop3proxy.py is a program written by the Spambayes team, it is a middle man installed between your current pop servers (usually provided by your ISP) and your Mail reader. Upon request It will call your (or one of your) usual pop servers and get mails from it. Then class the mail in 3 different categories: Spam|Ham|Unsure. After the classification, it will add a new Header which is by default X-Spambayes-Classification that you might look at to find the status of the mail. Then, it will forward the mail to you the same way your usual pop server does it.

Your mail reader can use the new header to classify the mails into 3 categories: ham, spam and unsure. pop3proxy.py can talk to as many pop servers that you want. For me I have 3 pop server on 3 different ISP, but to simplify the documentation I will only use two different pop servers. If you have only one pop server then it is even simpler.

With pop3proxy installed and running:


  pop server on ISP                Proxy on localhost       Port on
localhost   Mail reader
                                                             
  --------------------             ---------------------         
----       
 |pop.videotron.ca:110|  <------->|                     | <----> |6110|
<-\     --------------
  --------------------            |                     |        
----     \   |              |
                                  | pop3proxy.py       
|                   -> | Mozilla:mail |
  ------------------              |                     |        
____     /   |              |
 |mail.ulaval.ca:110|    <_------>|                     | <----> |6111|
<-/     --------------
  ------------------               ---------------------          ----

Without pop3proxy installed and running:

 pop server on ISP                Mail reader
                                                             
  --------------------           
 |pop.videotron.ca:110|   <-\      --------------
  --------------------       \    |              |
                              ->  | Mozilla:mail |
  ------------------         /    |              |
 |mail.ulaval.ca:110|     <-/      --------------
  ------------------

If you keep in mind the pictures above I will explain how all of this is working. Usually when pop3proxy is not installed and you want to get your email from your pop server. You set you mail reader to talk to your pop server and tell the mail reader to use port 110 to do that. (the usual port for pop server). And you do this for all your pop servers.

When pop3proxy is running your mail reader will talk to pop3proxy and ask it to get the mail from different pop server (pop.videotron.ca or mail.ulaval.ca in the picture above). To distinguish which pop server you want the mail from, you will talk to pop3proxy on different ports.

On the pictures aboves talking by the port 6110 (on your PC) to pop3proxy will tell him to talk to pop.videotron.ca. If you talk to pop3proxy by the port 6111 then it will know that you what the mail on the mail.ulaval.ca server.

The association local port <--> pop server can be done when you start pop3proxy by adding some command line options or by configuring some parameters in the file Options.py or by using the new OptionConfig.py program.

Settings the things up!

Modification to the Options.py file:

I changed the following lines:

pop3proxy_servers:
pop3proxy_ports:

for

pop3proxy_servers: pop.videotron.ca:110,mail.ulaval.ca:110
pop3proxy_ports: 6110, 6111

Note: The order is important since the first item in the pop3proxy_servers list will be associated with the first item in the pop3proxy_ports list. This mean that I have associated port 6110 on my PC (i.e. localhost) to the pop server pop.videotron.ca. When I will be talking to pop3proxy by the port 6110 it will know that I want the mail from the server pop.videotron.ca.

Modification to the Mail reader. (i.e Mozilla:mail)

You need to do some modification in the Mail & Newsgroups Account Setting windows. To get this window start Mozilla. Then select the menu Windows->Mail & Newsgroup. The Mail & Newsgroup window will appear select the menu Edit->Mail & Newsgroups Account Setting.... You will get the window you need. If you already have created an Email account then you will need to edit this entry (See below). But for now we will start from scratch.

We select: Add Account
We select: Email Account, then we press the Next Button
We enter the required information, then we press the Next Button
We select:
- For server type = POP (this is what most of us will need)
- Incoming Server = localhost (This is different since usually here we were use to enter our ISP mail server but now the mail reader talk to pop3proxy which is on our computer (i.e. localhost) and pop3proxy will talk to the mail server.
- Outgoing Server: relais.videotron.ca (Here you enter what your ISP had tell you what to enter. We follow what they said since we don't need to classify our outgoing mail. P.S. Spammer here you enter dev_null.spammer.com), then we press the Next Button
We enter the required information (User Name), then we press the Next Button
We enter the required information (Account Name=ricard), then we press the Finish Button

Now you need to edit the information you just have entered to specify the port we will use to talk to pop3proxy.
In the left part of the window, select under the entry you created by the above manipulation (for me it is ricard) the item Server Settings. The right part of the window should change and you should have the following fields.

Server Name: localhost
User Name: ricard
Port: 110

You need to change the port number to the one you specified in the file Options.py. (For me I change this to 6110). (For the next account I will use the second number of the line pop3proxy_ports:.

Using the new header to classify the mail

To classify the mail we will use the filter option available in mozilla:mail.

First we need to create a new filter item. Usually we can filter on: subject, sender or body, but we need to filter on the new header X-Spambayes-Classification. To do this you need at least one account (see above on how to create a new account).

In the Mail & Newsgroups window select the menu Tools->Message Filters...
In the new window, click on new.
Create the new item by click on the arrow on the right of Subject, then go to the item Customize in the drop down list.
Write X-Spambayes-Classification in the field and click the Add button.

Now it is possible to use this new header as a filter criteria. Since we can do whatever you want with this new criteria I will give you the setup I use.

Example on how to use the new filter item

In my in box I have 4 sub folders. 2 that receive the mail from mailing lists (Spambayes and Freesco). One that receive mails that was classified has unsure by pop3proxy and finally a sub folder for the spams. The Inbox will have only mail from my friends (hopefully).

Each good folders filters with X-Spambayes-Classification = ham.

Inbox              (Filter on X-Spambayes-Classification = ham)
 |-----> Spambayes (Filter on Subject = [Spambayes] and
X-Spambayes-Classification = ham)
 |-----> Freesco   (Filter on Subject = [freesco] and
X-Spambayes-Classification = ham)
 |-----> Unsure    (Filter on X-Spambayes-Classification = unsure)
 |-----> Spam      (Filter on X-Spambayes-Classification = spam)

-- Remi Ricard From msergeant at startechgroup.co.uk Tue Dec 3 09:52:20 2002 From: msergeant at startechgroup.co.uk (Matt Sergeant) Date: Tue, 03 Dec 2002 09:52:20 +0000 Subject: [Spambayes] OT: hotels near subway in Boston? In-Reply-To: References: Message-ID: <3DEC7ED4.4060403@startechgroup.co.uk> Neale Pickett said the following on 02/12/02 22:06: > So I'm booking travel for this spam conference next month, and I'm > learning that I probably don't want to rent a car in Boston. The > country being very automobile-happy, though, I can't find any hotels > that advertise proximity to the subway. Are there any Boston-area > residents on the list who can recommend a place to stay that's near the > subway? I'm a tourist so I can't handle a lot of bus transfers. I'm staying at the Marriot. Matt. From glouis at dynamicro.on.ca Tue Dec 3 12:04:36 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 07:04:36 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <200212021943.gB2JhIl29523@localhost.localdomain> References: <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> Message-ID: <20021203120436.GA1332@athame.dynamicro.on.ca> On 20021202 (Mon) at 1443:18 -0500, Bill Yerazunis wrote: > > 2) train once on each error, but then repeat the whole training process > until all messages are classified correctly? > > I'd think the latter might be beneficial, but haven't tried it yet > myself. > > Hmmm... that would be a good way to do regression checking to > verify that every message that is classified correctly once > is classified correctly forevermore. I have tried it now. I started from scratch, with 6372 spams and 6372 nonspams, and did a single pass of training-on-error. Then I did second, third, fourth and fifth passes. Here are the numbers of messages that had to be trained on each pass: rounds spam good 1 1 1090 764 2 2 193 56 3 3 28 15 4 4 10 5 5 5 8 3 Then I took three files of 1624 nonspams each and three files of 617 spams each and ran bogofilter on them with the training db's from each round of training: round run fpos fneg err percent 1 1 0 22 126 148 6.60 2 1 1 17 123 140 6.25 3 1 2 19 121 140 6.25 4 2 0 23 105 128 5.71 5 2 1 18 113 131 5.85 6 2 2 22 109 131 5.85 7 3 0 23 104 127 5.67 8 3 1 18 111 129 5.76 9 3 2 22 108 130 5.80 10 4 0 23 104 127 5.67 11 4 1 18 111 129 5.76 12 4 2 22 108 130 5.80 13 5 0 23 103 126 5.62 14 5 1 19 108 127 5.67 15 5 2 22 107 129 5.76 Summarizing, round meanerrpc lcl95 ucl95 1 1 6.37 6.13 6.60 2 2 5.80 5.56 6.04 3 3 5.74 5.50 5.98 4 4 5.74 5.50 5.98 5 5 5.68 5.44 5.92 It appears that a second round of training did improve discrimination slightly, but after that the law of diminishing returns set in. What remains to be done is to start again from scratch and do a full training, followed by one round of training-on-error, and run the test data against those two training sets to see if the result is any different. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From tim at fourstonesExpressions.com Tue Dec 3 13:40:05 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 03 Dec 2002 07:40:05 -0600 Subject: [Spambayes] Rethinking Corpus, mboxutils, life, the world, everything In-Reply-To: Message-ID: <96KECWQNMYSGFALGGE3XA72ZUOSPB7.3decb435@riven> 12/2/2002 2:27:11 PM, Neale Pickett wrote: >So then, Richie Hindle is all like: > >> Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it, >> but doesn't seem to use it (?) >> >> I'd like to see more of the existing code using it, but then again I'm not >> in a hurry to implement the idea myself... > >I have to confess that I haven't even looked at Corpus.py yet. >hammiebulk imports it because it needed it for some verbose variable at >one point. But I'm going to read up before I take it out, maybe there's >something there I can use :) The Corpus stuff was created in response to primarily the needs of the pop3proxy. That process manages sets of mail for 'the other' clients, like Netscape, Opera, OE, etc., for which we don't have any hooks into their internals. The only 'interface' we have from them is their pop3 socket datastream. We can't tell when a message moves around in one of their folders, and so we have to keep caches of the mail we receive and give them a user interface they can use to train the classifier with the cached mail. Corpus and its subclass FileCorpus manage that cache for the pop3proxy. Message and its subclass FileMessage wrap each message, giving it an interface that is particularly suited for the pop3proxy. ExpiryCorpus and ExpiryFileCorpus allow the cache contents to be age purged, so the cache doesn't grow indefinitely. All of this is quite suitable for the pop3proxy, but not at all suitable for the Outlook client, which has plenty of hooks into the mail persistence mechanism. The Corpus is observable, and sends notification of two events: a message addition and a message removal. The Trainer class is an observer, and trains a classifier appropriately, based on the kind of trainer it is and whether a message is being added to or removed from the corpus it's observing. In the Outlook client (nearly as I can tell) the idea of a cached corpus is nonsense. Mark can tell when a message moves from one folder to another, and can do the training based on the kind of folder, so this 'third party' user interface to an observable cache messages is not a paradigm that works for outlook. The other thing involved is the mboxutils and msgs 'legacy'. This appears to be primarily directed at unix-style mailboxes, with the message classes being kinda force-fit into some other use-cases. Clearly unix-style mailboxes represent a third message persistence paradigm, a single file with all the messages in it, with a recognizable boundary line between. (btw, it seems like it would be fairly easy to screw up this kind of mailbox...) Hammie* uses this stuff, even when it's not training on unix mailboxes, and there's code rambling around in there that says "if I'm looking at a mbox, do (a), if I'm looking at a directory, do (b), if I'm looking at a ..." There are clearly some valid candidates for abstraction in this arena. So when I look at Corpus, I think that some further abstraction is necessary. Mark saw this instantly, it took me longer. Specifically, the concept of a 'corpus' carries some definitional baggage that has to do with training and such. The Corpus class is abstract in definition, but it makes too many assumptions about its environment to be abstract *enough*. I think we should refactor and introduce another level of abstraction, perhaps called 'Folder'. Here's a strawman: class Folder: """Basic iteration, maybe not much else here""" def __getitem__(self, key): def keys(self): def __iter__(self): def makeMessage(self, key): class Directory(Folder): def __init__(self, directory) class Mbox(Folder): def __init__(self, mbox) class Outlook(Folder): def __init__(self, ???) class FileCorpus(Directory): """Observable set of messages""" class FileCache(FileCorpus): """Expirable set of messages""" class Message: """Message wrapper, maybe even is just email.Message""" class MessageFactory: """Abstract factory for Message""" class FileMessageFactory: """Wraps a file system message""" class OutlookMessageFactory""" """Wraps an outlook message, probably only has a key and delegator methods to outlook api (?)""" class SomeOtherMessageFactory: """wraps some other kind of message... you get the idea""" > >Neale > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From wsy at merl.com Tue Dec 3 14:51:59 2002 From: wsy at merl.com (Bill Yerazunis) Date: Tue, 3 Dec 2002 09:51:59 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <311837598.1038877084@[192.168.2.9]> (message from Brian Burton on Tue, 03 Dec 2002 00:58:04 -0500) References: <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <311837598.1038877084@[192.168.2.9]> Message-ID: <200212031451.gB3Epxi32211@localhost.localdomain> From: Brian Burton > Training a particular incarnation of CRM114 usually takes a week or > two; I read my mail (both categories) and when I find a piece of mail > misclassified, I train that one piece into the filter. Training only on errors after a cut-off point is interesting. Why do you do this? Is there a reason not to increment the good/spam counts for terms in every email? Is it to avoid overflowing the counts in your hash table or is this likely to be more accurate since it keeps the message counts small? The reason I started doing it is that I used "unsigned char" as the counters in the big hash tables, to keep them as small as reasonable (remember, we're doing really _random_ accesses of these files and we thrash virtual memory and cache like crazy). The bin incrementer is "smart" in that it won't wrap past 255, but it is losing data at that point, and losing it on the _most_ significant features. I did consider "uncorking" the values up to unsigned int16, but I haven't had a good justification to do that yet. It's a simple change and if there's a need, it'll happen. > After a couple of days the errors get very sparse; after two or three > weeks, I "go for data" and that's what gets reported in the monthlies. Perhaps I misunderstand, but doesn't that mean that you are training up to a desirable accuracy before beginning to measure your accuracy? Is the transition from training to performance measurement based on a predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in corpus, or 14 calendar days of training) or based on the accuracy rising to a certain level? It's measured intuitively, by when I find I'm just not getting enough errors to keep my attention in training. This _is_ human-guided training, mind you. Other influences on when to start are "it's the start of November, start getting data". and "now that the BCR has that nasty underflow problem fixed and the data has settled down, let's get numbers". The other issue that can't be dodged is that spam is not ergodic; spam evolves in fits and starts; my spam of 1996 is very different than my spam of 2002. Any filter that is trained and tested against data statically is operating "in vitro"- a necessary and useful scientific measure but it misses the point of how well a spam filter can retrain on the fly against evolution in action. The training period coincidentally works out to be about 2+ weeks of training, and co-coincidentally I usually have just a few bins in the hash table maxing out about then. (right now I've got 7 bins out of a million maxed out in the spam hashtable, and 5 bins out of a million maxed out in the nonspam hashtable.) If I were to find that I was maxing out a significant number of bins (say, hundreds) I'd rebuild with unsigned int16 bins and accept the performance hit. (yes, this is a very "engineering" style approach; I'm not a good mathematician, so I just do experiments and report on what comes back.) For those of you with exceptionally high boredom thresholds, the current under-test spectra histograms follow. It does exhibit a comforting long distribution tail. -Bill Y. Sparse spectra file spam.css has 1048577 bins total total number of hash datums in this file is 398830 now scanning bins- please be patient... bin value 0 found 786135 times bin value 1 found 188350 times bin value 2 found 48948 times bin value 3 found 11125 times bin value 4 found 8550 times bin value 5 found 2511 times bin value 6 found 992 times bin value 7 found 464 times bin value 8 found 470 times bin value 9 found 240 times bin value 10 found 140 times bin value 11 found 104 times bin value 12 found 77 times bin value 13 found 65 times bin value 14 found 46 times bin value 15 found 47 times bin value 16 found 32 times bin value 17 found 36 times bin value 18 found 19 times bin value 19 found 17 times bin value 20 found 30 times bin value 21 found 11 times bin value 22 found 14 times bin value 23 found 8 times bin value 24 found 7 times bin value 25 found 7 times bin value 26 found 6 times bin value 27 found 10 times bin value 28 found 9 times bin value 29 found 7 times bin value 30 found 6 times bin value 31 found 6 times bin value 32 found 5 times bin value 33 found 2 times bin value 34 found 5 times bin value 35 found 2 times bin value 36 found 6 times bin value 37 found 5 times bin value 38 found 2 times bin value 39 found 2 times bin value 40 found 4 times bin value 41 found 2 times bin value 43 found 3 times bin value 44 found 1 times bin value 46 found 3 times bin value 47 found 1 times bin value 50 found 2 times bin value 52 found 3 times bin value 53 found 3 times bin value 55 found 1 times bin value 56 found 3 times bin value 58 found 1 times bin value 60 found 1 times bin value 62 found 1 times bin value 64 found 1 times bin value 69 found 1 times bin value 73 found 1 times bin value 74 found 1 times bin value 76 found 1 times bin value 77 found 1 times bin value 89 found 1 times bin value 90 found 2 times bin value 103 found 1 times bin value 105 found 2 times bin value 116 found 1 times bin value 121 found 1 times bin value 130 found 1 times bin value 143 found 1 times bin value 146 found 1 times bin value 157 found 1 times bin value 171 found 1 times bin value 175 found 2 times bin value 189 found 1 times bin value 208 found 1 times bin value 255 found 7 times Sparse spectra file nonspam.css has 1048577 bins total total number of hash datums in this file is 299527 now scanning bins- please be patient... bin value 0 found 819494 times bin value 1 found 187269 times bin value 2 found 31009 times bin value 3 found 7158 times bin value 4 found 1776 times bin value 5 found 614 times bin value 6 found 371 times bin value 7 found 165 times bin value 8 found 100 times bin value 9 found 76 times bin value 10 found 74 times bin value 11 found 46 times bin value 12 found 46 times bin value 13 found 29 times bin value 14 found 46 times bin value 15 found 53 times bin value 16 found 38 times bin value 17 found 16 times bin value 18 found 24 times bin value 19 found 9 times bin value 20 found 5 times bin value 21 found 11 times bin value 22 found 7 times bin value 23 found 13 times bin value 24 found 5 times bin value 25 found 6 times bin value 26 found 6 times bin value 27 found 5 times bin value 28 found 3 times bin value 29 found 3 times bin value 30 found 10 times bin value 31 found 5 times bin value 32 found 4 times bin value 33 found 4 times bin value 34 found 3 times bin value 35 found 3 times bin value 36 found 5 times bin value 37 found 2 times bin value 38 found 3 times bin value 39 found 3 times bin value 40 found 2 times bin value 41 found 2 times bin value 45 found 1 times bin value 46 found 2 times bin value 48 found 3 times bin value 49 found 3 times bin value 50 found 1 times bin value 51 found 1 times bin value 52 found 2 times bin value 54 found 1 times bin value 55 found 1 times bin value 56 found 1 times bin value 57 found 1 times bin value 58 found 1 times bin value 59 found 1 times bin value 60 found 1 times bin value 64 found 1 times bin value 66 found 1 times bin value 67 found 1 times bin value 71 found 2 times bin value 72 found 1 times bin value 74 found 1 times bin value 75 found 1 times bin value 78 found 1 times bin value 79 found 1 times bin value 80 found 2 times bin value 82 found 2 times bin value 83 found 1 times bin value 86 found 1 times bin value 95 found 1 times bin value 102 found 1 times bin value 104 found 1 times bin value 113 found 1 times bin value 122 found 1 times bin value 138 found 1 times bin value 164 found 1 times bin value 169 found 1 times bin value 173 found 1 times bin value 183 found 1 times bin value 189 found 1 times bin value 222 found 1 times bin value 254 found 1 times bin value 255 found 5 times Enter bin value to zeroize, or 0 to exit: From trebor at animeigo.com Tue Dec 3 14:28:10 2002 From: trebor at animeigo.com (Robert Woodhead) Date: Tue, 3 Dec 2002 09:28:10 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> References: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> Message-ID: >However, if this is the approach Bill uses, you can't use to for >performance estimates. Our speech and natural language group is >very careful not to mix its training set with its test set. When >they do, they do something like 10 fold cross validation which >averages (?) the results of 10 experiments that take some random >fraction of the data as training and the rest as testing. ah, but the point is, since each individual user will have his own email stream to train on, all you care about is how accurate the system is when it looks at the very next email that comes in. Thus, a system that gets very good after a few weeks of training on all the incoming mail, AND STAYS THAT WAY, is what you want in the real world. Dividing up training sets can be good for analysing the statistical properties of particular algorithm choices, but what counts (in a production environment) is real world performance, and real world filters have to adapt as the spam (and ham) changes over time. Tests like "pick a random sample, train on it, and then pick another sample (nonintersecting) from the same corpus, and test" don't properly reflect the real world environment. Spams are ordered by time! Thus, my philosophical position is that a real world app has to train on every incoming email (and be corrected by the user when it goofs). At 9:30 PM -0500 12/2/02, Bill Yerazunis wrote: >The reason I haven't auto-trained is due to my lack of understanding >on what the limiting amount of self-teaching one can allow that >doesn't go off into belly gaze. This cannot happen unless the user is derelict in not correcting the output. If he is, then the input to the training system is 100% correct. And if the training system has an aging system, correction mistakes will eventually decay (and, if they cause misclassifications, the user will notice and correct the filter). Keep in mind there is always a new stream of incoming spam and ham to work with. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From brian at burton-computer.com Tue Dec 3 05:47:45 2002 From: brian at burton-computer.com (Brian Burton) Date: Tue, 03 Dec 2002 00:47:45 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> References: <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <5.0.2.1.2.20021202204727.033ae170@zima.bbn.com> Message-ID: <311217897.1038876465@[192.168.2.9]> --On Monday, December 02, 2002 9:00 PM -0500 Ken Anderson wrote: > However, if this is the approach Bill uses, you can't use to for > performance estimates. Our speech and natural language group is very > careful not to mix its training set with its test set. When they do, > they do something like 10 fold cross validation which averages (?) the > results of 10 experiments that take some random fraction of the data as > training and the rest as testing. > > This gives a lower performance score that is likely to be more accurate > on real data. Absolutely. That's the way I evaluate algorithms in SpamProbe as well. I use 10 different random partitionings of my good and bad spams into training and test subsets. Some tests yield excellent results. Others yield bad results. The average is always somewhere in the middle. Taking only a single partitioning isn't a very good way to evaluate the accuracy of an algorithm. All the best, ++Brian From brian at burton-computer.com Tue Dec 3 05:58:04 2002 From: brian at burton-computer.com (Brian Burton) Date: Tue, 03 Dec 2002 00:58:04 -0500 Subject: [Spambayes] Re: CRM114 in November breaks 99.9%. :-) In-Reply-To: <200212030230.gB32UkR30864@localhost.localdomain> References: <20021202040836.54151.qmail@mail.archub.org> <5.0.2.1.2.20021202105813.0209b360@zima.bbn.com> <200212030230.gB32UkR30864@localhost.localdomain> Message-ID: <311837598.1038877084@[192.168.2.9]> --On Monday, December 02, 2002 9:30 PM -0500 Bill Yerazunis wrote: > Training a particular incarnation of CRM114 usually takes a week or > two; I read my mail (both categories) and when I find a piece of mail > misclassified, I train that one piece into the filter. Training only on errors after a cut-off point is interesting. Why do you do this? Is there a reason not to increment the good/spam counts for terms in every email? Is it to avoid overflowing the counts in your hash table or is this likely to be more accurate since it keeps the message counts small? > After a couple of days the errors get very sparse; after two or three > weeks, I "go for data" and that's what gets reported in the monthlies. Perhaps I misunderstand, but doesn't that mean that you are training up to a desirable accuracy before beginning to measure your accuracy? Is the transition from training to performance measurement based on a predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in corpus, or 14 calendar days of training) or based on the accuracy rising to a certain level? All the best, ++Brian From glouis at dynamicro.on.ca Tue Dec 3 16:27:34 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 11:27:34 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021203120436.GA1332@athame.dynamicro.on.ca> References: <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> <20021203120436.GA1332@athame.dynamicro.on.ca> Message-ID: <20021203162734.GA12825@athame.dynamicro.on.ca> On 20021203 (Tue) at 0704:36 -0500, Greg Louis wrote: > > Summarizing, > round meanerrpc lcl95 ucl95 > 1 1 6.37 6.13 6.60 > 2 2 5.80 5.56 6.04 > 3 3 5.74 5.50 5.98 > 4 4 5.74 5.50 5.98 > 5 5 5.68 5.44 5.92 > > It appears that a second round of training did improve discrimination > slightly, but after that the law of diminishing returns set in. > > What remains to be done is to start again from scratch and do a full > training, followed by one round of training-on-error, and run the test > data against those two training sets to see if the result is any > different. train meanerrpc lcl95 ucl95 1 production 2.11 1.79 2.44 2 errtwice 5.80 5.48 6.12 3 full 5.10 4.78 5.43 4 fullerr 5.10 4.78 5.43 Production refers to my big production training set, just for comparison; it was full-trained up to about 10k spams and 10k hams and then trained, not randomly, on every error encountered since. Errtwice is two rounds of training-on-error with the 6372-of-each training corpus. Full is one round of full training with the same corpus, and fullerr is one round of full training followed by one round of train-on-error (only 18 spams and 221 nonspams were registered in that round; although the means are identical, there was some variation in the individual runs). Doesn't look as though pure training-on-error is particularly advantageous with the Robinson-Fisher (chi) calculation method. It may still be useful in maintaining the effectiveness of an established training base. The above experiment is described more fully at http://www.bgl.nu/~glouis/bogofilter/training.html -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From tim at zope.com Tue Dec 3 16:53:10 2002 From: tim at zope.com (Tim Peters) Date: Tue, 3 Dec 2002 11:53:10 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021203162734.GA12825@athame.dynamicro.on.ca> Message-ID: [Greg Louis] > ... > Doesn't look as though pure training-on-error is particularly > advantageous with the Robinson-Fisher (chi) calculation method. Are you hashing tokens? spambayes does not, CRM114 does. Bill generates about 16 hash codes per input token, and with just a million hash buckets, collision rates zoom quickly if you train on everything. The experiments spambayes did with CRM114-like schemes were a disaster due to this -- we continued to train on everything, with hashing but without any bounds on bucket count, and the hash collisions quickly caused outrageously bad classification mistakes. Removing the hashing cured that, but then the database size goes through the roof (when generating ~16 "exact strings" per input token, and training on everything). Training-on-error helps Bill because it slashes hash collisions, simply via producing far fewer hash codes than does training on everything. Experiments in the default non-hashing spambayes unigram code found that train-on-error hurt the unsure rate but not the FP or FN rates. > It may still be useful in maintaining the effectiveness of an established > training base. Possibly; we didn't do any experiments on that. From relson at osagesoftware.com Tue Dec 3 16:57:58 2002 From: relson at osagesoftware.com (David Relson) Date: Tue, 03 Dec 2002 11:57:58 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <20021203162734.GA12825@athame.dynamicro.on.ca> References: <20021203120436.GA1332@athame.dynamicro.on.ca> <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> <20021203120436.GA1332@athame.dynamicro.on.ca> Message-ID: <4.3.2.7.2.20021203115102.00e234a0@mail.osagesoftware.com> At 11:27 AM 12/3/02, Greg Louis wrote: >Doesn't look as though pure training-on-error is particularly >advantageous with the Robinson-Fisher (chi) calculation method. It may >still be useful in maintaining the effectiveness of an established >training base. Greg, That makes sense. By definition, with training-on-error, only some of the training corpora are put into the word lists. The obvious result is smaller word lists. Other than list size, the effects are less clear. On the one hand, incoming messages will have fewer "hits" in the word lists; while on the other hand, the hits will be more "meaningful". With the smaller lists, there is less "breadth of knowledge" about spam and ham. This could account for the lack of advantage of training-on-error. David From glouis at dynamicro.on.ca Tue Dec 3 17:11:10 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 12:11:10 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: References: <20021203162734.GA12825@athame.dynamicro.on.ca> Message-ID: <20021203171110.GA13054@athame.dynamicro.on.ca> On 20021203 (Tue) at 1153:10 -0500, Tim Peters wrote: > [Greg Louis] > > ... > > Doesn't look as though pure training-on-error is particularly > > advantageous with the Robinson-Fisher (chi) calculation method. > > Are you hashing tokens? spambayes does not, CRM114 does. Bill generates > about 16 hash codes per input token, and with just a million hash buckets, > collision rates zoom quickly if you train on everything. Understood. We don't hash tokens, and I agree that the sentence you quoted is misleading; I should have said something like "bogofilter's current tokenization and the R-F classification method." I didn't try any of bogofilter's other calculation methods. > The experiments spambayes did with CRM114-like schemes were a > disaster due to this -- we continued to train on everything, with > hashing but without any bounds on bucket count, and the hash > collisions quickly caused outrageously bad classification mistakes. > Removing the hashing cured that, but then the database size goes > through the roof (when generating ~16 "exact strings" per input > token, and training on everything). Yup. > Training-on-error helps Bill because it slashes hash collisions, simply via > producing far fewer hash codes than does training on everything. I didn't mean to imply otherwise, and your correction of my sloppy wording is appreciated. > Experiments in the default non-hashing spambayes unigram code found that > train-on-error hurt the unsure rate but not the FP or FN rates. > > > It may still be useful in maintaining the effectiveness of an established > > training base. > > Possibly; we didn't do any experiments on that. Neither have I; I've been doing it in practice and it seems to work (my fp/fn are coming down), but I would like to perform a properly-designed experiment to assess it. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From neale at woozle.org Tue Dec 3 17:19:52 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:19:52 -0800 Subject: [Spambayes] dbm on windows, hopefully for the last time Message-ID: What do you all think of this: new option "dbm_type" which can be "best", "db3hash", "dbhash", "gdbm", or "dumbdbm". If it's "best", then the best available dbm implementation will be used. Note that "best" on Windows excludes "dbhash". So now, you get the best one your platform supports by default. Or you can specify a specific dbm if you like that better. This will remove the "anydbm" module, but add a tiny "dbmstorage" module. Please let me know what you think. I'll check it in if I don't get any "no, don't do that" comments. Here's the diff: Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.78 diff -u -r1.78 Options.py --- Options.py 26 Nov 2002 00:43:51 -0000 1.78 +++ Options.py 3 Dec 2002 17:13:20 -0000 @@ -372,6 +372,10 @@ [globals] verbose: False +# What DBM storage type should we use? Must be best, db3hash, dbhash, +# gdbm, dumbdbm. Windows folk should steer clear of dbhash. Default is +# "best", which will pick the best DBM type available on your platform. +dbm_type: best """ int_cracker = ('getint', None) @@ -460,6 +464,7 @@ 'html_ui_launch_browser': boolean_cracker, }, 'globals': {'verbose': boolean_cracker, + 'dbm_type': string_cracker, }, } Index: anydbm.py =================================================================== RCS file: anydbm.py diff -N anydbm.py --- anydbm.py 2 Dec 2002 20:23:39 -0000 1.3 +++ /dev/null 1 Jan 1970 00:00:00 -0000 @@ -1,57 +0,0 @@ -#! /usr/bin/env python -"""Generic interface to all dbm clones. - -This is just like anydbm from the Python distribution, except that this -one leaves out the "dbm" type on Windows, since reliable reports have it -that this module is antiquated and most dreadful. - -""" - -import sys - -try: - class error(Exception): - pass -except (NameError, TypeError): - error = "anydbm.error" - -if sys.platform in ["win32"]: - # dbm on windows is awful. - _names = ["bsddb3", "gdbm", "dumbdbm"] -else: - _names = ["dbhash", "gdbm", "dbm", "dumbdbm"] -_errors = [error] -_defaultmod = None - -for _name in _names: - try: - _mod = __import__(_name) - except ImportError: - continue - if not _defaultmod: - _defaultmod = _mod - _errors.append(_mod.error) - -if not _defaultmod: - raise ImportError, "no dbm clone found; tried %s" % _names - -error = tuple(_errors) - -def open(file, flag = 'r', mode = 0666): - # guess the type of an existing database - from whichdb import whichdb - result=whichdb(file) - if result is None: - # db doesn't exist - if 'c' in flag or 'n' in flag: - # file doesn't exist and the new - # flag was used so use default type - mod = _defaultmod - else: - raise error, "need 'c' or 'n' flag to open new db" - elif result == "": - # db type cannot be determined - raise error, "db type could not be determined" - else: - mod = __import__(result) - return mod.open(file, flag, mode) Index: dbmstorage.py =================================================================== RCS file: dbmstorage.py diff -N dbmstorage.py --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ dbmstorage.py 3 Dec 2002 17:13:20 -0000 @@ -0,0 +1,53 @@ +"""Wrapper to open an appropriate dbm storage type.""" + +from Options import options + +class error(Exception): + pass + +def open_db3hash(*args): + """Open a bsddb3 hash.""" + import bsddb3 + return bsddb3.hashopen(*args) + +def open_dbhash(*args): + """Open a bsddb hash. Don't use this on Windows.""" + import bsddb + return bsddb.hashopen(*args) + +def open_gdbm(*args): + """Open a gdbm database.""" + import gdbm + return gdbm.open(*args) + +def open_dumbdbm(*args): + """Open a dumbdbm database.""" + import dumbdbm + return dumbdbm.open(*args) + +def open_best(*args): + if sys.platform == "win32": + funcs = [open_db3hash, open_gdbm, open_dumbdbm] + else: + funcs = [open_db3hash, open_dbhash, open_gdbm, open_dumbdbm] + for f in funcs: + try: + return f(*args) + except ImportError: + pass + raise error("No dbm modules available!") + +open_funcs = { + "best": open_best, + "db3hash": open_db3hash, + "dbhash": open_dbhash, + "gdbm": open_gdbm, + "dumbdbm": open_dumbdbm, + } + +def open(*args): + dbm_type = options.dbm_type.lower() + f = open_funcs.get(dbm_type) + if not f: + raise error("Unknown dbm type in options file") + return f(*args) Index: storage.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/storage.py,v retrieving revision 1.5 diff -u -r1.5 storage.py --- storage.py 2 Dec 2002 06:02:03 -0000 1.5 +++ storage.py 3 Dec 2002 17:13:20 -0000 @@ -51,6 +51,7 @@ import cPickle as pickle import errno import shelve +import dbmstorage PICKLE_TYPE = 1 NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training @@ -130,7 +131,8 @@ if options.verbose: print 'Loading state from',self.db_name,'database' - self.db = shelve.DbfilenameShelf(self.db_name, self.mode) + self.dbm = dbmstorage.open(self.db_name, self.mode) + self.db = shelve.Shelf(self.dbm) if self.db.has_key(self.statekey): t = self.db[self.statekey] From neale at woozle.org Tue Dec 3 17:22:39 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:22:39 -0800 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: So then, Neale Pickett is all like: > --- /dev/null 1 Jan 1970 00:00:00 -0000 > +++ dbmstorage.py 3 Dec 2002 17:13:20 -0000 > @@ -0,0 +1,53 @@ > +"""Wrapper to open an appropriate dbm storage type.""" > + > +from Options import options dbmstorage.py will, of course, also import sys :) > + if sys.platform == "win32": > + funcs = [open_db3hash, open_gdbm, open_dumbdbm] > + else: > + funcs = [open_db3hash, open_dbhash, open_gdbm, open_dumbdbm] From glouis at dynamicro.on.ca Tue Dec 3 17:23:25 2002 From: glouis at dynamicro.on.ca (Greg Louis) Date: Tue, 3 Dec 2002 12:23:25 -0500 Subject: [Spambayes] train on error - to exhaustion? In-Reply-To: <4.3.2.7.2.20021203115102.00e234a0@mail.osagesoftware.com> References: <20021203120436.GA1332@athame.dynamicro.on.ca> <20021202184021.GA6315@athame.dynamicro.on.ca> <200212021943.gB2JhIl29523@localhost.localdomain> <20021203120436.GA1332@athame.dynamicro.on.ca> <4.3.2.7.2.20021203115102.00e234a0@mail.osagesoftware.com> Message-ID: <20021203172325.GB13054@athame.dynamicro.on.ca> On 20021203 (Tue) at 1157:58 -0500, David Relson wrote: > By definition, with training-on-error, only some of the > training corpora are put into the word lists. The obvious result is > smaller word lists. I can confirm that. "twice" is the directory where the db files were built by two rounds of train-on-error: # ls -l full twice full: total 47288 -rw-r--r-- 1 spamtest root 38936576 Dec 3 07:24 goodlist.db -rw-r--r-- 1 spamtest root 9424896 Dec 3 07:06 spamlist.db twice: total 22168 -rw-r--r-- 1 spamtest users 15761408 Dec 2 14:54 goodlist.db -rw-r--r-- 1 spamtest users 6905856 Dec 2 14:55 spamlist.db > Other than list size, the effects are less clear. On > the one hand, incoming messages will have fewer "hits" in the word lists; > while on the other hand, the hits will be more "meaningful". With the > smaller lists, there is less "breadth of knowledge" about spam and > ham. This could account for the lack of advantage of training-on-error. The fact that you get only half a percent more errors with less than half the bulk of wordlists does suggest that full training introduces a lot of unproductive cruft, though. What I _think_ I'm seeing is that, when done on top of an existing "full" base, training on every error as it's encountered does quickly improve the discrimination. That's gut-feeling and could be wrong -- experimentation is needed. -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu | From neale at woozle.org Tue Dec 3 17:25:50 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:25:50 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: <3DEB7D92.26160.B217D9F@localhost> References: <3DEB3361.19290.9FFA921@localhost> <3DEB7D92.26160.B217D9F@localhost> Message-ID: So then, "Brad Clements" is all like: > I think the "database interface" should be abstract, regardless of > what I do. It is abstracted in a few places: you can write a new PersistentClassifier class (in storage.py), or you can have the DBDictClassifier use a new dbm storage backend (also in storage.py). If my latest patch is amenable to everyone, you can also hack dbmstorage.py to include a new dbm-like back-end. Neale From skip at pobox.com Tue Dec 3 17:28:34 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 11:28:34 -0600 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: <15852.59842.998057.524034@montanaro.dyndns.org> Neale> What do you all think of this: new option "dbm_type" which can be Neale> "best", "db3hash", "dbhash", "gdbm", or "dumbdbm". If it's Neale> "best", then the best available dbm implementation will be used. Neale> Note that "best" on Windows excludes "dbhash". Looks like a winner to me. Skip From richie at entrian.com Tue Dec 3 17:44:19 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 17:44:19 +0000 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: [Neale] > What do you all think of this: new option "dbm_type" which can be > "best", "db3hash", "dbhash", "gdbm", or "dumbdbm". If it's "best", then > the best available dbm implementation will be used. Note that "best" on > Windows excludes "dbhash". Looks spot on - nice one! We should also change the default for pop3proxy_persistent_use_database to True. -- Richie Hindle richie@entrian.com From neale at woozle.org Tue Dec 3 17:50:44 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 09:50:44 -0800 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: So then, Richie Hindle is all like: > We should also change the default for > pop3proxy_persistent_use_database to True. For that matter, what do you think about moving persistent_use_database back to the [global] section and doing away with *_presistent_use_database? Neale From skip at pobox.com Tue Dec 3 18:10:28 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 12:10:28 -0600 Subject: [Spambayes] The database question that would not die In-Reply-To: References: <15851.23096.388509.925822@montanaro.dyndns.org> Message-ID: <15852.62356.862434.212872@montanaro.dyndns.org> richie> Are there any platforms on which, when you ask anydbm to create richie> a database, it uses version 1.85 of the underlying Berkeley DB richie> library to do that? Yes, unfortunately the Python Windows installer is distributed with Berkeley DB 1.85. On other platforms it's a hit-or-miss proposition. I don't believe any Linux vendors ship with db1 as the default anymore, but I could easily be disabused of that notion. I don't know about the commercial Unix vendors. Has anyone considered Sleepycat's caveats about using 1.85? The relevant page is here: http://www.sleepycat.com/historic.html The q/a about 1.85 is: Are there known problems with the 1.85 and 1.86 versions? Yes. Specifically, we recommend that you avoid the following operations when using versions 1.85 and 1.86: * Btree cursor (seq and put using a cursor) operations. * Large numbers of btree duplicates (specifically, avoid migrating duplicate keys to internal pages). * Large numbers of btree deletes (you should periodically dump and rebuild the database if you delete large numbers of records). * Overwriting or deleting overflow hash key/data pairs (pairs with items larger than the page size). * Intermixing hash cursor operations with deletes. In addition: * As there was no locking support in version 1.85, you cannot perform concurrent read/write operations in the database. * As there was no logging or transaction support in version 1.85, you must re-create your database whenever abnormal application termination occurs (e.g., either the application or the system crashes) as the database may have been left in a corrupted state. Finally, you should not upgrade your GNU gcc or Solaris compiler. Optimizations in versions of gcc 2 that were in alpha test in the summer of 1997, and a version of the standard Solaris WorkShop Compiler that was in beta test in the fall of 1997, trigger bugs in versions 1.85 and 1.86 that will cause sporadic core dumps. It seems to me the most important issues for us are the last two bullets in the first section and the last bullet in the second section. How close can we come to avoiding them? I don't think we should have any overflow has key/data pairs. The largest item in my current hammie.db file is only 108 bytes. Does the code do things like foo = db.next() if someprop(foo): del db[foo[0]] ? If not that may not be a problem either. The "abnormal termination" bit bothers me some, based on historical prejudices about Windows' (in)stability. I imagine others can speak to that. Skip From richie at entrian.com Tue Dec 3 19:08:07 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 19:08:07 +0000 Subject: [Spambayes] The database question that would not die In-Reply-To: <15852.62356.862434.212872@montanaro.dyndns.org> References: <15851.23096.388509.925822@montanaro.dyndns.org> <15852.62356.862434.212872@montanaro.dyndns.org> Message-ID: [Skip] > Yes, unfortunately the Python Windows installer is distributed with Berkeley > DB 1.85. I should have said "except Windows" - I know we need to special-case that one. > Has anyone considered Sleepycat's caveats about using 1.85? I've read it, and although we might not hit any of the specific problems they mention right now, it seemed sufficiently scary to put me off using it. Who knows what code will be added to spambayes in the future - we can't make any assumptions. I think the patch Neale posted today does an excellent job of avoiding the problems - let's go with that. -- Richie Hindle richie@entrian.com From richie at entrian.com Tue Dec 3 19:14:04 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 19:14:04 +0000 Subject: [Spambayes] dbm on windows, hopefully for the last time In-Reply-To: References: Message-ID: [Neale] > For that matter, what do you think about moving persistent_use_database > back to the [global] section and doing away with > *_presistent_use_database? Yes, good plan. But what about *_persistent_storage_file? That defaults to ~/.hammiedb for hammie, which is meaningless on Windows 9x but very sensible on Unix. Maybe we need to move from having per-application defaults in bayescustomize.ini to having per-platform defaults? This is effectively what we've done with "dbm_type: best" (but in a different place). -- Richie Hindle richie@entrian.com From knutsen at yahoo.com Tue Dec 3 20:35:26 2002 From: knutsen at yahoo.com (Mark Knutsen) Date: Tue, 3 Dec 2002 12:35:26 -0800 (PST) Subject: [Spambayes] Great project; please keep up the good work Message-ID: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Looks like you've got something good going on there, especially the Outlook 2000 plugin for all of us stuck in corporate world. However, I'm neither a Python nor a Windows developer (I do Perl on Linux) and only use Windows to browse the Web and read my email at work, so I'm a bit leery of the installation process at present. What are the chances of an easy, packaged install coming down the pike? ===== --Mark Knutsen (Have you visited http://tbcy.org lately?) __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com From skip at pobox.com Tue Dec 3 20:57:58 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 14:57:58 -0600 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <20021203203526.28181.qmail@web10006.mail.yahoo.com> References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: <15853.6870.487509.485369@montanaro.dyndns.org> Mark> What are the chances of an easy, packaged install coming down the Mark> pike? It's in the cards, though I'm not sure anyone knows what the timeframe for that is. Skip From francois.granger at free.fr Tue Dec 3 20:59:47 2002 From: francois.granger at free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Tue, 3 Dec 2002 21:59:47 +0100 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <20021203203526.28181.qmail@web10006.mail.yahoo.com> References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: At 12:35 -0800 3/12/02, in message [Spambayes] Great project; please keep up the good work, Mark Knutsen wrote: > >However, I'm neither a Python nor a Windows developer (I do Perl on Linux) >and only use Windows to browse the Web and read my email at work, so I'm a >bit leery of the installation process at present. What are the chances of an >easy, packaged install coming down the pike? Go for the pop3proxy. It is a "one size fit all" working very nicely. Ot�her people will help you install on Unix with Procmail.... -- Le courrier �lectronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From mhammond at skippinet.com.au Tue Dec 3 21:08:16 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 4 Dec 2002 08:08:16 +1100 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: > Looks like you've got something good going on there, especially > the Outlook > 2000 plugin for all of us stuck in corporate world. > > However, I'm neither a Python nor a Windows developer (I do Perl on Linux) > and only use Windows to browse the Web and read my email at work, so I'm a > bit leery of the installation process at present. What are the > chances of an > easy, packaged install coming down the pike? There is a good chance - I am working on this at the moment. Mark. From richie at entrian.com Tue Dec 3 21:26:10 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 21:26:10 +0000 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> Message-ID: <2u7quu0t59l75012u6hjdnv85k41ihsj6u@4ax.com> [Mark Knutsen] > What are the chances of an easy, packaged install coming down the pike? [Mark Hammond] > There is a good chance - I am working on this at the moment. Fantastic! This is good news. Do you know for sure yet what this will include? Is it intended to be Outlook-specific, or will it include everything (hammie, the web interface, the POP3 proxy, etc)? Will you be shipping Python as part of the package? Would you like me to stop asking annoying questions now? 8-) -- Richie Hindle richie@entrian.com From richie at entrian.com Tue Dec 3 21:28:54 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 21:28:54 +0000 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: <2u7quu0t59l75012u6hjdnv85k41ihsj6u@4ax.com> References: <20021203203526.28181.qmail@web10006.mail.yahoo.com> <2u7quu0t59l75012u6hjdnv85k41ihsj6u@4ax.com> Message-ID: [Mark Knutsen] > What are the chances of an easy, packaged install coming down the pike? [Mark Hammond] > There is a good chance - I am working on this at the moment. [Richie Hindle] > Fantastic! This is good news. Do you know for sure yet what this will > include? Is it intended to be Outlook-specific, or will it include > everything (hammie, the web interface, the POP3 proxy, etc)? Will you be > shipping Python as part of the package? Would you like me to stop asking > annoying questions now? 8-) Will it include bsddb3? -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Tue Dec 3 21:51:14 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 4 Dec 2002 08:51:14 +1100 Subject: [Spambayes] Great project; please keep up the good work In-Reply-To: Message-ID: > [Richie Hindle] > > Fantastic! This is good news. Do you know for sure yet what this will > > include? Is it intended to be Outlook-specific, or will it include > > everything (hammie, the web interface, the POP3 proxy, etc)? > > Will you be > > shipping Python as part of the package? Would you like me to > > stop asking > > annoying questions now? 8-) > > Will it include bsddb3? It will be Outlook specific - there will only be DLL files, no executables. Of course, I would be happy to apply the same technology to a general distribution. And my plan is for it to include bsddb3 ;) Mark. From francois.granger at free.fr Tue Dec 3 22:13:37 2002 From: francois.granger at free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Tue, 3 Dec 2002 23:13:37 +0100 Subject: [Spambayes] Fwd: Re: [Pythonmac-SIG] Database engine Message-ID: >Delivered-To: online.fr-francois.granger@free.fr >Date: Tue, 3 Dec 2002 22:24:30 +0100 >Subject: Re: [Pythonmac-SIG] Database engine >Cc: MacPython >To: Fran�ois Granger >From: Jack Jansen > > >On maandag, dec 2, 2002, at 18:21 Europe/Amsterdam, Fran�ois Granger wrote: > >>On the Spambayes mailing list there was a discussion about the quality of >>the current bsddb database engine on Windows platforms. I was asked the >>question of how it is on Mac OS9 side. All I could say is that anydbm rely >>on gdbm. But I don't know which engine it is. >> >>Anyone knows a little on this ? > >It's the gdbm engine. This is a GNU database engine of approximately >10-12 years old. It used to be popular in its day, but nowadays most >people seem to prefer bsddb. But gdbm was reasonably easy to port to >MacOS9, and I never looked at bsddb afterwards. >-- >- Jack Jansen >http://www.cwi.nl/~jack - >- If I can't dance I don't want to be part of your revolution -- >Emma Goldman - -- Recently using MacOSX....... From skip at pobox.com Tue Dec 3 22:26:45 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 16:26:45 -0600 Subject: [Spambayes] hammie misquote? Message-ID: <15853.12197.292027.777745@montanaro.dyndns.org> In hammie.py --help the output includes: -g PATH mbox or directory of known good messages (non-spam) to train on. Can be specified more than once, or use - for stdin. -s PATH mbox or directory of known spam messages to train on. Can be specified more than once, or use - for stdin. As far as I can tell feeding it directories instead of mbox files, doesn't actually work. The code in train() suggests this as well: def train(hammie, msgs, is_spam): """Train bayes with all messages from a mailbox.""" mbox = mboxutils.getmbox(msgs) ... which is called like so: for g in good: print "Training ham (%s):" % g train(h, g, False) save = True where good is a list containing one directory if I invoke hammie like so: BAYESCUSTOMIZE=pfx.ini python ./hammie.py -g Data/Ham/Set1 -p ./hammie.db -d Did I miss something or is this a documentation mistake? Skip From baa at encodeweb.dk Tue Dec 3 22:45:20 2002 From: baa at encodeweb.dk (=?ISO-8859-1?Q?Brian_=C5gren?=) Date: Tue, 03 Dec 2002 23:45:20 +0100 Subject: [Spambayes] language support Message-ID: <3DED3400.7070101@encodeweb.dk> Hi SpamBayes Folks. I'm from a non-english country and was considering the consequences of using bayes based spam filtering with my language. As i see the problem .. all my spam is in english, most of my non-spam (incl. ham) is in danish .. my whife almost never gets any non-spam in english s� if she (or i) would get something which is in english all the words in the email that is known, will be known from spam and therefore consider to be "most-likely-spam". Is this the case, or? which licence is being used for this project? I'd like to use this in a webmail-app i'm writing (in java), I've used python a while back in the university, Would it be better for me to write another implementation of this or use jpython to incorporate your project into mine? - Brian Aagren From richie at entrian.com Tue Dec 3 23:27:28 2002 From: richie at entrian.com (Richie Hindle) Date: Tue, 03 Dec 2002 23:27:28 +0000 Subject: [Spambayes] Fwd: Re: [Pythonmac-SIG] Database engine In-Reply-To: References: Message-ID: <8afquukb0ljib2est5m889vfufo5iosojc@4ax.com> Hi Fran�ois, > From: Jack Jansen > [...] > It's the gdbm engine. This is a GNU database engine of approximately > 10-12 years old. It used to be popular in its day, but nowadays most > people seem to prefer bsddb. But gdbm was reasonably easy to port to > MacOS9, and I never looked at bsddb afterwards. Thanks to you and Jack for the confirmation - this should mean we don't need to treat the Mac in any special way in terms of database type. Neale's recent edits should pick up gdbm, and all should be well. -- Richie Hindle richie@entrian.com From dereks at itsite.com Tue Dec 3 18:53:09 2002 From: dereks at itsite.com (Derek Simkowiak) Date: Tue, 3 Dec 2002 13:53:09 -0500 (EST) Subject: [Spambayes] New Application of SpamBayesian tech? Message-ID: Surfing Slashdot, ran accrossed the interview at http://www.theopenenterprise.com/story/TOE20021202S0001 which is about finding jobs. I saw this part: ------------------------------------------------------- [The interviewee mentions they got 3000 resumes in a single weekend...] [Interviewer] TheOpenEnterprise: How do you handle 3000 resumes? Do you look at them all? [Interviewee] Cranston-Cuebas: In a sense, we do. But we first scan them quickly to filter out applicants without relevant skills. We create an index of all incoming resumes and search on keywords. That's why it's important for job-seekers to repeat the major skills multiple times in their resume. Another reason is that some recruiters use applicant tracking programs that do automatic skills assessment based on keywords found in the resume, and will rank resumes based on that assessment. ------------------------------------------------------- Is anyone else seeing what I'm seeing? It seems like the SpamBayes algorithms are perfectly suited to this task... and would be far more accurate than whatever simple "keyword" tracking the current apps use. For some reason, the application of "filtering in" with SpamBayes (instead of "filtering out") never occurred to me before. Given the large number of people looking for jobs in the U.S., this seems like a good opportunity. Anyone else find this interesting? --Derek From skip at pobox.com Wed Dec 4 03:07:40 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 21:07:40 -0600 Subject: [Spambayes] New Application of SpamBayesian tech? In-Reply-To: References: Message-ID: <15853.29052.186166.31687@montanaro.dyndns.org> Derek> [Interviewer] TheOpenEnterprise: How do you handle 3000 resumes? Do you Derek> look at them all? Derek> [Interviewee] Cranston-Cuebas: In a sense, we do. But we first Derek> scan them quickly to filter out applicants without relevant Derek> skills. Derek> Is anyone else seeing what I'm seeing? Yes. It also seems to me that web page content filtering proxies (you know, keeping your kids or employees from visiting XXX websites) would be another good application of the technology. Skip From neale at woozle.org Wed Dec 4 04:31:50 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 20:31:50 -0800 Subject: [Spambayes] hammie misquote? In-Reply-To: <15853.12197.292027.777745@montanaro.dyndns.org> References: <15853.12197.292027.777745@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > BAYESCUSTOMIZE=pfx.ini python ./hammie.py -g Data/Ham/Set1 -p ./hammie.db -d > > Did I miss something or is this a documentation mistake? Ah, yes, hrm. Here's the problem: the mboxutils module makes some guesses about the path you give it, and for the Data sets (at least, for *my* Data sets), it guesses wrong. The fix would be to just get mboxutils to recognize your flavor of directory. Myself, I just made my data sets look like Maildirs and then everything was fine. But that's just a hack, not a solution ;) Tim S, is the Corpus module smarter about things like this? From neale at woozle.org Wed Dec 4 04:34:55 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 20:34:55 -0800 Subject: [Spambayes] New Application of SpamBayesian tech? In-Reply-To: <15853.29052.186166.31687@montanaro.dyndns.org> References: <15853.29052.186166.31687@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Yes. It also seems to me that web page content filtering proxies (you > know, keeping your kids or employees from visiting XXX websites) would > be another good application of the technology. Not to mention IDS (Intrusion Detection Systems). IANAS but I have a friend who is, and he's suggested to me a few times that it would be very interesting and possibly fruitful to apply Bayesian analysis to network security. But I think I'm going to have to pull out the probab/stats book from college before I embark on such a thing :) From neale at woozle.org Wed Dec 4 04:39:09 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 20:39:09 -0800 Subject: [Spambayes] The database question that would not die In-Reply-To: <15852.62356.862434.212872@montanaro.dyndns.org> References: <15851.23096.388509.925822@montanaro.dyndns.org> <15852.62356.862434.212872@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Has anyone considered Sleepycat's caveats about using 1.85? I think we do read and write at the same time when training, by the way. I don't really know what they mean by "concurrent read/write operations". Are they talking about two processes working on the same database, or do they mean one process doing both operations? But is it really worth persuing? I'd be happy to just write off 1.85, and it seems most of the windows folks are okay with that too. Neale From skip at pobox.com Wed Dec 4 04:49:10 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 22:49:10 -0600 Subject: [Spambayes] How are individual values stored in the database? Message-ID: <15853.35142.408092.547180@montanaro.dyndns.org> I thought values associated with keys in the DBDict thing were stored as little pickles. Scanning the code in dbdict.py suggests that's the case, but I'm unable to unserialize items using either cPickle or marshal: Python 2.3a0 (#6, Nov 13 2002, 19:57:35) [GCC 3.1 20020420 (prerelease)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import dbhash >>> db = dbhash.open("hammie.db") >>> db["pfxlen:5"] 'W(GA\xce\xf6\xbf-\xc0$\x89K\x00K\x07K\x00G?\x9e\xed\x19\xc5\x95y\xfdtq\x01.' >>> import cPickle as pickle >>> pickle.loads(db["pfxlen:5"]) Traceback (most recent call last): File "", line 1, in ? cPickle.UnpicklingError: invalid load key, 'W'. >>> import marshal >>> marshal.loads(db["pfxlen:5"]) Traceback (most recent call last): File "", line 1, in ? ValueError: bad marshal data I used to be able to do this (I can still do it with the hammie.db file I generated in mid-November). The file in question was created by hammie.py invocations like BAYESCUSTOMIZE=pfx.ini python ./hammie.py -g ham.mbox -p ./hammie.db -d where pfx.ini has these lines: [Tokenizer] address_headers: to cc summarize_prefixes: True (I'm trying to evaluate a new tokenizer change and want to examine raw counts for the generated tokens.) I realize WordInfo objects aren't being pickled any longer, but I thought tuples were. What have I missed? Thx, Skip From skip at pobox.com Wed Dec 4 04:51:12 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue, 3 Dec 2002 22:51:12 -0600 Subject: [Spambayes] hammie misquote? In-Reply-To: References: <15853.12197.292027.777745@montanaro.dyndns.org> Message-ID: <15853.35264.832756.67640@montanaro.dyndns.org> Neale> The fix would be to just get mboxutils to recognize your flavor Neale> of directory. Myself, I just made my data sets look like Neale> Maildirs and then everything was fine. But that's just a hack, Neale> not a solution ;) Nothin' special about my directory. It's of the usual Unix variety. Its contents are the one message per file thing Tim defined for testing. What are "Maildirs"? How do they differ from Tim's thing? Skip From neale at woozle.org Wed Dec 4 05:08:38 2002 From: neale at woozle.org (Neale Pickett) Date: 03 Dec 2002 21:08:38 -0800 Subject: [Spambayes] hammie misquote? In-Reply-To: <15853.35264.832756.67640@montanaro.dyndns.org> References: <15853.12197.292027.777745@montanaro.dyndns.org> <15853.35264.832756.67640@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Nothin' special about my directory. It's of the usual Unix variety. > Its contents are the one message per file thing Tim defined for > testing. What are "Maildirs"? How do they differ from Tim's thing? Maildirs are, well, here's a picture. $HOME/Maildir/ new/ 1038978004.24787_1.gwydion cur/ 1037168130.15835_0.gwydion,S=542:2,S 1037214764.7823_0.gwydion,S=1749:2,S tmp/ So the idea here is that when the MTA is writing a new message, it does so in a new file in tmp/, one file per message. When it's done, it renames the file into the new/ directory (that's an atomic operation on just about every FS). Then when your client has read the message, it puts in in the cur/ directory. So you don't need to lock anything. It's super-de-duper for NFS-mounted mail directories, and beats mbox files on everything but indexing. Google maildir for more info. So strictly speaking, all files in a Maildir have to be named NUMBER.STRING.STRING. But our stuff just reads in every file in the directory. I made a symlink to my Set1 directory called "cur" and told it to train on Data/Ham. So it slurped in every file. An MH directory, on the other hand, doesn't have the new/ cur/ and tmp/ subdirectories, all the messages are in the same directory. And they all have to be numbers, starting at 1. The way mboxutils works currently, it first tries to read the directory as a maildir (looking for a "cur" subdirectory). Then, if "/Mail/" is in the pathname, it reads it as an MH directory. Otherwise, it treats it as a directory of text files and only reads *.txt and *.lorien (what is this?) files. So I guess we could change that last option to read everything, but it has to be that way for some reason. Anyone care to elucidate this point? Neale From Paul.Moore at atosorigin.com Wed Dec 4 09:01:00 2002 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Wed, 4 Dec 2002 09:01:00 -0000 Subject: [Spambayes] Interesting behaviour from the Outlook client Message-ID: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Over the past few days, I've been seeing an increase in FNs and Unsures. = I initially trained on my inbox and spam folders (386 ham, 999 spam), = and since then I've trained on errors only. I'm now at 391 ham and 1011 = spam. Initially, I was getting no errors, and 1 or 2 unsures per day. = Now, I'm starting to get at least 1 FN per day, and a slight increase in = the unsure rate. It's far too early to tell, but could this be related to Tim's code to = handle unbalanced training sets? As time goes on, the spam:ham ratio = will increase (as FNs happen more often than FPs) and so the impact of = spam clues will be lessened (by Tim's code). I'll keep monitoring this, = but my "real life" mail is definitely unbalanced (home is massively = biased in favour of spam, work massively biased in favour of ham, but I = pre-filter mailing lists which muddies the water badly). I dunno. Do the testing gurus round here have any idea whether this type = of hypothesis could be tested in practice? Paul. From francois.granger at free.fr Wed Dec 4 09:23:22 2002 From: francois.granger at free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Wed, 04 Dec 2002 10:23:22 +0100 Subject: [Spambayes] pop3proxy documentation In-Reply-To: <1038880590.998.9.camel@porsche> Message-ID: on 3/12/02 2:56, Remi Ricard at papaDoc@videotron.ca wrote: > Since my documentation is not really big I will include it in my email Nice job. I suggest that instead of mofifying the Option.py file you instruct the guy to create the bayescustomize.ini file.... I added instructions for MacOS with Eudora and Entourage.... Please find only the added text below: =======================================================

For MacOS 9

before anything

Due to MacOS multitasking, the popproxy does not work very fast. On a Cube or a G4 400, I found it usable but not much. YMMV. To handle network connection to localhost it is easier to add a host file. If you don't have one already, create one with any text editor.

It name must be "hosts". It should be located in the "Preference" foder. It content should be similar to:

localhost CNAME fbgmac.intranet.teleprosoft.com
fbgmac.intranet.teleprosoft.com A 127.0.0.1

The localhost and 127.0.0.1 values must be exactly like this. If you don't know the right value to use for fbgmac.intranet.teleprosoft.com put anything looking like this one. It have to be exactly the same for end of first line and biggining of second line.

When this file is created, go to "TCP/IP" control panel. Set user level to Administrator. Click on "Use a host file" and select this file. Save your changes.

On the Mac, you can transform a Python script into a double clickable applet. Just Drag & Drop the pop3proxy.py script onto the BuildApplet application. You'll get a double clickable pop3proxy application.

Create or modify the "bayescustomize.ini" file in the Spambayes folder.

be sur you have these lines:

[pop3proxy]
pop3proxy_servers: pop.videotron.ca:110,mail.ulaval.ca:110
pop3proxy_ports: 110, 6111

Configuring Entourage

Go to the Tools menu and choose Accounts.

Click on New and choose POP.

Fill in the various fields. For the POP server field, put "localhost".

For the videotron account, you are done.

For the ulaval account, in the "Advance receive option" windows click on the "Ignore the default POP port" check box and type in 6111.

Filtering with Entourage

The rule can be:

If
    Specific header: X-Spambayes-Classification Contains ham
then    
    do nothing
If
    Specific header: X-Spambayes-Classification Contains spam
then    
    Move message to folder Spam
If
    Specific header: X-Spambayes-Classification Contains unsure
then    
    Move message to folder Unsure

Configuring Eudora

In Eudora, you will be able to reach only one pop server, since you can configure only one port number for POP. But on this server, you can access more than one account.

Go to the Tool menu and choose Personalities.

Create an new personality with the POP server as "localhost".

With the proposed "bayescustomize.ini" you will be able to talk onlu to the videotron server.

Filtering with Eudora

The rule can be:

Match
    Header: X-Spambayes-Classification contains ham
Action    
    do nothing
Match
    Header: X-Spambayes-Classification contains spam
Action    
    Transfer To Spam
Match
    Header: X-Spambayes-Classification contains unsure
Action    
    Transfer To Unsure

======================================================= -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From petera at intrinsica.co.uk Wed Dec 4 09:47:03 2002 From: petera at intrinsica.co.uk (Peter Arnold) Date: Wed, 4 Dec 2002 09:47:03 -0000 Subject: [Spambayes] Outlook addin: Removing the tray icon Message-ID: It's always bugged me that Outlook leaves the New Mail icon in the system tray after a rule or addin has moved or deleted all the newly arrived e-mail. I know there's no programmatic interface to remove the icon but I found some Visual Basic code at http://www.slipstick.com/dev/code/clearenvicon.htm that does the job. I've converted the three pages of VB to 5 lines of python (!) and submitted it as a patch (http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D648271&grou= p_i d=3D61702&atid=3D498105). I'm a bit bamboozled where to put it in the actual addin code so I'm hoping someone more knowledgeable than me will be able to do that. I imagine it should be invoked after scanning all new e-mail in the inbox and determining that all of it was spam. Peter Arnold petera@intrinsica.co.uk _____________________________________________________________________ This=20e-mail=20has=20been=20scanned=20for=20viruses=20by=20the=20WorldCom= =20Internet=20Managed=20Scanning=20Service=20-=20powered=20by=20MessageLab= s.=20For=20further=20information=20visit=20http://www.worldcom.com From Alexander at Leidinger.net Wed Dec 4 10:03:23 2002 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Wed, 4 Dec 2002 11:03:23 +0100 Subject: [Spambayes] hammie misquote? In-Reply-To: References: <15853.12197.292027.777745@montanaro.dyndns.org> <15853.35264.832756.67640@montanaro.dyndns.org> Message-ID: <20021204110323.62d9a18f.Alexander@Leidinger.net> On 03 Dec 2002 21:08:38 -0800 Neale Pickett wrote: > The way mboxutils works currently, it first tries to read the directory > as a maildir (looking for a "cur" subdirectory). Then, if "/Mail/" is > in the pathname, it reads it as an MH directory. Otherwise, it treats > it as a directory of text files and only reads *.txt and *.lorien (what > is this?) files. > > So I guess we could change that last option to read everything, but it > has to be that way for some reason. Anyone care to elucidate this > point? It's the way Tim had it's directory set up (at least this is how I had understand it at the time I implemented the first version of this functionality). *.txt and *.lorien are files with one mail per file. I think *.lorien denotes a particular set of SPAM mails. IMHO we can remove the restriction, I assume it evolved from a quick hack. Bye, Alexander. -- The computer revolution is over. The computers won. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From mwh at python.net Wed Dec 4 11:49:23 2002 From: mwh at python.net (Michael Hudson) Date: 04 Dec 2002 11:49:23 +0000 Subject: [Spambayes] Re: New Application of SpamBayesian tech? References: <15853.29052.186166.31687@montanaro.dyndns.org> Message-ID: <2madjmqa0s.fsf@starship.python.net> Neale Pickett writes: > Skip Montanaro writes: > > > Yes. It also seems to me that web page content filtering proxies (you > > know, keeping your kids or employees from visiting XXX websites) would > > be another good application of the technology. > > Not to mention IDS (Intrusion Detection Systems). > > IANAS but I have a friend who is, and he's suggested to me a few times > that it would be very interesting and possibly fruitful to apply > Bayesian analysis to network security. But I think I'm going to have to > pull out the probab/stats book from college before I embark on such a > thing :) I have half a mind to see how it works as a replacement for gnus' adaptive scoring. A harder problem than spam filtering, I guess, but it might be interesting. Cheers, M. From jm at jmason.org Wed Dec 4 11:55:04 2002 From: jm at jmason.org (Justin Mason) Date: Wed, 04 Dec 2002 11:55:04 +0000 Subject: [Spambayes] New Application of SpamBayesian tech? In-Reply-To: Message from Skip Montanaro <15853.29052.186166.31687@montanaro.dyndns.org> Message-ID: <20021204115509.6E55716F17@jmason.org> Skip Montanaro said: > Yes. It also seems to me that web page content filtering proxies (you know, > keeping your kids or employees from visiting XXX websites) would be another > good application of the technology. BTW I'm reasonably sure I saw a patent for bayesian prob analysis applied to web filtering on the IBM database. --j. From tim at fourstonesExpressions.com Wed Dec 4 13:07:08 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 04 Dec 2002 07:07:08 -0600 Subject: [Spambayes] pop3proxy documentation In-Reply-To: Message-ID: 12/4/2002 3:23:22 AM, Fran�ois Granger wrote: >on 3/12/02 2:56, Remi Ricard at papaDoc@videotron.ca wrote: > >> Since my documentation is not really big I will include it in my email > >Nice job. > >I suggest that instead of mofifying the Option.py file you instruct the guy >to create the bayescustomize.ini file.... There is a configuration script to do this now. OptionsConfig.py. You should definitely have users use this script, rather than manually modify bayescustomize.ini. And you should definitely not instruct them to modify Options.py under any circumstances. It's just a bit too critical. If you accidentally screw it up, the whole system dies a horrible death. - TimS > >I added instructions for MacOS with Eudora and Entourage.... > >Please find only the added text below: > >======================================================= >

For MacOS 9

before anything

Due to MacOS multitasking, the popproxy does not work very fast. On a >Cube >or a G4 400, I found it usable but not much. YMMV. >To handle network connection to localhost it is easier to add a >host file. If you don't have one already, create one with any text editor. >

It name must be "hosts". It should be located in the "Preference" foder. >It content should be similar to: >

>localhost CNAME fbgmac.intranet.teleprosoft.com
>fbgmac.intranet.teleprosoft.com A 127.0.0.1
>

The localhost and 127.0.0.1 values must be exactly like this. If you >don't >know the right value to use for fbgmac.intranet.teleprosoft.com put >anything looking like this one. It have to be exactly the same for end of >first line and biggining of second line. >

When this file is created, go to "TCP/IP" control panel. Set user level >to Administrator. Click on "Use a host file" and select this file. Save >your changes. >

On the Mac, you can transform a Python script into a double clickable >applet. Just Drag & Drop the pop3proxy.py script onto the BuildApplet >application. You'll get a double clickable pop3proxy application. >

Create or modify the "bayescustomize.ini" file in the Spambayes folder. >

be sur you have these lines: >

>[pop3proxy]
>pop3proxy_servers: pop.videotron.ca:110,mail.ulaval.ca:110
>pop3proxy_ports: 110, 6111
>

Configuring Entourage

Go to the Tools menu and choose Accounts. >

Click on New and choose POP. >

Fill in the various fields. For the POP server field, put "localhost". >

For the videotron account, you are done. >

For the ulaval account, in the "Advance receive option" windows click on >the "Ignore the default POP port" check box and type in 6111. >

Filtering with Entourage

The rule can be: >

>If
>    Specific header: X-Spambayes-Classification Contains ham
>then    
>    do nothing
>If
>    Specific header: X-Spambayes-Classification Contains spam
>then    
>    Move message to folder Spam
>If
>    Specific header: X-Spambayes-Classification Contains unsure
>then    
>    Move message to folder Unsure
>

> > >

Configuring Eudora

In Eudora, you will be able to reach only one pop server, since you can >configure only one port number for POP. But on this server, you can access >more than one account. >

Go to the Tool menu and choose Personalities. >

Create an new personality with the POP server as "localhost". >

With the proposed "bayescustomize.ini" you will be able to talk onlu to >the videotron server. >

Filtering with Eudora

The rule can be: >

>Match
>    Header: X-Spambayes-Classification contains ham
>Action    
>    do nothing
>Match
>    Header: X-Spambayes-Classification contains spam
>Action    
>    Transfer To Spam
>Match
>    Header: X-Spambayes-Classification contains unsure
>Action    
>    Transfer To Unsure
>

> >======================================================= > >-- >Le courrier est un moyen de communication. Les gens devraient >se poser des questions sur les implications politiques des choix (ou non >choix) de leurs outils et technologies. Pour des courriers propres : > -- > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From papaDoc at videotron.ca Wed Dec 4 15:11:43 2002 From: papaDoc at videotron.ca (papaDoc) Date: Wed, 04 Dec 2002 10:11:43 -0500 Subject: [Spambayes] pop3proxy documentation In-Reply-To: References: Message-ID: <3DEE1B2F.7010704@videotron.ca> Hi, >>I suggest that instead of mofifying the Option.py file you instruct the guy >>to create the bayescustomize.ini file.... >> >> > >There is a configuration script to do this now. OptionsConfig.py. You should >definitely have users use this script, rather than manually modify >bayescustomize.ini. And you should definitely not instruct them to modify >Options.py under any circumstances. It's just a bit too critical. If you >accidentally screw it up, the whole system dies a horrible death. > Since I don't want to kill anyone I will explain how to use OptionsConfig.py. papaDoc From noreply at sourceforge.net Wed Dec 4 08:59:33 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed, 04 Dec 2002 00:59:33 -0800 Subject: [Spambayes] [ spambayes-Patches-648271 ] Code to remove the New Mail icon Message-ID: Patches item #648271, was opened at 2002-12-04 08:59 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=648271&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Peter Arnold (lardladpa) Assigned to: Nobody/Anonymous (nobody) Summary: Code to remove the New Mail icon Initial Comment: It would be great if having processed the newly arrived e-mail and discovered that they were all spam the addin could remove the New Message icon from the system tray. I know there's no programitic interface to do this but I found some VB code at http://www.slipstick.com/dev/code/clearenvicon.htm I've converted the 3 pages of VB to this small bit of python import win32gui # Locate the outlook window owning the tray icon hWnd = win32gui.FindWindow("rctrl_renwnd32", "") if hWnd != 0: # Send a NIM_DELETE to remove the icon nid = (hWnd, 0) win32gui.Shell_NotifyIcon(2, nid) # Send a WUM_RESETNOTIFICATION to the owning window win32gui.SendMessage(hWnd, 1031, 0, 0) It would be super if this patch could be integrated into the outlook plugin although I'm not quite sure where in the code it would go. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=648271&group_id=61702 From wsy at merl.com Wed Dec 4 15:36:26 2002 From: wsy at merl.com (Bill Yerazunis) Date: Wed, 4 Dec 2002 10:36:26 -0500 Subject: [Spambayes] Anyone else seeing increasing error rates over time? Message-ID: <200212041536.gB4FaQP02769@localhost.localdomain> Thread-Index: AcKbc6q3qg2tOWANRlqgdMLcb73raw== Over the past few days, I've been seeing an increase in FNs and Unsures. I initially trained on my inbox and spam folders (386 ham, 999 spam), and since then I've trained on errors only. I'm now at 391 ham and 1011 spam. Initially, I was getting no errors, and 1 or 2 unsures per day. Now, I'm starting to get at least 1 FN per day, and a slight increase in the unsure rate. It's far too early to tell, but could this be related to Tim's code to handle unbalanced training sets? As time goes on, the spam:ham ratio will increase (as FNs happen more often than FPs) and so the impact of spam clues will be lessened (by Tim's code). I'll keep monitoring this, but my "real life" mail is definitely unbalanced (home is massively biased in favour of spam, work massively biased in favour of ham, but I pre-filter mailing lists which muddies the water badly). I dunno. Do the testing gurus round here have any idea whether this type of hypothesis could be tested in practice? I'm seeing an increase in error rates as well. I'm starting to think of it as "evolution in action", that is, it's actually an indication of how fast spam mutates. The errors are new kinds of spam, or at least new topics, or in a new style, and not simmple misclassifies in the classic sense. Looking at the statistics on CRM114, as of today (with the run starting Nov 1): Week 1 - zero errors Week 2 - zero errors Week 3 - two errors Week 4 - two errors Week 5 - four errors, and it's only Wednesday! As of the start of week 5, I'm back to Train On Errors on-the-fly, and I'll let you know if that helps or not. It's too early to really have any assurance that this is the case, but I'll hypothesize that this shows that spam has a measurable nonzero mutation rate, and that mutation rate can be approximated by: kT Total really new spams seen = Spams seen * (e - 1) where T is the elapsed time in days since training stopped, and k is an empirical constant with value of roughly .0001 Paul Moore: see if this predicts your increase in errors. If you get 100 spams a day, and it's been 5 days since you last trained, this rule predicts 1/4 chance of a spam by the 5th day... but 4 spams by the 20th day. HUGE SCREAMING CAVEAT: This equation is pure smoke and mirrors, as I have far too little data to get an error bar that isn't the entire plotting area; a case of "torturing the data until it confesses" sufficient to warrant investigation by the Hague Tribunal. n.b.: The spams I've seen come through since the start of the November run are in general really new, and either 1)written so well that they even fool me into reading for a page or two until I figure out that they're spams (or have me laughing so hard that I keep reading anyway) or 2)written so tersely that it takes some background research to figure out that they're spams. The exception was the first occurrence of "Barnyard Teen" spam (you figure it out...) And gee, just when I thought things had settled down enough that I could sit back and make CRM114 truly 8-bit clean and wchar-safe... -Bill Yerazunis From tim.one at comcast.net Wed Dec 4 17:10:48 2002 From: tim.one at comcast.net (Tim Peters) Date: Wed, 04 Dec 2002 12:10:48 -0500 Subject: [Spambayes] Interesting behaviour from the Outlook client In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > Over the past few days, I've been seeing an increase in FNs and > Unsures. I initially trained on my inbox and spam folders (386 > ham, 999 spam), and since then I've trained on errors only. I'm > now at 391 ham and 1011 spam. Initially, I was getting no errors, > and 1 or 2 unsures per day. Now, I'm starting to get at least 1 > FN per day, and a slight increase in the unsure rate. My experiments with mistake-based training all said it was brittle, due to extreme reliance on hapaxes. That makes it more of a keyword-spotting classifier than a statistical inferencer. But since you've trained on only 5 ham + 12 spam since starting mistake-based training, I think this is just evidence that spam is changing. > It's far too early to tell, but could this be related to Tim's > code to handle unbalanced training sets? As time goes on, the > spam:ham ratio will increase (as FNs happen more often than FPs) > and so the impact of spam clues will be lessened (by Tim's code). This is so, and an increase in FN is an expected outcome of the imbalance adjustment, if you have more spam than ham. If you want to experiment with life without the imbalance adjustment, comment out the experimental_ham_spam_imbalance_adjustment: True line in your default_bayes_customize.ini file (in your spambayes Outlook2000 directory). That will make everything look less spammy, so an increase in FP is an expected outcome if you do this. > I'll keep monitoring this, but my "real life" mail is definitely > unbalanced (home is massively biased in favour of spam, work > massively biased in favour of ham, but I pre-filter mailing lists > which muddies the water badly). > > I dunno. Do the testing gurus round here have any idea whether > this type of hypothesis could be tested in practice? What exactly is the hypothesis? Whatever it is , it's certainly testable, but testing w/ Outlook is at best clumsy (testing is easiest if you have a stream of plain-text msgs ordered by time received; getting that out of Outlook is a series of battles). From neale at woozle.org Thu Dec 5 02:35:18 2002 From: neale at woozle.org (Neale Pickett) Date: 04 Dec 2002 18:35:18 -0800 Subject: [Spambayes] busy Message-ID: I've been pretty loud on the list recently so I figured I should let you all know that I've become quite busy with an upcoming internal release at $FIRM, so I'm not going to be very active for a little while. That means you'll all be deprived of my "kamikaze commit" style for a bit. But I guess reliability can be nice every now and then, if it's in small amounts Good luck on finishing your article in time, Richie! Neale From tim.one at comcast.net Thu Dec 5 04:01:11 2002 From: tim.one at comcast.net (Tim Peters) Date: Wed, 04 Dec 2002 23:01:11 -0500 Subject: [Spambayes] FW: PyCon DC 2003: Call For Papers Message-ID: In case you missed the announcement, or just need more pressure , the first PyCon is scheduled for the end of March, and Steve would *love* to get a paper on the spambayes project. This is a low-cost conference in Washington, DC, aimimg more at hackers than suits. I expect to be there, but expect my employer would kill me if I so much as mentioned this project. http://www.python.org/pycon/ -----Original Message----- From: python-list-admin@python.org [mailto:python-list-admin@python.org]On Behalf Of Steve Holden Sent: Wednesday, December 04, 2002 4:44 PM To: python-list@python.org; python-announce-list@python.org Subject: PyCon DC 2003: Call For Papers PyCon DC 2003, the first Python Community Conference, has now issued a formal call for papers, which you can read at www.python.org/pycon/cfp.html The organizing committee is interested in any and all submissions for presentations. Traditional presentation styles will doubtless be the norm, but if you would like to experiment with a different format you are encouraged to mail suggestions to pycon-interest at python dot org if you are a subscriber to that list, or to the address given at the foot of this message. Time is short, so please make sure any questions are sent in promptly. We will do our best to turn them around quickly. We look forward to seeing you at PyCon DC 2003, for which registration details should be published shortly. Steve Holden mailto:sholden@holdenweb.com PyCon Committee Chair pycondc-2003 at python dot org -- http://mail.python.org/mailman/listinfo/python-list From skip at pobox.com Thu Dec 5 19:40:22 2002 From: skip at pobox.com (Skip Montanaro) Date: Thu, 5 Dec 2002 13:40:22 -0600 Subject: [Spambayes] msg.get_content_type()? Message-ID: <15855.43942.935960.375535@montanaro.dyndns.org> I can't remember the relationship between the email package and Python version. Did we decide that 2.2.2 was required? I'm getting an AttributeError on one machine (running 2.2.1) complaining that a message object doesn't have get_content_type. I should be able to just drop the 2.2.2 or 2.3 email package into site-packages, right? Thx, Skip From richie at entrian.com Thu Dec 5 19:48:39 2002 From: richie at entrian.com (Richie Hindle) Date: Thu, 5 Dec 2002 19:48:39 +0000 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <15855.43942.935960.375535@montanaro.dyndns.org> References: <15855.43942.935960.375535@montanaro.dyndns.org> Message-ID: <8cdef7d942da49f3.dlg@entrian.com> [Skip] > I can't remember the relationship between the email package and Python > version. Did we decide that 2.2.2 was required? I'm getting an > AttributeError on one machine (running 2.2.1) complaining that a message > object doesn't have get_content_type. I should be able to just drop the > 2.2.2 or 2.3 email package into site-packages, right? Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or better of the email package - you can get that from Python 2.3, or download it from http://mimelib.sf.net. No released version of Python ships with it. -- Richie Hindle richie@entrian.com From tim at fourstonesExpressions.com Thu Dec 5 19:50:08 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 05 Dec 2002 13:50:08 -0600 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <8cdef7d942da49f3.dlg@entrian.com> Message-ID: <7552JHSNM86PFTR1Y6ZRPGRQ11U.3defadf0@riven> 12/5/2002 1:48:39 PM, "Richie Hindle" wrote: > >[Skip] >> I can't remember the relationship between the email package and Python >> version. Did we decide that 2.2.2 was required? I'm getting an >> AttributeError on one machine (running 2.2.1) complaining that a message >> object doesn't have get_content_type. I should be able to just drop the >> 2.2.2 or 2.3 email package into site-packages, right? > >Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or better of >the email package - you can get that from Python 2.3, or download it >from http://mimelib.sf.net. No released version of Python ships with >it. Argh... another external module dependency? - TimS > >-- >Richie Hindle >richie@entrian.com > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From skip at pobox.com Thu Dec 5 19:56:46 2002 From: skip at pobox.com (Skip Montanaro) Date: Thu, 5 Dec 2002 13:56:46 -0600 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <8cdef7d942da49f3.dlg@entrian.com> References: <15855.43942.935960.375535@montanaro.dyndns.org> <8cdef7d942da49f3.dlg@entrian.com> Message-ID: <15855.44926.841650.437684@montanaro.dyndns.org> Richie> Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or Richie> better of the email package - you can get that from Python 2.3, Richie> or download it from http://mimelib.sf.net. No released version Richie> of Python ships with it. Thanks. I pulled the email package from my release22maint branch. It seems to be 2.4.3. Skip From richie at entrian.com Thu Dec 5 21:08:33 2002 From: richie at entrian.com (Richie Hindle) Date: Thu, 5 Dec 2002 21:08:33 +0000 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <7552JHSNM86PFTR1Y6ZRPGRQ11U.3defadf0@riven> References: <7552JHSNM86PFTR1Y6ZRPGRQ11U.3defadf0@riven> Message-ID: [Richie] > you need version 2.4.3 or better of the email package [...] > No released version of Python ships with it. [Tim] > Argh... another external module dependency? This one's always been there - we've always required this version of 'email'. At least this one's pure Python, and the same on all platforms. -- Richie Hindle richie@entrian.com From mhammond at skippinet.com.au Fri Dec 6 03:21:46 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 6 Dec 2002 14:21:46 +1100 Subject: [Spambayes] Outlook addin: Removing the tray icon In-Reply-To: Message-ID: > It's always bugged me that Outlook leaves the New Mail icon in the > system tray after a rule or addin has moved or deleted all the newly > arrived e-mail. I know there's no programmatic interface to remove the > icon but I found some Visual Basic code at > http://www.slipstick.com/dev/code/clearenvicon.htm that does the job. > I've converted the three pages of VB to 5 lines of python (!) and > submitted it as a patch > (http://sourceforge.net/tracker/index.php?func=detail&aid=648271&group_i > d=61702&atid=498105). > > I'm a bit bamboozled where to put it in the actual addin code so I'm > hoping someone more knowledgeable than me will be able to do that. I > imagine it should be invoked after scanning all new e-mail in the inbox > and determining that all of it was spam. I'm not sure that this addin is the correct place for this code, or how you picture it working. One way would be that if there are no new items in the Inbox after filtering, the icon is removed. While this sounds OK on the face of it, I'm not sure how useful it will be in the real world - it certainly won't help me. I can't remember the last time my inbox had zero unread items . What I *could* see is a useful standalone addin just for this purpose. This addin would actually *replace* the Outlook item. This could be smarter - such as only showing up if there are new items since you last opened outlook. This would be far more useful, but beyond the scope of the SpamBayes addin. It really wouldn't be too hard, and I would be happy to help out with it - we certainly have all the tools available, including a sample that creates a new taskbar icon, etc. It would make a great sample for win32all ;) Of course, if others think that a simple bit of code in the addin would be useful for the majority, then speak up and I will squish it in somewhere... Mark. From tim.one at comcast.net Fri Dec 6 03:36:23 2002 From: tim.one at comcast.net (Tim Peters) Date: Thu, 05 Dec 2002 22:36:23 -0500 Subject: [Spambayes] Outlook addin: Removing the tray icon In-Reply-To: Message-ID: [Peter Arnold] > It's always bugged me that Outlook leaves the New Mail icon in the > system tray after a rule or addin has moved or deleted all the newly > arrived e-mail. > ... [Mark Hammond] > I'm not sure that this addin is the correct place for this code, > or how you picture it working. > > ... [how Mark pictures it working ] ... > > Of course, if others think that a simple bit of code in the addin > would be useful for the majority, then speak up and I will squish > it in somewhere... -1. We (meaning you ...) have enough work here without making this project a dumping ground for generic Outlook annoyances. A distinct Outlook addin would be fine, though. I have to say I've always ignored the New Mail icon, and could never figure out what it thought it was trying to tell me -- it seems to appear and disappear at random. If someone invested a year in figuring out what it's doing, it would be a shame if installing *this* code ruined their hard-won mental model . From piersh at friskit.com Fri Dec 6 03:55:14 2002 From: piersh at friskit.com (Piers Haken) Date: Thu, 5 Dec 2002 19:55:14 -0800 Subject: [Spambayes] Outlook addin: Removing the tray icon Message-ID: <9891913C5BFE87429D71E37F08210CB92C742A@zeus.sfhq.friskit.com> I agree. Also it probably wouldn't work in the case where you have unread messages that have been redirected, by inbox rules, to unwatched folders. Piers. > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net]=20 > Sent: Thursday, December 05, 2002 7:36 PM > To: Mark Hammond; Peter Arnold > Cc: spambayes@python.org > Subject: RE: [Spambayes] Outlook addin: Removing the tray icon >=20 >=20 > [Peter Arnold] > > It's always bugged me that Outlook leaves the New Mail icon in the=20 > > system tray after a rule or addin has moved or deleted all=20 > the newly=20 > > arrived e-mail. ... >=20 > [Mark Hammond] > > I'm not sure that this addin is the correct place for this code, or=20 > > how you picture it working. > > > > ... [how Mark pictures it working ] ... > > > > Of course, if others think that a simple bit of code in the addin=20 > > would be useful for the majority, then speak up and I will=20 > squish it=20 > > in somewhere... >=20 > -1. We (meaning you ...) have enough work here without=20 > making this project a dumping ground for generic Outlook=20 > annoyances. A distinct Outlook addin would be fine, though. =20 > I have to say I've always ignored the New Mail icon, and=20 > could never figure out what it thought it was trying to tell=20 > me -- it seems to appear and disappear at random. If someone=20 > invested a year in figuring out what it's doing, it would be=20 > a shame if installing *this* code ruined their hard-won=20 > mental model . >=20 >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 From mhammond at skippinet.com.au Fri Dec 6 07:17:26 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 6 Dec 2002 18:17:26 +1100 Subject: [Spambayes] Interesting behaviour from the Outlook client In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Message-ID: > Over the past few days, I've been seeing an increase in FNs and > Unsures. I initially trained on my inbox and spam folders (386 > ham, 999 spam), and since then I've trained on errors only. I'm > now at 391 ham and 1011 spam. Initially, I was getting no errors, > and 1 or 2 unsures per day. Now, I'm starting to get at least 1 > FN per day, and a slight increase in the unsure rate. I think something is broken. I'm not sure what though :( I am seeing bizarre stuff that I can't explain, and don't even know how to describe reasonably :( Eg, recently I saw a clear spam scored as 3%. The spam-clues showed: word spamprob #ham #spam ... 'card-swipe' 0.123921 2 0 'cash-only' 0.123921 2 0 but still lots of obvious spam clues (ie, not everything was screwed). However, I was certain these don't appear in ham, so I did a full re-train. Then, these were correctly identified as only in spam (ie, not in ham), so the spam got a solid 100%. Interestingly I did a full retrain very recently before this. I suspect incremental retrain is broken, but I haven't looked too far - I just throw this out in speculation that there may be a more subtle bug in the training rather than the algorithm or in the options that control it. Mark. From db3l at fitlinxx.com Fri Dec 6 16:25:16 2002 From: db3l at fitlinxx.com (David Bolen) Date: 06 Dec 2002 11:25:16 -0500 Subject: [Spambayes] Re: Outlook addin: Removing the tray icon References: Message-ID: "Peter Arnold" writes: > It's always bugged me that Outlook leaves the New Mail icon in the > system tray after a rule or addin has moved or deleted all the newly > arrived e-mail. Isn't that just because the movement itself does not make the message unread? At least on my system (others experience may vary), the new mail icon behaves pretty consistently - if I have an unread/new message (in any of my folders on any message store) it'll show up when the first such message appears in the system. I have occasionally gotten confused when an automatic rule had moved a new message into a sub-folder that I didn't have expanded in my folder list so it was sort of a hunt to locate that new message. But once the last message is marked read the icon goes away. > I'm a bit bamboozled where to put it in the actual addin code so I'm > hoping someone more knowledgeable than me will be able to do that. I > imagine it should be invoked after scanning all new e-mail in the inbox > and determining that all of it was spam. It would seem to me to be cleaner if we just considered adding an option to the addin to mark messages it moved as read (probably two choices - one for the spam and one for the maybe spam), which should accomplish the same thing with respect to the icon but let Outlook itself take care of it following it's normal rules. I know that at the moment pretty much the only thing I'm doing with the spam folder contents is selecting all the messages that were moved into it by the addin, and marking them as read (since I'm still saving them for future retrainings). So the option to mark them read automatically might be attractive. A downside is that it would be harder to notice just which messages were recently moved there, but if you just make semi-regular scans of the folder that's probably minor. -- David From db3l at fitlinxx.com Fri Dec 6 16:34:26 2002 From: db3l at fitlinxx.com (David Bolen) Date: 06 Dec 2002 11:34:26 -0500 Subject: [Spambayes] Re: Interesting behaviour from the Outlook client References: <16E1010E4581B049ABC51D4975CEDB8861996B@UKDCX001.uk.int.atosorigin.com> Message-ID: "Mark Hammond" writes: > I think something is broken. I'm not sure what though :( > > I am seeing bizarre stuff that I can't explain, and don't even know how to > describe reasonably :( Eg, recently I saw a clear spam scored as 3%. The > spam-clues showed: For another data point, in case it's related. I've been seeing sporadic spams scoring close to 0 although they're clearly spam. If I dump the spam clues, it actually shows *S* as 1 and *H* around 1e-6 (so clearly spam) but the Spam field in the message in Outlook still shows a very low value (sometimes 0%). If all I do is leave the message as unread and re-run the filter on unread mail it will rescore it as 100% and move it to the spam folder. I'm also occasionally seeing messages fail to show a score in their Spam column. It leads me to think that somehow the wrong message is being operated on at some point and/or the result stored in the wrong location. Since bringing up the spam-clues window appears to re-score the message, it would seem as if the earlier attempt either used the wrong message for the purpose of scoring (but stored it in the spam message) or in some other way got out of sync. It doesn't seem like the scoring itself is inaccurate. There is, however, no direct one-to-one correlation between messages missing a score and these bad spam scores. A particularly interesting point is that in these cases, I have been unable to find the affected message listed in the trace window - in either the case where it happens as new mail when the client is already running, or as unread mail detected upon client startup. But the message has clearly had its spam field updated. -- David PS: In what is more surely an Outlook issue, has anyone else had messages put back to an unread status after they've already been read? It happens semi-frequently to me (and sometimes correlated with a failure to update the spam column in the display until I switch out of and back into the folder). I'm assuming it's a race condition between my viewing the message clearing the unread bit and something that the addin is doing setting it back, but it's tricky to isolate a regular procedure for reproducing. From popiel at wolfskeep.com Fri Dec 6 16:37:24 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri, 06 Dec 2002 08:37:24 -0800 Subject: [Spambayes] Outlook addin: Removing the tray icon In-Reply-To: Message from Tim Peters References: Message-ID: <20021206163724.83A0E2DED6@cashew.wolfskeep.com> In message: Tim Peters writes: > >I have to say I've always ignored the New Mail icon, >and could never figure out what it thought it was trying to tell me -- it >seems to appear and disappear at random. If someone invested a year in >figuring out what it's doing, it would be a shame if installing *this* code >ruined their hard-won mental model . Turning off a few options (like the mail preview pane, which is a security risk anyway) makes the behaviour much more obvious: The new mail icon appears whenever new mail arrives, and the Outlook client is running. The new mail icon disappears whenever any message which had been unread gets marked as read, or the Outlook client exits. Note that being shown in the preview pane is often enough to get a message marked as read, so under some options configurations merely bringing the Outlook window to the foreground when the current message is a new message (due to the folder being previously empty or the click to raise the window also selecting a new message or somesuch) is enough to (apparently inconsistently) make the new mail icon disappear. - Alex (who suffers with Outlook at work) From Paul.Moore at atosorigin.com Fri Dec 6 16:38:35 2002 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Fri, 6 Dec 2002 16:38:35 -0000 Subject: [Spambayes] Re: Interesting behaviour from the Outlook client Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E76@UKDCX001.uk.int.atosorigin.com> From: David Bolen [mailto:db3l@fitlinxx.com] > I'm also occasionally seeing messages fail to show a score in their > Spam column. It leads me to think that somehow the wrong message is > being operated on at some point and/or the result stored in the wrong > location. [...] > PS: In what is more surely an Outlook issue, has anyone else had > messages put back to an unread status after they've already been read? > It happens semi-frequently to me (and sometimes correlated with a > failure to update the spam column in the display until I switch out of > and back into the folder). I'm assuming it's a race condition between > my viewing the message clearing the unread bit and something that the > addin is doing setting it back, but it's tricky to isolate a regular > procedure for reproducing. Yes, I see both of these behaviours. I also can't find a consistent pattern, but I think you're right that it's a race condition or synchronisation problem of some sort... Paul From barry at python.org Fri Dec 6 16:54:49 2002 From: barry at python.org (Barry A. Warsaw) Date: Fri, 6 Dec 2002 11:54:49 -0500 Subject: [Spambayes] msg.get_content_type()? References: <15855.43942.935960.375535@montanaro.dyndns.org> <8cdef7d942da49f3.dlg@entrian.com> Message-ID: <15856.54873.377409.136435@gargle.gargle.HOWL> >>>>> "RH" == Richie Hindle writes: RH> Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or RH> better of the email package - you can get that from Python RH> 2.3, or download it from http://mimelib.sf.net. No released RH> version of Python ships with it. Wrong. Python 2.2.2 (#1, Oct 15 2002, 12:24:47) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import email >>> email.__version__ '2.4.3' >>> from email.Message import Message >>> Message.get_content_type >>> email.__file__ '/usr/local/lib/python2.2/email/__init__.pyc' -Barry From tim at fourstonesExpressions.com Fri Dec 6 16:59:13 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 06 Dec 2002 10:59:13 -0600 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <15856.54873.377409.136435@gargle.gargle.HOWL> Message-ID: 12/6/2002 10:54:49 AM, barry@python.org (Barry A. Warsaw) wrote: > >>>>>> "RH" == Richie Hindle writes: > > RH> Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or > RH> better of the email package - you can get that from Python > RH> 2.3, or download it from http://mimelib.sf.net. No released > RH> version of Python ships with it. > >Wrong. > >Python 2.2.2 (#1, Oct 15 2002, 12:24:47) >[GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 >Type "help", "copyright", "credits" or "license" for more information. >>>> import email >>>> email.__version__ >'2.4.3' >>>> from email.Message import Message >>>> Message.get_content_type > >>>> email.__file__ >'/usr/local/lib/python2.2/email/__init__.pyc' > >-Barry Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32 Type "copyright", "credits" or "license" for more information. IDLE 0.8 -- press F1 for help >>> import email >>> email.__version__ '2.4.3' >>> from email.Message import Message >>> Message.get_content_type >>> email.__file__ 'C:\\Program Files\\Python2.2\\lib\\email\\__init__.pyc' - TimS > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From barry at python.org Fri Dec 6 17:03:35 2002 From: barry at python.org (Barry A. Warsaw) Date: Fri, 6 Dec 2002 12:03:35 -0500 Subject: [Spambayes] msg.get_content_type()? References: <15856.54873.377409.136435@gargle.gargle.HOWL> Message-ID: <15856.55399.635079.816568@gargle.gargle.HOWL> >>>>> "TS" == Tim Stone writes: TS> Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] Specifically, Python 2.2.x where x < 2 is /not/ sufficient. -Barry From tim at fourstonesExpressions.com Fri Dec 6 21:46:26 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 06 Dec 2002 15:46:26 -0600 Subject: [Spambayes] A followup to an article on spammers... Message-ID: Totally hillarious... http://www.freep.com/money/tech/mwend6_20021206.htm c'est moi - TimS www.fourstonesExpressions.com From mhammond at skippinet.com.au Sat Dec 7 01:41:22 2002 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat, 7 Dec 2002 12:41:22 +1100 Subject: [Spambayes] RE: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: Message-ID: [Me] > > In the back of my mind, I am pondering if we need a better directory > > structure - maybe with the core engine in a package, and some of these > > "wrappers" used only by a few application also into their own? [Richie] > Isn't this also YAGNI? We have a few tens of Python files in the > project - > do we really need to split it up? And if we do, should we be > doing it with the code this young? Yeah, I think so :) We all know young minds are easily manipulated, so getting the code while it is young is good. The main directory has 46 or so .py files in it now, which is getting too many. There are 3 clear categories: * Main engine. * pop3proxy application. * Test code. Even just making this split would be a good thing. If we can factor some commonly used "application base classes" (ie, the intent of Corpus.py etc), then these could stay in the main directory (contradicting what I said above ). I don't want to go overboard, but I think something could be done. The longer we leave it, the harder it gets. I don't have a real strong opinion, but am bringing this up because I feel it now, not simply because it offends my sensibilities Actually-running-quite-low-on-sensibilities ly, Mark. From piersh at friskit.com Sat Dec 7 02:52:54 2002 From: piersh at friskit.com (Piers Haken) Date: Fri, 6 Dec 2002 18:52:54 -0800 Subject: [Spambayes] Using spambayes with outlook XP's hotmail connector Message-ID: <9891913C5BFE87429D71E37F08210CB929751A@zeus.sfhq.friskit.com> The following patch allows spambayes to correctly filter messages on hotmail when using Outlook XP's hotmil connector. It simply ignores the exception that occurs when spambayes tries to set the 'spam' field on a message which resides on hotmail - the hotmail connector doesn't support such property changes. Piers. Index: msgstore.py =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.36 diff -u -r1.36 msgstore.py --- msgstore.py 25 Nov 2002 05:57:41 -0000 1.36 +++ msgstore.py 7 Dec 2002 02:35:03 -0000 @@ -631,7 +631,10 @@ =20 def Save(self): assert self.dirty, "asking me to save a clean message!" - self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE | USE_DEFERRED_ERRORS) + try: + self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE | USE_DEFERRED_ERRORS) + except: + pass self.dirty =3D False =20 def _DoCopyMove(self, folder, isMove): From lists at webcrunchers.com Sat Dec 7 10:26:52 2002 From: lists at webcrunchers.com (John D.) Date: Sat, 7 Dec 2002 02:26:52 -0800 Subject: [Spambayes] All these Default Mailboxes in OpenBSD. What for? Message-ID: How I can stop mail from the outside coming into "root", yet still allow internal system mail to go to root? # Basic system aliases -- these MUST be present MAILER-DAEMON: postmaster postmaster: root # General redirections for pseudo accounts bin: root daemon: root named: root nobody: root operator: root uucp: root www: root ftp-bugs: root popa3d: root proxy: root smmsp: root sshd: root _portmap: root _rstatd: root _identd: root _rusersd: root _fingerd: root _x11: root Why MUST these be present? We are getting an unusual amount od spam mail sent to these Email addresses, and want to know why these are created in the first place. John From skip at pobox.com Sat Dec 7 19:11:40 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 13:11:40 -0600 Subject: [Spambayes] You talk, it types - interesting irony... Message-ID: <15858.18412.476854.880620@montanaro.dyndns.org> I find it mildly interesting that spammers are hawking Dragon Systems' Naturally Speaking, while Tim Peters, a former Dragon rocket scientist has been actively working on a tool to thwart such hawking. ;-) Skip From skip at pobox.com Sat Dec 7 20:12:18 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 14:12:18 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? Message-ID: <15858.22050.915577.31246@montanaro.dyndns.org> I just realized I fed the wrong mbox file to hammie. After poking around a bit I came up with this untrain sequence: mbox = mboxutils.getmbox("/Users/skip/tmp/newham") h = hammie.open("hammie.db", mode='w') for msg in mbox: h.untrain_ham(msg) h.store() It seemed to work, but is that the right way to untrain an mbox file? Skip From tim at fourstonesExpressions.com Sat Dec 7 20:13:26 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat, 07 Dec 2002 14:13:26 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: <15858.22050.915577.31246@montanaro.dyndns.org> Message-ID: 12/7/2002 2:12:18 PM, Skip Montanaro wrote: > >I just realized I fed the wrong mbox file to hammie. After poking around a >bit I came up with this untrain sequence: > > mbox = mboxutils.getmbox("/Users/skip/tmp/newham") > h = hammie.open("hammie.db", mode='w') > for msg in mbox: > h.untrain_ham(msg) > h.store() > >It seemed to work, but is that the right way to untrain an mbox file? It is if you originally trained as ham... - TimS > >Skip > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From skip at pobox.com Sat Dec 7 21:38:10 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 15:38:10 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: References: <15858.22050.915577.31246@montanaro.dyndns.org> Message-ID: <15858.27202.880939.275746@montanaro.dyndns.org> >> It seemed to work, but is that the right way to untrain an mbox file? Tim> It is if you originally trained as ham... - TimS Thanks, yes, I did. It wasn't that I was supposed to train as spam, but that I fed it the unclean ham mbox (still had SpamAssassin and VM headers). Skip From tim.one at comcast.net Sat Dec 7 21:41:46 2002 From: tim.one at comcast.net (Tim Peters) Date: Sat, 07 Dec 2002 16:41:46 -0500 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: <15858.27202.880939.275746@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > It wasn't that I was supposed to train as spam, but that I fed it the > unclean ham mbox (still had SpamAssassin and VM headers). Skip, unless you've changed the defaults, such headers are ignored. From skip at pobox.com Sat Dec 7 21:59:34 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 15:59:34 -0600 Subject: [Spambayes] Is this the right way to untrain a group of messages? In-Reply-To: References: <15858.27202.880939.275746@montanaro.dyndns.org> Message-ID: <15858.28486.462125.381180@montanaro.dyndns.org> Skip> It wasn't that I was supposed to train as spam, but that I fed it Skip> the unclean ham mbox (still had SpamAssassin and VM headers). Tim> Skip, unless you've changed the defaults, such headers are ignored. Yeah, I realize that, however, by default SA stuffs a block of scoring information at the top of spam message bodies, so I needed to run unheader.py to get rid of that. As long as I was at it, I figured I might as well get rid of the VM-related headers. On the off-chance that I ever visit that mbox in VM the offsets in those headers would all be bogus. Skip From skip at pobox.com Sun Dec 8 03:00:01 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 21:00:01 -0600 Subject: [Spambayes] using binary pickles makes for much smaller databases Message-ID: <15858.46513.333137.130764@montanaro.dyndns.org> I was messing around with various things today. One thing I tried is to modify Python's shelve.py and Spambayes' storage.py to allow and use binary pickles. Before: -rw-rw-r-- 1 skip staff 20914176 Dec 7 18:20 hammie.db After: -rw-rw-r-- 1 skip staff 10874880 Dec 7 18:32 hammie.db In both cases I trained 13144 hams and 6662 spams starting with no hammie.db file. The databases each wound up with 324310 keys. The times seemed about the same: 324.66user+62.30sys for the ascii version and 322.89user+60.61sys for the binary version. The wall clock times weren't comparable because I was doing other things as they ran. Attached are diffs for Python's Lib/shelve.py and Spambayes' storage.py. I believe they should both be backward compatible though I haven't tested it. Let me know if you think they are reasonable changes. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: shelve.diff Type: application/octet-stream Size: 1342 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021207/4d69f2f0/shelve.exe -------------- next part -------------- A non-text attachment was scrubbed... Name: storage.diff Type: application/octet-stream Size: 1078 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021207/4d69f2f0/storage.exe From skip at pobox.com Sun Dec 8 04:27:55 2002 From: skip at pobox.com (Skip Montanaro) Date: Sat, 7 Dec 2002 22:27:55 -0600 Subject: [Spambayes] More messin' around - common email prefixes Message-ID: <15858.51787.475757.151843@montanaro.dyndns.org> I modified the tokenizer to generate tokens related to common prefixes in email addresses. One observation several people have made is that some spammers send out email to clumps of alphabetically similar addresses. One spam I received recently was sent to To: Cc: , , , , I fooled around a bit generating tokens that take into account the length of the common prefix and the number of recipients. I generate tokens that are the product of the length of the common prefix and the number of recipients divided by 10. In the above case I score it a '4' ((6 * 7) // 10). I only generate the token if there are more than one recipient and a non-zero common prefix. Here's the distribution of tokens in my database (13144 hams, 6662 spams): ('pfxlen:0', (18, 209)) ('pfxlen:1', (48, 32)) ('pfxlen:2', (42, 10)) ('pfxlen:3', (24, 2)) ('pfxlen:4', (23, 0)) ('pfxlen:5', (16, 0)) ('pfxlen:6', (16, 0)) ('pfxlen:7', (11, 0)) ('pfxlen:8', (6, 0)) ('pfxlen:9', (4, 0)) ('pfxlen:10', (5, 0)) ('pfxlen:11', (1, 0)) ('pfxlen:12', (1, 0)) ('pfxlen:14', (1, 0)) ('pfxlen:17', (1, 0)) ('pfxlen:18', (1, 0)) ('pfxlen:19', (1, 0)) ('pfxlen:24', (1, 0)) ('pfxlen:28', (1, 0)) Not too surprisingly, higher scores are associated with spam than with ham. This distribution suugests to me that perhaps I should squash that to two distinct tokens, one for scores of 0 or 1, and one for all higher scores. I'll try that out in a bit. Skip From richie at entrian.com Sun Dec 8 15:24:47 2002 From: richie at entrian.com (Richie Hindle) Date: Sun, 08 Dec 2002 15:24:47 +0000 Subject: [Spambayes] msg.get_content_type()? In-Reply-To: <15856.54873.377409.136435@gargle.gargle.HOWL> References: <15855.43942.935960.375535@montanaro.dyndns.org> <8cdef7d942da49f3.dlg@entrian.com> <15856.54873.377409.136435@gargle.gargle.HOWL> Message-ID: <36h6vucleapmke20g0muvc963pit13g0j6@4ax.com> [Richie] > Any Python 2.2.x or 2.3 is fine, but you need version 2.4.3 or > better of the email package - you can get that from Python > 2.3, or download it from http://mimelib.sf.net. No released > version of Python ships with it. [Barry] > Wrong. > > Python 2.2.2 (#1, Oct 15 2002, 12:24:47) > [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import email > >>> email.__version__ > '2.4.3' Oops, my mistake. That's good news. -- Richie Hindle richie@entrian.com From richie at entrian.com Sun Dec 8 15:25:06 2002 From: richie at entrian.com (Richie Hindle) Date: Sun, 08 Dec 2002 15:25:06 +0000 Subject: [Spambayes] using binary pickles makes for much smaller databases In-Reply-To: <15858.46513.333137.130764@montanaro.dyndns.org> References: <15858.46513.333137.130764@montanaro.dyndns.org> Message-ID: > modify Python's shelve.py and Spambayes' storage.py to allow and use binary > pickles. Good plan! (I had no idea that shelve used text pickles.) Below is an alternative implementation that avoids the need to change shelve.py, though it's a slight hack in that a future version of shelve could potentially break it by not keeping its pickler in a module global called 'Pickler'. This goes at the top of storage.py: --------------------------------------------------------------------------- # Make shelve use binary pickles by default. oldShelvePickler = shelve.Pickler def binaryDefaultPickler(f, binary=1): return oldShelvePickler(f, binary) shelve.Pickler = binaryDefaultPickler --------------------------------------------------------------------------- This gives me 335,872 bytes in 21 seconds vs. 679,936 bytes in 26 seconds. These are wall-clock times on an otherwise-idle Win98 box for training on 200 messages. This is backwards-compatible too - I can still use my existing database with no problems. Can anyone see a problem with this code (or is anyone offended by grubbing around with shelve.Pickler)? What if one of the DBMs supported by anydbm doesn't support values with embedded NULL characters for instance? (Seems unlikely.) Skip, your patch to shelve.py looks like a good candidate for inclusion into Python itself, assuming there really is no problem using binary pickles via shelve/anydbm. -- Richie Hindle richie@entrian.com From richie at entrian.com Sun Dec 8 15:25:19 2002 From: richie at entrian.com (Richie Hindle) Date: Sun, 08 Dec 2002 15:25:19 +0000 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: [Mark] > The main directory has 46 or so .py files in it now, which is getting too > many. You're probably right - now that I look at it again, it is getting a bit crowded in there. > There are 3 clear categories: > > * Main engine. > * pop3proxy application. > * Test code. There are also the command-line applications: hammie*.py and mboxtrain.py. I think they're in a different category from the "main engine" code. > Even just making this split would be a good thing. If we can factor some > commonly used "application base classes" (ie, the intent of Corpus.py etc), > then these could stay in the main directory (contradicting what I said above > ). I don't want to go overboard, but I think something could be done. I think you're right, but I also think it itches you more than it itches me (must be the hot Aussie summer - it's freezing in the UK). One possible issue: will we lose CVS history by moving files about? Does SourceForge give us the ability to move a file and its CVS history together? -- Richie Hindle richie@entrian.com From popiel at wolfskeep.com Sun Dec 8 16:47:36 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sun, 08 Dec 2002 08:47:36 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: Message from Richie Hindle References: Message-ID: <20021208164736.15A292DED1@cashew.wolfskeep.com> In message: Richie Hindle writes: > >[Mark] >> The main directory has 46 or so .py files in it now, which is getting too >> many. > >You're probably right - now that I look at it again, it is getting a bit >crowded in there. > >> There are 3 clear categories: >> >> * Main engine. >> * pop3proxy application. >> * Test code. > >There are also the command-line applications: hammie*.py and mboxtrain.py. >I think they're in a different category from the "main engine" code. I agree that the breakup is a good thing, and that hammie* and friends should be in their own category. As a python newbie, I wonder if breaking it up will complicate invocation, though... will the spambayes core stuff have to be placed someplace special (on a module search path?) to be found by the various front-ends in their subdirectories? - Alex From Alexander at Leidinger.net Sun Dec 8 18:56:41 2002 From: Alexander at Leidinger.net (Alexander Leidinger) Date: Sun, 8 Dec 2002 19:56:41 +0100 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes FileCorpus.py,1.8,1.9Corpus.py,1.5,1.6 In-Reply-To: References: Message-ID: <20021208195641.27ba9bba.Alexander@Leidinger.net> On Sun, 08 Dec 2002 15:25:19 +0000 Richie Hindle wrote: > One possible issue: will we lose CVS history by moving files about? Does > SourceForge give us the ability to move a file and its CVS history > together? Removing a file puts it into the attic (a special directory in the CVS repository). You can still get it from there. CVS itselv doesn't has a "move" command, if you have shell access to the CVS repository, you can copy the xxx,v files to the new location and "cvs remove" them in the old location (directly moving it in the repository is not an option, because you can't go back to an old version then). Bye, Alexander. -- Actually, Microsoft is sort of a mixture between the Borg and the Ferengi. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From noreply at sourceforge.net Sun Dec 8 18:39:39 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Sun, 08 Dec 2002 10:39:39 -0800 Subject: [Spambayes] [ spambayes-Bugs-650496 ] hammie.py discards headers Message-ID: Bugs item #650496, was opened at 2002-12-08 18:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Simon Baatz (bnomis26) Assigned to: Nobody/Anonymous (nobody) Summary: hammie.py discards headers Initial Comment: When feeding the (malformed) attached mail to hammie.py in filter mode, the headers of the mail are not present in the output. Command line: python hammie.py -f -d -p ~/mail/hammie.db < msg.lAoM Output: X-Spambayes-Classification: ham; 0.00 --Amazon.com_multipart_boundary____________ Content-Type: text/plain; charset=iso-8859-1 Vielen Dank f�r Ihre Bestellung bei Amazon.de. --Amazon.com_multipart_boundary____________ Content-Type: text/html; charset=iso-8859-1 --Amazon.com_multipart_boundary____________-- ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 From Todd.Miller at courtesan.com Mon Dec 9 15:17:52 2002 From: Todd.Miller at courtesan.com (Todd C. Miller) Date: Mon Dec 9 17:34:04 2002 Subject: [Spambayes] Re: All these Default Mailboxes in OpenBSD. What for? In-Reply-To: Your message of "Sat, 07 Dec 2002 02:26:52 PST." References: Message-ID: <200212092217.gB9MHqkL000956@xerxes.courtesan.com> In message so spake "John D." (lists): > # General redirections for pseudo accounts > bin: root > daemon: root > named: root > nobody: root > operator: root > uucp: root > www: root > ftp-bugs: root > popa3d: root > proxy: root > smmsp: root > sshd: root > _portmap: root > _rstatd: root > _identd: root > _rusersd: root > _fingerd: root > _x11: root > > Why MUST these be present? > We are getting an unusual amount od spam mail sent to these Email addresses, > and want to know why these are created in the first place. These are all pseudo-users. They either own files in the filesystem or act as unprivileged users that certain commands run as. If they are not aliases to something, then any mail that happens to come for them will just end up in /var/mail, which could eventually fill up /var. There's no reason they have to point to anything real, though. They could just go to /dev/null if you want (though operator is often used as real user and some people mail www instead of webmaster). - todd From skip at pobox.com Mon Dec 9 22:51:19 2002 From: skip at pobox.com (Skip Montanaro) Date: Mon Dec 9 23:51:10 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? Message-ID: <15861.29383.516933.165140@montanaro.dyndns.org> I just tried runing pop3proxy as python pop3proxy.py -t -b -p hammie.db -d and fetch the two sample messages it has. They come through just fine but don't appear to be scored. Is that a property of the test mode or will I run into that problem when grabbing mail from a real server as well? Version is recent CVS - updated earlier this evening. Thx, Skip From skip at pobox.com Mon Dec 9 23:06:58 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 10 00:06:48 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.29383.516933.165140@montanaro.dyndns.org> References: <15861.29383.516933.165140@montanaro.dyndns.org> Message-ID: <15861.30322.597106.427808@montanaro.dyndns.org> Skip> ... and fetch the two sample messages [pop3proxy] has. They come Skip> through just fine but don't appear to be scored. I am seeing the same behavior from hammie.py. I must be muffing something. Here's my .ini file: [Hammie] hammie_debug_header: True [Tokenizer] address_headers: from to cc generate_recipients: true Skip From tim at fourstonesExpressions.com Tue Dec 10 00:06:43 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Dec 10 01:07:24 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.29383.516933.165140@montanaro.dyndns.org> Message-ID: <83XWGCWQHDLFXVBAS4ZDCGEIDLDD0.3df58473@riven> No, Skip, there appears to be something wrong... I don't have time to look at it tonight, but this doesn't sound right. - TimS 12/9/2002 10:51:19 PM, Skip Montanaro wrote: >I just tried runing pop3proxy as > > python pop3proxy.py -t -b -p hammie.db -d > >and fetch the two sample messages it has. They come through just fine but >don't appear to be scored. Is that a property of the test mode or will I >run into that problem when grabbing mail from a real server as well? > >Version is recent CVS - updated earlier this evening. > >Thx, > >Skip > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com From richie at entrian.com Tue Dec 10 13:10:16 2002 From: richie at entrian.com (Richie Hindle) Date: Tue Dec 10 08:10:21 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.29383.516933.165140@montanaro.dyndns.org> References: <15861.29383.516933.165140@montanaro.dyndns.org> Message-ID: <3dpbvuo01o5pnce5ejo3bj8gl52a0ru89l@4ax.com> [Skip] > I just tried runing pop3proxy as > > python pop3proxy.py -t -b -p hammie.db -d > > and fetch the two sample messages it has. They come through just fine but > don't appear to be scored. Is that a property of the test mode or will I > run into that problem when grabbing mail from a real server as well? That's a property of test mode. "pop3proxy -t" runs a test server that serves up two unscored messages. I'm surpised it even accepts the other switches with -t. I should really put the test server in a different source file - if you want to use the proxy to score the test messages, you should run one "pop3proxy -t" and one "pop2proxy " - the first will run the test server and the second will run the proxy itself. I don't know why hammie isn't scoring things, but the test POP3 server is a red herring. -- Richie Hindle richie@entrian.com From skip at pobox.com Tue Dec 10 08:16:24 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 10 09:16:21 2002 Subject: [Spambayes] No X-* headers inserted by pop3proxy? In-Reply-To: <15861.30322.597106.427808@montanaro.dyndns.org> References: <15861.29383.516933.165140@montanaro.dyndns.org> <15861.30322.597106.427808@montanaro.dyndns.org> Message-ID: <15861.63288.43031.18877@montanaro.dyndns.org> Skip> ... and fetch the two sample messages [pop3proxy] has. They come Skip> through just fine but don't appear to be scored. Skip> I am seeing the same behavior from hammie.py. I must be muffing Skip> something. Aside from the bogus .ini file there were several new modules (at least Corpus, storage and dbmstorage) which weren't being installed. I think I'm all set now. Sorry for the false alarm(s). Skip From bkc at murkworks.com Tue Dec 10 09:57:22 2002 From: bkc at murkworks.com (Brad Clements) Date: Tue Dec 10 09:51:53 2002 Subject: [Spambayes] Anyone find this spam interesting? Message-ID: <3DF5B90A.3986.1EC6AF74@localhost> Just got this, its .. hmm. First, some relavent headers: To: Subject: {%CRAND2%}Need extra Cash? - Get Paid in 48 HRS! - Home Reps Needed{%CRAND1%} Date: Tue, 10 Dec 2002 08:53:30 +0700 MiME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_00D6_55E22C0A.B3647D70" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: AOL 7.0 for Windows US sub 118 Importance: Normal ------=_NextPart_000_00D6_55E22C0A.B3647D70 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64 eyVDUkFORDYlfQ0KDQpJbW1lZGlhdGUgSGVscCBOZWVkZWQuICBXZSBhcmUg YSAuY29tIGNvcnBvcmF0aW9uIHRoYXQgaXMgZ3Jvd2luZyBhdCBhIHRyZW1l Here's the text of the base64 decoded portion: {%CRAND6%} Immediate Help Needed. We are a .com corporation that is growing at a tremendous rate of over 1000% per year. We simply cannot keep up. We are looking for motivated individuals who are looking to earn a substantial income working from home. This is a real world opportunity to make an excellent income from home. No experience is required. We will provide you with any training you may need. We are looking for energetic and self motivated people. If that is you, then click on the link below and complete our online information request form, and one of our employment specialist will contact you. http://www.digitalcraftsmanship.com/pg.htm So if you are looking to be employed at home, with a career that will provide you vast opportunities and a substantial income, please fill out our online information request form here now: http://www.digitalcraftsmanship.com/pg.htm Take a Look {%CRAND8%} ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ Your email address was obtained from an opt-in list. If you wish to be deleted from this list, please click on the following link: http://www.digitalcraftsmanship.com/remove/remove.html and you will be removed from the list. If you have previously dealt with this matter and are still receiving this message, you may call our Abuse Control Center at 1-866-667-5398, or write us at: NOUCE1, 6822 22nd Ave. N., St. Petersburg, FL 33710-3918. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ {%CRAND9%} Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From noreply at sourceforge.net Tue Dec 10 02:42:06 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Tue Dec 10 10:14:08 2002 Subject: [Spambayes] [ spambayes-Bugs-651365 ] getattr recursion in Corpus.py Message-ID: Bugs item #651365, was opened at 2002-12-10 11:42 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651365&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: getattr recursion in Corpus.py Initial Comment: After feeding a bunch of new messages into pop3proxy, classifying them and when trying to save the result, I got a recursion loop (followed by recursion depth exceeded) in \cvshome\spambayes\Corpus.py|__getattr__|269] After looking into setSubstance, I noticed that setSubstance (called by load) only sets the attributes payload and hdrtext when the pattern matches. I temporarily added an else clause to bmatch, i.e. if bmatch: self.payload = bmatch.group(2) self.hdrtxt = sub[:bmatch.start(2)] print ".", else: self.payload = "nix\r\n" self.hdrtxt="nix\r\n" print "?", len(sub), and indeed, when trying to save, I notice that after about 800 good messages, ~ 100 have an empty message, see the output below. I don't really know what I'm doing here, but at this fix at least allows me to continue. ------------------------- C:\archiv\cvshome\spambayes>python -u pop3proxy.py - l 8110 mail.gmd.de Loading database... Done. Listener on port 8110 is proxying mail:110 User interface url is http://localhost:8880 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 . . . . . . . . . . . . . . . ----------------------- Initial traceback: error: uncaptured python exception, closing channel <__main__.UserInterface conn ected at 0x2213470> (exceptions.RuntimeError:maximum recursion depth exceeded [C :\Python22\lib\asyncore.py|poll|95] [C:\Python22 \lib\asyncore.py|handle_read_eve nt|392] [C:\Python22\lib\asynchat.py|handle_read|112] [C:\archiv\cvshome\spambay es\pop3proxy.py|found_terminator|804] [C:\archiv\cvshome\spambayes\pop3proxy.py| onRequest|830] [C:\archiv\cvshome\spambayes\pop3proxy.py|onReview|1 093] [C:\arch iv\cvs\spambayes\Corpus.py|takeMessage|188] [C:\archiv\cvs\spambayes\FileCorpus. py|addMessage|140] [C:\archiv\cvs\spambayes\FileCorpus.py|store|231] [C:\archiv\ cvs\spambayes\Corpus.py|getSubstance|318] [C:\archiv\cvs\spambayes\Corpus.py|__g etattr__|269] [C:\archiv\cvs\spambayes\Corpus.py|__getattr__|269] [C:\archiv\cvs \spambayes\Corpus.py|__getattr__|269] [C:\archiv\cvs\spambayes\Corpus.py|__getat ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651365&group_id=61702 From grobinson at transpose.com Tue Dec 10 10:15:59 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 10 10:16:05 2002 Subject: [Spambayes] A grassroots auto whitelist Message-ID: I had a wild idea that I'd like to bounce off the readers of SpamBayes and Spamflt. There is a company, Habeas, attempting to leverage trademark and copyright law to get people to insert a trademarked/copyrighted haiku into their emails. Then when spammers try it, they can be sued for trademark/copyright infringement (more here: http://radio.weblogs.com/0101454/categories/spam/2002/12/09.html#a200). It's free for individuals to use the haiku, but corporations have to pay. I was thinking today that it could be interest to do something less legalistic, but that would still probably have a certain practical effectiveness (particularly since I am not thrilled about including their silly haiku in all my emails). Suppose we all started putting some character string like ImNotSpam! in our emails. Then spam filters would come to associate them with non-spam, and emails with that word in it would have a better chance of getting through. Of course, spammers could do the same thing, but we could trademark it and if anybody infringes, gives lawyers the chance to sue in return for all (or most) of the damages as their contingency fee. But more effectively, instead of basing it on a simple character string like ImNotSpam, it could be that we base it on the URL of some antispam resource like Paul Graham's http://www.paulgraham.com/antispam.html or the wiki I started recently, http://spamland.org. (A URL with such a name can also be a registered trademark.) Of course, spammers could include such a URL, but wouldn't want to if it pointed to a potent source of antispam info such as a list of spam filtering products, and the trademark issue would be an additional danger. So, some balance would be achieved over time. If it got to be a popular tool for legitimate individuals to get their emails through, some spammers would use it, but if they did so on too large a scale they would be in danger of being sued and if a URL were used, they would also be informing people about how to deal with spam (the URL could list antispam products etc. as Paul's site and spamland do). A URL would also explain what the effort was about exhort people who come to the page to also start using the token in their emails, so it would be viral. I'd be very interested in any thoughts on this. If readers of these lists wanted to try including such a URL in their emails, we could get the grassroots efforts started. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From tim.one at comcast.net Tue Dec 10 11:05:12 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 11:09:25 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5B90A.3986.1EC6AF74@localhost> Message-ID: [Brad Clements] > Just got this, its .. hmm. Brad, what *might* be interesting about it? It looked like vanilla work-at-home spam. From bkc at murkworks.com Tue Dec 10 11:17:20 2002 From: bkc at murkworks.com (Brad Clements) Date: Tue Dec 10 11:12:03 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: References: <3DF5B90A.3986.1EC6AF74@localhost> Message-ID: <3DF5CBC7.11245.1F0FE5CB@localhost> > [Brad Clements] > > Just got this, its .. hmm. > > Brad, what *might* be interesting about it? It looked like vanilla > work-at-home spam. > I'm a butterscotch fan myself .. -- Uh ok, are we decoding base64 text attachments now, or tossing them? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one at comcast.net Tue Dec 10 11:16:56 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 11:21:50 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5CBC7.11245.1F0FE5CB@localhost> Message-ID: [Brad Clements] > Uh ok, are we decoding base64 text attachments now, or tossing them? We've been decoding them almost since The Beginning. Quoted-printable too. We aren't decoding embedded uuencoded sections, though. From bkc at murkworks.com Tue Dec 10 11:31:08 2002 From: bkc at murkworks.com (Brad Clements) Date: Tue Dec 10 11:26:02 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: References: <3DF5CBC7.11245.1F0FE5CB@localhost> Message-ID: <3DF5CF04.12768.1F1C88FE@localhost> > [Brad Clements] > > Uh ok, are we decoding base64 text attachments now, or tossing them? > > We've been decoding them almost since The Beginning. Quoted-printable too. > We aren't decoding embedded uuencoded sections, though. > (Church lady from SNL) "Oh, nevermind". Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From popiel at wolfskeep.com Tue Dec 10 08:34:15 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Dec 10 11:31:41 2002 Subject: [Spambayes] A grassroots auto whitelist In-Reply-To: Message from Gary Robinson References: Message-ID: <20021210163415.EAB9A2DED2@cashew.wolfskeep.com> In message: Gary Robinson writes: >Suppose we all started putting some character string like ImNotSpam! in our >emails. Then spam filters would come to associate them with non-spam, and >emails with that word in it would have a better chance of getting through. >But more effectively, instead of basing it on a simple character string like >ImNotSpam, it could be that we base it on the URL of some antispam resource >like Paul Graham's http://www.paulgraham.com/antispam.html or the wiki I >started recently, http://spamland.org. (A URL with such a name can also be a >registered trademark.) The largest problem I see with such (other than the fact that I despise canned text in my emails, hence no .sig) is tha tthe URLs in question become prime targets for domain-name-grabbing or other nefarious trickery. I'm much happier to leave the selection of strong ham clues to less overt means. - Alex From grobinson at transpose.com Tue Dec 10 13:54:16 2002 From: grobinson at transpose.com (Gary Robinson) Date: Tue Dec 10 13:54:27 2002 Subject: [Spambayes] What the heck Message-ID: I couldn't resist. See my sig below. At least it will be an interesting experiment. (I didn't include the trademark/copyright mechanism because I don't want to run up against Habeas' pending patent. And I think it may not be necessary in any case.) So as not to take up more bandwidth on these lists I won't say anything more about this unless somebody has a comment or question they want me to respond to. Bottom line, I'm up for trying things and seeing what happens. --Gary -- http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From tim.one at comcast.net Tue Dec 10 14:42:46 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 14:46:32 2002 Subject: [Spambayes] Anyone find this spam interesting? In-Reply-To: <3DF5CF04.12768.1F1C88FE@localhost> Message-ID: [Tim] > We've been decoding them [base64 text sections] almost since The Beginning. > Quoted-printable too. We aren't decoding embedded uuencoded sections, > though. [Brad Clements] > (Church lady from SNL) > > "Oh, nevermind". It's unclear -- there have been bugs in decoding base64 before, and may still be. Why did you ask originally? For example, perhaps that spam got classified as ham, and you didn't see any of the decoded spammy words in the clue list. In that case, we'd need the original msg to figure out what went wrong. From tim at fourstonesExpressions.com Tue Dec 10 16:12:26 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Dec 10 17:13:02 2002 Subject: [Spambayes] What the heck Message-ID: 12/10/2002 3:56:55 PM, Gary Robinson wrote: > >> It's an interesting idea... I'll bet that a spammer would have no qualms about >> including such a url in spam, though. > >That may be, but for the spammers to be motivated to do so, the idea will >have had to already been very successful in terms of people using it! > >If the spammers then put the URL in their spams, millions of people will >gain access to the laest info about the most powerful current antispam >techniques... probably including who to lobby to get antispam laws passed, >tips about who and where the spammers might be would be in a discussion >area, etc.... > >The outcome would be good either way! :) > >--Gary > > >-- >http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails What will the tokenizer do with the above url? - TimS > >Gary Robinson >CEO >Transpose, LLC >grobinson@transpose.com >207-942-3463 >http://www.emergentmusic.com >http://radio.weblogs.com/0101454 > > >> From: Tim Stone - Four Stones Expressions >> Reply-To: tim@fourstonesExpressions.com >> Date: Tue, 10 Dec 2002 15:51:36 -0600 >> To: Gary Robinson >> Subject: Re: [Spambayes] What the heck >> >> It will be interesting to see how many links back to the page you get... I >> wonder if the spambayes tokenizer will discard the 'ToDestroy...Emails' part >> as being a too long word... >> >> It's an interesting idea... I'll bet that a spammer would have no qualms about >> including such a url in spam, though. >> >> - TimS >> >> 12/10/2002 12:54:16 PM, Gary Robinson wrote: >> >>> I couldn't resist. See my sig below. At least it will be an interesting >>> experiment. >>> >>> (I didn't include the trademark/copyright mechanism because I don't want to >>> run up against Habeas' pending patent. And I think it may not be necessary >>> in any case.) >>> >>> So as not to take up more bandwidth on these lists I won't say anything more >>> about this unless somebody has a comment or question they want me to respond >>> to. >>> >>> Bottom line, I'm up for trying things and seeing what happens. >>> >>> >>> --Gary >>> >>> -- >>> http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails >>> >>> Gary Robinson >>> CEO >>> Transpose, LLC >>> grobinson@transpose.com >>> 207-942-3463 >>> http://www.emergentmusic.com >>> http://radio.weblogs.com/0101454 >>> >>> >>> >>> _______________________________________________ >>> Spambayes mailing list >>> Spambayes@python.org >>> http://mail.python.org/mailman/listinfo/spambayes >>> >>> >> >> >> c'est moi - TimS >> www.fourstonesExpressions.com >> >> > > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From skip at pobox.com Tue Dec 10 16:24:52 2002 From: skip at pobox.com (Skip Montanaro) Date: Tue Dec 10 17:24:43 2002 Subject: [Spambayes] What the heck In-Reply-To: References: Message-ID: <15862.27060.809694.985672@montanaro.dyndns.org> >> http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails Tim> What will the tokenizer do with the above url? - TimS Chew it up and spit out little pieces I believe, something like url:spamland url:org url:jsp url:Wiki url:ToDestroySpamIncludeThisLinkInAllLegitEmails Not sure about that last one. It might generate some sort of skip token, but I think that's just for regular long words. Skip From tim.one at comcast.net Tue Dec 10 17:28:17 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 17:30:34 2002 Subject: [Spambayes] What the heck In-Reply-To: Message-ID: > http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails [Tim Stone] > What will the tokenizer do with the above url? - TimS It generates 6 tokens: proto:http url:spamland url:org url:jsp url:wiki url:todestroyspamincludethislinkinalllegitemails From tim at fourstonesExpressions.com Tue Dec 10 16:33:11 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Dec 10 17:35:05 2002 Subject: [Spambayes] What the heck In-Reply-To: Message-ID: 12/10/2002 4:28:17 PM, Tim Peters wrote: >> http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails > >[Tim Stone] >> What will the tokenizer do with the above url? - TimS > >It generates 6 tokens: > >proto:http >url:spamland >url:org >url:jsp >url:wiki >url:todestroyspamincludethislinkinalllegitemails skip_max_wordsize only applies to words, not to url fragments? > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From tim.one at comcast.net Tue Dec 10 17:50:30 2002 From: tim.one at comcast.net (Tim Peters) Date: Tue Dec 10 17:52:49 2002 Subject: [Spambayes] What the heck In-Reply-To: Message-ID: > http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails > It generates 6 tokens: > > proto:http > url:spamland > url:org > url:jsp > url:wiki > url:todestroyspamincludethislinkinalllegitemails > skip_max_wordsize only applies to words, not to url fragments? It does not apply to url fragments. As to the first half of the question, tokenizer.py is open for inspection . From skip at pobox.com Tue Dec 10 23:01:18 2002 From: skip at pobox.com (Skip Montanaro) Date: Wed Dec 11 00:01:07 2002 Subject: [Spambayes] New option: summarize_email_prefixes Message-ID: <15862.50846.195158.599726@montanaro.dyndns.org> I just checked in code for a new option: summarize_email_prefixes. It tries to take advantage of clumps of related email addresses in a single message, e.g.: To: Cc: , , , , It's not a big win, but "pfxlen:big" is a very strong spam indicator. It might help on small messages without many other clues. I'd like others to give it a try and post their results. The code is pretty straightforward, so I won't go into more detail. Just gaze at tokenizer.py for a few seconds. Skip From noreply at sourceforge.net Tue Dec 10 20:56:28 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 00:03:11 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 23:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From msurface at myvine.com Wed Dec 11 00:05:42 2002 From: msurface at myvine.com (Mitchell Surface) Date: Wed Dec 11 00:05:47 2002 Subject: [Spambayes] Bug in mboxtrain.py? Message-ID: <20021211050542.GA14646@brewer.fwn.fortwayne.com> I think I may have found a bug in mboxtrain.py. When you run mboxtrain.py against a mbox that contains messages that have already been trained on, the old messages are deleted. I opened a bug on SF for this. It's late here, I'll try to find some time to look at the code tomorrow and see what's going on. It's probably something simple and better eyes than mine will spot it more quickly, but I wanted to give a warning as soon as I could. It wasn't a highlight of my day to see a mailbox disappear. <0.5 wink> Oh well, that's why it's called pre-alpha code, right? -- Mitchell Surface N9OSL Fort Wayne, IN USA Don't ever think you know what's right for the other person. He might start thinking he knows what's right for you. -- Paul Williams, `Das Energi' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021211/c7937172/attachment.bin From neale at woozle.org Wed Dec 11 08:23:17 2002 From: neale at woozle.org (Neale Pickett) Date: Wed Dec 11 11:23:26 2002 Subject: [Spambayes] Bug in mboxtrain.py? In-Reply-To: <20021211050542.GA14646@brewer.fwn.fortwayne.com> (Mitchell Surface's message of "Wed, 11 Dec 2002 00:05:42 -0500") References: <20021211050542.GA14646@brewer.fwn.fortwayne.com> Message-ID: Mitchell Surface writes: > I think I may have found a bug in mboxtrain.py. When you run > mboxtrain.py against a mbox that contains messages that have already > been trained on, the old messages are deleted. I opened a bug on SF for > this. Zowie! Well *that* was a dumb bug. I hope you backed up your mbox--sorry about that. I've checked in the fix. Please, everyone using mboxtrain.py on an mbox file, update your source from CVS. Thanks for the bug report, Mitchell. Neale From noreply at sourceforge.net Wed Dec 11 08:42:27 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 11:48:40 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 23:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- >Comment By: Mitchell Surface (msurface) Date: 2002-12-11 11:42 Message: Logged In: YES user_id=21257 I justt did a cvs up and it looks like the code has been rewritten to not do this, I'll test tonight and post results. Thanks guys! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From noreply at sourceforge.net Wed Dec 11 08:52:18 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 12:01:57 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 20:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) >Assigned to: Neale Pickett (npickett) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- Comment By: Mitchell Surface (msurface) Date: 2002-12-11 08:42 Message: Logged In: YES user_id=21257 I justt did a cvs up and it looks like the code has been rewritten to not do this, I'll test tonight and post results. Thanks guys! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From noreply at sourceforge.net Wed Dec 11 08:52:18 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 12:02:16 2002 Subject: [Spambayes] [ spambayes-Bugs-650496 ] hammie.py discards headers Message-ID: Bugs item #650496, was opened at 2002-12-08 10:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Simon Baatz (bnomis26) >Assigned to: Neale Pickett (npickett) Summary: hammie.py discards headers Initial Comment: When feeding the (malformed) attached mail to hammie.py in filter mode, the headers of the mail are not present in the output. Command line: python hammie.py -f -d -p ~/mail/hammie.db < msg.lAoM Output: X-Spambayes-Classification: ham; 0.00 --Amazon.com_multipart_boundary____________ Content-Type: text/plain; charset=iso-8859-1 Vielen Dank f�r Ihre Bestellung bei Amazon.de. --Amazon.com_multipart_boundary____________ Content-Type: text/html; charset=iso-8859-1 --Amazon.com_multipart_boundary____________-- ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=650496&group_id=61702 From noreply at sourceforge.net Wed Dec 11 08:53:19 2002 From: noreply at sourceforge.net (noreply@sourceforge.net) Date: Wed Dec 11 12:02:28 2002 Subject: [Spambayes] [ spambayes-Bugs-651840 ] mboxtrain.py eats old messages Message-ID: Bugs item #651840, was opened at 2002-12-10 20:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 Category: None Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: Mitchell Surface (msurface) Assigned to: Neale Pickett (npickett) Summary: mboxtrain.py eats old messages Initial Comment: When mboxtrain.py is run against a mbox containing messages it has already trained on, it deletes the old messages. ---------------------------------------------------------------------- >Comment By: Neale Pickett (npickett) Date: 2002-12-11 08:53 Message: Logged In: YES user_id=619391 I think this is fixed with my most recent cvs checkin. Feel free to re-open the bug if not :) ---------------------------------------------------------------------- Comment By: Mitchell Surface (msurface) Date: 2002-12-11 08:42 Message: Logged In: YES user_id=21257 I justt did a cvs up and it looks like the code has been rewritten to not do this, I'll test tonight and post results. Thanks guys! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=651840&group_id=61702 From neale at watchguard.com Wed Dec 11 22:02:16 2002 From: neale at watchguard.com (Neale Pickett) Date: Thu Dec 12 01:27:44 2002 Subject: [Spambayes] New option: summarize_email_prefixes In-Reply-To: <15862.50846.195158.599726@montanaro.dyndns.org> (Skip Montanaro's message of "Tue, 10 Dec 2002 23:01:18 -0600") References: <15862.50846.195158.599726@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > It's not a big win, but "pfxlen:big" is a very strong spam indicator. > It might help on small messages without many other clues. I'd like > others to give it a try and post their results. It didn't make one bit of difference for me. So if's helpful to you, I'm okay with it :) """ cv1s -> cv2s -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams false positive percentages 1.000 1.000 tied 1.000 1.000 tied 2.000 2.000 tied 3.000 3.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 14 to 14 tied mean fp % went from 1.4 to 1.4 tied false negative percentages 0.500 0.500 tied 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fn went from 3 to 3 tied mean fn % went from 0.3 to 0.3 tied ham mean ham sdev 4.39 4.39 +0.00% 16.31 16.31 +0.00% 3.62 3.62 +0.00% 14.22 14.22 +0.00% 5.01 5.01 +0.00% 18.67 18.67 +0.00% 4.94 4.94 +0.00% 19.29 19.29 +0.00% 4.20 4.21 +0.24% 14.37 14.38 +0.07% ham mean and sdev for all runs 4.43 4.43 +0.00% 16.71 16.71 +0.00% spam mean spam sdev 99.23 99.23 +0.00% 6.99 6.99 +0.00% 99.26 99.26 +0.00% 7.41 7.41 +0.00% 99.96 99.96 +0.00% 0.31 0.25 -19.35% 98.96 98.96 +0.00% 8.51 8.51 +0.00% 99.63 99.63 +0.00% 2.71 2.71 +0.00% spam mean and sdev for all runs 99.41 99.41 +0.00% 6.07 6.07 +0.00% ham/spam mean difference: 94.98 94.98 +0.00 """ From skip at pobox.com Thu Dec 12 08:34:11 2002 From: skip at pobox.com (Skip Montanaro) Date: Thu Dec 12 09:34:21 2002 Subject: [Spambayes] New option: summarize_email_prefixes In-Reply-To: References: <15862.50846.195158.599726@montanaro.dyndns.org> Message-ID: <15864.40547.760153.378737@montanaro.dyndns.org> >> It's not a big win, but "pfxlen:big" is a very strong spam indicator. Neale> It didn't make one bit of difference for me. So if's helpful to Neale> you, I'm okay with it :) It didn't help me much either. I figured it was worth leaving in as an experimental device because other people had asked about it before. Skip From ducky at webfoot.com Thu Dec 12 11:28:41 2002 From: ducky at webfoot.com (Kaitlin Duck Sherwood) Date: Thu Dec 12 14:26:06 2002 Subject: [Spambayes] Tiny bug In-Reply-To: References: Message-ID: Last night I checked out spambayes, and ran into trouble in mboxtest.py. There's a line in mboxtest.py from timtest import Msg but mboxtest.py words much better if it's from msgs import Msg I presume that as a newbie, I shouldn't (or can't) check in. Cheers. No reply needed. From tim at fourstonesExpressions.com Thu Dec 12 13:31:33 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Dec 12 14:32:09 2002 Subject: [Spambayes] Tiny bug In-Reply-To: Message-ID: I have no idea why timtest is there, but your fix is correct. You are correct that you cannot check in. - TimS (not the tim in timtest...;) 12/12/2002 1:28:41 PM, Kaitlin Duck Sherwood wrote: >Last night I checked out spambayes, and ran into trouble in mboxtest.py. > >There's a line in mboxtest.py > from timtest import Msg >but mboxtest.py words much better if it's > from msgs import Msg > >I presume that as a newbie, I shouldn't (or can't) check in. > >Cheers. > >No reply needed. > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From piper_dragon at lycos.com Fri Dec 13 04:12:12 2002 From: piper_dragon at lycos.com (douglas P craig) Date: Fri Dec 13 00:26:15 2002 Subject: [Spambayes] Lindsey Carter Message-ID: You guys are way over my head but if I understand correctly the email I received is only sent to get my to submit to the AVS service. I take it your passion is working on software to detect spam. Educate me someone and tell me what ham is? Thanks Doug From skip at pobox.com Thu Dec 12 23:58:19 2002 From: skip at pobox.com (Skip Montanaro) Date: Fri Dec 13 01:00:36 2002 Subject: [Spambayes] Lindsey Carter In-Reply-To: References: Message-ID: <15865.30459.363352.142342@montanaro.dyndns.org> doug> You guys are way over my head but if I understand correctly the doug> email I received is only sent to get my to submit to the AVS doug> service. Not sure what the "AVS service" is. We're not trying to get you to buy or submit to anything. doug> I take it your passion is working on software to detect doug> spam. Educate me someone and tell me what ham is? The opposite of spam. Skip From tim.one at comcast.net Sun Dec 15 21:11:01 2002 From: tim.one at comcast.net (Tim Peters) Date: Sun Dec 15 21:12:06 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: I got a typical mortgage spam today, surprising because it scored 0.78, at the high end of my personal-email Unsure range (which ends at 0.80). There were very few words in the clue listing; it got a score as high as it did because of the subject line Low rates will not last forever. some assorted spammish header clues, URL clues, and the single word "month!". Staring at the source revealed a cute trick I haven't seen before: ... Let the Lenders
Compete for your Loan! ... That is, the spammy words like Lenders and Compete and Loan! are broken up by embedded HTML comments. Our tokenizer does strip HTML comments, but replaces each with a blank, so the spammy words remain broken up. I'll fix that. In the meantime, if anyone knows this spammer , counsel them to break up the word "month!" too, as that was the highest-spamprob token in the whole msg. From dereks at itsite.com Sun Dec 15 19:40:39 2002 From: dereks at itsite.com (Derek Simkowiak) Date: Sun Dec 15 22:41:47 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: > Let the Lenders
> Compete for your Loan! > [...] Our tokenizer does strip HTML comments, but replaces each with a > blank, so the spammy words remain broken up. > > I'll fix that. Pretend I'm a spammer. Hi! Greeat Deeals with lo

w rates! (I.e., not just comments, but valid HTML tags too.) For that matter, since unrecognized tags are ignored by browsers, it could be: Hi! Great deals Here! Hell, it wouldn't even need too look like HTML: Hi! Great deals here! I haven't followed the discussions on HTML handling, but given this latest cute trick this other stuff can't be far away. --Derek From tim at fourstonesExpressions.com Sun Dec 15 21:49:14 2002 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun Dec 15 22:49:50 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: 12/15/2002 6:40:39 PM, Derek Simkowiak wrote: >> Let the Lenders
>> Compete for your Loan! > >> [...] Our tokenizer does strip HTML comments, but replaces each with a >> blank, so the spammy words remain broken up. >> >> I'll fix that. > > Pretend I'm a spammer. > >Hi! Greeat Deeals with lo

w rates! > > (I.e., not just comments, but valid HTML tags too.) > > For that matter, since unrecognized tags are ignored by browsers, >it could be: > >Hi! Great deals Here! > > Hell, it wouldn't even need too look like HTML: > >Hi! Great deals here! > > I haven't followed the discussions on HTML handling, but given >this latest cute trick this other stuff can't be far away. Right, but our current tokenizer would currently defeat all of these. It would have defeated Tim's example, except that in the case of a stripped comment, it replaced it with a blank. This is a great example of how the efforts of teams like ours are already forcing spammers into more and more convoluted behaviors, which will make their mail even more readily recognizable! - TimS > > > >--Derek > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS www.fourstonesExpressions.com http://spamland.org/jsp/Wiki?ToDestroySpamIncludeThisLinkInAllLegitEmails From popiel at wolfskeep.com Sun Dec 15 21:10:02 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 16 00:07:03 2002 Subject: [Spambayes] Another hammie setup Message-ID: <20021216051002.3FE3A2DE86@cashew.wolfskeep.com> A couple weeks ago, I mentioned that I was finally going to start using hammie for my live filtering, and that I'd share the scripts, etc that I generated to do so. First off, let me describe how I've got things set up. I am an avid (and rather religious) MH user, so my mail folders are of course stored in the MH format (directories full of single-message files, where the filenames are numbers indicating ordering in the folder). I've got four mail folders of interest for this discussion: everything, spam, newspam, and inbox. When mail arrives, it is classified, then immediately copied in the everything folder. If it was classified as spam or ham, it is trained as such, reinforcing the classification. Then, if it was labeled as spam, it goes into the newspam folder; otherwise it goes into my inbox. When I read my mail (from inbox or newspam), I move any confirmed spam into my spam folder; ham may be deleted. (Of course, I still have a copy of my ham in the everything folder.) Every night, I run a complete retraining (from cron at 2:10am); it trains on all mail in the everything folder that is less than 4 months old. If a given message has an identical copy in the spam or newspam folder, then it is trained as spam; otherwise it is trained as ham. This does mean that unread unsures will be treated as ham for up to a day; there's few enough of them that I don't care. The four-month age limit will have the effect of expiring old mail out of the training set, which will keep the database size fairly manageable (it's currently just under 10 meg, with 6 days to go until I have 4 months of data). The retraining generates a little report for me each night, showing a graph of my ham and spam levels over time. Here's a sample: Scanning spamdir (/home/cashew/popiel/Mail/spam): Scanning spamdir (/home/cashew/popiel/Mail/newspam): Scanning everything sshsshsshsshsshsshsshshsshshshshsshshshshshshsshsshshsshssshsshshsshshsshshsssh shshshsshshsshshshshshssshshshsshsshsshshshshshshsshshhshshsshshshshssshssshshs ssshs 154 152| 144| 136| 128| h 120| h s 112| s ss ss s h s ss 104| ss ss ss sHs h s ss 96| s ss s sH s ss sHs h Sss ss 88| h ss s sss ss sH sss ssssHHhS sSsssss 80| s sSH ss ssssss sssssH HssssHsHHHSS sSsssss 72| ssHSH ssssssssssssHHsHSHssHsHsHHHSSssSsssss 64| s s s s sHsHSHsssssssHsHsssHHsHSHssHsHsHHHSSssSsssss 56| s sss ss sssssHHHSHsHsssHsHHHHssHHsHSHHsHHHsHHHSSsHSsssss 48| ssssssssssssssHHHSHHHHssHsHHHHHsHHsHSHHsHHHsHHHSSsHSssHsss 40| ssssssssssHsHHHHHSHHHHHsHsHHHHHHHHHHSHHsHHHHHHHSSsHSHsHHss 32| ssHHssHsssHHHHHHHSHHHHHHHsHHHHHHHHHHSHHsHHHHHHHSSHHSHHHHHs 24| ssHHHHHHHsHHHHHHHSHHHHHHHsHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs 16| HsHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs 8| HHHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHH 0|SSSUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU +------------------------------------------------------------ Total: 6441 ham, 9987 spam (60.79% spam) real 7m45.049s user 5m38.980s sys 0m39.170s This is a set of overlaid bar graphs; s is for spam, h is for ham, u is unsure. The shorter bars are in front and capitalized. In the example, I have very few days where I have more ham than spam. My scripts (and a .procmailrc) are available at: http://www.wolfskeep.com/~popiel/spambayes/hammie - Alex From popiel at wolfskeep.com Sun Dec 15 21:13:30 2002 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Dec 16 00:10:26 2002 Subject: [Spambayes] One question I forgot to ask... Message-ID: <20021216051330.0E6F22DE86@cashew.wolfskeep.com> I forgot to ask in my last mail: would people like me to add the scripts I use for my nightly retraining (given my slightly unusual setup) to the project? - Alex From tim.one at comcast.net Mon Dec 16 00:26:44 2002 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 16 00:28:58 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: [Tim] > ... > Let the Lenders
> Compete for your Loan! > ... > > That is, the spammy words like Lenders and Compete and Loan! are > broken up by embedded HTML comments. Our tokenizer does strip HTML > comments, but replaces each with a blank, so the spammy words remain > broken up. > > I'll fix that. In the meantime, if anyone knows this spammer , > counsel them to break up the word "month!" too, as that was the > highest-spamprob token in the whole msg. It's fixed, and that particular spam is nailed now. Among the previously "hidden" words, 'refinance', 'equity', and 'debt' all have higher spamprobs than 'month!', and the score is now at the high end of the spam range. stupid-beats-smart-stupid-beats-smart-stupid-beats-smart-ly y'rs - tim From tim.one at comcast.net Mon Dec 16 00:46:38 2002 From: tim.one at comcast.net (Tim Peters) Date: Mon Dec 16 00:49:15 2002 Subject: [Spambayes] Cute spam trick In-Reply-To: Message-ID: [Derek Simkowiak, on embedding other kinds of tags in words] > ... > I haven't followed the discussions on HTML handling, but given > this latest cute trick this other stuff can't be far away. I don't know, but Tim Stone was right that we strip out all HTML tags, so it wouldn't help them against this system. They could still work around that, by including extremely long tags -- our cheap-ass regexp gimmicks are bounded in how far they'll look ahead when deciding what is and isn't a tag (we don't even know whether we're looking at HTML, and don't want to chew up non-HTML text that just happens to contain "<"). Someday I expect we'll need "a real" HTML parser -- but not today . The technically cleverest spam I've gotten to date remains an HTML spam that interspersed legitimate news stories & tech newsgroup postings with the spam, but specified a tiny font and white-on-white for the legit parts. Invisible when rendered. I've only seen that once, and part of the downside of stripping HTML tags is that the classifier will never learn on its own which HTML tricks are used to get this effect. OTOH, you can't guess someone's "ham words" without knowing something about them, and personal information is very expensive for spammers to obtain or exploit. From mwh at python.net Mon Dec 16 12:18:41 2002 From: mwh at python.net (Michael Hudson) Date: Mon Dec 16 07:18:43 2002 Subject: [Spambayes] Re: Cute spam trick References: Message-ID: <2md6o2cg2m.fsf@starship.python.net> Tim Peters writes: > OTOH, you can't guess someone's "ham words" without knowing > something about them, and personal information is very expensive for > spammers to obtain or exploit. Indeed, if they know that much about me, a better tactic would be to try to sell me something I might actually want... Cheers, M. -- There are two kinds of large software systems: those that evolved from small systems and those that don't work. -- Seen on slashdot.org, then quoted by amk From neale at woozle.org Mon Dec 16 10:46:08 2002 From: neale at woozle.org (Neale Pickett) Date: Mon Dec 16 13:46:17 2002 Subject: [Spambayes] One question I forgot to ask... In-Reply-To: <20021216051330.0E6F22DE86@cashew.wolfskeep.com> ("T. Alexander Popiel"'s message of "Sun, 15 Dec 2002 21:13:30 -0800") References: <20021216051330.0E6F22DE86@cashew.wolfskeep.com> Message-ID: "T. Alexander Popiel" writes: > I forgot to ask in my last mail: would people like me to add > the scripts I use for my nightly retraining (given my slightly > unusual setup) to the project? Yes. But I think we should go ahead and create that hammie subdir first. All hammie front-ends (hammiefilter, hammiebulk, hammiecli/srv, mboxtrain, any others) would be moved there, as well as the HAMMIE.txt file and any other hammie-like things. hammie.py would stay in the top-level directory, since lots of things use it. What do y'all think of that? Neale From trebor at animeigo.com Mon Dec 16 13:37:55 2002 From: trebor at animeigo.com (Robert Woodhead) Date: Mon Dec 16 13:48:55 2002 Subject: [Spambayes] Re: Spambayes Digest, Vol 52, Issue 26 In-Reply-To: References: